This chapter covers Stream Analytics SQL queries and window functions, which are core to processing real-time data streams in Azure. Understanding how to write queries that aggregate data over time windows is critical for the DP-900 exam, as approximately 10-15% of questions relate to stream processing and windowing. You will learn the syntax and behavior of tumbling, hopping, sliding, and session windows, along with common query patterns like filtering, joining, and aggregating streaming data. Mastery of these concepts is essential for passing the analytics domain of the exam.
Jump to a section
Imagine a factory assembly line where products move on a conveyor belt past inspection stations. Each station can only inspect items that arrive within a specific time shift. For example, Station A operates from 8:00 to 8:05, then Station B from 8:05 to 8:10, and so on. This is a tumbling window: each station handles exactly one non-overlapping batch. Now suppose Station A inspects items from 8:00 to 8:05, but also includes items from 7:55 to 8:00 (a sliding window effect); this means every item is inspected twice. If the station instead inspects items that arrived within the last 5 minutes, sliding forward every minute, that's a hopping window. Finally, imagine a station that starts a new batch whenever it sees a 'defective' tag, and stops after 10 seconds of no new items—this is a session window. The SQL queries in Stream Analytics act like the logic that decides which items go to which station and how aggregates (like average weight) are computed over each batch. Just as the factory needs precise timing rules to avoid missing items or double-counting, Stream Analytics uses window functions to partition the never-ending stream into manageable chunks for aggregation.
What Are Stream Analytics SQL Queries and Windows?
Azure Stream Analytics is a real-time analytics service that processes data streaming from sources like Azure Event Hubs, IoT Hub, or Blob Storage. You define queries using a SQL-like language to filter, aggregate, join, and transform the data before outputting to sinks like Azure SQL Database, Power BI, or Blob Storage. Because streaming data is infinite and unbounded, you cannot run standard SQL aggregates (like SUM or AVG) over the entire stream—you must partition it into finite chunks called windows. Windows define the time range over which aggregations are computed. The DP-900 exam tests your ability to choose the correct window type and write queries that use them.
How Windows Work Internally
Stream Analytics processes each event as it arrives. When a query includes a window function (e.g., TumblingWindow, HoppingWindow, SlidingWindow, SessionWindow), the engine creates a set of windows based on the system clock and the event timestamp (if using Event Time, or arrival time if not). Each window is a logical bucket that holds events whose timestamps fall within the window's start and end times. The engine evaluates the query for each window when it completes (for tumbling, hopping, session) or continuously (for sliding). Internally, the engine maintains state for each window, aggregating events as they arrive. When a window ends, the result is emitted. For hopping windows, overlapping windows are maintained simultaneously. The engine uses watermarking to handle late-arriving events, controlled by the Lateness tolerance parameter.
Key Components and Defaults
TumblingWindow: A series of fixed-size, non-overlapping, contiguous time intervals. Example: TumblingWindow(second, 10) creates windows of 10 seconds each, starting at 00:00:00. Default offset is 0. No overlap.
HoppingWindow: Windows that hop forward in time by a fixed period, possibly overlapping. Example: HoppingWindow(second, 10, 5) creates windows of 10 seconds that start every 5 seconds. The window size must be greater than the hop size. Overlap ratio can be up to 100%.
SlidingWindow: Produces an output only when an event occurs—it continuously slides forward. Example: SlidingWindow(second, 10) calculates the aggregate over the last 10 seconds, updating with each new event. No fixed end; output is event-driven.
SessionWindow: Groups events that arrive close together, with a maximum gap and maximum duration. Example: SessionWindow(second, 5, 10) starts a window when an event arrives, closes after 5 seconds of inactivity, but never exceeds 10 seconds total. Useful for user sessions.
Timestamp by: Used to specify a field in the event as the event time. If not specified, the arrival time (system clock) is used. Syntax: TIMESTAMP BY eventTimeField.
Lateness tolerance: Configurable via WITH clause or LATE keyword. Default is 0 seconds. Allows late-arriving events to be included in the correct window up to a specified delay.
System.Timestamp(): Returns the window end time for the current aggregation.
Configuration and Verification
Queries are defined in the Stream Analytics job's query editor. Example of a tumbling window query:
SELECT System.Timestamp() AS WindowEnd, COUNT(*) AS EventCount
INTO [output]
FROM [input] TIMESTAMP BY eventTime
GROUP BY TumblingWindow(second, 10)To verify, you can run sample data or use the 'Test' button in the Azure portal. For production, monitor the job's latency and watermark delay metrics.
Interaction with Related Technologies
Stream Analytics queries can read from multiple inputs and write to multiple outputs. Common patterns include:
Filtering: WHERE clause to select relevant events.
Joining two streams: JOIN with a window to correlate events from different sources within a time range.
Reference data: Use JOIN with a reference input (static data from Blob Storage or SQL Database) to enrich streams.
Partitioning: Use PARTITION BY to parallelize processing across partitions (e.g., by device ID).
Trap Patterns for the Exam
Mixing window types: Candidates often try to use a tumbling window when a hopping window is needed for overlapping intervals. Remember: tumbling = non-overlapping; hopping = overlapping (if hop < size).
Forgetting TIMESTAMP BY: If you don't specify a timestamp field, the engine uses arrival time, which may not match the event's actual time. Exam questions may test this.
Ignoring lateness: The default lateness tolerance is 0. If events arrive late, they are dropped. The exam may ask how to handle late events (use LATE keyword or adjust lateness tolerance).
SlidingWindow output frequency: Sliding windows emit output for every event, which can be high-volume. Candidates may think it emits at fixed intervals.
SessionWindow parameters: The maximum gap and maximum duration are often confused. Remember: gap defines inactivity timeout; duration caps total length.
Specific Numbers and Values
TumblingWindow, HoppingWindow, SlidingWindow: supported units are second, minute, hour, day. Max window size: 7 days for tumbling and hopping, 1 hour for sliding.
SessionWindow: max gap and max duration in same units. Max duration can be up to 7 days.
Lateness tolerance: configured in the query via WITH clause: WITH (LATE = 10 SECONDS) or globally in job settings. Max lateness: 7 days.
System.Timestamp() returns a datetime value representing the window end.
Step-by-Step Query Execution
When a query with a tumbling window runs: 1. The engine identifies the window boundaries based on the event timestamp and the window size. 2. Each event is assigned to the window whose start <= event time < end. 3. As events arrive, they are stored in the window's state. 4. When the system clock passes the window end (plus lateness tolerance), the window is considered complete. 5. The aggregate is computed and the result is sent to output. 6. The window state is discarded.
For sliding windows, steps 1-3 are similar, but step 4 is event-driven: every new event triggers a re-evaluation of the aggregate over the sliding interval, and output is emitted immediately.
Define Input and Output
First, you configure the input source (e.g., Event Hub) and output sink (e.g., SQL Database) in the Stream Analytics job. Inputs have an alias used in queries. Outputs also have an alias. For example, `INTO [output]` specifies where results go. You must ensure the schema (field names and types) matches between input and query expectations.
Write SELECT with Window
Write a SQL query that includes a window function in the GROUP BY clause. For example: `SELECT System.Timestamp() AS WindowEnd, AVG(temperature) AS AvgTemp FROM [input] GROUP BY TumblingWindow(second, 10)`. The window function defines the time interval for aggregation. You can also use `TIMESTAMP BY` to specify the event time column.
Configure Lateness Tolerance
Set the lateness tolerance to handle out-of-order or late events. In the query, use the `WITH` clause: `WITH (LATE = 5 SECONDS)`. This tells the engine to wait up to 5 seconds after the window end before emitting results, allowing late events to be included. If not specified, late events are dropped.
Test with Sample Data
Use the 'Test' button in the Azure portal to run the query against a sample dataset. You can upload a sample file or use live data in test mode. This verifies the query logic and output schema. Check for any errors like mismatched types or missing fields.
Start the Job and Monitor
Start the Stream Analytics job. Monitor the job's metrics such as Input Events, Output Events, Watermark Delay, and Late Events. Watermark delay indicates how far behind the engine is from real-time. High lateness suggests you need to adjust the lateness tolerance or check event timestamps.
Scenario 1: Real-Time IoT Temperature Monitoring
A manufacturing plant uses IoT sensors to monitor machine temperature. Each sensor sends a temperature reading every second. The operations team needs to know the average temperature per machine over 5-minute windows, and alert if the average exceeds 80°C. They use Azure Stream Analytics with a tumbling window of 5 minutes, grouping by machineId. The query: SELECT machineId, System.Timestamp() AS WindowEnd, AVG(temperature) AS AvgTemp INTO [output] FROM [input] TIMESTAMP BY eventTime GROUP BY machineId, TumblingWindow(minute, 5). They also set a lateness tolerance of 10 seconds to handle network delays. In production, they have 10,000 sensors, so they partition the input by machineId (using PARTITION BY) to scale. A common misconfiguration is forgetting to use TIMESTAMP BY, causing the engine to use arrival time, which may be skewed due to network latency. This leads to incorrect averages. They monitor watermark delay; if it exceeds 10 seconds, they increase the lateness tolerance.
Scenario 2: Clickstream Analysis for E-Commerce
An e-commerce website wants to analyze user sessions: how many pages a user visits within a session, where a session ends after 30 minutes of inactivity. They use session windows: SELECT userId, System.Timestamp() AS SessionEnd, COUNT(*) AS PageViews FROM [clickstream] TIMESTAMP BY clickTime GROUP BY userId, SessionWindow(minute, 30, 120). The maximum duration is 2 hours to prevent infinite sessions if the user is very active. They join this with a reference table containing user demographics to enrich the output. A common issue is that session windows can be resource-intensive for high-traffic sites; they need to ensure the Stream Analytics job has sufficient streaming units. They also set a lateness tolerance of 1 minute to account for delayed click events.
Scenario 3: Fraud Detection in Financial Transactions
A bank processes credit card transactions in real time. They need to count the number of transactions per card in the last 1 hour (sliding window) and flag if it exceeds 10. The query: SELECT cardId, COUNT(*) AS TransactionCount INTO [alert] FROM [transactions] TIMESTAMP BY transactionTime GROUP BY cardId, SlidingWindow(hour, 1). Sliding window updates with every transaction, so the alert is immediate. However, this can generate a high volume of output; they use a filter in the query to only output when count > 10. They also set a lateness tolerance of 30 seconds. A pitfall is that sliding windows do not automatically handle duplicates; if the same transaction is sent twice, it will be counted twice. They must deduplicate upstream.
Exactly What DP-900 Tests
Objective 3.5: Describe Stream Analytics SQL queries and windows. The exam expects you to:
Identify the correct window type for a given scenario (tumbling, hopping, sliding, session).
Understand that windows are used to aggregate data over time.
Know the syntax: GROUP BY TumblingWindow(second, 10) and TIMESTAMP BY.
Recognize that System.Timestamp() returns the window end time.
Understand the purpose of lateness tolerance.
Common Wrong Answers and Why
Choosing hopping window when tumbling is correct: Candidates see 'overlap' in the scenario and select hopping, but if the scenario says 'non-overlapping contiguous intervals', it's tumbling. Trap: hopping can be non-overlapping if hop equals size, but that's equivalent to tumbling.
Thinking sliding windows emit at fixed intervals: Sliding windows emit output for every event, not every N seconds. The exam may ask: 'Which window type emits results only when an event occurs?' Answer: Sliding.
Confusing session window parameters: The maximum gap is the inactivity timeout; maximum duration is the total window length. Many candidates reverse them.
Assuming lateness tolerance is unlimited: Default is 0 seconds. Max is 7 days. The exam may ask what happens to late events when tolerance is not set: they are dropped.
Specific Values on the Exam
Window units: second, minute, hour, day.
Max window size for tumbling/hopping: 7 days.
Max window size for sliding: 1 hour.
System.Timestamp() returns datetime.
TIMESTAMP BY is used to specify event time.
Edge Cases
If you omit TIMESTAMP BY, the engine uses arrival time. This can cause events to be assigned to wrong windows if they arrive late.
Hopping windows with hop size equal to window size are identical to tumbling windows.
Session windows can produce many small windows if events are sporadic.
How to Eliminate Wrong Answers
If the scenario mentions 'continuous output for every event', it's sliding.
If it mentions 'fixed intervals, non-overlapping', it's tumbling.
If it mentions 'overlapping intervals', it's hopping.
If it mentions 'sessions or inactivity timeout', it's session.
Look for keywords like 'every 5 minutes' (tumbling), 'every minute but looking back 5 minutes' (hopping), 'real-time update' (sliding), 'user session' (session).
Windows partition infinite streams into finite time chunks for aggregation.
Four window types: Tumbling (non-overlapping), Hopping (overlapping), Sliding (event-driven), Session (inactivity-based).
Use TIMESTAMP BY to specify event time; otherwise arrival time is used.
System.Timestamp() returns the window end time.
Lateness tolerance (default 0) allows late events to be included up to a max of 7 days.
Max window size: Tumbling/Hopping = 7 days, Sliding = 1 hour.
Sliding windows emit output for every event, not at fixed intervals.
Session windows have a maximum gap (inactivity) and maximum duration (total length).
These come up on the exam all the time. Here's how to tell them apart.
Tumbling Window
Non-overlapping, contiguous intervals
Fixed size, no overlap
Simple aggregation per batch
Use when you need distinct time buckets
Example: `TumblingWindow(second, 10)`
Hopping Window
Overlapping intervals (if hop < size)
Fixed size, hop forward by a period
Allows smooth transitions between windows
Use when you need frequent updates with larger context
Example: `HoppingWindow(second, 10, 5)`
Mistake
Sliding windows emit results at fixed time intervals.
Correct
Sliding windows emit results for every event that occurs within the window, not at fixed intervals. The output is event-driven.
Mistake
Hopping windows always overlap.
Correct
Hopping windows overlap only if the hop size is less than the window size. If hop size equals window size, they are non-overlapping and identical to tumbling windows.
Mistake
You must always use TIMESTAMP BY in every query.
Correct
TIMESTAMP BY is optional. If omitted, the engine uses the event's arrival time (system clock). However, for accurate event time processing, it is recommended.
Mistake
Session windows close immediately after the maximum gap expires.
Correct
Session windows close after the maximum gap of inactivity, but they also have a maximum duration that caps the total window length. The window closes when either condition is met.
Mistake
Lateness tolerance can be set to any value without limit.
Correct
Lateness tolerance has a maximum of 7 days. Also, a high tolerance increases memory usage because windows must be kept open longer.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Tumbling windows are a series of fixed-size, non-overlapping, contiguous time intervals. Hopping windows also have fixed size but hop forward in time by a specified period, which can create overlapping windows if the hop size is less than the window size. Use tumbling for distinct batches, hopping for overlapping time ranges.
Use sliding windows when you need to output results for every event, providing real-time updates. Hopping windows output at fixed intervals (the hop size). Sliding windows are event-driven, while hopping windows are time-driven.
System.Timestamp() returns the end time of the current window as a datetime value. For tumbling and hopping windows, it returns the window end. For sliding windows, it returns the current event's timestamp. For session windows, it returns the window end.
Set the lateness tolerance using the `WITH (LATE = N SECONDS)` clause in the query or configure it globally. This tells the engine to wait up to N seconds after the window end before emitting results, allowing late events to be included. The default is 0 seconds (no tolerance).
Yes, you can use multiple window functions by specifying multiple GROUP BY clauses with different window types, but each window must be in its own subquery or use a UNION. Alternatively, you can use the `Collect` and `CollectTop` functions for certain patterns.
The maximum window size for a sliding window is 1 hour. For tumbling and hopping windows, the maximum is 7 days.
If you omit TIMESTAMP BY, the engine uses the event's arrival time (the time it was received by the input source) as the timestamp. This may not match the actual event time, leading to incorrect window assignments if events arrive out of order.
You've just covered Stream Analytics SQL Queries and Windows — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?