This chapter covers real-time analytics on Azure, focusing on Azure Stream Analytics and related services. For the DP-900 exam, this topic falls under domain 'Analytics' objective 3.5 ('Describe real-time analytics in Azure'). Approximately 10-15% of exam questions touch real-time analytics concepts. You will learn the key services, their use cases, and how they differ from batch processing. Mastering this content is essential for distinguishing between streaming and batch solutions on the exam.
Jump to a section
Imagine a factory that sorts packages on a conveyor belt. Packages (events) arrive continuously from multiple chutes (data sources). Instead of storing all packages in a warehouse before sorting, the belt has sensors that inspect each package as it passes. A sensor reads the label, a computer decides the destination chute, and a pusher arm diverts the package instantly. If a package is damaged (bad data), it is diverted to a rejection bin for later inspection. The belt runs at a constant speed (throughput limit). If too many packages arrive at once, a buffer area holds them temporarily, but if the buffer overflows, packages are dropped (data loss). Azure Stream Analytics works like this: it processes events in memory as they flow through, using a sliding window to aggregate data over time (e.g., average temperature per minute). The output is sent to sinks like Power BI for live dashboards or Azure SQL for storage. Just as the factory can adjust belt speed or add parallel belts for higher volume, Stream Analytics can scale out by increasing Streaming Units (SUs) to handle more data per second.
What is Real-Time Analytics and Why Does It Exist?
Real-time analytics is the process of analyzing data as soon as it arrives, with minimal latency, to enable immediate insights or actions. Traditional batch processing collects data over a period (e.g., hourly, daily) and then processes it, resulting in delays. In contrast, streaming analytics processes events in motion, often within seconds or milliseconds. The need for real-time analytics arises from scenarios where timely decisions are critical: fraud detection, IoT sensor monitoring, stock trading, social media sentiment analysis, and operational dashboards. Azure Stream Analytics (ASA) is the primary PaaS offering for real-time analytics in Azure. It uses SQL-like query language to process data streams from sources like Azure Event Hubs, IoT Hub, or Blob Storage (for reference data) and outputs to sinks like Power BI, Azure SQL Database, or Azure Data Lake Storage.
How Azure Stream Analytics Works Internally
ASA jobs are composed of three logical components: input, query, and output. The input is a stream of events (e.g., from Event Hubs). The query is a continuous SQL-like query that runs indefinitely. The output is a sink where results are written. ASA uses a temporal windowing mechanism to group events over time. There are four types of windows: Tumbling, Hopping, Sliding, and Session. A Tumbling window is a fixed-size, non-overlapping time interval (e.g., every 5 minutes). A Hopping window is similar but can overlap (e.g., every 5 minutes with a 1-minute hop). A Sliding window produces results for each point in time when an event occurs, and a Session window groups events based on inactivity gaps. Under the hood, ASA partitions the input stream into sub-streams (by partition key or default) and processes them in parallel. The job uses Streaming Units (SUs) to define the compute capacity. 1 SU provides 1 GB memory and up to 1 MB/s throughput. For high-throughput scenarios, you can increase SUs up to 192 (for Standard tier). ASA guarantees exactly-once event delivery for outputs that support it (e.g., Azure SQL Database with upsert). It also handles late-arriving events using a configurable late arrival tolerance window (default 5 seconds). Out-of-order events are handled by an out-of-order tolerance window (default 0 seconds). These windows ensure temporal accuracy.
Key Components, Values, Defaults, and Timers
Inputs: Event Hubs, IoT Hub, Blob Storage (for reference data), ADLS Gen2. Each input has a serialization format (JSON, CSV, Avro, Parquet).
Outputs: Azure SQL Database, Azure Synapse Analytics, Blob Storage, Data Lake Storage, Event Hubs, IoT Hub, Power BI, Azure Functions, Table Storage, Queue Storage, Service Bus Topics/Queues, Cosmos DB, Azure Data Explorer.
Streaming Units (SU): 1 SU = 1 GB memory, ~1 MB/s throughput. Scale up to 192 SU for Standard tier. Job can be scaled manually or auto-scaled (preview).
Late Arrival Tolerance: Default 5 seconds. Events arriving later than this are ignored or handled based on policy.
Out-of-Order Tolerance: Default 0 seconds. Events with timestamps earlier than the current window are considered out-of-order and can be dropped or adjusted.
Query Language: T-SQL based with extensions for windowing functions (TumblingWindow, HoppingWindow, SlidingWindow, SessionWindow), temporal joins (e.g., JOIN with reference data), and user-defined functions (UDFs) in JavaScript or C#.
Checkpointing: ASA automatically checkpoints job progress to Blob Storage every few seconds for recovery.
Compatibility Level: 1.0, 1.1, 1.2 (latest). Determines query behavior and features.
Configuration and Verification Commands
Creating an ASA job via Azure CLI:
az stream-analytics job create --resource-group myRG --name myJob --location westus --output-error-policy Drop --compatibility-level 1.2 --data-locale en-USAdding an input:
az stream-analytics input create --resource-group myRG --job-name myJob --name myInput --type Stream --datasource "{\"type\":\"Microsoft.EventHub/EventHub\",\"properties\":{\"eventHubName\":\"myHub\",\"serviceBusNamespace\":\"myNamespace\",\"sharedAccessPolicyName\":\"RootManageSharedAccessKey\",\"sharedAccessPolicyKey\":\"...\"}}" --serialization "{\"type\":\"Json\",\"properties\":{\"encoding\":\"UTF8\"}}"Adding an output:
az stream-analytics output create --resource-group myRG --job-name myJob --name myOutput --datasource "{\"type\":\"Microsoft.Sql/Server/Database\",\"properties\":{\"server\":\"myServer\",\"database\":\"myDB\",\"user\":\"myUser\",\"password\":\"myPass\",\"table\":\"myTable\"}}" --serialization "{\"type\":\"Json\",\"properties\":{\"encoding\":\"UTF8\"}}"Starting a job:
az stream-analytics job start --resource-group myRG --name myJob --output-start-mode JobStartTimeVerification: Monitor job metrics via Azure Portal (Input Events, Output Events, Watermark Delay, SU Utilization). Use az monitor metrics to query programmatically.
How ASA Interacts with Related Technologies
ASA is often used with Azure Event Hubs for high-throughput event ingestion (millions of events per second). Event Hubs provides partitioned Kafka-compatible endpoints. ASA can also consume from IoT Hub for device telemetry. For reference data (static or slowly changing), ASA loads data from Blob Storage or SQL Database and joins it with the stream. Output to Power BI enables real-time dashboards with automatic refresh. For long-term storage, ASA can write to Azure Data Lake Storage or Azure Synapse Analytics. Azure Functions can be used as custom output for complex logic. Azure Data Explorer (ADX) is another real-time analytics service for interactive queries on large volumes of streaming data, but it is not part of the DP-900 scope except as a comparison to ASA. The exam focuses on ASA as the core service for real-time analytics.
Define the Input Source
Identify the data source for the streaming analytics job. Common sources are Azure Event Hubs, IoT Hub, or Blob Storage. For Event Hubs, you need the namespace, event hub name, shared access policy name, and key. The input schema must be defined – ASA automatically detects the schema from the first events. You can specify serialization format (JSON, CSV, Avro). For example, JSON with UTF-8 encoding. The input is configured in the Azure portal or via CLI. Ensure the source is active and sending events before starting the job.
Define the Output Sink
Choose where the results of the query will be written. Common outputs: Power BI for dashboards, Azure SQL Database for persistent storage, Azure Data Lake Storage for big data analytics. Each output requires connection details: server name, database, table, authentication (user/password or managed identity). For Power BI, you must authenticate with a user or service principal and specify the dataset and table name. Outputs can be partitioned for parallelism. ASA supports exactly-once semantics for outputs that support upsert (e.g., SQL with a unique key).
Write the Query
The query is a continuous SQL-like statement that defines the transformation. Example: `SELECT AVG(temperature) AS avgTemp, System.Timestamp AS windowEnd INTO myOutput FROM myInput TIMESTAMP BY eventTime GROUP BY TumblingWindow(minute, 5)`. The TIMESTAMP BY clause defines the event time field. If omitted, arrival time is used. Use window functions to aggregate over time. You can also join with reference data: `SELECT * FROM myInput JOIN refData ON myInput.sensorId = refData.id`. The query runs continuously on the stream. You can test the query with sample data in the portal.
Configure Late Arrival and Out-of-Order Policies
Set the tolerance for late-arriving events (default 5 seconds) and out-of-order events (default 0 seconds). These settings affect temporal accuracy. If events arrive after the tolerance window, they are either dropped or adjusted (depending on policy). The policy can be 'Drop' or 'Adjust'. For example, if out-of-order tolerance is 10 seconds, events with timestamps up to 10 seconds before the current watermark are considered in-order. The watermark is the maximum event time seen so far. These settings are configured in the job's 'Event ordering' section.
Start and Monitor the Job
Start the job with an output start mode: JobStartTime (starting now), CustomTime (specific past time), or LastOutputEventTime (resume from last stop). The job will begin processing events. Monitor metrics: Input Events, Output Events, Watermark Delay (time behind real-time), SU Utilization. High SU Utilization (>80%) indicates need for scaling. You can increase SUs to improve throughput. Errors appear in Activity Log or Diagnostic Logs. Use Azure Monitor alerts for proactive notification.
Real-World Scenarios for Azure Stream Analytics
Scenario 1: Real-Time Fraud Detection for E-Commerce A large online retailer processes millions of transactions per hour. They use Azure Event Hubs to ingest clickstream and purchase events. An ASA job runs a query that computes the average transaction amount per user over a sliding window of 5 minutes. If a user's average exceeds a threshold (e.g., 3 standard deviations above their historical mean), an alert is sent to an Azure Function, which triggers a fraud review. The output also writes summary stats to Azure SQL Database for compliance. The job is configured with 10 SUs to handle peak load of 10,000 events per second. Late arrival tolerance is set to 30 seconds to account for network delays. Misconfiguration: If SUs are too low, watermark delay increases, causing late results and missed fraud. If out-of-order tolerance is too high, legitimate spikes might be ignored.
Scenario 2: IoT Predictive Maintenance in Manufacturing
A factory deploys sensors on equipment that emit temperature, vibration, and pressure readings every second. Data is sent to IoT Hub. An ASA job joins the sensor stream with reference data (equipment thresholds stored in Blob Storage) and computes a rolling average over a 5-minute tumbling window. If the average exceeds a threshold, an output is sent to Power BI for a real-time dashboard and also to an Azure Function that sends an SMS alert. The job uses 3 SUs for 1000 devices. The query uses a session window to detect anomalies based on inactivity. Common pitfall: Not partitioning by device ID – without partitioning, the job processes all devices in a single stream, causing high latency. Best practice: Use PARTITION BY DeviceId in the query to enable parallelism.
Scenario 3: Social Media Sentiment Analysis for Brand Monitoring A marketing agency ingests tweets from Twitter API via Event Hubs. ASA runs a sentiment analysis query using a JavaScript UDF that calls a pre-trained machine learning model (hosted as an Azure Function). The query groups tweets by hashtag over a hopping window (1-minute hop, 15-minute window) and outputs sentiment aggregates to Power BI for a live dashboard. The job also outputs raw tweets to Azure Data Lake for historical analysis. The challenge is handling high volume (5000 tweets/sec). The job is scaled to 20 SUs. Misconfiguration: Using a tumbling window instead of hopping window would miss overlapping trends. Also, the UDF might become a bottleneck if it is not stateless and scalable.
Exam Focus: What DP-900 Tests on Real-Time Analytics
Objective Codes: The exam objective 'Describe real-time analytics in Azure' (3.5) includes:
Identify use cases for real-time analytics (e.g., fraud detection, IoT)
Describe Azure Stream Analytics (ASA) as the primary service
Differentiate between batch and streaming processing
Understand that ASA uses SQL-like queries with temporal windows
Know that Event Hubs and IoT Hub are common inputs
Know that Power BI is a common output for dashboards
Recognize that ASA runs continuously and processes data in memory
Common Wrong Answers and Why Candidates Choose Them: 1. Confusing Azure Stream Analytics with Azure Batch Processing (Azure Data Factory, HDInsight): Candidates often pick batch services for streaming scenarios. The trap: The question describes 'real-time' or 'continuous' processing. If the scenario mentions 'every hour' or 'scheduled', it is batch. If it says 'as events arrive' or 'continuous', it is streaming. 2. Choosing Event Hubs as the analytics service instead of the ingestion service: Event Hubs is a data ingestion platform, not an analytics engine. ASA is the analytics engine. Candidates think Event Hubs does analytics because it can filter events, but it does not support complex windowing or SQL queries. 3. Selecting Azure Data Lake Storage as a real-time output: Data Lake Storage is for long-term storage, not real-time dashboards. Power BI is the correct output for real-time visualization. Candidates confuse storage with visualization. 4. Misunderstanding window types: Questions may ask for the correct window type for a scenario. For example, 'average temperature every 5 minutes' uses a tumbling window. 'Rolling average over 15 minutes updated every minute' uses a hopping window. 'Detect idle periods' uses a session window. Candidates mix them up.
Specific Numbers and Terms on the Exam: - Streaming Units (SU): 1 SU = 1 GB memory, ~1 MB/s throughput. Max 192 SU. - Late arrival tolerance default: 5 seconds. - Out-of-order tolerance default: 0 seconds. - Window types: Tumbling, Hopping, Sliding, Session. - Input sources: Event Hubs, IoT Hub, Blob Storage (reference data). - Output sinks: Power BI, Azure SQL Database, Azure Data Lake Storage, Event Hubs, Azure Functions. - Query language: SQL-like with temporal functions.
Edge Cases and Exceptions: - ASA can process reference data from Blob Storage or SQL Database for joins (e.g., static lookup). - ASA supports exactly-once delivery for outputs like Azure SQL Database when using upsert. - ASA can be used with Azure Functions for custom processing. - ASA supports both event time (TIMESTAMP BY) and arrival time. - ASA jobs can be stopped and started; output start mode determines how far back to process.
How to Eliminate Wrong Answers: - If the scenario mentions 'real-time dashboard', the output is likely Power BI. - If the scenario mentions 'continuous query', the service is ASA. - If the scenario mentions 'ingestion', the service is Event Hubs or IoT Hub. - If the scenario mentions 'schedule', it is batch processing (e.g., ADF). - Use the underlying mechanism: ASA processes in-memory with windows, Event Hubs stores events for up to 7 days but does not analyze.
Azure Stream Analytics (ASA) is the primary PaaS service for real-time analytics on Azure.
ASA processes data in-memory with sub-second latency using continuous SQL queries.
Common inputs: Azure Event Hubs, IoT Hub; common outputs: Power BI, Azure SQL Database.
ASA uses four window types: Tumbling, Hopping, Sliding, and Session.
Streaming Units (SU) define compute capacity: 1 SU = 1 GB memory, ~1 MB/s throughput.
Default late arrival tolerance is 5 seconds; default out-of-order tolerance is 0 seconds.
ASA can join streaming data with reference data from Blob Storage or SQL Database.
Exactly-once delivery is supported for outputs like Azure SQL Database with upsert.
ASA is not a batch service; use Azure Data Factory or HDInsight for batch processing.
Power BI is the typical output for real-time dashboards, not Blob Storage.
These come up on the exam all the time. Here's how to tell them apart.
Azure Stream Analytics
Processes data continuously in real-time (sub-second latency).
Uses SQL-like query language with temporal windowing.
Inputs: Event Hubs, IoT Hub, Blob Storage (stream).
Outputs: Power BI, SQL Database, Data Lake Storage, Event Hubs.
Scales via Streaming Units (1-192 SU).
Azure Data Factory
Processes data in batches on a schedule (e.g., hourly, daily).
Uses pipelines with copy, transform, and data flow activities.
Inputs: Wide variety of data stores (Blob, SQL, etc.).
Outputs: Many data stores; does not output to Power BI directly.
Scales via Integration Runtime and Data Flow clusters.
Mistake
Azure Event Hubs performs real-time analytics.
Correct
Event Hubs is a data ingestion service that captures and stores events for up to 7 days. It does not perform analytics like aggregation or windowing. Azure Stream Analytics (ASA) is the service that processes data from Event Hubs using SQL queries.
Mistake
Stream Analytics outputs data to Blob Storage for real-time dashboards.
Correct
Blob Storage is not suitable for real-time dashboards because it introduces latency and requires continuous polling. Power BI is the appropriate output for real-time dashboards as it provides live streaming tiles.
Mistake
Stream Analytics processes data in batches every few minutes.
Correct
ASA processes data continuously as events arrive, with sub-second latency. It does not use batch cycles. However, it can aggregate over time windows (e.g., every 5 minutes) but the output is still streamed.
Mistake
Stream Analytics queries cannot use reference data from SQL Database.
Correct
ASA can join streaming data with reference data from Azure SQL Database (or Blob Storage) using a static snapshot that is periodically refreshed. This is a common pattern for enriching streams.
Mistake
Stream Analytics supports exactly-once processing for all outputs.
Correct
Exactly-once semantics are only supported for outputs that support idempotent inserts or upserts (e.g., Azure SQL Database with a unique key). For other outputs like Blob Storage, ASA provides at-least-once delivery.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Azure Event Hubs is a data ingestion platform that captures and stores events for up to 7 days. It can process millions of events per second but does not perform analytics. Azure Stream Analytics is a real-time analytics service that reads data from Event Hubs (or other sources), applies SQL queries with windowing, and writes results to outputs like Power BI or SQL Database. In short, Event Hubs is for ingestion, Stream Analytics is for processing.
Yes, Power BI is a common output for ASA. When configured, ASA writes streaming data directly to a Power BI dataset, which can be visualized in real-time dashboards with automatic refresh. This is ideal for monitoring live metrics. However, there are limits: Power BI can handle up to 15,000 rows per second per dataset, and the output must be in a compatible format.
Streaming Units (SU) represent the compute resources allocated to an ASA job. 1 SU provides about 1 GB of memory and can process roughly 1 MB/s of throughput. You can scale up to 192 SU for high-throughput scenarios. The number of SUs needed depends on event volume and query complexity. You can monitor SU utilization in the portal and adjust accordingly.
ASA has a configurable late arrival tolerance window (default 5 seconds). Events that arrive after this window relative to the current watermark are either dropped or have their timestamp adjusted (based on policy). The watermark is the maximum event timestamp seen so far. This ensures that results are not skewed by delayed events. For example, if tolerance is 30 seconds, events up to 30 seconds late are still processed.
ASA supports four window types: Tumbling (fixed, non-overlapping intervals), Hopping (overlapping intervals that move forward by a hop size), Sliding (produces results for each point in time when an event occurs), and Session (groups events separated by a period of inactivity). Each is used for different aggregation patterns. For example, Tumbling for periodic reports, Hopping for rolling averages, Sliding for real-time alerts, Session for detecting idle periods.
Yes, ASA can join streaming data with reference data stored in Azure SQL Database (or Blob Storage). The reference data is loaded as a static snapshot at job start and can be refreshed periodically (e.g., every 15 minutes). This is useful for enriching streams with lookup tables, such as product details or customer segments.
No, ASA is designed for real-time streaming, not batch. For batch processing on a schedule (e.g., hourly or daily), use Azure Data Factory, Azure HDInsight, or Azure Databricks. ASA processes data continuously as it arrives, with no built-in scheduling. If you need both, you can use ASA for real-time and ADF for periodic batch.
You've just covered Real-Time Analytics on Azure — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?