DP-900Chapter 17 of 101Objective 1.4

Batch Processing vs Streaming Analytics

This chapter covers the fundamental differences between batch processing and streaming analytics in Azure, two core data processing paradigms you must understand for the DP-900 exam. Roughly 15-20% of exam questions touch this topic, often asking you to identify which approach fits a given scenario based on latency, volume, and processing requirements. You will learn the mechanisms behind each, their Azure implementations, and how to choose between them — a skill tested directly in several scenario-based questions.

25 min read
Intermediate
Updated May 31, 2026

Batch vs Streaming: Bakery Orders and Fresh Donuts

Imagine a donut shop that receives customer orders. In batch processing, the shop collects all orders throughout the day and bakes them in large batches at midnight. The baker takes the entire stack of order slips, mixes dough for all donuts at once, fries them together, and boxes them. Customers receive their donuts the next morning — the data is hours old but the process is efficient because the fryer runs at full capacity. In streaming analytics, the shop bakes donuts one at a time as each order arrives. The baker has a continuous fryer belt: a customer orders a glazed donut, the baker immediately drops a scoop of dough onto the belt, it fries for exactly 90 seconds, gets glazed, and slides into a box. The customer gets a hot, fresh donut within two minutes. The trade-off: batch processing maximizes throughput and minimizes cost per donut (one large batch uses less oil per donut) but introduces latency. Streaming minimizes latency (order to delivery in seconds) but requires more equipment and energy (the fryer belt runs continuously even during lulls). In Azure, batch processing corresponds to services like Azure Synapse or Azure Batch where data accumulates before processing, while streaming corresponds to Azure Stream Analytics or Event Hubs where data is processed as it arrives with sub-second latency.

How It Actually Works

What Are Batch Processing and Streaming Analytics?

Batch processing and streaming analytics are two distinct approaches to data processing, each optimized for different time and volume constraints. Batch processing handles large volumes of data collected over a period — minutes, hours, or days — and processes it all at once in a single job. Streaming analytics processes data continuously as it arrives, often with sub-second to second-level latency.

Batch Processing: The Mechanism

Batch processing follows a "collect then process" model. Data is ingested from sources like files, databases, or message queues and stored in a staging area (e.g., Azure Blob Storage, Azure Data Lake Storage). A processing job — typically running on a schedule (e.g., every hour, nightly) or triggered by an event (e.g., file arrival) — reads the accumulated data, transforms it, and writes the output to a destination (e.g., a data warehouse, a data lake).

Key characteristics: - Latency: Minutes to hours. The time from data creation to availability of processed results is bounded by the batch interval. - Volume: Handles terabytes to petabytes per batch. Optimized for high throughput. - Processing engine: Typically MapReduce (Hadoop), Spark, or Azure Synapse pipelines. In Azure, common services include Azure Synapse Analytics (SQL pools, Spark pools), Azure Data Factory, and Azure Batch. - Fault tolerance: Batch jobs can be retried from the beginning or from checkpoints. Because data is stored durably, failures don't lose data — they just delay output. - Cost: Lower cost per unit of data because resources are provisioned for the duration of the job and can be scaled down between runs.

Streaming Analytics: The Mechanism

Streaming analytics processes data in motion. Data arrives as a continuous stream of events (e.g., sensor readings, clickstream logs, financial transactions) and is processed within a small time window — often seconds or milliseconds. The processing is typically stateful, meaning the engine maintains state across events (e.g., aggregating counts over a tumbling window).

Key characteristics: - Latency: Sub-second to a few seconds. The time from event ingestion to output is minimal. - Volume: Handles millions of events per second. Throughput is high but each event is small. - Processing engine: Typically Apache Flink, Apache Spark Streaming, Kafka Streams, or Azure Stream Analytics. In Azure, common services include Azure Stream Analytics, Azure Event Hubs (with Capture for batch fallback), and Azure Data Explorer. - Fault tolerance: Streaming engines use checkpointing and replayable sources (like Event Hubs) to recover from failures without data loss. They guarantee at-least-once or exactly-once processing semantics. - Cost: Higher cost per unit of data because resources must run continuously to handle the stream.

How They Work Internally

Batch Processing Step-by-Step

1.

Data Ingestion: Data is collected into a durable store. For example, log files are uploaded to Azure Blob Storage every 15 minutes.

2.

Job Trigger: A scheduler (e.g., Azure Data Factory trigger) launches a processing job at a specified time or when data arrives.

3.

Data Loading: The job reads the data from the storage. In Azure Synapse, this could be a PolyBase external table reading from Blob Storage.

4.

Transformation: The data is transformed using SQL, Spark, or custom code. For example, aggregating daily sales by product.

5.

Output: The transformed data is written to a destination, such as a Synapse dedicated SQL pool table or a Parquet file in Data Lake Storage.

6.

Cleanup: Temporary data may be deleted. The job logs completion status.

Streaming Analytics Step-by-Step

1.

Event Ingestion: Events are sent to an event ingestion service like Azure Event Hubs or IoT Hub. Events are stored durably in partitions for a configurable retention period (default 1 day, up to 7 days for Event Hubs, 7 days for IoT Hub).

2.

Stream Processing: A Stream Analytics job reads events from the input source. It processes events using a SQL-like query language. The job runs continuously — it never stops.

3.

Windowing: Aggregations are performed over time windows: tumbling (fixed, non-overlapping), hopping (overlapping), sliding (event-driven), or session (based on inactivity gaps). For example, a 5-minute tumbling window counts events every 5 minutes.

4.

State Management: The engine maintains state for windows and aggregates. State is checkpointed periodically to durable storage (e.g., Azure Storage) for fault tolerance.

5.

Output: Results are sent to output sinks: Azure SQL Database, Azure Blob Storage, Power BI, Event Hubs, or Azure Functions. Output is typically written with sub-second latency.

Key Components, Values, and Defaults

Azure Batch Processing Services

Azure Data Factory: Cloud-based ETL service. Supports scheduled triggers (minimum interval 1 minute) and event-based triggers (blob created). Default retry policy: 3 retries with 30-second interval.

Azure Synapse Analytics: Integrated analytics service. Dedicated SQL pool: scales from DW100c to DW30000c. Serverless SQL pool: pay per TB of data processed. Spark pools: up to 200 nodes.

Azure Batch: Managed HPC service. Supports job preparation tasks and task dependencies. Default task retention time: 7 days.

Azure Streaming Analytics Services

Azure Event Hubs: Ingestion service with partitions (2-32 for standard tier, up to 64 for dedicated). Throughput units (TU): 1 TU = 1 MB/s ingress, 2 MB/s egress, 1000 events/s. Retention: 1 day default (up to 7 days).

Azure IoT Hub: For IoT devices. Supports up to 6 units (S1: 4000 messages/day per unit, S2: 144,000 messages/day per unit). Message size: up to 256 KB.

Azure Stream Analytics: Processing engine. Supports up to 192 streaming units (SU). Each SU provides 1 MB/s throughput. Query language: T-SQL with extensions for windowing and temporal operations. Event ordering: configurable late arrival tolerance (default 5 seconds) and out-of-order tolerance (default 0 seconds).

Azure Data Explorer: Real-time analytics on large volumes of streaming data. Ingestion batching: by default, data is ingested every 10 seconds or when 1 GB is accumulated, whichever comes first.

Configuration and Verification

Azure Data Factory Pipeline Example

To create a batch pipeline that copies data daily: 1. Create a linked service to source (e.g., Blob Storage). 2. Create a dataset pointing to source files. 3. Create a pipeline with a Copy activity. 4. Add a schedule trigger with recurrence: every 1 day at 02:00 AM. 5. Publish and trigger.

Verify using: Get-AzDataFactoryV2PipelineRun -ResourceGroupName ...

Azure Stream Analytics Job Example

To create a job that counts events per 5-minute window:

SELECT
    System.Timestamp AS WindowEnd,
    COUNT(*) AS EventCount
INTO [output]
FROM [input]
GROUP BY TumblingWindow(minute, 5)

Verify using: Get-AzStreamAnalyticsJob -ResourceGroupName ... and check job metrics (InputEvents, OutputEvents, Watermark delay) in Azure Monitor.

Interaction with Related Technologies

Batch processing often feeds into data warehouses (Azure Synapse) or data lakes (Azure Data Lake Storage). Streaming analytics can also write to these destinations, but typically for real-time dashboards (Power BI) or alerts (Azure Functions, Logic Apps).

Event Hubs can serve both: real-time consumption via Stream Analytics and batch consumption via Event Hubs Capture (writes to Blob Storage in Avro format, which can be processed by Data Factory or Synapse).

Azure Data Explorer bridges both: it ingests streaming data with low latency but also supports batch ingestion from storage.

Summary of Differences

| Aspect | Batch Processing | Streaming Analytics | |--------|------------------|---------------------| | Latency | Minutes to hours | Sub-second to seconds | | Data Volume per Unit | Large (TB/PB) | Small per event, high throughput | | Processing Model | Collect then process | Process as it arrives | | State | Stateless or checkpointed | Stateful (windows, aggregates) | | Cost | Lower per GB | Higher per GB | | Typical Use Cases | Historical reports, ETL | Real-time dashboards, alerts | | Azure Services | Data Factory, Synapse, Batch | Stream Analytics, Event Hubs, Data Explorer |

The choice between batch and streaming depends on latency requirements. If the business can tolerate minutes or hours of delay, batch is more cost-effective. If sub-second decisions are needed (fraud detection, IoT alerts), streaming is mandatory. The exam tests this decision point repeatedly.

Walk-Through

1

Ingest data into durable store

For batch processing, data is first collected into a durable storage system like Azure Blob Storage or Azure Data Lake Storage. This step is asynchronous: data sources write files at their own pace. The storage acts as a buffer, decoupling ingestion from processing. In streaming, data is ingested into a temporary, ordered log like Azure Event Hubs or IoT Hub. Events are stored in partitions with configurable retention (default 1 day for Event Hubs standard). The ingestion service acknowledges receipt immediately, ensuring no data loss.

2

Trigger processing job

Batch jobs are triggered by a scheduler (e.g., Azure Data Factory trigger) or an event (e.g., blob created). The trigger checks conditions — time-based triggers fire at a specific schedule (e.g., every 24 hours at 2 AM). Event-based triggers fire when a new file appears. In streaming, the processing job runs continuously — it is deployed and started once. It never stops unless explicitly stopped or failed. The Stream Analytics job polls the event source for new data every few milliseconds.

3

Read and transform data

In batch, the job reads the entire accumulated dataset from storage. Transformations are applied using SQL, Spark, or custom code. For example, a Data Factory pipeline may use a Data Flow to join, aggregate, and filter data. The job processes all data in a single pass. In streaming, the job reads events one by one from the input source. It applies a SQL query that may include windowing functions. For each event, the engine updates state (e.g., a count per window) and emits results when a window completes or when an output is triggered.

4

Write output to destination

Batch output is written to a destination such as a data warehouse table (Azure Synapse dedicated SQL pool), a file (Parquet in Data Lake), or a database. The write is typically transactional — either all output succeeds or the job fails and retries. In streaming, output is written continuously as results are produced. Common sinks include Power BI (for real-time dashboards), Azure SQL Database, Blob Storage, or Event Hubs. Output is written with low latency, often within seconds of the event.

5

Monitor and handle failures

Batch jobs have built-in retry policies (e.g., Data Factory retries up to 3 times with 30-second intervals). If all retries fail, the job is marked as failed and an alert can be sent. The data in storage remains intact for reprocessing. Streaming jobs use checkpointing: the engine periodically saves the state and the current position in the event stream. If the job fails, it restarts from the last checkpoint, ensuring at-least-once processing. Monitoring metrics include InputEvents, OutputEvents, and Watermark delay (for Stream Analytics).

What This Looks Like on the Job

Scenario 1: E-commerce Order Processing

A large online retailer processes millions of orders daily. They use batch processing for nightly reports: all orders from the day are aggregated into a data warehouse (Azure Synapse) using Azure Data Factory pipelines running at 2 AM. This batch job calculates daily revenue, inventory levels, and customer purchase patterns. The latency is acceptable because reports are used for strategic decisions, not real-time operations. However, they also need real-time fraud detection: if a customer places multiple high-value orders in quick succession, the system must flag it immediately. For this, they use Azure Event Hubs to ingest order events and Azure Stream Analytics to evaluate fraud rules with sub-second latency. Alerts are sent via Azure Functions to the fraud team. The batch and streaming systems coexist: streaming handles real-time alerts, while batch provides comprehensive historical analysis.

Scenario 2: IoT Temperature Monitoring

A manufacturing plant has thousands of sensors sending temperature readings every second. They need to detect overheating immediately to prevent equipment damage. They use Azure IoT Hub to ingest sensor data, then Azure Stream Analytics to process the stream with a sliding window: if the average temperature over the last 10 seconds exceeds 100°C, an alert is sent. This is streaming analytics — latency must be under 5 seconds. Additionally, they archive all raw sensor data to Azure Blob Storage using IoT Hub message routing and Event Hubs Capture. Every night, a batch job (Azure Data Factory) reads the archived data and computes daily statistics (min, max, average temperature per sensor) for predictive maintenance models. Misconfiguration: if the Stream Analytics job's late arrival tolerance is set too low (e.g., 1 second), events that arrive slightly out of order are dropped, causing false negative alerts. The correct setting depends on network latency — typically 5-10 seconds.

Scenario 3: Financial Transaction Processing

A bank processes credit card transactions. For regulatory reporting, they need accurate daily summaries of all transactions — this is done via batch processing using Azure Synapse pipelines that run after midnight. However, for fraud detection, they need real-time analysis: if a transaction is flagged as suspicious (e.g., amount > $10,000 and location mismatch), the system must block the transaction within milliseconds. They use Azure Event Hubs with Stream Analytics to evaluate each transaction against rules. The Stream Analytics job outputs to an Azure SQL Database that is queried by the authorization system. Common mistake: engineers sometimes try to use the same batch pipeline for both purposes, adding a streaming component as an afterthought. This leads to either high latency for alerts or high cost for batch processing that runs continuously. The correct architecture is to separate concerns: streaming for real-time decisions, batch for historical analytics.

How DP-900 Actually Tests This

What DP-900 Tests

DP-900 objective 1.4: "Describe batch and streaming processing." The exam focuses on:

Identifying whether a scenario requires batch or streaming based on latency, volume, and processing frequency.

Recognizing Azure services that support each: batch (Azure Data Factory, Azure Synapse, Azure Batch) vs streaming (Azure Stream Analytics, Azure Event Hubs, Azure IoT Hub).

Understanding the difference between tumbling, hopping, and sliding windows in Stream Analytics.

Knowing that Event Hubs can be used for both real-time consumption and batch (via Capture).

Common Wrong Answers and Why

1.

"Use Azure Stream Analytics for batch processing" — Candidates see "analytics" and think it covers all processing. Reality: Stream Analytics is exclusively for streaming (continuous, low-latency). For batch, use Data Factory or Synapse.

2.

"Azure Data Factory can do real-time processing" — Data Factory is an orchestration service, not a stream processor. Its triggers can be event-based but still launch batch jobs with minutes of latency, not sub-second.

3.

"Batch processing is always cheaper" — While generally true, if batch jobs require large clusters that run 24/7, streaming may be more cost-effective. The exam tests the general principle: batch is cheaper for periodic processing of large volumes.

4.

"Streaming analytics cannot handle historical data" — It can, but it's inefficient. Streaming is designed for recent data; for historical analysis, batch is better.

Specific Numbers and Terms

Event Hubs retention: default 1 day, max 7 days (standard tier).

Stream Analytics streaming units: up to 192 SU, each ~1 MB/s throughput.

Window types: Tumbling (fixed, non-overlapping), Hopping (overlapping with hop size < window size), Sliding (event-driven, emits when value changes), Session (based on inactivity gap).

Late arrival tolerance: default 5 seconds in Stream Analytics.

Out-of-order tolerance: default 0 seconds (events out of order beyond this are dropped or adjusted).

Edge Cases

Exactly-once vs at-least-once: Stream Analytics guarantees exactly-once output to Azure SQL Database and Blob Storage, but at-least-once to other sinks. Batch jobs typically provide exactly-once semantics if idempotent.

Event Hubs Capture: enables batch processing of streaming data. Capture writes events to Blob Storage in Avro format, which can then be processed by Data Factory or Synapse. This is a hybrid pattern.

IoT Hub message routing: can route messages to multiple endpoints, including Event Hubs (for streaming) and Blob Storage (for batch).

How to Eliminate Wrong Answers

If a question mentions "real-time" or "immediate" or "sub-second" — eliminate batch services. If it mentions "nightly" or "daily" or "large volume" — eliminate streaming services. Look for keywords: "continuous", "stream", "event" indicate streaming; "schedule", "batch", "accumulate" indicate batch. Also, note that Azure Data Explorer supports both but is optimized for interactive analytics on streaming data, not traditional batch ETL.

Key Takeaways

Batch processing is used when latency requirements are minutes or longer; streaming is for sub-second to second-level latency.

Azure Data Factory and Azure Synapse are the primary batch processing services; Azure Stream Analytics and Azure Event Hubs are for streaming.

Stream Analytics uses tumbling, hopping, sliding, and session windows for temporal aggregations.

Event Hubs retention defaults to 1 day (max 7 days) for standard tier; Capture enables batch processing of streamed data.

Stream Analytics supports up to 192 streaming units; each SU provides ~1 MB/s throughput.

Batch is generally cheaper per GB processed than streaming because resources are used intermittently.

The exam tests scenario-based identification: look for keywords like 'real-time' (streaming) vs 'nightly' (batch).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Batch Processing

Processes data in discrete chunks after accumulation

Latency: minutes to hours

Optimized for high throughput on large volumes

Lower cost per unit of data processed

Azure services: Data Factory, Synapse, Batch

Streaming Analytics

Processes data continuously as it arrives

Latency: sub-second to seconds

Optimized for low latency on high-velocity data

Higher cost per unit of data processed

Azure services: Stream Analytics, Event Hubs, IoT Hub

Watch Out for These

Mistake

Streaming analytics processes data exactly once, with no duplicates.

Correct

Stream Analytics provides exactly-once semantics to certain sinks (Azure SQL Database, Blob Storage) but at-least-once to others (Event Hubs, Power BI). Duplicates can occur during failures and restarts.

Mistake

Batch processing is always slower than streaming.

Correct

Batch processing has higher latency but can achieve higher throughput for large datasets. For a given volume, batch may finish faster in wall-clock time if the streaming engine is throttled by per-event overhead.

Mistake

Azure Data Factory can process streaming data in real time.

Correct

Data Factory is an orchestration service for batch and incremental data movement. It does not support continuous stream processing. It can trigger pipelines based on events, but each pipeline run is a batch job.

Mistake

Event Hubs is only for streaming analytics.

Correct

Event Hubs can also support batch processing via the Capture feature, which writes events to Azure Blob Storage in Avro format. These files can then be processed by batch services like Azure Data Factory or Azure Synapse.

Mistake

Tumbling windows and hopping windows are the same thing.

Correct

Tumbling windows are a series of fixed-size, consecutive, non-overlapping time intervals. Hopping windows overlap — they hop forward by a specified interval that is smaller than the window size. For example, a 1-hour window hopping every 15 minutes produces 4 overlapping windows per hour.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between batch processing and streaming analytics in Azure?

Batch processing collects data over a period and processes it all at once, resulting in minutes-to-hours latency. Streaming analytics processes data continuously as it arrives, achieving sub-second to seconds latency. In Azure, batch uses services like Azure Data Factory and Azure Synapse; streaming uses Azure Stream Analytics and Event Hubs. Choose batch for cost-effective large-volume historical analysis; choose streaming for real-time alerts and dashboards.

Which Azure services are used for batch processing?

Primary batch processing services include Azure Data Factory (orchestration and ETL), Azure Synapse Analytics (dedicated SQL pool, serverless SQL pool, Spark pools), and Azure Batch (for HPC workloads). These services handle large datasets with scheduled or event-triggered jobs. They are not designed for sub-second latency.

Which Azure services are used for streaming analytics?

Key streaming services are Azure Stream Analytics (real-time SQL-based processing), Azure Event Hubs (event ingestion), Azure IoT Hub (IoT device messaging), and Azure Data Explorer (interactive analytics on streaming data). They process events continuously with low latency.

What are the window types in Azure Stream Analytics?

There are four window types: Tumbling (fixed, non-overlapping intervals), Hopping (overlapping intervals with a hop size smaller than window size), Sliding (window that moves forward when an event occurs, emitting when a condition is met), and Session (window that starts after an event and closes after a period of inactivity).

Can Event Hubs be used for batch processing?

Yes, through Event Hubs Capture. Capture automatically writes ingested events to Azure Blob Storage in Avro format at regular intervals (e.g., every 15 minutes or when 1 GB of data is accumulated). These files can then be processed by batch services like Azure Data Factory or Azure Synapse. This enables a hybrid pattern where streaming data is also available for batch analytics.

What is the default retention period for Event Hubs?

For the standard tier, the default retention period is 1 day, configurable up to 7 days. For the dedicated tier, retention can be set up to 7 days as well. Messages older than the retention period are automatically deleted.

How does Stream Analytics handle late-arriving events?

Stream Analytics has a configurable late arrival tolerance (default 5 seconds) and out-of-order tolerance (default 0 seconds). Events arriving within the late arrival window are still processed; events outside that window are either dropped or their timestamp is adjusted. This ensures windowed aggregations are accurate despite network delays.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Batch Processing vs Streaming Analytics — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?