This chapter covers data ingestion patterns—specifically batch processing and streaming—as they relate to Azure data services. Understanding these patterns is critical for the DP-900 exam, as questions on ingestion methods appear in roughly 15-20% of the exam, primarily under objective 1.4: 'Describe core data concepts.' You will learn the fundamental differences, use cases, and Azure services that implement each pattern. Mastery of this topic will help you answer scenario-based questions about choosing the right ingestion approach for given latency and volume requirements.
Jump to a section
Imagine a city's water supply system. In a batch processing model, the city collects rainwater in a large reservoir over a period (e.g., one week). At the end of the week, a team opens a valve, releasing the entire reservoir's water into the treatment plant, which processes the full volume at once. The treated water is then distributed to homes. This approach is efficient because the treatment plant operates at full capacity in one go, but residents must wait until the end of the week for water. In contrast, a streaming model uses a pipeline directly from each rain gutter to a continuous treatment system. As soon as a raindrop hits the gutter, it flows through a small pipe to a mini-treatment unit that processes it instantly. The treated water is immediately sent to homes. Residents get water in real-time, but the system requires many small treatment units and continuous monitoring. The batch model is simpler and cheaper for large volumes but introduces latency; the streaming model provides low-latency updates but is more complex and costly. In data, batch processing collects data over time (e.g., hourly, daily) and processes it in bulk, while streaming processes each data event as it arrives, enabling real-time insights.
What Are Data Ingestion Patterns?
Data ingestion is the process of moving data from one or more sources to a destination where it can be stored, processed, and analyzed. The two primary patterns are batch and streaming. Batch ingestion processes data in large, discrete chunks at scheduled intervals, while streaming ingestion processes data continuously as it is generated.
Why Two Patterns Exist
Different business scenarios demand different trade-offs between latency, cost, complexity, and throughput. Batch is ideal for historical analysis, reporting, and scenarios where near-real-time updates are not required. Streaming is essential for real-time dashboards, fraud detection, IoT telemetry, and any application that needs immediate action on incoming data.
How Batch Processing Works Internally
Data Collection: Data is accumulated over a period (e.g., every hour, daily) in a staging area such as Azure Blob Storage or a landing zone.
Trigger: A scheduler (e.g., Azure Data Factory schedule, cron job) initiates the batch job.
Extraction: The batch job reads the accumulated data from the source.
Transformation: Data is cleaned, validated, and transformed using tools like Azure Data Factory mapping data flows or Azure Databricks notebooks.
Load: The processed data is written to the target, such as Azure Synapse Analytics, Azure SQL Database, or Azure Data Lake Storage.
Post-processing: The staging area may be cleaned or archived.
Default timers in Azure Data Factory for batch schedules can be set to run every 1 minute (minimum) to every 30 days. Common defaults are hourly or daily.
How Streaming Processing Works Internally
Event Capture: Data events are captured in real-time from sources like IoT devices, application logs, or change data capture (CDC) streams. Azure Event Hubs or IoT Hub are common capture points.
Ingestion: Events are ingested into a streaming platform (e.g., Event Hubs, Kafka) with a defined partition count and retention period (default 1 day, configurable up to 7 days for Event Hubs).
Processing: A stream processor (e.g., Azure Stream Analytics, Azure Functions, Spark Structured Streaming) reads events and applies transformations, aggregations, or filters on-the-fly. For example, a Stream Analytics job can compute a 5-minute rolling average of sensor readings.
Output: Processed results are sent to a sink such as Power BI for real-time dashboards, Azure Storage for archival, or Azure SQL Database for low-latency queries.
Streaming systems use the concept of a sliding window or tumbling window. A tumbling window of 5 minutes processes events in non-overlapping 5-minute chunks. A hopping window of 10 minutes with a 5-minute hop processes overlapping windows. The default window size in Stream Analytics is 5 minutes.
Key Differences Between Batch and Streaming
Latency: Batch has high latency (minutes to days); streaming has low latency (seconds to milliseconds).
Data Volume: Batch handles large volumes efficiently; streaming handles continuous, potentially unbounded data.
Complexity: Batch is simpler to implement and debug; streaming requires handling out-of-order events, exactly-once semantics, and checkpointing.
Cost: Batch is often cheaper per unit of data because it uses less infrastructure; streaming requires sustained compute and storage.
Azure Services for Batch Ingestion
Azure Data Factory (ADF): Orchestrates batch pipelines with over 100 connectors. Supports scheduling, triggers, and tumbling windows.
Azure Synapse Pipelines: Built on ADF, optimized for Synapse Analytics.
Azure Databricks: Can process batch data using Spark DataFrames.
Azure Logic Apps: For simpler, event-driven batch flows.
Azure Services for Streaming Ingestion
Azure Event Hubs: A highly scalable event ingestion service capable of ingesting millions of events per second. Supports AMQP, HTTPS, and Kafka protocol.
Azure IoT Hub: A managed service for bidirectional communication with IoT devices.
Azure Stream Analytics: A serverless stream processing engine that can run queries on streams from Event Hubs, IoT Hub, or Blob Storage.
Azure Data Explorer (ADX): Optimized for real-time analytics on streaming data.
Azure Functions: Can be triggered by Event Hubs for custom processing.
Configuration and Verification Commands
Azure Data Factory (batch):
- Create a pipeline with a schedule trigger using Azure CLI:
az datafactory pipeline create --resource-group MyRG --factory-name MyADF --name MyPipeline --pipeline @pipeline.json
az datafactory trigger create --resource-group MyRG --factory-name MyADF --trigger-name HourlyTrigger --properties @trigger.json- Trigger JSON example:
{
"properties": {
"type": "ScheduleTrigger",
"typeProperties": {
"recurrence": {
"frequency": "Hour",
"interval": 1,
"startTime": "2024-01-01T00:00:00Z"
}
}
}
}Azure Event Hubs (streaming):
- Create an Event Hubs namespace and event hub using Azure CLI:
az eventhubs namespace create --name MyNamespace --resource-group MyRG --location eastus
az eventhubs eventhub create --name MyEventHub --namespace-name MyNamespace --resource-group MyRG --message-retention 1 --partition-count 4- Verify ingestion with a sample sender:
az eventhubs eventhub show --name MyEventHub --namespace-name MyNamespace --resource-group MyRG --query "status" --output tsvAzure Stream Analytics:
- Create a job and start it:
az stream-analytics job create --resource-group MyRG --job-name MyJob --location eastus --output-error-policy Drop --events-outoforder-policy Adjust --events-outoforder-max-delay 5
az stream-analytics job start --resource-group MyRG --job-name MyJob --output-start-mode JobStartTimeHow Batch and Streaming Interact
Modern architectures often combine both: a streaming layer provides real-time dashboards, while a batch layer processes historical data for deep analytics. This is known as the Lambda Architecture. Azure supports this via Event Hubs for streaming and Blob Storage for batch, with Stream Analytics for real-time and ADF for batch. The Kappa Architecture simplifies by using a single streaming platform for both real-time and batch-like processing (e.g., by replaying streams).
Exam-Relevant Details
Batch processing is typically used for data warehousing, historical reporting, and ETL jobs.
Streaming is used for real-time monitoring, fraud detection, and IoT.
Azure Data Factory is the primary batch orchestration service in Azure.
Azure Event Hubs is the primary event ingestion service, supporting up to 1 MB per second per partition (standard tier) and up to 20 MB per second per partition (premium tier).
Azure Stream Analytics supports SQL-like queries and can output to Power BI, Azure SQL, Blob Storage, and more.
The default message retention in Event Hubs is 1 day, with a maximum of 7 days (standard tier).
For exactly-once processing in streaming, use Stream Analytics with event ordering policies (Adjust, Drop, or Stop).
Trap Patterns
Confusing batch and streaming: A common wrong answer is to use Stream Analytics for a nightly report. Nightly reports are batch, so ADF or Synapse Pipelines are correct.
Overlooking cost: Batch is cheaper for large volumes; streaming costs more due to continuous processing.
Choosing Event Hubs for batch: Event Hubs is for streaming; Blob Storage is for batch.
Ignoring latency requirements: If the scenario says "real-time dashboard," the answer must involve streaming services (Event Hubs + Stream Analytics).
Define Ingestion Requirements
Identify the data source (e.g., IoT sensors, transaction logs, CRM exports) and the target (e.g., Azure SQL Database, Data Lake). Determine the required latency: if data must be available within seconds, choose streaming; if minutes to hours are acceptable, batch may suffice. Also assess data volume (MB/s vs TB/day) and budget. This step influences all subsequent choices.
Select Ingestion Service
For batch, choose Azure Data Factory (ADF) if you need complex orchestration, or Azure Synapse Pipelines for Synapse-native solutions. For streaming, choose Azure Event Hubs for high-throughput event ingestion, IoT Hub for device-specific scenarios, or Azure Data Explorer for real-time analytics. Consider the number of connectors and integration with existing systems.
Configure Data Collection
For batch, set up a landing zone (e.g., Azure Blob Storage container) with a folder structure partitioned by date. For streaming, create an Event Hubs namespace with an appropriate partition count (default 1, but scale to 16 or 32 for high throughput). Set message retention (default 1 day). Configure source-specific adapters (e.g., Azure IoT Hub for devices, Kafka clients for Apache Kafka sources).
Implement Processing Logic
For batch, use ADF mapping data flows or Databricks notebooks to transform data. Schedule the pipeline with a tumbling window trigger (e.g., every 5 minutes). For streaming, write a Stream Analytics query using SQL-like syntax. Define windows (tumbling, hopping, or session) for aggregations. For example, a 5-minute tumbling window on temperature readings computes average per window. Test with sample data.
Monitor and Optimize
For batch, monitor pipeline runs in ADF monitoring view. Check for failures and latency. For streaming, monitor Event Hubs metrics (incoming messages, throttled requests) and Stream Analytics job metrics (watermark delay, input events). Adjust partition count, retention, and stream unit count (1 SU = 1 MB/s throughput) to meet SLAs. Enable diagnostic logs for troubleshooting.
Scenario 1: E-commerce Order Processing
An online retailer needs to process orders for real-time inventory updates and nightly sales reports. The streaming path: when a customer places an order, the web app sends an event to Azure Event Hubs. Azure Stream Analytics reads the stream, updates a Power BI dashboard showing live order counts, and writes each order to Azure SQL Database for immediate fulfillment. The batch path: every night at 2 AM, Azure Data Factory runs a pipeline that extracts orders from the SQL DB, joins with product data from Blob Storage, and loads aggregated sales into Azure Synapse Analytics for historical analysis. This hybrid approach ensures real-time visibility for operations while enabling deep analytics. Common issues: if the Event Hubs partition count is too low (e.g., 1 partition for 1000 orders/sec), throttling occurs. Solution: increase partitions to 4 or 8 based on throughput needs. Also, ensure Stream Analytics output to Power BI uses a dedicated capacity to avoid throttling.
Scenario 2: IoT Sensor Telemetry
A manufacturing plant collects temperature and vibration data from 10,000 sensors every second. Each sensor sends a JSON message to Azure IoT Hub. The message includes device ID, timestamp, and readings. Azure Stream Analytics processes the stream to detect anomalies (e.g., temperature > 80°C) and sends alerts via Azure Functions to an operations team. Simultaneously, the raw data is archived to Azure Blob Storage using a Stream Analytics output to Blob (with a 5-minute window). For batch analysis, Azure Data Factory runs a daily job to load the archived data into Azure Data Lake Storage Gen2 for machine learning model training. Key considerations: IoT Hub's device-to-cloud partition count must match the Event Hubs-compatible endpoint. Default retention is 1 day, but for compliance, it may be extended to 7 days. Misconfiguration often leads to data loss: if the Stream Analytics job fails, events are lost unless the output uses exactly-once semantics. To mitigate, use Event Hubs Capture to automatically write all events to Blob Storage as a backup.
Scenario 3: Financial Transaction Logs
A bank must process millions of credit card transactions daily for fraud detection and monthly reporting. The streaming pipeline uses Azure Event Hubs to ingest transactions in real-time. Azure Stream Analytics runs a query that flags transactions exceeding a 3-standard-deviation threshold compared to the customer's historical average. These flags are sent to an Azure Function that triggers an SMS alert. The batch pipeline runs every morning at 3 AM: Azure Data Factory extracts the previous day's transactions from the Event Hubs Capture (stored in Blob), performs complex joins with customer data, and loads the results into Azure SQL Data Warehouse (now Azure Synapse) for risk analysis. A common pitfall is assuming that Stream Analytics can handle all processing; however, large historical joins are better suited for batch. Also, Event Hubs Capture must be enabled before data flows, or historical data will be missing.
What DP-900 Tests on This Topic
Objective 1.4: 'Describe core data concepts' includes recognizing batch and streaming processing modes. The exam expects you to:
Identify batch processing scenarios (e.g., daily reports, ETL, data warehousing).
Identify streaming processing scenarios (e.g., real-time dashboards, IoT, fraud detection).
Match Azure services to patterns: Azure Data Factory (batch), Azure Event Hubs (streaming), Azure Stream Analytics (streaming), Azure Synapse Pipelines (batch).
Understand that batch is scheduled, streaming is continuous.
Know that batch handles bounded data, streaming handles unbounded data.
Common Wrong Answers and Why
Using Azure Stream Analytics for a nightly batch job: Candidates see 'data ingestion' and pick Stream Analytics because it's a data service. However, nightly jobs are batch; Stream Analytics is only for streaming. Correct: Azure Data Factory.
Choosing Azure Event Hubs for batch data ingestion: Event Hubs is for real-time event ingestion, not for bulk uploads. Batch uses Blob Storage or Data Lake as a landing zone.
Confusing Azure Data Factory with Azure Stream Analytics: ADF is for orchestration of batch pipelines; ASA is for stream processing. The exam may ask 'which service processes data in real-time?' — answer Stream Analytics.
Thinking batch is faster than streaming: Batch is slower by nature. If a question mentions 'low latency,' streaming is correct.
Specific Numbers and Terms That Appear on the Exam
Default Event Hubs message retention: 1 day (standard tier), max 7 days.
Event Hubs partition count: default 1, configurable up to 32 (standard tier) or 200 (dedicated).
Stream Analytics output window: tumbling window default 5 minutes.
Azure Data Factory trigger frequency: minimum 1 minute.
Batch vs streaming: batch processes 'bounded data,' streaming processes 'unbounded data.'
Key phrase: 'near real-time' often indicates streaming, but 'real-time' is not always strictly real-time; exam uses 'real-time' to mean streaming.
Edge Cases and Exceptions
Hybrid scenarios: The exam may describe a system that uses both batch and streaming. For example, 'ingest sensor data in real-time and also generate daily summaries.' The correct answer may involve both Event Hubs and Azure Data Factory.
Streaming with batch-like behavior: Azure Stream Analytics can output to Blob Storage in windowed batches (e.g., every 5 minutes). This is still streaming because the input is continuous.
Batch with streaming-like triggers: ADF can be triggered by an event (e.g., new file arrives). This is still batch processing because the job processes a batch of data (the file).
How to Eliminate Wrong Answers
If the scenario mentions 'scheduled' or 'daily' or 'every hour' → batch.
If the scenario mentions 'real-time,' 'continuous,' 'stream,' or 'event' → streaming.
If the service name includes 'Stream' (Stream Analytics, Event Hubs) → streaming.
If the service name includes 'Data Factory' or 'Pipeline' → batch.
If the data is 'bounded' (finite set) → batch; if 'unbounded' (infinite) → streaming.
Look for latency keywords: 'seconds' or 'milliseconds' → streaming; 'minutes' or 'hours' → batch.
Batch processing uses scheduled jobs; streaming uses continuous processing.
Azure Data Factory is the primary batch orchestration service in Azure.
Azure Event Hubs and IoT Hub are the primary event ingestion services for streaming.
Azure Stream Analytics is the primary stream processing service in Azure.
Batch is for bounded data; streaming is for unbounded data.
Default Event Hubs message retention is 1 day (max 7 days).
Stream Analytics tumbling window default is 5 minutes.
Choose batch for cost-sensitive, latency-tolerant workloads; choose streaming for real-time needs.
These come up on the exam all the time. Here's how to tell them apart.
Batch Processing
Processes data in discrete chunks at scheduled intervals.
Higher latency (minutes to days).
Handles bounded data (finite set).
Simpler to implement and debug.
Lower cost per unit data processed.
Streaming Processing
Processes data continuously as it arrives.
Low latency (seconds to milliseconds).
Handles unbounded data (infinite stream).
More complex, requires handling out-of-order events.
Higher cost due to sustained compute.
Mistake
Batch processing always means processing data in 24-hour intervals.
Correct
Batch processing can run at any interval, from every minute to monthly. The key is that data is collected and processed in discrete chunks, not continuously. Azure Data Factory supports triggers as frequent as every 1 minute.
Mistake
Streaming processing is always faster than batch processing.
Correct
Streaming processing has lower latency (seconds to milliseconds), but it does not necessarily have higher throughput. Batch can process massive volumes more efficiently per unit time. Streaming is designed for low latency, not high throughput per job.
Mistake
Azure Event Hubs can be used for batch data ingestion.
Correct
Event Hubs is a streaming event ingestion service. It expects continuous, real-time data streams. For batch ingestion, use Azure Blob Storage or Azure Data Lake Storage as a landing zone, then process with Azure Data Factory.
Mistake
Azure Stream Analytics can only process streaming data from Event Hubs.
Correct
Stream Analytics can also ingest data from IoT Hub, Blob Storage (as a stream of new blobs), and Event Hubs. It can output to multiple sinks including Power BI, SQL Database, and Blob Storage.
Mistake
Batch processing cannot handle real-time data.
Correct
Batch processing inherently introduces latency because it waits for a time window to elapse. However, if the batch interval is very short (e.g., 1 minute), it can approximate real-time, but it is still batch. True real-time requires streaming.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Batch ingestion processes data in large, discrete chunks at scheduled intervals (e.g., hourly, daily). Streaming ingestion processes data continuously as it arrives, enabling real-time or near-real-time analytics. In Azure, batch uses services like Azure Data Factory, while streaming uses Azure Event Hubs and Azure Stream Analytics.
Use Azure Data Factory for batch orchestration—scheduled pipelines that move and transform data in bulk. Use Azure Stream Analytics for real-time processing of streaming data from Event Hubs or IoT Hub. If you need a nightly report, choose Data Factory; if you need a live dashboard, choose Stream Analytics.
Azure Stream Analytics is designed for streaming data. It can process data from Blob Storage if new blobs are continuously arriving, but it is not intended for one-time batch jobs. For batch processing, use Azure Data Factory or Azure Databricks.
The default message retention period is 1 day. For standard tier, you can configure it up to 7 days. For dedicated tier, retention can be up to 90 days. Retention determines how far back you can replay events.
A tumbling window is a time-based window that does not overlap. For example, a 5-minute tumbling window groups events into non-overlapping 5-minute chunks. It is useful for aggregations like counting events per 5-minute interval. The default window size in Stream Analytics is 5 minutes.
Azure Data Factory is primarily a batch orchestration service. It can be triggered by events (e.g., new file arrival) but still processes data in batches (the file). For true real-time ingestion, use Azure Event Hubs with Azure Stream Analytics.
Bounded data is a finite dataset that has a defined start and end (e.g., a CSV file). Unbounded data is a continuous stream with no fixed end (e.g., sensor readings). Batch processing handles bounded data; streaming handles unbounded data.
You've just covered Data Ingestion Patterns: Batch vs Streaming — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?