This chapter covers Event Hubs Capture, a feature that automatically writes streaming events from an Event Hub to Azure Blob Storage or Azure Data Lake Storage in Avro format. It is essential for event sourcing, auditing, and long-term retention scenarios. On the AZ-204 exam, this topic appears in about 5-7% of questions, typically integrated with questions about Event Hubs, Azure Storage, and data pipelines. You will need to know how to enable Capture, configure its settings, understand the file naming and partitioning, and handle common pitfalls like timeouts and file size thresholds.
Jump to a section
Imagine an airline's flight data recorder (black box) on a commercial jet. The recorder continuously writes a stream of telemetry data—altitude, speed, heading, engine parameters—onto a loop of magnetic tape. The tape has a fixed capacity; once full, it overwrites the oldest data. However, the airline wants to archive all flights permanently for post-incident analysis. They install a secondary system: a capture device that reads the same stream and writes it to a durable, long-term storage medium (like a solid-state drive) in Avro format. The capture device runs continuously, writing files in 15-minute chunks or whenever 100 MB of data accumulates, whichever comes first. If the plane loses power, the capture device stops but the last file is finalized. Back on the ground, analysts query the archived data using standard tools. This mirrors Event Hubs Capture: the Event Hub is the looped buffer (with retention up to 7 days), and Capture writes the same events to Azure Blob Storage or Azure Data Lake Storage in Avro format, partitioned by time and entity. The capture is configured to run on a schedule (e.g., every 15 minutes) or when a size threshold (e.g., 100 MB) is hit. The capture operates in parallel with the live stream—it does not block consumers. If the Event Hub is deleted, the captured data survives in storage.
What is Event Hubs Capture?
Event Hubs Capture is a built-in feature of Azure Event Hubs that automatically persists the streaming data (events) from an Event Hub to a durable storage destination — either Azure Blob Storage or Azure Data Lake Storage Gen2 (ADLS Gen2). It operates as a sink that runs in parallel with the event stream, meaning it does not block or slow down the ingestion path. Capture is designed for scenarios where you need to retain events beyond the Event Hub's retention window (which is up to 7 days), such as for event sourcing, auditing, batch analytics, or machine learning training.
How Capture Works Internally
When Capture is enabled on an Event Hub, the Event Hubs service internally reads events from the same partition stream that consumers see. It buffers events until either a time window or a size threshold is reached, then flushes the buffer to a file in Avro format in the configured storage account. The process is: 1. Events arrive at the Event Hub and are stored in the partition's log (the underlying storage). 2. A Capture background process reads from the same log, starting from the earliest available event. 3. It accumulates events into an in-memory buffer. The buffer is flushed when:
- The time window (default 15 minutes, configurable from 1 to 15 minutes) expires, OR - The size threshold (default 100 MB, configurable from 10 MB to 500 MB) is reached, whichever happens first. 4. The flushed data is written to a single Avro file in the configured storage container. The file path includes the Event Hub namespace, Event Hub name, partition ID, and timestamp (year, month, day, hour, minute, second). 5. After writing, the Capture process updates its checkpoint so it knows where to resume.
Key Components and Defaults
Storage Destination: Azure Blob Storage or ADLS Gen2. You must provide an existing storage account and a container (or file system for ADLS Gen2). The container must be created before enabling Capture.
Time Window: 1 to 15 minutes. Default is 15 minutes. This is the maximum time between file writes. If no events arrive, no file is written until the window expires.
Size Window: 10 MB to 500 MB. Default is 100 MB. The file is written once the accumulated data reaches this size, even if the time window hasn't expired.
File Format: Avro, a compact binary format that includes a schema. Each Avro file contains a header with the schema and then a series of records (events). The schema includes fields like SequenceNumber, Offset, EnqueuedTimeUtc, SystemProperties, and Properties (application properties).
File Naming Convention: {Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}.avro
Partitioning: Capture writes one file per partition per flush window. If you have 4 partitions, you may get up to 4 files per flush interval.
Checkpointing: Capture maintains its own checkpoint internally to track the last flushed event offset. It does not interfere with consumer checkpoints.
Configuration Methods
Capture can be enabled via:
- Azure Portal: When creating or editing an Event Hub, under the 'Capture' blade, toggle 'On', select storage account and container, set time and size windows.
- Azure CLI: Use az eventhubs eventhub create or az eventhubs eventhub update with --capture-enabled, --capture-interval, --capture-size-limit, --archive-name-format, etc.
- Azure PowerShell: Use New-AzEventHub or Set-AzEventHub with -CaptureEnabled, -CaptureIntervalInSeconds, -CaptureSizeLimitInBytes.
- ARM Templates: Include the captureDescription property in the Event Hub resource definition.
Example CLI command to enable Capture on an existing Event Hub:
az eventhubs eventhub update \
--resource-group MyResourceGroup \
--namespace-name MyNamespace \
--name MyEventHub \
--capture-enabled true \
--capture-interval 300 \
--capture-size-limit 104857600 \
--archive-name-format "{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}" \
--storage-account MyStorageAccount \
--blob-container mycontainerNote: --capture-interval is in seconds (300 = 5 minutes). --capture-size-limit is in bytes (104857600 = 100 MB).
Interaction with Related Technologies
Event Hubs + Capture + Azure Stream Analytics: Stream Analytics can read from Event Hubs for real-time processing, while Capture archives the same data for historical analysis.
Event Hubs + Capture + Azure Data Lake Storage: Ideal for big data analytics using U-SQL, Spark, or Hive.
Event Hubs + Capture + Azure Storage Lifecycle Management: You can set lifecycle policies on the storage container to move old Avro files to cool or archive tiers, or delete them after a period.
Event Hubs + Capture + Azure Functions: Use a blob trigger to process each new Avro file as it arrives.
Performance and Scaling
Capture uses internal resources of the Event Hubs service. It does not consume throughput units (TUs) or processing units (PUs) separately — it is included in the cost of the Event Hub. However, the write operations to storage count against your storage account's transaction limits. For high-throughput scenarios, ensure your storage account can handle the additional write IOPS. Capture introduces minimal latency (typically a few seconds to minutes) depending on the flush interval. The default 15-minute window is a good balance between cost and freshness.
Error Handling and Reliability
Capture is designed to be resilient. If a write to storage fails (e.g., network issue, storage account throttling), Capture will retry internally. If the failure persists, the Capture process will skip the current batch and continue with new events. The skipped events are not lost — they remain in the Event Hub's retention buffer and will be captured in a later flush. However, if the retention period expires before they are captured, they are permanently lost. Therefore, set the retention period long enough to accommodate any capture delays (e.g., 1 day). Capture does not guarantee exactly-once delivery to storage; duplicates can occur if the flush succeeds but the checkpoint update fails. Downstream consumers should be idempotent.
Monitoring
You can monitor Capture via Azure Monitor metrics: - Capture Backlog: Number of bytes that have not yet been captured. A growing backlog indicates Capture is falling behind. - Capture Requests: Number of write requests to storage. - Capture Errors: Number of failed write attempts.
Use in Event Sourcing
Event sourcing requires an immutable, ordered log of events. Event Hubs provides the ordered log per partition, and Capture persists that log to long-term storage. However, Capture writes files at intervals, so the order across partitions is not guaranteed. For event sourcing, you must consume events from the Event Hub in order per partition, and Capture files can be replayed in partition order. A common pattern is to use Capture as the raw archive and then process the Avro files into a structured event store (e.g., Cosmos DB or SQL) for querying.
Enable Capture on Event Hub
Navigate to the Event Hub in Azure Portal, select the 'Capture' blade, and toggle it on. You must specify an existing storage account and container. Optionally, set the time window (1-15 minutes) and size threshold (10-500 MB). The default is 15 minutes and 100 MB. You can also set the file name format, but the default is recommended. After enabling, Capture starts immediately. It will begin reading events from the earliest available offset in each partition. If the Event Hub already has events, Capture will backfill them (up to the retention limit).
Events flow into Event Hub partitions
Producers send events to the Event Hub. Events are assigned to a partition (either by partition key or round-robin). Each partition maintains an ordered log. Capture internally reads from this log independently of any consumer groups. It uses its own checkpoint to track the last captured offset. The Capture process runs in the background, consuming minimal resources. It does not affect the ingestion throughput or latency for producers.
Capture buffers events until flush condition
The Capture process accumulates events from a partition into an in-memory buffer. Two conditions trigger a flush: the time window expires (default 15 minutes) or the buffer size reaches the threshold (default 100 MB). Whichever happens first causes the buffer to be written to storage. If no events arrive during a window, no file is written. The buffer is per partition, so each partition flushes independently. This means files from different partitions may have different timestamps even for the same time window.
Write Avro file to Azure Storage
When a flush is triggered, the buffer is serialized into Avro format and written as a single .avro file to the configured storage container. The file path follows the pattern: `{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}.avro`. The timestamp corresponds to the end of the flush window (rounded to the second). The file includes a header with the Avro schema and then records. Each record contains the event body, system properties (offset, sequence number, enqueued time), and application properties. After successful write, the checkpoint is updated.
Downstream processing of captured files
Once the Avro file is written, it can be consumed by any tool that supports Avro, such as Azure Data Factory, Azure Databricks, HDInsight, or custom code. Common patterns include: using Azure Functions with a blob trigger to process each new file, running periodic batch jobs with Azure Batch, or using PolyBase in Azure Synapse to query directly. Because the files are in Avro, they preserve the schema and are efficient for compression. You can also apply Azure Storage lifecycle management to move old files to cool or archive tiers to reduce costs.
Scenario 1: Financial Audit Trail
A fintech company processes millions of transactions per hour through Event Hubs. They need to retain all transaction events for 7 years for regulatory compliance. They enable Capture with a 5-minute time window and 50 MB size threshold, writing to Azure Blob Storage. The storage account is configured with immutable blob policies (WORM) to prevent tampering. A separate Azure Function triggers on each new Avro file, decompresses it, and inserts the records into a SQL database for fast querying. The company also uses Azure Policy to enforce that Capture is enabled on all Event Hubs in production.
Common issue: If the storage account is in a different region, latency increases and Capture may fall behind. Solution: Use a storage account in the same region as the Event Hub.
Scenario 2: IoT Telemetry Archival
A manufacturing company collects sensor data from thousands of IoT devices. Each device sends telemetry every second. They use Event Hubs with Capture to archive raw data to ADLS Gen2 for later analysis using Spark. They set the time window to 15 minutes (default) to batch files efficiently. They also set a size window of 200 MB to avoid too many small files. After archiving, they run a monthly Databricks job to aggregate data and delete raw files older than 90 days.
Common issue: Too many small files (e.g., if the time window is too short and data volume is low). Solution: Increase the time window or size threshold to create fewer, larger files.
Scenario 3: Streaming ETL with Real-time and Batch
A media company ingests clickstream data from websites. They use Event Hubs for real-time processing with Azure Stream Analytics to update dashboards. At the same time, Capture writes the same data to Blob Storage. Every hour, a Data Factory pipeline copies the latest Avro files to a staging area, transforms them into Parquet, and loads them into Azure Synapse for historical reporting. They set the Capture time window to 10 minutes to balance between data freshness and file count.
Common pitfall: Forgetting to set the retention period on the Event Hub long enough to cover any Capture downtime. If Capture fails for an hour and retention is only 1 hour, events older than 1 hour are lost. Best practice: Set retention to at least 24 hours.
What AZ-204 Tests on Event Hubs Capture
Objective 5.3: Implement event processing. The exam expects you to know:
How to enable Capture via Azure Portal, CLI, PowerShell, and ARM templates (multiple-choice and scenario-based).
The default time window (15 minutes) and size threshold (100 MB), and that they are configurable.
That Capture writes Avro files to Blob Storage or ADLS Gen2.
The file naming convention (Namespace/EventHub/PartitionId/Year/Month/Day/Hour/Minute/Second).
That Capture does not affect the real-time ingestion path.
That Capture can be used for event sourcing and long-term retention.
Common Wrong Answers
"Capture writes to Azure Table Storage" — Wrong. Capture only writes to Blob Storage or ADLS Gen2.
"Capture uses JSON format" — Wrong. It uses Avro, not JSON.
"Capture must be enabled at namespace level" — Wrong. Capture is enabled per Event Hub, not per namespace.
"Capture blocks consumers until the file is written" — Wrong. Capture runs in parallel and does not block.
"The minimum time window is 1 minute" — Actually, the minimum is 1 minute, but the default is 15 minutes. Exam may test that the minimum is 1 minute.
Exam Numbers and Values
Time window range: 1 to 15 minutes.
Size window range: 10 MB to 500 MB.
Default time: 15 minutes.
Default size: 100 MB.
File format: Avro.
Storage: Azure Blob Storage or Azure Data Lake Storage Gen2.
Retention of Event Hub: up to 7 days. Capture extends retention beyond that.
Edge Cases
If Capture is enabled on an Event Hub that already has events, it will start capturing from the oldest available event (based on retention).
If the storage account is deleted, Capture will stop and log errors. Events remain in the Event Hub until retention expires.
If the container does not exist, Capture will fail. The container must be created beforehand.
Capture does not support writing to ADLS Gen1 (only Gen2).
Eliminating Wrong Answers
When you see a question about archiving Event Hubs data, immediately think of Capture. If the answer mentions using a separate consumer group or a worker role to write to storage, it's likely wrong because Capture is built-in. Also, if the answer says "the data is stored in JSON" or "Parquet", eliminate it — it's Avro. If the answer says "Capture writes to a secondary Event Hub", that's wrong.
Event Hubs Capture automatically writes events to Azure Blob Storage or ADLS Gen2 in Avro format.
Default time window is 15 minutes (range 1-15 min); default size window is 100 MB (range 10-500 MB).
Capture is enabled per Event Hub, not per namespace.
Capture does not block real-time ingestion or consume throughput units.
The file naming convention includes namespace, event hub name, partition ID, and timestamp.
Avro files contain a schema header and event records with system and application properties.
Capture is ideal for event sourcing, long-term retention, and batch analytics.
If the storage account is unavailable, Capture will retry and skip the batch if persistent failure occurs.
These come up on the exam all the time. Here's how to tell them apart.
Event Hubs Capture
Built-in, no additional code required
Writes Avro files automatically
Does not consume throughput units
Manages checkpoints internally
Supports both Blob and ADLS Gen2
Custom Consumer with Blob Storage
Requires custom consumer implementation
Can write any format (JSON, CSV, Parquet)
Consumes throughput units or processing units
Must manage checkpoints manually
More flexible but higher development effort
Mistake
Event Hubs Capture writes data to Azure Table Storage.
Correct
Capture only writes to Azure Blob Storage or Azure Data Lake Storage Gen2. It does not support Table Storage or Queue Storage.
Mistake
Capture uses JSON format for the output files.
Correct
Capture uses Avro format, which is a compact binary format with a schema. This is a common exam trap.
Mistake
Capture can be enabled only at the Event Hubs namespace level.
Correct
Capture is configured per Event Hub, not per namespace. Each Event Hub within a namespace can have its own Capture settings.
Mistake
Capture blocks event ingestion while it writes files to storage.
Correct
Capture runs as a background process and does not block or slow down the real-time ingestion. Producers and consumers continue unaffected.
Mistake
The minimum time window for Capture is 5 minutes.
Correct
The minimum time window is 1 minute. The default is 15 minutes. The maximum is 15 minutes.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
No. Capture supports only Azure Blob Storage and Azure Data Lake Storage Gen2. ADLS Gen1 is not supported. If you need to archive to Gen1, you must use a custom consumer.
Capture will start reading from the oldest available event in each partition, up to the retention limit. It will backfill all existing events that are still within the retention window. This is useful for retroactive archiving.
No. Capture provides at-least-once delivery. Duplicates can occur if the file is written but the checkpoint update fails. Downstream consumers should be idempotent.
Yes. You can update the time window (1-15 minutes) and size threshold (10-500 MB) at any time via portal, CLI, or PowerShell. The changes take effect immediately for subsequent flushes.
Capture itself does not have an additional charge beyond the Event Hub cost. However, you pay for the storage used (Blob or ADLS Gen2) and the write operations (transactions) against that storage. The number of writes depends on the flush frequency.
You can use any tool that supports Avro, such as Azure Data Factory, Azure Databricks, HDInsight, or custom .NET/Java code with Avro libraries. Azure Functions with a blob trigger is a common serverless option.
Yes. If the Event Hub is in a VNet, you need to ensure the storage account can be accessed from the VNet. You may need to configure firewall rules or service endpoints.
You've just covered Event Hubs Capture for Event Sourcing — now see how well it sticks with free AZ-204 practice questions. Full explanations included, no account needed.
Done with this chapter?