This chapter covers the delivery and retry logic for webhooks in Azure Event Grid, a critical topic for the AZ-204 exam. You will learn the exact mechanism Event Grid uses to deliver events to HTTP endpoints, the configurable retry policy, and how dead-lettering works. Expect 5-7% of exam questions to touch on Event Grid delivery, retry schedules, and error handling. Mastery of this topic is essential for building reliable event-driven applications.
Jump to a section
Imagine the Azure Event Grid webhook delivery system as a highly automated postal service. You (the publisher) drop off a letter (event) at the post office (Event Grid). The post office is responsible for delivering that letter to a specific recipient (webhook endpoint). The postal service has strict rules: it must confirm delivery within 10 seconds, or it assumes the letter was lost. The mail carrier (Event Grid's delivery engine) attempts delivery. If the recipient answers the door and signs for the letter (HTTP 200), delivery is complete. If no one answers (timeout) or the recipient refuses (non-200 status), the carrier notes the failure and schedules a retry. The postal service has a predefined schedule: first retry after 10 seconds, then 30 seconds, then 1 minute, up to a maximum of 30 minutes between attempts, and a total of 24 hours before giving up entirely. The carrier also has a 'dead letter' process: if the recipient has moved and left no forwarding address (invalid endpoint), after exhausting retries, the letter is sent to a special dead letter office (storage account) for analysis. The postal service can also be configured to deliver to an alternate address (webhook with retry policy) or to hold the letter for pickup (queue storage) if the recipient prefers. This ensures that even in a storm (transient failure), the letter eventually gets delivered or properly archived.
What is Webhook Delivery and Why Does It Exist?
Webhooks are user-defined HTTP callbacks that are triggered by specific events. In Azure Event Grid, when an event occurs (e.g., a blob is created), Event Grid pushes the event to a registered webhook endpoint. Delivery is the process of sending the event to that endpoint and receiving a successful acknowledgment. Retry logic is the mechanism Event Grid uses to re-attempt delivery if the initial attempt fails. This is crucial for building resilient, fault-tolerant event-driven architectures where transient failures (network blips, endpoint overload) must not cause data loss.
How It Works Internally
When Event Grid attempts to deliver an event to a webhook endpoint, it sends an HTTP POST request with a JSON payload containing the event data. The request includes headers like Content-Type: application/json and aeg-event-type: Notification. The endpoint must respond with an HTTP status code 200 (OK) within 10 seconds. If the endpoint responds with 2xx (including 200, 201, 202, etc.), delivery is considered successful and no retry occurs. If the endpoint responds with a non-2xx status (e.g., 400, 403, 500, 503) or if the request times out (no response within 10 seconds), Event Grid classifies it as a delivery failure and triggers the retry policy.
Event Grid uses an exponential backoff with jitter for retries. The retry schedule is not strictly exponential; it uses a predefined sequence of intervals: 10 seconds, 30 seconds, 1 minute, 5 minutes, 10 minutes, 30 minutes, and then every 30 minutes up to a maximum of 24 hours. After 24 hours, if the event has not been delivered, it is either dropped or sent to a dead-letter destination if configured. The maximum number of retries is 30 attempts by default, but this can be configured via the maxDeliveryCount property on the event subscription. The retry policy is per-event; each event is retried independently.
Key Components, Values, Defaults, and Timers
HTTP Response Code: Only 2xx codes indicate success. 3xx, 4xx, and 5xx are failures. Specifically, 429 (Too Many Requests) is treated as a failure and triggers retry; Event Grid does not respect the Retry-After header.
Timeout: 10 seconds. If no response within 10 seconds, the attempt is considered failed.
Retry Schedule (default): 10s, 30s, 1m, 5m, 10m, 30m, then every 30m up to 24h.
Max Delivery Count: Default 30. Configurable 1-100.
Event Time-to-Live (TTL): Default 24 hours. Configurable 1-1440 minutes (24 hours). After TTL expires, the event is dropped or dead-lettered.
Dead-Lettering: Optional. Events that fail delivery can be sent to a Blob Storage container or a Queue Storage queue. Configured at event subscription creation.
Validation: For webhook endpoints, Event Grid requires a validation handshake before events are delivered. The endpoint must respond to a SubscriptionValidation event with a validation code in the response body.
Configuration and Verification Commands
You can configure retry policy and dead-lettering using Azure CLI, PowerShell, or ARM templates. Example using Azure CLI:
# Create an event subscription with custom retry policy and dead-lettering
az eventgrid event-subscription create \
--name mySubscription \
--source-resource-id /subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{storage} \
--endpoint https://mywebhook.azurewebsites.net/api/events \
--max-delivery-count 10 \
--event-ttl 60 \
--deadletter-endpoint /subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{dlstorage}/blobServices/default/containers/{dlcontainer}To verify delivery status, you can use Azure Monitor metrics or check the Event Grid diagnostic logs. The metric DeadLetteredCount shows events sent to dead-letter. DeliveryFailedCount shows failed deliveries.
Interaction with Related Technologies
Event Grid webhook delivery integrates with: - Azure Functions: Functions can serve as webhook endpoints. The runtime automatically responds with 202 if the function completes successfully. - Logic Apps: Logic Apps can be triggered via webhook. They respond with 200 after processing. - Azure Event Hubs: As an alternative to webhooks, events can be sent to Event Hubs for reliable streaming. - Storage Queues: Instead of a webhook, you can route events to a queue for decoupled processing. - Azure Monitor: Alerts can be triggered on delivery failures or dead-lettered events.
Edge Cases
Throttled Endpoints: If the endpoint returns 429, Event Grid retries but does not back off more than the standard schedule. Consider using dead-lettering for persistent throttling.
Invalid Endpoint: If the endpoint returns 404, Event Grid treats it as a failure and retries. Dead-lettering is recommended.
Validation Handshake: If the endpoint does not respond correctly to the validation event, Event Grid will not send any events. The subscription remains in a 'Awaiting manual action' state.
1. Event Occurs and Publish
When an event occurs (e.g., a new blob is created in Azure Storage), the source (Storage account) publishes the event to Event Grid. Event Grid receives the event and identifies all subscriptions that match the event type and subject filter. For each matching subscription, Event Grid prepares an HTTP POST request to the registered webhook endpoint. The request body contains the event in CloudEvents 1.0 format. The request includes headers such as `aeg-event-type: Notification`, `aeg-subscription-name`, and `aeg-delivery-count` (starting at 1).
2. Initial Delivery Attempt
Event Grid sends the HTTP POST request to the webhook endpoint. The endpoint must respond with an HTTP 2xx status code within 10 seconds. If the endpoint responds with 200, Event Grid considers the delivery successful and removes the event from its internal queue. If the endpoint responds with any other status (including 3xx, 4xx, 5xx) or fails to respond within 10 seconds, Event Grid records a delivery failure and increments a retry counter for that event.
3. First Retry (10 seconds)
After the initial failure, Event Grid waits 10 seconds before retrying. The retry uses the same HTTP POST request with the same event payload. The `aeg-delivery-count` header is incremented to 2. The endpoint must again respond within 10 seconds. If successful, the event is delivered. If not, Event Grid moves to the next retry interval.
4. Subsequent Retries (Exponential Backoff)
Event Grid follows a predefined schedule: after the first retry at 10s, it retries at 30s, 1m, 5m, 10m, 30m, and then every 30 minutes. The maximum number of retries is determined by `maxDeliveryCount` (default 30). The event also has a time-to-live (TTL) of 24 hours by default. If the TTL expires before successful delivery, the event is dropped or dead-lettered. The retry schedule is fixed, not adaptive based on endpoint load.
5. Dead-Lettering (Optional)
If dead-lettering is configured, after exhausting all retries (or upon TTL expiry), Event Grid writes the event payload to a Blob Storage container or a Queue Storage queue. The dead-lettered event retains its original format. This allows for later manual or automated reprocessing. Dead-lettering is configured at subscription creation. The dead-letter destination must be in the same Azure region as the event subscription. Diagnostic logs can be used to monitor dead-lettered events.
Scenario 1: E-commerce Order Processing
A large e-commerce platform uses Event Grid to trigger order fulfillment workflows. When a new order is placed, an event is published and delivered to a webhook endpoint hosted on Azure Functions. The function processes the order, updates inventory, and sends confirmation. To handle transient failures (e.g., database throttling), the retry policy is configured with maxDeliveryCount of 10 and TTL of 1 hour. Dead-lettering is enabled to capture any orders that fail after retries, allowing operations to manually re-process them. Without dead-lettering, lost orders could result in revenue loss. The team monitors DeadLetteredCount metric and sets up alerts for spikes.
Scenario 2: IoT Telemetry Ingestion
An IoT solution ingests telemetry from millions of devices. Events are published to Event Grid and delivered to a webhook endpoint that writes to a time-series database. The endpoint occasionally returns 503 during peak load. The retry policy is set with maxDeliveryCount of 5 and TTL of 30 minutes. Because the endpoint is critical, dead-lettering is configured to a Storage Queue. A separate process reads from the queue and replays events during off-peak hours. Without retry logic, transient failures would cause data gaps. The team also uses diagnostic logs to correlate delivery failures with endpoint CPU spikes.
Scenario 3: Multi-tenant SaaS Application
A SaaS provider uses Event Grid to notify tenants of resource changes. Each tenant registers a webhook endpoint. One tenant's endpoint becomes misconfigured (returns 404). Event Grid retries for 24 hours, consuming resources. The provider configures dead-lettering and sets a low maxDeliveryCount (3) for tenant subscriptions to prevent resource exhaustion. They also implement a validation handshake to verify endpoints before enabling delivery. Misconfigured dead-lettering (e.g., pointing to a non-existent container) causes events to be lost silently, so the provider uses ARM templates to enforce correct configuration.
What AZ-204 Tests on This Topic
AZ-204 objective 5.3 covers "Implement event handling" which includes configuring Event Grid subscriptions, retry policies, and dead-lettering. Exam questions typically present a scenario where an event-driven application must handle transient failures or ensure delivery guarantees. You must know:
The default retry schedule (10s, 30s, 1m, 5m, 10m, 30m, then every 30m)
The default max delivery count (30) and TTL (24 hours)
That only 2xx responses are considered successful (3xx, 4xx, 5xx are failures)
That dead-lettering requires a storage account (Blob or Queue) and is optional
That validation handshake is required for webhook endpoints
That retry policy is per-subscription, not per-event
Common Wrong Answers and Why
Retry respects Retry-After header: Many candidates think Event Grid respects the Retry-After header from the endpoint. It does not. It uses its own fixed schedule. This is a common trap.
Retry is exponential backoff: While it includes backoff, it is a fixed schedule, not true exponential backoff with jitter. The exam tests the exact intervals.
Dead-lettering is default: Dead-lettering is optional and must be explicitly configured. Without it, events are dropped after TTL expiry.
HTTP 429 is treated as success: 429 is a failure and triggers retry. Candidates often think rate-limiting responses are handled gracefully.
Specific Numbers and Terms
Memorize: default maxDeliveryCount = 30, default TTL = 1440 minutes (24 hours), timeout = 10 seconds. The retry schedule sequence: 10s, 30s, 1m, 5m, 10m, 30m, then every 30m. Valid dead-letter destinations: Blob Storage container or Queue Storage queue. The validation event type is Microsoft.EventGrid.SubscriptionValidationEvent.
Edge Cases
If the endpoint returns 200 but with an invalid body, Event Grid still considers it success. The exam might test that only HTTP status matters.
If the endpoint is unreachable (DNS failure), Event Grid retries. This is a delivery failure.
You cannot change retry policy after subscription creation; you must delete and recreate.
Dead-letter destination must be in the same Azure region.
How to Eliminate Wrong Answers
Focus on the mechanism: if the endpoint responds with any non-2xx or times out, it's a failure. If the question says "retry after 5 seconds" — that's wrong because the schedule starts at 10s. If it says "dead-lettering is enabled by default" — that's wrong. Look for phrases like "retry after 1 minute" — that matches the third interval. Use the default values to eliminate options.
Event Grid webhook delivery uses a fixed retry schedule: 10s, 30s, 1m, 5m, 10m, 30m, then every 30m up to 24h.
Default maxDeliveryCount is 30; configurable 1-100. Default TTL is 24 hours (1440 minutes); configurable 1-1440 minutes.
Only HTTP 2xx responses are considered successful. 3xx, 4xx, 5xx, and timeouts are failures.
Event Grid does not respect the Retry-After header from endpoints.
Dead-lettering is optional and must be explicitly configured to a Blob Storage container or Queue Storage queue.
Webhook endpoints must respond within 10 seconds; otherwise, the attempt is counted as a failure.
Validation handshake is required for webhook subscriptions; the endpoint must respond to SubscriptionValidationEvent.
Retry policy and dead-lettering are configured per event subscription and cannot be changed after creation.
These come up on the exam all the time. Here's how to tell them apart.
Event Grid Webhook Delivery
Delivers events via HTTP POST to a webhook endpoint.
Endpoint must respond with 2xx within 10 seconds.
Retry schedule is fixed with exponential intervals.
Failure can result in dead-lettering or dropping.
Requires validation handshake for new endpoints.
Event Grid Queue Storage Delivery
Delivers events by writing to an Azure Storage Queue.
No HTTP endpoint required; queue is polled by consumer.
No retry logic; events remain in queue until read.
No dead-lettering needed; events persist in queue.
No validation handshake; queue is always ready.
Mistake
Event Grid retries are exponential backoff with jitter.
Correct
Event Grid uses a fixed schedule: 10s, 30s, 1m, 5m, 10m, 30m, then every 30m. It is not adaptive and does not use jitter.
Mistake
Any HTTP 2xx response is treated as success.
Correct
Yes, any 2xx (200, 201, 202, etc.) is considered success. Even 202 Accepted is treated as successful delivery.
Mistake
Dead-lettering is enabled by default.
Correct
Dead-lettering is optional. You must explicitly configure a dead-letter destination (Blob Storage or Queue Storage) when creating the event subscription.
Mistake
Event Grid respects the Retry-After header from the endpoint.
Correct
Event Grid does not respect Retry-After. It uses its own fixed retry schedule regardless of the header.
Mistake
The maximum delivery count is 30 and cannot be changed.
Correct
The default is 30, but it is configurable from 1 to 100. The TTL is also configurable from 1 to 1440 minutes.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
The default retry schedule is: 10 seconds, 30 seconds, 1 minute, 5 minutes, 10 minutes, 30 minutes, and then every 30 minutes thereafter, up to a maximum of 24 hours (TTL). The default max delivery count is 30. These values are important for the exam; you need to memorize the exact sequence.
Event Grid treats HTTP 429 (Too Many Requests) as a delivery failure, just like any other non-2xx status. It does not respect the Retry-After header. The event will be retried according to the fixed schedule. If you need to handle throttling, consider using dead-lettering to capture events after retries are exhausted.
No. The retry policy (maxDeliveryCount and event TTL) and dead-letter configuration must be set at the time of subscription creation. To change these settings, you must delete the subscription and recreate it with the new values.
Event Grid treats any 3xx response as a failure. It does not follow redirects. The delivery is considered failed and will be retried according to the schedule. The endpoint must respond with 2xx for success.
No, dead-lettering is optional. If not configured, events that fail delivery after exhausting retries or TTL are simply dropped. Dead-lettering is recommended for critical events to prevent data loss.
When you create a webhook subscription, Event Grid sends a validation event (SubscriptionValidationEvent) to the endpoint. The endpoint must respond with the validation code in the response body (e.g., `{"validationResponse": "<code>"}`). If successful, the subscription is activated. If not, the subscription remains in a pending state.
Event Grid is a push-based event routing service with retry logic, ideal for discrete events. Event Hubs is a pull-based streaming platform with high throughput, suitable for large volumes of events. Event Grid can also deliver to Event Hubs as a destination.
You've just covered Webhook Delivery and Retry Logic — now see how well it sticks with free AZ-204 practice questions. Full explanations included, no account needed.
Done with this chapter?