This chapter covers SQS Dead Letter Queues (DLQs), a critical feature for building resilient and fault-tolerant message processing systems on AWS. For the SAA-C03 exam, DLQs appear in roughly 5-10% of questions, often as a distractor or as part of a larger solution design. Understanding DLQ mechanics, configuration parameters, and integration patterns is essential to avoid infinite processing loops and to enable systematic error handling in distributed architectures.
Jump to a section
Imagine a post office that processes millions of letters daily. Each letter has a destination address, and the post office has multiple sorting machines that route letters to the correct delivery trucks. However, some letters are undeliverable: the address is illegible, the recipient moved, or the package is damaged. Normally, the sorting machine tries to deliver a letter three times. If all attempts fail, the letter is not simply thrown away—instead, it is placed into a special bin labeled "Dead Letters." A supervisor periodically checks this bin to analyze why those letters failed. The post office can configure the number of delivery attempts (redelivery policy) and the maximum time a letter sits in the dead letter bin before it is destroyed (retention period). Additionally, the post office can set a maximum number of letters in the dead letter bin; if exceeded, no more dead letters are accepted. This system prevents undeliverable letters from clogging the main sorting machines and allows the post office to debug systemic issues, such as a misconfigured sorting machine or a wrong address format. In AWS, SQS Dead Letter Queues work exactly like this: undeliverable messages are moved to a separate queue for analysis, preventing infinite processing loops and preserving the main queue's performance.
What is an SQS Dead Letter Queue?
An SQS Dead Letter Queue (DLQ) is a standard or FIFO queue that other source queues can target for messages that are not successfully consumed after a specified number of attempts. The DLQ acts as a sink for problematic messages, isolating them from the main processing flow so that they can be analyzed, re-processed, or discarded without blocking the primary queue.
Why DLQs Exist
In a distributed system, messages can fail to process for many reasons: transient network issues, application bugs, corrupt payloads, or downstream service unavailability. Without a DLQ, a consumer might repeatedly attempt to process the same message, leading to wasted compute, increased latency for other messages, and potential infinite loops. DLQs solve this by providing a mechanism to automatically move messages that exceed a configurable maximum receive count to a separate queue. This protects the main queue's performance and allows developers to focus on fixing the root cause without losing messages.
How DLQs Work Internally
When a source queue is configured with a DLQ, the following mechanism governs message movement:
Receive Count Tracking: Each message in an SQS queue has a ReceiveCount attribute that increments every time a consumer calls ReceiveMessage and retrieves that message. The count is stored per message and persists across visibility timeout expirations.
Maximum Receives Threshold: The source queue has a redrivePolicy that specifies a maxReceiveCount (integer, 1 to 1000). When a message's ReceiveCount exceeds this threshold, SQS automatically moves the message to the DLQ.
Redrive Permission: The source queue must have a policy granting the DLQ's queue the sqs:SendMessage permission. Without this, the redrive will fail silently (message remains in source queue).
Redrive Process: SQS performs the redrive asynchronously. It does not guarantee immediate movement; there may be a delay of a few seconds to minutes. During this time, the message remains in the source queue and can still be received by consumers (if the visibility timeout expires). Once moved, the message is deleted from the source queue and a new message with the same body and attributes is created in the DLQ. The original message ID is preserved, but the ReceiveCount resets in the DLQ.
DLQ Queue Configuration: The DLQ itself is a normal SQS queue that can be standard or FIFO. It has its own retention period (default 4 days, max 14 days), and messages in the DLQ can be consumed, deleted, or re-driven back to the source queue (using AWS support or custom scripts).
Key Parameters and Defaults
maxReceiveCount: Default is not set; you must explicitly configure it. Common values: 3, 5, 10.
DLQ Queue: Must be of the same type (standard or FIFO) as the source queue. FIFO queues require a DLQ that is also FIFO.
Retention Period: Source queue default 4 days, DLQ default 4 days. Both can be set from 60 seconds to 14 days.
Visibility Timeout: Not directly related to DLQ, but influences how quickly messages are re-delivered and thus how fast ReceiveCount increments.
Maximum Message Size: Both queues support 1 KB to 256 KB.
Configuration Steps
Using AWS CLI:
Create the DLQ:
aws sqs create-queue --queue-name MyDLQGet the DLQ ARN:
aws sqs get-queue-attributes --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/MyDLQ --attribute-names QueueArnSet the source queue's redrive policy (must be valid JSON):
aws sqs set-queue-attributes --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/MySourceQueue --attributes '{"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:MyDLQ\",\"maxReceiveCount\":5}"}'Verify:
aws sqs get-queue-attributes --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/MySourceQueue --attribute-names RedrivePolicyInteraction with Related Technologies
Lambda: When Lambda is the consumer, you can configure a DLQ on the Lambda function itself (separate from the SQS DLQ). The Lambda DLQ catches events that fail after all retries (including asynchronous invocation retries). SQS DLQ catches messages that are repeatedly received but not deleted. Both can be used together for layered resilience.
Auto Scaling: DLQ depth can be used as a CloudWatch metric to trigger alarms or auto-scaling actions (e.g., increase consumer instances if DLQ grows).
SNS: SNS can subscribe to a DLQ to send notifications when messages are moved.
Event Source Mapping: For Lambda polling SQS, the maxReceiveCount in the SQS DLQ is independent of the Lambda function's own retry behavior. The Lambda function may still attempt processing multiple times before the message exceeds the SQS threshold.
Best Practices
Always configure a DLQ for production queues to prevent message loss and infinite loops.
Set maxReceiveCount to a value that allows for transient failures (e.g., 3-5) but not so high that it delays error detection.
Monitor DLQ depth using CloudWatch metrics (ApproximateNumberOfMessagesVisible). A growing DLQ indicates a systemic issue.
Set appropriate retention period on DLQ (e.g., 14 days) to allow time for analysis.
For FIFO queues, the DLQ must also be FIFO, and message ordering is preserved per message group ID.
Do not use the same DLQ for multiple source queues unless you are okay with intermingling messages; use separate DLQs per application or use case.
Common Pitfalls
Missing Permissions: The source queue's policy must allow the DLQ to send messages. If using the same account, the default policy often works, but cross-account requires explicit resource-based policy.
Wrong Queue Type: Mixing standard and FIFO queues will cause configuration failure.
Setting maxReceiveCount too low: For example, setting it to 1 will move messages after the first receive attempt, even if the consumer is still processing. This can lead to false positives.
Not handling DLQ messages: A DLQ with no consumer will eventually fill up and cause messages to be lost (if retention expires). Implement a process to analyze and re-drive or discard messages.
Advanced: Re-driving Messages from DLQ
AWS does not provide a native API to re-drive messages from DLQ back to source queue. You must write a custom script that reads from DLQ and sends to source queue. Alternatively, you can use AWS Support to request a re-drive (limited). For automation, Lambda functions are commonly used.
Configure DLQ Queue Creation
First, create the SQS queue that will serve as the Dead Letter Queue. This queue can be standard or FIFO, but must match the type of the source queue. Choose appropriate retention period (default 4 days, max 14 days) and other settings. Note the queue ARN for later use. In the AWS Management Console, navigate to SQS, click 'Create queue', select type, name it (e.g., 'MyApp-DLQ'), set retention period, and create. Using CLI: `aws sqs create-queue --queue-name MyApp-DLQ`.
Configure Source Queue Redrive Policy
On the source queue (the queue from which messages will be moved), set the `RedrivePolicy` attribute. This policy is a JSON object containing `deadLetterTargetArn` (the ARN of the DLQ) and `maxReceiveCount` (the number of receive attempts before moving). For example, `{"deadLetterTargetArn":"arn:aws:sqs:us-east-1:123456789012:MyApp-DLQ","maxReceiveCount":5}`. This can be set via Console (Queue Actions > Configure Dead Letter Queue) or CLI: `aws sqs set-queue-attributes --queue-url <source-queue-url> --attributes '{"RedrivePolicy": "..."}'`.
Verify DLQ Permissions
Ensure that the source queue's policy allows the SQS service to send messages to the DLQ. In the same AWS account, the default policy usually works. For cross-account, you must add a statement to the source queue's policy granting `sqs:SendMessage` to the DLQ's queue ARN. Also, the DLQ's policy must allow the source queue to send. You can check using `aws sqs get-queue-attributes --queue-url <dlq-url> --attribute-names Policy`. If permissions are missing, the redrive will fail silently (message stays in source queue).
Monitor Receive Count and Message Movement
When a consumer calls `ReceiveMessage` on the source queue, the message's `ReceiveCount` increments. If the consumer does not delete the message before the visibility timeout expires, the message becomes visible again and can be received again, incrementing the count. Once `ReceiveCount` exceeds `maxReceiveCount`, SQS asynchronously moves the message to the DLQ. You can monitor the DLQ's `ApproximateNumberOfMessagesVisible` metric to track redrives. The movement is not instantaneous; expect a delay of up to a few minutes.
Process DLQ Messages
Set up a consumer (e.g., Lambda, EC2) to read messages from the DLQ. Analyze the message body and attributes to determine why processing failed. Options: log the message for debugging, re-drive it back to the source queue (via custom script), or delete it if it is poison. For re-drive, create a Lambda that reads from DLQ and sends to source queue. Ensure you handle infinite loops by adding a message attribute or using a separate DLQ for re-driven messages. AWS does not provide native re-drive API.
Set CloudWatch Alarms on DLQ
Create CloudWatch alarms on the DLQ's `ApproximateNumberOfMessagesVisible` metric. For example, alarm when messages appear (indicating failures) or when count exceeds a threshold (e.g., 100) for sustained period. This triggers notifications (SNS) or auto-scaling actions. Also monitor the source queue's `ApproximateAgeOfOldestMessage` to detect if messages are stuck. These alarms are crucial for operational visibility.
Enterprise Scenario 1: E-commerce Order Processing
A large e-commerce platform uses SQS to decouple order placement from downstream processing (inventory, payment, shipping). The source queue receives orders from the web tier. A fleet of EC2 instances polls the queue, processes orders, and deletes messages on success. However, occasionally an order has a corrupted JSON payload or references a discontinued product, causing the consumer to fail repeatedly. Without a DLQ, the same message is retried indefinitely, consuming resources and delaying other orders. The solution: configure a DLQ with maxReceiveCount of 3. After three failed attempts, the message moves to the DLQ. A separate Lambda function reads from the DLQ, logs the error, and sends an alert to the operations team. The DLQ retention is set to 14 days to allow time for analysis. The operations team can then fix the issue and re-drive the message using a custom script. In production, the DLQ depth is monitored via CloudWatch; if it exceeds 100 messages, an alarm triggers auto-scaling of the consumer fleet to handle potential load spikes.
Enterprise Scenario 2: Financial Transaction Reconciliation
A financial institution uses FIFO queues to process transactions in order. Strict ordering is required per account. The source queue is FIFO, and the DLQ must also be FIFO to preserve ordering. Messages that fail processing (e.g., due to insufficient funds) are moved to the DLQ after 5 receive attempts. The DLQ consumer is a dedicated Lambda that examines the message, attempts a different reconciliation path, and if still failing, sends the message to a separate S3 bucket for archival and manual review. The DLQ's maxReceiveCount is set carefully: too low (e.g., 2) and transient network issues cause false positives; too high (e.g., 20) and delays in detecting systemic issues. The team uses a value of 5 based on historical failure patterns. They also use message deduplication ID to prevent duplicate entries in the DLQ.
Enterprise Scenario 3: IoT Device Telemetry
An IoT platform ingests sensor data from millions of devices via SQS. The data is processed by a Lambda function that normalizes and stores it in DynamoDB. Some devices send malformed data (e.g., temperature values outside valid range). These messages are moved to a DLQ after 3 receive attempts. The DLQ is used as a trigger for a separate Lambda that updates a device's status in a registry table, marking it as 'faulty'. This allows the platform to isolate problematic devices without affecting the main data pipeline. The DLQ retention is set to 7 days, after which messages are automatically deleted. The team also uses a CloudWatch metric on DLQ depth to trigger an SNS notification to the device management team.
What Goes Wrong When Misconfigured
No DLQ: A poison message causes infinite retries, increasing costs and latency. Eventually, the queue may become backlogged, causing downstream systems to time out.
maxReceiveCount too low: Messages are moved to DLQ prematurely, even for transient failures. This increases operational overhead as teams investigate false positives.
maxReceiveCount too high: Systemic issues go undetected for too long, wasting resources and delaying response.
Permissions missing: Redrive silently fails; messages remain in source queue, leading to the same symptoms as no DLQ.
DLQ retention too short: Messages expire before they are analyzed, causing permanent data loss.
Using same DLQ for multiple queues: Messages from different sources intermingle, making it hard to trace root cause.
Performance Considerations
DLQ redrive is asynchronous and does not impact source queue throughput significantly. However, if the DLQ becomes very deep (millions of messages), the cost of storing and processing can increase. It is recommended to set a reasonable retention period and periodically purge or process DLQ messages. The DLQ does not add latency to message delivery; it only affects messages that fail.
SAA-C03 Exam Focus: SQS Dead Letter Queues
Objective Codes: This topic falls under Domain 2: Resilient Architectures, Objective 2.1: Design scalable, decoupled, and highly available systems. Specifically, questions test your understanding of decoupling patterns using SQS and DLQs.
What the Exam Tests:
1. Purpose of DLQ: The exam expects you to know that DLQs prevent infinite processing loops and allow for systematic error handling. They are not for load balancing, caching, or ordering (unless FIFO).
2. Configuration Parameters: You must know maxReceiveCount and deadLetterTargetArn. The default maxReceiveCount is not set; you must configure it. Values like 3, 5, 10 are common.
3. Queue Type Matching: Standard source requires standard DLQ; FIFO source requires FIFO DLQ. Mixing types is invalid.
4. Redrive Process: Messages are moved asynchronously after exceeding maxReceiveCount. The message ID is preserved, but ReceiveCount resets in DLQ.
5. Permissions: The source queue must have a policy allowing the DLQ to send messages. Cross-account requires explicit policies.
6. Integration with Lambda: Lambda has its own DLQ for failed invocations; SQS DLQ is separate. The exam tests distinguishing between them.
7. Monitoring: CloudWatch metrics like ApproximateNumberOfMessagesVisible on DLQ are used for alarms.
Common Wrong Answers and Traps:
1. "DLQ provides high availability": Wrong. DLQ is for error handling, not availability. Availability is achieved through multi-AZ and redundant consumers.
2. "DLQ guarantees message ordering": Only for FIFO queues. For standard queues, ordering is not guaranteed even with DLQ.
3. "Messages are moved immediately after first failure": Wrong. They are moved only after exceeding maxReceiveCount. The default threshold is not set, so messages never move unless configured.
4. "DLQ can be used to throttle messages": Wrong. Throttling is done via visibility timeout or Lambda concurrency limits.
5. "You can re-drive messages from DLQ using AWS Management Console": Wrong. There is no native re-drive; you must use custom scripts or AWS Support.
Specific Numbers and Terms:
- maxReceiveCount: 1 to 1000.
- Retention period: 60 seconds to 14 days (default 4 days).
- Maximum message size: 256 KB.
- DLQ must be same type (standard or FIFO) as source.
- The RedrivePolicy is a JSON attribute on the source queue.
Edge Cases:
If the DLQ is deleted, the redrive fails silently; messages remain in source queue.
If the DLQ is full (reaches its maximum number of messages, which is unlimited by default but can be limited by queue policy), new redrives fail.
For FIFO queues, if a message in a group is moved to DLQ, subsequent messages in the same group can still be processed normally (the group is not blocked).
How to Eliminate Wrong Answers:
If a question asks about handling poison messages, the answer should involve a DLQ.
If a question mentions "infinite retries" or "processing failures", look for DLQ.
If a question asks about ordering and DLQ, remember only FIFO DLQ preserves ordering.
If a question offers both Lambda DLQ and SQS DLQ, identify which layer the failure occurs at (Lambda invocation vs. SQS consumption).
DLQ prevents infinite processing loops by moving messages that exceed maxReceiveCount (1-1000).
DLQ must be same queue type (standard or FIFO) as the source queue.
Redrive is asynchronous; messages may be processed once more after threshold is exceeded.
Default retention period for both queues is 4 days (configurable 60s to 14 days).
Source queue must have permission for DLQ to send messages (sqs:SendMessage).
No native re-drive from DLQ; use custom Lambda or AWS Support.
Monitor DLQ depth via CloudWatch metric ApproximateNumberOfMessagesVisible.
Lambda DLQ is separate from SQS DLQ; both can be used together for layered resilience.
These come up on the exam all the time. Here's how to tell them apart.
SQS Dead Letter Queue (DLQ)
Configured on the SQS source queue via RedrivePolicy.
Catches messages that are repeatedly received but not deleted (poison messages).
Works at the SQS consumption layer.
maxReceiveCount determines when message moves (1-1000).
Messages can be re-driven back to source queue via custom scripts.
Lambda Dead Letter Queue (Lambda DLQ)
Configured on the Lambda function (DeadLetterConfig).
Catches events that fail after all Lambda retries (including async invocation retries).
Works at the Lambda invocation layer.
Lambda retries up to 3 times (for async) before sending to DLQ.
Events are sent to SQS or SNS; no native re-drive to Lambda.
Mistake
DLQ automatically retries messages after moving them to DLQ.
Correct
No. Once a message is moved to the DLQ, it sits there until a consumer reads it. There is no automatic retry. You must build a custom process to re-drive messages back to the source queue if needed.
Mistake
You must set maxReceiveCount to the same value as the number of retries configured in the consumer application.
Correct
Not necessarily. The consumer's internal retry logic is independent of SQS's maxReceiveCount. For example, a Lambda function may retry internally 3 times before failing, but SQS's maxReceiveCount counts each ReceiveMessage call. If the consumer retries internally without calling ReceiveMessage again, the count does not increment. You should set maxReceiveCount based on the number of times the message is actually received from the queue, not internal retries.
Mistake
DLQ can be a standard queue when the source is FIFO.
Correct
False. The DLQ must be the same type as the source queue. If the source is FIFO, the DLQ must also be FIFO to preserve message ordering and deduplication. Configuring a standard DLQ for a FIFO source will result in an error.
Mistake
Messages are moved to DLQ immediately after the last receive attempt.
Correct
The redrive is asynchronous and may take seconds to minutes. During this time, the message is still in the source queue and can be received again if the visibility timeout expires. This can cause the message to be processed one more time before being moved.
Mistake
You can re-drive messages from DLQ to source queue using the AWS Management Console.
Correct
There is no native console or API to re-drive messages. You must write a custom script (e.g., Lambda) that reads from DLQ and sends to source queue. AWS Support can perform a one-time re-drive in some cases, but it's not a self-service feature.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
SQS DLQ is configured on an SQS queue and catches messages that are repeatedly received but not deleted (poison messages). It operates at the message consumption layer. Lambda DLQ is configured on a Lambda function and catches events that fail after all Lambda retries (for async invocation, up to 3 retries). It operates at the function invocation layer. Both can be used together: SQS DLQ catches messages that fail after multiple receive attempts, and Lambda DLQ catches events that fail during Lambda processing. They address different failure points.
If the DLQ is deleted, the redrive policy on the source queue still references the old ARN. When a message exceeds maxReceiveCount, SQS attempts to send it to the DLQ but fails because the queue no longer exists. The message remains in the source queue (it is not deleted). The redrive failure is logged in CloudTrail, and the source queue continues to operate normally. To fix, you must either recreate the DLQ with the same ARN (not possible because ARN includes queue name, which can be reused after deletion) or update the source queue's redrive policy to point to a new DLQ.
Yes, technically you can use the same DLQ for multiple source queues. However, this is not recommended because messages from different sources will intermingle, making it difficult to identify which source queue a message came from. To differentiate, you can set message attributes (e.g., 'SourceQueueName') when sending messages, but this adds complexity. Best practice is to use a separate DLQ per application or use case.
AWS does not provide a native API for re-driving messages from DLQ to source queue. You must implement a custom solution, typically using a Lambda function that reads messages from the DLQ, optionally modifies them (e.g., resetting receive count), and sends them to the source queue using the SQS SendMessage API. Be careful to avoid infinite loops: track re-driven messages with a message attribute or use a separate DLQ for re-driven messages. Alternatively, you can use AWS Support to request a one-time re-drive, but this is not a scalable solution.
There is no default value for maxReceiveCount. If you do not configure a redrive policy, messages will never be moved to a DLQ. You must explicitly set maxReceiveCount when configuring the redrive policy. Common values are 3, 5, or 10, depending on the application's tolerance for transient failures.
Yes, for FIFO queues, the DLQ must also be FIFO, and message ordering is preserved per message group ID. When a message from a group is moved to the DLQ, subsequent messages in the same group remain in the source queue and are processed normally. The DLQ maintains the order of messages as they were moved, but the source queue's order for the remaining messages is unaffected.
SQS queues (including DLQs) do not have a built-in maximum message limit. However, you can use queue policies to restrict who can send messages to the DLQ, but there is no native cap on the number of messages. To limit DLQ size, you can set a short retention period and implement a consumer that processes messages regularly. Alternatively, you can use CloudWatch alarms to notify when the DLQ depth exceeds a threshold.
You've just covered SQS Dead Letter Queues — now see how well it sticks with free SAA-C03 practice questions. Full explanations included, no account needed.
Done with this chapter?