A company runs a critical application on AWS Lambda that processes messages from an Amazon SQS queue. The application must be resilient to downstream service failures. The team notices that when the downstream service is unhealthy, messages are repeatedly retried and eventually sent to the dead-letter queue (DLQ) before the service recovers. What design change would improve resilience by allowing automatic retries after the downstream service recovers?
Long visibility timeout and high maxReceiveCount allow messages to be retried over an extended period.
Why this answer
Option A is correct because increasing the visibility timeout to a long duration (e.g., 6 hours) prevents messages from being repeatedly retried and sent to the DLQ while the downstream service is unhealthy. Instead, messages remain in the SQS queue and become visible again only after the visibility timeout expires, allowing automatic retries once the downstream service recovers. This approach avoids premature DLQ delivery and leverages SQS's built-in redrive policy based on maxReceiveCount.
Exam trap
The trap here is that candidates often think increasing the DLQ retention or reducing retries (maxReceiveCount) is the solution, but the real key is controlling the retry timing via the visibility timeout to allow the downstream service to recover before messages are exhausted.
How to eliminate wrong answers
Option B is wrong because reducing maxReceiveCount to 1 sends messages to the DLQ immediately after the first failure, which defeats resilience by not allowing any retries and requiring manual reprocessing from the DLQ. Option C is wrong because increasing the message retention period and using a DLQ with high retention does not prevent messages from being sent to the DLQ prematurely; it only keeps them in the DLQ longer, but the downstream service may recover before the messages are consumed from the DLQ. Option D is wrong because using SNS to fan out to multiple SQS queues with different retry policies adds complexity and does not address the core issue of preventing premature DLQ delivery; it still relies on the same visibility timeout and retry mechanism.