DVA-C02Chapter 90 of 101Objective 4.2

Error Handling Patterns in Serverless Apps

This chapter covers error handling patterns in serverless applications on AWS, focusing on how to design resilient systems using Lambda, SQS, SNS, Step Functions, and CloudWatch. Error handling is critical for the DVA-C02 exam because approximately 15-20% of troubleshooting questions involve identifying and fixing failures in serverless architectures. You'll learn the specific mechanisms—retries, DLQs, destinations, and circuit breakers—that prevent data loss and ensure application reliability, along with exam traps around timeout values, async invocation behavior, and error visibility.

25 min read
Intermediate
Updated May 31, 2026

Serverless Error Handling as a Multi-Layer Safety Net

Imagine a high-rise building with a central fire alarm system. On each floor, there are smoke detectors (Lambda function code) that can detect a fire and trigger an alarm. But what if a detector itself malfunctions? The building has a backup system: a manual pull station on each floor (Dead Letter Queue) that collects any alarms the detectors fail to send. Additionally, there's a central monitoring station (CloudWatch) that tracks all detector health and alarm history. If a detector fails to report, the monitoring station can trigger a maintenance request (SNS notification) to the building engineer (developer). The building also has a fire suppression system (retry logic) that automatically tries to extinguish a small fire three times before escalating to the fire department (SQS retries). If the suppression system fails, the water flow triggers a separate alarm (Lambda destination on failure) that directly alerts the fire chief. Each layer is independent: if the smoke detector fails, the pull station still works; if the pull station is blocked, the monitoring station still logs the event. This multi-layer design ensures no fire goes unnoticed, just as serverless error handling ensures no event is lost or unprocessed.

How It Actually Works

What Is Error Handling in Serverless Apps?

Serverless applications, by their nature, involve many moving parts: Lambda functions, event sources (S3, SQS, API Gateway), and downstream services (DynamoDB, SNS). Failures can occur at any layer: code exceptions, timeouts, throttling, permission errors, or downstream service unavailability. Error handling patterns are the structured approaches to detect, capture, and respond to these failures without losing data or causing cascading failures.

On the DVA-C02 exam, you must understand how Lambda handles errors differently based on invocation type (synchronous vs. asynchronous) and event source mapping (e.g., SQS, DynamoDB Streams). The exam also tests your ability to configure DLQs (Dead Letter Queues), set retry policies, and use Lambda destinations for failure routing.

How Lambda Handles Errors Internally

Lambda supports three invocation types: - Synchronous (e.g., API Gateway, SDK Invoke with InvocationType='RequestResponse'): The caller receives the error response immediately. Lambda does not retry. The calling service (e.g., API Gateway) may have its own retry logic. - Asynchronous (e.g., S3 events, SNS, SDK Invoke with InvocationType='Event'): Lambda automatically retries the failed invocation twice (total of 3 attempts) with delays between attempts (approximately 1 minute then 2 minutes, but not guaranteed). After all retries fail, the event can be sent to a Dead Letter Queue (DLQ) or a Lambda destination (on failure). - Poll-based (Event Source Mappings) (e.g., SQS, DynamoDB Streams, Kinesis): The event source mapping polls the source and invokes Lambda synchronously. If the function fails, the mapping retries based on the source's configuration (e.g., SQS visibility timeout, Kinesis iterator age). Successful processing deletes the message; failure causes it to reappear for reprocessing.

Key Components and Defaults

Dead Letter Queue (DLQ): An SQS queue or SNS topic where Lambda sends events that failed after all retries. DLQs are configured per function (for async invocations) or per event source mapping (for SQS). Default: No DLQ. Exam tip: DLQs are only for async invocations and event source mappings, not synchronous.

Lambda Destinations: A newer feature that allows you to route invocation results to SQS, SNS, Lambda, or EventBridge. You can set separate destinations for success and failure. Default: No destinations. Exam tip: Destinations can receive both success and failure events, while DLQs only receive failure events.

Retry Attempts: For async invocations, Lambda retries twice (total 3 attempts). For SQS event source mappings, the number of retries is controlled by the SQS queue's maxReceiveCount (default: unlimited) and the visibility timeout. For Kinesis/DynamoDB Streams, retry behavior is governed by the iterator age (default 24 hours) and bisectBatchOnFunctionError.

Maximum Event Age: For async invocations, you can set maximum-event-age-in-seconds (1 min to 6 hours, default 6 hours). Events older than this are discarded or sent to DLQ/destination.

Bisect Batch on Function Error: For stream-based event sources, this option splits the failed batch into two halves to isolate the failing record. Default: Disabled.

Configuration and Verification

You can configure DLQs and destinations via the AWS Console, CLI, or CloudFormation. For example, to set a DLQ for async invocations:

aws lambda put-function-event-invoke-config \
  --function-name my-function \
  --destination-config '{"OnFailure": {"Destination": "arn:aws:sqs:us-east-1:123456789012:my-dlq"}}'

For SQS event source mapping, the DLQ is configured on the SQS queue itself (redrive policy):

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/my-queue \
  --attributes '{"RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:my-dlq\",\"maxReceiveCount\":\"5\"}"}'

To verify, check CloudWatch Logs for error messages like Task timed out after 3.00 seconds or Process exited before completing request. Also monitor the DeadLetterErrors metric (DLQ send failures) and DestinationDeliveryFailures.

Interaction with Related Technologies

SQS + Lambda: Lambda polls SQS synchronously. If the function returns an error, the message remains in the queue and becomes visible again after the visibility timeout. The maxReceiveCount determines when it moves to the DLQ. Common exam scenario: Setting maxReceiveCount to 2 means each message can be retried once before being sent to DLQ.

SNS + Lambda: SNS sends messages to Lambda asynchronously. If Lambda fails, SNS does not retry; it's up to Lambda's async retry mechanism. However, SNS can send to SQS first, then Lambda polls SQS.

Step Functions: Error handling is built into state machines. You can use Retry (with IntervalSeconds, MaxAttempts, BackoffRate) and Catch (to route to another state). Default retries: 3 attempts with 1-second interval and 2.0 backoff rate. Exam tip: Step Functions can also send to DLQs via the ResultPath and Catch.

API Gateway + Lambda: API Gateway can return 502 (Bad Gateway) if Lambda times out or returns an error. You can configure API Gateway to cache errors or use integration response mapping to return custom error codes.

Timeout and Throttling Errors

Lambda function timeout defaults to 3 seconds (max 15 minutes). If your function exceeds the timeout, it is terminated and returns an error. For async invocations, this triggers a retry. For synchronous, the caller gets a Task timed out error.

Throttling: Lambda has a concurrency limit (default 1000 per region). If exceeded, synchronous invocations return TooManyRequestsException (429). Asynchronous invocations are queued and retried automatically for up to 6 hours. Exam tip: Throttled async invocations do not go to DLQ unless the maximum-event-age expires.

Idempotency and Poison Messages

Poison messages are messages that repeatedly fail processing. Without a DLQ, they would be retried indefinitely (or until the visibility timeout expires). DLQs isolate these messages for manual inspection. Idempotency ensures that reprocessing the same message does not cause side effects (e.g., duplicate charges). Use idempotency tokens (e.g., DynamoDB table with TTL) to deduplicate.

Walk-Through

1

Configure Lambda Function Timeout

Set the function timeout to a value that accommodates worst-case execution but not so high that it delays retries. Default is 3 seconds; max is 900 seconds (15 minutes). For a typical API endpoint, 10-30 seconds is common. If the function exceeds this timeout, Lambda terminates it and returns an error. For async invocations, this triggers a retry after ~1 minute. Monitor `Duration` and `Throttles` metrics in CloudWatch.

2

Enable Retries for Async Invocations

Lambda automatically retries failed async invocations twice (3 total attempts). You cannot change the number of retries, but you can control the maximum event age (1 min to 6 hours). Events that fail after all retries are sent to the DLQ (if configured) or discarded. Use `put-function-event-invoke-config` to set the DLQ or destination. Exam tip: The retry delay is not configurable and is approximately 1 minute then 2 minutes.

3

Configure Dead Letter Queue (DLQ)

Create an SQS queue or SNS topic to receive failed events. For async invocations, set the DLQ ARN in the function's event invoke config. For SQS event source mappings, set the redrive policy on the source queue with `maxReceiveCount`. Typical values: 3-5. Messages that exceed `maxReceiveCount` are moved to the DLQ. Monitor the `ApproximateNumberOfMessagesVisible` metric on the DLQ to detect failures.

4

Set Lambda Destinations for Success/Failure

Destinations provide a more flexible alternative to DLQs. You can route success and failure results to different targets (SQS, SNS, Lambda, EventBridge). Configure via `put-function-event-invoke-config` with `OnSuccess` and `OnFailure`. Exam tip: Destinations can receive both success and failure events, while DLQs only receive failures. Also, destinations are only available for async invocations.

5

Implement Circuit Breaker with Step Functions

For complex workflows, use Step Functions with `Retry` and `Catch` states. Define a `Retrier` with `IntervalSeconds`, `MaxAttempts`, and `BackoffRate`. Example: Retry 3 times with 2-second interval and 2.0 backoff. If all retries fail, `Catch` routes to a fallback state (e.g., send to DLQ). This prevents repeated failures from overwhelming downstream services.

What This Looks Like on the Job

In a production e-commerce platform, an order processing Lambda function receives events from SQS. The function writes to DynamoDB and calls an external payment gateway. If the payment gateway times out, the function returns an error. Without a DLQ, the same message would be retried every 30 seconds (visibility timeout) until it exceeds the max receive count (set to 3). After 3 failures, it moves to the DLQ. The operations team monitors the DLQ for failed orders and manually reprocesses them after fixing the issue (e.g., payment gateway down). They also set up a CloudWatch alarm on the DLQ's ApproximateNumberOfMessagesVisible to trigger an SNS notification to the on-call engineer.

Another scenario: a media processing pipeline uses Lambda to transcode videos uploaded to S3. The Lambda function has a 5-minute timeout. If a video is too large, the function times out. With async invocation, Lambda retries twice, but each retry also takes 5 minutes, causing a 15-minute delay before the event goes to the DLQ. To improve, the team sets maximum-event-age-in-seconds to 300 (5 minutes) so that events older than 5 minutes are immediately sent to the DLQ without retries. They also use Lambda destinations to send the failure payload to an SNS topic that triggers a notification to the engineering team.

A common misconfiguration is setting the SQS visibility timeout lower than the Lambda function timeout. For example, if Lambda timeout is 30 seconds but visibility timeout is 20 seconds, the message becomes visible again before the function finishes, leading to duplicate processing. The rule of thumb: set visibility timeout to 6x the Lambda timeout (to allow for retries and DLQ processing). In practice, for a 30-second Lambda, set visibility timeout to 180 seconds.

Another issue is forgetting to enable ReportBatchItemFailures for SQS event source mappings. Without this, a single failing record causes the entire batch to be reprocessed, even if other records succeeded. Enabling this feature allows Lambda to return partial failures, so only the failed records are retried. This is a key exam topic.

How DVA-C02 Actually Tests This

The DVA-C02 exam tests error handling under Objective 4.2 (Troubleshooting) and also appears in Domain 1 (Deployment) and Domain 2 (Development). Expect 2-3 questions specifically on DLQs, retries, and destinations. The most common wrong answers involve: 1. Thinking DLQs work for synchronous invocations: They do not. DLQs only apply to async invocations and event source mappings. For synchronous, the caller must handle errors. 2. Confusing DLQs with Lambda Destinations: DLQs are older and only handle failures. Destinations handle both success and failure, and are more flexible. The exam may ask which to use for a given scenario (e.g., need to log successes? Use destinations). 3. Assuming Lambda retries synchronous invocations: It does not. The caller (e.g., API Gateway) may retry, but Lambda itself does not. 4. Setting `maxReceiveCount` too low: If set to 1, every transient failure sends the message to DLQ, causing unnecessary manual intervention. Typical values are 3-5.

Specific numbers to memorize:

Lambda async retries: 2 retries (3 total attempts)

Lambda function timeout default: 3 seconds, max 900 seconds

Maximum event age for async: 1 min to 6 hours, default 6 hours

SQS visibility timeout: 0 seconds to 12 hours, default 30 seconds

Step Functions default retry: 3 attempts, 1-second interval, backoff rate 2.0

Lambda concurrency default: 1000 per region

Edge cases:

Throttled async invocations are queued and retried, but they do NOT go to DLQ unless the event age expires.

If a DLQ is not configured and all retries fail, the event is discarded (lost).

For SQS event source mappings, if the Lambda function returns an error, the message becomes visible again after the visibility timeout. If the function throws an exception but still returns a 200 (e.g., by catching all errors), the message is considered processed and deleted—this is a common pitfall.

To eliminate wrong answers, trace the invocation type: is it synchronous? Then DLQ/destinations are irrelevant. Is it async? Then check if DLQ or destination is configured. For SQS, check the maxReceiveCount and visibility timeout. Always ask: "What happens after all retries?" If the answer says 'discarded', that's correct if no DLQ configured.

Key Takeaways

Lambda async invocations are automatically retried twice (3 total attempts) before being sent to a DLQ or destination.

Synchronous invocations are not retried by Lambda; the caller must handle errors.

DLQs only capture failures; Lambda destinations capture both success and failure.

For SQS event source mappings, set visibility timeout to at least 6 times the Lambda function timeout.

Enable ReportBatchItemFailures for SQS partial batch failures to avoid reprocessing successful records.

Step Functions default retry: 3 attempts with 1-second interval and 2.0 backoff rate.

Lambda function timeout default is 3 seconds, max 900 seconds (15 minutes).

If no DLQ is configured and async retries fail, the event is discarded (lost).

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Dead Letter Queue (DLQ)

Only captures failed events (on failure).

Supported for async invocations and event source mappings.

Targets: SQS queue or SNS topic only.

Older feature, simpler configuration.

No success routing; no metadata beyond the event.

Lambda Destinations

Captures both success and failure events.

Only supported for async invocations (not event source mappings).

Targets: SQS, SNS, Lambda, EventBridge.

Newer feature, more flexible.

Can include invocation metadata (request ID, response).

Watch Out for These

Mistake

Lambda automatically retries all failed invocations three times regardless of invocation type.

Correct

Only asynchronous invocations are automatically retried twice (three total attempts). Synchronous invocations are not retried by Lambda; the calling service or client must handle retries.

Mistake

A Dead Letter Queue can be used for synchronous Lambda invocations.

Correct

DLQs are only available for asynchronous invocations and event source mappings (SQS, streams). Synchronous invocations return errors directly to the caller, and no DLQ is involved.

Mistake

Lambda destinations and DLQs are the same thing.

Correct

DLQs only capture failed events for async invocations. Lambda destinations can capture both success and failure events for async invocations, and they support multiple target types (SQS, SNS, Lambda, EventBridge). Destinations are more modern and flexible.

Mistake

If an SQS message fails processing, it is immediately sent to the DLQ.

Correct

The message is retried based on the visibility timeout and `maxReceiveCount`. It only moves to the DLQ after exceeding `maxReceiveCount`. The default `maxReceiveCount` is unlimited, meaning the message is never sent to DLQ unless configured.

Mistake

Lambda function timeout can be set to any value up to 15 minutes.

Correct

True: the maximum is 900 seconds (15 minutes). However, the default is 3 seconds. For API Gateway integrations, the timeout must be less than 29 seconds (API Gateway timeout), so Lambda timeout should be set accordingly.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What happens to an asynchronous Lambda invocation if the function keeps failing and no DLQ is configured?

If no DLQ or destination is configured for an async invocation, Lambda retries the invocation twice (three total attempts). After all retries fail, the event is discarded. It is not stored anywhere and is lost. To prevent data loss, always configure a DLQ or a Lambda destination on failure. Exam tip: This is a common scenario where you must recommend adding a DLQ to avoid losing events.

Can I use a Lambda destination with an SQS event source mapping?

No. Lambda destinations are only supported for asynchronous invocations (e.g., S3 events, SNS, API Gateway async). For SQS event source mappings, use the DLQ on the SQS queue (redrive policy) to capture failed messages. The exam often tests this distinction: destinations for async, DLQ for SQS.

How do I configure Lambda to retry a failed synchronous invocation?

Lambda does not retry synchronous invocations automatically. You must implement retry logic in the calling application (e.g., using exponential backoff with jitter in your SDK client). For example, API Gateway can be configured with a retry policy, but that's separate from Lambda. Exam tip: If a question mentions a synchronous invocation failing, the answer will not involve DLQ or Lambda destinations.

What is the difference between DLQ and redrive policy?

A DLQ is a separate queue (or SNS topic) that receives messages after processing failures. The redrive policy is the configuration on the source queue that specifies the DLQ ARN and `maxReceiveCount`. Without a redrive policy, messages are not moved to a DLQ. For Lambda async invocations, the DLQ is configured on the function itself, not on a queue.

How can I avoid duplicate processing of SQS messages?

Use idempotency tokens. For example, before processing a message, check a DynamoDB table for a unique message ID. If it exists, skip processing. Also, set the SQS visibility timeout to be longer than the Lambda timeout plus any retry time. The recommended formula: visibility timeout = 6 * Lambda timeout. This prevents the message from becoming visible again while still being processed.

What is the 'maximum event age' for Lambda async invocations?

The `maximum-event-age-in-seconds` parameter controls how long an event can remain in the Lambda internal queue before being discarded or sent to a DLQ/destination. The default is 21600 seconds (6 hours), and the range is 60 seconds to 21600 seconds. If you set it to 300 seconds (5 minutes), events older than 5 minutes are not retried but are routed to the DLQ/destination.

Can I use SNS as a DLQ target?

Yes, for Lambda async invocations, you can specify an SNS topic as the DLQ target. When a failed event is sent to the DLQ, it publishes a message to the SNS topic, which can then trigger notifications (e.g., email, SMS) or invoke another Lambda function. For SQS event source mappings, the DLQ must be an SQS queue (SNS is not supported).

Terms Worth Knowing

Ready to put this to the test?

You've just covered Error Handling Patterns in Serverless Apps — now see how well it sticks with free DVA-C02 practice questions. Full explanations included, no account needed.

Done with this chapter?