DVA-C02Chapter 94 of 101Objective 1.6

Saga Pattern for Distributed Transactions

This chapter covers the Saga pattern for managing distributed transactions in microservices architectures, a critical concept for the DVA-C02 exam under Domain 1 (Development) Objective 1.6 (Implement application design patterns). Expect 1–2 questions directly testing saga types (choreography vs. orchestration), compensating transactions, and integration with AWS services like Step Functions. Understanding saga is essential for designing fault-tolerant, eventually consistent systems on AWS.

25 min read
Intermediate
Updated May 31, 2026

The Travel Agent's Distributed Itinerary

Imagine you are a travel agent booking a complex vacation for a client: a flight, a hotel, and a car rental. Each booking is a separate, independent reservation system (a microservice). You cannot book everything atomically—each system has its own transaction. If you book the flight first, then the hotel, then the car, but the car rental fails, you cannot simply undo the flight and hotel without calling each system again. This is the saga pattern. You must design a series of compensating actions: if the car fails, you cancel the flight and hotel (compensating transactions). Each step has a forward action (book) and a compensating action (cancel). The saga orchestrator (you) tracks each step and, on failure, executes the compensating actions in reverse order. In AWS, this is often implemented using AWS Step Functions for orchestration and AWS Lambda for compensating logic, with state stored in Amazon DynamoDB. The saga ensures eventual consistency without requiring distributed locks or two-phase commits.

How It Actually Works

What is the Saga Pattern?

The Saga pattern is a design pattern for managing data consistency across microservices in distributed transactions. Unlike traditional ACID transactions (Atomic, Consistent, Isolated, Durable) that rely on a single database with locks, sagas break a distributed transaction into a sequence of local transactions, each with a compensating action to undo its effects if a later step fails. Sagas ensure eventual consistency without blocking resources.

Why Not Two-Phase Commit (2PC)?

In distributed systems, 2PC requires a coordinator that locks resources across all participants until the transaction commits or aborts. This leads to: - Blocking: Participants hold locks during the prepare phase, reducing concurrency. - Single point of failure: The coordinator can become a bottleneck. - No support in NoSQL: Many AWS databases (DynamoDB, S3) do not support 2PC.

Sagas avoid these issues by releasing resources after each local transaction, using compensating actions only when needed.

Types of Sagas

There are two main implementation types:

- Choreography-based saga: Each service publishes events after its local transaction. Other services listen and react. There is no central coordinator. Example: After an order service creates an order, it emits an "OrderCreated" event. The payment service listens, processes payment, and emits "PaymentProcessed". If payment fails, it emits "PaymentFailed", and the order service listens and cancels the order. - Pros: Loose coupling, simple for few services. - Cons: Complex to trace when many services are involved; risk of cyclic dependencies.

- Orchestration-based saga: A central orchestrator (e.g., AWS Step Functions) tells each service what to do. The orchestrator tracks state and executes compensating transactions on failure. - Pros: Easier to manage, test, and monitor. - Cons: More coupling to the orchestrator.

Compensating Transactions

A compensating transaction is an operation that semantically undoes a previous transaction. It is not a rollback; it applies a new transaction to revert the state. For example:

Forward: Debit $100 from account A.

Compensating: Credit $100 to account A.

Compensating transactions must be idempotent to handle retries. They may also be partial if the forward transaction cannot be fully undone (e.g., sending an email cannot be unsent; instead, send a follow-up email).

AWS Implementation with Step Functions

AWS Step Functions is the primary service for implementing orchestration-based sagas. Key components: - State machine: Defines the workflow (standard or express). - Task states: Invoke Lambda functions, call APIs, or run ECS tasks. - Choice states: Branch based on output. - Catch/Retry: Handle errors with retry policies (maxAttempts, intervalSeconds, backoffRate). - ResultPath: Store output in the execution's JSON state.

Example state machine for an order saga:

{
  "Comment": "Order saga",
  "StartAt": "CreateOrder",
  "States": {
    "CreateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:createOrder",
      "Next": "ProcessPayment",
      "Catch": [{
        "ErrorEquals": ["States.ALL"],
        "Next": "CancelOrder"
      }]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:processPayment",
      "Next": "UpdateInventory",
      "Catch": [{
        "ErrorEquals": ["States.ALL"],
        "Next": "RefundPayment"
      }]
    },
    "UpdateInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:updateInventory",
      "End": true,
      "Catch": [{
        "ErrorEquals": ["States.ALL"],
        "Next": "RefundPayment"
      }]
    },
    "RefundPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:refundPayment",
      "Next": "CancelOrder"
    },
    "CancelOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:cancelOrder",
      "End": true
    }
  }
}

Key Considerations

Idempotency: Compensating actions must be idempotent. For example, a refund Lambda should check if already refunded.

Timeout: Step Functions execution history is retained for up to 90 days. Standard workflows have a max execution duration of 1 year. Express workflows max at 5 minutes.

Error handling: Use retry with exponential backoff (e.g., intervalSeconds: 2, backoffRate: 2, maxAttempts: 5).

State persistence: Store saga state in DynamoDB for idempotency and recovery.

Interaction with Other AWS Services

Amazon DynamoDB: Often used to store saga state (transactional outbox pattern). Each step records its status.

Amazon SQS: For choreography-based sagas, services communicate via queues (e.g., order queue, payment queue).

Amazon EventBridge: For event-driven choreography.

AWS Lambda: Executes both forward and compensating actions.

When to Use Saga

Long-lived transactions (e.g., booking a trip across multiple systems).

Systems using NoSQL databases that lack ACID across partitions.

High-availability requirements where blocking is unacceptable.

When Not to Use Saga

When strong consistency is required (e.g., financial ledgers must balance immediately).

When the business logic is simple and a single database transaction suffices.

Exam Focus

DVA-C02 tests the ability to choose between choreography and orchestration, design compensating transactions, and implement sagas using Step Functions. You must know that Step Functions is the recommended AWS service for orchestration. Also, understand that sagas provide eventual consistency, not strong consistency.

Walk-Through

1

Start Saga Execution

The saga begins when a client request triggers a distributed transaction. For an orchestration saga, an AWS Step Functions state machine is started via the StartExecution API. The input contains the order details. The Step Functions execution context is stored in the execution history, which is durable and fault-tolerant. The state machine's first state is invoked, typically a Lambda function. The Lambda processes the local transaction (e.g., create an order in DynamoDB with status 'PENDING') and returns a result. If the Lambda succeeds, Step Functions transitions to the next state. If it fails (uncaught exception), Step Functions catches the error and routes to a compensating state as defined in the Catch field. The execution ID uniquely identifies the saga instance.

2

Execute Forward Transaction

Each step executes a local transaction within a single microservice. For example, the 'ProcessPayment' Lambda calls the payment service API, which debits the customer's credit card. The Lambda must be idempotent: if the payment was already processed (detected via a unique idempotency key in the request), it returns the previous result without charging again. The Lambda updates the saga state in DynamoDB (e.g., set paymentStatus to 'SUCCESS'). If the payment service returns a failure (e.g., insufficient funds), the Lambda throws an error. Step Functions' Catch block catches the error and transitions to the compensating state (e.g., 'RefundPayment' if payment succeeded earlier, or directly 'CancelOrder' if payment failed). The result path stores the Lambda output in the execution JSON for use in subsequent steps.

3

Handle Failure and Compensate

When a forward transaction fails, the saga must undo all previously completed steps. In Step Functions, this is achieved by chaining compensating states in reverse order. For instance, if 'UpdateInventory' fails after 'ProcessPayment' succeeded, the state machine goes to 'RefundPayment' (compensating for payment) and then 'CancelOrder' (compensating for order creation). Each compensating Lambda must be idempotent and handle partial failures: if a refund fails (e.g., payment gateway down), the saga may retry or record the failure for manual intervention. The saga does not automatically retry compensating actions; it may transition to a 'ManualIntervention' state or use a Dead Letter Queue (DLQ). The saga state in DynamoDB is updated to 'FAILED' with the step where failure occurred.

4

Complete Saga Successfully

If all forward transactions succeed, the saga completes successfully. The final state in Step Functions marks the execution as 'SUCCEEDED'. The saga state in DynamoDB is updated to 'COMPLETED'. The client can be notified via a callback (e.g., SNS notification) or by polling the saga state. The execution history is retained for auditing. In a choreography saga, each service emits a success event; the absence of failure events implies success. However, orchestration is preferred for clarity. The saga ensures eventual consistency: at any point during execution, the system may be inconsistent, but after completion, all services are consistent.

5

Handle Timeouts and Retries

Step Functions allows configuring retry policies for each state. For example, a Lambda that calls an external API may have a retry policy with maxAttempts: 3, intervalSeconds: 2, backoffRate: 2. If all retries fail, the state catches the error and transitions to compensation. Additionally, the entire state machine has a timeout (default 60 seconds for Express, up to 1 year for Standard). If the saga exceeds the timeout, the execution fails and should trigger compensating actions. The saga must also handle partial failures: if a compensating action fails, the saga may need to escalate (e.g., send to SQS for manual processing). It is critical to design compensating actions that are robust and do not themselves fail frequently.

What This Looks Like on the Job

Enterprise Scenario 1: E-Commerce Order Fulfillment

A large e-commerce platform uses a saga pattern to process orders across microservices: Order Service, Payment Service, Inventory Service, and Shipping Service. When a customer places an order, the saga orchestrator (Step Functions) first creates the order (status: PENDING), then processes payment (debit card), then reserves inventory, and finally initiates shipping. If inventory is insufficient, the saga compensates by refunding the payment and cancelling the order. The system handles 10,000 orders per day. Key considerations: The payment refund must be idempotent to avoid double refunds if the refund Lambda is retried. The inventory reservation has a TTL of 30 minutes; if the saga fails after reservation, the inventory is released by the compensating action. The saga state is stored in DynamoDB with a TTL of 7 days for cleanup. Misconfiguration: If the compensating actions are not idempotent, a retry can cause duplicate refunds or multiple cancellations, leading to customer complaints.

Enterprise Scenario 2: Travel Booking System

A travel agency aggregates flights, hotels, and car rentals from third-party APIs. The saga orchestrator books each component sequentially. If the car rental fails after the flight and hotel are booked, the saga cancels the hotel and flight (each with cancellation fees). The compensating actions must account for partial refunds and cancellation policies. The system uses Step Functions with Express workflows for low latency (under 5 minutes). Each API call has a timeout of 10 seconds; if exceeded, the saga retries once and then compensates. The saga state is logged to CloudWatch for auditing. Common issue: The third-party APIs are not idempotent; a retry may book twice. Solution: Use a unique request ID for each booking attempt and check for duplicates on the third-party side.

Performance and Scale

At scale, sagas must handle concurrent executions. Step Functions Standard workflows support millions of executions per month. DynamoDB must be provisioned with sufficient read/write capacity to handle saga state updates. For high throughput, use DynamoDB auto-scaling. Compensating actions should be lightweight; if they involve external API calls, consider using SQS to queue them for async processing. The saga pattern is not suitable for real-time systems requiring strong consistency; eventual consistency may lead to temporary inconsistencies (e.g., an order appears as 'confirmed' but later is cancelled).

How DVA-C02 Actually Tests This

What DVA-C02 Tests on Sagas (Objective 1.6)

The exam focuses on:

Identifying when to use saga vs. 2PC.

Choosing between choreography and orchestration based on requirements (e.g., loose coupling vs. central control).

Implementing compensating transactions.

Using AWS Step Functions for orchestration.

Understanding eventual consistency vs. strong consistency.

Common Wrong Answers

1.

Choosing 2PC for NoSQL databases: Many candidates think 2PC is the standard for distributed transactions, but AWS NoSQL services (DynamoDB, S3) do not support 2PC. The correct answer is saga.

2.

Assuming sagas provide strong consistency: Sagas provide eventual consistency. If the question requires immediate consistency, saga is wrong.

3.

Using SQS for orchestration: Candidates often pick SQS for decoupling, but SQS alone does not provide orchestration logic; Step Functions is the AWS-managed orchestrator.

4.

Forgetting compensating transactions: Some answers propose simply logging the failure and not undoing previous steps. The correct answer must include compensating actions.

Specific Numbers and Terms

Step Functions execution history retention: up to 90 days.

Standard workflow max duration: 1 year.

Express workflow max duration: 5 minutes.

Retry parameters: maxAttempts (default 3), intervalSeconds, backoffRate (default 2.0).

Catch: can catch specific errors (e.g., Lambda.ServiceException) or States.ALL.

Edge Cases

Partial compensation: If a compensating action fails, the saga must handle it (e.g., send to DLQ).

Idempotency: Every forward and compensating action must be idempotent.

Timeouts: If a forward transaction times out, the saga should treat it as a failure and compensate.

Concurrent sagas: Multiple sagas may update the same resource; use optimistic locking in DynamoDB.

How to Eliminate Wrong Answers

If the answer mentions locks or blocking, it is likely 2PC, not saga.

If the answer uses SNS/SQS without a coordinator, it is choreography.

If the answer uses Step Functions, it is orchestration.

If the answer does not mention compensating actions, it is incomplete.

Always look for the option that includes Step Functions and compensating transactions for orchestration, or events for choreography.

Key Takeaways

Saga pattern manages distributed transactions without locks, providing eventual consistency.

Two types: choreography (event-driven) and orchestration (central coordinator).

AWS Step Functions is the recommended service for orchestration-based sagas.

Every forward transaction must have a compensating transaction that is idempotent.

Compensating transactions are not rollbacks; they are new operations to undo effects.

Step Functions Standard workflows can run up to 1 year; Express up to 5 minutes.

Use DynamoDB to store saga state for idempotency and recovery.

Saga is appropriate for long-lived transactions and NoSQL databases; not for strong consistency needs.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Choreography-based Saga

No central coordinator; services communicate via events.

Loose coupling; each service only knows about its own events.

Harder to trace and debug; complex workflows become tangled.

Best for simple workflows with few services.

Example: Order service emits 'OrderCreated'; Payment service listens and emits 'PaymentProcessed'.

Orchestration-based Saga

Central coordinator (e.g., Step Functions) controls the flow.

Tighter coupling; services depend on the orchestrator.

Easier to monitor, test, and manage; clear workflow definition.

Best for complex workflows with many services and error handling.

Example: Step Functions calls order Lambda, then payment Lambda, then inventory Lambda.

Watch Out for These

Mistake

Saga provides strong consistency like ACID transactions.

Correct

Saga provides eventual consistency. At any point during execution, the system may be inconsistent until all steps complete or compensation finishes.

Mistake

Choreography-based sagas are always better than orchestration because they are more decoupled.

Correct

Choreography is simpler for few services but becomes hard to manage and debug as the number of services grows. Orchestration (Step Functions) is easier to monitor and maintain for complex workflows.

Mistake

Compensating transactions are the same as rolling back a database transaction.

Correct

Compensating transactions are new transactions that semantically undo the effect, not a physical rollback. They must be idempotent and may not fully restore the original state (e.g., non-refundable fees).

Mistake

Saga pattern requires a central database for coordination.

Correct

Saga does not require a central database; state can be stored in event logs or Step Functions execution history. However, DynamoDB is commonly used for idempotency and state tracking.

Mistake

AWS Step Functions is the only way to implement sagas on AWS.

Correct

You can also implement choreography using SQS, SNS, or EventBridge. Step Functions is the recommended orchestration service, but not mandatory.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between saga and two-phase commit (2PC)?

2PC provides strong consistency by locking resources across participants until the coordinator decides to commit or abort. This blocking reduces concurrency and is not supported by many NoSQL databases. Saga, on the other hand, uses a sequence of local transactions with compensating actions, releasing resources after each step. Saga provides eventual consistency and is suitable for microservices and NoSQL environments. In AWS, 2PC is not commonly used because services like DynamoDB and S3 do not support it; saga is the recommended pattern.

How do I implement a saga on AWS?

Use AWS Step Functions to create an orchestration-based saga. Define a state machine with Task states for each forward transaction (e.g., Lambda functions) and Catch blocks to route to compensating states on failure. Store saga state in DynamoDB for idempotency. For choreography, use Amazon SNS or EventBridge to emit events, and have each service subscribe and react. Step Functions is the easier and more manageable approach for the exam.

What is a compensating transaction?

A compensating transaction is an operation that semantically undoes a previous transaction. For example, if a forward transaction debits $100, the compensating transaction credits $100. It must be idempotent to handle retries. Compensating transactions are not database rollbacks; they are new transactions that may have side effects (e.g., sending a cancellation email). If the forward transaction cannot be fully undone (e.g., sending an email), the compensating action might be to send a follow-up email.

Can I use Amazon RDS with saga?

Yes, but RDS supports ACID transactions within a single database. If your saga spans multiple RDS instances or includes other services, you still need a saga. You can use Step Functions to call Lambda functions that execute SQL transactions. Each local transaction can use RDS transactions, but the overall saga is not ACID across databases.

What is the role of idempotency in sagas?

Idempotency ensures that executing the same operation multiple times produces the same result. In sagas, both forward and compensating actions must be idempotent because Step Functions may retry on failure. For example, a payment Lambda should check if the payment was already processed using an idempotency key (e.g., order ID) and skip processing if already done. Without idempotency, retries could cause duplicate charges or multiple refunds.

How does saga handle partial failures?

If a forward transaction fails, the saga executes compensating transactions for all previously successful steps. If a compensating transaction fails, the saga may retry it (if configured) or escalate to a manual process (e.g., send to a Dead Letter Queue). The saga state is updated to reflect the failure, and an operator can intervene. Step Functions allows defining retry policies on compensating states as well.

Is saga suitable for real-time systems?

Saga provides eventual consistency, so there is a window where the system is inconsistent. For real-time systems that require strong consistency (e.g., financial trading), saga is not suitable. Use a single database transaction or 2PC if the system can tolerate blocking. Saga is ideal for long-lived transactions where temporary inconsistency is acceptable.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Saga Pattern for Distributed Transactions — now see how well it sticks with free DVA-C02 practice questions. Full explanations included, no account needed.

Done with this chapter?