SAA-C03 · topic practice

Design Resilient Architectures practice questions

Use this page to practise high availability and resilience questions. The SAA-C03 exam tests your ability to match an architecture pattern to an RTO/RPO requirement — know the cost and recovery time of each pattern.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security
20 questionsDomain: Design Resilient Architectures

What the exam tests

What to know about Design Resilient Architectures

High availability and resilience questions test multi-AZ vs multi-Region patterns, Auto Scaling, load balancing and the right service for a given recovery time objective.

Multi-AZ vs multi-Region deployment trade-offs.

Auto Scaling policies and when to scale horizontally vs vertically.

Elastic Load Balancing: ALB, NLB, CLB and their use cases.

RTO and RPO targets matched to the correct AWS architecture.

Watch out for

Common Design Resilient Architectures exam traps

  • Multi-AZ protects against AZ failure; multi-Region protects against Region failure.
  • Auto Scaling does not guarantee zero downtime without a load balancer.
  • ALB operates at Layer 7; NLB operates at Layer 4.
  • Pilot light is cheaper than warm standby but has longer recovery time.

Practice set

Design Resilient Architectures questions

20 questions · select your answer, then reveal the explanation

An order-processing service consumes messages from an Amazon SQS Standard queue using a custom worker. During traffic spikes, the worker occasionally times out after performing some work but before acknowledging the message, so SQS redelivers it and it may be processed again.

You also observe that a small set of “poison” messages always fail validation.

What change most directly improves resilience by (1) preventing poison messages from retrying indefinitely and (2) avoiding duplicate side effects caused by legitimate retries?

Based on the exhibit, the application sees several minutes of connection errors during an Aurora failover. What is the best change to reduce failover impact?

Exhibit

Application configuration
  JDBC URL: jdbc:postgresql://mydb-instance-1.abcdefghijkl.us-east-1.rds.amazonaws.com:5432/app
Aurora event log
  11:15:02 Failover initiated
  11:15:04 Writer moved to a different instance
  11:18:20 Application still reporting connection refused errors
Notes from the team
  The application uses a connection pool and does not re-resolve the endpoint quickly.

A payments service receives payment orders by consuming messages from an Amazon SQS Standard queue. The downstream processor occasionally exceeds its processing timeout. As a result, some messages reappear in the queue and may be processed more than once.

The team wants to prevent duplicate side effects (for example, double-charging) and also ensure poison messages do not repeatedly consume processing capacity.

What approach best satisfies both goals?

Question 4mediummultiple choice
Review the full subnetting walkthrough →

A company runs an application behind an Application Load Balancer (ALB). An Auto Scaling group (ASG) is configured with desired capacity 2, but it is attached only to subnets in a single Availability Zone. The ALB is healthy because it is configured across multiple Availability Zones.

When the Availability Zone that contains the ASG subnets experiences an outage, what change most directly improves resilience and allows capacity to be restored automatically?

Question 5hardmultiple choice
Read the full DNS explanation →

Based on the exhibit, DNS still sends traffic to the primary Region even though Route 53 health checks show the primary endpoint is unhealthy. What is the best change to make failover work as intended?

Exhibit

Route 53 record sets for app.example.com:
- Record 1: Type A, RoutingPolicy=Simple, AliasTarget=alb-use1.amazonaws.com
- Record 2: Type A, RoutingPolicy=Simple, AliasTarget=alb-usw2.amazonaws.com

Health check status:
hc-primary: FAILED
hc-secondary: HEALTHY

Resolver test:
$ dig +short app.example.com
alb-use1.amazonaws.com

Ops note:
The intent is to send all traffic to us-east-1 normally and fail over to us-west-2 only when the primary is unhealthy.

Based on the exhibit, the web application must remain available even if one Availability Zone fails. What is the best change to improve resilience with the least redesign?

Exhibit

Application Load Balancer
  Subnets: subnet-a1 (us-east-1a), subnet-b1 (us-east-1b)
Auto Scaling group
  VPCZoneIdentifier: subnet-a1 (us-east-1a)
  DesiredCapacity: 2
  MinSize: 2
  MaxSize: 4
CloudWatch
  HealthyHostCount: 2
  HTTPCode_Target_5XX_Count: 0
Troubleshooting note
  A planned test that disabled us-east-1a caused the application to become unreachable.
Question 7mediummultiple choice
Read the full NAT/PAT explanation →

An Auto Scaling group behind an Application Load Balancer frequently replaces new EC2 instances. The application needs ~6 minutes to warm up after instance launch. However, the ALB target group health checks start immediately and mark the targets unhealthy until the application is ready. Because the targets become unhealthy early, the Auto Scaling group then terminates the instances and launches replacements, creating a repeated unhealthy/termination loop.

What configuration change will most directly improve recovery by preventing premature ASG termination while the application is warming up?

Question 8mediummultiple choice
Read the full DNS explanation →

A company runs an internet-facing API in two AWS Regions. Route 53 currently uses simple routing to a primary Application Load Balancer (ALB) DNS name. When the primary Region experiences an outage, customers wait a long time because the DNS entry is not changed automatically.

The team wants automatic failover: if the primary Region ALB health check fails for a sustained period, Route 53 should route users to the secondary Region ALB.

Which Route 53 approach best meets this requirement?

A team accidentally updates critical rows in an Amazon RDS for PostgreSQL database. Automated backups are enabled. They need to recover the data to the exact state as of 90 minutes ago.

They also cannot risk interrupting the current production database instance while investigators validate the restored data.

Which recovery strategy best meets these constraints?

Based on the exhibit, the database must continue serving if the current Availability Zone fails. What should you change?

Exhibit

Amazon RDS for PostgreSQL
DB instance identifier: orders-db
Multi-AZ: false
Automated backups: enabled
Availability Zone: us-east-1b
Publicly accessible: no

Based on the exhibit, the application tier is not replacing unhealthy instances even though the Auto Scaling group spans two Availability Zones. What change most directly improves automatic recovery when the application process fails?

Network Topology
$ aws autoscaling describe-auto-scaling-groupsauto-scaling-group-names orders-asg$ aws elbv2 describe-target-healthtarget-group-arn arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/orders-tg/abcd1234"AutoScalingGroups": ["AutoScalingGroupName": "orders-asg","DesiredCapacity": 4,"MinSize": 4,"MaxSize": 8,"AvailabilityZones": ["us-east-1a", "us-east-1b"],"HealthCheckType": "EC2","HealthCheckGracePeriod": 300,"TargetGroupARNs": ["arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/orders-tg/abcd1234"]TARGETSi-01e2a3b4: healthyi-02e3b4c5: healthyi-03f4c5d6: unhealthyi-04a5d6e7: unhealthyApplication health endpoint:2026-04-27T13:05:22Z GET /health -> 500EC2 status checks: passing

Based on the exhibit, the team must restore an Amazon RDS for PostgreSQL database to the exact state just before a bad delete happened. What is the best recovery approach?

Exhibit

RDS backup status:
- Automated backups: Enabled
- Backup retention period: 14 days
- Latest automated snapshot: 2026-04-27 09:00 UTC
- Latest restorable time: 2026-04-27 15:14 UTC

Incident timeline:
- 2026-04-27 15:11 UTC: deployment script accidentally deleted critical rows
- 2026-04-27 15:12 UTC: application detected missing data
- Required restore point: 2026-04-27 15:10 UTC

Operations note:
The business wants to recover to a new database first, verify data, and then cut over the application.
Question 13mediummultiple choice
Read the full DNS explanation →

Based on the exhibit, the company wants DNS traffic to fail over automatically from the primary Region to a secondary Region when the primary endpoint is unhealthy. Which Route 53 change is best?

Exhibit

Route 53 record set
  Name: app.example.com
  Type: A (Alias)
  Routing policy: Simple
  Alias target: alb-primary-123.us-east-1.elb.amazonaws.com
  TTL: 60 seconds
Health check
  ID: hc-44
  Status: Inactive
Secondary environment
  ALB target exists in us-west-2: alb-secondary-456.us-west-2.elb.amazonaws.com
Operational note
  A Region outage should shift users to the secondary ALB without manual DNS changes.

Based on the exhibit, downstream payment timeouts cause EventBridge deliveries to back up and some events are retried until they age out. What change best improves resilience and preserves events during downstream outages?

Exhibit

Amazon EventBridge rule:
- source: orders.checkout
- target: Lambda function process-orders
- retry policy: default

CloudWatch metrics:
- Invocations: 120/min
- Throttles: 87/min
- ApproximateAgeOfOldestEvent: 900 seconds

Lambda log excerpt:
2026-04-27T18:22:41Z payment API timeout
2026-04-27T18:22:44Z retry attempt 3 failed
2026-04-27T18:22:48Z processing orderId=90118 paused

Business requirement:
No events should be lost during a temporary payment API outage, and the system must absorb bursts instead of failing immediately.

A SaaS platform plans to run in two AWS Regions for lower latency. The team wants to enable active-active writes (both regions accept updates) to avoid failover downtime. However, the business requires strong consistency for order status transitions (for example, only one transition from “Paid” to “Shipped” must be allowed).

Which statement is the best architectural choice to meet the consistency requirement?

Based on the exhibit, the web tier becomes unavailable if us-west-2a has an outage. What is the best change to improve resilience with the least redesign?

Exhibit

Auto Scaling group: web-asg
Attached subnets: subnet-1111 (us-west-2a)
Load balancer subnets: subnet-1111 (us-west-2a)
Desired capacity: 2
Health check type: ELB

Based on the exhibit, the database is manually promoted during an Availability Zone failure and the application outage lasts longer than the target. What change best improves resilience with the least operational intervention?

Exhibit

Current topology:
app -> Amazon RDS for PostgreSQL primary db-a in us-east-1a
app -> Amazon RDS read replica db-b in us-east-1b

Incident report:
10:14 UTC - Primary AZ impaired
10:15 UTC - Application returns database connection errors
10:18 UTC - DBA manually promotes db-b
10:22 UTC - Application reconnects
Observed replication lag before failure: 40 seconds
Target:
- Automatic failover within 2 minutes
- No manual promotion during an AZ outage

An application writes to an Amazon Aurora DB cluster. After a planned Aurora failover, the application experiences several minutes of connection errors.

The logs show the application continues connecting to the specific DB instance endpoint that was the primary before the failover.

What change most directly improves resilience during Aurora failovers?

A service processes customer payments from a message queue. Because the queue provides at-least-once delivery, the same payment message can be delivered more than once if the consumer times out before committing its state. Currently, the service sometimes charges the customer twice.

Which design change most directly prevents duplicate charges while still allowing safe retries?

Question 20mediummultiple choice
Read the full DNS explanation →

Your web application is deployed in two AWS Regions (Region A and Region B). You want Route 53 to automatically fail over DNS traffic from Region A to Region B when Region A is unhealthy.

The failover decision must be based on health checks that verify whether the application in Region A is reachable.

Which Route 53 routing configuration best meets these requirements?

Free account

Track your progress over time

Create a free account to save your results and see which topics improve across sessions.

Focused Design Resilient Architectures sessions

Start a Design Resilient Architectures only practice session

Every question in these sessions is drawn from the Design Resilient Architectures domain — nothing else.

Related practice questions

Related SAA-C03 topic practice pages

Move into related areas when this topic feels solid.

Frequently asked questions

What does the SAA-C03 exam test about Design Resilient Architectures?
High availability and resilience questions test multi-AZ vs multi-Region patterns, Auto Scaling, load balancing and the right service for a given recovery time objective.
How should I use these practice questions?
Select your answer before revealing the explanation. Then read why each option is right or wrong — this active recall approach builds retention far faster than re-reading notes.
Can I practise just Design Resilient Architectures questions in a focused session?
Yes — the session launcher on this page draws every question from the Design Resilient Architectures domain. Use a 10-question session first to gauge your baseline, then move to 20 or 30 once the weak spots are clear.
Where can I practise other SAA-C03 topics?
Use the topic links above to move to related areas, or go back to the SAA-C03 question bank to see all topics.
Are these real exam questions or dumps?
These are original practice questions written to test the same concepts the SAA-C03 exam covers. They are not copied from any real exam or dump site.