SAA-C03 domain

Design Resilient Architectures

Use this page to practise SAA-C03 Design Resilient Architectures practice questions. The goal is not to memorise dumps, but to understand the concept, review the explanation and improve your exam readiness.

250 questions

Focused practice

Start a Design Resilient Architectures session

All sessions draw only from this domain. Pick a length or try interactive practice with inline explanations.

Start 20-question practice session →

What the exam tests

What to know about Design Resilient Architectures

Design Resilient Architectures questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Question index

All Design Resilient Architectures questions (250)

Click any question to see the full explanation, or start a practice session above.

1

An order-processing service consumes messages from an Amazon SQS Standard queue using a custom worker. During traffic spikes, the worker occasionally times out after performing some work but before acknowledging the message, so SQS redelivers it and it may be processed again. You also observe that a small set of “poison” messages always fail validation. What change most directly improves resilience by (1) preventing poison messages from retrying indefinitely and (2) avoiding duplicate side effects caused by legitimate retries?

2

Based on the exhibit, the application sees several minutes of connection errors during an Aurora failover. What is the best change to reduce failover impact?

3

A payments service receives payment orders by consuming messages from an Amazon SQS Standard queue. The downstream processor occasionally exceeds its processing timeout. As a result, some messages reappear in the queue and may be processed more than once. The team wants to prevent duplicate side effects (for example, double-charging) and also ensure poison messages do not repeatedly consume processing capacity. What approach best satisfies both goals?

4

A company runs an application behind an Application Load Balancer (ALB). An Auto Scaling group (ASG) is configured with desired capacity 2, but it is attached only to subnets in a single Availability Zone. The ALB is healthy because it is configured across multiple Availability Zones. When the Availability Zone that contains the ASG subnets experiences an outage, what change most directly improves resilience and allows capacity to be restored automatically?

5

Based on the exhibit, DNS still sends traffic to the primary Region even though Route 53 health checks show the primary endpoint is unhealthy. What is the best change to make failover work as intended?

6

Based on the exhibit, the web application must remain available even if one Availability Zone fails. What is the best change to improve resilience with the least redesign?

7

An Auto Scaling group behind an Application Load Balancer frequently replaces new EC2 instances. The application needs ~6 minutes to warm up after instance launch. However, the ALB target group health checks start immediately and mark the targets unhealthy until the application is ready. Because the targets become unhealthy early, the Auto Scaling group then terminates the instances and launches replacements, creating a repeated unhealthy/termination loop. What configuration change will most directly improve recovery by preventing premature ASG termination while the application is warming up?

8

A company runs an internet-facing API in two AWS Regions. Route 53 currently uses simple routing to a primary Application Load Balancer (ALB) DNS name. When the primary Region experiences an outage, customers wait a long time because the DNS entry is not changed automatically. The team wants automatic failover: if the primary Region ALB health check fails for a sustained period, Route 53 should route users to the secondary Region ALB. Which Route 53 approach best meets this requirement?

9

A team accidentally updates critical rows in an Amazon RDS for PostgreSQL database. Automated backups are enabled. They need to recover the data to the exact state as of 90 minutes ago. They also cannot risk interrupting the current production database instance while investigators validate the restored data. Which recovery strategy best meets these constraints?

10

Based on the exhibit, the database must continue serving if the current Availability Zone fails. What should you change?

11

Based on the exhibit, the application tier is not replacing unhealthy instances even though the Auto Scaling group spans two Availability Zones. What change most directly improves automatic recovery when the application process fails?

12

Based on the exhibit, the team must restore an Amazon RDS for PostgreSQL database to the exact state just before a bad delete happened. What is the best recovery approach?

13

Based on the exhibit, the company wants DNS traffic to fail over automatically from the primary Region to a secondary Region when the primary endpoint is unhealthy. Which Route 53 change is best?

14

Based on the exhibit, downstream payment timeouts cause EventBridge deliveries to back up and some events are retried until they age out. What change best improves resilience and preserves events during downstream outages?

15

A SaaS platform plans to run in two AWS Regions for lower latency. The team wants to enable active-active writes (both regions accept updates) to avoid failover downtime. However, the business requires strong consistency for order status transitions (for example, only one transition from “Paid” to “Shipped” must be allowed). Which statement is the best architectural choice to meet the consistency requirement?

16

Based on the exhibit, the web tier becomes unavailable if us-west-2a has an outage. What is the best change to improve resilience with the least redesign?

17

Based on the exhibit, the database is manually promoted during an Availability Zone failure and the application outage lasts longer than the target. What change best improves resilience with the least operational intervention?

18

An application writes to an Amazon Aurora DB cluster. After a planned Aurora failover, the application experiences several minutes of connection errors. The logs show the application continues connecting to the specific DB instance endpoint that was the primary before the failover. What change most directly improves resilience during Aurora failovers?

19

A service processes customer payments from a message queue. Because the queue provides at-least-once delivery, the same payment message can be delivered more than once if the consumer times out before committing its state. Currently, the service sometimes charges the customer twice. Which design change most directly prevents duplicate charges while still allowing safe retries?

20

Your web application is deployed in two AWS Regions (Region A and Region B). You want Route 53 to automatically fail over DNS traffic from Region A to Region B when Region A is unhealthy. The failover decision must be based on health checks that verify whether the application in Region A is reachable. Which Route 53 routing configuration best meets these requirements?

21

Based on the exhibit, the payment worker sometimes processes the same SQS Standard message more than once after a timeout. What change best prevents duplicate charges while keeping the queue architecture?

22

Based on the exhibit, duplicate payment charges occasionally occur when the worker times out after the charge is submitted but before the message is deleted. What change best prevents duplicate charges while keeping retry behavior?

23

A production team accidentally deletes critical rows in an Amazon RDS for PostgreSQL database. The deletion occurred about 6 hours ago. The team wants to recover to a specific point in time with minimal disruption. Assuming automated backups are enabled, which approach provides the best resilience outcome?

24

A web application uses pooled JDBC connections to an Amazon Aurora cluster using the writer endpoint. During an Aurora planned failover, monitoring shows a short spike in failed requests. The Aurora cluster writer endpoint remains the same, but many existing pooled connections briefly fail. The application retries aggressively and overloads the new writer during the transition. Which design change will most improve application resilience during Aurora failovers without requiring application redeployment?

25

Based on the exhibit, an administrator accidentally deleted data from Amazon RDS for PostgreSQL about 90 minutes ago. Which recovery approach best restores the database to the exact required point in time?

26

Based on the exhibit, the current disaster recovery design misses the RTO target even though the database replica is current. Which deployment model best meets the requirements with the least always-on cost?

27

A payments platform requires disaster recovery across Regions. Requirements: RPO of 15 minutes and RTO of about 1 hour. The business cannot afford full duplicate capacity in both Regions all the time, but the team wants automated readiness so failover is mostly operationally guided rather than a slow rebuild. Which DR strategy is the best fit?

28

Based on the exhibit, a web application must stay available if one Availability Zone fails. What is the best change to improve resilience?

29

Match the disaster recovery strategy to the recovery posture it best fits for a Regional outage.

30

A global application experiences frequent writes and must survive a full Regional outage with near-zero data loss. The product team also requires that users can continue to write during the incident using the closest Region. Which approach is most aligned with these requirements?

31

A team wants a web application to keep serving traffic if one Availability Zone fails. Match each architecture element to the resilience behavior it provides.

32

Based on the exhibit, the database must fail over automatically if the primary Availability Zone goes down. Which solution should the architect choose?

33

A company uses Amazon RDS for a PostgreSQL database powering a customer-facing application. The application’s availability depends on fast database failover with minimal manual intervention. The RDS instance currently runs as a single-AZ deployment in one DB subnet group. Which change most directly meets the goal?

34

A stateless web API runs on EC2 instances behind an Application Load Balancer (ALB). The Auto Scaling group (ASG) currently uses subnets from only one Availability Zone, even though the ALB spans two Availability Zones. During maintenance of that single AZ, the ALB remains up but clients see timeouts because there are no healthy targets. Which change most directly improves resilience against an AZ failure?

35

A caching layer uses Amazon ElastiCache for Redis in front of a stateless web service. The service must continue to read cached responses during maintenance events and should automatically fail over to another node if one AZ becomes impaired. Which design change best satisfies this requirement?

36

A company runs a stateful analytics workload on EC2 instances that use EBS volumes. The data must be restorable in another Region after a major outage, with frequent point-in-time recovery. Which approach provides the most suitable replication mechanism for the EBS-backed data?

37

An order processing workflow uses Amazon SQS as the decoupling layer between a producer and a consumer Lambda function. The consumer intermittently fails due to a downstream dependency. The team has observed that certain “poison” messages keep being retried repeatedly and prevent other messages from being processed efficiently. Which SQS configuration most directly addresses this issue?

38

A media company stores original uploads in an S3 bucket. They must recover from accidental overwrites/deletes and also recover quickly from a full Region outage. The required RPO is about 1 hour. Which configuration best meets these requirements?

39

An ECS service runs on EC2 instances and is fronted by an ALB. The ALB spans two Availability Zones, and the ECS service desired count is 2 tasks. The underlying EC2 capacity uses an Auto Scaling group (ASG) with min size set to 1, and the ASG also spans only one subnet in practice. What is the most effective change to meet the requirement that the service continues during a single-AZ instance loss?

40

Your order-processing system uses EventBridge rules to send events to a Lambda function that updates order status. Over the last week, some events fail with a transient database timeout, and the Lambda retries intermittently but then the events are lost (no alerts after failures). You want at-least-once processing, bounded retries, and a way to inspect unprocessable events for later reprocessing. Which architecture change best meets these requirements?

41

A retail API runs on Amazon EC2 instances behind an Application Load Balancer and stores orders in an Amazon RDS for PostgreSQL database. A test that stopped one Availability Zone caused the API to return errors because all application servers were in the same AZ and the database was single-AZ. Which two changes should the architect make to continue serving traffic during a single-AZ failure? Select two.

42

An engineering team deploys a stateless web API on EC2 using an Auto Scaling group and an Application Load Balancer (ALB). During a recent test, they noticed that when one Availability Zone was unavailable, traffic failed until new instances were manually launched. Which change most directly improves automatic failover for the compute layer within a single Region?

43

A customer portal must recover from a regional outage within a few hours. The business wants lower ongoing cost than a fully active second Region and does not want to rebuild everything from scratch during the outage. Which two DR patterns best fit that goal? Select two.

44

Your media processing pipeline writes original uploads to an S3 bucket and later generates derivative files. An operator accidentally deletes a subset of original uploads in production. You need to (1) restore the deleted objects with minimal data loss and (2) protect against both regional disasters and future operator mistakes. The company requires recovery even if objects are deleted and later overwritten. What is the most effective change to meet these requirements?

45

A company runs its customer-facing web app on EC2 behind an Application Load Balancer. The database is Amazon RDS for PostgreSQL. The requirement is that if a single Availability Zone fails, the database must automatically fail over within the same AWS Region with minimal application changes. Which database setup best meets this requirement?

46

A media company stores daily financial exports in Amazon S3. The files must be protected against accidental overwrite or deletion, and the business also wants a second copy in another Region for recovery after a regional outage. Which two actions should the architect take? Select two.

47

A serverless order-ingestion API writes directly to a database. During traffic spikes, the database occasionally throttles, Lambda retries create duplicate order records, and some requests time out. Which two changes best improve buffering and safe retry behavior? Select two.

48

An order system receives events and uses a Lambda function to write each order into a database. During traffic spikes, the database sometimes throttles, and Lambda retries lead to occasional message loss in the event flow. The team wants buffering, automatic retries, and a way to isolate messages that repeatedly fail so they can be inspected later. What design change best meets this need?

49

A SaaS platform serves an API using two regional deployments: us-east-1 (primary) and us-west-2 (secondary). Each region has its own ALB. The business requires automated DNS-based failover when the primary region becomes unhealthy, and they do not want manual DNS changes during incidents. Which Route 53 configuration is the best match?

50

A fintech startup uses AWS to run a web API and a PostgreSQL database. They must meet an RPO of 15 minutes and an RTO of 2 hours for a Region-wide disaster. Budget allows running a small, always-on set of infrastructure in a secondary Region, but not full production capacity. The team wants a DR approach that is regularly testable without large manual effort. Which disaster recovery strategy is the best fit?

51

A SaaS application is deployed in us-east-1 and us-west-2 behind separate ALBs. The business wants DNS to send new clients to the primary Region when it is healthy and automatically fail over to the secondary Region when the primary endpoint is unhealthy. Which two Route 53 settings are required? Select two.

52

Your ecommerce app runs behind an Application Load Balancer (ALB) and uses an RDS database for orders. During an AZ impairment in us-east-1, customers report that checkout takes several minutes to recover. The current design places EC2 instances only in private subnets of AZ-a, while the ALB spans multiple subnets. The RDS DB instance is Multi-AZ. Management wants automatic recovery within the same Region. Which change best addresses the issue with minimal operational overhead?

53

A team uses an S3 bucket to store important customer-generated exports. They need protection against accidental overwrites and also want copies of the data in another AWS Region for disaster recovery. Which S3 configuration best satisfies both requirements?

54

A company runs a customer portal on an Amazon Aurora PostgreSQL cluster. The application currently connects directly to the writer instance endpoint and keeps long-lived connections open. During a maintenance failover, writes fail until clients are restarted. The team wants the application to reconnect to the correct Aurora endpoint automatically and reduce user-visible write interruptions. Which change is most likely to achieve this?

55

A company runs the same public API in two regions (Region A and Region B), each fronted by an ALB. They want Route 53 to automatically route clients to the Region B API when Region A becomes unhealthy, with minimal configuration effort. Which Route 53 approach should they use?

56

A retail platform needs disaster recovery across AWS Regions. The business requirement is: RTO up to 6 hours, RPO up to 1 hour, and they want the ability to start serving quickly during a Region outage but do not want to run full production capacity continuously. Which DR strategy best fits these requirements?

57

A team uses an S3 bucket to store important customer-generated exports. They need protection against accidental overwrites and also want copies of the data in another AWS Region for disaster recovery. Which S3 configuration best satisfies both requirements?

58

A fintech startup uses AWS to run a web API and a PostgreSQL database. They must meet an RPO of 15 minutes and an RTO of 2 hours for a Region-wide disaster. Budget allows running a small, always-on set of infrastructure in a secondary Region, but not full production capacity. The team wants a DR approach that is regularly testable without large manual effort. Which disaster recovery strategy is the best fit?

59

A media company stores daily financial exports in Amazon S3. The files must be protected against accidental overwrite or deletion, and the business also wants a second copy in another Region for recovery after a regional outage. Which two actions should the architect take? Select two.

60

A SaaS platform serves an API using two regional deployments: us-east-1 (primary) and us-west-2 (secondary). Each region has its own ALB. The business requires automated DNS-based failover when the primary region becomes unhealthy, and they do not want manual DNS changes during incidents. Which Route 53 configuration is the best match?

61

A company runs a customer portal on an Amazon Aurora PostgreSQL cluster. The application currently connects directly to the writer instance endpoint and keeps long-lived connections open. During a maintenance failover, writes fail until clients are restarted. The team wants the application to reconnect to the correct Aurora endpoint automatically and reduce user-visible write interruptions. Which change is most likely to achieve this?

62

A customer portal must recover from a regional outage within a few hours. The business wants lower ongoing cost than a fully active second Region and does not want to rebuild everything from scratch during the outage. Which two DR patterns best fit that goal? Select two.

63

An engineering team deploys a stateless web API on EC2 using an Auto Scaling group and an Application Load Balancer (ALB). During a recent test, they noticed that when one Availability Zone was unavailable, traffic failed until new instances were manually launched. Which change most directly improves automatic failover for the compute layer within a single Region?

64

A SaaS application is deployed in us-east-1 and us-west-2 behind separate ALBs. The business wants DNS to send new clients to the primary Region when it is healthy and automatically fail over to the secondary Region when the primary endpoint is unhealthy. Which two Route 53 settings are required? Select two.

65

A retail platform needs disaster recovery across AWS Regions. The business requirement is: RTO up to 6 hours, RPO up to 1 hour, and they want the ability to start serving quickly during a Region outage but do not want to run full production capacity continuously. Which DR strategy best fits these requirements?

66

Your order-processing system uses EventBridge rules to send events to a Lambda function that updates order status. Over the last week, some events fail with a transient database timeout, and the Lambda retries intermittently but then the events are lost (no alerts after failures). You want at-least-once processing, bounded retries, and a way to inspect unprocessable events for later reprocessing. Which architecture change best meets these requirements?

67

A company runs the same public API in two regions (Region A and Region B), each fronted by an ALB. They want Route 53 to automatically route clients to the Region B API when Region A becomes unhealthy, with minimal configuration effort. Which Route 53 approach should they use?

68

An order system receives events and uses a Lambda function to write each order into a database. During traffic spikes, the database sometimes throttles, and Lambda retries lead to occasional message loss in the event flow. The team wants buffering, automatic retries, and a way to isolate messages that repeatedly fail so they can be inspected later. What design change best meets this need?

69

A company runs its customer-facing web app on EC2 behind an Application Load Balancer. The database is Amazon RDS for PostgreSQL. The requirement is that if a single Availability Zone fails, the database must automatically fail over within the same AWS Region with minimal application changes. Which database setup best meets this requirement?

70

A retail API runs on Amazon EC2 instances behind an Application Load Balancer and stores orders in an Amazon RDS for PostgreSQL database. A test that stopped one Availability Zone caused the API to return errors because all application servers were in the same AZ and the database was single-AZ. Which two changes should the architect make to continue serving traffic during a single-AZ failure? Select two.

71

A serverless order-ingestion API writes directly to a database. During traffic spikes, the database occasionally throttles, Lambda retries create duplicate order records, and some requests time out. Which two changes best improve buffering and safe retry behavior? Select two.

72

Your ecommerce app runs behind an Application Load Balancer (ALB) and uses an RDS database for orders. During an AZ impairment in us-east-1, customers report that checkout takes several minutes to recover. The current design places EC2 instances only in private subnets of AZ-a, while the ALB spans multiple subnets. The RDS DB instance is Multi-AZ. Management wants automatic recovery within the same Region. Which change best addresses the issue with minimal operational overhead?

73

Your media processing pipeline writes original uploads to an S3 bucket and later generates derivative files. An operator accidentally deletes a subset of original uploads in production. You need to (1) restore the deleted objects with minimal data loss and (2) protect against both regional disasters and future operator mistakes. The company requires recovery even if objects are deleted and later overwritten. What is the most effective change to meet these requirements?

74

Based on the exhibit, a web application must stay available if one Availability Zone fails. What is the best change to improve resilience?

75

Match the disaster recovery strategy to the recovery posture it best fits for a Regional outage.

76

A global application experiences frequent writes and must survive a full Regional outage with near-zero data loss. The product team also requires that users can continue to write during the incident using the closest Region. Which approach is most aligned with these requirements?

77

Based on the exhibit, the database must fail over automatically if the primary Availability Zone goes down. Which solution should the architect choose?

78

A payments platform requires disaster recovery across Regions. Requirements: RPO of 15 minutes and RTO of about 1 hour. The business cannot afford full duplicate capacity in both Regions all the time, but the team wants automated readiness so failover is mostly operationally guided rather than a slow rebuild. Which DR strategy is the best fit?

79

A company uses Amazon RDS for a PostgreSQL database powering a customer-facing application. The application’s availability depends on fast database failover with minimal manual intervention. The RDS instance currently runs as a single-AZ deployment in one DB subnet group. Which change most directly meets the goal?

80

An ECS service runs on EC2 instances and is fronted by an ALB. The ALB spans two Availability Zones, and the ECS service desired count is 2 tasks. The underlying EC2 capacity uses an Auto Scaling group (ASG) with min size set to 1, and the ASG also spans only one subnet in practice. What is the most effective change to meet the requirement that the service continues during a single-AZ instance loss?

81

An order processing workflow uses Amazon SQS as the decoupling layer between a producer and a consumer Lambda function. The consumer intermittently fails due to a downstream dependency. The team has observed that certain “poison” messages keep being retried repeatedly and prevent other messages from being processed efficiently. Which SQS configuration most directly addresses this issue?

82

A company runs a stateful analytics workload on EC2 instances that use EBS volumes. The data must be restorable in another Region after a major outage, with frequent point-in-time recovery. Which approach provides the most suitable replication mechanism for the EBS-backed data?

83

A caching layer uses Amazon ElastiCache for Redis in front of a stateless web service. The service must continue to read cached responses during maintenance events and should automatically fail over to another node if one AZ becomes impaired. Which design change best satisfies this requirement?

84

A media company stores original uploads in an S3 bucket. They must recover from accidental overwrites/deletes and also recover quickly from a full Region outage. The required RPO is about 1 hour. Which configuration best meets these requirements?

85

A team wants a web application to keep serving traffic if one Availability Zone fails. Match each architecture element to the resilience behavior it provides.

86

A stateless web API runs on EC2 instances behind an Application Load Balancer (ALB). The Auto Scaling group (ASG) currently uses subnets from only one Availability Zone, even though the ALB spans two Availability Zones. During maintenance of that single AZ, the ALB remains up but clients see timeouts because there are no healthy targets. Which change most directly improves resilience against an AZ failure?

87

A web app runs on an EC2 Auto Scaling group behind an Application Load Balancer (ALB). The ALB is configured with health checks and the ASG spans three subnets in three Availability Zones. During an AZ outage, monitoring shows the number of healthy instances drops sharply and never returns to the original capacity until the ASG is manually adjusted. What change most directly improves resilience so capacity returns automatically during an AZ failure?

88

Your public API is hosted in two regions. You want Route 53 to automatically send traffic to the secondary region when the primary region’s endpoint fails. The primary API health check is returning failure codes, but clients still reach the primary region for several minutes. Which Route 53 configuration most directly addresses this behavior?

89

An orders service publishes payment instructions to an Amazon SQS queue. After occasional processing timeouts, the downstream consumer sometimes processes the same instruction twice, resulting in duplicate payment attempts. The team currently uses an SQS Standard queue with a visibility timeout of 2 minutes and relies on the consumer to finish before the timeout expires. What approach best improves resilience against duplicate processing?

90

A developer accidentally deletes important rows in an RDS database. The mistake is discovered 45 minutes later. The database has automated backups enabled with a retention period of 7 days. What is the best way to restore the database to a point just before the deletion?

91

A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ASG is currently attached to subnets in only two Availability Zones (AZs). During a planned maintenance window, one AZ becomes unavailable for about 25 minutes. Monitoring shows that targets in the remaining AZ go healthy, and the ALB/target group health checks report normal. However, users still experience intermittent connection failures and slower responses during the AZ outage. What change will most directly improve resilience against an AZ loss while keeping the same ALB-based design?

92

An application uses an Amazon Aurora DB cluster. The cluster performs an automatic failover from the writer instance to a standby instance. After failover completes, reads succeed, but all new writes fail with errors indicating the application is connecting to the old writer endpoint. Which change best fixes the resiliency issue after failover?

93

A company hosts a public API using two AWS regions behind a single custom domain. Route 53 is configured with latency-based routing and health checks. During a regional outage, application metrics confirm the primary API is unhealthy, but clients still resolve to the primary region for most requests. Which DNS configuration change will most directly ensure automatic failover to the secondary region when the primary fails?

94

An orders service publishes payment instructions to an Amazon SQS queue. The downstream consumer sometimes times out while processing a message. After the message becomes visible again, the consumer may process the same instruction more than once and occasionally creates duplicate orders. The team needs a resiliency-focused design that prevents duplicates from creating double-charges, even if the same message is processed multiple times. What is the best architectural change?

95

A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer (ALB). After a new release, instances begin failing ALB health checks with errors like 502 while the application is still starting up. CloudWatch shows that the ASG replaces the instances before they finish initializing, so traffic never reaches healthy targets. Which change most directly prevents premature replacement during startup so traffic can resume as soon as the instances are actually healthy?

96

A company uses an Amazon Aurora DB cluster in a Multi-AZ configuration. During a planned failover of the writer instance, the database endpoints in the application are updated incorrectly. After failover, reads work but writes fail with connection errors and timeouts for several minutes. The team currently uses the instance endpoint for the writer. What should they change to improve write resilience during failovers?

97

A public API is deployed in two AWS Regions: us-east-1 (primary) and us-west-2 (secondary). The team wants Route 53 to automatically route users to the secondary region if the primary API becomes unhealthy. They will use Route 53 health checks that monitor the API’s /status endpoint over HTTPS. Which Route 53 configuration most directly implements this failover behavior?

98

An orders service publishes payment instructions to an Amazon SQS Standard queue. A downstream consumer sometimes times out and retries the work, causing the consumer to process the same instruction more than once. Operationally, the team must ensure that duplicate processing does not create duplicate charges. The queue type cannot be changed. What is the most resilient application-side approach?

99

A service consumes messages from an SQS queue. Recently, a new message format started failing validation in the consumer. The consumer catches the exception but cannot successfully process those messages without code changes. The team wants failed messages to be isolated for later investigation instead of being retried indefinitely. What should they configure?

100

A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ASG uses the ALB target group health checks to decide when instances are healthy (for example, by using the ELB/target-group health check integration). During a deployment, the ASG performs instance replacement. Shortly after the deployment starts and while new instances are still bootstrapping, CloudWatch shows the ALB target group briefly has zero healthy targets, and users intermittently receive 502 responses. Which ASG deployment configuration best reduces the chance that there will be a period with zero healthy ALB targets, while still keeping failover behavior resilient?

101

You host a public API using Amazon API Gateway in two AWS Regions: us-east-1 (primary) and us-west-2 (secondary). You want Route 53 to send client traffic to the secondary region only when the primary API is unhealthy. Which Route 53 setup best meets this requirement?

102

An orders service publishes payment instructions to an Amazon SQS Standard queue. A downstream consumer sometimes times out or crashes after it has partially completed processing, causing the same instruction to be processed more than once. You must keep the design resilient without attempting to guarantee exactly-once processing. Which approach best handles duplicates safely?

103

A Multi-AZ Amazon RDS database experiences incorrect writes at 10:15 UTC due to a buggy release. The team detects the problem at 10:25 UTC. They want to restore the data to a known-good point around 10:15 UTC, and validate the recovered data, without taking the current production instance offline during the recovery process. What is the most appropriate AWS action?

104

An events service publishes critical notifications using Amazon SNS. Three independent downstream systems (A, B, and C) subscribe to the topic. Downstream system B sometimes fails to process certain messages (for example, it times out or returns an error while handling the message), and you want: 1) failures in B to be isolated so A and C keep processing unaffected, and 2) messages that B cannot successfully process after retries to be sent to a DLQ for B. Which design best meets these requirements?

105

A web application runs on an Amazon EC2 Auto Scaling group behind an Application Load Balancer (ALB). After each deployment, new instances take about 2 minutes to download artifacts and become ready to accept requests on the target port. In the last deployment, the ALB started marking targets unhealthy before the app was ready, and the Auto Scaling group then replaced those instances repeatedly, causing a prolonged outage. Which change best improves resilience during instance start-up without reducing actual availability once the application is healthy?

106

A company uses Amazon RDS with automated backups enabled (retention period: 7 days). At 10:30 UTC, a bad release corrupts specific rows in a production table. The team detects the issue at 11:10 UTC. They need to revert the database state to what it was from 10:00–10:30 UTC, recover quickly, and minimize risk to the currently running workload. What is the best option?

107

An internal-facing application is available in two AWS regions (Region 1 and Region 2). Each region has its own Application Load Balancer (ALB) and target group. The company uses an AWS Route 53 private hosted zone to route clients to Region 1 by default, but it must automatically fail over to Region 2 when Region 1’s ALB is unhealthy. Which Route 53 design best meets this requirement?

108

An internal worker consumes messages from an Amazon SQS Standard queue. Recently, some messages fail validation in the worker (for example, missing required fields), causing the worker to crash before it can successfully process those messages. Those messages keep getting retried repeatedly, slowing down processing of valid messages. The team wants a resilient mechanism to quarantine bad messages after a limited number of receive attempts. What should they implement?

109

An orders service publishes payment instructions to an Amazon SQS Standard queue. The downstream processor sometimes times out after it has already applied the payment, but before it can delete the message from the queue. As a result, the same payment instruction can be processed more than once. The team wants the strongest way to prevent duplicate side effects while keeping the system decoupled. What should they implement?

110

A web application runs on an Amazon EC2 Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ALB is configured to use at least two Availability Zones (AZs), but the ASG currently uses subnets in only one AZ. If that AZ becomes unavailable, the application stops serving requests. Which change most directly improves resilience to an AZ outage?

111

Your company hosts an internal API in two AWS Regions. You want Amazon Route 53 to automatically send traffic to the secondary Region if the primary Region’s endpoint becomes unhealthy. Which Route 53 configuration best meets this requirement?

112

An internal worker consumes messages from an Amazon SQS queue. Occasionally, a message fails validation in the worker (for example, missing required fields). Reprocessing the same bad message repeatedly wastes processing time and delays healthy messages. What is the best AWS approach to handle these poison messages without blocking the rest of the queue?

113

A team needs a relational database solution that can automatically fail over to a standby instance if the primary database becomes unavailable. They want the standby to be located in a different Availability Zone. Which RDS/Aurora configuration best satisfies this requirement?

114

A production Amazon RDS database has automated backups enabled with sufficient retention. At 10:30 UTC, a release corrupts specific rows. The issue is detected at 10:45 UTC. The team wants to restore the database state to before the corruption with minimal complexity. What should they do?

115

An orders service consumes payment instructions from an Amazon SQS queue. Sometimes the consumer times out after applying the payment but before deleting the SQS message. As a result, the same payment instruction is processed again. Which design change most directly prevents duplicate side effects caused by message retries?

116

A production Amazon RDS database has automated backups enabled. At 10:00 UTC, an application deploy accidentally overwrote a subset of rows due to a faulty migration. The issue is detected at 10:45 UTC. The team confirms that the required retention window is still available. Which approach offers the most resilient and least disruptive way to recover the affected data close to the time of the event?

117

An orders system sends payment instructions to an Amazon SQS queue. The consumer sometimes times out after it has already created the payment record but before it deletes the SQS message. As a result, the same instruction can be processed more than once. Which design best ensures the consumer remains resilient and does not create duplicate payments when the same instruction is delivered multiple times?

118

A company runs an Amazon Aurora DB cluster with a Multi-AZ deployment. The application is configured with a hard-coded endpoint that points to the current writer *DB instance* (an instance-specific endpoint), rather than the Aurora cluster writer endpoint. During an unexpected AZ failure, Aurora promotes the standby to become the new writer. However, the application continues to fail to connect until an operator updates the hard-coded endpoint. What change most directly improves resiliency so the application automatically reconnects after failover?

119

An event-driven order processing service consumes messages from an Amazon SQS Standard queue. After a deployment, about 1% of messages start failing validation because a required field is missing. The consumer catches the exception and returns control, so the messages are retried. However, those poison messages keep reappearing and repeatedly consuming processing time for hours, delaying handling of valid messages. What is the most resilient way to handle the poison messages while keeping the system available?

120

A company hosts an internal API behind an Application Load Balancer (ALB) in two AWS Regions. They want Amazon Route 53 to automatically fail over to the secondary Region when the primary Region’s ALB is unhealthy. Health checks for the primary ALB are already configured, but the DNS record currently uses a latency-based routing policy. Which Route 53 configuration most directly provides automatic failover based on health status?

121

A web application runs on an EC2 Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ASG spans three Availability Zones. After a deployment, new instances frequently fail the ALB target group health checks with HTTP 5xx responses and are quickly terminated by the ASG. What change most improves resiliency during deployments with minimal downtime by preventing premature removal of instances that are still starting?

122

A fintech company has a two-Region DR requirement: RPO must be within 15 minutes and RTO must be under 2 hours. To control cost, they do not want to run full production infrastructure in the secondary Region continuously. They plan to continuously replicate the database and keep the application infrastructure in the secondary Region prepared, but at reduced capacity. Which DR strategy best matches this requirement and accurately describes their plan?

123

A web application runs on an Auto Scaling group behind an Application Load Balancer. The business wants the service to keep running if one Availability Zone goes down. Which two changes should you make? Select two.

124

A production Amazon RDS database must continue serving the application if the primary DB instance fails. The application should reconnect automatically without hard-coding a new IP address. Which two actions should you take? Select two.

125

A company hosts an internal API in two AWS Regions. Traffic must automatically switch to the secondary Region when the primary Region's endpoint is unhealthy. Which two Route 53 settings are required? Select two.

126

A service processes messages from an Amazon SQS queue. Sometimes the worker finishes the business logic but does not delete the message before the visibility timeout expires, so the message is delivered again. Which two changes improve resilience and reduce the impact of duplicate processing? Select two.

127

A developer accidentally corrupts part of a production Amazon RDS database, and the issue is discovered 45 minutes later. The team needs to restore the database to the state immediately before the change. Which two actions should be part of the recovery plan? Select two.

128

A batch processing job can be interrupted and restarted from checkpoints. The business wants to lower compute cost while still keeping the workload resilient to interruptions. Which two choices are best? Select two.

129

A production application uses an Amazon RDS Multi-AZ DB instance. During an unplanned failover, the database endpoint remains the same. What change should the application team make to handle the failover reliably?

130

Your web tier runs on an EC2 Auto Scaling group behind an Application Load Balancer (ALB). You currently deploy both the ALB and the Auto Scaling group in only two Availability Zones (AZs). One AZ fails. What is the best configuration change to improve resilience?

131

An internal API is hosted in two AWS Regions behind Route 53. Under normal conditions, clients should use the primary region. If the primary endpoint becomes unhealthy, traffic must automatically switch to the secondary region. Which Route 53 setup best meets this requirement?

132

An order-processing system publishes an event whenever a payment succeeds. Three downstream services (inventory, shipping, and analytics) must react independently. Analytics sometimes has high latency, but order processing must not be blocked. What is the best AWS approach to decouple these consumers?

133

A consumer application reads from an Amazon SQS queue. Some messages have an invalid format and always fail processing. They are retried repeatedly and consume consumer capacity. What is the best way to prevent these "poison pill" messages from blocking normal processing?

134

An event consumer sometimes processes the same SQS message more than once due to timeouts and retries. The consumer must ensure the payment is not charged twice. What design choice best addresses this requirement?

135

A company needs an Amazon RDS database that automatically fails over to a standby when the primary DB instance becomes unavailable. Which approach best meets the requirement with minimal operational effort?

136

An internal service is hosted behind an Application Load Balancer (ALB) with targets spread across two Availability Zones. If the targets in one Availability Zone become unhealthy, the service must continue serving traffic from the healthy AZ. What change most directly improves resilience at the load-balancing layer?

137

A worker consumes messages from an Amazon SQS queue. Some messages consistently fail validation and are retried until the worker can no longer process them. What is the most appropriate AWS mechanism to handle these poison messages while keeping the queue usable?

138

A production Amazon RDS database has automated backups enabled. At 10:45 UTC, an issue is discovered. The team needs to restore the database to its state as of 10:30 UTC. Which capability should they use?

139

A system processes events from Amazon SQS and sometimes sees duplicate messages due to retries. The business requirement is that each payment must be charged at most once. What design choice best addresses this resiliency requirement?

140

A company wants a disaster recovery setup for a web application. They need relatively quick recovery, but they can't afford running full production in the secondary location at all times. Which option best matches this requirement?

141

A fintech company needs a disaster recovery design for a web application in two Regions. The business requires an RPO of 15 minutes and an RTO under 2 hours, but it cannot afford to keep a full production stack running in both Regions all the time. Which two DR strategies best fit the requirement? Select two.

142

A transactional application uses Amazon RDS for MySQL in a single Availability Zone. The team wants the database to fail over automatically if the primary DB instance becomes unavailable, and they want the application to recover with minimal code changes. Which two actions should they take? Select two.

143

An order-processing worker consumes messages from Amazon SQS. Occasionally, the worker times out after successfully creating a payment record but before deleting the message, which causes duplicate charges during retries. Some messages also fail validation repeatedly because required fields are missing. Which two changes should the team make? Select two.

144

A production Amazon RDS database already has automated backups enabled. At 10:45 UTC, the team discovers that a faulty migration corrupted rows in a table at 10:30 UTC. The business wants the database restored to exactly the state it had at 10:30 UTC with minimal risk. Which two actions should the team take? Select two.

145

A worker service consumes messages from an Amazon SQS queue. Some messages are malformed and always fail validation. The worker retries, but it keeps reprocessing the same bad messages and consumes processing capacity that should be used for valid work. What is the best solution to prevent “poison messages” from blocking progress?

146

A production Amazon RDS database has automated backups enabled. An application mistakenly updates a table and the issue is discovered one hour later. The team needs to restore the database to the exact state it had 45 minutes ago. Which approach best meets the requirement?

147

A company wants a disaster recovery setup for a web application. They want to keep costs low but still recover within a couple of hours after a regional disruption. They are willing to run only minimal infrastructure in the secondary location and scale it up during the outage. Which DR approach best matches this requirement?

148

A company hosts a web application on Amazon EC2 instances in an Auto Scaling group behind an Application Load Balancer (ALB). The ALB and the Auto Scaling group are currently deployed in only one Availability Zone (AZ). The business wants the application to keep running if that AZ has an outage. What is the best change?

149

A team runs an Amazon RDS for MySQL database in a single Availability Zone. They want automatic failover with minimal downtime if the primary database instance becomes unavailable. Automated backups are already enabled. Which configuration change best meets the requirement?

150

An organization hosts the same public API in two AWS Regions. Normal traffic should go to the primary Region. If the primary endpoint becomes unhealthy, Route 53 should automatically route users to the secondary Region. What is the best Route 53 configuration approach?

151

An orders service currently sends HTTP requests directly to two downstream services (inventory and shipping). During peak load, inventory slows down, causing the orders service to slow as well. The team wants the orders service to remain responsive even when a downstream service is temporarily slow or restarted. Which design change best achieves this resiliency goal?

152

A payment worker consumes messages from an Amazon SQS queue. Sometimes the worker finishes the payment creation, but a timeout prevents message deletion and the same payment request is delivered again. Which two design changes best reduce the risk of duplicate charges and keep bad messages from looping forever? Select two.

153

An application uses an Amazon RDS Multi-AZ DB instance. During a failover test, connections fail until the application is restarted, even though the database comes back online. Which two changes should the team make to improve resilience during failover? Select two.

154

An internal API is deployed in two AWS Regions behind separate Application Load Balancers. The company wants clients to use the primary Region when it is healthy and automatically switch to the secondary Region if the primary health check fails. Which two Route 53 record configurations are required? Select two.

155

A production Amazon Aurora MySQL database is corrupted by a bad migration at 10:30 UTC, and the problem is discovered at 10:45 UTC. The team wants to recover to the state just before the migration with minimal manual effort. Which two actions should they take? Select two.

156

An order service must notify inventory, shipping, and analytics independently when payment succeeds. The shipping service may be slow, but the order service should keep accepting new orders even if one consumer is unavailable. Which two changes best improve resilience? Select two.

157

Based on the exhibit, the team wants to stop poison messages from consuming worker capacity and also prevent duplicate side effects if the same message is delivered more than once. Which design change best meets the requirement?

158

Based on the exhibit, a faulty deployment corrupted production data at 10:30 UTC and the issue was discovered at 10:55 UTC. The team needs to recover the database to the last good state before the corruption. Which action should they take?

159

Based on the exhibit, the application team wants the database to keep the same connection endpoint during failover and to reconnect automatically after the primary instance becomes unavailable. Which change best meets the requirement?

160

Based on the exhibit, which Route 53 configuration should be used so traffic automatically returns to the secondary Region only when the primary Region becomes unhealthy?

161

Based on the exhibit, the business needs Regional disaster recovery with an RTO of 45 minutes and an RPO of 15 minutes. The solution should keep cost lower than running two fully active production environments. Which DR strategy is the best fit?

162

Based on the exhibit, the application should continue serving requests if one Availability Zone fails. Which change best improves resilience with the least operational complexity?

163

Based on the exhibit, some SQS messages fail validation repeatedly and continue consuming worker time. What change best prevents the bad messages from being retried forever?

164

Based on the exhibit, the web team wants the application to continue serving traffic if one Availability Zone fails. Which change best meets the requirement with the least operational overhead?

165

A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable?

166

A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include?

167

A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required?

168

A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers?

169

A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure?

170

A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured?

171

A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured?

172

A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used?

173

A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable?

174

A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered?

175

A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most?

176

A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable?

177

A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable?

178

A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include?

179

A regional web application for a content publishing system must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required?

180

A warehouse integration service receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers?

181

A payments API uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure?

182

A ticket booking system stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured?

183

A inventory service uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured?

184

A patient portal must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used?

185

A claims workflow requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable?

186

A trading dashboard uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered?

187

A content publishing system exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most?

188

A warehouse integration service must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable?

189

A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The design must avoid adding custom operational scripts.

190

A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The design must avoid adding custom operational scripts.

191

A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The design must avoid adding custom operational scripts.

192

A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The design must avoid adding custom operational scripts.

193

A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The design must avoid adding custom operational scripts.

194

A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The design must avoid adding custom operational scripts.

195

A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The design must avoid adding custom operational scripts.

196

A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The design must avoid adding custom operational scripts.

197

A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The design must avoid adding custom operational scripts.

198

A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The design must avoid adding custom operational scripts.

199

A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The design must avoid adding custom operational scripts.

200

A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The design must avoid adding custom operational scripts.

201

A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The design must avoid adding custom operational scripts.

202

A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The design must avoid adding custom operational scripts.

203

A regional web application for a content publishing system must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The design must avoid adding custom operational scripts.

204

A warehouse integration service receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The design must avoid adding custom operational scripts.

205

A payments API uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The design must avoid adding custom operational scripts.

206

A ticket booking system stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The design must avoid adding custom operational scripts.

207

A inventory service uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The design must avoid adding custom operational scripts.

208

A patient portal must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The design must avoid adding custom operational scripts.

209

A claims workflow requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The design must avoid adding custom operational scripts.

210

A trading dashboard uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The design must avoid adding custom operational scripts.

211

A content publishing system exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The design must avoid adding custom operational scripts.

212

A warehouse integration service must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The design must avoid adding custom operational scripts.

213

A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The architecture review board prefers a managed AWS-native control.

214

A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The architecture review board prefers a managed AWS-native control.

215

A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The architecture review board prefers a managed AWS-native control.

216

A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The architecture review board prefers a managed AWS-native control.

217

A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The architecture review board prefers a managed AWS-native control.

218

A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The architecture review board prefers a managed AWS-native control.

219

A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The architecture review board prefers a managed AWS-native control.

220

A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The architecture review board prefers a managed AWS-native control.

221

A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The architecture review board prefers a managed AWS-native control.

222

A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The architecture review board prefers a managed AWS-native control.

223

A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The architecture review board prefers a managed AWS-native control.

224

A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The architecture review board prefers a managed AWS-native control.

225

A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The architecture review board prefers a managed AWS-native control.

226

A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The architecture review board prefers a managed AWS-native control.

227

A regional web application for a content publishing system must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The architecture review board prefers a managed AWS-native control.

228

A warehouse integration service receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The architecture review board prefers a managed AWS-native control.

229

A payments API uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The architecture review board prefers a managed AWS-native control.

230

A ticket booking system stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The architecture review board prefers a managed AWS-native control.

231

A inventory service uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The architecture review board prefers a managed AWS-native control.

232

A patient portal must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The architecture review board prefers a managed AWS-native control.

233

A claims workflow requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The architecture review board prefers a managed AWS-native control.

234

A trading dashboard uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The architecture review board prefers a managed AWS-native control.

235

A content publishing system exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The architecture review board prefers a managed AWS-native control.

236

A warehouse integration service must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The architecture review board prefers a managed AWS-native control.

237

A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The team wants the control to be enforceable during normal operations.

238

A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The team wants the control to be enforceable during normal operations.

239

A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The team wants the control to be enforceable during normal operations.

240

A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The team wants the control to be enforceable during normal operations.

241

A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The team wants the control to be enforceable during normal operations.

242

A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The team wants the control to be enforceable during normal operations.

243

A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The team wants the control to be enforceable during normal operations.

244

A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The team wants the control to be enforceable during normal operations.

245

A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The team wants the control to be enforceable during normal operations.

246

A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The team wants the control to be enforceable during normal operations.

247

A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The team wants the control to be enforceable during normal operations.

248

A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The team wants the control to be enforceable during normal operations.

249

A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The team wants the control to be enforceable during normal operations.

250

A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The team wants the control to be enforceable during normal operations.

Watch out for

Common Design Resilient Architectures exam traps

  • Answering from memory before reading the full scenario.
  • Missing a constraint such as cost, availability, security, scope or command context.
  • Choosing a broad answer when the question asks for the most specific fix.
  • Ignoring why the wrong options are tempting.

Frequently asked questions

What does the Design Resilient Architectures domain cover on the SAA-C03 exam?
Design Resilient Architectures questions test whether you can apply the concept in context, not just recognise a definition.
How many questions are in this domain?
This page lists all 250 Design Resilient Architectures questions in the SAA-C03 question bank. The actual exam draws from this domain proportionally to its weighting in the official exam blueprint.
What is the best way to practise this domain?
Start with a short focused session (10 questions) to identify gaps, then use the interactive practice page to work through explanations. Repeat with a longer session once the weak areas feel solid.
Can I practise only Design Resilient Architectures questions?
Yes — the session launcher on this page filters questions to this domain only. Choose any session length or try the interactive practice page for inline explanations.