SAA-C03 domain
Design Resilient Architectures
Use this page to practise SAA-C03 Design Resilient Architectures practice questions. The goal is not to memorise dumps, but to understand the concept, review the explanation and improve your exam readiness.
Focused practice
Start a Design Resilient Architectures session
All sessions draw only from this domain. Pick a length or try interactive practice with inline explanations.
Start 20-question practice session →What the exam tests
What to know about Design Resilient Architectures
Design Resilient Architectures questions test whether you can apply the concept in context, not just recognise a definition.
How the topic appears in realistic exam-style scenarios.
Which detail in the question changes the correct answer.
How to eliminate plausible but wrong options.
How to connect the question back to the wider exam objective.
Question index
All Design Resilient Architectures questions (250)
Click any question to see the full explanation, or start a practice session above.
An order-processing service consumes messages from an Amazon SQS Standard queue using a custom worker. During traffic spikes, the worker occasionally times out after performing some work but before acknowledging the message, so SQS redelivers it and it may be processed again. You also observe that a small set of “poison” messages always fail validation. What change most directly improves resilience by (1) preventing poison messages from retrying indefinitely and (2) avoiding duplicate side effects caused by legitimate retries?
2Based on the exhibit, the application sees several minutes of connection errors during an Aurora failover. What is the best change to reduce failover impact?
3A payments service receives payment orders by consuming messages from an Amazon SQS Standard queue. The downstream processor occasionally exceeds its processing timeout. As a result, some messages reappear in the queue and may be processed more than once. The team wants to prevent duplicate side effects (for example, double-charging) and also ensure poison messages do not repeatedly consume processing capacity. What approach best satisfies both goals?
4A company runs an application behind an Application Load Balancer (ALB). An Auto Scaling group (ASG) is configured with desired capacity 2, but it is attached only to subnets in a single Availability Zone. The ALB is healthy because it is configured across multiple Availability Zones. When the Availability Zone that contains the ASG subnets experiences an outage, what change most directly improves resilience and allows capacity to be restored automatically?
5Based on the exhibit, DNS still sends traffic to the primary Region even though Route 53 health checks show the primary endpoint is unhealthy. What is the best change to make failover work as intended?
6Based on the exhibit, the web application must remain available even if one Availability Zone fails. What is the best change to improve resilience with the least redesign?
7An Auto Scaling group behind an Application Load Balancer frequently replaces new EC2 instances. The application needs ~6 minutes to warm up after instance launch. However, the ALB target group health checks start immediately and mark the targets unhealthy until the application is ready. Because the targets become unhealthy early, the Auto Scaling group then terminates the instances and launches replacements, creating a repeated unhealthy/termination loop. What configuration change will most directly improve recovery by preventing premature ASG termination while the application is warming up?
8A company runs an internet-facing API in two AWS Regions. Route 53 currently uses simple routing to a primary Application Load Balancer (ALB) DNS name. When the primary Region experiences an outage, customers wait a long time because the DNS entry is not changed automatically. The team wants automatic failover: if the primary Region ALB health check fails for a sustained period, Route 53 should route users to the secondary Region ALB. Which Route 53 approach best meets this requirement?
9A team accidentally updates critical rows in an Amazon RDS for PostgreSQL database. Automated backups are enabled. They need to recover the data to the exact state as of 90 minutes ago. They also cannot risk interrupting the current production database instance while investigators validate the restored data. Which recovery strategy best meets these constraints?
10Based on the exhibit, the database must continue serving if the current Availability Zone fails. What should you change?
11Based on the exhibit, the application tier is not replacing unhealthy instances even though the Auto Scaling group spans two Availability Zones. What change most directly improves automatic recovery when the application process fails?
12Based on the exhibit, the team must restore an Amazon RDS for PostgreSQL database to the exact state just before a bad delete happened. What is the best recovery approach?
13Based on the exhibit, the company wants DNS traffic to fail over automatically from the primary Region to a secondary Region when the primary endpoint is unhealthy. Which Route 53 change is best?
14Based on the exhibit, downstream payment timeouts cause EventBridge deliveries to back up and some events are retried until they age out. What change best improves resilience and preserves events during downstream outages?
15A SaaS platform plans to run in two AWS Regions for lower latency. The team wants to enable active-active writes (both regions accept updates) to avoid failover downtime. However, the business requires strong consistency for order status transitions (for example, only one transition from “Paid” to “Shipped” must be allowed). Which statement is the best architectural choice to meet the consistency requirement?
16Based on the exhibit, the web tier becomes unavailable if us-west-2a has an outage. What is the best change to improve resilience with the least redesign?
17Based on the exhibit, the database is manually promoted during an Availability Zone failure and the application outage lasts longer than the target. What change best improves resilience with the least operational intervention?
18An application writes to an Amazon Aurora DB cluster. After a planned Aurora failover, the application experiences several minutes of connection errors. The logs show the application continues connecting to the specific DB instance endpoint that was the primary before the failover. What change most directly improves resilience during Aurora failovers?
19A service processes customer payments from a message queue. Because the queue provides at-least-once delivery, the same payment message can be delivered more than once if the consumer times out before committing its state. Currently, the service sometimes charges the customer twice. Which design change most directly prevents duplicate charges while still allowing safe retries?
20Your web application is deployed in two AWS Regions (Region A and Region B). You want Route 53 to automatically fail over DNS traffic from Region A to Region B when Region A is unhealthy. The failover decision must be based on health checks that verify whether the application in Region A is reachable. Which Route 53 routing configuration best meets these requirements?
21Based on the exhibit, the payment worker sometimes processes the same SQS Standard message more than once after a timeout. What change best prevents duplicate charges while keeping the queue architecture?
22Based on the exhibit, duplicate payment charges occasionally occur when the worker times out after the charge is submitted but before the message is deleted. What change best prevents duplicate charges while keeping retry behavior?
23A production team accidentally deletes critical rows in an Amazon RDS for PostgreSQL database. The deletion occurred about 6 hours ago. The team wants to recover to a specific point in time with minimal disruption. Assuming automated backups are enabled, which approach provides the best resilience outcome?
24A web application uses pooled JDBC connections to an Amazon Aurora cluster using the writer endpoint. During an Aurora planned failover, monitoring shows a short spike in failed requests. The Aurora cluster writer endpoint remains the same, but many existing pooled connections briefly fail. The application retries aggressively and overloads the new writer during the transition. Which design change will most improve application resilience during Aurora failovers without requiring application redeployment?
25Based on the exhibit, an administrator accidentally deleted data from Amazon RDS for PostgreSQL about 90 minutes ago. Which recovery approach best restores the database to the exact required point in time?
26Based on the exhibit, the current disaster recovery design misses the RTO target even though the database replica is current. Which deployment model best meets the requirements with the least always-on cost?
27A payments platform requires disaster recovery across Regions. Requirements: RPO of 15 minutes and RTO of about 1 hour. The business cannot afford full duplicate capacity in both Regions all the time, but the team wants automated readiness so failover is mostly operationally guided rather than a slow rebuild. Which DR strategy is the best fit?
28Based on the exhibit, a web application must stay available if one Availability Zone fails. What is the best change to improve resilience?
29Match the disaster recovery strategy to the recovery posture it best fits for a Regional outage.
30A global application experiences frequent writes and must survive a full Regional outage with near-zero data loss. The product team also requires that users can continue to write during the incident using the closest Region. Which approach is most aligned with these requirements?
31A team wants a web application to keep serving traffic if one Availability Zone fails. Match each architecture element to the resilience behavior it provides.
32Based on the exhibit, the database must fail over automatically if the primary Availability Zone goes down. Which solution should the architect choose?
33A company uses Amazon RDS for a PostgreSQL database powering a customer-facing application. The application’s availability depends on fast database failover with minimal manual intervention. The RDS instance currently runs as a single-AZ deployment in one DB subnet group. Which change most directly meets the goal?
34A stateless web API runs on EC2 instances behind an Application Load Balancer (ALB). The Auto Scaling group (ASG) currently uses subnets from only one Availability Zone, even though the ALB spans two Availability Zones. During maintenance of that single AZ, the ALB remains up but clients see timeouts because there are no healthy targets. Which change most directly improves resilience against an AZ failure?
35A caching layer uses Amazon ElastiCache for Redis in front of a stateless web service. The service must continue to read cached responses during maintenance events and should automatically fail over to another node if one AZ becomes impaired. Which design change best satisfies this requirement?
36A company runs a stateful analytics workload on EC2 instances that use EBS volumes. The data must be restorable in another Region after a major outage, with frequent point-in-time recovery. Which approach provides the most suitable replication mechanism for the EBS-backed data?
37An order processing workflow uses Amazon SQS as the decoupling layer between a producer and a consumer Lambda function. The consumer intermittently fails due to a downstream dependency. The team has observed that certain “poison” messages keep being retried repeatedly and prevent other messages from being processed efficiently. Which SQS configuration most directly addresses this issue?
38A media company stores original uploads in an S3 bucket. They must recover from accidental overwrites/deletes and also recover quickly from a full Region outage. The required RPO is about 1 hour. Which configuration best meets these requirements?
39An ECS service runs on EC2 instances and is fronted by an ALB. The ALB spans two Availability Zones, and the ECS service desired count is 2 tasks. The underlying EC2 capacity uses an Auto Scaling group (ASG) with min size set to 1, and the ASG also spans only one subnet in practice. What is the most effective change to meet the requirement that the service continues during a single-AZ instance loss?
40Your order-processing system uses EventBridge rules to send events to a Lambda function that updates order status. Over the last week, some events fail with a transient database timeout, and the Lambda retries intermittently but then the events are lost (no alerts after failures). You want at-least-once processing, bounded retries, and a way to inspect unprocessable events for later reprocessing. Which architecture change best meets these requirements?
41A retail API runs on Amazon EC2 instances behind an Application Load Balancer and stores orders in an Amazon RDS for PostgreSQL database. A test that stopped one Availability Zone caused the API to return errors because all application servers were in the same AZ and the database was single-AZ. Which two changes should the architect make to continue serving traffic during a single-AZ failure? Select two.
42An engineering team deploys a stateless web API on EC2 using an Auto Scaling group and an Application Load Balancer (ALB). During a recent test, they noticed that when one Availability Zone was unavailable, traffic failed until new instances were manually launched. Which change most directly improves automatic failover for the compute layer within a single Region?
43A customer portal must recover from a regional outage within a few hours. The business wants lower ongoing cost than a fully active second Region and does not want to rebuild everything from scratch during the outage. Which two DR patterns best fit that goal? Select two.
44Your media processing pipeline writes original uploads to an S3 bucket and later generates derivative files. An operator accidentally deletes a subset of original uploads in production. You need to (1) restore the deleted objects with minimal data loss and (2) protect against both regional disasters and future operator mistakes. The company requires recovery even if objects are deleted and later overwritten. What is the most effective change to meet these requirements?
45A company runs its customer-facing web app on EC2 behind an Application Load Balancer. The database is Amazon RDS for PostgreSQL. The requirement is that if a single Availability Zone fails, the database must automatically fail over within the same AWS Region with minimal application changes. Which database setup best meets this requirement?
46A media company stores daily financial exports in Amazon S3. The files must be protected against accidental overwrite or deletion, and the business also wants a second copy in another Region for recovery after a regional outage. Which two actions should the architect take? Select two.
47A serverless order-ingestion API writes directly to a database. During traffic spikes, the database occasionally throttles, Lambda retries create duplicate order records, and some requests time out. Which two changes best improve buffering and safe retry behavior? Select two.
48An order system receives events and uses a Lambda function to write each order into a database. During traffic spikes, the database sometimes throttles, and Lambda retries lead to occasional message loss in the event flow. The team wants buffering, automatic retries, and a way to isolate messages that repeatedly fail so they can be inspected later. What design change best meets this need?
49A SaaS platform serves an API using two regional deployments: us-east-1 (primary) and us-west-2 (secondary). Each region has its own ALB. The business requires automated DNS-based failover when the primary region becomes unhealthy, and they do not want manual DNS changes during incidents. Which Route 53 configuration is the best match?
50A fintech startup uses AWS to run a web API and a PostgreSQL database. They must meet an RPO of 15 minutes and an RTO of 2 hours for a Region-wide disaster. Budget allows running a small, always-on set of infrastructure in a secondary Region, but not full production capacity. The team wants a DR approach that is regularly testable without large manual effort. Which disaster recovery strategy is the best fit?
51A SaaS application is deployed in us-east-1 and us-west-2 behind separate ALBs. The business wants DNS to send new clients to the primary Region when it is healthy and automatically fail over to the secondary Region when the primary endpoint is unhealthy. Which two Route 53 settings are required? Select two.
52Your ecommerce app runs behind an Application Load Balancer (ALB) and uses an RDS database for orders. During an AZ impairment in us-east-1, customers report that checkout takes several minutes to recover. The current design places EC2 instances only in private subnets of AZ-a, while the ALB spans multiple subnets. The RDS DB instance is Multi-AZ. Management wants automatic recovery within the same Region. Which change best addresses the issue with minimal operational overhead?
53A team uses an S3 bucket to store important customer-generated exports. They need protection against accidental overwrites and also want copies of the data in another AWS Region for disaster recovery. Which S3 configuration best satisfies both requirements?
54A company runs a customer portal on an Amazon Aurora PostgreSQL cluster. The application currently connects directly to the writer instance endpoint and keeps long-lived connections open. During a maintenance failover, writes fail until clients are restarted. The team wants the application to reconnect to the correct Aurora endpoint automatically and reduce user-visible write interruptions. Which change is most likely to achieve this?
55A company runs the same public API in two regions (Region A and Region B), each fronted by an ALB. They want Route 53 to automatically route clients to the Region B API when Region A becomes unhealthy, with minimal configuration effort. Which Route 53 approach should they use?
56A retail platform needs disaster recovery across AWS Regions. The business requirement is: RTO up to 6 hours, RPO up to 1 hour, and they want the ability to start serving quickly during a Region outage but do not want to run full production capacity continuously. Which DR strategy best fits these requirements?
57A team uses an S3 bucket to store important customer-generated exports. They need protection against accidental overwrites and also want copies of the data in another AWS Region for disaster recovery. Which S3 configuration best satisfies both requirements?
58A fintech startup uses AWS to run a web API and a PostgreSQL database. They must meet an RPO of 15 minutes and an RTO of 2 hours for a Region-wide disaster. Budget allows running a small, always-on set of infrastructure in a secondary Region, but not full production capacity. The team wants a DR approach that is regularly testable without large manual effort. Which disaster recovery strategy is the best fit?
59A media company stores daily financial exports in Amazon S3. The files must be protected against accidental overwrite or deletion, and the business also wants a second copy in another Region for recovery after a regional outage. Which two actions should the architect take? Select two.
60A SaaS platform serves an API using two regional deployments: us-east-1 (primary) and us-west-2 (secondary). Each region has its own ALB. The business requires automated DNS-based failover when the primary region becomes unhealthy, and they do not want manual DNS changes during incidents. Which Route 53 configuration is the best match?
61A company runs a customer portal on an Amazon Aurora PostgreSQL cluster. The application currently connects directly to the writer instance endpoint and keeps long-lived connections open. During a maintenance failover, writes fail until clients are restarted. The team wants the application to reconnect to the correct Aurora endpoint automatically and reduce user-visible write interruptions. Which change is most likely to achieve this?
62A customer portal must recover from a regional outage within a few hours. The business wants lower ongoing cost than a fully active second Region and does not want to rebuild everything from scratch during the outage. Which two DR patterns best fit that goal? Select two.
63An engineering team deploys a stateless web API on EC2 using an Auto Scaling group and an Application Load Balancer (ALB). During a recent test, they noticed that when one Availability Zone was unavailable, traffic failed until new instances were manually launched. Which change most directly improves automatic failover for the compute layer within a single Region?
64A SaaS application is deployed in us-east-1 and us-west-2 behind separate ALBs. The business wants DNS to send new clients to the primary Region when it is healthy and automatically fail over to the secondary Region when the primary endpoint is unhealthy. Which two Route 53 settings are required? Select two.
65A retail platform needs disaster recovery across AWS Regions. The business requirement is: RTO up to 6 hours, RPO up to 1 hour, and they want the ability to start serving quickly during a Region outage but do not want to run full production capacity continuously. Which DR strategy best fits these requirements?
66Your order-processing system uses EventBridge rules to send events to a Lambda function that updates order status. Over the last week, some events fail with a transient database timeout, and the Lambda retries intermittently but then the events are lost (no alerts after failures). You want at-least-once processing, bounded retries, and a way to inspect unprocessable events for later reprocessing. Which architecture change best meets these requirements?
67A company runs the same public API in two regions (Region A and Region B), each fronted by an ALB. They want Route 53 to automatically route clients to the Region B API when Region A becomes unhealthy, with minimal configuration effort. Which Route 53 approach should they use?
68An order system receives events and uses a Lambda function to write each order into a database. During traffic spikes, the database sometimes throttles, and Lambda retries lead to occasional message loss in the event flow. The team wants buffering, automatic retries, and a way to isolate messages that repeatedly fail so they can be inspected later. What design change best meets this need?
69A company runs its customer-facing web app on EC2 behind an Application Load Balancer. The database is Amazon RDS for PostgreSQL. The requirement is that if a single Availability Zone fails, the database must automatically fail over within the same AWS Region with minimal application changes. Which database setup best meets this requirement?
70A retail API runs on Amazon EC2 instances behind an Application Load Balancer and stores orders in an Amazon RDS for PostgreSQL database. A test that stopped one Availability Zone caused the API to return errors because all application servers were in the same AZ and the database was single-AZ. Which two changes should the architect make to continue serving traffic during a single-AZ failure? Select two.
71A serverless order-ingestion API writes directly to a database. During traffic spikes, the database occasionally throttles, Lambda retries create duplicate order records, and some requests time out. Which two changes best improve buffering and safe retry behavior? Select two.
72Your ecommerce app runs behind an Application Load Balancer (ALB) and uses an RDS database for orders. During an AZ impairment in us-east-1, customers report that checkout takes several minutes to recover. The current design places EC2 instances only in private subnets of AZ-a, while the ALB spans multiple subnets. The RDS DB instance is Multi-AZ. Management wants automatic recovery within the same Region. Which change best addresses the issue with minimal operational overhead?
73Your media processing pipeline writes original uploads to an S3 bucket and later generates derivative files. An operator accidentally deletes a subset of original uploads in production. You need to (1) restore the deleted objects with minimal data loss and (2) protect against both regional disasters and future operator mistakes. The company requires recovery even if objects are deleted and later overwritten. What is the most effective change to meet these requirements?
74Based on the exhibit, a web application must stay available if one Availability Zone fails. What is the best change to improve resilience?
75Match the disaster recovery strategy to the recovery posture it best fits for a Regional outage.
76A global application experiences frequent writes and must survive a full Regional outage with near-zero data loss. The product team also requires that users can continue to write during the incident using the closest Region. Which approach is most aligned with these requirements?
77Based on the exhibit, the database must fail over automatically if the primary Availability Zone goes down. Which solution should the architect choose?
78A payments platform requires disaster recovery across Regions. Requirements: RPO of 15 minutes and RTO of about 1 hour. The business cannot afford full duplicate capacity in both Regions all the time, but the team wants automated readiness so failover is mostly operationally guided rather than a slow rebuild. Which DR strategy is the best fit?
79A company uses Amazon RDS for a PostgreSQL database powering a customer-facing application. The application’s availability depends on fast database failover with minimal manual intervention. The RDS instance currently runs as a single-AZ deployment in one DB subnet group. Which change most directly meets the goal?
80An ECS service runs on EC2 instances and is fronted by an ALB. The ALB spans two Availability Zones, and the ECS service desired count is 2 tasks. The underlying EC2 capacity uses an Auto Scaling group (ASG) with min size set to 1, and the ASG also spans only one subnet in practice. What is the most effective change to meet the requirement that the service continues during a single-AZ instance loss?
81An order processing workflow uses Amazon SQS as the decoupling layer between a producer and a consumer Lambda function. The consumer intermittently fails due to a downstream dependency. The team has observed that certain “poison” messages keep being retried repeatedly and prevent other messages from being processed efficiently. Which SQS configuration most directly addresses this issue?
82A company runs a stateful analytics workload on EC2 instances that use EBS volumes. The data must be restorable in another Region after a major outage, with frequent point-in-time recovery. Which approach provides the most suitable replication mechanism for the EBS-backed data?
83A caching layer uses Amazon ElastiCache for Redis in front of a stateless web service. The service must continue to read cached responses during maintenance events and should automatically fail over to another node if one AZ becomes impaired. Which design change best satisfies this requirement?
84A media company stores original uploads in an S3 bucket. They must recover from accidental overwrites/deletes and also recover quickly from a full Region outage. The required RPO is about 1 hour. Which configuration best meets these requirements?
85A team wants a web application to keep serving traffic if one Availability Zone fails. Match each architecture element to the resilience behavior it provides.
86A stateless web API runs on EC2 instances behind an Application Load Balancer (ALB). The Auto Scaling group (ASG) currently uses subnets from only one Availability Zone, even though the ALB spans two Availability Zones. During maintenance of that single AZ, the ALB remains up but clients see timeouts because there are no healthy targets. Which change most directly improves resilience against an AZ failure?
87A web app runs on an EC2 Auto Scaling group behind an Application Load Balancer (ALB). The ALB is configured with health checks and the ASG spans three subnets in three Availability Zones. During an AZ outage, monitoring shows the number of healthy instances drops sharply and never returns to the original capacity until the ASG is manually adjusted. What change most directly improves resilience so capacity returns automatically during an AZ failure?
88Your public API is hosted in two regions. You want Route 53 to automatically send traffic to the secondary region when the primary region’s endpoint fails. The primary API health check is returning failure codes, but clients still reach the primary region for several minutes. Which Route 53 configuration most directly addresses this behavior?
89An orders service publishes payment instructions to an Amazon SQS queue. After occasional processing timeouts, the downstream consumer sometimes processes the same instruction twice, resulting in duplicate payment attempts. The team currently uses an SQS Standard queue with a visibility timeout of 2 minutes and relies on the consumer to finish before the timeout expires. What approach best improves resilience against duplicate processing?
90A developer accidentally deletes important rows in an RDS database. The mistake is discovered 45 minutes later. The database has automated backups enabled with a retention period of 7 days. What is the best way to restore the database to a point just before the deletion?
91A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ASG is currently attached to subnets in only two Availability Zones (AZs). During a planned maintenance window, one AZ becomes unavailable for about 25 minutes. Monitoring shows that targets in the remaining AZ go healthy, and the ALB/target group health checks report normal. However, users still experience intermittent connection failures and slower responses during the AZ outage. What change will most directly improve resilience against an AZ loss while keeping the same ALB-based design?
92An application uses an Amazon Aurora DB cluster. The cluster performs an automatic failover from the writer instance to a standby instance. After failover completes, reads succeed, but all new writes fail with errors indicating the application is connecting to the old writer endpoint. Which change best fixes the resiliency issue after failover?
93A company hosts a public API using two AWS regions behind a single custom domain. Route 53 is configured with latency-based routing and health checks. During a regional outage, application metrics confirm the primary API is unhealthy, but clients still resolve to the primary region for most requests. Which DNS configuration change will most directly ensure automatic failover to the secondary region when the primary fails?
94An orders service publishes payment instructions to an Amazon SQS queue. The downstream consumer sometimes times out while processing a message. After the message becomes visible again, the consumer may process the same instruction more than once and occasionally creates duplicate orders. The team needs a resiliency-focused design that prevents duplicates from creating double-charges, even if the same message is processed multiple times. What is the best architectural change?
95A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer (ALB). After a new release, instances begin failing ALB health checks with errors like 502 while the application is still starting up. CloudWatch shows that the ASG replaces the instances before they finish initializing, so traffic never reaches healthy targets. Which change most directly prevents premature replacement during startup so traffic can resume as soon as the instances are actually healthy?
96A company uses an Amazon Aurora DB cluster in a Multi-AZ configuration. During a planned failover of the writer instance, the database endpoints in the application are updated incorrectly. After failover, reads work but writes fail with connection errors and timeouts for several minutes. The team currently uses the instance endpoint for the writer. What should they change to improve write resilience during failovers?
97A public API is deployed in two AWS Regions: us-east-1 (primary) and us-west-2 (secondary). The team wants Route 53 to automatically route users to the secondary region if the primary API becomes unhealthy. They will use Route 53 health checks that monitor the API’s /status endpoint over HTTPS. Which Route 53 configuration most directly implements this failover behavior?
98An orders service publishes payment instructions to an Amazon SQS Standard queue. A downstream consumer sometimes times out and retries the work, causing the consumer to process the same instruction more than once. Operationally, the team must ensure that duplicate processing does not create duplicate charges. The queue type cannot be changed. What is the most resilient application-side approach?
99A service consumes messages from an SQS queue. Recently, a new message format started failing validation in the consumer. The consumer catches the exception but cannot successfully process those messages without code changes. The team wants failed messages to be isolated for later investigation instead of being retried indefinitely. What should they configure?
100A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ASG uses the ALB target group health checks to decide when instances are healthy (for example, by using the ELB/target-group health check integration). During a deployment, the ASG performs instance replacement. Shortly after the deployment starts and while new instances are still bootstrapping, CloudWatch shows the ALB target group briefly has zero healthy targets, and users intermittently receive 502 responses. Which ASG deployment configuration best reduces the chance that there will be a period with zero healthy ALB targets, while still keeping failover behavior resilient?
101You host a public API using Amazon API Gateway in two AWS Regions: us-east-1 (primary) and us-west-2 (secondary). You want Route 53 to send client traffic to the secondary region only when the primary API is unhealthy. Which Route 53 setup best meets this requirement?
102An orders service publishes payment instructions to an Amazon SQS Standard queue. A downstream consumer sometimes times out or crashes after it has partially completed processing, causing the same instruction to be processed more than once. You must keep the design resilient without attempting to guarantee exactly-once processing. Which approach best handles duplicates safely?
103A Multi-AZ Amazon RDS database experiences incorrect writes at 10:15 UTC due to a buggy release. The team detects the problem at 10:25 UTC. They want to restore the data to a known-good point around 10:15 UTC, and validate the recovered data, without taking the current production instance offline during the recovery process. What is the most appropriate AWS action?
104An events service publishes critical notifications using Amazon SNS. Three independent downstream systems (A, B, and C) subscribe to the topic. Downstream system B sometimes fails to process certain messages (for example, it times out or returns an error while handling the message), and you want: 1) failures in B to be isolated so A and C keep processing unaffected, and 2) messages that B cannot successfully process after retries to be sent to a DLQ for B. Which design best meets these requirements?
105A web application runs on an Amazon EC2 Auto Scaling group behind an Application Load Balancer (ALB). After each deployment, new instances take about 2 minutes to download artifacts and become ready to accept requests on the target port. In the last deployment, the ALB started marking targets unhealthy before the app was ready, and the Auto Scaling group then replaced those instances repeatedly, causing a prolonged outage. Which change best improves resilience during instance start-up without reducing actual availability once the application is healthy?
106A company uses Amazon RDS with automated backups enabled (retention period: 7 days). At 10:30 UTC, a bad release corrupts specific rows in a production table. The team detects the issue at 11:10 UTC. They need to revert the database state to what it was from 10:00–10:30 UTC, recover quickly, and minimize risk to the currently running workload. What is the best option?
107An internal-facing application is available in two AWS regions (Region 1 and Region 2). Each region has its own Application Load Balancer (ALB) and target group. The company uses an AWS Route 53 private hosted zone to route clients to Region 1 by default, but it must automatically fail over to Region 2 when Region 1’s ALB is unhealthy. Which Route 53 design best meets this requirement?
108An internal worker consumes messages from an Amazon SQS Standard queue. Recently, some messages fail validation in the worker (for example, missing required fields), causing the worker to crash before it can successfully process those messages. Those messages keep getting retried repeatedly, slowing down processing of valid messages. The team wants a resilient mechanism to quarantine bad messages after a limited number of receive attempts. What should they implement?
109An orders service publishes payment instructions to an Amazon SQS Standard queue. The downstream processor sometimes times out after it has already applied the payment, but before it can delete the message from the queue. As a result, the same payment instruction can be processed more than once. The team wants the strongest way to prevent duplicate side effects while keeping the system decoupled. What should they implement?
110A web application runs on an Amazon EC2 Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ALB is configured to use at least two Availability Zones (AZs), but the ASG currently uses subnets in only one AZ. If that AZ becomes unavailable, the application stops serving requests. Which change most directly improves resilience to an AZ outage?
111Your company hosts an internal API in two AWS Regions. You want Amazon Route 53 to automatically send traffic to the secondary Region if the primary Region’s endpoint becomes unhealthy. Which Route 53 configuration best meets this requirement?
112An internal worker consumes messages from an Amazon SQS queue. Occasionally, a message fails validation in the worker (for example, missing required fields). Reprocessing the same bad message repeatedly wastes processing time and delays healthy messages. What is the best AWS approach to handle these poison messages without blocking the rest of the queue?
113A team needs a relational database solution that can automatically fail over to a standby instance if the primary database becomes unavailable. They want the standby to be located in a different Availability Zone. Which RDS/Aurora configuration best satisfies this requirement?
114A production Amazon RDS database has automated backups enabled with sufficient retention. At 10:30 UTC, a release corrupts specific rows. The issue is detected at 10:45 UTC. The team wants to restore the database state to before the corruption with minimal complexity. What should they do?
115An orders service consumes payment instructions from an Amazon SQS queue. Sometimes the consumer times out after applying the payment but before deleting the SQS message. As a result, the same payment instruction is processed again. Which design change most directly prevents duplicate side effects caused by message retries?
116A production Amazon RDS database has automated backups enabled. At 10:00 UTC, an application deploy accidentally overwrote a subset of rows due to a faulty migration. The issue is detected at 10:45 UTC. The team confirms that the required retention window is still available. Which approach offers the most resilient and least disruptive way to recover the affected data close to the time of the event?
117An orders system sends payment instructions to an Amazon SQS queue. The consumer sometimes times out after it has already created the payment record but before it deletes the SQS message. As a result, the same instruction can be processed more than once. Which design best ensures the consumer remains resilient and does not create duplicate payments when the same instruction is delivered multiple times?
118A company runs an Amazon Aurora DB cluster with a Multi-AZ deployment. The application is configured with a hard-coded endpoint that points to the current writer *DB instance* (an instance-specific endpoint), rather than the Aurora cluster writer endpoint. During an unexpected AZ failure, Aurora promotes the standby to become the new writer. However, the application continues to fail to connect until an operator updates the hard-coded endpoint. What change most directly improves resiliency so the application automatically reconnects after failover?
119An event-driven order processing service consumes messages from an Amazon SQS Standard queue. After a deployment, about 1% of messages start failing validation because a required field is missing. The consumer catches the exception and returns control, so the messages are retried. However, those poison messages keep reappearing and repeatedly consuming processing time for hours, delaying handling of valid messages. What is the most resilient way to handle the poison messages while keeping the system available?
120A company hosts an internal API behind an Application Load Balancer (ALB) in two AWS Regions. They want Amazon Route 53 to automatically fail over to the secondary Region when the primary Region’s ALB is unhealthy. Health checks for the primary ALB are already configured, but the DNS record currently uses a latency-based routing policy. Which Route 53 configuration most directly provides automatic failover based on health status?
121A web application runs on an EC2 Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ASG spans three Availability Zones. After a deployment, new instances frequently fail the ALB target group health checks with HTTP 5xx responses and are quickly terminated by the ASG. What change most improves resiliency during deployments with minimal downtime by preventing premature removal of instances that are still starting?
122A fintech company has a two-Region DR requirement: RPO must be within 15 minutes and RTO must be under 2 hours. To control cost, they do not want to run full production infrastructure in the secondary Region continuously. They plan to continuously replicate the database and keep the application infrastructure in the secondary Region prepared, but at reduced capacity. Which DR strategy best matches this requirement and accurately describes their plan?
123A web application runs on an Auto Scaling group behind an Application Load Balancer. The business wants the service to keep running if one Availability Zone goes down. Which two changes should you make? Select two.
124A production Amazon RDS database must continue serving the application if the primary DB instance fails. The application should reconnect automatically without hard-coding a new IP address. Which two actions should you take? Select two.
125A company hosts an internal API in two AWS Regions. Traffic must automatically switch to the secondary Region when the primary Region's endpoint is unhealthy. Which two Route 53 settings are required? Select two.
126A service processes messages from an Amazon SQS queue. Sometimes the worker finishes the business logic but does not delete the message before the visibility timeout expires, so the message is delivered again. Which two changes improve resilience and reduce the impact of duplicate processing? Select two.
127A developer accidentally corrupts part of a production Amazon RDS database, and the issue is discovered 45 minutes later. The team needs to restore the database to the state immediately before the change. Which two actions should be part of the recovery plan? Select two.
128A batch processing job can be interrupted and restarted from checkpoints. The business wants to lower compute cost while still keeping the workload resilient to interruptions. Which two choices are best? Select two.
129A production application uses an Amazon RDS Multi-AZ DB instance. During an unplanned failover, the database endpoint remains the same. What change should the application team make to handle the failover reliably?
130Your web tier runs on an EC2 Auto Scaling group behind an Application Load Balancer (ALB). You currently deploy both the ALB and the Auto Scaling group in only two Availability Zones (AZs). One AZ fails. What is the best configuration change to improve resilience?
131An internal API is hosted in two AWS Regions behind Route 53. Under normal conditions, clients should use the primary region. If the primary endpoint becomes unhealthy, traffic must automatically switch to the secondary region. Which Route 53 setup best meets this requirement?
132An order-processing system publishes an event whenever a payment succeeds. Three downstream services (inventory, shipping, and analytics) must react independently. Analytics sometimes has high latency, but order processing must not be blocked. What is the best AWS approach to decouple these consumers?
133A consumer application reads from an Amazon SQS queue. Some messages have an invalid format and always fail processing. They are retried repeatedly and consume consumer capacity. What is the best way to prevent these "poison pill" messages from blocking normal processing?
134An event consumer sometimes processes the same SQS message more than once due to timeouts and retries. The consumer must ensure the payment is not charged twice. What design choice best addresses this requirement?
135A company needs an Amazon RDS database that automatically fails over to a standby when the primary DB instance becomes unavailable. Which approach best meets the requirement with minimal operational effort?
136An internal service is hosted behind an Application Load Balancer (ALB) with targets spread across two Availability Zones. If the targets in one Availability Zone become unhealthy, the service must continue serving traffic from the healthy AZ. What change most directly improves resilience at the load-balancing layer?
137A worker consumes messages from an Amazon SQS queue. Some messages consistently fail validation and are retried until the worker can no longer process them. What is the most appropriate AWS mechanism to handle these poison messages while keeping the queue usable?
138A production Amazon RDS database has automated backups enabled. At 10:45 UTC, an issue is discovered. The team needs to restore the database to its state as of 10:30 UTC. Which capability should they use?
139A system processes events from Amazon SQS and sometimes sees duplicate messages due to retries. The business requirement is that each payment must be charged at most once. What design choice best addresses this resiliency requirement?
140A company wants a disaster recovery setup for a web application. They need relatively quick recovery, but they can't afford running full production in the secondary location at all times. Which option best matches this requirement?
141A fintech company needs a disaster recovery design for a web application in two Regions. The business requires an RPO of 15 minutes and an RTO under 2 hours, but it cannot afford to keep a full production stack running in both Regions all the time. Which two DR strategies best fit the requirement? Select two.
142A transactional application uses Amazon RDS for MySQL in a single Availability Zone. The team wants the database to fail over automatically if the primary DB instance becomes unavailable, and they want the application to recover with minimal code changes. Which two actions should they take? Select two.
143An order-processing worker consumes messages from Amazon SQS. Occasionally, the worker times out after successfully creating a payment record but before deleting the message, which causes duplicate charges during retries. Some messages also fail validation repeatedly because required fields are missing. Which two changes should the team make? Select two.
144A production Amazon RDS database already has automated backups enabled. At 10:45 UTC, the team discovers that a faulty migration corrupted rows in a table at 10:30 UTC. The business wants the database restored to exactly the state it had at 10:30 UTC with minimal risk. Which two actions should the team take? Select two.
145A worker service consumes messages from an Amazon SQS queue. Some messages are malformed and always fail validation. The worker retries, but it keeps reprocessing the same bad messages and consumes processing capacity that should be used for valid work. What is the best solution to prevent “poison messages” from blocking progress?
146A production Amazon RDS database has automated backups enabled. An application mistakenly updates a table and the issue is discovered one hour later. The team needs to restore the database to the exact state it had 45 minutes ago. Which approach best meets the requirement?
147A company wants a disaster recovery setup for a web application. They want to keep costs low but still recover within a couple of hours after a regional disruption. They are willing to run only minimal infrastructure in the secondary location and scale it up during the outage. Which DR approach best matches this requirement?
148A company hosts a web application on Amazon EC2 instances in an Auto Scaling group behind an Application Load Balancer (ALB). The ALB and the Auto Scaling group are currently deployed in only one Availability Zone (AZ). The business wants the application to keep running if that AZ has an outage. What is the best change?
149A team runs an Amazon RDS for MySQL database in a single Availability Zone. They want automatic failover with minimal downtime if the primary database instance becomes unavailable. Automated backups are already enabled. Which configuration change best meets the requirement?
150An organization hosts the same public API in two AWS Regions. Normal traffic should go to the primary Region. If the primary endpoint becomes unhealthy, Route 53 should automatically route users to the secondary Region. What is the best Route 53 configuration approach?
151An orders service currently sends HTTP requests directly to two downstream services (inventory and shipping). During peak load, inventory slows down, causing the orders service to slow as well. The team wants the orders service to remain responsive even when a downstream service is temporarily slow or restarted. Which design change best achieves this resiliency goal?
152A payment worker consumes messages from an Amazon SQS queue. Sometimes the worker finishes the payment creation, but a timeout prevents message deletion and the same payment request is delivered again. Which two design changes best reduce the risk of duplicate charges and keep bad messages from looping forever? Select two.
153An application uses an Amazon RDS Multi-AZ DB instance. During a failover test, connections fail until the application is restarted, even though the database comes back online. Which two changes should the team make to improve resilience during failover? Select two.
154An internal API is deployed in two AWS Regions behind separate Application Load Balancers. The company wants clients to use the primary Region when it is healthy and automatically switch to the secondary Region if the primary health check fails. Which two Route 53 record configurations are required? Select two.
155A production Amazon Aurora MySQL database is corrupted by a bad migration at 10:30 UTC, and the problem is discovered at 10:45 UTC. The team wants to recover to the state just before the migration with minimal manual effort. Which two actions should they take? Select two.
156An order service must notify inventory, shipping, and analytics independently when payment succeeds. The shipping service may be slow, but the order service should keep accepting new orders even if one consumer is unavailable. Which two changes best improve resilience? Select two.
157Based on the exhibit, the team wants to stop poison messages from consuming worker capacity and also prevent duplicate side effects if the same message is delivered more than once. Which design change best meets the requirement?
158Based on the exhibit, a faulty deployment corrupted production data at 10:30 UTC and the issue was discovered at 10:55 UTC. The team needs to recover the database to the last good state before the corruption. Which action should they take?
159Based on the exhibit, the application team wants the database to keep the same connection endpoint during failover and to reconnect automatically after the primary instance becomes unavailable. Which change best meets the requirement?
160Based on the exhibit, which Route 53 configuration should be used so traffic automatically returns to the secondary Region only when the primary Region becomes unhealthy?
161Based on the exhibit, the business needs Regional disaster recovery with an RTO of 45 minutes and an RPO of 15 minutes. The solution should keep cost lower than running two fully active production environments. Which DR strategy is the best fit?
162Based on the exhibit, the application should continue serving requests if one Availability Zone fails. Which change best improves resilience with the least operational complexity?
163Based on the exhibit, some SQS messages fail validation repeatedly and continue consuming worker time. What change best prevents the bad messages from being retried forever?
164Based on the exhibit, the web team wants the application to continue serving traffic if one Availability Zone fails. Which change best meets the requirement with the least operational overhead?
165A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable?
166A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include?
167A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required?
168A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers?
169A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure?
170A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured?
171A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured?
172A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used?
173A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable?
174A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered?
175A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most?
176A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable?
177A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable?
178A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include?
179A regional web application for a content publishing system must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required?
180A warehouse integration service receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers?
181A payments API uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure?
182A ticket booking system stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured?
183A inventory service uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured?
184A patient portal must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used?
185A claims workflow requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable?
186A trading dashboard uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered?
187A content publishing system exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most?
188A warehouse integration service must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable?
189A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The design must avoid adding custom operational scripts.
190A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The design must avoid adding custom operational scripts.
191A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The design must avoid adding custom operational scripts.
192A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The design must avoid adding custom operational scripts.
193A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The design must avoid adding custom operational scripts.
194A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The design must avoid adding custom operational scripts.
195A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The design must avoid adding custom operational scripts.
196A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The design must avoid adding custom operational scripts.
197A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The design must avoid adding custom operational scripts.
198A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The design must avoid adding custom operational scripts.
199A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The design must avoid adding custom operational scripts.
200A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The design must avoid adding custom operational scripts.
201A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The design must avoid adding custom operational scripts.
202A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The design must avoid adding custom operational scripts.
203A regional web application for a content publishing system must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The design must avoid adding custom operational scripts.
204A warehouse integration service receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The design must avoid adding custom operational scripts.
205A payments API uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The design must avoid adding custom operational scripts.
206A ticket booking system stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The design must avoid adding custom operational scripts.
207A inventory service uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The design must avoid adding custom operational scripts.
208A patient portal must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The design must avoid adding custom operational scripts.
209A claims workflow requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The design must avoid adding custom operational scripts.
210A trading dashboard uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The design must avoid adding custom operational scripts.
211A content publishing system exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The design must avoid adding custom operational scripts.
212A warehouse integration service must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The design must avoid adding custom operational scripts.
213A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The architecture review board prefers a managed AWS-native control.
214A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The architecture review board prefers a managed AWS-native control.
215A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The architecture review board prefers a managed AWS-native control.
216A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The architecture review board prefers a managed AWS-native control.
217A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The architecture review board prefers a managed AWS-native control.
218A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The architecture review board prefers a managed AWS-native control.
219A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The architecture review board prefers a managed AWS-native control.
220A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The architecture review board prefers a managed AWS-native control.
221A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The architecture review board prefers a managed AWS-native control.
222A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The architecture review board prefers a managed AWS-native control.
223A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The architecture review board prefers a managed AWS-native control.
224A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The architecture review board prefers a managed AWS-native control.
225A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The architecture review board prefers a managed AWS-native control.
226A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The architecture review board prefers a managed AWS-native control.
227A regional web application for a content publishing system must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The architecture review board prefers a managed AWS-native control.
228A warehouse integration service receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The architecture review board prefers a managed AWS-native control.
229A payments API uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The architecture review board prefers a managed AWS-native control.
230A ticket booking system stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The architecture review board prefers a managed AWS-native control.
231A inventory service uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The architecture review board prefers a managed AWS-native control.
232A patient portal must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The architecture review board prefers a managed AWS-native control.
233A claims workflow requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The architecture review board prefers a managed AWS-native control.
234A trading dashboard uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The architecture review board prefers a managed AWS-native control.
235A content publishing system exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The architecture review board prefers a managed AWS-native control.
236A warehouse integration service must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The architecture review board prefers a managed AWS-native control.
237A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The team wants the control to be enforceable during normal operations.
238A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The team wants the control to be enforceable during normal operations.
239A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The team wants the control to be enforceable during normal operations.
240A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The team wants the control to be enforceable during normal operations.
241A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The team wants the control to be enforceable during normal operations.
242A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The team wants the control to be enforceable during normal operations.
243A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The team wants the control to be enforceable during normal operations.
244A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The team wants the control to be enforceable during normal operations.
245A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The team wants the control to be enforceable during normal operations.
246A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The team wants the control to be enforceable during normal operations.
247A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The team wants the control to be enforceable during normal operations.
248A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The team wants the control to be enforceable during normal operations.
249A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The team wants the control to be enforceable during normal operations.
250A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The team wants the control to be enforceable during normal operations.
Watch out for
Common Design Resilient Architectures exam traps
- ▸Answering from memory before reading the full scenario.
- ▸Missing a constraint such as cost, availability, security, scope or command context.
- ▸Choosing a broad answer when the question asks for the most specific fix.
- ▸Ignoring why the wrong options are tempting.
Frequently asked questions
- What does the Design Resilient Architectures domain cover on the SAA-C03 exam?
- Design Resilient Architectures questions test whether you can apply the concept in context, not just recognise a definition.
- How many questions are in this domain?
- This page lists all 250 Design Resilient Architectures questions in the SAA-C03 question bank. The actual exam draws from this domain proportionally to its weighting in the official exam blueprint.
- What is the best way to practise this domain?
- Start with a short focused session (10 questions) to identify gaps, then use the interactive practice page to work through explanations. Repeat with a longer session once the weak areas feel solid.
- Can I practise only Design Resilient Architectures questions?
- Yes — the session launcher on this page filters questions to this domain only. Choose any session length or try the interactive practice page for inline explanations.