CCNA Design Resilient Questions — Page 2 of 4

MCQmedium

An Auto Scaling group behind an Application Load Balancer frequently replaces new EC2 instances. The application needs ~6 minutes to warm up after instance launch. However, the ALB target group health checks start immediately and mark the targets unhealthy until the application is ready. Because the targets become unhealthy early, the Auto Scaling group then terminates the instances and launches replacements, creating a repeated unhealthy/termination loop. What configuration change will most directly improve recovery by preventing premature ASG termination while the application is warming up?

A.Set a health check grace period on the Auto Scaling group that exceeds the application startup/warm-up time.

B.Increase the Auto Scaling group's desired capacity to a higher number than required.

C.Disable ALB target group health checks so instances are considered healthy as soon as they register.

D.Change the Auto Scaling health check type from ELB to EC2 so the ALB will no longer determine instance health.

AnswerA

A health check grace period delays when the Auto Scaling group starts evaluating instance health. This prevents the ASG from terminating instances due to ALB/target health being unhealthy during the initial warm-up window, breaking the unhealthy/termination loop.

Why this answer

The health check grace period on an Auto Scaling group (ASG) allows a newly launched EC2 instance to bypass health check failures for a specified duration. By setting this grace period to exceed the application's ~6-minute warm-up time, the ASG will not prematurely terminate the instance based on ALB health check results. This directly breaks the unhealthy/termination loop while the application initializes.

Exam trap

The trap here is that candidates may think disabling health checks or changing the health check type is a valid fix, but the correct solution is to use the ASG's built-in grace period to decouple early health check failures from termination decisions.

How to eliminate wrong answers

Option B is wrong because increasing the desired capacity does not address the root cause of premature termination; it only adds more instances that will also be terminated during the warm-up period. Option C is wrong because disabling ALB health checks would prevent the ALB from routing traffic to healthy instances, defeating the purpose of load balancing and potentially causing service disruption. Option D is wrong because changing the health check type to EC2 would ignore ALB health check failures, but the ASG would still rely on EC2 status checks (which pass immediately at launch), so the loop would stop—however, this is less direct than a grace period and does not ensure the application is actually ready to serve traffic, making it a suboptimal solution compared to the grace period.

Practice this question →

MCQmedium

A company runs an internet-facing API in two AWS Regions. Route 53 currently uses simple routing to a primary Application Load Balancer (ALB) DNS name. When the primary Region experiences an outage, customers wait a long time because the DNS entry is not changed automatically. The team wants automatic failover: if the primary Region ALB health check fails for a sustained period, Route 53 should route users to the secondary Region ALB. Which Route 53 approach best meets this requirement?

A.Use Route 53 failover routing with a PRIMARY and SECONDARY record set for the same name, and attach health checks to the ALBs.

B.Use latency-based routing so Route 53 automatically spreads traffic to both Regions based on measured latency.

C.Use weighted routing and configure the secondary ALB to receive 100% traffic when the primary returns HTTP 5xx responses.

D.Use geolocation routing and restrict the primary Region record to specific countries only.

AnswerA

Failover routing is designed for active/passive DNS failover. Route 53 evaluates health checks for the PRIMARY record and automatically serves the SECONDARY record when the PRIMARY is considered unhealthy for the configured evaluation period.

Why this answer

Route 53 failover routing is designed specifically for active-passive failover scenarios. By creating PRIMARY and SECONDARY record sets with the same DNS name and attaching health checks to the ALBs, Route 53 will automatically route traffic to the secondary ALB when the primary ALB health check fails for a sustained period. This meets the requirement for automatic failover without manual intervention.

Exam trap

The trap here is that candidates often confuse failover routing with latency-based or weighted routing, assuming that latency-based routing inherently provides failover, but it does not—it only optimizes for performance, not availability.

How to eliminate wrong answers

Option B is wrong because latency-based routing distributes traffic based on lowest latency, not health status; it does not provide automatic failover from a primary to a secondary region when the primary is unhealthy. Option C is wrong because weighted routing distributes traffic based on fixed weights and does not automatically shift 100% traffic to the secondary based on HTTP 5xx responses; it requires external automation or custom health checks to adjust weights. Option D is wrong because geolocation routing directs traffic based on the geographic location of the user, not on the health of the endpoint; it cannot automatically failover from a primary to a secondary region when the primary is unhealthy.

Practice this question →

MCQmedium

A company runs a customer portal on an Amazon Aurora PostgreSQL cluster. The application currently connects directly to the writer instance endpoint and keeps long-lived connections open. During a maintenance failover, writes fail until clients are restarted. The team wants the application to reconnect to the correct Aurora endpoint automatically and reduce user-visible write interruptions. Which change is most likely to achieve this?

A.Use the Aurora cluster endpoint for write traffic, use the reader endpoint for read-only traffic, and implement connection retry or reconnect logic on failover.

B.Keep using the original writer instance endpoint so the database host name never changes during failover.

C.Convert the Aurora cluster to Single-AZ so there is only one database node to connect to.

D.Place Route 53 in front of the database and manually update DNS records whenever failover occurs.

AnswerA

The cluster endpoint always targets the current writer, and failover-aware reconnect logic helps the application recover from dropped connections after promotion.

Why this answer

The Aurora cluster endpoint automatically points to the current writer instance and updates DNS after a failover, so the application can reconnect without manual intervention. However, because the application keeps long-lived connections, it must implement connection retry or reconnect logic to detect the broken connection and re-resolve the DNS name to the new writer. This combination ensures writes resume automatically after failover.

Exam trap

The trap here is that candidates assume the cluster endpoint alone solves the problem, forgetting that long-lived connections must be re-established after failover, which requires explicit retry or reconnect logic in the application.

How to eliminate wrong answers

Option B is wrong because the writer instance endpoint is tied to a specific database node; during failover, that node becomes a reader or is replaced, so the host name changes and the original endpoint no longer accepts writes. Option C is wrong because converting to Single-AZ removes the failover capability entirely, making the system less resilient and still subject to interruptions during maintenance. Option D is wrong because manually updating Route 53 records is slow, error-prone, and defeats the purpose of automated failover; Aurora already provides managed endpoints that update automatically.

Practice this question →

MCQmedium

A company hosts a public API using two AWS regions behind a single custom domain. Route 53 is configured with latency-based routing and health checks. During a regional outage, application metrics confirm the primary API is unhealthy, but clients still resolve to the primary region for most requests. Which DNS configuration change will most directly ensure automatic failover to the secondary region when the primary fails?

A.Change the record type to A/AAAA alias with an active-active routing policy so both regions always receive equal traffic.

B.Switch to Route 53 failover routing: configure the primary record with the primary health check and the secondary record with the secondary failover health check.

C.Keep latency-based routing but shorten the health check interval to 5 seconds.

D.Use geolocation routing so requests from each country route to the nearest region.

AnswerB

Failover routing is designed for disaster recovery-style behavior using health checks. Route 53 returns the primary record when its health check is passing, and it automatically switches resolution to the secondary record when the primary health check fails. This directly matches the requirement that clients should mostly move to the secondary region during a primary regional outage.

Why this answer

Option B is correct because Route 53 failover routing with health checks explicitly directs traffic to the secondary region when the primary health check fails. This ensures automatic failover at the DNS level, whereas latency-based routing does not guarantee failover even with health checks—it only reduces latency and may still return unhealthy records if no healthier alternative exists.

Exam trap

The trap here is that candidates assume latency-based routing with health checks will automatically fail over, but it only routes to the lowest-latency healthy endpoint—if no healthy endpoint exists, it may still return unhealthy records, whereas failover routing explicitly switches to the secondary record on health check failure.

How to eliminate wrong answers

Option A is wrong because an active-active routing policy distributes traffic equally regardless of health, failing to provide automatic failover during an outage. Option C is wrong because shortening the health check interval does not change the fundamental behavior of latency-based routing; it still may return unhealthy records if the primary region has the lowest latency. Option D is wrong because geolocation routing routes based on client location, not health, and does not automatically fail over to a secondary region when the primary is unhealthy.

Practice this question →

MCQhard

A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The team wants the control to be enforceable during normal operations.

A.Use an in-memory queue on one EC2 instance

B.Use UDP messages sent directly to workers

C.Use Amazon SQS standard queue and design consumers to be idempotent

D.Use CloudFront signed URLs

AnswerC

SQS standard queues provide at-least-once delivery and high throughput; consumers must handle occasional duplicates.

Why this answer

Amazon SQS standard queues provide at-least-once delivery, ensuring every event is processed at least once, which matches the requirement. Duplicate processing is acceptable because the team can design consumers to be idempotent, handling duplicates without side effects. SQS is a fully managed, scalable, and durable service that enforces this behavior during normal operations without requiring custom infrastructure.

Exam trap

The trap here is that candidates may confuse 'at-least-once' delivery with 'exactly-once' delivery, or incorrectly assume that UDP or in-memory queues can provide reliable event processing, when in fact only a managed queue service like SQS with idempotent consumers meets the stated requirement for enforceability during normal operations.

How to eliminate wrong answers

Option A is wrong because an in-memory queue on a single EC2 instance is not durable, cannot survive instance failures, and does not provide at-least-once delivery guarantees across restarts or scaling events. Option B is wrong because UDP is a connectionless, unreliable protocol that does not guarantee message delivery, order, or duplicate detection, making it unsuitable for at-least-once processing. Option D is wrong because CloudFront signed URLs are used for access control to content delivery, not for event processing or messaging, and they do not provide any delivery guarantee or queue semantics.

Practice this question →

MCQmedium

A team accidentally updates critical rows in an Amazon RDS for PostgreSQL database. Automated backups are enabled. They need to recover the data to the exact state as of 90 minutes ago. They also cannot risk interrupting the current production database instance while investigators validate the restored data. Which recovery strategy best meets these constraints?

A.Use point-in-time recovery (PITR) to restore to a new RDS DB instance as of 90 minutes ago, then validate and cut over after approval.

B.Restore a manual snapshot and overwrite the existing production DB instance so the data matches exactly 90 minutes ago.

C.Wait for the next automated backup window and then restart the current DB instance to roll back changes automatically.

D.Use cross-region read replicas to rewind changes and promote the replica to become the writer immediately.

AnswerA

PITR can restore a new DB instance to a specific timestamp using automated backups and transaction logs. Restoring to a separate instance avoids overwriting or interrupting the existing production instance during validation.

Why this answer

Point-in-time recovery (PITR) for Amazon RDS allows you to restore a DB instance to any second within the backup retention period, using automated backups and transaction logs. By restoring to a new RDS instance as of 90 minutes ago, you create an isolated copy for validation without affecting the production database. This meets both the recovery point objective (RPO) of 90 minutes and the constraint of no interruption to the current production instance.

Exam trap

The trap here is that candidates confuse point-in-time recovery with snapshot restoration or assume that read replicas can be used for time-based rollbacks, but only PITR provides the exact time-targeted restore without affecting the production instance.

How to eliminate wrong answers

Option B is wrong because restoring a manual snapshot and overwriting the existing production DB instance would interrupt the production database and cannot target an exact point 90 minutes ago—snapshots are point-in-time captures at the moment they were taken, not a time-shift. Option C is wrong because waiting for the next automated backup window does not roll back changes; automated backups are for restoration, not for in-place rollback, and the database would continue to operate with the erroneous data. Option D is wrong because cross-region read replicas replicate data asynchronously and cannot rewind changes; promoting a replica does not revert data to a past state, and the replica would contain the same erroneous updates.

Practice this question →

Drag & Dropmedium

Order the steps to create a static website using Amazon S3 and CloudFront.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

S3 bucket with hosting, upload files, CloudFront distribution, configure CloudFront, then DNS.

Practice this question →

MCQmedium

Based on the exhibit, the business needs Regional disaster recovery with an RTO of 45 minutes and an RPO of 15 minutes. The solution should keep cost lower than running two fully active production environments. Which DR strategy is the best fit?

A.Backup and restore only, because the existing daily backups are already in another Region.

B.Pilot light, because the recovery Region only needs minimal resources and can be scaled after a disaster.

C.Warm standby, because a scaled-down but fully functional copy can take traffic quickly while keeping costs below full duplication.

D.Active-active, because it minimizes RTO by keeping both Regions fully live all the time.

AnswerC

Warm standby keeps a functional copy of the environment running in the recovery Region at reduced capacity. That shortens failover time compared with backup and restore or pilot light, while still costing less than a fully scaled second production stack. With continuous or near-continuous data replication and automated cutover, it can satisfy an RTO of 45 minutes and an RPO of 15 minutes.

Why this answer

Warm standby is the best fit because it maintains a scaled-down but fully functional copy of the production environment in the recovery Region, which can be quickly scaled up to handle production traffic. This meets the RTO of 45 minutes and RPO of 15 minutes by keeping the standby environment ready with replicated data (e.g., using Amazon RDS Multi-AZ or cross-Region read replicas with synchronous replication), while costing less than two fully active environments since the standby runs on smaller instances or fewer resources until failover.

Exam trap

The trap here is that candidates often confuse pilot light with warm standby, assuming minimal resources can be scaled quickly enough to meet a 45-minute RTO, but pilot light requires provisioning and configuring additional resources (e.g., launching EC2 instances, attaching volumes) which typically exceeds that time window, whereas warm standby already has a running (though scaled-down) environment ready to accept traffic.

How to eliminate wrong answers

Option A is wrong because backup and restore only cannot achieve an RTO of 45 minutes or RPO of 15 minutes; restoring from daily backups would take hours and lose up to 24 hours of data, far exceeding the required RPO. Option B is wrong because pilot light uses minimal resources (e.g., core services like a small database and a few EC2 instances) but requires provisioning and scaling of full infrastructure after a disaster, which typically takes longer than 45 minutes to become fully operational, thus failing the RTO requirement. Option D is wrong because active-active keeps both Regions fully live all the time, which incurs the cost of two fully active production environments, contradicting the requirement to keep costs lower than full duplication.

Practice this question →

MCQeasy

Based on the exhibit, the database must fail over automatically if the primary Availability Zone goes down. Which solution should the architect choose?

A.Create a read replica in the same Availability Zone as the primary database.

B.Convert the database to a Multi-AZ RDS deployment.

C.Increase the backup retention period to 35 days.

D.Move the database to an EC2 instance with an attached EBS volume.

AnswerB

A Multi-AZ RDS deployment keeps a synchronous standby in another Availability Zone and automatically fails over when the primary fails. This matches the requirement for minimal manual intervention and preserves the same database endpoint, so the application does not need connection string changes. It is the standard AWS choice for resilient relational databases.

Why this answer

Option B is correct because a Multi-AZ RDS deployment automatically provisions and maintains a synchronous standby replica in a different Availability Zone. If the primary AZ fails, Amazon RDS automatically fails over to the standby, typically within 60–120 seconds, without requiring manual intervention or changes to the application connection string.

Exam trap

The trap here is that candidates confuse read replicas (which are asynchronous and require manual promotion) with Multi-AZ deployments (which provide automatic synchronous failover), often selecting a read replica in the same AZ because they think it offers high availability without understanding the fundamental replication mode difference.

How to eliminate wrong answers

Option A is wrong because a read replica in the same AZ does not provide automatic failover; it is designed for read scaling, not high availability, and requires manual promotion. Option C is wrong because increasing the backup retention period to 35 days only affects point-in-time recovery and automated backups, not failover capability. Option D is wrong because moving the database to an EC2 instance with an attached EBS volume requires custom scripting or third-party tools to implement automatic failover, and EBS volumes are AZ-specific, so they cannot survive an AZ outage without manual intervention.

Practice this question →

MCQmedium

A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered?

A.Aurora Global Database

B.A single-AZ Aurora cluster

C.An ElastiCache Redis replica

D.Manual snapshots copied monthly

AnswerA

Aurora Global Database replicates with low latency to secondary Regions and supports faster disaster recovery than snapshot-only approaches.

Why this answer

Aurora Global Database is designed for cross-Region disaster recovery with a typical RPO of 1 second or less, using storage-based replication that does not impact database performance. This meets the requirement for fast failover and low data loss, unlike manual snapshot-based approaches which have higher RPO and slower recovery.

Exam trap

The trap here is that candidates may confuse cross-Region read replicas (which have higher lag and manual promotion) with Aurora Global Database, or assume that ElastiCache or single-AZ deployments can provide adequate DR, when only Aurora Global Database meets the low RPO and fast cross-Region recovery requirements.

How to eliminate wrong answers

Option B is wrong because a single-AZ Aurora cluster lacks any cross-Region replication or failover capability, providing no disaster recovery across Regions. Option C is wrong because ElastiCache Redis is an in-memory cache, not a persistent database, and cannot serve as the primary data store for ticket booking transactions or provide cross-Region DR for the Aurora MySQL data. Option D is wrong because manual snapshots copied monthly result in an RPO of up to one month, which is far too high for the low RPO requirement, and recovery would require provisioning a new cluster from the snapshot, leading to significant downtime.

Practice this question →

Multi-Selecthard

A claims workflow requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable?

Select 2 answers

A.Point-in-time recovery

B.DAX

C.Deletion protection or tightly controlled delete permissions

D.Global secondary indexes

AnswersA, C

PITR allows restoration to a specific second within the supported recovery window.

Why this answer

Point-in-time recovery (PITR) for DynamoDB enables continuous backups with 35-day granularity, allowing restoration to any second within that window. This directly satisfies the requirement for point-in-time recovery by providing the ability to restore the table to a specific state before a data corruption or accidental write event.

Exam trap

The trap here is that candidates often confuse DAX or GSIs as data protection mechanisms, but neither provides backup, recovery, or deletion prevention—they are performance and query optimization features, not resilience controls.

Practice this question →

MCQhard

A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The architecture review board prefers a managed AWS-native control.

A.Amazon EFS with mount targets in multiple Availability Zones

B.S3 mounted as a POSIX file system without a file gateway

C.Instance store volumes

D.An EBS volume attached to all instances

AnswerA

EFS is regional file storage and supports mount targets across AZs.

Why this answer

Amazon EFS provides a fully managed, NFS-based shared file system that can be mounted concurrently by multiple Linux EC2 instances across different Availability Zones. By creating mount targets in each AZ, the file system remains accessible even if one AZ fails, as traffic is automatically routed to the surviving mount targets. This meets the requirement for shared, resilient storage with a managed AWS-native control plane.

Exam trap

The trap here is that candidates often confuse EBS Multi-Attach (which is limited to a single AZ and specific instance types) with the cross-AZ shared file system capability of EFS, or incorrectly assume S3 with a FUSE mount can replace a POSIX-compliant file system.

How to eliminate wrong answers

Option B is wrong because mounting S3 as a POSIX file system (e.g., using s3fs-fuse) does not provide true POSIX compliance, lacks strong consistency guarantees, and introduces performance and locking issues unsuitable for shared file workloads; it also requires a third-party tool, not a fully managed AWS-native service. Option C is wrong because instance store volumes are ephemeral, tied to the lifecycle of a single EC2 instance, and cannot be shared across instances or survive an AZ failure. Option D is wrong because an EBS volume can only be attached to a single EC2 instance at a time (unless using multi-attach, which is limited to specific EBS types and still not designed for cross-AZ shared file systems), and it cannot be simultaneously mounted by instances in multiple Availability Zones.

Practice this question →

MCQeasy

A worker service consumes messages from an Amazon SQS queue. Some messages are malformed and always fail validation. The worker retries, but it keeps reprocessing the same bad messages and consumes processing capacity that should be used for valid work. What is the best solution to prevent “poison messages” from blocking progress?

A.Configure a Dead-Letter Queue (DLQ) and set a redrive policy so messages move to the DLQ after a maximum number of receives.

B.Increase the visibility timeout so the worker gets fewer retries per hour.

C.Disable SQS retries by deleting messages immediately on any processing error.

D.Create a second worker that polls the queue less frequently until the malformed message is processed successfully.

AnswerA

DLQs isolate repeatedly failing messages so they stop consuming worker capacity and can be analyzed later.

Why this answer

A Dead-Letter Queue (DLQ) with a redrive policy is the standard AWS mechanism for handling poison messages. By setting a maximum receive count (e.g., 5), the SQS queue automatically moves messages that fail processing repeatedly to the DLQ, isolating them from the main queue. This prevents the worker from wasting capacity on invalid messages and allows the main queue to continue processing valid work without interruption.

Exam trap

The trap here is that candidates may think increasing the visibility timeout or deleting messages on error is a valid solution, but AWS specifically designed the DLQ pattern to isolate poison messages without losing data or impacting throughput.

How to eliminate wrong answers

Option B is wrong because increasing the visibility timeout only delays the retry, it does not prevent the worker from eventually reprocessing the same bad message, so the poison message still consumes processing capacity. Option C is wrong because SQS does not support disabling retries; deleting messages immediately on error would lose the message entirely without any chance for recovery or analysis, which is not a best practice. Option D is wrong because creating a second worker that polls less frequently does not solve the problem—the malformed message will still be retried and block progress, and a slower poll rate only reduces throughput without addressing the root cause.

Practice this question →

MCQmedium

A warehouse integration service receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The design must avoid adding custom operational scripts.

A.AWS WAF

B.Amazon Route 53 weighted routing

C.Amazon SQS queue

D.Amazon CloudFront

AnswerC

SQS decouples producers and consumers, buffers bursts, and supports retries through visibility timeout and dead-letter queues.

Why this answer

Amazon SQS is the correct choice because it acts as a durable, scalable message buffer that decouples the web tier from the fulfilment workers. When order bursts arrive, messages are stored reliably in the queue, and workers can poll at their own pace, retrying failed messages automatically without any custom scripts. This pattern absorbs spikes and ensures no requests are lost, meeting the requirement for a fully managed, serverless integration.

Exam trap

The trap here is that candidates often confuse load-balancing or traffic-routing services (like Route 53 or CloudFront) with message queuing, mistakenly thinking they can absorb processing spikes, whereas only a queue like SQS provides durable storage and asynchronous decoupling for request bursts.

How to eliminate wrong answers

Option A is wrong because AWS WAF is a web application firewall that filters HTTP/S traffic based on rules (e.g., SQL injection, XSS) and does not provide message buffering, decoupling, or retry capabilities for downstream services. Option B is wrong because Amazon Route 53 weighted routing distributes DNS traffic across multiple endpoints based on weights, but it operates at the DNS level and cannot absorb processing spikes or retry failed requests; it simply routes new connections. Option D is wrong because Amazon CloudFront is a content delivery network (CDN) that caches static and dynamic content at edge locations to reduce latency, but it does not offer message queuing, buffering, or retry logic for backend processing workloads.

Practice this question →

MCQeasy

A company runs its customer-facing web app on EC2 behind an Application Load Balancer. The database is Amazon RDS for PostgreSQL. The requirement is that if a single Availability Zone fails, the database must automatically fail over within the same AWS Region with minimal application changes. Which database setup best meets this requirement?

A.Use an RDS single-AZ instance and periodically restore from automated backups if needed.

B.Deploy the RDS PostgreSQL instance as Multi-AZ with automatic failover enabled.

C.Create a read replica in a different AZ and use it only when the primary fails.

D.Use RDS with Multi-AZ disabled, but increase storage IOPS to prevent failover.

AnswerB

Multi-AZ RDS maintains a standby instance in a different AZ. If the primary fails, RDS performs automatic failover, preserving the same database endpoint behavior.

Why this answer

Option B is correct because RDS Multi-AZ for PostgreSQL automatically provisions and maintains a synchronous standby replica in a different Availability Zone. If the primary AZ fails, Amazon RDS automatically fails over to the standby, typically within 60–120 seconds, with no changes required to the application's connection string (the DNS name remains the same). This meets the requirement for minimal application changes and automatic failover within the same Region.

Exam trap

The trap here is that candidates often confuse a read replica (which requires manual promotion and DNS changes) with a Multi-AZ standby (which provides automatic, transparent failover), leading them to incorrectly select Option C.

How to eliminate wrong answers

Option A is wrong because restoring from automated backups is a manual process that can take hours, not an automatic failover, and it does not meet the requirement for minimal application changes. Option C is wrong because a read replica is designed for read scaling, not automatic failover; promoting a read replica to primary requires manual intervention and a DNS change, which violates the 'minimal application changes' requirement. Option D is wrong because disabling Multi-AZ and increasing IOPS does not provide any failover capability; it only improves performance and does not protect against an AZ failure.

Practice this question →

Multi-Selectmedium

A company is deploying a stateless web application on Amazon ECS with Fargate. The application must be resilient to individual task failures and Availability Zone failures. Which three steps should the company take to achieve this resilience? (Choose three.)

Select 3 answers

.Configure the ECS service to use a spread placement strategy across Availability Zones.

.Set a minimum healthy percent of 50 and a maximum percent of 200 in the ECS service deployment configuration.

.Place all ECS tasks in a single subnet to minimize network latency.

.Use an Application Load Balancer (ALB) in front of the ECS service to distribute traffic across tasks.

.Store application session data in an attached EFS file system shared across all tasks.

.Disable automatic task replacement to avoid unnecessary task churn during failures.

Why this answer

Configuring the ECS service with a spread placement strategy across Availability Zones ensures tasks are distributed across multiple AZs, providing resilience against AZ failures. Setting a minimum healthy percent of 50 and a maximum percent of 200 allows the service to maintain at least half of the desired tasks during deployments or failures while scaling up to replace failed tasks without downtime. Using an Application Load Balancer (ALB) in front of the ECS service distributes incoming traffic across healthy tasks in different AZs, automatically rerouting traffic if a task or AZ fails.

Exam trap

The trap here is that candidates may confuse stateless applications with stateful ones and incorrectly choose to store session data in EFS, or they may think placing tasks in a single subnet improves performance without considering the single point of failure risk.

Practice this question →

MCQeasy

Your company hosts an internal API in two AWS Regions. You want Amazon Route 53 to automatically send traffic to the secondary Region if the primary Region’s endpoint becomes unhealthy. Which Route 53 configuration best meets this requirement?

A.Latency-based routing with health checks for both Regions.

B.Failover routing with a primary record associated with a health check, and a secondary (failover) record associated with its own health check settings.

C.Weighted routing to distribute traffic evenly across both Regions.

D.Geolocation routing based on the client’s country to choose a Region.

AnswerB

Route 53 failover routing is explicitly designed for primary/secondary behavior. When the primary record’s health check fails, Route 53 automatically routes to the secondary (failover) record, matching the stated requirement.

Why this answer

Failover routing in Route 53 is specifically designed for active-passive configurations where traffic is directed to a primary resource unless a health check indicates it is unhealthy, at which point traffic is automatically routed to the secondary (failover) record. By associating a health check with the primary record, Route 53 can monitor the endpoint's health and perform the failover seamlessly. This directly meets the requirement to send traffic to the secondary Region when the primary endpoint becomes unhealthy.

Exam trap

The trap here is that candidates often confuse failover routing with latency-based routing, assuming that latency-based routing with health checks will automatically redirect traffic to the next best Region when one is unhealthy, but in reality, latency-based routing only selects the lowest-latency healthy endpoint and does not enforce a strict primary-secondary failover order.

How to eliminate wrong answers

Option A is wrong because latency-based routing directs traffic to the Region with the lowest latency for the client, not based on health status; while health checks can be associated, they only mark records as unhealthy without automatically failing over to a specific secondary Region. Option C is wrong because weighted routing distributes traffic based on assigned weights, not health; if the primary endpoint is unhealthy, traffic would still be sent to it according to the weight, unless the record is marked unhealthy, but there is no automatic failover to a designated secondary. Option D is wrong because geolocation routing directs traffic based on the client's geographic location, not on endpoint health; it does not provide automatic failover to a secondary Region when the primary is unhealthy.

Practice this question →

MCQmedium

A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The team wants the control to be enforceable during normal operations.

A.Subnets in at least two Availability Zones with health checks enabled

B.All instances in one larger subnet

C.A Network Load Balancer in one subnet

D.A single EC2 instance with detailed monitoring

AnswerA

An Auto Scaling group spanning multiple AZs can replace unhealthy instances and maintain capacity during an AZ failure.

Why this answer

Option A is correct because distributing subnets across at least two Availability Zones ensures that if one AZ fails, the Auto Scaling group can launch replacement instances in the remaining AZ(s) to maintain capacity. Enabling health checks on the Application Load Balancer allows the Auto Scaling group to detect and replace unhealthy instances, enforcing resilience during normal operations without manual intervention.

Exam trap

The trap here is that candidates often assume a single larger subnet or a Network Load Balancer provides high availability, but they overlook the requirement for multi-AZ distribution and application-layer health checks to enforce resilience during normal operations.

How to eliminate wrong answers

Option B is wrong because placing all instances in one larger subnet within a single Availability Zone creates a single point of failure; if that AZ goes down, all instances are lost. Option C is wrong because a Network Load Balancer operates at Layer 4 and does not provide the HTTP/HTTPS health checks needed for a ticket booking system; also, placing it in one subnet does not address multi-AZ fault tolerance. Option D is wrong because a single EC2 instance, even with detailed monitoring, cannot survive an AZ failure; Auto Scaling requires at least two instances across multiple AZs to maintain availability.

Practice this question →

MCQmedium

A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The architecture review board prefers a managed AWS-native control.

A.Aurora Global Database

B.A single-AZ Aurora cluster

C.An ElastiCache Redis replica

D.Manual snapshots copied monthly

AnswerA

Aurora Global Database replicates with low latency to secondary Regions and supports faster disaster recovery than snapshot-only approaches.

Why this answer

Aurora Global Database is the correct choice because it provides a managed, cross-Region disaster recovery solution with a Recovery Point Objective (RPO) of typically less than 1 second, using storage-based replication that does not impact database performance. This meets the requirement for fast failover and low data loss, while being fully AWS-native and controlled by the architecture review board.

Exam trap

The trap here is that candidates may confuse cross-Region read replicas (which have higher RPO and require manual promotion) with Aurora Global Database, or assume that any caching layer like ElastiCache can substitute for database DR, when in fact only Aurora Global Database provides the required low RPO and managed failover.

How to eliminate wrong answers

Option B is wrong because a single-AZ Aurora cluster lacks any cross-Region replication or failover capability, offering no disaster recovery across Regions. Option C is wrong because ElastiCache Redis is an in-memory cache, not a persistent database, and cannot serve as a primary data store for ticket bookings or provide cross-Region DR with low RPO. Option D is wrong because manual snapshots copied monthly result in an RPO of up to a month, which is far too high for fast disaster recovery requirements.

Practice this question →

MCQmedium

A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The architecture review board prefers a managed AWS-native control.

A.An EBS snapshot schedule

B.S3 Cross-Region Replication with versioning enabled

C.S3 lifecycle transition to Glacier Flexible Retrieval

D.A CloudFront distribution

AnswerB

CRR asynchronously replicates objects to a bucket in another Region and requires versioning.

Why this answer

S3 Cross-Region Replication (CRR) is the correct AWS-native managed solution for automatically replicating objects from a source S3 bucket in one region to a destination bucket in another region, meeting the disaster recovery requirement. Versioning must be enabled on both source and destination buckets for CRR to function, as replication relies on version IDs to track and copy objects. This provides asynchronous, automatic replication without custom scripting or third-party tools.

Exam trap

The trap here is that candidates may confuse S3 lifecycle policies (which only manage storage tiers within a region) with cross-region replication, or incorrectly assume CloudFront's global edge caching provides durable DR storage in another region.

How to eliminate wrong answers

Option A is wrong because EBS snapshots are for Amazon Elastic Block Store volumes attached to EC2 instances, not for S3 objects; they cannot replicate data across regions for S3-based storage. Option C is wrong because S3 lifecycle transitions to Glacier Flexible Retrieval only change the storage class within the same region for cost optimization, not replicate data to another region for disaster recovery. Option D is wrong because CloudFront is a content delivery network (CDN) that caches content at edge locations for low-latency access, but it does not provide cross-region replication or persistent storage in a secondary region for DR.

Practice this question →

Multi-Selecthard

A regional web application for a content publishing system must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The architecture review board prefers a managed AWS-native control.

Select 2 answers

A.AWS Organizations service control policies

B.Route 53 failover routing with health checks

C.S3 Transfer Acceleration

D.A deployed standby application stack in the secondary Region

AnswersB, D

Route 53 can monitor endpoint health and return the standby endpoint when the primary is unhealthy.

Why this answer

Route 53 failover routing with health checks (Option B) is required because it monitors the primary endpoint's health and automatically reroutes traffic to a secondary Region when the primary becomes unhealthy. This is the managed AWS-native control for DNS-based failover, meeting the architecture review board's preference.

Exam trap

The trap here is that candidates often think DNS failover alone is sufficient, forgetting that you must also have a running application stack in the secondary Region to receive traffic after failover.

Practice this question →

MCQmedium

A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ASG uses the ALB target group health checks to decide when instances are healthy (for example, by using the ELB/target-group health check integration). During a deployment, the ASG performs instance replacement. Shortly after the deployment starts and while new instances are still bootstrapping, CloudWatch shows the ALB target group briefly has zero healthy targets, and users intermittently receive 502 responses. Which ASG deployment configuration best reduces the chance that there will be a period with zero healthy ALB targets, while still keeping failover behavior resilient?

A.Set the target group HealthCheckGracePeriod to a very short value so the ALB quickly declares instances healthy or unhealthy.

B.Use an ASG rolling update approach that launches replacement instances first, ensures the new instances pass the ALB target group health checks, and only then terminates the old instances (for example, by configuring sufficient minimum healthy capacity and waiting on ALB health).

C.Disable ALB target group health checks and route traffic to any registered targets so replacements do not depend on health check status.

D.Reduce the ASG desired capacity by one instance during deployments so the replacement happens faster.

AnswerB

This sequencing avoids a “no healthy targets” window. By keeping capacity stable (or maintaining a minimum healthy percentage) and waiting for the new instances to be marked healthy by the ALB, traffic is only sent to healthy targets during replacement.

Why this answer

Option B is correct because it describes a rolling update strategy that launches new instances first, waits for them to pass ALB target group health checks, and only then terminates old instances. This ensures that at all times during the deployment, there is a sufficient number of healthy instances to serve traffic, preventing the ALB target group from ever having zero healthy targets. The ASG's minimum healthy capacity setting and the wait for ALB health check integration guarantee that failover remains resilient because the old instances continue to handle requests until the new ones are fully ready.

Exam trap

The trap here is that candidates often think reducing the health check grace period or disabling health checks will speed up recovery, but in reality, these actions either cause premature removal of healthy instances or allow traffic to unhealthy instances, both of which increase the likelihood of 502 errors and reduce resilience.

How to eliminate wrong answers

Option A is wrong because setting the HealthCheckGracePeriod to a very short value does not prevent zero healthy targets; it merely reduces the delay before the ALB marks instances as unhealthy, which could actually cause the ALB to prematurely remove instances and exacerbate the problem. Option C is wrong because disabling ALB target group health checks would cause the ALB to route traffic to any registered targets regardless of their actual health, leading to increased 502 errors and no failover resilience. Option D is wrong because reducing the ASG desired capacity by one instance during deployments does not address the root cause of zero healthy targets; it only reduces the number of instances being replaced, but the replacement process still creates a gap where old instances are terminated before new ones are healthy.

Practice this question →

Multi-Selectmedium

A customer portal must recover from a regional outage within a few hours. The business wants lower ongoing cost than a fully active second Region and does not want to rebuild everything from scratch during the outage. Which two DR patterns best fit that goal? Select two.

Select 2 answers

A.Backup and restore

B.Pilot light

C.Warm standby

D.Multi-site active-active

E.Single-AZ deployment

AnswersB, C

Pilot light keeps only core components running in the secondary Region, which lowers cost while reducing recovery time.

Why this answer

Pilot light is correct because it maintains a minimal, always-running core infrastructure (e.g., a small database and application server) in the secondary Region, replicating data continuously. During a regional outage, you can rapidly scale up the environment by provisioning additional resources (e.g., EC2 instances from pre-baked AMIs) to become fully active, meeting the recovery time objective (RTO) of a few hours while keeping ongoing costs lower than a fully active second Region.

Exam trap

AWS often tests the distinction between pilot light and warm standby by making candidates confuse the minimal 'pilot light' core with a fully scaled 'warm standby' environment, or by assuming that backup and restore can meet a few-hour RTO when it typically cannot due to provisioning and data restoration latency.

Practice this question →

MCQhard

A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The team wants the control to be enforceable during normal operations.

A.A FIFO queue without a redrive policy

B.Short polling instead of long polling

C.A dead-letter queue with an appropriate maxReceiveCount

D.A larger message retention period only

AnswerC

A DLQ isolates messages that fail repeatedly so they can be investigated without disrupting normal processing.

Why this answer

A dead-letter queue (DLQ) with an appropriate maxReceiveCount is the correct solution because it automatically moves messages that have failed processing a specified number of times to a separate queue, preventing them from blocking subsequent retries. This enforces control during normal operations by isolating poison messages without manual intervention, allowing the main queue to continue processing valid messages.

Exam trap

The trap here is that candidates may confuse a dead-letter queue with simply increasing retention or changing polling methods, failing to recognize that only a DLQ with a maxReceiveCount enforces automatic removal of poison messages during normal operations.

How to eliminate wrong answers

Option A is wrong because a FIFO queue without a redrive policy does not handle poison messages; it only preserves message order and exactly-once processing, but without a DLQ, failed messages will continue to be retried indefinitely. Option B is wrong because short polling (returning immediately even if the queue is empty) does not address poison messages; it affects message availability timing, not retry behavior or failure handling. Option D is wrong because increasing the message retention period only extends how long messages stay in the queue; it does not limit retries or remove failing messages, so poison messages would still block retries until the retention period expires.

Practice this question →

100

MCQmedium

A.Subnets in at least two Availability Zones with health checks enabled

B.All instances in one larger subnet

C.A Network Load Balancer in one subnet

D.A single EC2 instance with detailed monitoring

AnswerA

An Auto Scaling group spanning multiple AZs can replace unhealthy instances and maintain capacity during an AZ failure.

Why this answer

Option A is correct because an Auto Scaling group configured with subnets in at least two Availability Zones and health checks enabled ensures that if one AZ fails, EC2 instances in the remaining AZs continue to serve traffic. The Application Load Balancer distributes requests across healthy instances in multiple AZs, and the Auto Scaling group replaces failed instances in the affected AZ, maintaining capacity. This design meets the requirement to tolerate the failure of one Availability Zone.

Exam trap

The trap here is that candidates often think a single larger subnet or a different load balancer type provides resilience, but only distributing subnets across multiple Availability Zones with health checks ensures the system can survive an AZ failure.

How to eliminate wrong answers

Option B is wrong because placing all instances in one larger subnet within a single Availability Zone creates a single point of failure; if that AZ goes down, all instances become unavailable. Option C is wrong because a Network Load Balancer operates at Layer 4 and does not provide the HTTP/HTTPS health checks or path-based routing needed for a ticket booking system, and placing it in one subnet does not address multi-AZ resilience. Option D is wrong because a single EC2 instance, even with detailed monitoring, cannot survive an AZ failure; there is no redundancy or automatic failover.

Practice this question →

101

Multi-Selectmedium

A media company stores daily financial exports in Amazon S3. The files must be protected against accidental overwrite or deletion, and the business also wants a second copy in another Region for recovery after a regional outage. Which two actions should the architect take? Select two.

Select 2 answers

A.Enable bucket versioning on the S3 bucket.

B.Turn on S3 Transfer Acceleration for the bucket.

C.Use only lifecycle policies to move objects to Glacier.

D.Configure replication to a bucket in a second AWS Region.

E.Enable S3 Block Public Access on the bucket.

AnswersA, D

Versioning preserves prior object versions so accidental deletes and overwrites can be recovered later.

Why this answer

Option A is correct because enabling S3 Versioning on the bucket protects objects from accidental overwrite or deletion by preserving previous versions of each object. When versioning is enabled, a delete marker is placed instead of permanently removing the object, and overwrites create a new version while retaining the old one. This directly meets the requirement to guard against accidental data loss.

Exam trap

The trap here is that candidates may confuse S3 Transfer Acceleration or Block Public Access with data protection features, when in fact only versioning and replication directly address the requirements for preventing accidental deletion and providing cross-region recovery.

Practice this question →

102

Multi-Selectmedium

A serverless order-ingestion API writes directly to a database. During traffic spikes, the database occasionally throttles, Lambda retries create duplicate order records, and some requests time out. Which two changes best improve buffering and safe retry behavior? Select two.

Select 2 answers

A.Increase the Lambda timeout and keep writing directly to the database.

B.Put an Amazon SQS queue between the API and the database-processing function.

C.Replace SQS with SNS so every request is delivered immediately to all subscribers.

D.Make the database write idempotent by using a unique request token or order ID.

E.Disable retries so failed writes are never duplicated.

AnswersB, D

SQS buffers bursts and decouples producers from consumers, so the database can be processed at a steadier rate.

Why this answer

Option B is correct because inserting an SQS queue between the API Gateway and the Lambda function decouples the ingestion from the database write. During traffic spikes, SQS buffers the requests, allowing the Lambda function to poll at a controlled rate, which prevents database throttling. Additionally, SQS provides built-in retry logic with a visibility timeout, so failed messages are automatically retried without creating duplicate order records.

Exam trap

The trap here is that candidates often think SNS (Option C) is a suitable replacement for SQS because both are messaging services, but SNS lacks buffering and retry mechanics, making it inappropriate for smoothing traffic spikes and handling failures gracefully.

Practice this question →

103

MCQhard

A company runs a production MySQL database on Amazon RDS in us-east-1. A read replica exists in us-west-2 for disaster recovery. The primary region experiences a complete outage. Which of the following describes the correct procedure to restore database service using the cross-region read replica?

A.Wait for AWS to automatically fail over the read replica to become the new primary

B.Restore the primary database from the most recent automated snapshot in us-west-2

C.Manually promote the us-west-2 read replica to a standalone DB instance and update application endpoints

D.Create a new RDS instance in us-west-2 and manually restore data from application logs

AnswerC

Manual promotion is the correct procedure. The replica becomes a writable standalone DB in us-west-2. Applications must update their connection strings to the new endpoint.

Why this answer

Cross-region RDS read replicas support manual promotion to a standalone database instance. When the primary region fails, the replica must be manually promoted — this makes it an independent writable instance in us-west-2.

Key points: Promotion is NOT automatic (unlike RDS Multi-AZ failover). Promotion breaks the replication link — the replica becomes autonomous. After promotion, application connection strings must be updated to the new endpoint. Any replication lag at the time of the outage represents potential data loss (RPO > 0).

Exam trap

RDS Multi-AZ provides automatic failover — no manual action required. Cross-region read replicas do NOT failover automatically — promotion must be manually triggered. This distinction appears frequently.

For automatic cross-region failover with near-zero RPO, use Amazon Aurora Global Database.

Why the other options are wrong

RDS cross-region read replicas do NOT automatically failover. Only RDS Multi-AZ provides automatic same-region failover. Manual promotion is required for cross-region replicas.

Restoring from a snapshot creates a new instance from an older state. The read replica contains more recent data (continuously synchronized). Promotion is faster and yields less data loss than snapshot restoration when the replica is available.

Creating an empty new instance and manually re-entering data is not a valid DR procedure. The read replica already contains synchronized production data. Never manually re-enter data as part of a DR plan.

Practice this question →

104

MCQeasy

An engineering team deploys a stateless web API on EC2 using an Auto Scaling group and an Application Load Balancer (ALB). During a recent test, they noticed that when one Availability Zone was unavailable, traffic failed until new instances were manually launched. Which change most directly improves automatic failover for the compute layer within a single Region?

A.Place the Auto Scaling group in only one subnet so instance launches are simpler.

B.Ensure the ALB and Auto Scaling group span multiple subnets in at least two Availability Zones.

C.Increase the target group deregistration delay to allow old instances to stay longer.

D.Use a Network Load Balancer, but keep all subnets in a single Availability Zone.

AnswerB

Spreading the ALB and Auto Scaling group across at least two AZs provides redundant capacity. If one AZ fails, the ALB continues routing to healthy targets in the other AZ.

Why this answer

Option B is correct because placing both the ALB and the Auto Scaling group across multiple subnets in at least two Availability Zones ensures that if one AZ becomes unavailable, the ALB can route traffic to healthy instances in the remaining AZs, and the Auto Scaling group can automatically launch replacement instances in the other AZs. This directly provides automatic failover for the compute layer within a single Region without manual intervention.

Exam trap

The trap here is that candidates may think a single-AZ setup with a load balancer is sufficient for high availability, but without multi-AZ subnets for both the ALB and Auto Scaling group, the architecture remains vulnerable to AZ failure and requires manual recovery.

How to eliminate wrong answers

Option A is wrong because placing the Auto Scaling group in only one subnet (single AZ) creates a single point of failure; if that AZ becomes unavailable, all instances are lost and traffic fails until new instances are manually launched in another AZ. Option C is wrong because increasing the target group deregistration delay only keeps old instances longer during a deregistration process, which does not help with failover when an entire AZ is unavailable; it delays traffic draining but does not provide automatic recovery from AZ failure. Option D is wrong because using a Network Load Balancer in a single AZ still creates a single point of failure; the NLB cannot route traffic to other AZs if the only AZ is down, and it does not improve automatic failover compared to an ALB spanning multiple AZs.

Practice this question →

105

MCQeasy

An orders service consumes payment instructions from an Amazon SQS queue. Sometimes the consumer times out after applying the payment but before deleting the SQS message. As a result, the same payment instruction is processed again. Which design change most directly prevents duplicate side effects caused by message retries?

A.Delete the SQS message immediately after it is received, before processing, to ensure it is not retried.

B.Implement idempotency by recording a processed marker keyed by the instruction ID and ignoring duplicates.

C.Increase the SQS visibility timeout to a maximum value to avoid retries entirely.

D.Convert the queue to FIFO and enable content-based deduplication.

AnswerB

Idempotency ensures that repeated deliveries of the same instruction do not cause repeated side effects. By persisting a record keyed by instruction ID (or enforcing a unique constraint in a transactional store), the service can detect duplicates and safely skip or reconcile them even if SQS redelivers the message.

Why this answer

Option B is correct because implementing idempotency ensures that even if the same payment instruction is processed multiple times due to a timeout and retry, the side effect (e.g., applying the payment) occurs only once. By recording a processed marker keyed by the instruction ID (e.g., using a DynamoDB table or Redis), the consumer can check the marker before processing and ignore duplicates. This directly addresses the root cause—duplicate processing—without altering the queue's retry behavior.

Exam trap

The trap here is that candidates confuse message deduplication (preventing duplicate deliveries) with idempotent processing (preventing duplicate side effects), leading them to choose Option D, which only prevents redelivery but does not handle the case where the same message is processed twice due to a consumer timeout before deletion.

How to eliminate wrong answers

Option A is wrong because deleting the SQS message immediately after receipt, before processing, defeats the purpose of at-least-once delivery; if the consumer crashes after deletion but before processing, the payment instruction is lost permanently, leading to data loss. Option C is wrong because increasing the visibility timeout to a maximum value (e.g., 12 hours) does not prevent retries entirely; the message will still be retried if the consumer fails to delete it within the timeout, and it can also delay processing of other messages. Option D is wrong because converting to a FIFO queue with content-based deduplication deduplicates based on the message body, not the processing outcome; if the same message is received again due to a consumer timeout, the deduplication ID (derived from the body) remains the same, so the message is not redelivered—but this does not prevent the duplicate side effect from the first retry that already occurred, and it also requires the queue to be FIFO, which may not suit the existing architecture.

Practice this question →

106

MCQmedium

A public API is deployed in two AWS Regions: us-east-1 (primary) and us-west-2 (secondary). The team wants Route 53 to automatically route users to the secondary region if the primary API becomes unhealthy. They will use Route 53 health checks that monitor the API’s /status endpoint over HTTPS. Which Route 53 configuration most directly implements this failover behavior?

A.Create two latency-based alias records for the same name, each with different health checks; Route 53 will automatically shift to the secondary when primary is unhealthy.

B.Create a primary alias record and a failover alias record (secondary), configure failover routing policy, and attach health checks to both records.

C.Use geolocation routing with a health check; when the primary is unhealthy, Route 53 will automatically change the region mapping globally.

D.Use simple routing with weighted records and a low health check threshold so traffic quickly moves to the secondary region.

AnswerB

Route 53 failover routing (primary/secondary) is designed for active-passive regional DR. When the primary health check fails, Route 53 automatically stops returning the primary alias and returns the secondary alias target; attaching health checks ensures the change is driven by the /status endpoint health.

Why this answer

B is correct because the failover routing policy in Route 53 is specifically designed for active-passive failover. By creating a primary alias record and a secondary failover alias record, each with an associated health check, Route 53 will automatically route traffic to the secondary region when the health check for the primary fails. This directly implements the required behavior without relying on latency or geographic proximity.

Exam trap

The trap here is that candidates often confuse failover routing with latency-based or geolocation routing, assuming that health checks automatically trigger failover in those policies, but only failover routing provides the explicit active-passive failover behavior described in the question.

How to eliminate wrong answers

Option A is wrong because latency-based routing does not support automatic failover based on health checks; it routes based on lowest latency, and while health checks can be associated, Route 53 does not automatically shift traffic to the secondary when the primary is unhealthy—it continues to return the primary record if it is still considered healthy, and if both are healthy, latency determines the response. Option C is wrong because geolocation routing routes based on the user's geographic location, not health; even with a health check, Route 53 does not automatically change region mappings globally—it would only return no answer for the unhealthy location, not redirect to another region. Option D is wrong because simple routing with weighted records does not support health checks for automatic failover; weighted routing distributes traffic based on weights and does not automatically shift all traffic to the secondary when the primary is unhealthy—it would require manual intervention or complex scripting.

Practice this question →

107

Multi-Selecthard

A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The team wants the control to be enforceable during normal operations.

Select 2 answers

A.Route 53 failover routing with health checks

B.S3 Transfer Acceleration

C.A deployed standby application stack in the secondary Region

D.AWS Organizations service control policies

AnswersA, C

Route 53 can monitor endpoint health and return the standby endpoint when the primary is unhealthy.

Why this answer

Route 53 failover routing with health checks is correct because it enables automatic DNS-level failover to a secondary Region when the primary endpoint is unhealthy. Route 53 health checks monitor the primary endpoint's health, and if they detect a failure, the DNS record is updated to route traffic to the secondary Region's endpoint. This provides enforceable control during normal operations by allowing you to define routing policies that are active at all times.

Exam trap

The trap here is that candidates may think DNS-level failover alone is sufficient, forgetting that a fully deployed standby stack in the secondary Region is required to actually serve traffic after failover.

Practice this question →

108

Multi-Selectmedium

A production Amazon Aurora MySQL database is corrupted by a bad migration at 10:30 UTC, and the problem is discovered at 10:45 UTC. The team wants to recover to the state just before the migration with minimal manual effort. Which two actions should they take? Select two.

Select 2 answers

A.Restore only the affected table from the latest snapshot and keep the current cluster online.

B.Perform a point-in-time restore to a new DB cluster or instance using automated backups.

C.Reboot the writer so Aurora automatically rolls back the bad migration.

D.Validate the restored database, then repoint the application or DNS name to the restored endpoint.

E.Promote a read replica from the same cluster without restoring from backup.

AnswersB, D

Point-in-time restore is the supported mechanism for recovering to a specific timestamp before the corruption occurred. It uses automated backups and transaction logs to recreate a clean copy of the database state.

Why this answer

Option B is correct because Amazon Aurora supports point-in-time recovery (PITR) to any point within the backup retention window, allowing you to restore the database to a state just before the migration (e.g., 10:29 UTC). This uses automated backups and requires minimal manual effort, as you simply specify the target time and a new DB cluster is created.

Exam trap

The trap here is that candidates may think rebooting or promoting a replica can undo data changes, but these actions do not affect committed transactions; only a point-in-time restore can recover to a pre-migration state.

Practice this question →

109

MCQmedium

A payments platform requires disaster recovery across Regions. Requirements: RPO of 15 minutes and RTO of about 1 hour. The business cannot afford full duplicate capacity in both Regions all the time, but the team wants automated readiness so failover is mostly operationally guided rather than a slow rebuild. Which DR strategy is the best fit?

A.Backup and restore only, relying on scheduled snapshots and manual restores during incidents.

B.Pilot light, keeping only minimal infrastructure in the secondary Region and starting full services after failover.

C.Warm standby, keeping core infrastructure and a partially provisioned environment ready in the secondary Region with frequent data replication.

D.Active/active, routing production traffic to both Regions continuously and accepting dual-region complexity.

AnswerC

Warm standby balances cost and readiness by keeping enough capacity and services running to shorten recovery time while meeting RPO needs.

Why this answer

Warm standby is the best fit because it maintains a partially provisioned environment in the secondary Region with core infrastructure (e.g., a smaller EC2 instance fleet, a replicated database) and uses frequent data replication (e.g., Amazon RDS cross-Region replication or DynamoDB global tables) to achieve an RPO of 15 minutes. The RTO of about 1 hour is achievable by scaling up the standby environment and redirecting traffic, which is faster than a full rebuild but avoids the cost of full duplicate capacity. This balances the business constraint of not affording active/active with the need for automated readiness and guided failover.

Exam trap

The trap here is that candidates often confuse pilot light with warm standby, assuming minimal infrastructure is sufficient for a 1-hour RTO, but pilot light requires provisioning compute resources after failover, which adds significant time, whereas warm standby already has compute running and only needs scaling.

How to eliminate wrong answers

Option A is wrong because backup and restore only relies on scheduled snapshots (e.g., EBS snapshots or RDS automated backups) and manual restores, which typically cannot achieve an RPO of 15 minutes (snapshots are often taken every few hours) and would result in an RTO far exceeding 1 hour due to manual intervention and data restoration time. Option B is wrong because pilot light keeps only minimal infrastructure (e.g., a small database replica and no application servers) in the secondary Region, and starting full services after failover requires provisioning compute resources, which would likely exceed the 1-hour RTO target. Option D is wrong because active/active requires full duplicate capacity in both Regions all the time, which contradicts the business constraint that they cannot afford this, and it introduces dual-region complexity that is unnecessary for the stated RPO/RTO goals.

Practice this question →

110

Multi-Selecteasy

A batch processing job can be interrupted and restarted from checkpoints. The business wants to lower compute cost while still keeping the workload resilient to interruptions. Which two choices are best? Select two.

Select 2 answers

A.Run the workload on Amazon EC2 Spot Instances.

B.Store checkpoints in durable storage such as Amazon S3.

C.Use a single On-Demand instance in one Availability Zone only.

D.Disable automatic replacement so the job is never restarted.

E.Keep all intermediate state only in instance memory.

AnswersA, B

Spot Instances are significantly cheaper than On-Demand capacity and are a good fit when the workload can tolerate interruption. Because the job can restart from checkpoints, interruptions are acceptable.

Why this answer

Amazon EC2 Spot Instances offer significant cost savings (up to 90% compared to On-Demand) and are ideal for fault-tolerant, stateless batch processing jobs that can be interrupted and restarted from checkpoints. The workload's ability to resume from checkpoints makes it resilient to Spot Instance interruptions, aligning perfectly with the business goal of lowering compute costs while maintaining resilience.

Exam trap

The trap here is that candidates may overlook the synergy between Spot Instances and checkpointing, mistakenly thinking that any cost-saving measure (like a single On-Demand instance) suffices, or that resilience can be achieved without durable external storage for state.

Practice this question →

111

Multi-Selectmedium

A SaaS application is deployed in us-east-1 and us-west-2 behind separate ALBs. The business wants DNS to send new clients to the primary Region when it is healthy and automatically fail over to the secondary Region when the primary endpoint is unhealthy. Which two Route 53 settings are required? Select two.

Select 2 answers

A.Use a failover routing policy with a primary and secondary record.

B.Create a health check and associate it with the primary endpoint.

C.Use weighted routing with a 50/50 traffic split between both Regions.

D.Use latency-based routing so clients always choose the fastest Region.

E.Use a geolocation policy without health checks.

AnswersA, B

Failover routing is designed specifically to send traffic to a secondary endpoint when the primary becomes unhealthy.

Why this answer

A failover routing policy is correct because it allows you to designate one record as primary and another as secondary. Route 53 will route traffic to the primary record as long as it is healthy, and automatically fail over to the secondary record when the primary becomes unhealthy. This directly meets the requirement to send new clients to the primary region when healthy and fail over automatically.

Exam trap

The trap here is that candidates often confuse failover routing with weighted or latency-based routing, assuming any multi-region setup with health checks will automatically fail over, but only failover routing provides the explicit primary/secondary failover behavior required.

Practice this question →

112

MCQeasy

Based on the exhibit, the database must continue serving if the current Availability Zone fails. What should you change?

A.Create a read replica in another Availability Zone and promote it manually if needed.

B.Modify the DB instance to use a Multi-AZ deployment.

C.Increase the automated backup retention period to 30 days.

D.Resize the DB instance to a larger class.

AnswerB

A Multi-AZ RDS deployment provides synchronous standby replication in another Availability Zone and automatic failover if the primary AZ becomes unavailable. This directly matches the requirement to keep the database serving after an AZ failure. It is the simplest resilient design change when the application needs high availability rather than just backups.

Why this answer

Option B is correct because Multi-AZ deployment automatically provisions and maintains a synchronous standby replica in a different Availability Zone. If the primary AZ fails, Amazon RDS automatically fails over to the standby, ensuring database availability without manual intervention. This meets the requirement of continuing service during an AZ failure.

Exam trap

The trap here is confusing read replicas (which are for read scaling and asynchronous replication) with Multi-AZ (which is for high availability and synchronous replication), leading candidates to choose Option A for failover scenarios.

How to eliminate wrong answers

Option A is wrong because a read replica is asynchronous and not designed for automatic failover; promoting it manually introduces downtime and data loss risk, which does not satisfy the requirement for continued service without manual action. Option C is wrong because increasing the backup retention period only affects point-in-time recovery duration, not high availability or failover capability. Option D is wrong because resizing the instance class improves performance but does not provide any redundancy or failover across Availability Zones.

Practice this question →

113

MCQeasy

A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most?

A.CloudFront caching with appropriate TTLs

B.AWS Backup Vault Lock

C.IAM Access Analyzer

D.S3 Select

AnswerA

CloudFront can serve cached content from edge locations when the origin is temporarily unavailable.

Why this answer

CloudFront caches responses from the S3 origin based on configured TTLs (Cache-Control or Expires headers). If the S3 origin experiences a short outage, CloudFront can still serve cached pages to users from its edge locations, maintaining availability. This is the most direct way to ensure users receive content during origin failures.

Exam trap

The trap here is that candidates may confuse CloudFront's caching with other AWS services like S3 Transfer Acceleration or S3 Cross-Region Replication, which do not provide cached responses during origin outages.

How to eliminate wrong answers

Option B (AWS Backup Vault Lock) is wrong because it is a data protection feature for backups, enforcing retention policies and preventing deletion, not related to serving cached web content during origin outages. Option C (IAM Access Analyzer) is wrong because it analyzes resource-based policies to identify unintended public or cross-account access, not for caching or origin failover. Option D (S3 Select) is wrong because it is a feature to retrieve subsets of object data using SQL queries, not for caching or serving static content during outages.

Practice this question →

114

MCQmedium

A company uses Amazon RDS for a PostgreSQL database powering a customer-facing application. The application’s availability depends on fast database failover with minimal manual intervention. The RDS instance currently runs as a single-AZ deployment in one DB subnet group. Which change most directly meets the goal?

A.Create a read replica in a different Availability Zone and configure the application to fail over manually.

B.Enable Multi-AZ for the RDS DB instance so AWS manages a standby in another Availability Zone with automatic failover.

C.Switch the database to use EBS snapshots more frequently and restore in case of failure.

D.Pin the DB to a specific instance type with higher CPU credits to prevent CPU-related disconnects.

AnswerB

RDS Multi-AZ maintains a standby in another AZ and supports automatic failover, improving resilience and reducing manual work.

Why this answer

Enabling Multi-AZ for the RDS DB instance creates a synchronous standby replica in a different Availability Zone. AWS automatically handles failover to the standby with no manual intervention required, meeting the goal of fast database failover with minimal manual intervention.

Exam trap

The trap here is that candidates confuse a read replica (asynchronous, manual promotion) with Multi-AZ (synchronous, automatic failover), or assume that frequent backups or instance sizing improvements can substitute for a dedicated high-availability standby.

How to eliminate wrong answers

Option A is wrong because a read replica is asynchronous and intended for read scaling, not automatic failover; manual failover requires promoting the replica, which involves data loss risk and does not meet the 'minimal manual intervention' requirement. Option C is wrong because EBS snapshots are point-in-time backups that require manual restore and significant downtime, not fast automated failover. Option D is wrong because CPU credits apply to burstable instance types (e.g., T-series) and do not address database availability or failover; higher CPU credits prevent CPU throttling but do not provide a standby or automatic failover mechanism.

Practice this question →

115

MCQmedium

Based on the exhibit, a faulty deployment corrupted production data at 10:30 UTC and the issue was discovered at 10:55 UTC. The team needs to recover the database to the last good state before the corruption. Which action should they take?

A.Restore the latest manual snapshot and accept data loss since the snapshot was taken overnight.

B.Use point-in-time restore to create a new database instance at 10:29 UTC, then switch the application to it.

C.Restart the database instance so the transaction log replays the failed migration cleanly.

D.Create a read replica and promote it, because replicas always contain the previous transaction state.

AnswerB

Point-in-time restore is the correct recovery method when automated backups are enabled and the team needs the database just before a known corruption event. Restoring to 10:29 UTC brings the data back to the last safe moment before the migration began. Creating a new instance first avoids modifying the damaged database until the restored copy is validated.

Why this answer

Option B is correct because Amazon RDS for MySQL (and other engines) supports point-in-time recovery (PITR), which allows you to restore a database to any second within the backup retention period, up to the last five minutes. By restoring to 10:29 UTC (one minute before the corruption at 10:30 UTC), the team can recover the database to its last good state with minimal data loss. After restoring, the application can be pointed to the new instance, avoiding the corrupted data.

Exam trap

The trap here is that candidates may confuse point-in-time restore with snapshot restore, assuming snapshots are the only recovery option, or incorrectly believe that restarting or promoting a replica can undo a logical corruption that has already been written to disk.

How to eliminate wrong answers

Option A is wrong because restoring the latest manual snapshot would revert the database to the time the snapshot was taken (likely overnight), causing significant data loss of all transactions between that snapshot and 10:30 UTC, which is unacceptable when a more precise recovery is available. Option C is wrong because restarting the database instance does not replay transaction logs to undo a faulty deployment; it only replays committed transactions from the binary logs to ensure consistency, which would reapply the corruption. Option D is wrong because a read replica contains the same data as the primary at the time of replication lag, not a previous transaction state; promoting it would still include the corrupted data if the corruption occurred before the replica caught up.

Practice this question →

116

MCQeasy

An order system receives events and uses a Lambda function to write each order into a database. During traffic spikes, the database sometimes throttles, and Lambda retries lead to occasional message loss in the event flow. The team wants buffering, automatic retries, and a way to isolate messages that repeatedly fail so they can be inspected later. What design change best meets this need?

A.Send events directly from EventBridge to Lambda without any queue to simplify the flow.

B.Use Amazon SQS as a buffer between the event source and Lambda, with an SQS dead-letter queue (DLQ).

C.Use SNS fan-out to multiple Lambda functions, but keep no retry logic and no DLQ.

D.Store events in an S3 bucket and trigger Lambda immediately after each upload, without using DLQs.

AnswerB

SQS buffers bursts, supports retries via visibility timeouts, and DLQs capture messages that fail repeatedly for later review.

Why this answer

Option B is correct because Amazon SQS acts as a durable buffer between the event source and Lambda, absorbing traffic spikes and decoupling the producer from the consumer. The SQS dead-letter queue (DLQ) automatically captures messages that exceed the configured maximum retries, allowing the team to inspect and reprocess them later without loss. This design provides the required buffering, automatic retries via the Lambda event source mapping, and isolation of repeatedly failing messages.

Exam trap

The trap here is that candidates often assume a direct event-driven flow (like EventBridge to Lambda) is simpler and sufficient, but they overlook the need for buffering and a DLQ to handle throttling and isolate persistent failures, which SQS explicitly provides.

How to eliminate wrong answers

Option A is wrong because sending events directly from EventBridge to Lambda without a queue removes any buffering, so during traffic spikes Lambda will be overwhelmed and retries can still lead to message loss. Option C is wrong because SNS fan-out to multiple Lambda functions with no retry logic and no DLQ provides no buffering, no automatic retries, and no mechanism to isolate failed messages, so throttling and message loss remain unaddressed. Option D is wrong because storing events in S3 and triggering Lambda immediately after each upload does not provide a built-in retry mechanism for Lambda failures, and S3 does not offer a dead-letter queue to isolate repeatedly failing messages; this approach also introduces latency and complexity for real-time order processing.

Practice this question →

117

MCQmedium

A.Create a read replica in a different Availability Zone and configure the application to fail over manually.

B.Enable Multi-AZ for the RDS DB instance so AWS manages a standby in another Availability Zone with automatic failover.

C.Switch the database to use EBS snapshots more frequently and restore in case of failure.

D.Pin the DB to a specific instance type with higher CPU credits to prevent CPU-related disconnects.

AnswerB

RDS Multi-AZ maintains a standby in another AZ and supports automatic failover, improving resilience and reducing manual work.

Why this answer

Enabling Multi-AZ for the RDS DB instance creates a synchronous standby replica in a different Availability Zone. AWS automatically handles failover to the standby with no manual intervention required, which directly meets the goal of fast database failover with minimal manual intervention.

Exam trap

The trap here is that candidates often confuse read replicas (which are for read scaling and manual promotion) with Multi-AZ (which provides automatic failover and high availability), leading them to choose Option A incorrectly.

How to eliminate wrong answers

Option A is wrong because a read replica is asynchronous and not designed for automatic failover; manual failover requires application changes and introduces significant downtime. Option C is wrong because restoring from EBS snapshots is a slow, manual process that can take minutes to hours, far from the fast, automated failover required. Option D is wrong because CPU credits (relevant to burstable instances like T-series) do not address database availability or failover; they only prevent CPU throttling, not instance or AZ failures.

Practice this question →

118

MCQmedium

A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include?

A.A single EC2 instance with detailed monitoring

B.Subnets in at least two Availability Zones with health checks enabled

C.All instances in one larger subnet

D.A Network Load Balancer in one subnet

AnswerB

An Auto Scaling group spanning multiple AZs can replace unhealthy instances and maintain capacity during an AZ failure.

Why this answer

Option B is correct because distributing EC2 instances across subnets in at least two Availability Zones ensures that if one AZ fails, the Auto Scaling group can maintain capacity using instances in the remaining AZ(s). Enabling health checks allows the group to detect and replace unhealthy instances, which is essential for fault tolerance. This configuration meets the requirement to tolerate the failure of one Availability Zone.

Exam trap

The trap here is that candidates often confuse high availability with fault tolerance, thinking a single large subnet or a single instance with monitoring is sufficient, when in fact distributing across multiple Availability Zones is the key to surviving an AZ failure.

How to eliminate wrong answers

Option A is wrong because a single EC2 instance, even with detailed monitoring, cannot tolerate the failure of an entire Availability Zone; if that AZ goes down, the instance becomes unavailable. Option C is wrong because placing all instances in one larger subnet confines them to a single Availability Zone, providing no redundancy if that AZ fails. Option D is wrong because a Network Load Balancer in one subnet does not solve the AZ failure requirement; the Auto Scaling group must span multiple AZs, and the load balancer itself should be cross-zone enabled to distribute traffic across AZs.

Practice this question →

119

MCQmedium

A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured?

A.An EBS snapshot schedule

B.S3 Cross-Region Replication with versioning enabled

C.S3 lifecycle transition to Glacier Flexible Retrieval

D.A CloudFront distribution

AnswerB

CRR asynchronously replicates objects to a bucket in another Region and requires versioning.

Why this answer

S3 Cross-Region Replication (CRR) with versioning enabled automatically replicates objects to a destination bucket in a different AWS Region, providing a durable, low-latency disaster recovery copy. Versioning must be enabled on both source and destination buckets to track object changes and ensure consistency during replication. This meets the requirement for a cross-region copy without manual intervention.

Exam trap

The trap here is that candidates may confuse lifecycle transitions (which change storage class within the same region) with cross-region replication (which copies data to a different region), or assume EBS snapshots apply to S3 storage.

How to eliminate wrong answers

Option A is wrong because EBS snapshots are used for backing up EC2 block storage volumes, not for S3 objects, and they are region-specific unless manually copied. Option C is wrong because S3 lifecycle transition to Glacier Flexible Retrieval moves objects to a cold storage tier for cost savings, not to a different AWS Region for disaster recovery. Option D is wrong because CloudFront is a content delivery network that caches data at edge locations for low-latency access, not a mechanism for replicating data to another region for DR.

Practice this question →

120

MCQeasy

An internal service is hosted behind an Application Load Balancer (ALB) with targets spread across two Availability Zones. If the targets in one Availability Zone become unhealthy, the service must continue serving traffic from the healthy AZ. What change most directly improves resilience at the load-balancing layer?

A.Turn off health checks and rely only on instance CPU utilization to route traffic.

B.Configure ALB listener rules to route all traffic to a single target group in one Availability Zone.

C.Configure target group health checks so the ALB stops sending traffic to unhealthy targets and continues routing to healthy targets in the other Availability Zone.

D.Store requests in an SQS queue before routing them to the ALB.

AnswerC

With target group health checks enabled and configured correctly, the ALB evaluates each target's health and stops routing requests to targets marked unhealthy. As long as healthy targets exist in the other AZ, the ALB preserves reachability.

Why this answer

Option C is correct because configuring target group health checks allows the ALB to automatically detect unhealthy targets and stop sending traffic to them, while continuing to route requests to healthy targets in the other Availability Zone. This directly improves resilience at the load-balancing layer by ensuring traffic is only forwarded to healthy instances, maintaining service availability even when an entire AZ fails.

Exam trap

The trap here is that candidates may think SQS decoupling (Option D) improves resilience at the load-balancing layer, but SQS operates at the application layer and does not affect how the ALB routes traffic to unhealthy targets.

How to eliminate wrong answers

Option A is wrong because turning off health checks removes the ALB's ability to detect unhealthy targets, which would cause traffic to be sent to failed instances, breaking resilience. Option B is wrong because routing all traffic to a single target group in one AZ creates a single point of failure and defeats the purpose of multi-AZ redundancy. Option D is wrong because storing requests in an SQS queue before routing to the ALB adds unnecessary latency and complexity, and does not address the immediate need for the ALB to stop sending traffic to unhealthy targets.

Practice this question →

121

Multi-Selecthard

A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The design must avoid adding custom operational scripts.

Select 2 answers

A.Route 53 failover routing with health checks

B.S3 Transfer Acceleration

C.A deployed standby application stack in the secondary Region

D.AWS Organizations service control policies

AnswersA, C

Route 53 can monitor endpoint health and return the standby endpoint when the primary is unhealthy.

Why this answer

Route 53 failover routing with health checks (Option A) is required because it automatically evaluates the health of the primary endpoint and, upon detecting failure, updates DNS resolution to direct traffic to the secondary Region. This is the native AWS mechanism for DNS-based failover without custom scripts, relying on Route 53 health checkers to assess endpoint health via HTTP/HTTPS/TCP or calculated health checks.

Exam trap

The trap here is that candidates may think a single service like Route 53 alone can handle failover, but without a pre-deployed standby application stack in the secondary Region, there is no infrastructure to route traffic to, making both Route 53 failover routing and the standby stack required together.

Practice this question →

122

Drag & Dropmedium

Order the steps for setting up a VPC with public and private subnets using a NAT gateway.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

VPC creation comes first, then subnets, IGW, NAT Gateway, and finally route table updates.

Practice this question →

123

MCQhard

A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The architecture review board prefers a managed AWS-native control.

A.Use an in-memory queue on one EC2 instance

B.Use UDP messages sent directly to workers

C.Use Amazon SQS standard queue and design consumers to be idempotent

D.Use CloudFront signed URLs

AnswerC

SQS standard queues provide at-least-once delivery and high throughput; consumers must handle occasional duplicates.

Why this answer

Amazon SQS standard queues provide at-least-once delivery, meaning each message is delivered at least once but may be delivered more than once. This aligns perfectly with the requirement that every event must be processed at least once and that duplicate processing is acceptable if consumers are idempotent. SQS is a fully managed, AWS-native service that meets the architecture review board's preference for a managed solution.

Exam trap

The trap here is that candidates may confuse 'at-least-once' delivery with 'exactly-once' delivery and incorrectly choose a solution like an in-memory queue (Option A) or UDP (Option B), overlooking that SQS standard queues are the managed, AWS-native way to achieve at-least-once delivery with idempotent consumers.

How to eliminate wrong answers

Option A is wrong because an in-memory queue on a single EC2 instance is not managed, introduces a single point of failure, and cannot guarantee at-least-once delivery across failures. Option B is wrong because UDP is a connectionless, unreliable protocol that does not guarantee message delivery, so it cannot ensure every event is processed at least once. Option D is wrong because CloudFront signed URLs are used for securing content delivery, not for event processing or messaging between components.

Practice this question →

124

MCQeasy

An event consumer sometimes processes the same SQS message more than once due to timeouts and retries. The consumer must ensure the payment is not charged twice. What design choice best addresses this requirement?

A.Assume messages are processed exactly once because SQS uses durable storage.

B.Make the payment operation idempotent by using an idempotency key and skipping side effects when the key indicates the payment already succeeded.

C.Increase the consumer visibility timeout to several days so messages are not redelivered.

D.Delete the message immediately even if processing fails validation.

AnswerB

Idempotency ensures that repeated processing attempts produce the same result. The consumer should use a stable idempotency key (for example, a business transaction ID) and record completion in durable storage. If the key already indicates the payment succeeded, the consumer skips charging again.

Why this answer

Option B is correct because making the payment operation idempotent using an idempotency key ensures that even if the same SQS message is processed multiple times due to timeouts and retries, the payment will only be charged once. The consumer checks the idempotency key before executing the payment; if the key indicates the payment already succeeded, the consumer skips the side effect. This pattern directly addresses the requirement of not charging twice without relying on SQS's at-least-once delivery guarantee.

Exam trap

The trap here is that candidates assume SQS provides exactly-once delivery or that increasing the visibility timeout is a reliable solution, but the exam tests understanding that SQS is at-least-once and that idempotency is the correct architectural pattern to handle duplicates.

How to eliminate wrong answers

Option A is wrong because SQS guarantees at-least-once delivery, not exactly-once processing; messages can be duplicated due to network issues or consumer timeouts, so assuming exactly-once processing is incorrect. Option C is wrong because increasing the visibility timeout to several days does not prevent redelivery; it only delays it, and if the consumer crashes or fails to delete the message, it will still be redelivered after the timeout expires. Option D is wrong because deleting a message immediately even if processing fails validation means the message is lost permanently, preventing any retry or dead-letter queue handling, which can lead to data loss or incomplete processing.

Practice this question →

125

MCQmedium

A inventory service uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The architecture review board prefers a managed AWS-native control.

A.Lambda reserved concurrency set to zero

B.A Lambda dead-letter queue or failure destination

C.A larger deployment package

D.CloudFront error pages

AnswerB

A DLQ or asynchronous failure destination captures failed events after retry attempts.

Why this answer

Lambda dead-letter queues (DLQs) or failure destinations are the correct AWS-native mechanism to retain failed events after retries are exhausted. When a Lambda function fails to process an event (e.g., due to an unreliable third-party API), the function can be configured to send the failed event payload to an SQS queue or SNS topic for later investigation. This ensures no data loss and aligns with the requirement for a managed, AWS-native solution.

Exam trap

The trap here is that candidates often confuse Lambda DLQs with SQS DLQs or assume that increasing retries (via reserved concurrency or package size) solves the retention problem, but the key is the explicit configuration to capture events after retries are exhausted.

How to eliminate wrong answers

Option A is wrong because setting Lambda reserved concurrency to zero would prevent the function from executing at all, not retain failed events. Option C is wrong because a larger deployment package has no impact on error handling or event retention; it only affects cold start times and deployment size. Option D is wrong because CloudFront error pages are for HTTP-level errors in front of web applications, not for Lambda function invocation failures or event retention.

Practice this question →

126

MCQmedium

A SaaS platform serves an API using two regional deployments: us-east-1 (primary) and us-west-2 (secondary). Each region has its own ALB. The business requires automated DNS-based failover when the primary region becomes unhealthy, and they do not want manual DNS changes during incidents. Which Route 53 configuration is the best match?

A.Create a single Route 53 record using weighted routing across both ALBs with weights adjusted manually during an incident.

B.Use Route 53 failover routing with a primary record pointing to the us-east-1 ALB and a secondary record pointing to the us-west-2 ALB, each using health checks.

C.Use latency-based routing so Route 53 always selects the fastest region; health checks are unnecessary because client latency reflects availability.

D.Use a single A record with a static IP address that points to a NAT gateway, and update that IP during failure events.

AnswerB

Failover routing with health checks enables automatic switching of DNS responses when the primary endpoint fails health evaluation.

Why this answer

Route 53 failover routing is designed for active-passive configurations where traffic must automatically shift to a secondary endpoint when the primary fails. By attaching health checks to the primary record (us-east-1 ALB), Route 53 can detect regional unavailability and automatically route traffic to the secondary record (us-west-2 ALB) without manual intervention. This meets the requirement for DNS-based failover without manual DNS changes during incidents.

Exam trap

The trap here is that candidates often confuse latency-based routing with failover routing, assuming that lower latency implies availability, but latency routing does not incorporate health checks and cannot automatically redirect traffic away from an unhealthy region.

How to eliminate wrong answers

Option A is wrong because weighted routing requires manual adjustment of weights during an incident, which violates the requirement for no manual DNS changes. Option C is wrong because latency-based routing optimizes for performance, not availability; it does not use health checks to detect regional failures, so traffic could still be sent to an unhealthy region if it has lower latency. Option D is wrong because using a static IP pointing to a NAT gateway is not a valid DNS failover strategy; NAT gateways are not load balancers, and updating the IP during failure events requires manual intervention, contradicting the automated failover requirement.

Practice this question →

127

MCQmedium

A inventory service uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured?

A.Lambda reserved concurrency set to zero

B.A Lambda dead-letter queue or failure destination

C.A larger deployment package

D.CloudFront error pages

AnswerB

A DLQ or asynchronous failure destination captures failed events after retry attempts.

Why this answer

Lambda dead-letter queues (DLQs) or failure destinations are the correct mechanism to retain failed events after all retries are exhausted. When a Lambda function fails to process an event (e.g., from an asynchronous invocation), the service automatically retries twice. If those retries fail, the event can be sent to an SQS queue or SNS topic (DLQ) or to a specified destination (failure destination) for later investigation.

This ensures no data loss and provides a durable storage for post-mortem analysis.

Exam trap

The trap here is that candidates may confuse DLQs with retry mechanisms or think that increasing function resources (like memory or package size) will prevent failures, when in fact DLQs are the only way to durably capture events after retries are exhausted.

How to eliminate wrong answers

Option A is wrong because setting reserved concurrency to zero would prevent the Lambda function from executing at all, not retain failed events. Option C is wrong because a larger deployment package does not affect error handling or event retention; it only increases cold start latency and storage overhead. Option D is wrong because CloudFront error pages are for HTTP-level errors from a web distribution, not for capturing asynchronous Lambda invocation failures.

Practice this question →

128

MCQmedium

Based on the exhibit, the application team wants the database to keep the same connection endpoint during failover and to reconnect automatically after the primary instance becomes unavailable. Which change best meets the requirement?

A.Keep the IP address and increase the JDBC connection timeout so the application waits longer during failover.

B.Replace the IP address with the RDS DNS endpoint and add client retry logic that re-resolves DNS after connection loss.

C.Create an additional read replica and point the application to it so failover is faster.

D.Place a Network Load Balancer in front of the database and use the load balancer target IP to avoid DNS changes.

AnswerB

RDS Multi-AZ failover preserves the database endpoint name, not the underlying IP address. When the standby is promoted, AWS updates the DNS record to point to the new primary. Using the RDS endpoint allows the application to follow that change, and retry logic helps the client recover from the short disconnect that occurs during failover.

Why this answer

Option B is correct because using the RDS DNS endpoint ensures that the application connects to the current primary instance, even after a failover. When the primary becomes unavailable, RDS promotes a standby (or read replica) to a new primary and updates the DNS record to point to the new instance's IP. By adding client retry logic that re-resolves DNS after a connection loss, the application automatically picks up the new IP and reconnects without manual intervention, meeting both requirements of a stable endpoint and automatic reconnection.

Exam trap

The trap here is that candidates assume a static IP or a load balancer can provide a stable endpoint, but AWS RDS does not support static IPs for Multi-AZ failover, and NLB cannot front RDS instances—the only reliable way is to use the RDS DNS endpoint with retry logic that re-resolves DNS after a connection loss.

How to eliminate wrong answers

Option A is wrong because keeping the IP address is unreliable—after a failover, the new primary instance will have a different IP address, so the application would connect to a stale IP and fail. Increasing the JDBC connection timeout only delays the failure; it does not resolve the underlying IP mismatch. Option C is wrong because creating an additional read replica does not change the connection endpoint for the primary; the application still connects to the original primary endpoint, which becomes unavailable during failover.

Read replicas are for read scaling, not for providing a failover endpoint. Option D is wrong because placing a Network Load Balancer in front of an RDS database is not a supported architecture—RDS does not integrate with NLB for database traffic, and the load balancer target IP would still change after failover, requiring DNS re-resolution anyway, making the solution unnecessarily complex and non-compliant with AWS best practices.

Practice this question →

129

MCQmedium

A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The architecture review board prefers a managed AWS-native control.

A.Lambda reserved concurrency set to zero

B.A larger deployment package

C.CloudFront error pages

D.A Lambda dead-letter queue or failure destination

AnswerD

A DLQ or asynchronous failure destination captures failed events after retry attempts.

Why this answer

Option D is correct because Lambda dead-letter queues (DLQs) or failure destinations are the managed AWS-native way to capture events that have exhausted all retry attempts from an asynchronous invocation. When the Lambda function fails after the configured number of retries (default 3), the event is automatically sent to an SQS queue or SNS topic (DLQ) or to a specified destination (e.g., SQS, SNS, EventBridge) for later investigation and reprocessing.

Exam trap

The trap here is that candidates may confuse Lambda's synchronous invocation retry behavior (which is controlled by the caller) with asynchronous invocation retries (which are managed by Lambda itself and require a DLQ or failure destination for post-retry capture).

How to eliminate wrong answers

Option A is wrong because setting reserved concurrency to zero would prevent the Lambda function from executing at all, not handle failed events after retries. Option B is wrong because a larger deployment package does not affect retry or failure handling; it only increases the function's code size and cold start latency. Option C is wrong because CloudFront error pages are for HTTP-level errors from a web distribution, not for capturing failed asynchronous Lambda invocations from a third-party API call.

Practice this question →

130

MCQhard

Based on the exhibit, the application tier is not replacing unhealthy instances even though the Auto Scaling group spans two Availability Zones. What change most directly improves automatic recovery when the application process fails?

A.Increase the ASG desired capacity so that extra instances absorb the failed ones.

B.Set the Auto Scaling group health check type to ELB so target group health determines replacement.

C.Replace the Application Load Balancer with a Network Load Balancer to improve failover speed.

D.Increase the HealthCheckGracePeriod to the maximum value so the instances have more time to stabilize.

AnswerB

This makes Auto Scaling replace instances that fail the load balancer health check even when EC2 status checks still pass. The exhibit shows the application health endpoint returns 500 while EC2 checks remain passing, so EC2-only health checks miss the failure. ELB-based health checks align replacement with real application availability.

Why this answer

Option B is correct because setting the Auto Scaling group health check type to ELB allows the ASG to use the target group's health checks, which monitor application-level health (e.g., HTTP 200 responses). When the application process fails, the ELB marks the instance as unhealthy, and the ASG immediately terminates and replaces it. This directly addresses the issue of unhealthy instances not being replaced, as the default EC2 health check only verifies instance status (e.g., running vs. stopped), not application responsiveness.

Exam trap

The trap here is that candidates assume the default EC2 health check is sufficient for application-level failures, but it only checks instance state (running/stopped), not the application process, so the ASG never triggers replacement for application crashes.

How to eliminate wrong answers

Option A is wrong because increasing the desired capacity does not cause the ASG to replace unhealthy instances; it only adds more instances, which may mask the failure but does not fix the underlying health check mechanism. Option C is wrong because replacing the Application Load Balancer with a Network Load Balancer does not improve application-level health checking; NLB operates at Layer 4 and cannot perform HTTP-level health checks, so it would not detect application process failures. Option D is wrong because increasing the HealthCheckGracePeriod delays the start of health checks, giving instances more time to stabilize, but it does not change the health check type to monitor application health; if the health check type remains EC2, the ASG still won't detect application failures.

Practice this question →

131

Multi-Selecthard

A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The architecture review board prefers a managed AWS-native control.

Select 2 answers

A.Deletion protection or tightly controlled delete permissions

B.Point-in-time recovery

C.Global secondary indexes

D.DAX

AnswersA, B

Deletion protection and least-privilege controls reduce accidental table removal risk.

Why this answer

Point-in-time recovery (PITR) enables continuous backups of the DynamoDB table, allowing restoration to any point within the last 35 days, which satisfies the requirement for point-in-time recovery. Deletion protection prevents accidental deletion of the table by blocking drop-table operations, meeting the accidental-delete protection requirement. Both are managed AWS-native controls that require no custom scripting or external tooling.

Exam trap

The trap here is that candidates often confuse operational features like DAX (caching) or GSIs (indexing) with data protection mechanisms, but neither provides backup/restore or deletion safeguards required for resilience and data durability.

Practice this question →

132

MCQmedium

A trading dashboard uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered?

A.A single-AZ Aurora cluster

B.Aurora Global Database

C.Manual snapshots copied monthly

D.An ElastiCache Redis replica

AnswerB

Aurora Global Database replicates with low latency to secondary Regions and supports faster disaster recovery than snapshot-only approaches.

Why this answer

Aurora Global Database is designed for cross-Region disaster recovery with a typical RPO of 1 second and RTO of less than 1 minute, using storage-based replication that does not impact database performance. This meets the low RPO requirement for a trading dashboard, where data loss must be minimized.

Exam trap

The trap here is that candidates might choose manual snapshots (Option C) thinking they are sufficient for DR, but they overlook the critical requirement of low RPO, which snapshots copied monthly cannot satisfy.

How to eliminate wrong answers

Option A is wrong because a single-AZ Aurora cluster provides no cross-Region replication and offers no disaster recovery across AWS Regions, resulting in potentially high RPO if the primary Region fails. Option C is wrong because manual snapshots copied monthly have an RPO of up to one month, which is far too high for a trading dashboard requiring low RPO. Option D is wrong because ElastiCache Redis is an in-memory cache, not a persistent database, and cannot serve as a cross-Region disaster recovery solution for Aurora MySQL data.

Practice this question →

133

MCQmedium

An events service publishes critical notifications using Amazon SNS. Three independent downstream systems (A, B, and C) subscribe to the topic. Downstream system B sometimes fails to process certain messages (for example, it times out or returns an error while handling the message), and you want: 1) failures in B to be isolated so A and C keep processing unaffected, and 2) messages that B cannot successfully process after retries to be sent to a DLQ for B. Which design best meets these requirements?

A.Subscribe each downstream directly with HTTPS endpoints and configure a single SNS dead-letter queue (DLQ) for the topic.

B.For each downstream system, create its own SQS queue, subscribe each SQS queue to the SNS topic, and configure a redrive policy with a DLQ for each SQS queue.

C.Use one shared SQS queue for all three downstream systems and configure a single DLQ only when all three downstream systems fail.

D.Use EventBridge rules to invoke A, B, and C synchronously with retries enabled, and send failures to a common DLQ.

AnswerB

SNS delivers the message independently to each subscribed SQS queue. If downstream B fails to process a message, B can avoid deleting it from its own queue; after visibility timeout and retry attempts, SQS redrives messages to B’s DLQ. A and C are isolated because they have separate queues and DLQs, so B’s failures do not prevent deliveries to A and C.

Why this answer

Option B is correct because it creates a dedicated SQS queue for each downstream system, which isolates failures: if system B fails, its SQS queue will accumulate messages while systems A and C continue processing from their own queues. Each SQS queue can have a redrive policy that moves messages to a per-queue DLQ after the configured maximum retries are exhausted, satisfying the requirement for a B-specific DLQ without affecting the other subscribers.

Exam trap

The trap here is that candidates assume a single DLQ at the SNS topic level is sufficient, but SNS DLQs only apply to the SNS delivery failure (e.g., HTTP endpoint unreachable), not to downstream processing failures after the message is delivered to SQS.

How to eliminate wrong answers

Option A is wrong because a single SNS DLQ applies to the entire topic, not per-subscriber; if B fails, messages would be sent to the common DLQ for all subscribers, and A and C would still receive the message from SNS, but the DLQ is not isolated to B. Option C is wrong because a shared SQS queue for all three systems means a failure in B could block or delay messages for A and C, and a single DLQ would trigger only when all three fail, not when B alone fails. Option D is wrong because EventBridge synchronous invocation with a common DLQ would cause failures in B to potentially block or delay A and C (since synchronous calls are sequential), and the DLQ is shared, not isolated to B.

Practice this question →

134

MCQmedium

A stateless web API runs on EC2 instances behind an Application Load Balancer (ALB). The Auto Scaling group (ASG) currently uses subnets from only one Availability Zone, even though the ALB spans two Availability Zones. During maintenance of that single AZ, the ALB remains up but clients see timeouts because there are no healthy targets. Which change most directly improves resilience against an AZ failure?

A.Keep the ASG in one subnet/AZ, but enable ALB stickiness to reduce session interruption.

B.Update the ASG to launch instances across subnets in at least two Availability Zones and ensure ALB health checks target an application-ready path.

C.Add a NAT gateway in the public subnets so instances can reach the internet during maintenance events.

D.Create a second ALB in the same Availability Zone and route traffic using DNS failover.

AnswerB

Spreading instances across multiple AZs ensures the ALB can route to healthy targets even when one AZ fails.

Why this answer

The most direct fix for AZ failure resilience is to distribute the ASG across multiple Availability Zones. With the ALB already spanning two AZs, if the ASG only launches instances in one AZ, a failure of that AZ leaves the ALB with zero healthy targets, causing timeouts. By configuring the ASG to launch instances in at least two AZs and setting ALB health checks to an application-ready path, the ALB can route traffic to healthy instances in the surviving AZ, maintaining availability.

Exam trap

The trap here is that candidates may think adding a second ALB or enabling stickiness solves the problem, when the real issue is that the ASG is not distributing instances across multiple Availability Zones, leaving the ALB with no healthy targets during an AZ outage.

How to eliminate wrong answers

Option A is wrong because enabling ALB stickiness (session affinity) does not solve the underlying problem of zero healthy targets; it only binds a client session to a specific target, which still fails if that target is in the failed AZ. Option C is wrong because a NAT gateway provides outbound internet access for instances in private subnets, but it does not affect the availability of targets for inbound traffic through the ALB during an AZ failure. Option D is wrong because creating a second ALB in the same AZ and using DNS failover adds complexity and cost without addressing the root cause—the ASG's single-AZ deployment—and DNS failover introduces propagation delays, not the immediate resilience needed.

Practice this question →

135

MCQmedium

An orders service publishes payment instructions to an Amazon SQS Standard queue. A downstream consumer sometimes times out or crashes after it has partially completed processing, causing the same instruction to be processed more than once. You must keep the design resilient without attempting to guarantee exactly-once processing. Which approach best handles duplicates safely?

A.Set the SQS visibility timeout extremely long so the message cannot be retried even after processing failures.

B.Make the consumer idempotent by deriving a deterministic idempotency key from the payment instruction (for example, the instruction ID), persisting the result of successful processing, and skipping re-processing when that key is already marked successful.

C.Switch to an SQS FIFO queue but remove error handling in the consumer so duplicates never occur.

D.Send all failed messages to a DLQ and rely on it to deduplicate messages that were already successfully processed.

AnswerB

SQS Standard provides at-least-once delivery, so duplicates are expected. Idempotency ensures that re-processing the same instruction does not create incorrect side effects. Persisting a deterministic key/result allows the consumer to safely short-circuit duplicates after retries/timeouts.

Why this answer

Option B is correct because making the consumer idempotent ensures that even if the same payment instruction is processed multiple times due to timeouts or crashes, the system remains consistent. By deriving a deterministic idempotency key (e.g., the instruction ID) and persisting the result of successful processing, the consumer can skip re-processing when the key is already marked as successful. This approach aligns with the requirement to keep the design resilient without guaranteeing exactly-once processing, as it safely handles duplicates at the application level.

Exam trap

The trap here is that candidates often assume SQS FIFO queues or DLQs inherently solve duplicate processing, but the exam tests understanding that Standard queues require application-level idempotency for safe duplicate handling, and FIFO queues do not eliminate the need for idempotent consumers in crash scenarios.

How to eliminate wrong answers

Option A is wrong because setting the SQS visibility timeout extremely long does not prevent duplicates; it only delays retries, and if the consumer crashes after partially processing, the message will eventually become visible again and be reprocessed, leading to duplicates. Option C is wrong because switching to an SQS FIFO queue provides exactly-once processing within a five-minute deduplication window, but removing error handling in the consumer does not prevent duplicates from timeouts or crashes; FIFO queues still allow retries, and without error handling, the system becomes fragile. Option D is wrong because a Dead-Letter Queue (DLQ) is used to capture messages that fail after multiple retries, not to deduplicate messages; relying on a DLQ for deduplication is a misconception, as DLQs do not track successful processing and cannot prevent duplicates from being processed again.

Practice this question →

136

MCQmedium

A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The design must avoid adding custom operational scripts.

A.S3 Cross-Region Replication

B.Multi-AZ deployment for the RDS DB instance

C.EBS snapshots every hour

D.Read replicas only

AnswerB

Multi-AZ provides synchronous standby replication and automatic failover within a Region.

Why this answer

Multi-AZ deployment for RDS MySQL provides automatic failover to a standby replica in a different Availability Zone. This ensures high availability during an AZ failure with minimal application changes, as the DNS endpoint remains the same and failover is handled by AWS without custom scripts.

Exam trap

The trap here is that candidates often confuse read replicas with Multi-AZ deployments, assuming read replicas provide automatic failover, but they require manual promotion and do not maintain the same endpoint.

How to eliminate wrong answers

Option A is wrong because S3 Cross-Region Replication is for object storage replication across regions, not for database availability within a region, and it does not address RDS MySQL failover. Option C is wrong because EBS snapshots every hour provide point-in-time backups but do not enable automatic failover or maintain availability during an AZ failure; recovery would require manual intervention and data loss. Option D is wrong because read replicas are for read scaling and do not provide automatic failover for the primary instance; promoting a read replica requires manual steps or custom scripts, violating the 'no custom operational scripts' constraint.

Practice this question →

137

Multi-Selecteasy

A web application runs on an Auto Scaling group behind an Application Load Balancer. The business wants the service to keep running if one Availability Zone goes down. Which two changes should you make? Select two.

Select 2 answers

A.Place the Auto Scaling group in subnets across at least two Availability Zones.

B.Attach the Application Load Balancer to subnets in at least two Availability Zones.

C.Increase the instance size so each server can handle more traffic alone.

D.Disable ALB health checks so instances stay registered longer.

E.Run the whole stack in one Availability Zone for simpler networking.

AnswersA, B

Spreading the Auto Scaling group across multiple Availability Zones lets EC2 capacity remain available if one Zone fails. The group can continue launching and serving instances in the remaining healthy Zone, which improves availability without changing the application itself.

Why this answer

Option A is correct because placing the Auto Scaling group in subnets across at least two Availability Zones ensures that if one AZ fails, the Auto Scaling group can still launch instances in the remaining healthy AZ(s). This is a fundamental pattern for high availability, as the Auto Scaling group distributes instances across the specified subnets, and if an entire AZ becomes unavailable, instances in other AZs continue to serve traffic.

Exam trap

The trap here is that candidates often think increasing instance size or disabling health checks provides resilience, but AWS's high availability model relies on distributing resources across multiple Availability Zones, not on making individual instances more powerful or ignoring failures.

Practice this question →

138

MCQhard

Based on the exhibit, the team must restore an Amazon RDS for PostgreSQL database to the exact state just before a bad delete happened. What is the best recovery approach?

A.Restore the latest automated snapshot and accept data loss from the last backup window.

B.Perform a point-in-time restore to 2026-04-27 15:10 UTC into a new DB instance, then cut over after validation.

C.Promote a read replica because it will contain the deleted rows and can replace the primary immediately.

D.Enable Multi-AZ on the current database and wait for automatic failover to reverse the delete.

AnswerB

Point-in-time restore uses the automated backups and transaction logs to rebuild the database to an exact time before the bad change. The exhibit confirms the requested restore time is within the restorable window, and the business wants to validate the restored copy before switching traffic. Restoring to a new instance first is the safest way to recover without risking the current production database.

Why this answer

Point-in-time recovery (PITR) allows you to restore an Amazon RDS for PostgreSQL database to any second within the backup retention period, using automated backups and transaction logs. By restoring to 2026-04-27 15:10 UTC, just before the bad delete occurred, you can recover the exact state without data loss, then cut over after validation.

Exam trap

The trap here is that candidates often confuse read replicas or Multi-AZ as solutions for logical data corruption, when in fact they only protect against infrastructure failures, not user errors like a bad delete.

How to eliminate wrong answers

Option A is wrong because restoring the latest automated snapshot would only recover data up to the last snapshot time, which could be hours before the delete, resulting in data loss from the backup window. Option C is wrong because a read replica in RDS for PostgreSQL does not contain deleted rows from the primary; it applies the same changes asynchronously, so the delete would also be replicated, and promoting it would not recover the lost data. Option D is wrong because enabling Multi-AZ provides high availability through synchronous replication to a standby in another Availability Zone, but it does not protect against logical errors like a bad delete; the delete would be replicated to the standby, and failover would not reverse it.

Practice this question →

139

MCQhard

A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable?

A.Use an in-memory queue on one EC2 instance

B.Use UDP messages sent directly to workers

C.Use Amazon SQS standard queue and design consumers to be idempotent

D.Use CloudFront signed URLs

AnswerC

SQS standard queues provide at-least-once delivery and high throughput; consumers must handle occasional duplicates.

Why this answer

Amazon SQS standard queues provide at-least-once delivery, meaning each message is delivered at least once but may occasionally be delivered more than once. This aligns with the requirement that every event must be processed at least once, and since duplicate processing is acceptable when consumers are idempotent, the standard queue is the most suitable choice. SQS handles the decoupling and durability of messages without requiring custom infrastructure.

Exam trap

The trap here is that candidates may confuse 'at-least-once' with 'exactly-once' and incorrectly choose a FIFO queue or another option, but the question explicitly accepts duplicates if the consumer handles idempotency, making the standard queue the correct choice.

How to eliminate wrong answers

Option A is wrong because an in-memory queue on a single EC2 instance is not durable, cannot survive instance failures, and does not provide at-least-once delivery guarantees across distributed consumers. Option B is wrong because UDP is a connectionless, unreliable protocol that does not guarantee message delivery, ordering, or duplicate detection, making it unsuitable for at-least-once processing. Option D is wrong because CloudFront signed URLs are used for secure content delivery and access control, not for event messaging or queue-based processing.

Practice this question →

140

MCQeasy

A worker consumes messages from an Amazon SQS queue. Some messages consistently fail validation and are retried until the worker can no longer process them. What is the most appropriate AWS mechanism to handle these poison messages while keeping the queue usable?

A.Enable SQS long polling and increase the maximum message size for the queue.

B.Send failing messages to an SQS dead-letter queue (DLQ) using a redrive policy based on receive count.

C.Change the queue to a FIFO queue and handle duplicates in the worker code without DLQs.

D.Delete the queue and recreate it hourly to clear out any problematic messages.

AnswerB

A DLQ with a redrive policy isolates poison messages. After a message is received and fails processing more than the configured maxReceiveCount, SQS moves it to the DLQ, preventing it from continually blocking retries in the source queue.

Why this answer

Option B is correct because an SQS dead-letter queue (DLQ) with a redrive policy based on receive count allows messages that repeatedly fail processing (poison pills) to be moved out of the main queue after a specified number of retries. This keeps the main queue operational for valid messages and isolates problematic messages for later analysis or manual intervention.

Exam trap

The trap here is that candidates may think increasing retries or message size (Option A) solves the problem, but the exam specifically tests the concept of isolating poison messages via a DLQ with a receive-count-based redrive policy to maintain queue availability.

How to eliminate wrong answers

Option A is wrong because enabling long polling and increasing maximum message size does not address the core issue of messages that consistently fail validation; long polling reduces empty responses and larger message size allows bigger payloads, but neither prevents poison messages from blocking processing. Option C is wrong because changing to a FIFO queue does not inherently handle poison messages; FIFO queues preserve order and deduplicate based on message deduplication ID, but they still require a DLQ or explicit error handling to remove failing messages, and the worker code alone cannot prevent retries from exhausting resources. Option D is wrong because deleting and recreating the queue hourly is a disruptive, non-scalable approach that loses all messages (including valid ones) and does not provide a mechanism to isolate or analyze poison messages; it also violates the requirement to keep the queue usable.

Practice this question →

141

MCQmedium

A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The team wants the control to be enforceable during normal operations.

A.A single EC2 instance with detailed monitoring

B.Subnets in at least two Availability Zones with health checks enabled

C.All instances in one larger subnet

D.A Network Load Balancer in one subnet

AnswerB

An Auto Scaling group spanning multiple AZs can replace unhealthy instances and maintain capacity during an AZ failure.

Why this answer

Option B is correct because distributing EC2 instances across at least two Availability Zones (AZs) ensures that if one AZ fails, the Auto Scaling group can maintain capacity in the remaining AZ(s). Enabling health checks allows the group to detect instance failures and automatically replace them, providing fault tolerance. This configuration meets the requirement to tolerate a single AZ failure while remaining enforceable during normal operations.

Exam trap

The trap here is that candidates often confuse high availability (spanning multiple AZs) with fault tolerance at the instance level, mistakenly thinking a single instance with monitoring or a single subnet can survive an AZ failure.

How to eliminate wrong answers

Option A is wrong because a single EC2 instance, even with detailed monitoring, cannot tolerate the failure of an entire Availability Zone; if that AZ goes down, the instance becomes unavailable. Option C is wrong because placing all instances in one larger subnet within a single AZ creates a single point of failure; an AZ failure would take down all instances. Option D is wrong because a Network Load Balancer in one subnet does not provide AZ-level fault tolerance; it still relies on that single AZ, and the Auto Scaling group must span multiple AZs for resilience.

Practice this question →

142

MCQmedium

A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The design must avoid adding custom operational scripts.

A.S3 Cross-Region Replication

B.Multi-AZ deployment for the RDS DB instance

C.Read replicas only

D.EBS snapshots every hour

AnswerB

Multi-AZ provides synchronous standby replication and automatic failover within a Region.

Why this answer

Multi-AZ deployment for RDS MySQL automatically provisions and maintains a synchronous standby replica in a different Availability Zone. If the primary AZ fails, RDS performs an automatic failover to the standby, ensuring database availability with minimal application changes (the application only needs to reconnect using the same endpoint). This meets the requirement of avoiding custom operational scripts.

Exam trap

The trap here is that candidates often confuse read replicas with Multi-AZ, thinking read replicas provide high availability, but they lack automatic failover for write traffic and require manual promotion, which violates the 'no custom operational scripts' constraint.

How to eliminate wrong answers

Option A is wrong because S3 Cross-Region Replication is for object storage replication across AWS regions, not for RDS database availability during an AZ failure, and it would require significant application changes to redirect traffic. Option C is wrong because read replicas are designed for read scaling and do not provide automatic failover for write operations; promoting a read replica to a primary requires manual intervention or custom scripts. Option D is wrong because EBS snapshots every hour provide point-in-time backups but do not enable automatic failover; restoring from a snapshot would cause significant downtime and require custom scripts to automate recovery.

Practice this question →

143

MCQmedium

Based on the exhibit, the company wants DNS traffic to fail over automatically from the primary Region to a secondary Region when the primary endpoint is unhealthy. Which Route 53 change is best?

A.Keep simple routing and lower the TTL to 10 seconds.

B.Use weighted routing with equal weights for both ALBs.

C.Use geolocation routing so users in each continent reach a closer ALB.

D.Create Route 53 failover records with health checks for the primary and secondary ALBs.

AnswerD

Failover routing is the Route 53 policy intended for this use case. Route 53 returns the primary record while its health check passes, and automatically serves the secondary record when the primary health check fails. That provides DNS-based Regional failover without manual intervention.

Why this answer

Route 53 failover routing with health checks is the only option that automatically directs DNS traffic away from an unhealthy primary endpoint to a healthy secondary endpoint. When the health check for the primary ALB fails, Route 53 returns the secondary ALB's IP address in DNS responses, providing automatic failover across regions. Simple, weighted, and geolocation routing do not natively support automatic failover based on endpoint health.

Exam trap

The trap here is that candidates often confuse weighted routing with failover, assuming equal weights will somehow cause automatic failover, but weighted routing does not consider health status and requires manual intervention to shift traffic.

How to eliminate wrong answers

Option A is wrong because simple routing does not support health checks or automatic failover; lowering the TTL only reduces DNS caching but does not change the routing behavior when the endpoint is unhealthy. Option B is wrong because weighted routing distributes traffic based on weights regardless of endpoint health; it does not automatically failover to the secondary when the primary is unhealthy unless you manually adjust weights. Option C is wrong because geolocation routing directs users based on their geographic location, not endpoint health; it cannot automatically failover traffic from an unhealthy primary region to a secondary region.

Practice this question →

144

MCQhard

A.Amazon EFS with mount targets in multiple Availability Zones

B.S3 mounted as a POSIX file system without a file gateway

C.Instance store volumes

D.An EBS volume attached to all instances

AnswerA

EFS is regional file storage and supports mount targets across AZs.

Why this answer

Amazon EFS provides a fully managed, scalable, and elastic NFS file system that can be mounted concurrently on multiple Linux EC2 instances across different Availability Zones. By configuring mount targets in each AZ, the file system remains accessible even if one AZ fails, because the other mount targets continue to serve traffic. This meets the requirement for shared, highly available file storage across AZs.

Exam trap

The trap here is that candidates may confuse EBS multi-attach (which has strict limitations and is not suitable for shared file systems across AZs) with a true distributed file system like EFS, or assume that S3 with a FUSE driver can replace a POSIX-compliant shared file system.

How to eliminate wrong answers

Option B is wrong because mounting S3 as a POSIX file system without a file gateway (e.g., using s3fs-fuse) does not provide true POSIX semantics (e.g., file locking, atomic operations) and introduces performance and consistency issues; it is not a native shared file system for Linux EC2 instances. Option C is wrong because instance store volumes are ephemeral and tied to a single EC2 instance; they are lost if the instance stops or fails, and cannot be shared across instances or survive an AZ failure. Option D is wrong because a single EBS volume can only be attached to one EC2 instance at a time (multi-attach EBS is limited to specific io1/io2 volumes and is not designed for shared file system workloads across multiple instances in different AZs).

Practice this question →

145

MCQmedium

A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ASG is currently attached to subnets in only two Availability Zones (AZs). During a planned maintenance window, one AZ becomes unavailable for about 25 minutes. Monitoring shows that targets in the remaining AZ go healthy, and the ALB/target group health checks report normal. However, users still experience intermittent connection failures and slower responses during the AZ outage. What change will most directly improve resilience against an AZ loss while keeping the same ALB-based design?

A.Set the ASG min capacity to 0 so instances can be recreated faster when an AZ recovers.

B.Extend the ASG to use subnets in three AZs so there is placement redundancy during an AZ outage, while continuing to keep traffic behind the ALB.

C.Increase the ALB idle timeout to 120 seconds to reduce connection drops.

D.Disable health checks on the target group so instances are not deregistered during the maintenance window.

AnswerB

An AZ outage reduces the number of AZs where the ASG can place instances. With only two AZs, losing one significantly limits capacity and can cause temporary shortages and uneven load distribution, even if existing targets are marked healthy. Expanding the ASG to subnets in three (or more) AZs provides additional placement options so the ASG can maintain the desired number of instances across the remaining AZ(s). The ALB will continue routing only to healthy targets, and the system is more likely to sustain stable response times during the outage.

Why this answer

B is correct because deploying the ASG across three Availability Zones (AZs) ensures that when one AZ becomes unavailable, the remaining two AZs can handle the full traffic load without overloading the instances. This placement redundancy directly addresses the intermittent connection failures and slower responses, as the ALB can distribute traffic only to healthy targets in the remaining AZs, maintaining capacity and performance. The current two-AZ setup lacks sufficient buffer capacity, causing the single remaining AZ to become overwhelmed during the outage.

Exam trap

The trap here is that candidates may focus on connection-level settings (idle timeout) or health check behavior, missing the fundamental architectural need for multi-AZ redundancy to maintain capacity during an AZ outage.

How to eliminate wrong answers

Option A is wrong because setting the ASG min capacity to 0 does not help during an AZ outage; it would actually allow all instances to be terminated, making the application unavailable, and it does not address the lack of capacity in the remaining AZ. Option C is wrong because increasing the ALB idle timeout to 120 seconds only keeps idle connections open longer, which does not prevent connection failures or slow responses caused by insufficient capacity in the remaining AZ; it may even mask underlying issues. Option D is wrong because disabling health checks on the target group would prevent the ALB from deregistering unhealthy instances, causing traffic to be routed to failed instances in the unavailable AZ, leading to more connection failures and no improvement in resilience.

Practice this question →

146

MCQmedium

A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The team wants the control to be enforceable during normal operations.

A.AWS WAF

B.Amazon CloudFront

C.Amazon SQS queue

D.Amazon Route 53 weighted routing

AnswerC

SQS decouples producers and consumers, buffers bursts, and supports retries through visibility timeout and dead-letter queues.

Why this answer

Amazon SQS is the correct choice because it acts as a durable buffer between the web tier and fulfilment workers, decoupling the producers from consumers. When bursts of orders arrive, SQS queues the messages and allows the fulfilment service to poll and process them at its own pace, absorbing spikes without data loss. The queue provides at-least-once delivery and supports retries via a dead-letter queue, ensuring no requests are lost even if processing fails.

Exam trap

The trap here is that candidates confuse buffering and decoupling (SQS) with traffic distribution (Route 53) or security filtering (WAF), or they mistakenly think a CDN (CloudFront) can handle asynchronous order processing, but none of those services provide durable message storage or retry logic.

How to eliminate wrong answers

Option A is wrong because AWS WAF is a web application firewall that filters HTTP/S traffic based on rules (e.g., SQL injection, XSS), not a message queue; it cannot buffer or retry order processing. Option B is wrong because Amazon CloudFront is a content delivery network (CDN) that caches and accelerates static/dynamic content delivery, not a queue for asynchronous message passing; it does not provide durable storage for order requests. Option D is wrong because Amazon Route 53 weighted routing distributes DNS traffic across multiple endpoints based on weights, but it does not buffer or retry requests; it only controls which server receives a request, and if the downstream service is overwhelmed, requests are still lost or fail.

Practice this question →

147

MCQmedium

A warehouse integration service receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The architecture review board prefers a managed AWS-native control.

A.AWS WAF

B.Amazon Route 53 weighted routing

C.Amazon SQS queue

D.Amazon CloudFront

AnswerC

SQS decouples producers and consumers, buffers bursts, and supports retries through visibility timeout and dead-letter queues.

Why this answer

Amazon SQS is the correct choice because it acts as a fully managed message queue that decouples the web tier from the fulfilment workers, buffering incoming order bursts. It provides at-least-once delivery and allows workers to poll messages at their own pace, ensuring no requests are lost even during spikes. SQS also supports retries via a dead-letter queue (DLQ) for messages that fail processing, meeting the requirement for resilient, managed AWS-native control.

Exam trap

The trap here is that candidates may confuse AWS WAF or CloudFront as tools for handling traffic spikes, but neither provides the decoupling, buffering, and retry capabilities of a queue; they are designed for security and content delivery, respectively, not for asynchronous processing.

How to eliminate wrong answers

Option A is wrong because AWS WAF is a web application firewall that filters HTTP/S traffic based on rules (e.g., SQL injection, XSS) and does not provide message buffering, queuing, or retry logic for downstream services. Option B is wrong because Amazon Route 53 weighted routing distributes DNS traffic across multiple endpoints based on weights, but it does not absorb spikes or provide retry mechanisms; it only controls which endpoint receives a request, and a failed request is lost unless the client retries. Option D is wrong because Amazon CloudFront is a content delivery network (CDN) that caches static and dynamic content at edge locations to reduce latency, but it cannot buffer or retry requests for a downstream fulfilment service; it is designed for accelerating content delivery, not for decoupling or absorbing processing spikes.

Practice this question →

148

MCQhard

A patient portal must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The design must avoid adding custom operational scripts.

A.Use an in-memory queue on one EC2 instance

B.Use UDP messages sent directly to workers

C.Use Amazon SQS standard queue and design consumers to be idempotent

D.Use CloudFront signed URLs

AnswerC

SQS standard queues provide at-least-once delivery and high throughput; consumers must handle occasional duplicates.

Why this answer

Amazon SQS standard queues guarantee at-least-once delivery, which satisfies the requirement that every event is processed at least once. The design avoids custom operational scripts by leveraging a fully managed service, and the acceptance of duplicate processing is handled by making consumers idempotent. This combination provides a scalable, resilient, and cost-effective event-driven architecture without the need for custom infrastructure management.

Exam trap

The trap here is that candidates may confuse 'at-least-once' delivery with 'exactly-once' delivery and incorrectly choose a solution like a FIFO queue or a custom retry mechanism, but the question explicitly allows duplicate processing, making the standard queue the correct choice.

How to eliminate wrong answers

Option A is wrong because an in-memory queue on a single EC2 instance creates a single point of failure, lacks durability, and requires custom operational scripts for management and recovery, violating the 'avoid adding custom operational scripts' constraint. Option B is wrong because UDP is a connectionless, unreliable protocol that does not guarantee message delivery, so it cannot ensure at-least-once processing; it also requires custom application-level handling for reliability. Option D is wrong because CloudFront signed URLs are used for securing content delivery and controlling access to files, not for event processing or message queuing; they do not provide any event delivery guarantee or queue semantics.

Practice this question →

149

MCQmedium

An orders service publishes payment instructions to an Amazon SQS queue. The downstream consumer sometimes times out while processing a message. After the message becomes visible again, the consumer may process the same instruction more than once and occasionally creates duplicate orders. The team needs a resiliency-focused design that prevents duplicates from creating double-charges, even if the same message is processed multiple times. What is the best architectural change?

A.Rely on SQS to guarantee exactly-once delivery for standard queues and remove all duplicate-handling logic in the consumer.

B.Make the consumer idempotent by using an idempotency key from the payment instruction (for example, a unique transaction/payment ID) and storing processing results with conditional writes so repeated deliveries do not create a second order.

C.Increase the SQS visibility timeout to the maximum value so the consumer never retries the message.

D.Change the queue to SNS with a fan-out subscription so each consumer gets a separate copy, ensuring processing is sequential and duplicate-free.

AnswerB

Because standard SQS is at-least-once, duplicates are expected under failure scenarios. The resilient approach is to ensure the side effect is performed only once by implementing idempotency. Store a record keyed by a payment/instruction ID using conditional logic (for example, a database conditional put/update or a transaction with a uniqueness constraint). If the key already exists, the consumer should treat the message as already processed and avoid creating a duplicate order/charge.

Why this answer

Option B is correct because making the consumer idempotent ensures that processing the same payment instruction multiple times does not result in duplicate orders or double charges. By using a unique idempotency key (e.g., transaction ID) and conditional writes (e.g., DynamoDB conditional put or database INSERT ... ON CONFLICT DO NOTHING), the consumer can safely handle repeated message deliveries without side effects.

This directly addresses the resiliency requirement without relying on SQS guarantees, which standard queues do not provide for exactly-once delivery.

Exam trap

The trap here is that candidates assume SQS FIFO queues or increased visibility timeouts solve duplicates, but the question specifically tests the concept of idempotency as the correct resiliency pattern for at-least-once delivery systems.

How to eliminate wrong answers

Option A is wrong because standard SQS queues do not guarantee exactly-once delivery; they offer at-least-once delivery, meaning duplicates can occur. Option C is wrong because increasing the visibility timeout to the maximum value (12 hours) does not prevent the consumer from timing out or processing duplicates; it only delays retries, and the message will still become visible again after the timeout expires, leading to potential duplicate processing. Option D is wrong because switching to SNS with fan-out does not prevent duplicates; it sends the same message to multiple subscribers, which could increase duplicate processing, and SNS does not provide ordering or deduplication guarantees.

Practice this question →

150

Multi-Selectmedium

A transactional application uses Amazon RDS for MySQL in a single Availability Zone. The team wants the database to fail over automatically if the primary DB instance becomes unavailable, and they want the application to recover with minimal code changes. Which two actions should they take? Select two.

Select 2 answers

A.Convert the database to an RDS Multi-AZ deployment.

B.Have the application connect to the RDS endpoint by DNS name and reconnect after failures.

C.Add a read replica and promote it manually during an outage.

D.Store the current DB instance IP address in the application configuration file.

E.Rely on nightly snapshots because they provide automatic failover to another Availability Zone.

AnswersA, B

Correct. RDS Multi-AZ is the AWS-managed availability feature designed for automatic failover to a standby in another Availability Zone. It preserves the database endpoint and reduces recovery time without requiring the application to implement its own replica-selection logic.

Why this answer

Option A is correct because RDS Multi-AZ automatically synchronously replicates data to a standby instance in a different Availability Zone. If the primary fails, RDS automatically fails over to the standby, typically within 60–120 seconds, with no manual intervention required. This meets the requirement for automatic failover with minimal code changes.

Exam trap

The trap here is that candidates often confuse read replicas with Multi-AZ failover, thinking a read replica can serve as a manual failover target, but AWS explicitly separates these features—read replicas are for read scaling, not high availability failover.

Practice this question →