CCNA Design Resilient Questions — Page 4 of 4

226

Multi-Selecthard

A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable?

Select 2 answers

A.Deletion protection or tightly controlled delete permissions

B.Point-in-time recovery

C.Global secondary indexes

D.DAX

AnswersA, B

Deletion protection and least-privilege controls reduce accidental table removal risk.

Why this answer

Deletion protection (Option A) prevents accidental table deletion by blocking DropTable API calls unless explicitly disabled, which is essential for protecting the payments table from human error or automated scripts. Point-in-time recovery (Option B) enables continuous backups with 35-day granularity, allowing restoration to any second within that window to recover from accidental writes or data corruption. Together, these settings satisfy both the point-in-time recovery and accidental-delete protection requirements for the DynamoDB table.

Exam trap

The trap here is that candidates often confuse operational features like GSIs or DAX with data protection mechanisms, mistakenly thinking they provide recovery or deletion safeguards when they only serve performance or query optimization roles.

Practice this question →

227

Multi-Selectmedium

A company is designing a disaster recovery plan for a critical application hosted on AWS. The application runs on EC2 instances with data stored in Amazon EBS volumes and Amazon S3. The recovery time objective (RTO) is 15 minutes, and the recovery point objective (RPO) is 1 hour. Which three strategies would help meet these objectives? (Choose three.)

Select 3 answers

.Use AWS Backup to create hourly snapshots of EBS volumes and copy them to a different AWS Region.

.Pre-provision EC2 instances in the disaster recovery region and keep them running 24/7.

.Replicate critical data to S3 in the disaster recovery region using S3 Cross-Region Replication (CRR).

.Store Amazon Machine Images (AMIs) in the source region and use AWS Lambda to copy them after a disaster.

.Configure Amazon Route 53 with a failover routing policy and health checks to redirect traffic to the DR region.

.Set up an AWS Direct Connect link between the primary and DR regions for faster data transfer.

Why this answer

AWS Backup can create hourly snapshots of EBS volumes and copy them to a different AWS Region, meeting the 1-hour RPO by ensuring backups are taken every hour. S3 Cross-Region Replication (CRR) asynchronously replicates objects to a bucket in another region, keeping data synchronized within minutes and supporting the RPO. Amazon Route 53 with a failover routing policy and health checks can automatically redirect traffic to the DR region within seconds to minutes, enabling the 15-minute RTO by quickly failing over to pre-prepared infrastructure.

Exam trap

The trap here is that candidates may confuse operational readiness (like pre-provisioning instances) with a specific strategy that directly contributes to meeting RTO/RPO, or they may think Direct Connect is a disaster recovery strategy when it is merely a connectivity option that does not automate failover or data replication.

Practice this question →

228

MCQmedium

A service consumes messages from an SQS queue. Recently, a new message format started failing validation in the consumer. The consumer catches the exception but cannot successfully process those messages without code changes. The team wants failed messages to be isolated for later investigation instead of being retried indefinitely. What should they configure?

A.Set the queue’s retention period to 1 minute and rely on messages expiring naturally.

B.Configure a dead-letter queue (DLQ) with a redrive policy and set maxReceiveCount so messages move after repeated failed receives.

C.Increase the visibility timeout to 7 days so failed messages cannot be retried.

D.Publish the same message again to SNS on every failure so a different subscriber might succeed.

AnswerB

A DLQ isolates “poison messages” that repeatedly fail processing. With a redrive policy, SQS tracks receives; once a message exceeds maxReceiveCount without successful processing, SQS moves it to the DLQ. This prevents infinite retries on the bad format while preserving the failed messages for debugging and code fixes.

Why this answer

A dead-letter queue (DLQ) with a redrive policy is the correct solution because it allows messages that repeatedly fail processing to be moved to a separate queue after exceeding the maxReceiveCount. This isolates problematic messages for later investigation without blocking the main queue or causing infinite retries. The consumer catches the exception, so the message is not deleted and is returned to the queue for redelivery; the DLQ ensures that after a configurable number of attempts, the message is redirected instead of being retried indefinitely.

Exam trap

The trap here is that candidates may think increasing the visibility timeout or relying on message expiration is sufficient, but they fail to understand that those approaches either affect all messages or only temporarily hide the message, whereas a DLQ provides a permanent, targeted isolation mechanism for repeatedly failing messages.

How to eliminate wrong answers

Option A is wrong because setting the retention period to 1 minute would cause all messages (including valid ones) to expire quickly, leading to data loss and not isolating only the failed messages. Option C is wrong because increasing the visibility timeout to 7 days would simply hide the message from consumers for that period, but after the timeout expires the message would become visible again and be retried, failing to isolate it permanently. Option D is wrong because publishing the same message to SNS on every failure would create an infinite loop of republishing, and SNS subscribers would also fail if they use the same validation logic, not solving the isolation requirement.

Practice this question →

229

MCQeasy

A production Amazon RDS database has automated backups enabled with sufficient retention. At 10:30 UTC, a release corrupts specific rows. The issue is detected at 10:45 UTC. The team wants to restore the database state to before the corruption with minimal complexity. What should they do?

A.Perform a point-in-time restore (PITR) to a timestamp just before 10:30 UTC and create a restored DB instance/cluster.

B.Change the VPC route tables so the database restarts in a clean state.

C.Relaunch the same DB instance in the same Availability Zone and rely on caching to revert the changes.

D.Enable a DLQ on the database to store invalid SQL statements until the system is fixed.

AnswerA

PITR uses automated backups to restore the database to a specific point in time. Selecting a timestamp just before the corruption (for example, slightly before 10:30 UTC) restores the affected data state as it existed before the bad release.

Why this answer

Option A is correct because Amazon RDS Point-in-Time Restore (PITR) allows you to restore the database to any second within the backup retention period, using automated backups and transaction logs. By restoring to a timestamp just before 10:30 UTC, you can recover the database to a state before the corruption occurred, creating a new DB instance/cluster with minimal complexity and no data loss from the uncorrupted period.

Exam trap

The trap here is that candidates may confuse database recovery methods with network or application-level fixes, or incorrectly assume that restarting or relaunching an instance will clear data changes, when in fact only a restore from backup or PITR can revert committed transactions.

How to eliminate wrong answers

Option B is wrong because changing VPC route tables affects network traffic routing, not database state or data integrity; it cannot revert corrupt rows or restart the database in a clean state. Option C is wrong because relaunching the same DB instance in the same Availability Zone does not revert data changes; it simply creates a new instance with the same underlying storage, which still contains the corrupt rows. Option D is wrong because a Dead Letter Queue (DLQ) is a concept for message queues (like Amazon SQS) to handle failed message processing, not a feature of Amazon RDS; it cannot store or revert SQL statements.

Practice this question →

230

MCQmedium

A Multi-AZ Amazon RDS database experiences incorrect writes at 10:15 UTC due to a buggy release. The team detects the problem at 10:25 UTC. They want to restore the data to a known-good point around 10:15 UTC, and validate the recovered data, without taking the current production instance offline during the recovery process. What is the most appropriate AWS action?

A.Immediately reboot the RDS instance and rely on the reboot to roll back the bad writes.

B.Perform a point-in-time restore (PITR) to a new DB instance using a restore time around 10:15 UTC, then test the restored instance before cutting over.

C.Create a new Read Replica from the current primary and use it as the recovered database after applying reverse migrations.

D.Temporarily disable Multi-AZ to speed up storage rollback, then re-enable Multi-AZ.

AnswerB

PITR restores to a specific timestamp using backups and transaction logs. Importantly, it creates a recovered copy (typically a new DB instance), which allows validation and cutover decisions without stopping or directly impacting the existing production instance.

Why this answer

Option B is correct because Amazon RDS point-in-time recovery (PITR) allows you to restore a DB instance to any second within the backup retention period, creating a new, independent DB instance. This lets you validate the recovered data without affecting the current production instance, which remains online and serving traffic. The team can then cut over to the restored instance after confirming it is clean.

Exam trap

The trap here is that candidates may assume a reboot or Read Replica can undo bad writes, but neither provides a rollback mechanism; only PITR or a manual restore from a snapshot can recover to a specific point in time without affecting the live instance.

How to eliminate wrong answers

Option A is wrong because rebooting an RDS instance does not roll back writes; it only restarts the database engine and applies any pending maintenance or parameter changes, leaving the bad data intact. Option C is wrong because a Read Replica is an asynchronous copy of the primary that replicates all writes, including the buggy ones, so it cannot serve as a point-in-time recovery target without manual, error-prone reverse migrations. Option D is wrong because disabling Multi-AZ does not provide a storage rollback mechanism; it only removes the standby replica, and the primary's storage still contains the incorrect writes.

Practice this question →

231

MCQmedium

A production team accidentally deletes critical rows in an Amazon RDS for PostgreSQL database. The deletion occurred about 6 hours ago. The team wants to recover to a specific point in time with minimal disruption. Assuming automated backups are enabled, which approach provides the best resilience outcome?

A.Restore the current DB instance in place by overwriting it with only the latest automated backup.

B.Use point-in-time recovery (PITR) to restore a new DB instance to a timestamp shortly before the deletion, then switch application traffic to the restored instance.

C.Create a manual snapshot and restore from it only if the snapshot date exactly matches today.

D.Perform a database-level rollback using transaction logs from the application server without using RDS restore features.

AnswerB

With automated backups enabled, PITR allows restoring to a precise timestamp within the retention window. Creating a new DB instance (rather than overwriting production) enables verification of data correctness and then a controlled cutover, minimizing disruption while meeting the “specific point in time” requirement.

Why this answer

Point-in-time recovery (PITR) allows you to restore a new DB instance to any second within the automated backup retention period, which includes transaction logs. By restoring to a timestamp just before the deletion, you recover the lost rows without affecting the current production instance, then switch traffic to the new instance for minimal disruption.

Exam trap

The trap here is that candidates may think restoring in place (Option A) is faster or simpler, but they overlook that PITR provides granular recovery without overwriting the production instance, which is the key to minimal disruption.

How to eliminate wrong answers

Option A is wrong because restoring in place overwrites the current DB instance with the latest automated backup, which does not contain the deleted rows (they were removed 6 hours ago) and causes significant downtime. Option C is wrong because manual snapshots capture the entire DB at a specific point in time, but they do not support point-in-time granularity; restoring from a snapshot taken today would still include the deletion if it occurred after the snapshot. Option D is wrong because RDS for PostgreSQL does not expose transaction logs for direct database-level rollback; application-level rollback cannot guarantee consistency with the RDS-managed storage engine.

Practice this question →

232

MCQeasy

A company wants a disaster recovery setup for a web application. They want to keep costs low but still recover within a couple of hours after a regional disruption. They are willing to run only minimal infrastructure in the secondary location and scale it up during the outage. Which DR approach best matches this requirement?

A.Active-active, where both Regions run full production at all times.

B.Pilot light, where the secondary Region keeps minimal core components ready and scales up during failover.

C.Cold standby, where no infrastructure is running in the secondary Region until an outage occurs.

D.Backups-only, where recovery relies solely on manually restoring snapshots during an outage.

AnswerB

Pilot light maintains a small baseline in the secondary Region to enable faster, cost-optimized recovery.

Why this answer

The Pilot light approach is correct because it keeps minimal core components (e.g., a small database, a scaled-down application server) running in the secondary Region, allowing rapid failover by scaling up those resources during an outage. This meets the requirement of low cost during normal operations while achieving recovery within a couple of hours, as the core infrastructure is already provisioned and can be scaled horizontally (e.g., using Auto Scaling groups and pre-configured AMIs) without needing to rebuild from scratch.

Exam trap

The trap here is confusing Pilot light with Cold standby, as both involve minimal infrastructure, but Pilot light has core components already running (e.g., a small database instance) while Cold standby has nothing provisioned, leading to significantly longer recovery times.

How to eliminate wrong answers

Option A is wrong because Active-active runs full production in both Regions at all times, which incurs high costs and does not match the requirement to keep costs low. Option C is wrong because Cold standby has no infrastructure running in the secondary Region until an outage occurs, which would typically require more than a couple of hours to provision and configure resources (e.g., launching EC2 instances, restoring databases) and thus fails the recovery time objective. Option D is wrong because Backups-only relies on manually restoring snapshots (e.g., EBS snapshots, RDS snapshots) during an outage, which is slow and error-prone, often exceeding the couple-of-hours recovery window due to manual intervention and data transfer times.

Practice this question →

233

Multi-Selecthard

A regional web application for a content publishing system must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required?

Select 2 answers

A.AWS Organizations service control policies

B.Route 53 failover routing with health checks

C.S3 Transfer Acceleration

D.A deployed standby application stack in the secondary Region

AnswersB, D

Route 53 can monitor endpoint health and return the standby endpoint when the primary is unhealthy.

Why this answer

Route 53 failover routing with health checks (B) is required because it continuously monitors the health of the primary endpoint and automatically reroutes traffic to a secondary Region when the primary becomes unhealthy. This is achieved by configuring a primary and secondary failover record set in Route 53, where the health check is associated with the primary record. When the health check fails, Route 53 returns the secondary record's IP address, enabling automatic failover at the DNS level.

Exam trap

The trap here is that candidates often think Route 53 alone is sufficient for failover, but they forget that a fully deployed standby application stack in the secondary Region is also required to actually serve traffic after the DNS switch.

Practice this question →

234

MCQeasy

A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The design must avoid adding custom operational scripts.

A.CloudFront caching with appropriate TTLs

B.AWS Backup Vault Lock

C.IAM Access Analyzer

D.S3 Select

AnswerA

CloudFront can serve cached content from edge locations when the origin is temporarily unavailable.

Why this answer

CloudFront caches responses at edge locations based on configured TTLs (Cache-Control or Expires headers). If the S3 origin becomes temporarily unavailable, CloudFront can still serve stale or cached content to users, maintaining availability without any custom scripts or failover logic. This directly addresses the requirement to serve cached pages during short S3 outages.

Exam trap

The trap here is that candidates might think AWS Backup Vault Lock (Option B) provides some form of data availability or failover, but it is purely a compliance and retention tool with no impact on serving cached web content during origin outages.

How to eliminate wrong answers

Option B is wrong because AWS Backup Vault Lock is a data protection feature for backup vaults, enforcing retention policies (WORM) to prevent deletion; it does not provide caching or origin failover for web content. Option C is wrong because IAM Access Analyzer helps identify unintended resource access policies, not caching or availability during origin outages. Option D is wrong because S3 Select is a query-in-place feature to retrieve subsets of object data using SQL expressions; it has no role in caching or serving cached pages during origin failures.

Practice this question →

235

MCQmedium

Based on the exhibit, which Route 53 configuration should be used so traffic automatically returns to the secondary Region only when the primary Region becomes unhealthy?

A.Use latency-based routing with both ALB records enabled.

B.Use failover routing with a primary alias record, a secondary alias record, and a Route 53 health check on the primary target.

C.Use geolocation routing so users are always sent to the closest Region.

D.Use a CNAME record that points to both ALBs so DNS can round-robin between Regions.

AnswerB

Failover routing is designed for this pattern: Route 53 returns the primary alias while the primary endpoint is healthy, and switches to the secondary alias when the primary health check fails. Alias records integrate cleanly with ALB targets, and the health check provides the signal that drives the failover decision.

Why this answer

Failover routing in Amazon Route 53 is designed for active-passive configurations. By creating a primary alias record pointing to the ALB in the primary Region and a secondary alias record pointing to the ALB in the secondary Region, and attaching a Route 53 health check to the primary target, traffic automatically fails over to the secondary Region only when the health check detects the primary as unhealthy. This meets the requirement of returning traffic to the secondary Region only upon primary failure.

Exam trap

The trap here is that candidates often confuse failover routing with latency-based or geolocation routing, assuming that 'closest' or 'fastest' automatically implies health awareness, but Route 53 health checks must be explicitly associated with failover records to trigger automatic traffic redirection.

How to eliminate wrong answers

Option A is wrong because latency-based routing directs users based on lowest latency, not health status, so it would not automatically fail over only when the primary is unhealthy; traffic could still be sent to an unhealthy primary if latency is low. Option C is wrong because geolocation routing sends users based on their geographic location, not the health of the endpoint, so it cannot automatically redirect traffic to the secondary Region when the primary becomes unhealthy. Option D is wrong because a CNAME record cannot point to multiple ALBs for round-robin; CNAME records can only point to a single DNS name, and DNS round-robin does not consider health checks, so traffic would still be sent to an unhealthy primary.

Practice this question →

236

MCQmedium

A web application uses pooled JDBC connections to an Amazon Aurora cluster using the writer endpoint. During an Aurora planned failover, monitoring shows a short spike in failed requests. The Aurora cluster writer endpoint remains the same, but many existing pooled connections briefly fail. The application retries aggressively and overloads the new writer during the transition. Which design change will most improve application resilience during Aurora failovers without requiring application redeployment?

A.Add an RDS Proxy between the application and Aurora to manage database connections across failovers.

B.Change the Aurora cluster to Single-AZ to reduce failover events.

C.Increase the application thread count so more requests can be served while connections reconnect.

D.Pin all database traffic to a specific instance hostname instead of the writer cluster endpoint.

AnswerA

RDS Proxy terminates and manages client connections, while maintaining separate managed connections to the database. During a writer failover, the proxy can re-establish backend connections to the new writer, reducing failed pooled connections seen by the application and lowering retry pressure.

Why this answer

RDS Proxy sits between the application and the Aurora cluster, maintaining a warm connection pool to the database. During a failover, RDS Proxy transparently reconnects to the new writer instance without dropping the application's existing connections, eliminating the spike in failed requests and preventing the aggressive retry storm that overloads the new writer. This requires no code changes or redeployment, as the application simply connects to the proxy endpoint instead of the cluster endpoint.

Exam trap

The trap here is that candidates assume the writer endpoint remains the same so connections should survive, but they miss that pooled JDBC connections hold stale server-side state (like TCP sockets and session context) that is invalidated during failover, and only RDS Proxy can transparently preserve those connections without application changes.

How to eliminate wrong answers

Option B is wrong because switching to Single-AZ eliminates the failover mechanism entirely, making the application less resilient and increasing downtime during any instance failure, which is the opposite of improving resilience. Option C is wrong because increasing the thread count only amplifies the retry storm, overwhelming the new writer even more and failing to address the root cause of connection drops during failover. Option D is wrong because pinning traffic to a specific instance hostname bypasses the writer endpoint's automatic failover routing, causing all traffic to fail if that instance becomes unavailable, and it requires application redeployment to change the hostname.

Practice this question →

237

MCQmedium

A web application runs on an Auto Scaling group (ASG) behind an Application Load Balancer (ALB). After a new release, instances begin failing ALB health checks with errors like 502 while the application is still starting up. CloudWatch shows that the ASG replaces the instances before they finish initializing, so traffic never reaches healthy targets. Which change most directly prevents premature replacement during startup so traffic can resume as soon as the instances are actually healthy?

A.Reduce the ALB health check timeout to 1 second so failures are detected faster.

B.Increase the Auto Scaling group health check grace period to cover application startup and initialization time.

C.Enable connection draining on the ALB target group but set deregistration delay to 0 seconds.

D.Switch the ALB target group health checks from HTTP to TCP so the application does not need to return HTTP 200.

AnswerB

The ASG health check grace period tells Auto Scaling to ignore failing health checks for a period after instance launch. This prevents newly launched instances from being replaced before the application has finished booting and can pass ALB health checks.

Why this answer

B is correct because the Auto Scaling group health check grace period allows instances a specified amount of time to initialize before the ASG starts checking their health status. By increasing this grace period to cover the application startup time, the ASG will not prematurely replace instances that are still initializing, allowing them to pass the ALB health checks and begin receiving traffic once they are actually healthy.

Exam trap

The trap here is that candidates often confuse the ALB health check timeout or interval with the ASG health check grace period, thinking that adjusting ALB settings will fix the premature replacement issue, when in fact the ASG grace period is the direct control for delaying health check evaluation during startup.

How to eliminate wrong answers

Option A is wrong because reducing the ALB health check timeout to 1 second would cause health checks to fail even faster, exacerbating the problem of premature instance replacement. Option C is wrong because connection draining controls how existing connections are closed during deregistration, not how quickly instances are replaced during startup; setting deregistration delay to 0 seconds would abruptly terminate active connections, causing user disruption. Option D is wrong because switching to TCP health checks would bypass the application layer, allowing the ALB to consider an instance healthy even if the application is not fully initialized, which could lead to serving 502 errors to users.

Practice this question →

238

MCQmedium

A stateless web API runs on EC2 instances behind an Application Load Balancer (ALB). The Auto Scaling group (ASG) currently uses subnets from only one Availability Zone, even though the ALB spans two Availability Zones. During maintenance of that single AZ, the ALB remains up but clients see timeouts because there are no healthy targets. Which change most directly improves resilience against an AZ failure?

A.Keep the ASG in one subnet/AZ, but enable ALB stickiness to reduce session interruption.

B.Update the ASG to launch instances across subnets in at least two Availability Zones and ensure ALB health checks target an application-ready path.

C.Add a NAT gateway in the public subnets so instances can reach the internet during maintenance events.

D.Create a second ALB in the same Availability Zone and route traffic using DNS failover.

AnswerB

Spreading instances across multiple AZs ensures the ALB can route to healthy targets even when one AZ fails.

Why this answer

Option B is correct because it directly addresses the single point of failure: the ASG only launches instances in one AZ, so when that AZ fails, the ALB has no healthy targets to route traffic to, causing timeouts. By configuring the ASG to span at least two AZs, the ALB can distribute traffic to healthy instances in the remaining AZ during maintenance, ensuring high availability. The ALB health check must target an application-ready path (e.g., /health) to accurately detect instance health and avoid routing requests to impaired instances.

Exam trap

The trap here is that candidates may think ALB stickiness or DNS failover can compensate for a single-AZ deployment, but AWS explicitly requires multi-AZ architecture for resilience, and the ALB's health check must be application-aware to avoid routing to impaired instances.

How to eliminate wrong answers

Option A is wrong because enabling ALB stickiness (session affinity) does not solve the root cause; it only binds a client session to a specific target, but if all targets in the single AZ are unhealthy, stickiness cannot route traffic to a healthy instance and timeouts will still occur. Option C is wrong because adding a NAT gateway in public subnets provides outbound internet access for instances, which is irrelevant to the AZ failure scenario where the issue is the lack of healthy targets in the ALB's target group, not internet connectivity. Option D is wrong because creating a second ALB in the same single AZ does not eliminate the single point of failure; DNS failover would still route to an ALB that has no healthy targets if that AZ fails, and the architecture remains dependent on one AZ.

Practice this question →

239

Multi-Selectmedium

A company is designing a multi-Region disaster recovery (DR) strategy for a stateless web application running on Amazon EC2 instances behind an Application Load Balancer (ALB). The application uses an Amazon RDS for MySQL database as its data store. The architecture must provide rapid failover with the lowest possible Recovery Point Objective (RPO) and Recovery Time Objective (RTO). Which of the following design choices will help achieve these objectives? (Choose four.)

Select 4 answers

.Configure an active-passive failover strategy by deploying the application stack in two AWS Regions and using Amazon Route 53 health checks with a failover routing policy.

.Set up Amazon RDS Multi-AZ deployment to enable automatic failover to a standby replica in a different Availability Zone within the primary Region.

.Use Amazon RDS cross-Region read replicas with automatic failover to promote a read replica to a primary instance in the secondary Region.

.Deploy the application and ALB in an active-active configuration across two AWS Regions using Amazon Route 53 latency-based routing.

.Store static assets and application state in Amazon S3 with cross-Region replication enabled, and serve them via Amazon CloudFront.

.Use an Amazon RDS for MySQL single-AZ deployment in the primary Region and take daily snapshots copied to the secondary Region.

Why this answer

An active-passive failover strategy with Route 53 failover routing policy is correct because it provides rapid failover by directing traffic to the secondary Region only when health checks fail in the primary, minimizing RTO. Cross-Region read replicas with automatic failover are correct because they allow promoting a read replica to a primary in the secondary Region with low RPO (typically seconds) and automated failover, reducing RTO. Active-active configuration with latency-based routing is correct because it distributes traffic across both Regions, enabling immediate failover without DNS propagation delays, achieving very low RTO.

Storing static assets and application state in S3 with cross-Region replication and CloudFront is correct because it ensures data durability and low-latency access, supporting rapid recovery with minimal RPO.

Exam trap

The trap here is that candidates often confuse Multi-AZ (single-Region high availability) with cross-Region DR, or they assume daily snapshots provide adequate RPO for a DR strategy requiring the lowest possible RPO and RTO.

Practice this question →

240

MCQmedium

A inventory service uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The design must avoid adding custom operational scripts.

A.Lambda reserved concurrency set to zero

B.A Lambda dead-letter queue or failure destination

C.A larger deployment package

D.CloudFront error pages

AnswerB

A DLQ or asynchronous failure destination captures failed events after retry attempts.

Why this answer

A Lambda dead-letter queue (DLQ) or failure destination allows you to capture events that have exhausted all retry attempts from an asynchronous invocation. When the Lambda function fails after the maximum retries (default 3), the event is sent to the configured SQS queue or SNS topic for later investigation, without requiring custom scripts or manual polling.

Exam trap

The trap here is that candidates may confuse Lambda's DLQ/failure destination with other error-handling mechanisms like SQS redrive policies or CloudFront custom error pages, which serve different purposes and operate at different layers of the architecture.

How to eliminate wrong answers

Option A is wrong because setting reserved concurrency to zero would completely disable the Lambda function, preventing any invocations and thus failing to process or retain any events. Option C is wrong because a larger deployment package does not affect error handling or event retention; it only increases cold start latency and deployment size. Option D is wrong because CloudFront error pages are for HTTP-level errors from a web distribution, not for capturing asynchronous Lambda invocation failures or dead-letter events.

Practice this question →

241

MCQeasy

An order system receives events and uses a Lambda function to write each order into a database. During traffic spikes, the database sometimes throttles, and Lambda retries lead to occasional message loss in the event flow. The team wants buffering, automatic retries, and a way to isolate messages that repeatedly fail so they can be inspected later. What design change best meets this need?

A.Send events directly from EventBridge to Lambda without any queue to simplify the flow.

B.Use Amazon SQS as a buffer between the event source and Lambda, with an SQS dead-letter queue (DLQ).

C.Use SNS fan-out to multiple Lambda functions, but keep no retry logic and no DLQ.

D.Store events in an S3 bucket and trigger Lambda immediately after each upload, without using DLQs.

AnswerB

SQS buffers bursts, supports retries via visibility timeouts, and DLQs capture messages that fail repeatedly for later review.

Why this answer

B is correct because Amazon SQS acts as a durable buffer between the event source and Lambda, absorbing traffic spikes and providing automatic retries via its visibility timeout mechanism. By attaching a dead-letter queue (DLQ) to the SQS queue, messages that repeatedly fail processing can be isolated for later inspection, preventing data loss and enabling debugging.

Exam trap

The trap here is that candidates may think EventBridge or S3 triggers provide sufficient retry and isolation, but they lack the built-in DLQ and configurable retry mechanics that SQS offers for decoupling and resilience.

How to eliminate wrong answers

Option A is wrong because sending events directly from EventBridge to Lambda without a queue provides no buffering or retry isolation; Lambda’s synchronous invocation retries are limited and can still lead to message loss under throttling. Option C is wrong because SNS fan-out to multiple Lambda functions without retry logic and no DLQ means failed messages are dropped immediately, with no mechanism for buffering or isolating problematic messages. Option D is wrong because storing events in S3 and triggering Lambda immediately after upload does not provide built-in retry logic for processing failures, and S3 does not offer a DLQ concept; failed events would be lost unless custom retry logic is implemented.

Practice this question →

242

MCQeasy

A company runs its customer-facing web app on EC2 behind an Application Load Balancer. The database is Amazon RDS for PostgreSQL. The requirement is that if a single Availability Zone fails, the database must automatically fail over within the same AWS Region with minimal application changes. Which database setup best meets this requirement?

A.Use an RDS single-AZ instance and periodically restore from automated backups if needed.

B.Deploy the RDS PostgreSQL instance as Multi-AZ with automatic failover enabled.

C.Create a read replica in a different AZ and use it only when the primary fails.

D.Use RDS with Multi-AZ disabled, but increase storage IOPS to prevent failover.

AnswerB

Multi-AZ RDS maintains a standby instance in a different AZ. If the primary fails, RDS performs automatic failover, preserving the same database endpoint behavior.

Why this answer

Option B is correct because RDS Multi-AZ with automatic failover provides synchronous replication to a standby instance in a different Availability Zone. If the primary AZ fails, RDS automatically flips the DNS CNAME to the standby, resulting in minimal application changes (only a brief connection interruption). This meets the requirement for automatic failover within the same region without manual intervention.

Exam trap

The trap here is that candidates often confuse a read replica (Option C) with a Multi-AZ standby, but a read replica is asynchronous and requires manual promotion, whereas Multi-AZ provides automatic synchronous failover with no application changes beyond reconnecting.

How to eliminate wrong answers

Option A is wrong because restoring from automated backups is a manual, time-consuming process that does not provide automatic failover; it can take hours and requires application changes to point to a new endpoint. Option C is wrong because a read replica is designed for read scaling, not automatic failover; promoting it to a primary requires manual intervention and does not provide synchronous replication, leading to potential data loss. Option D is wrong because disabling Multi-AZ and increasing IOPS does not provide any failover capability; it only improves performance, not availability, and a single-AZ failure will still cause an outage.

Practice this question →

243

MCQmedium

A company runs a customer portal on an Amazon Aurora PostgreSQL cluster. The application currently connects directly to the writer instance endpoint and keeps long-lived connections open. During a maintenance failover, writes fail until clients are restarted. The team wants the application to reconnect to the correct Aurora endpoint automatically and reduce user-visible write interruptions. Which change is most likely to achieve this?

A.Use the Aurora cluster endpoint for write traffic, use the reader endpoint for read-only traffic, and implement connection retry or reconnect logic on failover.

B.Keep using the original writer instance endpoint so the database host name never changes during failover.

C.Convert the Aurora cluster to Single-AZ so there is only one database node to connect to.

D.Place Route 53 in front of the database and manually update DNS records whenever failover occurs.

AnswerA

The cluster endpoint always targets the current writer, and failover-aware reconnect logic helps the application recover from dropped connections after promotion.

Why this answer

The Aurora cluster endpoint automatically points to the current writer instance, so using it for write traffic ensures that after a failover, new writes are directed to the new writer without needing to change the connection string. Implementing connection retry or reconnect logic in the application is essential because the existing long-lived connections will be broken during failover; the application must detect the failure and re-establish connections to the cluster endpoint to resume writes seamlessly.

Exam trap

The trap here is that candidates assume the writer instance endpoint remains constant during failover (Option B), but in Aurora, the writer instance endpoint changes because it is tied to the specific DB instance, not the cluster.

How to eliminate wrong answers

Option B is wrong because the writer instance endpoint is tied to a specific database node; during a failover, the original writer is replaced by a new writer with a different endpoint, so the host name does change, and the application would still need to reconnect. Option C is wrong because converting to Single-AZ removes the high-availability failover capability entirely, which contradicts the goal of reducing write interruptions during a failover. Option D is wrong because manually updating Route 53 DNS records is error-prone, introduces latency due to DNS caching, and does not provide the automatic, low-latency failover behavior that the Aurora cluster endpoint offers natively.

Practice this question →

244

Multi-Selecteasy

A company hosts an internal API in two AWS Regions. Traffic must automatically switch to the secondary Region when the primary Region's endpoint is unhealthy. Which two Route 53 settings are required? Select two.

Select 2 answers

A.Use a failover routing policy for the DNS record.

B.Configure a health check for the primary endpoint.

C.Use geolocation routing so users are always sent to the closest Region.

D.Use a private hosted zone to expose the API to the internet.

E.Set the TTL to zero and skip health checks to make failover faster.

AnswersA, B

Failover routing is specifically designed for primary and secondary endpoints. Route 53 returns the secondary record when the primary record is considered unhealthy.

Why this answer

A failover routing policy is required because it allows Route 53 to automatically route traffic from a primary resource to a secondary resource when the primary is unhealthy. This is the only routing policy that supports active-passive failover across two AWS Regions. Without this policy, Route 53 would not know which endpoint to consider primary or how to switch traffic upon failure.

Exam trap

The trap here is that candidates often confuse failover routing with geolocation routing, thinking geographic proximity is sufficient for disaster recovery, but failover routing is the only policy that provides automatic health-based switching between primary and secondary endpoints.

Practice this question →

245

MCQmedium

A developer accidentally deletes important rows in an RDS database. The mistake is discovered 45 minutes later. The database has automated backups enabled with a retention period of 7 days. What is the best way to restore the database to a point just before the deletion?

A.Restore the latest manual snapshot and then run SQL scripts to revert the deletion.

B.Use point-in-time restore (PITR) to restore the database to a specific timestamp before the deletion, based on automated backups.

C.Promote an existing read replica to be the primary and then copy the missing rows from logs.

D.Recreate the instance using the most recent CloudWatch metric alarm snapshot of storage metrics.

AnswerB

With automated backups enabled, RDS supports PITR within the retention window. PITR lets you restore to any second within that window, so you can select a timestamp just before the destructive deletion occurred. This avoids restoring a potentially stale snapshot and eliminates the need for risky manual compensating scripts.

Why this answer

Point-in-time restore (PITR) allows you to restore an RDS DB instance to any second within the automated backup retention period (here, 7 days). Since the deletion occurred 45 minutes ago, you can specify a timestamp just before the deletion, and RDS will replay the transaction logs to bring the database to that exact state. This is the most precise and efficient recovery method for accidental data modifications.

Exam trap

The trap here is that candidates may assume manual snapshots or read replicas can be used for granular point-in-time recovery, but only automated backups with transaction logs enable restoring to a specific second within the retention period.

How to eliminate wrong answers

Option A is wrong because manual snapshots capture the entire instance at a point in time, but they do not provide the granularity to restore to a specific moment just before the deletion; you would lose all changes made after the snapshot, and running SQL scripts to revert deletions is error-prone and not a built-in RDS feature. Option C is wrong because promoting a read replica makes it a new primary, but it does not revert data; it simply becomes a writable copy of the current state, which still contains the deletion. Option D is wrong because CloudWatch metric alarms monitor performance metrics, not database row-level data; they cannot be used to restore or recover deleted rows.

Practice this question →

246

MCQmedium

A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable?

A.S3 Cross-Region Replication

B.Multi-AZ deployment for the RDS DB instance

C.Read replicas only

D.EBS snapshots every hour

AnswerB

Multi-AZ provides synchronous standby replication and automatic failover within a Region.

Why this answer

Multi-AZ deployment for RDS MySQL automatically provisions and maintains a synchronous standby replica in a different Availability Zone. In the event of an AZ failure, Amazon RDS automatically fails over to the standby, providing high availability with minimal application changes (the application only needs to reconnect to the same endpoint). This meets the requirement for availability during an AZ outage without requiring code modifications.

Exam trap

The trap here is that candidates often confuse read replicas (which are for read scaling and manual promotion) with Multi-AZ (which provides automatic failover for high availability), leading them to select read replicas as a cheaper or simpler alternative.

How to eliminate wrong answers

Option A is wrong because S3 Cross-Region Replication is for object storage in S3, not for RDS MySQL databases, and it does not provide automatic failover for a relational database. Option C is wrong because read replicas are designed for read scaling, not for automatic failover during an AZ failure; they require manual promotion and application changes to redirect writes. Option D is wrong because EBS snapshots every hour provide point-in-time backup and recovery, not high availability; restoring from a snapshot would involve significant downtime and manual intervention, not minimal application changes.

Practice this question →

247

MCQmedium

A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The team wants the control to be enforceable during normal operations.

A.S3 Cross-Region Replication

B.Multi-AZ deployment for the RDS DB instance

C.Read replicas only

D.EBS snapshots every hour

AnswerB

Multi-AZ provides synchronous standby replication and automatic failover within a Region.

Why this answer

Multi-AZ deployment for RDS MySQL automatically provisions and synchronously replicates a standby instance in a different Availability Zone. In the event of an AZ failure, Amazon RDS automatically fails over to the standby, providing high availability with minimal application changes (the application only needs to reconnect using the same endpoint). This meets the requirement of remaining available during an AZ failure while being enforceable during normal operations.

Exam trap

The trap here is that candidates often confuse read replicas (which are for read offloading and disaster recovery across regions) with Multi-AZ deployments (which provide synchronous replication and automatic failover for high availability within a region).

How to eliminate wrong answers

Option A is wrong because S3 Cross-Region Replication is for object-level replication in Amazon S3, not for RDS MySQL databases, and it does not provide automatic failover for a relational database. Option C is wrong because read replicas are designed for read scaling and asynchronous replication; they do not provide automatic failover for the primary database instance during an AZ failure, and promoting a read replica requires manual intervention and changes to the application connection string. Option D is wrong because EBS snapshots every hour provide point-in-time backups, not high availability; restoring from a snapshot would involve significant downtime and manual steps, failing the requirement for minimal application changes and continuous availability.

Practice this question →

248

MCQmedium

You host a public API using Amazon API Gateway in two AWS Regions: us-east-1 (primary) and us-west-2 (secondary). You want Route 53 to send client traffic to the secondary region only when the primary API is unhealthy. Which Route 53 setup best meets this requirement?

A.Use latency-based routing with one routing policy per region, and use CloudWatch alarms to update traffic weights between regions.

B.Use Route 53 failover routing with two ALIAS records (same DNS name) pointing to the API Gateway regional endpoints: one record is configured as PRIMARY with an associated health check, and the other is configured as SECONDARY.

C.Use weighted routing across both regions and rely on Route 53 health checks to automatically set the secondary to 100% weight when the primary fails.

D.Use geolocation routing to map some client geographies to the secondary region and the rest to the primary region.

AnswerB

Failover routing is designed for active-passive regional resiliency. With a PRIMARY record tied to a health check, Route 53 automatically returns DNS answers to the SECONDARY endpoint when the PRIMARY fails health checks.

Why this answer

Route 53 failover routing is designed for active-passive setups where traffic is sent to a primary resource unless it is unhealthy, in which case traffic is routed to a secondary resource. By creating two ALIAS records with the same DNS name, one marked PRIMARY with an associated health check and the other marked SECONDARY, Route 53 will automatically fail over to the secondary region when the health check for the primary API Gateway endpoint fails. This directly meets the requirement of sending traffic to the secondary region only when the primary API is unhealthy.

Exam trap

The trap here is that candidates often confuse weighted routing with failover routing, mistakenly believing that Route 53 health checks can automatically adjust weights to achieve active-passive failover, when in fact weighted routing does not support dynamic weight adjustment based on health.

How to eliminate wrong answers

Option A is wrong because latency-based routing directs traffic based on lowest latency, not health, and using CloudWatch alarms to manually update weights is not an automatic failover mechanism; it also requires custom automation and does not natively support health-check-driven failover. Option C is wrong because weighted routing distributes traffic based on assigned weights and does not automatically set the secondary to 100% weight when the primary fails; Route 53 health checks can mark a record as unhealthy but do not dynamically adjust weights—they would cause the primary record to be excluded from responses, but the secondary would only receive traffic if its weight is non-zero, and the behavior is not a clean active-passive failover. Option D is wrong because geolocation routing directs traffic based on the geographic location of the client, not the health of the endpoint, and it cannot automatically fail over traffic from one region to another when the primary becomes unhealthy.

Practice this question →

249

MCQmedium

An ECS service runs on EC2 instances and is fronted by an ALB. The ALB spans two Availability Zones, and the ECS service desired count is 2 tasks. The underlying EC2 capacity uses an Auto Scaling group (ASG) with min size set to 1, and the ASG also spans only one subnet in practice. What is the most effective change to meet the requirement that the service continues during a single-AZ instance loss?

A.Set the ECS deployment configuration to maximum percent 100 so tasks replace instances faster during rollouts.

B.Increase ASG min size to at least 2 and ensure the ASG uses subnets in at least two Availability Zones.

C.Enable ALB connection draining longer than expected so existing connections survive longer during an AZ event.

D.Reduce task memory reservations to pack both tasks onto a single EC2 instance.

AnswerB

Multi-AZ instance capacity ensures tasks have eligible compute in another AZ when one AZ loses instances.

Why this answer

The current architecture has a single point of failure because the Auto Scaling group (ASG) spans only one subnet (one Availability Zone). If that AZ fails, all EC2 instances are lost, and the ECS service cannot run any tasks. Increasing the ASG min size to at least 2 and configuring it to use subnets in at least two AZs ensures that EC2 instances are distributed across AZs, allowing the ECS service to maintain at least one task in the surviving AZ during a single-AZ failure.

Exam trap

The trap here is that candidates often focus on ECS-specific settings (like deployment configuration or task placement) rather than recognizing that the root cause is the ASG's single-AZ limitation, which is a fundamental infrastructure resilience issue.

How to eliminate wrong answers

Option A is wrong because setting the ECS deployment configuration to maximum percent 100 controls how tasks are replaced during a rolling update, not how the service survives an AZ failure; it does not address the underlying lack of EC2 capacity in multiple AZs. Option C is wrong because ALB connection draining only helps gracefully terminate existing connections during deregistration or health check failures; it does not provision new compute capacity or ensure tasks run in another AZ after an AZ loss. Option D is wrong because reducing task memory reservations to pack both tasks onto a single EC2 instance actually increases risk—if that single instance (or its AZ) fails, both tasks are lost, and the ASG min size of 1 cannot recover quickly enough to meet the requirement.

Practice this question →

250

Multi-Selectmedium

A retail API runs on Amazon EC2 instances behind an Application Load Balancer and stores orders in an Amazon RDS for PostgreSQL database. A test that stopped one Availability Zone caused the API to return errors because all application servers were in the same AZ and the database was single-AZ. Which two changes should the architect make to continue serving traffic during a single-AZ failure? Select two.

Select 2 answers

A.Increase the EC2 instance size and keep all application servers in the same subnet.

B.Configure the Auto Scaling group to launch instances across private subnets in at least two Availability Zones.

C.Replace the Application Load Balancer with a Network Load Balancer in a single Availability Zone.

D.Convert the RDS for PostgreSQL database to a Multi-AZ deployment.

E.Add an Amazon RDS read replica and point the application to the replica endpoint.

AnswersB, D

Spreading the application tier across multiple AZs preserves healthy capacity if one AZ fails and lets the load balancer keep serving requests.

Why this answer

Option B is correct because distributing EC2 instances across multiple Availability Zones via an Auto Scaling group ensures that if one AZ fails, the remaining AZs continue to serve traffic. Option D is correct because converting the RDS for PostgreSQL database to a Multi-AZ deployment provides a standby replica in a different AZ, enabling automatic failover and continued database availability during a single-AZ failure.

Exam trap

The trap here is that candidates often confuse a read replica (which is for read scaling and requires manual promotion) with a Multi-AZ standby (which provides automatic failover for high availability).

Practice this question →

251

MCQmedium

A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The design must avoid adding custom operational scripts.

A.Aurora Global Database

B.A single-AZ Aurora cluster

C.An ElastiCache Redis replica

D.Manual snapshots copied monthly

AnswerA

Aurora Global Database replicates with low latency to secondary Regions and supports faster disaster recovery than snapshot-only approaches.

Why this answer

Aurora Global Database is designed for cross-Region disaster recovery with a typical RPO of 1 second and RTO of 1 minute, using storage-based replication that does not require custom scripts. It replicates data from a primary Region to up to five secondary Regions with minimal impact on database performance, meeting the low RPO requirement without operational overhead.

Exam trap

The trap here is that candidates may confuse cross-Region read replicas (which require manual promotion and have higher RPO) with Aurora Global Database, which provides automated failover and lower RPO without custom scripts.

How to eliminate wrong answers

Option B is wrong because a single-AZ Aurora cluster lacks any cross-Region replication or failover capability, providing no disaster recovery across Regions. Option C is wrong because ElastiCache Redis is an in-memory cache, not a persistent database, and cannot serve as a primary data store for ticket bookings or provide cross-Region DR with low RPO. Option D is wrong because manual snapshots copied monthly result in an RPO of up to one month, which is far too high for fast disaster recovery, and the process requires custom scripting to automate cross-Region copy.

Practice this question →

252

MCQmedium

Based on the exhibit, an administrator accidentally deleted data from Amazon RDS for PostgreSQL about 90 minutes ago. Which recovery approach best restores the database to the exact required point in time?

A.Restore the latest automated snapshot back onto the existing DB instance.

B.Restore the database to the specified point in time into a new DB instance.

C.Create a read replica and promote it after the deletion is noticed.

D.Enable Multi-AZ so the database can automatically undo application mistakes.

AnswerB

Point-in-time restore uses automated backups plus transaction logs to recreate the database at a specific moment. For accidental deletion, this is the correct RDS recovery method because it can recover the database to just before the bad change while preserving all legitimate data up to that point.

Why this answer

Amazon RDS for PostgreSQL supports Point-in-Time Recovery (PITR), which allows you to restore a DB instance to any second within the backup retention period, up to the last five minutes. Since the deletion occurred approximately 90 minutes ago, you can restore to that exact point in time by specifying the timestamp, and RDS will create a new DB instance from automated backups and transaction logs. This is the only option that recovers the exact state before the accidental deletion.

Exam trap

The trap here is that candidates confuse automated snapshots with point-in-time recovery, assuming a snapshot restore can target a specific time, when in fact snapshots are point-in-time captures and cannot replay transaction logs to reach an arbitrary second.

How to eliminate wrong answers

Option A is wrong because restoring the latest automated snapshot would recover data only up to the snapshot creation time, which could be hours or days before the deletion, not the exact point 90 minutes ago. Option C is wrong because creating a read replica replicates data asynchronously from the source; by the time the replica is promoted, it will already contain the deletion, and it cannot roll back to a prior point in time. Option D is wrong because Multi-AZ provides high availability through synchronous standby replication, but it does not protect against logical data corruption or accidental deletions; it cannot undo application mistakes.

Practice this question →

253

Multi-Selectmedium

A serverless order-ingestion API writes directly to a database. During traffic spikes, the database occasionally throttles, Lambda retries create duplicate order records, and some requests time out. Which two changes best improve buffering and safe retry behavior? Select two.

Select 2 answers

A.Increase the Lambda timeout and keep writing directly to the database.

B.Put an Amazon SQS queue between the API and the database-processing function.

C.Replace SQS with SNS so every request is delivered immediately to all subscribers.

D.Make the database write idempotent by using a unique request token or order ID.

E.Disable retries so failed writes are never duplicated.

AnswersB, D

SQS buffers bursts and decouples producers from consumers, so the database can be processed at a steadier rate.

Why this answer

Option B is correct because inserting an SQS queue between the API Gateway and the Lambda function decouples the ingestion from the database write. During traffic spikes, SQS acts as a buffer, absorbing bursts and allowing the Lambda function to poll messages at a controlled rate, which prevents database throttling. Combined with a dead-letter queue, failed messages can be retried safely without overwhelming the database or creating duplicate records.

Exam trap

The trap here is that candidates often confuse SNS (push-based, no buffering) with SQS (pull-based, buffering), and they overlook that idempotency is a complementary pattern to handle retries without duplicates, not a replacement for decoupling.

Practice this question →

254

MCQmedium

A warehouse integration service receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers?

A.AWS WAF

B.Amazon Route 53 weighted routing

C.Amazon SQS queue

D.Amazon CloudFront

AnswerC

SQS decouples producers and consumers, buffers bursts, and supports retries through visibility timeout and dead-letter queues.

Why this answer

Amazon SQS is the correct choice because it acts as a durable, fully managed message queue that decouples the web tier from the fulfilment workers. It can absorb bursts of orders by storing messages durably, and workers can poll the queue at their own pace, with built-in retry logic via visibility timeouts and dead-letter queues to ensure no requests are lost.

Exam trap

The trap here is that candidates may confuse load-balancing or caching services (like Route 53 or CloudFront) with message queuing, failing to recognize that only a durable queue like SQS provides the necessary buffering, decoupling, and retry semantics for asynchronous order processing.

How to eliminate wrong answers

Option A is wrong because AWS WAF is a web application firewall that filters HTTP/S traffic based on rules, not a queuing or buffering mechanism; it cannot absorb spikes or retry processing. Option B is wrong because Amazon Route 53 weighted routing distributes DNS traffic across multiple endpoints based on weights, but it does not provide durable storage or retry capabilities for individual requests. Option D is wrong because Amazon CloudFront is a content delivery network (CDN) that caches static and dynamic content at edge locations; it can reduce load on origins but cannot queue or retry individual order messages.

Practice this question →

255

MCQmedium

Your ecommerce app runs behind an Application Load Balancer (ALB) and uses an RDS database for orders. During an AZ impairment in us-east-1, customers report that checkout takes several minutes to recover. The current design places EC2 instances only in private subnets of AZ-a, while the ALB spans multiple subnets. The RDS DB instance is Multi-AZ. Management wants automatic recovery within the same Region. Which change best addresses the issue with minimal operational overhead?

A.Move the EC2 instances into Auto Scaling Groups that span private subnets in at least two AZs, keeping the ALB spanning those subnets.

B.Switch from RDS Single-AZ to RDS Multi-AZ, keeping the EC2 instances in only AZ-a because failover will still reach them.

C.Terminate the ALB and use a Network Load Balancer (NLB) in front of the existing single-AZ EC2 instances.

D.Add more EC2 instances in AZ-a and increase the ALB health check thresholds to avoid unnecessary replacements during impairments.

AnswerA

An Auto Scaling Group across multiple AZs ensures healthy capacity exists when an AZ becomes impaired, and the ALB can route to instances in any available AZ.

Why this answer

The correct answer is A because the current design has a single point of failure: all EC2 instances are in one Availability Zone (AZ-a). During an AZ impairment, those instances become unreachable, causing the checkout process to fail until the impairment ends or manual intervention occurs. By placing EC2 instances in an Auto Scaling Group spanning at least two AZs, the application can automatically recover by launching new instances in a healthy AZ, while the ALB distributes traffic across the surviving AZs.

This minimizes operational overhead as Auto Scaling handles instance replacement automatically.

Exam trap

The trap here is that candidates may focus on the database layer (Multi-AZ) or load balancer type (NLB vs ALB) and overlook the critical single-AZ EC2 instance placement, which is the actual bottleneck causing the prolonged recovery during an AZ impairment.

How to eliminate wrong answers

Option B is wrong because the RDS DB instance is already Multi-AZ (as stated in the question), so switching from Single-AZ to Multi-AZ is not a change; moreover, keeping EC2 instances in only AZ-a still leaves them vulnerable to an AZ impairment, as the ALB cannot route traffic to a healthy AZ if no instances exist there. Option C is wrong because replacing the ALB with an NLB does not address the root cause—EC2 instances are still confined to a single AZ; additionally, an NLB operates at Layer 4 and lacks the HTTP/HTTPS health checks and content-based routing that an ALB provides, which could break the ecommerce application's functionality. Option D is wrong because adding more EC2 instances in AZ-a only increases capacity within the same failing AZ, and increasing health check thresholds delays the detection of unhealthy instances, prolonging recovery time rather than improving it.

Practice this question →

256

MCQeasy

An internal API is hosted in two AWS Regions behind Route 53. Under normal conditions, clients should use the primary region. If the primary endpoint becomes unhealthy, traffic must automatically switch to the secondary region. Which Route 53 setup best meets this requirement?

A.Use latency-based routing with one record per region and no health checks.

B.Use failover routing policy: create two alias records for the same name (primary and failover) and associate health checks with the primary record.

C.Use weighted routing and manually change the weights during incidents.

D.Create a single alias record only for the primary region and rely on client-side DNS retries.

AnswerB

Route 53 failover routing is designed for deterministic primary/secondary switching based on health check status. When the primary health check fails, Route 53 automatically returns the secondary region endpoint.

Why this answer

Route 53 failover routing policy is designed for active-passive failover scenarios. By creating two alias records (primary and secondary) for the same DNS name and associating a health check with the primary record, Route 53 automatically directs traffic to the secondary region if the primary health check fails. This meets the requirement of automatic failover without manual intervention.

Exam trap

The trap here is that candidates often confuse failover routing with latency-based routing, assuming latency routing inherently handles failover, but latency routing does not automatically switch traffic when an endpoint becomes unhealthy unless health checks are explicitly configured.

How to eliminate wrong answers

Option A is wrong because latency-based routing distributes traffic based on lowest latency, not active-passive failover, and without health checks it cannot detect endpoint failures. Option C is wrong because weighted routing requires manual weight changes during incidents, which violates the requirement for automatic failover. Option D is wrong because a single alias record with no secondary endpoint provides no failover capability; client-side DNS retries do not redirect to a different region.

Practice this question →

257

Multi-Selectmedium

A company is designing a multi-tier web application on AWS that must be resilient to the failure of an entire AWS Region. The application uses Amazon Route 53, an Application Load Balancer, EC2 instances, and Amazon RDS. Which three design choices support a multi-Region resilient architecture? (Choose three.)

Select 3 answers

.Use Route 53 latency-based routing to direct users to the closest healthy region.

.Configure Route 53 with a failover routing policy and health checks on the application endpoints.

.Deploy the application stack in two separate AWS Regions and use an active-passive setup.

.Use Amazon RDS Cross-Region read replicas to keep the standby region database up-to-date.

.Store all application state in a single Amazon ElastiCache cluster in the primary region.

.Place the Application Load Balancer in a single region but use it across multiple Availability Zones.

Why this answer

Route 53 failover routing with health checks is correct because it allows DNS to automatically route traffic away from a failed primary region to a standby region, which is essential for multi-Region resilience. Deploying the application stack in two separate AWS Regions with an active-passive setup is correct because it ensures that if the primary region fails, the passive region can take over, providing regional fault isolation. Amazon RDS Cross-Region read replicas are correct because they keep the standby region's database synchronized with the primary, enabling promotion to a primary database in a disaster recovery scenario.

Exam trap

AWS often tests the misconception that latency-based routing provides failover capability, but it only routes to the lowest-latency endpoint and does not support health-check-driven failover to a specific standby region.

Practice this question →

258

MCQeasy

An orders service currently sends HTTP requests directly to two downstream services (inventory and shipping). During peak load, inventory slows down, causing the orders service to slow as well. The team wants the orders service to remain responsive even when a downstream service is temporarily slow or restarted. Which design change best achieves this resiliency goal?

A.Keep HTTP calls but add longer client timeouts so orders requests wait for slow downstream responses.

B.Introduce Amazon SQS as a buffer between orders and downstream services, with consumers processing from the queue.

C.Replace the downstream services with AWS Lambda functions that are invoked synchronously by the orders service.

D.Call the downstream services in parallel threads to reduce waiting time during peak load.

AnswerB

SQS decouples the producer (orders service) from the consumers (inventory/shipping processors). The orders service can quickly enqueue work and return to the caller, even if a downstream service is slow or restarted. Messages remain in the queue until consumers can process them, preventing cascading latency/backpressure from propagating to the orders API.

Why this answer

Option B is correct because introducing Amazon SQS as a buffer decouples the orders service from the downstream inventory and shipping services. The orders service can immediately enqueue messages and respond to the client, while downstream consumers process messages at their own pace. This prevents backpressure from a slow or restarting downstream service from blocking the orders service, achieving the desired resiliency.

Exam trap

The trap here is that candidates may think parallelizing calls (Option D) or increasing timeouts (Option A) solves the problem, but they fail to recognize that true resiliency requires decoupling via asynchronous messaging, not just concurrency or tolerance of delays.

How to eliminate wrong answers

Option A is wrong because adding longer client timeouts does not prevent the orders service from being blocked; it only increases the wait time before a timeout occurs, still causing the orders service to slow down during peak load. Option C is wrong because replacing downstream services with synchronously invoked Lambda functions does not decouple the services; the orders service would still block waiting for the Lambda invocation to complete, and Lambda has a 15-minute timeout limit, which does not solve the slowdown issue. Option D is wrong because calling downstream services in parallel threads reduces latency only if both services are responsive; if one service is slow or restarting, the orders service still waits for that slow response, and thread pool exhaustion can occur under peak load, leading to resource contention and slowdown.

Practice this question →

259

MCQmedium

Your media processing pipeline writes original uploads to an S3 bucket and later generates derivative files. An operator accidentally deletes a subset of original uploads in production. You need to (1) restore the deleted objects with minimal data loss and (2) protect against both regional disasters and future operator mistakes. The company requires recovery even if objects are deleted and later overwritten. What is the most effective change to meet these requirements?

A.Enable S3 versioning on the bucket and configure cross-Region replication so previous versions are available after regional loss and accidental deletion.

B.Move all objects to S3 Glacier Instant Retrieval and apply a lifecycle policy to keep only the latest object copy.

C.Use S3 server-side encryption with KMS keys and rely on access logs to manually recover the deleted objects.

D.Enable S3 bucket policies that deny DeleteObject, but do not enable versioning or replication.

AnswerA

Versioning retains prior object versions, and cross-Region replication provides redundancy across Regions for recovery after deletion or disaster.

Why this answer

Option A is correct because enabling S3 Versioning preserves all object versions, including deleted markers and overwritten objects, allowing recovery from accidental deletions. Cross-Region Replication (CRR) replicates both current and previous versions to a secondary region, providing protection against regional disasters. This combination ensures that even if objects are deleted and later overwritten, the original versions remain recoverable in both the source and destination buckets.

Exam trap

The trap here is that candidates often assume S3 bucket policies or encryption alone can protect against deletion, but only versioning preserves object history, and only replication provides regional disaster recovery.

How to eliminate wrong answers

Option B is wrong because moving objects to S3 Glacier Instant Retrieval does not provide versioning or replication, so deleted objects cannot be restored and there is no protection against regional disasters. Option C is wrong because S3 server-side encryption with KMS keys does not preserve deleted or overwritten objects; access logs only record events, not the data itself, making manual recovery impossible. Option D is wrong because a bucket policy denying DeleteObject can be bypassed by authorized users or misconfigurations, and without versioning or replication, deleted objects are permanently lost and there is no regional disaster recovery.

Practice this question →

260

MCQmedium

A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The design must avoid adding custom operational scripts.

A.Lambda reserved concurrency set to zero

B.A larger deployment package

C.CloudFront error pages

D.A Lambda dead-letter queue or failure destination

AnswerD

A DLQ or asynchronous failure destination captures failed events after retry attempts.

Why this answer

A Lambda dead-letter queue (DLQ) or failure destination is the correct solution because it captures events that have exhausted all retry attempts from an asynchronous Lambda invocation. This allows failed events to be retained in an Amazon SQS queue or SNS topic for later investigation, without requiring custom operational scripts. The DLQ or failure destination integrates directly with Lambda's built-in retry behavior, ensuring that only events that fail after the configured number of retries are sent to the destination.

Exam trap

The trap here is that candidates may confuse a DLQ with other error-handling mechanisms like CloudFront error pages or reserved concurrency, but only a DLQ or failure destination directly captures failed asynchronous Lambda events without custom code.

How to eliminate wrong answers

Option A is wrong because setting reserved concurrency to zero would prevent the Lambda function from executing at all, which stops all invocations and does not retain failed events. Option B is wrong because a larger deployment package does not affect error handling or event retention; it only increases the function's storage size and cold start time. Option C is wrong because CloudFront error pages are used for HTTP error responses from a web distribution, not for capturing failed Lambda invocations from asynchronous event sources.

Practice this question →

261

MCQmedium

A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The design must avoid adding custom operational scripts.

A.A single EC2 instance with detailed monitoring

B.Subnets in at least two Availability Zones with health checks enabled

C.All instances in one larger subnet

D.A Network Load Balancer in one subnet

AnswerB

An Auto Scaling group spanning multiple AZs can replace unhealthy instances and maintain capacity during an AZ failure.

Why this answer

Option B is correct because distributing EC2 instances across at least two Availability Zones (AZs) ensures that the Auto Scaling group can maintain capacity even if one AZ fails. Enabling health checks on the Application Load Balancer (ALB) allows the group to automatically replace unhealthy instances without custom scripts, meeting the fault-tolerance requirement.

Exam trap

The trap here is that candidates often confuse using a load balancer (like an NLB) with achieving AZ redundancy, but without multi-AZ subnets in the Auto Scaling group, the architecture remains single-AZ and vulnerable to failure.

How to eliminate wrong answers

Option A is wrong because a single EC2 instance, even with detailed monitoring, cannot tolerate an AZ failure—if that AZ goes down, the instance is lost. Option C is wrong because placing all instances in one larger subnet confines them to a single AZ, providing no redundancy against AZ failure. Option D is wrong because a Network Load Balancer (NLB) in one subnet does not inherently distribute across AZs; the Auto Scaling group must span multiple AZs, and the NLB alone does not replace the need for multi-AZ instance placement.

Practice this question →

262

MCQhard

Based on the exhibit, the current disaster recovery design misses the RTO target even though the database replica is current. Which deployment model best meets the requirements with the least always-on cost?

A.Pilot light, because only the database needs to be running in the secondary Region.

B.Warm standby, because a scaled-down application stack stays running in the secondary Region and can take over faster.

C.Active-active, because both Regions should always serve traffic to guarantee the RTO.

D.Backup and restore, because restoring from backups is the least expensive DR model available.

AnswerB

Warm standby is the best fit when you need faster recovery than pilot light but do not want the cost of full active-active capacity. The exhibit shows that starting the application stack from zero consumes most of the recovery time. Keeping a reduced but functional stack running in the secondary Region removes that startup delay and should bring the total recovery time within the 15-minute RTO while still keeping always-on cost below full production duplication.

Why this answer

Warm standby is the correct choice because it keeps a scaled-down application stack running in the secondary Region, which can be scaled up quickly to handle production traffic. This design meets the RTO target by reducing failover time compared to a pilot light, while avoiding the higher always-on cost of an active-active deployment.

Exam trap

The trap here is that candidates confuse pilot light with warm standby, assuming that only the database needs to be running to meet the RTO, but they overlook the time required to provision the application stack on failover.

How to eliminate wrong answers

Option A is wrong because a pilot light only keeps the database running and requires provisioning the full application stack on failover, which cannot meet the RTO target. Option C is wrong because active-active runs full application stacks in both Regions at all times, incurring higher always-on cost than necessary. Option D is wrong because backup and restore involves restoring from backups (e.g., RDS snapshots or S3), which takes too long to meet the RTO target and is not the least expensive when considering operational overhead.

Practice this question →

263

MCQeasy

A company runs the same public API in two regions (Region A and Region B), each fronted by an ALB. They want Route 53 to automatically route clients to the Region B API when Region A becomes unhealthy, with minimal configuration effort. Which Route 53 approach should they use?

A.Use a single Route 53 A record that points only to Region A’s ALB and manually update it after failures.

B.Use Route 53 latency-based routing with separate records for each region.

C.Use Route 53 failover routing with health checks for each region’s endpoint.

D.Use weighted routing and set the Region B weight to 0 to ensure it is only used when needed.

AnswerC

Failover routing works with health checks to move traffic from a primary endpoint to a secondary endpoint when the primary becomes unhealthy.

Why this answer

Route 53 failover routing with health checks is the correct choice because it automatically directs traffic to the secondary (Region B) endpoint when the primary (Region A) endpoint fails a health check. This provides automated DNS failover with minimal configuration effort, as Route 53 monitors the health of each ALB endpoint and updates DNS responses accordingly.

Exam trap

The trap here is that candidates often confuse latency-based routing with failover routing, assuming latency routing will automatically redirect traffic away from an unhealthy region, but latency routing has no health awareness and will continue sending traffic to a down endpoint if it has the lowest latency.

How to eliminate wrong answers

Option A is wrong because manually updating a single A record after a failure is not automated and contradicts the requirement for minimal configuration effort; it also introduces significant downtime during the manual update window. Option B is wrong because latency-based routing routes traffic based on lowest latency, not health; it does not automatically fail over to Region B when Region A becomes unhealthy—it would still send traffic to Region A if it has lower latency, even if it is down. Option D is wrong because setting Region B's weight to 0 would prevent any traffic from reaching it; weighted routing does not support automatic failover based on health checks, so Region B would remain unused even when Region A fails.

Practice this question →

264

MCQeasy

A retail platform needs disaster recovery across AWS Regions. The business requirement is: RTO up to 6 hours, RPO up to 1 hour, and they want the ability to start serving quickly during a Region outage but do not want to run full production capacity continuously. Which DR strategy best fits these requirements?

A.Backup and restore only, with no continuously running infrastructure in the secondary Region.

B.Pilot light, keeping only the minimum resources needed to bootstrap the environment.

C.Warm standby, keeping a reduced but ready-to-scale environment in the secondary Region.

D.Multi-site active-active, serving production traffic from both Regions at all times.

AnswerC

Warm standby maintains enough infrastructure to reduce recovery time, while not fully running production capacity continuously.

Why this answer

Warm standby is the correct strategy because it maintains a scaled-down but fully functional copy of the production environment in the secondary Region, which can be scaled up within the 6-hour RTO. The RPO of 1 hour is met by continuous replication (e.g., Amazon RDS cross-Region read replicas or DynamoDB global tables), and the reduced footprint avoids the cost of full production capacity while still enabling rapid failover.

Exam trap

The trap here is that candidates confuse pilot light with warm standby, assuming that any minimal running infrastructure qualifies as pilot light, but warm standby specifically requires a scaled-down but fully functional environment that can serve traffic immediately after scaling, whereas pilot light requires significant provisioning before it can serve traffic.

How to eliminate wrong answers

Option A is wrong because backup and restore only would require restoring from snapshots or backups, which typically takes longer than 6 hours and cannot achieve a 1-hour RPO due to the time needed to provision infrastructure and load data. Option B is wrong because pilot light keeps only the minimal core resources (e.g., a small database and a single EC2 instance) that must be fully provisioned and scaled after failover, which cannot meet the 6-hour RTO if scaling takes too long, and the RPO may be missed if replication is not continuous. Option D is wrong because multi-site active-active runs full production capacity in both Regions at all times, which violates the requirement to not run full production capacity continuously and incurs unnecessary cost.

Practice this question →