Knowledge + Practice

CCNA Design Resilient Architectures Questions

75 of 264 questions · Page 1/4 · Design Resilient Architectures · Answers revealed

Practice these questions Domain overview All questions

1

Multi-Selectmedium

A production Amazon RDS database already has automated backups enabled. At 10:45 UTC, the team discovers that a faulty migration corrupted rows in a table at 10:30 UTC. The business wants the database restored to exactly the state it had at 10:30 UTC with minimal risk. Which two actions should the team take? Select two.

Select 2 answers

A.Restore the database to a new instance using point-in-time restore for 10:30 UTC.

B.Validate the restored database, then switch the application endpoint to the restored database.

C.Restore the most recent manual snapshot because it will include the 10:30 UTC state.

D.Overwrite the existing database instance in place so the application keeps the same storage volume.

E.Wait for automated backups to complete again, then replay the migration to restore the missing rows.

AnswersA, B

Correct. Point-in-time restore is the RDS recovery method for returning to a specific moment before the corruption occurred. Restoring to a new instance gives the team a clean database copy at the desired timestamp without risking the current production instance.

Why this answer

Option A is correct because Amazon RDS Point-in-Time Restore (PITR) allows you to restore a DB instance to any second within the backup retention period, including 10:30 UTC. This uses automated backups and transaction logs to reconstruct the exact database state at that specific time, providing a precise recovery point with minimal data loss.

Exam trap

The trap here is that candidates may think manual snapshots can be used for point-in-time recovery, but they only capture a single moment and cannot roll forward to a specific time like automated backups can.

Practice this question →

2

MCQeasy

Based on the exhibit, the web team wants the application to continue serving traffic if one Availability Zone fails. Which change best meets the requirement with the least operational overhead?

A.Increase desired capacity to 3 in the same Availability Zone so one extra instance is always available.

B.Add the unused subnet in us-east-1b to the Auto Scaling group so instances can launch in both AZs.

C.Replace the Application Load Balancer with a Network Load Balancer because it will automatically keep the app online.

D.Move the application to a larger EC2 instance type so a single server can handle the full workload.

AnswerB

Placing the Auto Scaling group in at least two Availability Zones allows AWS to distribute and replace instances across zones. Because the Application Load Balancer can route only to healthy targets, adding the second subnet is the lowest-complexity change that gives the application resilience to a full AZ outage.

Why this answer

Option B is correct because it adds the unused subnet in us-east-1b to the Auto Scaling group, enabling EC2 instances to launch across two Availability Zones. This provides fault isolation: if one AZ fails, the ALB can route traffic to healthy instances in the other AZ. The change requires only a configuration update to the Auto Scaling group, minimizing operational overhead while meeting the high-availability requirement.

Exam trap

The trap here is that candidates often assume increasing instance count in a single AZ or using a different load balancer type alone provides high availability, but true resilience requires distributing instances across multiple Availability Zones.

How to eliminate wrong answers

Option A is wrong because increasing desired capacity to 3 in the same Availability Zone does not protect against an AZ failure; all instances remain in a single AZ, so if that AZ fails, all traffic is lost. Option C is wrong because replacing the Application Load Balancer with a Network Load Balancer does not inherently provide cross-AZ failover; the NLB still requires instances in multiple AZs to maintain availability, and the change introduces unnecessary operational overhead. Option D is wrong because moving to a larger EC2 instance type does not eliminate the single point of failure; if the AZ hosting that single instance fails, the application goes down regardless of instance size.

Practice this question →

3

MCQmedium

A trading dashboard uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The architecture review board prefers a managed AWS-native control.

A.A single-AZ Aurora cluster

B.Aurora Global Database

C.Manual snapshots copied monthly

D.An ElastiCache Redis replica

AnswerB

Aurora Global Database replicates with low latency to secondary Regions and supports faster disaster recovery than snapshot-only approaches.

Why this answer

Aurora Global Database is the correct choice because it provides a managed, cross-Region disaster recovery solution with a Recovery Point Objective (RPO) of less than 1 second and a Recovery Time Objective (RTO) of typically less than 1 minute. It uses storage-based replication to keep a secondary cluster in another AWS Region up to date with minimal latency, meeting the low RPO requirement without manual intervention.

Exam trap

The trap here is that candidates may confuse cross-Region replication with multi-AZ deployments, or assume that manual snapshots or caching solutions can meet low RPO requirements, when only a managed global database service like Aurora Global Database provides the necessary sub-second RPO and automated failover.

How to eliminate wrong answers

Option A is wrong because a single-AZ Aurora cluster lacks any cross-Region replication or failover capability, offering no disaster recovery across Regions and resulting in an unacceptably high RPO if the primary Region fails. Option C is wrong because manual snapshots copied monthly provide an RPO of up to one month, which is far too high for the low RPO requirement, and the process is not automated or managed natively for rapid recovery. Option D is wrong because ElastiCache Redis is an in-memory cache, not a persistent database, and cannot serve as a primary data store for the trading dashboard's transactional data; it also lacks cross-Region replication for disaster recovery.

Practice this question →

4

MCQmedium

A payments platform requires disaster recovery across Regions. Requirements: RPO of 15 minutes and RTO of about 1 hour. The business cannot afford full duplicate capacity in both Regions all the time, but the team wants automated readiness so failover is mostly operationally guided rather than a slow rebuild. Which DR strategy is the best fit?

A.Backup and restore only, relying on scheduled snapshots and manual restores during incidents.

B.Pilot light, keeping only minimal infrastructure in the secondary Region and starting full services after failover.

C.Warm standby, keeping core infrastructure and a partially provisioned environment ready in the secondary Region with frequent data replication.

D.Active/active, routing production traffic to both Regions continuously and accepting dual-region complexity.

AnswerC

Warm standby balances cost and readiness by keeping enough capacity and services running to shorten recovery time while meeting RPO needs.

Why this answer

Warm standby is the best fit because it maintains a partially provisioned environment in the secondary Region with core infrastructure (e.g., a smaller EC2 Auto Scaling group, a standby database with synchronous or asynchronous replication) and frequent data replication, enabling an RPO of 15 minutes and an RTO of about 1 hour. This approach balances cost and automated readiness, as the team can scale up the standby environment during failover without the expense of full duplicate capacity, while still meeting the recovery objectives through automated replication (e.g., Amazon RDS Multi-AZ cross-Region or DynamoDB global tables).

Exam trap

The trap here is that candidates often confuse pilot light with warm standby, assuming minimal infrastructure is sufficient for a 1-hour RTO, but pilot light's need to provision and configure full services after failover typically pushes RTO beyond 1 hour, whereas warm standby's partially provisioned environment allows faster scaling.

How to eliminate wrong answers

Option A is wrong because backup and restore with scheduled snapshots cannot achieve an RPO of 15 minutes (snapshots are typically taken every few hours) and manual restores would far exceed the 1-hour RTO due to data transfer and restoration time. Option B is wrong because pilot light keeps only minimal infrastructure (e.g., a small database and no application servers) and requires starting full services after failover, which would likely exceed the 1-hour RTO due to provisioning and configuration time. Option D is wrong because active/active requires full duplicate capacity in both Regions all the time, which the business explicitly cannot afford, and introduces unnecessary complexity and cost for a scenario where failover is only occasional.

Practice this question →

5

MCQeasy

An internal worker consumes messages from an Amazon SQS queue. Occasionally, a message fails validation in the worker (for example, missing required fields). Reprocessing the same bad message repeatedly wastes processing time and delays healthy messages. What is the best AWS approach to handle these poison messages without blocking the rest of the queue?

A.Configure an SQS dead-letter queue (DLQ) using a redrive policy with a maxReceiveCount.

B.Delete the SQS queue and recreate it daily to clear invalid messages.

C.Increase the consumer timeout/processing time so validation failures take longer to occur.

D.Use SNS fan-out without any DLQ and rely only on application retries.

AnswerA

With a redrive policy, SQS continues delivering the message to consumers until it has been received unsuccessfully maxReceiveCount times. After that threshold, SQS moves the poison message to a DLQ, isolating it from the main processing flow so healthy messages can continue being processed.

Why this answer

Option A is correct because an SQS dead-letter queue (DLQ) with a redrive policy that sets a maxReceiveCount allows the worker to process a message up to a specified number of times. After that threshold is exceeded, the message is automatically moved to the DLQ, isolating the poison message and preventing it from blocking or delaying the processing of healthy messages in the main queue.

Exam trap

The trap here is that candidates may think increasing timeouts or relying on application retries alone can solve the problem, but they fail to recognize that only a DLQ with a redrive policy provides automatic, queue-level isolation of poison messages without blocking healthy message processing.

How to eliminate wrong answers

Option B is wrong because deleting and recreating the queue daily is disruptive, causes data loss of all messages (including valid ones), and does not provide a targeted mechanism to isolate only the poison messages. Option C is wrong because increasing the consumer timeout or processing time does not prevent validation failures; it only delays the retry cycle and does not remove the bad message from the queue, so it will still be reprocessed and waste resources. Option D is wrong because SNS fan-out without a DLQ and relying only on application retries means the poison message will be repeatedly delivered to all subscribers, causing infinite retries and blocking the processing of healthy messages; there is no automatic isolation mechanism.

Practice this question →

6

MCQeasy

A production application uses an Amazon RDS Multi-AZ DB instance. During an unplanned failover, the database endpoint remains the same. What change should the application team make to handle the failover reliably?

A.Hard-code the new writer instance IP address after failover completes.

B.Keep using the same RDS endpoint and implement connection retry logic on failures.

C.Disable Multi-AZ and rely on manual intervention to switch endpoints.

D.Move reads to application-side caching only, and avoid reopening DB connections.

AnswerB

For RDS Multi-AZ, the DB endpoint is designed to remain consistent. During failover, in-flight connections may drop, so the application should treat connection/transaction errors as transient and reconnect with retry (for example, exponential backoff).

Why this answer

Option B is correct because the RDS Multi-AZ DNS endpoint remains unchanged during a failover, automatically pointing to the new writer instance. Implementing connection retry logic with exponential backoff allows the application to handle the brief DNS propagation delay and connection interruption, ensuring reliable recovery without manual intervention.

Exam trap

The trap here is that candidates assume the endpoint changes or that Multi-AZ provides seamless failover without any application-side changes, but in reality the application must implement retry logic to handle the brief connection disruption during DNS propagation.

How to eliminate wrong answers

Option A is wrong because hard-coding the new writer instance IP address is impractical and error-prone; the IP address can change after failover, and this approach bypasses the automatic DNS update provided by Multi-AZ. Option C is wrong because disabling Multi-AZ removes high availability entirely, forcing manual endpoint switching which increases downtime and violates the goal of reliable failover handling. Option D is wrong because moving reads to application-side caching does not address the need to re-establish the database connection after failover; the application must still handle connection failures and retries for writes.

Practice this question →

7

MCQeasy

Based on the exhibit, a web application must stay available if one Availability Zone fails. What is the best change to improve resilience?

A.Increase the desired capacity to 8 instances in the same subnet.

B.Add a subnet in another Availability Zone to the Auto Scaling group and keep the ALB spanning both AZs.

C.Replace the Application Load Balancer with a Network Load Balancer.

D.Move the instances to a larger instance type with more CPU and memory.

AnswerB

This places application instances across multiple Availability Zones, which protects the stateless tier from a single-AZ failure. The ALB already spans two AZs, so the missing piece is the Auto Scaling group using subnets in more than one AZ. That allows AWS to replace unhealthy instances and continue serving traffic from the surviving Zone.

Why this answer

Adding a subnet in another Availability Zone (AZ) to the Auto Scaling group and keeping the ALB spanning both AZs ensures that if one AZ fails, the ALB can route traffic to healthy instances in the other AZ. This is the standard pattern for building multi-AZ resilient architectures with Auto Scaling and ALB, as it eliminates the single point of failure at the AZ level.

Exam trap

The trap here is that candidates often think increasing instance count or size improves resilience, but without multi-AZ distribution, all instances remain vulnerable to a single AZ failure.

How to eliminate wrong answers

Option A is wrong because increasing the desired capacity to 8 instances in the same subnet does not protect against an AZ failure; all instances remain in a single AZ, so if that AZ goes down, all instances become unavailable. Option C is wrong because replacing the ALB with a Network Load Balancer does not inherently improve resilience against AZ failure; both ALB and NLB support multi-AZ deployments, but the issue is the lack of cross-AZ instance distribution, not the load balancer type. Option D is wrong because moving to a larger instance type with more CPU and memory addresses performance scaling, not availability; it does not protect against an AZ outage.

Practice this question →

8

MCQmedium

An order-processing service consumes messages from an Amazon SQS Standard queue using a custom worker. During traffic spikes, the worker occasionally times out after performing some work but before acknowledging the message, so SQS redelivers it and it may be processed again. You also observe that a small set of “poison” messages always fail validation. What change most directly improves resilience by (1) preventing poison messages from retrying indefinitely and (2) avoiding duplicate side effects caused by legitimate retries?

A.Increase the SQS visibility timeout and, when validation fails, call DeleteMessage in the consumer to remove the message immediately.

B.Move to SNS topics with subscriptions and rely on SNS to provide exactly-once delivery to eliminate duplicates automatically.

C.Configure a dead-letter queue (DLQ) with a redrive policy that moves messages after maxReceiveCount, and implement idempotent processing in the consumer using an idempotency key.

D.Change the queue to FIFO and enable content-based deduplication, leaving the consumer logic unchanged.

AnswerC

SQS Standard is at-least-once delivery, so timeouts can cause redelivery and duplicates. A DLQ with a redrive policy prevents poison messages from retrying forever by moving them after repeated failures. Idempotent processing (for example, storing a processed marker in a database with conditional logic keyed by an idempotency key) prevents duplicate side effects when retries occur for valid messages.

Why this answer

Option C is correct because a dead-letter queue (DLQ) with a maxReceiveCount redrive policy directly addresses the poison message problem by moving messages that repeatedly fail validation out of the main queue after a set number of retries, preventing indefinite retries. Implementing idempotent processing using an idempotency key ensures that even if a legitimate message is redelivered due to a visibility timeout, the consumer can detect and skip duplicate side effects, thus solving both requirements most directly.

Exam trap

The trap here is that candidates often confuse FIFO queues as a universal solution for both deduplication and poison message handling, but FIFO only provides exactly-once processing within a deduplication window and does not automatically handle poison messages without a DLQ, nor does it address idempotency for retries outside that window.

How to eliminate wrong answers

Option A is wrong because increasing the visibility timeout does not prevent poison messages from retrying indefinitely—they will still be retried until the timeout expires, and calling DeleteMessage after validation failure only removes the message from the queue but does not stop redelivery if the consumer times out before acknowledging; it also does not address duplicate side effects from legitimate retries. Option B is wrong because SNS topics do not provide exactly-once delivery; SNS is a pub/sub messaging service that delivers messages to multiple subscribers but does not guarantee deduplication or eliminate duplicates, and it does not replace the need for a DLQ or idempotent processing. Option D is wrong because switching to a FIFO queue with content-based deduplication eliminates duplicates within a 5-minute deduplication window but does not handle poison messages—they would still be retried indefinitely unless a DLQ is configured, and leaving consumer logic unchanged means idempotency is not addressed, so duplicate side effects from retries beyond the deduplication window could still occur.

Practice this question →

9

Multi-Selecthard

A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The architecture review board prefers a managed AWS-native control.

Select 2 answers

A.Route 53 failover routing with health checks

B.S3 Transfer Acceleration

C.A deployed standby application stack in the secondary Region

D.AWS Organizations service control policies

AnswersA, C

Route 53 can monitor endpoint health and return the standby endpoint when the primary is unhealthy.

Why this answer

Route 53 failover routing with health checks is correct because it provides the DNS-level automatic failover mechanism required to redirect traffic from the primary Region to a secondary Region when the primary endpoint becomes unhealthy. Route 53 health checks monitor the primary endpoint's health, and when they detect a failure, the failover routing policy automatically returns the IP address of the secondary endpoint, enabling seamless failover without manual intervention. This is a managed AWS-native control that meets the architecture review board's preference.

Exam trap

The trap here is that candidates may think a pre-deployed standby stack in the secondary Region is optional or that a single service like Route 53 alone can handle failover, but both the DNS routing mechanism (Route 53) and the actual compute/storage resources in the secondary Region are required for a working failover solution.

Practice this question →

10

MCQeasy

A production Amazon RDS database has automated backups enabled. An application mistakenly updates a table and the issue is discovered one hour later. The team needs to restore the database to the exact state it had 45 minutes ago. Which approach best meets the requirement?

A.Perform a point-in-time restore to a timestamp within the automated backup window.

B.Restore only from the latest daily snapshot, then manually undo the last hour’s changes.

C.Increase Multi-AZ to generate a new standby and redirect traffic back to the previous primary state.

D.Stop the database and change the application to ignore the table going forward.

AnswerA

Point-in-time restore lets RDS recover to a specific time, which matches the “45 minutes ago” requirement.

Why this answer

Amazon RDS Point-in-Time Recovery (PITR) allows you to restore a DB instance to any specific second within the automated backup retention period, which includes the transaction logs needed to reconstruct the database state at the desired time. Since the issue is discovered one hour after the mistaken update, PITR can restore the database to exactly 45 minutes ago by replaying binary logs up to that precise timestamp, meeting the requirement without data loss.

Exam trap

The trap here is that candidates often confuse automated backups with manual snapshots or assume Multi-AZ provides recovery capabilities, but Multi-AZ only ensures failover, not point-in-time restoration, and PITR requires transaction logs, not just snapshots.

How to eliminate wrong answers

Option B is wrong because restoring from the latest daily snapshot would only provide a backup from the snapshot time (likely hours or days old), not the state 45 minutes ago, and manually undoing changes is error-prone and not supported by RDS as a native feature. Option C is wrong because Multi-AZ provides high availability through a standby replica, but it does not offer point-in-time recovery; the standby is an identical copy of the primary and cannot be used to revert to a previous state. Option D is wrong because stopping the database and ignoring the table does not restore the lost or incorrect data; it only avoids the immediate symptom without recovering the required database state.

Practice this question →

11

MCQmedium

A web application runs on an EC2 Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ASG spans three Availability Zones. After a deployment, new instances frequently fail the ALB target group health checks with HTTP 5xx responses and are quickly terminated by the ASG. What change most improves resiliency during deployments with minimal downtime by preventing premature removal of instances that are still starting?

A.Reduce the ASG health check grace period to 0 seconds so issues are detected faster.

B.Use a longer ASG health check grace period and deploy new instances using controlled replacement (for example, rolling instance refresh) so existing healthy instances continue serving while new ones warm up.

C.Restrict the ASG to a single Availability Zone so health check evaluation is simpler.

D.Disable ALB health checks so the ASG does not terminate instances on HTTP 5xx responses.

AnswerB

A longer ASG health check grace period prevents instances from being evaluated too early during normal startup time. Controlled replacement or rolling instance refresh ensures capacity is maintained while new instances warm up, so the ALB continues routing requests only to healthy targets.

Why this answer

Option B is correct because increasing the ASG health check grace period gives new instances more time to complete their startup and pass the ALB health checks before the ASG marks them unhealthy. A rolling instance refresh replaces instances in a controlled manner, ensuring that existing healthy instances continue serving traffic while new instances warm up, minimizing downtime and preventing premature termination.

Exam trap

The trap here is that candidates think reducing the grace period or disabling health checks will speed up recovery, when in fact it causes premature termination or serves traffic to unhealthy instances, increasing downtime.

How to eliminate wrong answers

Option A is wrong because reducing the grace period to 0 seconds would cause the ASG to terminate instances even faster when they return HTTP 5xx during startup, worsening the problem. Option C is wrong because restricting to a single Availability Zone reduces fault tolerance and does not address the root cause of premature termination during startup. Option D is wrong because disabling ALB health checks would prevent the ASG from detecting actual instance failures, leading to serving traffic to unhealthy instances and increasing downtime.

Practice this question →

12

MCQmedium

Based on the exhibit, the application sees several minutes of connection errors during an Aurora failover. What is the best change to reduce failover impact?

A.Change the application to use the Aurora cluster writer endpoint and retry transient connections.

B.Add an Aurora read replica and keep using the same JDBC URL.

C.Increase the EC2 instance size of the application servers.

D.Switch to a single-AZ RDS PostgreSQL instance for simpler connectivity.

AnswerA

The current configuration targets a specific instance endpoint, which becomes stale after failover. The Aurora cluster writer endpoint always resolves to the current writer, so the application can reconnect without manual endpoint changes. Adding retries with backoff helps the application survive the short DNS and connection transition during failover.

Why this answer

The Aurora cluster writer endpoint always points to the current primary instance, even after a failover. By using this endpoint and implementing retry logic for transient connection errors, the application can automatically reconnect to the new writer without manual intervention, reducing the impact of the failover from several minutes to seconds.

Exam trap

The trap here is that candidates often think adding read replicas or scaling application servers will fix failover connectivity, but the real issue is that the application must use the correct endpoint and handle transient disconnections gracefully.

How to eliminate wrong answers

Option B is wrong because adding a read replica does not help with write connection errors during failover; the application would still need to reconnect to the new writer, and the same JDBC URL (which likely points to a specific instance) would fail after failover. Option C is wrong because increasing the EC2 instance size of the application servers does not address the root cause of connection errors during failover; it only improves compute capacity, not database connectivity resilience. Option D is wrong because switching to a single-AZ RDS PostgreSQL instance would actually increase downtime during a failure (no automatic failover) and does not solve the transient connection issue; it also loses Aurora's high-availability features.

Practice this question →

13

MCQhard

A patient portal must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The architecture review board prefers a managed AWS-native control.

A.Instance store volumes

B.Amazon EFS with mount targets in multiple Availability Zones

C.An EBS volume attached to all instances

D.S3 mounted as a POSIX file system without a file gateway

AnswerB

EFS is regional file storage and supports mount targets across AZs.

Why this answer

Amazon EFS provides a fully managed, shared POSIX-compliant file system that can be mounted concurrently across multiple Linux EC2 instances. By creating mount targets in multiple Availability Zones, the file system remains accessible even if one AZ fails, meeting the high-availability requirement. This aligns with the architecture review board's preference for a managed AWS-native control.

Exam trap

The trap here is that candidates may confuse EBS multi-attach (which is limited to a single AZ and specific volume types) with the cross-AZ shared file system capability of EFS, or incorrectly assume that mounting S3 as a POSIX file system is a reliable, managed solution for shared storage.

How to eliminate wrong answers

Option A is wrong because instance store volumes are ephemeral, tied to a single EC2 instance, and data is lost on instance stop or termination, so they cannot provide shared, durable storage across AZs. Option C is wrong because a single EBS volume can only be attached to one EC2 instance at a time (except for multi-attach io1/io2 volumes, which are limited to a few Nitro-based instances and still not designed for cross-AZ shared file storage). Option D is wrong because mounting an S3 bucket as a POSIX file system (e.g., via s3fs-fuse) does not provide native POSIX locking or consistency semantics, and it introduces performance and reliability issues; it is not a managed AWS-native file system service.

Practice this question →

14

MCQmedium

An orders service publishes payment instructions to an Amazon SQS Standard queue. The downstream processor sometimes times out after it has already applied the payment, but before it can delete the message from the queue. As a result, the same payment instruction can be processed more than once. The team wants the strongest way to prevent duplicate side effects while keeping the system decoupled. What should they implement?

A.Keep the queue as SQS Standard but increase the visibility timeout so duplicates are less likely to reappear during timeouts.

B.Change the queue to an SQS FIFO queue and use a stable deduplication ID derived from the payment instruction ID.

C.Make the downstream processor idempotent by recording processed payment instruction IDs in a durable datastore and ignoring repeats.

D.Use an ALB health check to restart the downstream processor when timeouts occur.

AnswerC

SQS Standard is at-least-once delivery, so the same message can be delivered more than once if the consumer times out before deleting it. Idempotent processing is the strongest protection against duplicate side effects because it prevents repeat application of the payment even when the message is redelivered.

Why this answer

Option C is correct because making the downstream processor idempotent ensures that duplicate payment instructions are safely ignored, even if the same message is delivered more than once. This approach provides the strongest guarantee against duplicate side effects without requiring changes to the queue type or increasing visibility timeouts, and it keeps the system fully decoupled.

Exam trap

The trap here is that candidates often assume that switching to a FIFO queue or increasing visibility timeout fully solves duplicate processing, but they overlook that the downstream processor's timeout after applying the payment is the root cause, which idempotency directly addresses.

How to eliminate wrong answers

Option A is wrong because increasing the visibility timeout only reduces the likelihood of duplicates but does not eliminate them; a timeout can still occur after processing, leading to the same duplicate issue. Option B is wrong because switching to an SQS FIFO queue with a deduplication ID prevents duplicate messages from being delivered, but it does not prevent the downstream processor from timing out after applying the payment and before deleting the message, so the same message could be redelivered and processed again. Option D is wrong because an ALB health check only restarts the downstream processor when timeouts occur, but it does not prevent duplicate processing of the same payment instruction.

Practice this question →

15

MCQmedium

Your order-processing system uses EventBridge rules to send events to a Lambda function that updates order status. Over the last week, some events fail with a transient database timeout, and the Lambda retries intermittently but then the events are lost (no alerts after failures). You want at-least-once processing, bounded retries, and a way to inspect unprocessable events for later reprocessing. Which architecture change best meets these requirements?

A.Send EventBridge events to an SQS queue, configure a redrive policy to move messages to a dead-letter queue (DLQ) after a defined receive count, and make the Lambda processing idempotent.

B.Invoke Lambda directly from EventBridge in asynchronous mode, and increase the Lambda timeout to reduce failures.

C.Use SNS topics with Lambda subscriptions, but remove all retry and DLQ configuration to minimize duplicate events.

D.Store failed events only in CloudWatch logs, and have operators manually copy log entries back into the database for reprocessing.

AnswerA

EventBridge-to-SQS provides buffering and decoupling; SQS redrive with a DLQ bounds retries and preserves failed events for analysis and replay.

Why this answer

Option A is correct because it introduces an SQS queue between EventBridge and Lambda, which provides a durable buffer for events. The redrive policy moves events to a dead-letter queue (DLQ) after a defined number of failed processing attempts, ensuring bounded retries and preserving unprocessable events for later inspection and reprocessing. Making the Lambda idempotent guarantees at-least-once processing even if duplicate events occur.

Exam trap

The trap here is that candidates may think increasing Lambda timeout or relying on asynchronous invocation retries alone is sufficient, but they overlook the need for a DLQ to capture and inspect events that persistently fail, which is a key requirement for operational visibility and reprocessing.

How to eliminate wrong answers

Option B is wrong because increasing the Lambda timeout does not address transient database timeouts; it only gives the function more time to complete, but failures can still occur and events will be lost if the Lambda's asynchronous invocation retry limit is exhausted without a DLQ. Option C is wrong because SNS with Lambda subscriptions does not provide a built-in DLQ mechanism; removing retry and DLQ configuration would cause events to be dropped immediately on failure, violating the requirement for bounded retries and inspectable unprocessable events. Option D is wrong because storing failed events only in CloudWatch logs does not provide a structured, queryable, or automated reprocessing mechanism; manual copying is error-prone, unscalable, and does not meet the requirement for bounded retries or at-least-once processing.

Practice this question →

16

MCQeasy

A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The architecture review board prefers a managed AWS-native control.

A.CloudFront caching with appropriate TTLs

B.AWS Backup Vault Lock

C.IAM Access Analyzer

D.S3 Select

AnswerA

CloudFront can serve cached content from edge locations when the origin is temporarily unavailable.

Why this answer

CloudFront caching with appropriate TTLs allows cached responses to be served to users even when the S3 origin is temporarily unavailable. By setting a minimum TTL (e.g., 0 seconds for fresh content, but a higher default or maximum TTL for stale content), CloudFront can continue delivering previously cached pages from edge locations during an S3 outage, ensuring high availability and resilience. This is a managed AWS-native feature that aligns with the architecture review board's preference.

Exam trap

The trap here is that candidates may confuse data protection features (like Backup Vault Lock) or data retrieval tools (like S3 Select) with caching and origin resilience, overlooking that CloudFront's TTL-based caching is the direct AWS-managed solution for serving content during origin outages.

How to eliminate wrong answers

Option B (AWS Backup Vault Lock) is wrong because it is a data protection feature for backup vaults that prevents deletion of backups, not a mechanism to serve cached content during an origin outage. Option C (IAM Access Analyzer) is wrong because it analyzes resource-based policies to identify unintended public access, not to cache or serve static content. Option D (S3 Select) is wrong because it is a query-in-place feature that retrieves subsets of data from objects using SQL expressions, and it does not provide caching or resilience against origin outages.

Practice this question →

17

MCQmedium

A ticket booking system stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured?

A.S3 lifecycle transition to Glacier Flexible Retrieval

B.An EBS snapshot schedule

C.S3 Cross-Region Replication with versioning enabled

D.A CloudFront distribution

AnswerC

CRR asynchronously replicates objects to a bucket in another Region and requires versioning.

Why this answer

S3 Cross-Region Replication (CRR) with versioning enabled automatically copies objects from a source bucket in one AWS Region to a destination bucket in another Region, meeting the disaster recovery requirement for a geographically separate copy. Versioning must be enabled on both buckets to support replication of all object versions, ensuring consistency and recoverability. This is the native S3 feature designed for cross-region data redundancy without custom scripting or third-party tools.

Exam trap

The trap here is that candidates confuse S3 Cross-Region Replication with S3 lifecycle policies or other storage services like EBS snapshots, failing to recognize that CRR is the only option that directly creates a second copy of S3 objects in a different AWS Region for disaster recovery.

How to eliminate wrong answers

Option A is wrong because S3 lifecycle transition to Glacier Flexible Retrieval only moves objects to a lower-cost storage class within the same bucket and region; it does not create a copy in another AWS Region. Option B is wrong because EBS snapshots are for Amazon Elastic Block Store volumes attached to EC2 instances, not for S3 objects, and they cannot replicate data across regions automatically without additional configuration like copying snapshots manually. Option D is wrong because CloudFront is a content delivery network (CDN) that caches content at edge locations for low-latency delivery; it does not provide persistent cross-region storage replication for disaster recovery.

Practice this question →

18

MCQeasy

Based on the exhibit, a web application must stay available if one Availability Zone fails. What is the best change to improve resilience?

A.Increase the desired capacity to 8 instances in the same subnet.

B.Add a subnet in another Availability Zone to the Auto Scaling group and keep the ALB spanning both AZs.

C.Replace the Application Load Balancer with a Network Load Balancer.

D.Move the instances to a larger instance type with more CPU and memory.

AnswerB

This places application instances across multiple Availability Zones, which protects the stateless tier from a single-AZ failure. The ALB already spans two AZs, so the missing piece is the Auto Scaling group using subnets in more than one AZ. That allows AWS to replace unhealthy instances and continue serving traffic from the surviving Zone.

Why this answer

Adding a subnet in another Availability Zone (AZ) to the Auto Scaling group and keeping the ALB spanning both AZs ensures that if one AZ fails, the ALB can route traffic to healthy instances in the other AZ. This is the standard pattern for building multi-AZ resilient architectures with Auto Scaling and ALB, as it eliminates the single point of failure at the AZ level.

Exam trap

The trap here is that candidates often focus on scaling up (more instances or larger instances) or changing the load balancer type, missing the fundamental requirement of distributing resources across multiple Availability Zones to achieve AZ-level resilience.

How to eliminate wrong answers

Option A is wrong because increasing the desired capacity to 8 instances in the same subnet does not protect against an AZ failure; all instances remain in a single AZ, so if that AZ goes down, all instances become unavailable. Option C is wrong because replacing the ALB with a Network Load Balancer does not inherently improve resilience against AZ failure; both ALB and NLB can span multiple AZs, but the key issue is the lack of multi-AZ instance placement, not the load balancer type. Option D is wrong because moving to a larger instance type with more CPU and memory improves performance but does not address AZ-level fault tolerance; a single AZ failure would still take down all instances regardless of size.

Practice this question →

19

Multi-Selecthard

A regional web application for a inventory service must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required?

Select 2 answers

A.Route 53 failover routing with health checks

B.S3 Transfer Acceleration

C.A deployed standby application stack in the secondary Region

D.AWS Organizations service control policies

AnswersA, C

Route 53 can monitor endpoint health and return the standby endpoint when the primary is unhealthy.

Why this answer

Route 53 failover routing with health checks is correct because it monitors the health of the primary endpoint via periodic HTTP/HTTPS or TCP checks. If the health check fails, Route 53 automatically updates DNS resolution to point to the secondary Region's endpoint, enabling failover at the DNS level without manual intervention.

Exam trap

The trap here is that candidates often assume Route 53 alone is sufficient for failover, forgetting that a fully deployed standby application stack in the secondary Region is required to actually serve traffic after DNS rerouting.

Practice this question →

20

Multi-Selecthard

A claims workflow requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The architecture review board prefers a managed AWS-native control.

Select 2 answers

A.Point-in-time recovery

B.DAX

C.Deletion protection or tightly controlled delete permissions

D.Global secondary indexes

AnswersA, C

PITR allows restoration to a specific second within the supported recovery window.

Why this answer

Point-in-time recovery (PITR) for DynamoDB enables continuous backups with 35-day granularity, allowing restoration to any second within that window. This satisfies the point-in-time recovery requirement by providing a managed AWS-native control that automatically backs up table data without manual intervention.

Exam trap

The trap here is that candidates often confuse DAX (a caching layer) with backup/recovery features, or assume that global secondary indexes provide some form of data redundancy or protection, when in fact they only support query flexibility.

Practice this question →

21

Multi-Selectmedium

A retail API runs on Amazon EC2 instances behind an Application Load Balancer and stores orders in an Amazon RDS for PostgreSQL database. A test that stopped one Availability Zone caused the API to return errors because all application servers were in the same AZ and the database was single-AZ. Which two changes should the architect make to continue serving traffic during a single-AZ failure? Select two.

Select 2 answers

A.Increase the EC2 instance size and keep all application servers in the same subnet.

B.Configure the Auto Scaling group to launch instances across private subnets in at least two Availability Zones.

C.Replace the Application Load Balancer with a Network Load Balancer in a single Availability Zone.

D.Convert the RDS for PostgreSQL database to a Multi-AZ deployment.

E.Add an Amazon RDS read replica and point the application to the replica endpoint.

AnswersB, D

Spreading the application tier across multiple AZs preserves healthy capacity if one AZ fails and lets the load balancer keep serving requests.

Why this answer

Option B is correct because distributing EC2 instances across private subnets in at least two Availability Zones (AZs) ensures that if one AZ fails, the Auto Scaling group can continue serving traffic from instances in the remaining AZs. This eliminates the single point of failure for the application tier. Option D is correct because converting the RDS for PostgreSQL database to a Multi-AZ deployment automatically provisions a standby replica in a different AZ, enabling automatic failover during an AZ outage and preserving database availability.

Exam trap

The trap here is that candidates often think a read replica can serve as a high-availability solution for writes, but read replicas are asynchronous and do not support automatic failover for the primary database.

Practice this question →

22

Matchingmedium

Match the disaster recovery strategy to the recovery posture it best fits for a Regional outage.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Lowest cost option where the environment is rebuilt from backups and hours of downtime are acceptable.

Keep only the critical core running in the secondary Region, then scale out after failover.

Run a scaled-down but functional environment in another Region for faster cutover.

Serve production traffic from more than one Region at the same time for the fastest recovery.

Why these pairings

These pairs match disaster recovery strategies to their recovery postures, aligning with AWS DR strategies where RTO and RPO define the recovery objectives.

Practice this question →

23

MCQmedium

A payments service receives payment orders by consuming messages from an Amazon SQS Standard queue. The downstream processor occasionally exceeds its processing timeout. As a result, some messages reappear in the queue and may be processed more than once. The team wants to prevent duplicate side effects (for example, double-charging) and also ensure poison messages do not repeatedly consume processing capacity. What approach best satisfies both goals?

A.Implement idempotent processing (for example, store processed payment IDs in DynamoDB) and configure an SQS dead-letter queue (DLQ) using a redrive policy with an appropriate maxReceiveCount.

B.Rely only on increasing the SQS visibility timeout so duplicates rarely occur, without adding idempotency checks or a DLQ.

C.Switch to a FIFO queue and delete messages immediately upon receipt to avoid duplicates.

D.Move the workload to SNS and use synchronous HTTP endpoints so the sender retries until the receiver confirms success.

AnswerA

With SQS Standard’s at-least-once delivery, duplicates can occur. Idempotency ensures repeated processing of the same payment ID does not create duplicate side effects. A DLQ with redrive policy isolates poison messages: after a message is received and fails processing more than maxReceiveCount times, SQS moves it to the DLQ instead of cycling it back to the main queue indefinitely.

Why this answer

Option A is correct because it addresses both requirements: idempotent processing (e.g., storing processed payment IDs in DynamoDB) ensures that even if a message is processed more than once, duplicate side effects like double-charging are prevented. Configuring an SQS dead-letter queue (DLQ) with a redrive policy and an appropriate maxReceiveCount (e.g., 3 or 5) automatically moves messages that exceed the maximum number of receives to the DLQ, preventing poison messages from repeatedly consuming processing capacity.

Exam trap

The trap here is that candidates often confuse 'exactly-once delivery' (FIFO queues) with 'exactly-once processing,' failing to realize that idempotency is still required to handle failures after message receipt, and that a DLQ is necessary to manage poison messages regardless of queue type.

How to eliminate wrong answers

Option B is wrong because simply increasing the SQS visibility timeout reduces the likelihood of duplicates but does not eliminate them entirely, and it fails to handle poison messages that may still cause repeated processing failures. Option C is wrong because switching to a FIFO queue and deleting messages immediately upon receipt does not prevent duplicate side effects if the downstream processor fails after deletion but before completing processing; FIFO queues guarantee exactly-once delivery but not exactly-once processing, and immediate deletion removes the ability to retry or handle failures. Option D is wrong because moving to SNS with synchronous HTTP endpoints shifts the retry responsibility to the sender, but it does not inherently prevent duplicate side effects (e.g., if the receiver processes the request but the acknowledgment is lost) and does not address poison messages that could repeatedly fail.

Practice this question →

24

MCQmedium

A fintech company has a two-Region DR requirement: RPO must be within 15 minutes and RTO must be under 2 hours. To control cost, they do not want to run full production infrastructure in the secondary Region continuously. They plan to continuously replicate the database and keep the application infrastructure in the secondary Region prepared, but at reduced capacity. Which DR strategy best matches this requirement and accurately describes their plan?

A.Pilot light: keep only minimal components (for example, replicated storage and a small amount of core services), so the app scales up during a disaster.

B.Warm standby: keep the essential parts of the application running in the secondary Region at reduced capacity, while using database replication to meet the RPO.

C.Active-active: run the application fully in both Regions with synchronized writes and share traffic continuously.

D.Cold standby: store backups in the secondary Region and provision all infrastructure only during a disaster.

AnswerB

Warm standby aligns with both constraints: reduced-cost readiness is maintained in the secondary Region (so RTO is faster), and continuous replication is used to keep data lag within the 15-minute RPO target.

Why this answer

Warm standby is the correct strategy because it runs a scaled-down version of the production application in the secondary Region continuously, with database replication (e.g., Amazon RDS Multi-Region or Aurora Global Database) meeting the 15-minute RPO. The reduced-capacity infrastructure can be scaled up within the 2-hour RTO during a disaster, balancing cost and recovery requirements.

Exam trap

The trap here is confusing pilot light with warm standby: candidates often think any pre-provisioned infrastructure qualifies as pilot light, but warm standby explicitly runs the application at reduced capacity, whereas pilot light keeps only core services and storage without running the application stack.

How to eliminate wrong answers

Option A is wrong because pilot light keeps only minimal core services and storage, not a running application at reduced capacity, and requires provisioning and scaling up compute resources during a disaster, which may not meet the 2-hour RTO if scaling takes significant time. Option C is wrong because active-active runs the application fully in both Regions with synchronized writes and continuous traffic sharing, which violates the cost control requirement of not running full production infrastructure continuously. Option D is wrong because cold standby stores only backups and provisions all infrastructure during a disaster, leading to RTOs that typically exceed 2 hours due to provisioning and data restoration delays.

Practice this question →

25

MCQeasy

A team uses an S3 bucket to store important customer-generated exports. They need protection against accidental overwrites and also want copies of the data in another AWS Region for disaster recovery. Which S3 configuration best satisfies both requirements?

A.Enable S3 lifecycle policies to automatically move objects to Glacier after 30 days only.

B.Enable S3 versioning and configure Cross-Region Replication to a destination bucket in another Region.

C.Disable all versioning and rely on AWS Backup to restore objects from a scheduled backup window.

D.Enable S3 Block Public Access and SSE-S3 encryption, without using versioning or replication.

AnswerB

Versioning preserves previous object states against overwrites and deletes, while replication provides an additional Region copy for recovery.

Why this answer

Option B is correct because enabling S3 versioning protects against accidental overwrites by preserving previous versions of objects, and configuring Cross-Region Replication (CRR) automatically replicates objects to a destination bucket in another AWS Region, providing disaster recovery. This combination meets both requirements without manual intervention.

Exam trap

The trap here is that candidates may think AWS Backup alone can handle both accidental overwrites and disaster recovery, but it does not provide continuous versioning protection or real-time cross-region replication, and disabling versioning removes the ability to recover from overwrites.

How to eliminate wrong answers

Option A is wrong because lifecycle policies to Glacier only manage storage tier transitions and do not protect against accidental overwrites or provide cross-region replication for disaster recovery. Option C is wrong because disabling versioning removes the ability to recover from accidental overwrites, and relying solely on AWS Backup for scheduled restores does not provide real-time protection or continuous replication to another region. Option D is wrong because enabling Block Public Access and SSE-S3 encryption addresses security and encryption, but does not protect against accidental overwrites (no versioning) nor replicate data to another region (no replication).

Practice this question →

26

MCQhard

A payments API uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure?

A.A FIFO queue without a redrive policy

B.A dead-letter queue with an appropriate maxReceiveCount

C.A larger message retention period only

D.Short polling instead of long polling

AnswerB

A DLQ isolates messages that fail repeatedly so they can be investigated without disrupting normal processing.

Why this answer

B is correct because a dead-letter queue (DLQ) with an appropriate maxReceiveCount allows the payments API to isolate poison messages after a specified number of failed processing attempts. This prevents repeated failures from blocking useful retries, as the problematic messages are moved to the DLQ for manual inspection or separate handling, while the main queue continues processing valid messages.

Exam trap

The trap here is that candidates often confuse increasing the retention period or switching polling methods as solutions for poison messages, when the correct mechanism is a dead-letter queue with a maxReceiveCount to limit retries.

How to eliminate wrong answers

Option A is wrong because a FIFO queue without a redrive policy does not automatically handle poison messages; without a DLQ, failed messages will continue to be retried indefinitely, blocking the queue. Option C is wrong because increasing the message retention period only extends how long messages stay in the queue, but does not address the repeated failure and blocking caused by poison messages. Option D is wrong because short polling (vs. long polling) affects how often the queue is polled for messages, not the handling of poison messages or retry behavior.

Practice this question →

27

MCQeasy

A content publishing system exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most?

A.IAM Access Analyzer

B.AWS Backup Vault Lock

C.CloudFront caching with appropriate TTLs

D.S3 Select

AnswerC

CloudFront can serve cached content from edge locations when the origin is temporarily unavailable.

Why this answer

CloudFront caches responses from the S3 origin based on configured TTLs (Cache-Control or Expires headers). If the S3 origin experiences a short outage, CloudFront can still serve cached content to users until the TTL expires, maintaining availability. This is the most direct way to ensure users receive pages during transient origin failures.

Exam trap

The trap here is confusing data protection features (like Backup Vault Lock) or data retrieval features (like S3 Select) with caching mechanisms that directly improve availability during origin outages.

How to eliminate wrong answers

Option A is wrong because IAM Access Analyzer helps identify unintended access to resources but does not provide caching or origin failover capabilities. Option B is wrong because AWS Backup Vault Lock prevents deletion of backups but does not affect content delivery or caching behavior. Option D is wrong because S3 Select is a feature to retrieve subsets of object data using SQL queries, not a mechanism for caching or serving static content during outages.

Practice this question →

28

MCQmedium

A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The architecture review board prefers a managed AWS-native control.

A.S3 Cross-Region Replication

B.Multi-AZ deployment for the RDS DB instance

C.EBS snapshots every hour

D.Read replicas only

AnswerB

Multi-AZ provides synchronous standby replication and automatic failover within a Region.

Why this answer

Multi-AZ deployment for RDS MySQL provides synchronous standby replication to a different Availability Zone. In the event of an AZ failure, RDS automatically fails over to the standby, ensuring high availability with minimal application changes (the same endpoint is used). This is a managed AWS-native solution that meets the architecture review board's preference.

Exam trap

The trap here is that candidates often confuse Read replicas (which are for read scaling and not automatic failover) with Multi-AZ (which provides synchronous standby for high availability), or they mistakenly think EBS snapshots or S3 replication can provide the same level of automatic recovery with minimal downtime.

How to eliminate wrong answers

Option A is wrong because S3 Cross-Region Replication is for object storage replication across regions, not for RDS database availability within a region, and it does not address AZ failure for a MySQL database. Option C is wrong because EBS snapshots every hour provide point-in-time backups but do not enable automatic failover; recovery would require manual restoration, causing significant downtime. Option D is wrong because Read replicas only are designed for read scaling and asynchronous replication; they do not support automatic failover for write operations and cannot maintain availability during an AZ failure without manual promotion.

Practice this question →

29

MCQmedium

A company runs an application behind an Application Load Balancer (ALB). An Auto Scaling group (ASG) is configured with desired capacity 2, but it is attached only to subnets in a single Availability Zone. The ALB is healthy because it is configured across multiple Availability Zones. When the Availability Zone that contains the ASG subnets experiences an outage, what change most directly improves resilience and allows capacity to be restored automatically?

A.Update the ASG to use subnet IDs that span at least two Availability Zones so it can launch replacement instances after an AZ outage.

B.Reduce the ALB health check interval to speed up detection of unhealthy targets.

C.Enable connection draining on the ALB so existing requests complete before targets are terminated.

D.Increase the ASG desired capacity from 2 to 6 to compensate for the missing subnets.

AnswerA

If the ASG is attached to subnets in multiple Availability Zones, when instances in the failed AZ become unhealthy/terminate, Auto Scaling can launch new instances in the remaining AZs to restore the desired capacity. This directly addresses the root cause: the ASG cannot create capacity outside the AZs it is configured for.

Why this answer

Option A is correct because an Auto Scaling group (ASG) can only launch instances into the subnets explicitly assigned to it. If those subnets reside in a single Availability Zone (AZ) and that AZ fails, the ASG has no capacity to launch replacement instances, even though the ALB is multi-AZ. By configuring the ASG with subnet IDs spanning at least two AZs, the ASG can automatically launch instances in a healthy AZ, restoring capacity and resilience.

Exam trap

The trap here is that candidates assume a multi-AZ ALB automatically makes the entire architecture resilient, overlooking that the ASG must also be configured with subnets in multiple AZs to launch replacement instances after an AZ failure.

How to eliminate wrong answers

Option B is wrong because reducing the ALB health check interval speeds up detection of unhealthy targets but does not address the root cause: the ASG has no subnets in a healthy AZ to launch replacement instances. Option C is wrong because connection draining ensures in-flight requests complete before targets are deregistered, but it does not help restore capacity after an AZ outage. Option D is wrong because increasing the desired capacity from 2 to 6 does not solve the problem; the ASG still cannot launch instances if its subnets are all in the failed AZ, so the extra capacity is unreachable.

Practice this question →

30

MCQmedium

A global application experiences frequent writes and must survive a full Regional outage with near-zero data loss. The product team also requires that users can continue to write during the incident using the closest Region. Which approach is most aligned with these requirements?

A.Use an active/active design with multi-Region data replication (for example, global tables for the write-heavy datastore) and route traffic to multiple Regions based on health and latency.

B.Use warm standby with periodic backups of the primary write datastore every 24 hours.

C.Use pilot light where the secondary Region runs only infrastructure templates and starts data replication only after detecting failure.

D.Use a single-writer model in one Region and deploy read-only replicas in the other Region for continuity.

AnswerA

Active/active supports writing in multiple Regions and reduces the blast radius of a Regional failure while enabling continued operations.

Why this answer

Option A is correct because an active/active design with multi-Region data replication, such as Amazon DynamoDB global tables, allows writes to occur in any Region and replicates them to all other Regions with near-real-time latency (typically sub-second). This meets the requirement for near-zero data loss during a full Regional outage, as data is asynchronously replicated to multiple Regions, and users can continue writing to the closest healthy Region via Route 53 latency-based or geolocation routing.

Exam trap

The trap here is that candidates often confuse 'multi-Region replication' with 'read replicas only' (Option D) or assume that periodic backups (Option B) provide sufficient durability, failing to recognize that near-zero data loss requires continuous asynchronous replication, not batch-based or on-demand replication.

How to eliminate wrong answers

Option B is wrong because warm standby with periodic backups every 24 hours cannot achieve near-zero data loss; a 24-hour backup window means up to 24 hours of writes could be lost in a Regional failure. Option C is wrong because pilot light starts data replication only after detecting failure, which introduces a recovery time objective (RTO) and recovery point objective (RPO) that are too high for near-zero data loss, and it does not support continuous writes during the incident. Option D is wrong because a single-writer model with read-only replicas in another Region means writes cannot continue during a Regional outage of the primary Region, violating the requirement that users can write during the incident.

Practice this question →

31

MCQmedium

A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers?

A.AWS WAF

B.Amazon CloudFront

C.Amazon SQS queue

D.Amazon Route 53 weighted routing

AnswerC

SQS decouples producers and consumers, buffers bursts, and supports retries through visibility timeout and dead-letter queues.

Why this answer

Amazon SQS is the correct choice because it acts as a durable, highly available message buffer between the web tier and the fulfilment workers. It decouples the components, allowing the web tier to enqueue requests immediately without waiting for the downstream service, and the workers can poll and process messages at their own pace. SQS automatically retains messages for up to 14 days and supports retries via a dead-letter queue, ensuring no requests are lost even during spikes.

Exam trap

The trap here is that candidates may confuse a load-balancing or caching service (like CloudFront or Route 53) with a message queue, failing to recognize that only a queue provides durable, asynchronous decoupling and retry capability for request processing.

How to eliminate wrong answers

Option A is wrong because AWS WAF is a web application firewall that filters HTTP/S traffic based on rules (e.g., SQL injection, XSS) and does not provide message buffering, retry logic, or decoupling for asynchronous processing. Option B is wrong because Amazon CloudFront is a content delivery network (CDN) that caches and accelerates static and dynamic content at edge locations; it cannot buffer or persist requests for downstream workers to process asynchronously. Option D is wrong because Route 53 weighted routing distributes DNS traffic across multiple endpoints based on weights, but it operates at the DNS level and cannot absorb spikes or retry failed requests; it provides no queueing or persistence.

Practice this question →

32

MCQeasy

An engineering team deploys a stateless web API on EC2 using an Auto Scaling group and an Application Load Balancer (ALB). During a recent test, they noticed that when one Availability Zone was unavailable, traffic failed until new instances were manually launched. Which change most directly improves automatic failover for the compute layer within a single Region?

A.Place the Auto Scaling group in only one subnet so instance launches are simpler.

B.Ensure the ALB and Auto Scaling group span multiple subnets in at least two Availability Zones.

C.Increase the target group deregistration delay to allow old instances to stay longer.

D.Use a Network Load Balancer, but keep all subnets in a single Availability Zone.

AnswerB

Spreading the ALB and Auto Scaling group across at least two AZs provides redundant capacity. If one AZ fails, the ALB continues routing to healthy targets in the other AZ.

Why this answer

Option B is correct because an Application Load Balancer (ALB) and Auto Scaling group must span multiple subnets in at least two Availability Zones (AZs) to provide automatic failover. When one AZ becomes unavailable, the ALB automatically reroutes traffic to healthy targets in the remaining AZs, and the Auto Scaling group can launch replacement instances in the surviving AZs. This architecture ensures that the compute layer remains available without manual intervention.

Exam trap

The trap here is that candidates often think a single-AZ deployment with a load balancer provides failover, but without multiple AZs, the load balancer itself becomes a single point of failure and cannot reroute traffic when the AZ goes down.

How to eliminate wrong answers

Option A is wrong because placing the Auto Scaling group in only one subnet (single AZ) eliminates redundancy; if that AZ fails, all instances become unreachable and no automatic failover is possible. Option C is wrong because increasing the target group deregistration delay only keeps old instances longer during a scale-in event, it does not provide failover when an AZ becomes unavailable. Option D is wrong because using a Network Load Balancer (NLB) in a single AZ still creates a single point of failure; the NLB cannot route traffic to healthy targets in other AZs if the only AZ is down, and it does not improve failover over an ALB in this scenario.

Practice this question →

33

Multi-Selecthard

A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The team wants the control to be enforceable during normal operations.

Select 2 answers

A.Deletion protection or tightly controlled delete permissions

B.Point-in-time recovery

C.Global secondary indexes

D.DAX

AnswersA, B

Deletion protection and least-privilege controls reduce accidental table removal risk.

Why this answer

Deletion protection (Option A) prevents accidental table deletion by blocking drop-table operations, which is enforceable during normal operations. Point-in-time recovery (Option B) enables continuous backups with 35-day granularity, allowing restoration to any second within that window. Together, they satisfy the requirements for accidental-delete protection and point-in-time recovery.

Exam trap

The trap here is that candidates often confuse point-in-time recovery with backup solutions like AWS Backup or assume that GSIs or DAX provide data protection, when in fact they serve entirely different purposes (performance optimization and caching).

Practice this question →

34

MCQeasy

A inventory service exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The team wants the control to be enforceable during normal operations.

A.CloudFront caching with appropriate TTLs

B.AWS Backup Vault Lock

C.IAM Access Analyzer

D.S3 Select

AnswerA

CloudFront can serve cached content from edge locations when the origin is temporarily unavailable.

Why this answer

CloudFront caching with appropriate TTLs allows the distribution to serve stale or cached content from edge locations even when the S3 origin is temporarily unavailable. By setting a minimum TTL (e.g., 0 seconds) and a default/max TTL (e.g., 86400 seconds), CloudFront can continue to respond to user requests with previously cached objects during an origin outage, ensuring high availability. This feature is enforceable during normal operations because the TTL settings are configured in the CloudFront distribution behavior and are always active, not just during failures.

Exam trap

The trap here is that candidates may confuse CloudFront's caching with origin failover or think that features like AWS Backup Vault Lock or IAM Access Analyzer can somehow enforce availability, when in fact only proper TTL configuration ensures cached content is served during an outage.

How to eliminate wrong answers

Option B (AWS Backup Vault Lock) is wrong because it is a data protection feature that prevents deletion or modification of backup vaults, not a mechanism to serve cached content during an origin outage. Option C (IAM Access Analyzer) is wrong because it analyzes resource-based policies to identify unintended access, not to control caching or origin failover behavior. Option D (S3 Select) is wrong because it is a query-in-place feature for filtering data within S3 objects, not a caching or availability mechanism for static websites.

Practice this question →

35

MCQmedium

A fintech startup uses AWS to run a web API and a PostgreSQL database. They must meet an RPO of 15 minutes and an RTO of 2 hours for a Region-wide disaster. Budget allows running a small, always-on set of infrastructure in a secondary Region, but not full production capacity. The team wants a DR approach that is regularly testable without large manual effort. Which disaster recovery strategy is the best fit?

A.Pilot light: replicate databases and store backups, keep only minimal infrastructure in the secondary Region, and scale up fully during failover.

B.Warm standby: keep a scaled-down application environment and database replication active in the secondary Region, using automated failover controls.

C.Backup and restore only: rely on daily automated backups and restore into the secondary Region during an incident.

D.Multi-site active-active: run both Regions at full capacity and route live traffic to both simultaneously.

AnswerB

Warm standby aligns with moderate RTO requirements by having ready-to-run resources plus continuous replication to meet the RPO target during failover.

Why this answer

Warm standby (B) is the best fit because it maintains a scaled-down but fully functional application environment in the secondary Region with active database replication, meeting the RPO of 15 minutes via synchronous or near-synchronous replication (e.g., PostgreSQL streaming replication or AWS DMS with ongoing replication). Automated failover controls (e.g., Route 53 health checks and Lambda automation) can achieve the RTO of 2 hours by scaling up the standby environment, and the always-on infrastructure allows regular, low-effort testing of the failover process without manual intervention.

Exam trap

The trap here is that candidates confuse 'pilot light' with 'warm standby' because both involve a secondary Region with minimal resources, but pilot light lacks pre-provisioned application servers and automated failover, making it unsuitable for the stated RTO and testability requirements.

How to eliminate wrong answers

Option A is wrong because pilot light keeps only minimal infrastructure (e.g., database replicas and no application servers) and requires manual or scripted scaling during failover, which risks exceeding the 2-hour RTO due to provisioning delays and lacks the automated failover controls needed for regular testing. Option C is wrong because backup and restore relies on daily backups, which cannot meet the 15-minute RPO (backups are typically taken every 24 hours) and restoring from backups into a secondary Region often takes longer than 2 hours due to data transfer and recovery time. Option D is wrong because multi-site active-active requires full production capacity in both Regions, which exceeds the budget constraint of running only a small, always-on set of infrastructure in the secondary Region.

Practice this question →

36

MCQhard

A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure?

A.A FIFO queue without a redrive policy

B.Short polling instead of long polling

C.A dead-letter queue with an appropriate maxReceiveCount

D.A larger message retention period only

AnswerC

A DLQ isolates messages that fail repeatedly so they can be investigated without disrupting normal processing.

Why this answer

Option C is correct because a dead-letter queue (DLQ) with an appropriate maxReceiveCount allows messages that repeatedly fail processing to be moved out of the source queue after a specified number of receive attempts. This prevents poison messages from blocking the queue and consuming retry capacity, enabling the workflow to continue processing valid messages without interruption.

Exam trap

The trap here is that candidates often confuse increasing the retention period or changing polling behavior with solving poison message issues, when the correct solution is to use a dead-letter queue with a maxReceiveCount to isolate failing messages.

How to eliminate wrong answers

Option A is wrong because a FIFO queue without a redrive policy does not automatically handle poison messages; without a DLQ, failed messages remain in the queue and continue to block retries. Option B is wrong because short polling reduces latency but does not address the issue of poison messages; it returns fewer messages per request and can increase costs, but it does not prevent repeated failures. Option D is wrong because increasing the message retention period only keeps messages in the queue longer; it does not remove or isolate poison messages, so they will continue to fail and block useful retries.

Practice this question →

37

MCQmedium

An orders service publishes payment instructions to an Amazon SQS queue. After occasional processing timeouts, the downstream consumer sometimes processes the same instruction twice, resulting in duplicate payment attempts. The team currently uses an SQS Standard queue with a visibility timeout of 2 minutes and relies on the consumer to finish before the timeout expires. What approach best improves resilience against duplicate processing?

A.Decrease visibility timeout to 10 seconds so duplicates are less likely to occur.

B.Make the consumer idempotent using the order ID as a deduplication key, and set the visibility timeout longer than the worst-case processing time.

C.Use an EventBridge rule with a fixed retry policy that only retries when the payload matches exactly.

D.Enable a dead-letter queue (DLQ) only, without changing the queue type or consumer logic.

AnswerB

SQS Standard provides at-least-once delivery, so duplicates can still occur. The most resilient design is to make the payment handler idempotent so repeated deliveries do not create duplicate side effects, and to set the visibility timeout long enough to cover the worst-case processing time to reduce unnecessary re-delivery.

Why this answer

Option B is correct because making the consumer idempotent using the order ID as a deduplication key ensures that even if the same message is processed multiple times, the downstream system will only apply the payment once. Setting the visibility timeout longer than the worst-case processing time prevents the message from becoming visible again before the consumer finishes, eliminating the root cause of duplicate processing in a Standard queue.

Exam trap

The trap here is that candidates often think reducing the visibility timeout or adding a DLQ alone solves duplicates, but they overlook that Standard queues inherently allow at-least-once delivery, so idempotency is the only reliable solution.

How to eliminate wrong answers

Option A is wrong because decreasing the visibility timeout to 10 seconds would increase the likelihood of duplicates by making the message reappear sooner if the consumer takes longer than 10 seconds, exacerbating the timeout issue. Option C is wrong because an EventBridge rule with a fixed retry policy does not address duplicate processing; EventBridge is a event bus service, not a queue, and its retry policy cannot prevent duplicate delivery from SQS. Option D is wrong because enabling only a DLQ without changing the queue type or consumer logic does not prevent duplicates; a DLQ captures failed messages but does not make the consumer idempotent or adjust visibility timeout to avoid reprocessing.

Practice this question →

38

MCQmedium

An orders system sends payment instructions to an Amazon SQS queue. The consumer sometimes times out after it has already created the payment record but before it deletes the SQS message. As a result, the same instruction can be processed more than once. Which design best ensures the consumer remains resilient and does not create duplicate payments when the same instruction is delivered multiple times?

A.Assume the consumer will always delete the SQS message in the same execution path, and ignore the timeout case.

B.Use idempotency: store a deterministic payment request identifier in a DynamoDB table and only create a payment when a conditional write indicates it was not processed before.

C.Switch to SQS Standard because it provides exactly-once delivery, so duplicates cannot happen.

D.Increase the consumer timeout and reduce the number of retries so that duplicates rarely occur.

AnswerB

Idempotency based on a stable identifier prevents duplicates by making processing repeatable and safely detectable.

Why this answer

Option B is correct because it implements idempotency using a DynamoDB table with a conditional write. By storing a deterministic payment request identifier (e.g., a hash of the message body) and only creating the payment if the conditional write succeeds (i.e., the identifier does not already exist), the consumer can safely process the same SQS message multiple times without creating duplicate payments. This pattern ensures resilience against the at-least-once delivery semantics of SQS and consumer timeouts that prevent message deletion.

Exam trap

The trap here is that candidates assume SQS FIFO queues provide exactly-once delivery, but the question specifies an SQS queue (likely Standard), and even FIFO queues only guarantee exactly-once processing within a limited deduplication window, not absolute idempotency; the correct solution is to make the consumer itself idempotent.

How to eliminate wrong answers

Option A is wrong because ignoring the timeout case violates the principle of designing for failure; SQS guarantees at-least-once delivery, and timeouts are a real-world occurrence that must be handled explicitly. Option C is wrong because SQS Standard does not provide exactly-once delivery; it offers at-least-once delivery, and duplicates can still occur due to network retries or consumer failures. Option D is wrong because increasing the consumer timeout and reducing retries only reduces the probability of duplicates but does not eliminate them, and it does not address the fundamental issue of at-least-once delivery semantics.

Practice this question →

39

Multi-Selectmedium

A customer portal must recover from a regional outage within a few hours. The business wants lower ongoing cost than a fully active second Region and does not want to rebuild everything from scratch during the outage. Which two DR patterns best fit that goal? Select two.

Select 2 answers

A.Backup and restore

B.Pilot light

C.Warm standby

D.Multi-site active-active

E.Single-AZ deployment

AnswersB, C

Pilot light keeps only core components running in the secondary Region, which lowers cost while reducing recovery time.

Why this answer

Pilot light is correct because it maintains a minimal core infrastructure (e.g., database, networking) in the secondary Region that can be quickly scaled up during a disaster, meeting the recovery time objective (RTO) of a few hours while keeping ongoing costs lower than a fully active second Region. It avoids rebuilding everything from scratch by having critical data and configurations already in place, allowing compute resources to be launched on demand.

Exam trap

AWS often tests the distinction between pilot light and warm standby—the trap here is that candidates may confuse pilot light with backup and restore, not realizing that pilot light maintains a live, minimal environment (e.g., database replicas) rather than just backup files, enabling faster recovery without full rebuild.

Practice this question →

40

Multi-Selectmedium

A fintech company needs a disaster recovery design for a web application in two Regions. The business requires an RPO of 15 minutes and an RTO under 2 hours, but it cannot afford to keep a full production stack running in both Regions all the time. Which two DR strategies best fit the requirement? Select two.

Select 2 answers

A.Pilot light with critical data and minimal services pre-staged in the secondary Region.

B.Warm standby with a scaled-down but running environment in the secondary Region.

C.Active-active deployment with full production capacity in both Regions.

D.Backup-and-restore only, with no pre-provisioned resources in the secondary Region.

E.Single-Region deployment with Multi-AZ only, because that already covers disaster recovery.

AnswersA, B

Correct because pilot light keeps a small but ready foundation in the recovery Region, which lowers cost while still allowing much faster recovery than restoring everything from scratch. It is a common fit when the business can accept a short recovery window and controlled failover steps.

Why this answer

A pilot light strategy is correct because it pre-stages only critical data (e.g., database replication) and minimal core services (e.g., a small EC2 instance or RDS standby) in the secondary Region, which can be scaled up to full production within the RTO of under 2 hours. The RPO of 15 minutes is achievable by using synchronous or near-synchronous replication (e.g., Amazon RDS Multi-AZ cross-Region or DynamoDB global tables) to keep data loss minimal. This approach avoids the cost of a full production stack while meeting the recovery objectives.

Exam trap

The trap here is that candidates often confuse warm standby with active-active, assuming any running environment in the secondary Region must be at full capacity, but warm standby allows a scaled-down environment that can be scaled up within the RTO, meeting cost constraints.

Practice this question →

41

MCQeasy

A content publishing system exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The architecture review board prefers a managed AWS-native control.

A.IAM Access Analyzer

B.AWS Backup Vault Lock

C.CloudFront caching with appropriate TTLs

D.S3 Select

AnswerC

CloudFront can serve cached content from edge locations when the origin is temporarily unavailable.

Why this answer

CloudFront caching with appropriate TTLs ensures that even if the S3 origin becomes temporarily unavailable, CloudFront can serve cached content from its edge locations to users. By setting a minimum TTL (e.g., 0 seconds for dynamic content or longer for static assets), CloudFront will continue to serve stale responses from cache during an origin outage, maintaining availability. This is a managed AWS-native feature that requires no additional infrastructure and aligns with the architecture review board's preference.

Exam trap

The trap here is that candidates might confuse CloudFront's caching with other AWS services like AWS Global Accelerator or Route 53 health checks, or incorrectly assume that S3's built-in redundancy alone handles origin outages, overlooking CloudFront's ability to serve stale cached content during origin failures.

How to eliminate wrong answers

Option A is wrong because IAM Access Analyzer is a tool for analyzing resource-based policies to identify unintended public or cross-account access, not for providing caching or origin resilience. Option B is wrong because AWS Backup Vault Lock is a feature to enforce retention policies on backups and prevent deletion, unrelated to serving cached content during an origin outage. Option D is wrong because S3 Select is a feature to retrieve subsets of data from objects using SQL-like queries, not a caching mechanism or origin failover solution.

Practice this question →

42

MCQeasy

A company wants a disaster recovery setup for a web application. They need relatively quick recovery, but they can't afford running full production in the secondary location at all times. Which option best matches this requirement?

A.Pilot light: keep only essential infrastructure in the secondary location and scale up the application during a failure.

B.Warm standby: run a minimal but functional version of the application and supporting services in the secondary location, and scale up during a failure.

C.Active-active: run full production in both the primary and secondary locations at the same time.

D.Backup and restore only: rely on periodic backups and restore the application after a failure.

AnswerB

Warm standby balances cost and recovery time by keeping some capacity running in the secondary environment (for example, smaller Auto Scaling capacity for the app tier and replication for the data tier). When the primary fails, you fail over and scale out quickly.

Why this answer

Warm standby (Option B) is correct because it runs a minimal but functional version of the application in the secondary region, allowing for faster recovery than a pilot light while avoiding the cost of full production. During a failure, you scale up the standby environment to handle production traffic, meeting the requirement for relatively quick recovery without the expense of active-active.

Exam trap

The trap here is confusing 'pilot light' with 'warm standby' — candidates often think pilot light is faster because it sounds minimal, but warm standby actually provides quicker recovery by having the application already deployed and ready to scale.

How to eliminate wrong answers

Option A is wrong because a pilot light keeps only essential infrastructure (e.g., database, core services) without running the application, requiring more time to deploy and scale the application layer during a failure, which does not meet the 'relatively quick recovery' requirement. Option C is wrong because active-active runs full production in both locations at all times, which contradicts the requirement to not afford running full production in the secondary location. Option D is wrong because backup and restore relies on periodic backups and manual restoration, resulting in the slowest recovery time (hours to days) and does not provide the relatively quick recovery needed.

Practice this question →

43

MCQeasy

A production Amazon RDS database has automated backups enabled. At 10:45 UTC, an issue is discovered. The team needs to restore the database to its state as of 10:30 UTC. Which capability should they use?

A.Point-in-time restore (PITR) using automated backups to a specific timestamp.

B.Perform a Multi-AZ manual failover of the standby to recover to the earlier timestamp.

C.Promote a cross-region replication target to replace the current database with the last-known good copy.

D.Switch to a read replica to access an older view of data without restoring.

AnswerA

PITR restores an RDS DB instance to a chosen moment within the retention period for automated backups, allowing the team to roll back to 10:30 UTC.

Why this answer

Amazon RDS automated backups enable point-in-time recovery (PITR) to any second within the backup retention period, restoring to a new DB instance. Since the issue was discovered at 10:45 UTC and the desired recovery point is 10:30 UTC, PITR can restore the database to that exact timestamp, provided it falls within the automated backup window and retention period.

Exam trap

The trap here is confusing Multi-AZ failover or read replicas with point-in-time recovery capabilities, leading candidates to think failover or replica promotion can roll back to a specific past state when they only provide high availability or read scaling.

How to eliminate wrong answers

Option B is wrong because Multi-AZ failover switches to a standby replica that is kept synchronously in sync with the primary; it does not provide a way to roll back to an earlier point in time, only to the current state of the primary. Option C is wrong because cross-region replication (e.g., using a read replica in another region) replicates data asynchronously and cannot be used to restore to a specific past timestamp; promoting it would give you a copy from a lagged point, not necessarily 10:30 UTC. Option D is wrong because a read replica provides a live, near-real-time copy of the primary database and does not retain historical snapshots or allow accessing an older view of data without a full restore.

Practice this question →

44

Multi-Selectmedium

A solutions architect is designing a highly available and resilient architecture for a critical internal application that processes financial transactions. The application runs on Amazon EC2 instances inside an Auto Scaling group. The database layer uses an Amazon Aurora MySQL cluster. The company requires that if an entire AWS Availability Zone (AZ) fails, the application must remain operational with minimal impact and automatically recover without manual intervention. Which combination of architectural decisions will meet these requirements? (Choose four.)

Select 4 answers

.Configure the Auto Scaling group to span at least three Availability Zones in the same AWS Region.

.Deploy the Aurora cluster with a single DB instance to reduce complexity and cost.

.Configure the Aurora cluster to include at least one Aurora Replica in a different Availability Zone than the primary instance.

.Use an Application Load Balancer (ALB) to distribute traffic across EC2 instances in multiple Availability Zones.

.Place the EC2 instances in a single Availability Zone to ensure data locality with the primary database.

.Set up an Amazon RDS Proxy to manage database connections and provide connection pooling for improved resilience.

Why this answer

Configuring the Auto Scaling group to span at least three Availability Zones ensures that if one AZ fails, the remaining AZs have sufficient capacity to handle the load, and the Auto Scaling group can automatically launch new instances in the healthy AZs. Deploying the Aurora cluster with at least one Aurora Replica in a different AZ than the primary instance provides automatic failover to a replica in under 30 seconds, ensuring database resilience without manual intervention. Using an Application Load Balancer (ALB) to distribute traffic across EC2 instances in multiple AZs allows the ALB to automatically route traffic away from failed AZs and only to healthy targets, maintaining application availability.

Setting up an Amazon RDS Proxy manages database connections by pooling and reusing them, which reduces the load on the database during failover and improves resilience by providing seamless connection handling across AZ failures.

Exam trap

The trap here is that candidates often think a single Aurora instance with multi-AZ storage is sufficient, but without an Aurora Replica in a different AZ, automatic failover is not possible; similarly, they may assume that placing all EC2 instances in one AZ simplifies data locality, but this sacrifices availability for a false sense of performance optimization.

Practice this question →

45

MCQmedium

Your public API is hosted in two regions. You want Route 53 to automatically send traffic to the secondary region when the primary region’s endpoint fails. The primary API health check is returning failure codes, but clients still reach the primary region for several minutes. Which Route 53 configuration most directly addresses this behavior?

A.Use a single Alias A record with simple routing and a short TTL so Route 53 quickly changes the IP address.

B.Use Route 53 failover routing with a primary record and a secondary record, each associated with its own health check, so Route 53 answers with the healthy region.

C.Use weighted routing to send a small percentage of traffic to the secondary region, increasing it manually when the primary fails.

D.Use latency routing only, letting Route 53 choose the lowest-latency region at query time, without health checks.

AnswerB

Failover routing is designed for this: Route 53 evaluates health checks and returns the primary record while it is healthy. When the primary health check fails, Route 53 automatically returns the secondary record. Note that clients may still see traffic for a few minutes due to DNS caching, but failover routing is the configuration that enables automatic region switching.

Why this answer

Option B is correct because Route 53 failover routing with health checks on both primary and secondary records ensures that when the primary health check fails, Route 53 stops returning the primary record's IP and instead returns the secondary record's IP. This directly addresses the observed behavior where clients still reach the primary region for several minutes—likely because the primary record's health check was not configured or associated, or a simple routing policy was used without health check integration, causing stale DNS responses to be served until TTL expires.

Exam trap

The trap here is that candidates assume a short TTL alone (Option A) is sufficient for fast failover, but without health checks, Route 53 has no mechanism to detect endpoint failure and will continue returning the primary record until the TTL expires and the record is manually updated, causing the observed delay.

How to eliminate wrong answers

Option A is wrong because simple routing with a short TTL does not incorporate health checks; Route 53 will continue to return the primary record's IP even if the endpoint is unhealthy, and clients will still reach the failing region until the TTL expires and the record is manually updated. Option C is wrong because weighted routing requires manual intervention to adjust weights when the primary fails, which does not provide automatic failover and can still result in clients reaching the unhealthy primary region. Option D is wrong because latency routing without health checks will continue to return the primary region's IP if it has the lowest latency, even when the primary endpoint is returning failure codes, so clients will still be directed to the failing region.

Practice this question →

46

MCQeasy

Based on the exhibit, some SQS messages fail validation repeatedly and continue consuming worker time. What change best prevents the bad messages from being retried forever?

A.Increase the visibility timeout so each message has more time to finish processing.

B.Configure a dead-letter queue and a redrive policy for messages that exceed the retry limit.

C.Replace the queue with an Amazon SNS topic so failed messages will not be retried.

D.Increase the number of workers so the queue drains faster during peak load.

AnswerB

A dead-letter queue captures messages that fail repeatedly after a defined receive count. The main queue can keep processing healthy messages, while the poison messages are isolated for later inspection and remediation.

Why this answer

A dead-letter queue (DLQ) with a redrive policy allows messages that have been received a maximum number of times (e.g., after the configured retry limit) to be moved to a separate queue for analysis or manual handling. This prevents the same invalid message from being repeatedly processed by workers, freeing up compute resources and avoiding infinite retry loops.

Exam trap

The trap here is that candidates may think increasing the visibility timeout or adding more workers will solve the retry problem, but neither addresses the root cause of a message that will always fail validation.

How to eliminate wrong answers

Option A is wrong because increasing the visibility timeout only gives workers more time to process a message before it becomes visible again; it does not stop a failing message from being retried indefinitely. Option C is wrong because Amazon SNS is a pub/sub messaging service that does not provide built-in retry logic or a mechanism to move failed messages out of the processing pipeline; it would still deliver the same bad message to subscribers repeatedly. Option D is wrong because adding more workers only increases throughput for valid messages but does not prevent the same invalid message from being retried forever; the bad message will still consume worker time on every retry.

Practice this question →

47

MCQmedium

A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The architecture review board prefers a managed AWS-native control.

A.AWS WAF

B.Amazon CloudFront

C.Amazon SQS queue

D.Amazon Route 53 weighted routing

AnswerC

SQS decouples producers and consumers, buffers bursts, and supports retries through visibility timeout and dead-letter queues.

Why this answer

Amazon SQS is the correct choice because it acts as a decoupling buffer between the web tier and the fulfilment workers. It can absorb sudden bursts of orders by storing messages durably, and workers can poll the queue at their own pace, retrying failed processing without losing any requests. This aligns with the requirement for a managed AWS-native service that handles spikes and retries.

Exam trap

The trap here is that candidates may confuse buffering and decoupling with services like CloudFront (caching) or Route 53 (traffic routing), failing to recognize that SQS is the only AWS-native service designed specifically for asynchronous message queuing and retry logic.

How to eliminate wrong answers

Option A is wrong because AWS WAF is a web application firewall that filters HTTP/S traffic based on rules, not a message queue for buffering and retrying requests. Option B is wrong because Amazon CloudFront is a content delivery network (CDN) that caches and accelerates static/dynamic content delivery, not a service for decoupling and buffering asynchronous workloads. Option D is wrong because Amazon Route 53 weighted routing is a DNS routing policy for distributing traffic across endpoints, not a message queuing or buffering service.

Practice this question →

48

MCQeasy

A team runs an Amazon RDS for MySQL database in a single Availability Zone. They want automatic failover with minimal downtime if the primary database instance becomes unavailable. Automated backups are already enabled. Which configuration change best meets the requirement?

A.Keep the deployment as single-AZ, but increase automated backup retention to 35 days.

B.Create a read replica in another Availability Zone, but keep Multi-AZ disabled.

C.Enable RDS Multi-AZ so AWS maintains a standby in another Availability Zone for automatic failover.

D.Rely on restoring from the most recent manual snapshot after an outage.

AnswerC

RDS Multi-AZ creates a standby instance in a different AZ and replicates data to it. If the primary becomes unavailable, AWS performs an automatic failover, promoting the standby and maintaining high availability with minimal application disruption.

Why this answer

Option C is correct because enabling Multi-AZ on Amazon RDS for MySQL automatically provisions and maintains a synchronous standby replica in a different Availability Zone. If the primary instance fails, Amazon RDS automatically fails over to the standby, typically within 60–120 seconds, minimizing downtime without manual intervention. This meets the requirement for automatic failover with minimal downtime.

Exam trap

The trap here is that candidates often confuse read replicas (which are for read scaling and manual promotion) with Multi-AZ (which is for high availability and automatic failover), leading them to incorrectly choose Option B.

How to eliminate wrong answers

Option A is wrong because increasing automated backup retention to 35 days only extends the point-in-time recovery window; it does not provide automatic failover or a standby instance. Option B is wrong because a read replica in another AZ is asynchronous and does not support automatic failover; it requires manual promotion to become the primary, which incurs downtime. Option D is wrong because restoring from a manual snapshot is a manual process that can take significant time (minutes to hours depending on size) and does not provide automatic failover.

Practice this question →

49

Multi-Selectmedium

A media company stores daily financial exports in Amazon S3. The files must be protected against accidental overwrite or deletion, and the business also wants a second copy in another Region for recovery after a regional outage. Which two actions should the architect take? Select two.

Select 2 answers

A.Enable bucket versioning on the S3 bucket.

B.Turn on S3 Transfer Acceleration for the bucket.

C.Use only lifecycle policies to move objects to Glacier.

D.Configure replication to a bucket in a second AWS Region.

E.Enable S3 Block Public Access on the bucket.

AnswersA, D

Versioning preserves prior object versions so accidental deletes and overwrites can be recovered later.

Why this answer

Option A is correct because enabling S3 Versioning protects objects from accidental overwrite or deletion by preserving previous versions. When a file is overwritten or deleted, the original version is retained, allowing recovery. This directly addresses the requirement to guard against data loss from user or application errors.

Exam trap

The trap here is that candidates may confuse S3 Transfer Acceleration or lifecycle policies with data protection features, overlooking that versioning and replication are the specific services designed for accidental deletion prevention and cross-region recovery.

Practice this question →

50

Matchingmedium

A team wants a web application to keep serving traffic if one Availability Zone fails. Match each architecture element to the resilience behavior it provides.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Stop sending requests to unhealthy targets and keep only healthy instances in rotation.

Launch replacement instances in healthy AZs when capacity is lost.

Maintain a synchronous standby in another AZ and fail over automatically.

Allow instances to be replaced without losing user sessions that are stored elsewhere.

Why these pairings

These pairs match architecture elements with their resilience behaviors for surviving an Availability Zone failure, focusing on AWS services that provide high availability and fault tolerance.

Practice this question →

51

Multi-Selectmedium

An order service must notify inventory, shipping, and analytics independently when payment succeeds. The shipping service may be slow, but the order service should keep accepting new orders even if one consumer is unavailable. Which two changes best improve resilience? Select two.

Select 2 answers

A.Publish the event to an Amazon SNS topic and subscribe a separate SQS queue for each downstream service.

B.Have the order service call all downstream services synchronously so failures are visible immediately.

C.Use one shared SQS queue for all three consumers so they always process the same message.

D.Store the event in a relational database and poll it from every consumer on a fixed schedule.

E.Configure a dead-letter queue on each consumer queue to isolate poison messages.

AnswersA, E

SNS fan-out with separate SQS queues decouples the producer from each consumer. Every downstream service gets its own buffered queue, so a slow or unavailable consumer does not block the others or the order service.

Why this answer

Option A is correct because publishing the event to an SNS topic allows the order service to emit a single notification that is then fanned out to multiple SQS queues, one per downstream service. This decouples the order service from the consumers, so even if the shipping service is slow or unavailable, the order service can continue accepting new orders without blocking. Each SQS queue provides independent buffering and retry logic, ensuring resilience against individual consumer failures.

Exam trap

The trap here is that candidates often confuse a single shared queue (Option C) with a fan-out pattern, not realizing that each consumer needs its own queue to process messages independently and avoid head-of-line blocking.

Practice this question →

52

MCQeasy

A company needs an Amazon RDS database that automatically fails over to a standby when the primary DB instance becomes unavailable. Which approach best meets the requirement with minimal operational effort?

A.Keep the DB as a single-AZ instance and implement a manual process to promote a standby when needed.

B.Deploy the DB as a Multi-AZ DB instance so AWS maintains a synchronous standby in another Availability Zone and performs automated failover.

C.Enable versioned backups only, and restore the database each time the primary instance becomes unavailable.

D.Replicate the database to another region and switch clients to the secondary region using manual DNS changes.

AnswerB

RDS Multi-AZ provisions a synchronous standby in a different Availability Zone within the same AWS Region. When the primary DB instance is unavailable, AWS performs automated failover to the standby, reducing downtime without custom scripts.

Why this answer

Option B is correct because Amazon RDS Multi-AZ automatically provisions and maintains a synchronous standby replica in a different Availability Zone. When the primary DB instance fails, AWS handles the automatic failover to the standby with zero manual intervention, meeting the requirement with minimal operational effort.

Exam trap

The trap here is that candidates often confuse Multi-AZ (synchronous replication, automatic failover) with Multi-Region (asynchronous replication, manual or automated cross-region failover) or assume that backups alone can provide high availability, but backups do not offer automatic failover or minimal downtime.

How to eliminate wrong answers

Option A is wrong because a single-AZ instance has no standby, and a manual process to promote a standby would require creating a new instance from a snapshot or read replica, which incurs significant downtime and operational overhead. Option C is wrong because versioned backups alone do not provide a standby; restoring from a backup can take minutes to hours, resulting in unacceptable downtime and data loss. Option D is wrong because cross-region replication requires manual DNS changes to redirect traffic, introduces higher latency, and involves more operational complexity than a Multi-AZ deployment within a single region.

Practice this question →

53

MCQmedium

A SaaS platform serves an API using two regional deployments: us-east-1 (primary) and us-west-2 (secondary). Each region has its own ALB. The business requires automated DNS-based failover when the primary region becomes unhealthy, and they do not want manual DNS changes during incidents. Which Route 53 configuration is the best match?

A.Create a single Route 53 record using weighted routing across both ALBs with weights adjusted manually during an incident.

B.Use Route 53 failover routing with a primary record pointing to the us-east-1 ALB and a secondary record pointing to the us-west-2 ALB, each using health checks.

C.Use latency-based routing so Route 53 always selects the fastest region; health checks are unnecessary because client latency reflects availability.

D.Use a single A record with a static IP address that points to a NAT gateway, and update that IP during failure events.

AnswerB

Failover routing with health checks enables automatic switching of DNS responses when the primary endpoint fails health evaluation.

Why this answer

Route 53 failover routing is designed specifically for active-passive failover scenarios where you have a primary and secondary resource. By associating health checks with each record, Route 53 automatically detects when the primary ALB in us-east-1 becomes unhealthy and routes traffic to the secondary ALB in us-west-2 without manual intervention. This meets the requirement for automated DNS-based failover without manual DNS changes.

Exam trap

The trap here is that candidates may confuse latency-based routing with failover routing, assuming that lowest latency implies health, but latency routing does not consider endpoint health and will continue sending traffic to an unhealthy region if it is still the fastest.

How to eliminate wrong answers

Option A is wrong because weighted routing requires manual adjustment of weights during an incident, which violates the requirement for automated failover without manual DNS changes. Option C is wrong because latency-based routing selects the region with the lowest latency for each user, not based on health; it does not provide failover when a region becomes unhealthy, and health checks are not used to determine routing decisions. Option D is wrong because using a static IP pointing to a NAT gateway is not a scalable or resilient approach for an API served by ALBs, and updating the IP during failure events requires manual intervention, which contradicts the automation requirement.

Practice this question →

54

MCQhard

Based on the exhibit, DNS still sends traffic to the primary Region even though Route 53 health checks show the primary endpoint is unhealthy. What is the best change to make failover work as intended?

A.Change both records to weighted routing with a 50/50 split so Route 53 can shift traffic gradually.

B.Use a failover routing policy with a primary record and a secondary record, and attach the health check to the primary record.

C.Switch to latency-based routing so users are always directed to the lowest-latency Region.

D.Use geolocation routing so clients in one Region are sent to the healthier endpoint.

AnswerB

Failover routing is designed for active-passive DNS behavior. With a primary and secondary record, Route 53 answers with the primary record when it is healthy and returns the secondary record when the primary health check fails. The exhibit shows simple routing, which does not express the failover intent. Switching to failover routing aligns the DNS policy with the stated requirement.

Why this answer

Option B is correct because a failover routing policy with a health check attached to the primary record is the only configuration that allows Route 53 to automatically stop sending traffic to an unhealthy primary endpoint and redirect it to the secondary endpoint. Without the health check attached to the primary record, Route 53 has no mechanism to detect the failure and will continue routing traffic to the primary Region, even if the health check status shows unhealthy.

Exam trap

The trap here is that candidates assume Route 53 automatically uses health check status to influence routing regardless of the routing policy, but in reality, health checks only affect routing when explicitly attached to a record in a failover or weighted routing policy.

How to eliminate wrong answers

Option A is wrong because weighted routing distributes traffic based on weights, not failover; it does not automatically shift all traffic away from an unhealthy endpoint, and a 50/50 split would still send half the traffic to the unhealthy primary. Option C is wrong because latency-based routing directs users to the endpoint with the lowest latency, not based on health; it does not provide automatic failover when a health check fails. Option D is wrong because geolocation routing directs traffic based on the geographic location of the user, not on endpoint health; it cannot automatically reroute traffic away from an unhealthy primary endpoint.

Practice this question →

55

Drag & Dropmedium

Order the steps to restore an Amazon RDS DB instance from a snapshot.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Snapshot selection, restore, configure, redirect app, then delete old instance.

Practice this question →

56

MCQhard

A patient portal must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used?

A.Instance store volumes

B.Amazon EFS with mount targets in multiple Availability Zones

C.An EBS volume attached to all instances

D.S3 mounted as a POSIX file system without a file gateway

AnswerB

EFS is regional file storage and supports mount targets across AZs.

Why this answer

Amazon EFS provides a scalable, fully managed NFS file system that can be mounted concurrently on multiple Linux EC2 instances. By creating mount targets in multiple Availability Zones, the file system remains accessible even if one AZ fails, ensuring high availability and shared file storage across instances.

Exam trap

The trap here is that candidates may confuse EBS multi-attach (which is limited to specific instance types and does not span AZs) with the true multi-AZ shared file system capability of EFS.

How to eliminate wrong answers

Option A is wrong because instance store volumes are ephemeral and tied to a single EC2 instance; they cannot be shared across instances or survive an AZ failure. Option C is wrong because an EBS volume can only be attached to one EC2 instance at a time (unless using multi-attach, which is limited to specific instance types and not designed for shared file storage across AZs). Option D is wrong because mounting S3 as a POSIX file system without a file gateway (e.g., using s3fs-fuse) does not provide consistent POSIX semantics, lacks strong read-after-write consistency, and is not designed for high-availability shared file storage across AZs.

Practice this question →

57

MCQmedium

A web app runs on an EC2 Auto Scaling group behind an Application Load Balancer (ALB). The ALB is configured with health checks and the ASG spans three subnets in three Availability Zones. During an AZ outage, monitoring shows the number of healthy instances drops sharply and never returns to the original capacity until the ASG is manually adjusted. What change most directly improves resilience so capacity returns automatically during an AZ failure?

A.Reduce the ASG desired capacity by 1 and rely on the ALB to route traffic to fewer instances during the outage.

B.Configure the ASG to use the ALB target-group health checks (ELB/target-group health) and ensure the ASG has at least two subnets in different Availability Zones that remain available for instance placement.

C.Move the ALB to only one subnet so health checks and routing remain consistent during the outage.

D.Add an S3 event trigger to terminate unhealthy instances so the ASG can scale back out using its scheduled actions.

AnswerB

If the AZ outage prevents the ALB from reaching targets, instance-level (EC2) health checks may still consider instances “healthy” because the instances are running. When the ASG is configured to use ALB/target-group health (ASG health check type set to ELB and tied to the target group), the ASG can detect application-level unreachability and replace unhealthy instances. With multiple eligible subnets across different AZs, the ASG can launch replacement instances in the remaining AZs and automatically return to the configured desired capacity.

Why this answer

Option B is correct because configuring the ASG to use ALB target-group health checks (ELB health checks) ensures that the ASG replaces instances that fail the ALB's health checks, including those in an impaired AZ. By also ensuring the ASG has at least two subnets in different AZs that remain available, the ASG can launch replacement instances in the healthy AZs when one AZ fails, automatically restoring capacity without manual intervention.

Exam trap

The trap here is that candidates assume EC2 status checks are sufficient for AZ failure detection, but they fail to recognize that an instance in a failed AZ may still pass EC2 status checks while being unreachable via the network, so only ALB target-group health checks trigger the ASG to replace them.

How to eliminate wrong answers

Option A is wrong because reducing the desired capacity does not address the root cause; the ASG will not automatically replace instances in the failed AZ, and the ALB simply routes traffic to fewer instances, leaving the capacity deficit permanent. Option C is wrong because moving the ALB to only one subnet creates a single point of failure, defeating the purpose of multi-AZ resilience and potentially causing the ALB itself to become unavailable during an AZ outage. Option D is wrong because S3 event triggers are not designed to terminate unhealthy instances or trigger ASG scaling; scheduled actions are time-based and cannot react to dynamic failures like an AZ outage, and the described mechanism is not a valid AWS pattern for health-based replacement.

Practice this question →

58

MCQmedium

A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable?

A.S3 Cross-Region Replication

B.Multi-AZ deployment for the RDS DB instance

C.EBS snapshots every hour

D.Read replicas only

AnswerB

Multi-AZ provides synchronous standby replication and automatic failover within a Region.

Why this answer

Multi-AZ deployment for RDS MySQL automatically provisions and maintains a synchronous standby replica in a different Availability Zone. In the event of an AZ failure, Amazon RDS automatically fails over to the standby, providing high availability with minimal application changes (the application simply reconnects to the same endpoint). This meets the requirement for availability during an AZ outage without requiring code modifications.

Exam trap

The trap here is that candidates often confuse read replicas (which are for read scaling and manual promotion) with Multi-AZ (which provides automatic failover and high availability), leading them to select 'Read replicas only' as a cheaper but incorrect alternative.

How to eliminate wrong answers

Option A is wrong because S3 Cross-Region Replication is designed for object-level replication across AWS regions, not for database high availability within a region, and it does not provide automatic failover for an RDS MySQL database. Option C is wrong because EBS snapshots every hour provide point-in-time backup and recovery, not automatic failover; restoring from a snapshot requires manual intervention and results in data loss for transactions after the last snapshot. Option D is wrong because read replicas only provide read scaling and asynchronous replication; they do not support automatic failover for write operations, and promoting a read replica to a primary requires manual action and potential data loss.

Practice this question →

59

Multi-Selectmedium

A payment worker consumes messages from an Amazon SQS queue. Sometimes the worker finishes the payment creation, but a timeout prevents message deletion and the same payment request is delivered again. Which two design changes best reduce the risk of duplicate charges and keep bad messages from looping forever? Select two.

Select 2 answers

A.Make the payment operation idempotent by storing a unique request identifier before charging.

B.Reduce the visibility timeout so retries happen sooner after each timeout.

C.Move the queue to Amazon SNS so each message is delivered only once.

D.Increase the message retention period so failed payments stay available longer.

E.Configure a dead-letter queue with a redrive policy for messages that exceed the max receive count.

AnswersA, E

Idempotency ensures the same business request cannot create multiple charges if SQS redelivers the message.

Why this answer

Option A is correct because making the payment operation idempotent ensures that even if the same message is processed multiple times due to a timeout, the payment is only charged once. This is typically achieved by storing a unique request identifier (e.g., a UUID or idempotency key) in a database or cache before processing; subsequent duplicate requests with the same identifier are detected and ignored, preventing duplicate charges.

Exam trap

The trap here is that candidates often think reducing the visibility timeout or switching to SNS will solve duplicates, but they fail to recognize that SQS guarantees at-least-once delivery and that SNS does not provide message deduplication; the correct approach is to combine idempotency with a dead-letter queue to handle both duplicate charges and infinite retries.

Practice this question →

60

MCQmedium

A ticket booking system stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The architecture review board prefers a managed AWS-native control.

A.S3 lifecycle transition to Glacier Flexible Retrieval

B.An EBS snapshot schedule

C.S3 Cross-Region Replication with versioning enabled

D.A CloudFront distribution

AnswerC

CRR asynchronously replicates objects to a bucket in another Region and requires versioning.

Why this answer

S3 Cross-Region Replication (CRR) is a fully managed AWS-native feature that automatically replicates objects from a source S3 bucket in one AWS Region to a destination bucket in another Region, meeting the disaster recovery requirement for a geographically separate copy. Enabling versioning on both buckets is mandatory for CRR to function, as it tracks object versions and ensures consistency during replication.

Exam trap

The trap here is that candidates often confuse S3 Cross-Region Replication with S3 lifecycle policies or Glacier transitions, mistakenly thinking that moving data to a cheaper storage class in the same region satisfies a disaster recovery requirement for geographic separation.

How to eliminate wrong answers

Option A is wrong because S3 lifecycle transition to Glacier Flexible Retrieval only moves data within the same bucket and region to a colder storage class for cost optimization, not to another AWS Region for disaster recovery. Option B is wrong because EBS snapshot schedules are used for backing up Amazon EBS volumes attached to EC2 instances, not for S3 objects, and they do not provide cross-region replication for S3 data. Option D is wrong because CloudFront is a content delivery network (CDN) that caches data at edge locations for low-latency access, not a replication mechanism to copy data to another AWS Region for disaster recovery.

Practice this question →

61

Multi-Selecteasy

A developer accidentally corrupts part of a production Amazon RDS database, and the issue is discovered 45 minutes later. The team needs to restore the database to the state immediately before the change. Which two actions should be part of the recovery plan? Select two.

Select 2 answers

A.Enable automated backups with a retention period that covers the recovery window.

B.Perform a point-in-time restore to a new database instance.

C.Convert the database to a single-AZ deployment for faster restores.

D.Delete the corrupted rows manually and continue without restoring.

E.Use a read replica as the only recovery source for all deletions.

AnswersA, B

Point-in-time recovery in RDS depends on automated backups and transaction logs. The retention period must include the time before the corruption occurred, otherwise the desired recovery point will not be available.

Why this answer

Option A is correct because automated backups must be enabled to allow point-in-time recovery (PITR) within the retention window. Since the corruption occurred 45 minutes ago, the retention period must cover at least that duration to restore to the state immediately before the change. Option B is correct because PITR restores the database to a specified time (down to the second) within the backup retention period, creating a new DB instance that reflects the state just before the corruption.

Exam trap

The trap here is that candidates may think a read replica can be used for point-in-time recovery, but it only provides read scaling and asynchronous replication, not a restore point before the corruption occurred.

Practice this question →

62

MCQmedium

Your media processing pipeline writes original uploads to an S3 bucket and later generates derivative files. An operator accidentally deletes a subset of original uploads in production. You need to (1) restore the deleted objects with minimal data loss and (2) protect against both regional disasters and future operator mistakes. The company requires recovery even if objects are deleted and later overwritten. What is the most effective change to meet these requirements?

A.Enable S3 versioning on the bucket and configure cross-Region replication so previous versions are available after regional loss and accidental deletion.

B.Move all objects to S3 Glacier Instant Retrieval and apply a lifecycle policy to keep only the latest object copy.

C.Use S3 server-side encryption with KMS keys and rely on access logs to manually recover the deleted objects.

D.Enable S3 bucket policies that deny DeleteObject, but do not enable versioning or replication.

AnswerA

Versioning retains prior object versions, and cross-Region replication provides redundancy across Regions for recovery after deletion or disaster.

Why this answer

Option A is correct because enabling S3 Versioning preserves all object versions, including overwrites and deletions (which become delete markers), allowing you to restore deleted objects by removing the delete marker. Cross-Region Replication (CRR) replicates both current and previous versions to a secondary Region, protecting against regional disasters. Together, they ensure recovery even if objects are deleted and later overwritten, meeting all requirements.

Exam trap

The trap here is that candidates may think a bucket policy denying DeleteObject is sufficient to prevent data loss, but it does not protect against overwrites, authorized user mistakes, or regional disasters, and without versioning, deleted objects are permanently lost.

How to eliminate wrong answers

Option B is wrong because moving objects to S3 Glacier Instant Retrieval does not provide versioning or replication; a lifecycle policy that keeps only the latest copy would permanently lose previous versions and deleted objects, failing the recovery requirement. Option C is wrong because S3 server-side encryption with KMS keys does not protect against deletion or overwrite; access logs only record events, they do not restore deleted objects, and manual recovery from logs is impractical and not guaranteed. Option D is wrong because a bucket policy denying DeleteObject can be bypassed by authorized users (e.g., operators with elevated permissions) and does not protect against overwrites or regional disasters; without versioning or replication, deleted objects are unrecoverable.

Practice this question →

63

Matchingmedium

Match the disaster recovery strategy to the recovery posture it best fits for a Regional outage.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Lowest cost option where the environment is rebuilt from backups and hours of downtime are acceptable.

Keep only the critical core running in the secondary Region, then scale out after failover.

Run a scaled-down but functional environment in another Region for faster cutover.

Serve production traffic from more than one Region at the same time for the fastest recovery.

Why these pairings

The pairs match disaster recovery strategies to their typical recovery posture for a regional outage, based on AWS Well-Architected Framework and common DR patterns.

Practice this question →

64

MCQhard

A patient portal must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The design must avoid adding custom operational scripts.

A.Instance store volumes

B.Amazon EFS with mount targets in multiple Availability Zones

C.An EBS volume attached to all instances

D.S3 mounted as a POSIX file system without a file gateway

AnswerB

EFS is regional file storage and supports mount targets across AZs.

Why this answer

Amazon EFS provides a fully managed, scalable, and shared file system that can be mounted concurrently on multiple Linux EC2 instances across different Availability Zones. By creating mount targets in each AZ, the file system remains accessible even if one AZ fails, meeting the high availability requirement without custom scripts.

Exam trap

The trap here is that candidates may confuse EBS multi-attach (which is limited to specific instance types and a single AZ) with a true cross-AZ shared file system, or assume S3 with a FUSE mount provides POSIX compliance without operational overhead.

How to eliminate wrong answers

Option A is wrong because instance store volumes are ephemeral, tied to a single EC2 instance, and data is lost on instance stop or termination, making them unsuitable for shared, durable storage across AZs. Option C is wrong because a single EBS volume can only be attached to one EC2 instance at a time (except for multi-attach EBS, which is limited to specific instance types and not designed for cross-AZ shared file storage). Option D is wrong because S3 mounted as a POSIX file system (e.g., via s3fs) requires custom scripts and does not provide native POSIX consistency or locking, and using it without a file gateway introduces performance and reliability issues for shared file storage.

Practice this question →

65

MCQeasy

Based on the exhibit, the database must fail over automatically if the primary Availability Zone goes down. Which solution should the architect choose?

A.Create a read replica in the same Availability Zone as the primary database.

B.Convert the database to a Multi-AZ RDS deployment.

C.Increase the backup retention period to 35 days.

D.Move the database to an EC2 instance with an attached EBS volume.

AnswerB

A Multi-AZ RDS deployment keeps a synchronous standby in another Availability Zone and automatically fails over when the primary fails. This matches the requirement for minimal manual intervention and preserves the same database endpoint, so the application does not need connection string changes. It is the standard AWS choice for resilient relational databases.

Why this answer

A Multi-AZ RDS deployment automatically synchronously replicates data to a standby instance in a different Availability Zone. If the primary AZ fails, Amazon RDS automatically performs a failover to the standby, ensuring high availability without manual intervention. This meets the requirement for automatic failover when the primary AZ goes down.

Exam trap

The trap here is that candidates often confuse read replicas (which are for read scaling and require manual promotion) with Multi-AZ deployments (which provide automatic failover), leading them to incorrectly select Option A.

How to eliminate wrong answers

Option A is wrong because a read replica in the same AZ does not provide automatic failover; it is designed for read scaling and requires manual promotion to become a primary. Option C is wrong because increasing the backup retention period to 35 days only affects point-in-time recovery duration, not failover capability. Option D is wrong because an EC2 instance with an attached EBS volume requires custom scripting or additional services (e.g., Auto Scaling, Elastic IP reassignment) to achieve automatic failover, and does not provide the managed, synchronous replication of Multi-AZ RDS.

Practice this question →

66

MCQmedium

A global application experiences frequent writes and must survive a full Regional outage with near-zero data loss. The product team also requires that users can continue to write during the incident using the closest Region. Which approach is most aligned with these requirements?

A.Use an active/active design with multi-Region data replication (for example, global tables for the write-heavy datastore) and route traffic to multiple Regions based on health and latency.

B.Use warm standby with periodic backups of the primary write datastore every 24 hours.

C.Use pilot light where the secondary Region runs only infrastructure templates and starts data replication only after detecting failure.

D.Use a single-writer model in one Region and deploy read-only replicas in the other Region for continuity.

AnswerA

Active/active supports writing in multiple Regions and reduces the blast radius of a Regional failure while enabling continued operations.

Why this answer

Option A is correct because an active/active design with multi-Region data replication, such as DynamoDB global tables, allows writes to occur in any Region and replicates data asynchronously across Regions with sub-second latency. This ensures near-zero data loss (RPO of seconds) and continuous write availability during a full Regional outage, while Route 53 latency-based routing directs users to the closest healthy Region.

Exam trap

The trap here is that candidates often confuse 'read-only replicas' (which cannot accept writes) with 'multi-Region write replicas' (which can), leading them to choose Option D despite its inability to support writes during an outage.

How to eliminate wrong answers

Option B is wrong because warm standby with 24-hour periodic backups cannot achieve near-zero data loss; the RPO would be up to 24 hours, and writes would stop during failover. Option C is wrong because pilot light starts data replication only after failure detection, leading to minutes of data loss and write unavailability during the replication setup. Option D is wrong because a single-writer model with read-only replicas prevents writes during a Regional outage, violating the requirement that users continue to write during the incident.

Practice this question →

67

MCQeasy

An organization hosts the same public API in two AWS Regions. Normal traffic should go to the primary Region. If the primary endpoint becomes unhealthy, Route 53 should automatically route users to the secondary Region. What is the best Route 53 configuration approach?

A.Use simple routing with one record that contains both regions as weighted targets.

B.Use weighted routing and set the secondary Region weight to 0 until needed.

C.Use Route 53 failover routing with health checks that mark the primary as unhealthy and fail over to the secondary.

D.Use latency-based routing so requests go to the region with the lowest latency, regardless of health.

AnswerC

Failover routing is designed for active/passive disaster recovery. You configure a primary record and a secondary record, each associated with health checks. When the primary fails its health checks, Route 53 automatically resolves the name to the secondary target.

Why this answer

Route 53 failover routing is designed for active-passive configurations where traffic is directed to a primary resource unless a health check marks it as unhealthy, at which point all traffic automatically shifts to the secondary resource. This directly matches the requirement of routing normal traffic to the primary Region and failing over to the secondary Region only when the primary endpoint becomes unhealthy.

Exam trap

The trap here is that candidates often confuse weighted routing with failover routing, mistakenly thinking that setting a weight of 0 on the secondary is a valid way to keep it inactive until needed, but Route 53 does not automatically adjust weights based on health checks.

How to eliminate wrong answers

Option A is wrong because simple routing does not support health checks or automatic failover; it simply returns all IP addresses in a random order, which cannot enforce a primary-secondary failover pattern. Option B is wrong because setting the secondary Region weight to 0 would prevent any traffic from reaching it even during a failure, and manually changing weights defeats the purpose of automatic failover. Option D is wrong because latency-based routing selects the Region with the lowest latency for each user, which does not guarantee that the primary Region handles normal traffic and does not automatically fail over based on endpoint health.

Practice this question →

68

MCQmedium

A claims workflow uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The team wants the control to be enforceable during normal operations.

A.S3 Cross-Region Replication

B.Multi-AZ deployment for the RDS DB instance

C.EBS snapshots every hour

D.Read replicas only

AnswerB

Multi-AZ provides synchronous standby replication and automatic failover within a Region.

Why this answer

Multi-AZ deployment for RDS MySQL provides synchronous standby replication across two Availability Zones. In the event of an AZ failure, Amazon RDS automatically fails over to the standby in the other AZ, ensuring availability with minimal application changes (the same database endpoint is used). This meets the requirement for enforceability during normal operations because Multi-AZ is always active, not a manual or scheduled process.

Exam trap

The trap here is that candidates often confuse read replicas (which only handle read traffic and require manual promotion) with Multi-AZ (which provides automatic failover for both reads and writes), or they assume EBS snapshots provide high availability rather than just backup.

How to eliminate wrong answers

Option A is wrong because S3 Cross-Region Replication is for object storage in S3, not for RDS MySQL databases, and it does not provide automatic failover for a relational database. Option C is wrong because EBS snapshots every hour provide point-in-time recovery but do not enable automatic failover during an AZ failure; they require manual restoration and result in data loss up to one hour. Option D is wrong because read replicas only support read traffic and do not provide automatic failover for write operations; they require manual promotion and application changes to redirect writes.

Practice this question →

69

MCQmedium

Based on the exhibit, the web application must remain available even if one Availability Zone fails. What is the best change to improve resilience with the least redesign?

A.Increase DesiredCapacity to 4 while keeping all instances in subnet-a1.

B.Add subnet-b1 in a different Availability Zone to the Auto Scaling group.

C.Replace the Application Load Balancer with a Network Load Balancer.

D.Enable EBS encryption on the launch template volumes.

AnswerB

This spreads EC2 instances across two Availability Zones, so the Auto Scaling group can continue serving traffic if one AZ becomes unavailable. Because the ALB is already deployed in both subnets, this is the smallest change that adds true zonal resilience to the compute tier.

Why this answer

Adding subnet-b1 in a different Availability Zone to the Auto Scaling group ensures that EC2 instances are launched across two Availability Zones. If one zone fails, the ALB can route traffic to healthy instances in the other zone, maintaining application availability. This change requires minimal redesign because it only modifies the Auto Scaling group's subnet configuration without altering the load balancer or compute architecture.

Exam trap

The trap here is that candidates may think increasing instance count or changing load balancer type improves resilience, but without multi-AZ distribution, a single AZ failure still causes a total outage.

How to eliminate wrong answers

Option A is wrong because increasing DesiredCapacity to 4 while keeping all instances in subnet-a1 does not provide multi-AZ resilience; a single Availability Zone failure would still take all instances offline. Option C is wrong because replacing the Application Load Balancer with a Network Load Balancer does not inherently improve resilience against Availability Zone failures; both ALB and NLB support multi-AZ deployments, but the NLB operates at Layer 4 and lacks Layer 7 features like path-based routing, which may be required for the web application. Option D is wrong because enabling EBS encryption on the launch template volumes protects data at rest but does not affect availability or resilience against an Availability Zone failure.

Practice this question →

70

MCQhard

A warehouse integration service must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable?

A.Use CloudFront signed URLs

B.Use Amazon SQS standard queue and design consumers to be idempotent

C.Use UDP messages sent directly to workers

D.Use an in-memory queue on one EC2 instance

AnswerB

SQS standard queues provide at-least-once delivery and high throughput; consumers must handle occasional duplicates.

Why this answer

Amazon SQS standard queues provide at-least-once delivery, meaning each message is delivered at least once but can occasionally be delivered more than once. This matches the requirement to process every event at least once, and since duplicate processing is acceptable when consumers are idempotent, the standard queue is the most suitable and cost-effective choice. SQS also decouples the warehouse integration service from its consumers, improving resilience and scalability.

Exam trap

The trap here is that candidates may confuse 'at-least-once' with 'exactly-once' and incorrectly choose FIFO queues or other options, but the question explicitly accepts duplicates if idempotency is handled, making the standard queue the correct and simpler choice.

How to eliminate wrong answers

Option A is wrong because CloudFront signed URLs are used to control access to content delivered via CloudFront, not for event processing or message queuing; they provide no delivery guarantee mechanism. Option C is wrong because UDP is a connectionless, unreliable transport protocol that does not guarantee message delivery, order, or duplicate prevention, making it unsuitable for at-least-once processing. Option D is wrong because an in-memory queue on a single EC2 instance creates a single point of failure and lacks durability; if the instance fails, all queued events are lost, violating the requirement to process every event at least once.

Practice this question →

71

MCQeasy

A team needs a relational database solution that can automatically fail over to a standby instance if the primary database becomes unavailable. They want the standby to be located in a different Availability Zone. Which RDS/Aurora configuration best satisfies this requirement?

A.Single-AZ DB deployment and rely on manual snapshot restore during failures.

B.Multi-AZ deployment with an automatically managed standby in a different Availability Zone and automatic failover.

C.Enable read replicas only, and promote a replica manually when the primary fails.

D.Enable point-in-time recovery (PITR) without configuring any Multi-AZ standby.

AnswerB

RDS/Aurora Multi-AZ deployments maintain a standby instance in a separate AZ. When configured for Multi-AZ, RDS/Aurora can perform automatic failover to the standby, meeting both the “different AZ” and “automatic failover” requirements.

Why this answer

Option B is correct because a Multi-AZ RDS deployment automatically provisions and maintains a standby instance in a different Availability Zone, and the failover is handled automatically by AWS without manual intervention. This meets the requirement for automatic failover to a standby in a different AZ, which is the core purpose of Multi-AZ deployments.

Exam trap

The trap here is that candidates often confuse read replicas with Multi-AZ standby, thinking that promoting a read replica provides automatic failover, but read replicas require manual promotion and do not serve as a synchronous standby.

How to eliminate wrong answers

Option A is wrong because a Single-AZ deployment has no standby instance, and manual snapshot restore requires significant downtime and manual steps, failing the automatic failover requirement. Option C is wrong because read replicas are designed for read scaling, not automatic failover; promoting a read replica manually introduces downtime and does not provide automatic failover to a standby. Option D is wrong because point-in-time recovery (PITR) only enables restoring to a specific time from backups, not automatic failover to a standby instance in a different AZ.

Practice this question →

72

MCQmedium

A trading dashboard runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The architecture review board prefers a managed AWS-native control.

A.A single EC2 instance with detailed monitoring

B.Subnets in at least two Availability Zones with health checks enabled

C.All instances in one larger subnet

D.A Network Load Balancer in one subnet

AnswerB

An Auto Scaling group spanning multiple AZs can replace unhealthy instances and maintain capacity during an AZ failure.

Why this answer

Option B is correct because distributing EC2 instances across at least two Availability Zones (AZs) ensures that the application remains available if one AZ fails. The Auto Scaling group must include subnets in multiple AZs and use health checks (e.g., ELB health checks) to automatically replace unhealthy instances. This configuration meets the requirement for fault tolerance and aligns with AWS-managed best practices for high availability.

Exam trap

The trap here is that candidates often confuse 'scaling' with 'resilience' and think that a single large subnet or a different load balancer type (NLB) provides AZ fault tolerance, but only multi-AZ subnet configuration with health checks ensures automatic recovery from an AZ failure.

How to eliminate wrong answers

Option A is wrong because a single EC2 instance, even with detailed monitoring, cannot tolerate the failure of an Availability Zone; it represents a single point of failure. Option C is wrong because placing all instances in one larger subnet confines them to a single Availability Zone, which does not provide AZ-level fault tolerance. Option D is wrong because a Network Load Balancer in one subnet does not address the need for multi-AZ instance distribution; it also lacks the health-check-based auto-scaling capabilities required for instance replacement.

Practice this question →

73

MCQmedium

An event-driven order processing service consumes messages from an Amazon SQS Standard queue. After a deployment, about 1% of messages start failing validation because a required field is missing. The consumer catches the exception and returns control, so the messages are retried. However, those poison messages keep reappearing and repeatedly consuming processing time for hours, delaying handling of valid messages. What is the most resilient way to handle the poison messages while keeping the system available?

A.Set the consumer visibility timeout to a very large value so failing messages are hidden for hours.

B.Configure an SQS redrive policy to send messages to a dead-letter queue (DLQ) after a limited number of receives (maxReceiveCount).

C.Switch the SQS queue from Standard to FIFO so poison messages do not retry.

D.Increase the consumer concurrency indefinitely so the system processes all messages even if some fail validation.

AnswerB

A DLQ redrive policy creates a deterministic stop condition for poison messages. After maxReceiveCount, the messages are moved to the DLQ instead of cycling in the main queue, preventing repeated failed deliveries from degrading capacity and availability for valid messages.

Why this answer

Option B is correct because configuring an SQS redrive policy with a maxReceiveCount (e.g., 3–5) automatically moves messages that repeatedly fail processing to a dead-letter queue (DLQ) after the specified number of receives. This isolates the poison messages, preventing them from consuming visibility timeout and processing resources, while allowing valid messages to be handled without delay. The DLQ can then be analyzed or reprocessed offline, maintaining system availability.

Exam trap

The trap here is that candidates may think increasing visibility timeout or concurrency solves the problem, but they fail to recognize that only a dead-letter queue permanently isolates poison messages from the processing pipeline.

How to eliminate wrong answers

Option A is wrong because setting the consumer visibility timeout to a very large value would hide failing messages for hours, but they would still reappear after the timeout expires, continuing the cycle of retries and delays without resolving the issue. Option C is wrong because switching from Standard to FIFO does not prevent poison messages from retrying; FIFO queues still retry messages on failure and require a DLQ for poison handling, and they also sacrifice throughput and ordering flexibility. Option D is wrong because increasing consumer concurrency indefinitely does not address the root cause—poison messages will still be retried and consume processing slots, potentially overwhelming the system and delaying valid messages further.

Practice this question →

74

MCQhard

A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The architecture review board prefers a managed AWS-native control.

A.A FIFO queue without a redrive policy

B.Short polling instead of long polling

C.A dead-letter queue with an appropriate maxReceiveCount

D.A larger message retention period only

AnswerC

A DLQ isolates messages that fail repeatedly so they can be investigated without disrupting normal processing.

Why this answer

A dead-letter queue (DLQ) with an appropriate maxReceiveCount allows messages that repeatedly fail processing to be moved out of the source queue after a specified number of receive attempts. This prevents poison messages from blocking useful retries and is a fully managed AWS-native pattern. The architecture review board's preference for a managed solution is satisfied because SQS DLQs are a built-in feature requiring no custom code.

Exam trap

The trap here is that candidates may confuse a DLQ with simply increasing retention or changing polling behavior, not realizing that poison messages require explicit isolation via a separate queue and a maxReceiveCount threshold to stop infinite retries.

How to eliminate wrong answers

Option A is wrong because a FIFO queue without a redrive policy does not automatically handle poison messages; without a DLQ, failed messages remain in the queue and continue to block retries. Option B is wrong because short polling reduces latency but does not address poison messages; it returns only a subset of servers' messages and can increase empty responses, but it has no effect on message failure handling. Option D is wrong because increasing the message retention period only keeps messages longer without removing failing ones; poison messages would still be retried until they expire, continuing to block useful retries.

Practice this question →

75

Multi-Selectmedium

An application uses an Amazon RDS Multi-AZ DB instance. During a failover test, connections fail until the application is restarted, even though the database comes back online. Which two changes should the team make to improve resilience during failover? Select two.

Select 2 answers

A.Cache and reconnect to the current writer IP address to avoid DNS lookups during failover.

B.Use the RDS endpoint name instead of hard-coding the current instance IP or hostname in the application.

C.Switch to a read replica and let it promote manually after every outage.

D.Add retry logic with exponential backoff for transient connection and DNS resolution errors.

E.Disable connection pooling so each request opens a fresh socket during normal operation.

AnswersB, D

The RDS endpoint abstracts the underlying writer instance. When failover occurs, AWS updates the endpoint to point at the new writer, so the application should reconnect by using the managed name rather than a fixed IP or hostname.

Why this answer

Option B is correct because the RDS endpoint is a DNS name that automatically resolves to the current writer instance's IP address. During a failover, the DNS record is updated to point to the new primary, so using the endpoint instead of a hard-coded IP or hostname allows the application to reconnect without manual intervention. Option D is correct because adding retry logic with exponential backoff handles transient failures during DNS resolution and connection establishment, which are common during the brief period when the DNS TTL has not yet expired after a failover.

Exam trap

The trap here is that candidates often think caching the IP (Option A) improves performance, but it actually breaks failover resilience because the application never learns the new writer's address after a failover.

Practice this question →

Page 1 of 4 · 264 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Design Resilient Architectures questions.

Start 20-question session