CCNA Design Resilient Questions — Page 3 of 4

151

Multi-Selecthard

A payments API requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The design must avoid adding custom operational scripts.

Select 2 answers

A.Deletion protection or tightly controlled delete permissions

B.Point-in-time recovery

C.Global secondary indexes

D.DAX

AnswersA, B

Deletion protection and least-privilege controls reduce accidental table removal risk.

Why this answer

Deletion protection (option A) prevents accidental deletion of the DynamoDB table itself, which is critical for the accidental-delete protection requirement. Point-in-time recovery (option B) enables restoring the table to any point within the last 35 days, satisfying the point-in-time recovery requirement. Both features are native DynamoDB capabilities that require no custom scripts.

Exam trap

The trap here is that candidates may confuse deletion protection (which protects the table resource) with item-level delete prevention, or think that GSIs or DAX provide data durability or recovery features when they do not.

Practice this question →

152

MCQmedium

A company hosts an internal API behind an Application Load Balancer (ALB) in two AWS Regions. They want Amazon Route 53 to automatically fail over to the secondary Region when the primary Region’s ALB is unhealthy. Health checks for the primary ALB are already configured, but the DNS record currently uses a latency-based routing policy. Which Route 53 configuration most directly provides automatic failover based on health status?

A.Keep latency-based routing, and set the weights so the secondary Region rarely receives traffic unless manual changes are made.

B.Use a Route 53 failover routing policy: configure two alias records for the ALBs where the primary record is marked PRIMARY, the secondary is marked SECONDARY, and each record has an associated health check.

C.Use an alias A record that returns both ALBs simultaneously so clients automatically load balance across Regions during outages.

D.Use geolocation routing to route users to the primary Region and rely on ALB health checks to shift requests between Regions.

AnswerB

Route 53 failover routing uses health checks to determine whether to return the PRIMARY or SECONDARY record. When the primary health check fails, Route 53 automatically switches resolution to the secondary ALB.

Why this answer

Option B is correct because Route 53 failover routing policy is specifically designed to automatically route traffic away from an unhealthy resource to a healthy one. By creating two alias records (one PRIMARY with an associated health check for the primary ALB, and one SECONDARY for the secondary ALB), Route 53 will automatically fail over to the secondary record when the primary health check fails. This directly meets the requirement for automatic failover based on health status, unlike latency-based routing which only optimizes for response time.

Exam trap

The trap here is that candidates often confuse latency-based routing with failover routing, assuming latency-based routing inherently provides health-based failover, but it only optimizes for latency and does not automatically reroute based on health status.

How to eliminate wrong answers

Option A is wrong because latency-based routing does not support automatic failover based on health status; weights only control traffic distribution and manual changes would be required to shift traffic, which contradicts the 'automatic failover' requirement. Option C is wrong because an alias A record cannot return multiple ALBs simultaneously; Route 53 alias records point to a single AWS resource, and returning multiple IPs would require a non-alias record with multiple values, which still does not provide health-based failover. Option D is wrong because geolocation routing routes based on user location, not health; ALB health checks alone cannot shift requests between Regions because the DNS record itself does not change based on health status without a failover routing policy.

Practice this question →

153

MCQmedium

An orders service publishes payment instructions to an Amazon SQS Standard queue. A downstream consumer sometimes times out and retries the work, causing the consumer to process the same instruction more than once. Operationally, the team must ensure that duplicate processing does not create duplicate charges. The queue type cannot be changed. What is the most resilient application-side approach?

A.Rely on SQS Standard to provide exactly-once delivery for each message, since the consumer uses retries.

B.Implement idempotent processing using a persistent deduplication key (for example, paymentInstructionId) so repeated messages are ignored or safely merged.

C.Increase the queue’s visibility timeout to 24 hours so messages never reappear even if the consumer times out.

D.Delete and recreate the queue with a different name whenever duplicates are detected in production.

AnswerB

Because SQS Standard is at-least-once, the consumer must assume duplicates are possible. Persisting a record keyed by paymentInstructionId (or using a database unique constraint) lets the consumer detect that a given instruction was already processed successfully and safely skip the charge or merge results deterministically.

Why this answer

Option B is correct because implementing idempotent processing with a persistent deduplication key (e.g., paymentInstructionId) ensures that even if SQS Standard delivers the same message multiple times due to consumer timeouts and retries, the downstream logic will detect and ignore or safely merge duplicate charges. This is the most resilient application-side approach as it does not rely on queue configuration changes and works within the constraints of SQS Standard's at-least-once delivery model.

Exam trap

The trap here is that candidates often assume SQS Standard can provide exactly-once delivery if retries are handled properly, but the exam tests the understanding that SQS Standard inherently allows duplicates and that idempotency is the only reliable application-side solution.

How to eliminate wrong answers

Option A is wrong because SQS Standard queues provide at-least-once delivery, not exactly-once delivery; retries and timeouts can cause duplicate messages, and relying on exactly-once is a misconception. Option C is wrong because increasing the visibility timeout to 24 hours does not prevent duplicates if the consumer times out and retries before the timeout expires, and it can delay processing unnecessarily, making it impractical and not resilient. Option D is wrong because deleting and recreating the queue with a different name is a disruptive, manual, and non-scalable approach that does not address the root cause of duplicate processing and would cause data loss and operational chaos.

Practice this question →

154

MCQmedium

A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The architecture review board prefers a managed AWS-native control.

A.Subnets in at least two Availability Zones with health checks enabled

B.All instances in one larger subnet

C.A Network Load Balancer in one subnet

D.A single EC2 instance with detailed monitoring

AnswerA

An Auto Scaling group spanning multiple AZs can replace unhealthy instances and maintain capacity during an AZ failure.

Why this answer

Option A is correct because an Auto Scaling group configured with subnets in at least two Availability Zones ensures that if one AZ fails, the remaining AZ(s) can continue to serve traffic. Health checks on the EC2 instances allow the Auto Scaling group to detect and replace unhealthy instances, maintaining the desired capacity across the surviving AZs. This aligns with the requirement for a managed AWS-native control to tolerate an AZ failure.

Exam trap

The trap here is that candidates might think a single large subnet or a Network Load Balancer provides AZ resilience, but subnets are AZ-scoped and an NLB is a separate load-balancing component, not an Auto Scaling group configuration setting.

How to eliminate wrong answers

Option B is wrong because placing all instances in one larger subnet, even if it spans multiple AZs (which is not possible as subnets are AZ-specific), does not provide AZ failure tolerance; a single AZ failure would take down all instances. Option C is wrong because a Network Load Balancer (NLB) is not a component of an Auto Scaling group configuration; the question asks what the Auto Scaling group should include, and an NLB is a separate resource, not a configuration setting within the group. Option D is wrong because a single EC2 instance, even with detailed monitoring, cannot tolerate the failure of one Availability Zone; if that instance resides in the failed AZ, the application becomes unavailable, and detailed monitoring does not provide redundancy.

Practice this question →

155

Multi-Selecteasy

A service processes messages from an Amazon SQS queue. Sometimes the worker finishes the business logic but does not delete the message before the visibility timeout expires, so the message is delivered again. Which two changes improve resilience and reduce the impact of duplicate processing? Select two.

Select 2 answers

A.Make the message handler idempotent.

B.Set the SQS visibility timeout long enough for normal processing to complete.

C.Switch from SQS to Amazon SNS for reliable buffering.

D.Shorten the queue retention period so messages expire quickly.

E.Disable retries in the consumer application.

AnswersA, B

SQS provides at-least-once delivery, so the same message can be seen more than once. An idempotent handler ensures a repeated delivery does not create duplicate records, duplicate payments, or other repeated side effects.

Why this answer

Option A is correct because making the message handler idempotent ensures that even if a message is processed multiple times (due to visibility timeout expiry), the business outcome remains the same. Idempotency is a key design pattern for resilient architectures when using at-least-once delivery systems like SQS. Option B is correct because setting the visibility timeout long enough for normal processing prevents premature redelivery, reducing the chance of duplicate processing in the first place.

Exam trap

The trap here is that candidates often think disabling retries or switching to SNS will solve the duplicate processing issue, but they fail to recognize that SQS's at-least-once delivery model inherently requires idempotent consumers and proper visibility timeout configuration.

Practice this question →

156

Multi-Selectmedium

A company runs a production database on Amazon RDS for MySQL with Multi-AZ enabled. The database experiences a sudden increase in read replicas due to a marketing campaign. Which three strategies would help ensure the database remains resilient under heavy read traffic? (Choose three.)

Select 3 answers

.Create additional read replicas in different Availability Zones to distribute read traffic.

.Enable Multi-AZ on the read replicas to provide automatic failover for read operations.

.Use an RDS Proxy between the application and the database to manage connection pooling.

.Promote a read replica to a standalone DB instance to offload write traffic.

.Configure the application to use the read replica endpoint for read queries and the primary endpoint for writes.

.Increase the storage size of the primary DB instance to improve I/O throughput.

Why this answer

Creating additional read replicas in different Availability Zones distributes read traffic across multiple isolated locations, reducing load on the primary instance and improving read scalability. Using RDS Proxy between the application and the database manages connection pooling, which reduces the number of database connections and prevents resource exhaustion under heavy traffic. Configuring the application to use the read replica endpoint for read queries and the primary endpoint for writes offloads read traffic from the primary instance, preserving its capacity for write operations and maintaining write availability.

Exam trap

The trap here is that candidates often confuse Multi-AZ with read replicas, thinking Multi-AZ can be applied to replicas for failover, or they assume promoting a replica helps with write offloading, when in fact it creates a separate write target without reducing load on the original primary.

Practice this question →

157

MCQmedium

Based on the exhibit, the team wants to stop poison messages from consuming worker capacity and also prevent duplicate side effects if the same message is delivered more than once. Which design change best meets the requirement?

A.Increase the SQS queue batch size so each worker processes more messages per request.

B.Replace SQS with Amazon SNS and let each worker subscribe directly to the topic.

C.Configure a dead-letter queue and make the handler idempotent by storing a durable processed-message key.

D.Disable retries and shorten the visibility timeout so failed messages disappear sooner.

AnswerC

A dead-letter queue isolates messages that repeatedly fail so they stop wasting worker capacity. Idempotency ensures a message processed more than once does not create duplicate side effects, which is essential when visibility timeouts expire or retries occur. Together, these controls address both poison-message handling and at-least-once delivery behavior.

Why this answer

Option C is correct because a dead-letter queue isolates poison messages that repeatedly fail processing, preventing them from consuming worker capacity. Making the handler idempotent by storing a durable processed-message key (e.g., using DynamoDB or a database) ensures that even if the same message is delivered more than once, duplicate side effects are avoided. This combination directly addresses both requirements: stopping poison messages from wasting resources and preventing duplicate processing.

Exam trap

The trap here is that candidates often think disabling retries or increasing batch size solves poison messages, but they fail to realize that only a dead-letter queue isolates problematic messages, and idempotency is required to handle duplicate deliveries inherent in SQS's at-least-once delivery model.

How to eliminate wrong answers

Option A is wrong because increasing the SQS batch size does not prevent poison messages from consuming worker capacity; it only makes each worker process more messages per request, which could actually increase the impact of poison messages. Option B is wrong because replacing SQS with Amazon SNS and having workers subscribe directly to the topic removes the ability to decouple producers and consumers, and SNS does not provide message retention, retries, or a dead-letter queue mechanism, so poison messages would still be delivered and could cause duplicate side effects. Option D is wrong because disabling retries and shortening the visibility timeout would cause failed messages to disappear sooner, but this does not prevent duplicate side effects (messages could still be redelivered before being deleted) and does not isolate poison messages—they would simply be lost, not handled.

Practice this question →

158

MCQeasy

A retail platform needs disaster recovery across AWS Regions. The business requirement is: RTO up to 6 hours, RPO up to 1 hour, and they want the ability to start serving quickly during a Region outage but do not want to run full production capacity continuously. Which DR strategy best fits these requirements?

A.Backup and restore only, with no continuously running infrastructure in the secondary Region.

B.Pilot light, keeping only the minimum resources needed to bootstrap the environment.

C.Warm standby, keeping a reduced but ready-to-scale environment in the secondary Region.

D.Multi-site active-active, serving production traffic from both Regions at all times.

AnswerC

Warm standby maintains enough infrastructure to reduce recovery time, while not fully running production capacity continuously.

Why this answer

Warm standby is the best fit because it maintains a scaled-down but fully functional copy of the production environment in the secondary Region, allowing the RTO of 6 hours and RPO of 1 hour to be met without running full production capacity continuously. During a disaster, the standby environment can be scaled up quickly to serve traffic, balancing cost and recovery speed.

Exam trap

The trap here is that candidates confuse pilot light with warm standby, assuming that minimal resources (pilot light) can meet the 6-hour RTO, but pilot light requires manual provisioning of compute and scaling, which often exceeds the RTO, while warm standby provides a pre-provisioned, ready-to-scale environment that meets the requirement.

How to eliminate wrong answers

Option A is wrong because backup and restore only would result in an RTO significantly longer than 6 hours, as it requires provisioning infrastructure and restoring data from backups, which cannot meet the 1-hour RPO or 6-hour RTO. Option B is wrong because pilot light keeps only the minimal core resources (e.g., database, DNS) and requires manual provisioning of compute and scaling, which typically exceeds the 6-hour RTO and may not achieve the 1-hour RPO without additional automation. Option D is wrong because multi-site active-active runs full production capacity in both Regions at all times, which violates the requirement to not run full production capacity continuously and incurs unnecessary cost.

Practice this question →

159

MCQmedium

A caching layer uses Amazon ElastiCache for Redis in front of a stateless web service. The service must continue to read cached responses during maintenance events and should automatically fail over to another node if one AZ becomes impaired. Which design change best satisfies this requirement?

A.Deploy a single-node Redis cluster and rely on application-level retries when cache misses occur.

B.Configure an ElastiCache Redis replication group with automatic failover across multiple Availability Zones.

C.Move the cache into the VPC but keep it in one Availability Zone to reduce network latency.

D.Use a Memcached cluster and configure only client-side connection pooling without failover support.

AnswerB

Multi-AZ replication groups provide redundant nodes and automatic failover, improving cache resilience during AZ events.

Why this answer

Option B is correct because an ElastiCache Redis replication group with automatic failover across multiple Availability Zones ensures that if the primary node or its AZ becomes impaired, a read-replica in another AZ is automatically promoted to primary. This allows the stateless web service to continue reading cached responses without interruption, satisfying both the maintenance and AZ impairment requirements.

Exam trap

The trap here is that candidates often confuse Memcached's simplicity with Redis's replication capabilities, assuming that client-side connection pooling alone can handle failover, when in fact Memcached lacks any built-in replication or automatic failover mechanism.

How to eliminate wrong answers

Option A is wrong because a single-node Redis cluster provides no redundancy; if the node fails or its AZ becomes impaired, the cache is completely unavailable, forcing the web service to fall back to the origin (cache miss) until the node is restored, which violates the requirement for automatic failover. Option C is wrong because keeping the cache in one Availability Zone does not protect against AZ impairment; a single-AZ deployment cannot automatically fail over to another node in a different AZ, so the service would lose cached responses during an AZ outage. Option D is wrong because Memcached does not support replication or automatic failover; it is a pure caching engine with no built-in mechanism to promote a standby node, and client-side connection pooling alone cannot provide failover if a node or AZ becomes impaired.

Practice this question →

160

MCQmedium

An ECS service runs on EC2 instances and is fronted by an ALB. The ALB spans two Availability Zones, and the ECS service desired count is 2 tasks. The underlying EC2 capacity uses an Auto Scaling group (ASG) with min size set to 1, and the ASG also spans only one subnet in practice. What is the most effective change to meet the requirement that the service continues during a single-AZ instance loss?

A.Set the ECS deployment configuration to maximum percent 100 so tasks replace instances faster during rollouts.

B.Increase ASG min size to at least 2 and ensure the ASG uses subnets in at least two Availability Zones.

C.Enable ALB connection draining longer than expected so existing connections survive longer during an AZ event.

D.Reduce task memory reservations to pack both tasks onto a single EC2 instance.

AnswerB

Multi-AZ instance capacity ensures tasks have eligible compute in another AZ when one AZ loses instances.

Why this answer

Option B is correct because the current architecture has a single point of failure: the ASG spans only one subnet (one AZ), so if that AZ fails, all EC2 instances are lost, and the ECS service cannot run any tasks. By increasing the ASG min size to at least 2 and ensuring it uses subnets in at least two AZs, the ASG will maintain at least one healthy instance in each AZ, allowing the ECS service to survive a single-AZ outage. This aligns with the AWS Well-Architected Framework's principle of deploying across multiple AZs for high availability.

Exam trap

The trap here is that candidates may focus on the ALB's multi-AZ configuration and overlook that the compute layer (ASG/EC2) is the actual bottleneck, leading them to choose connection draining or deployment settings that do not address the fundamental lack of cross-AZ capacity.

How to eliminate wrong answers

Option A is wrong because setting the ECS deployment configuration maximum percent to 100 controls how many tasks are replaced during a rolling update, not the ability to survive an AZ failure; it does not address the underlying lack of EC2 capacity across AZs. Option C is wrong because ALB connection draining only gracefully terminates existing connections during deregistration or health check failures; it does not prevent service interruption when all EC2 instances in the single AZ become unavailable. Option D is wrong because reducing task memory reservations to pack both tasks onto a single EC2 instance actually increases risk—if that single instance or its AZ fails, both tasks are lost, violating the resilience requirement.

Practice this question →

161

MCQhard

A warehouse integration service must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The architecture review board prefers a managed AWS-native control.

A.Use CloudFront signed URLs

B.Use Amazon SQS standard queue and design consumers to be idempotent

C.Use UDP messages sent directly to workers

D.Use an in-memory queue on one EC2 instance

AnswerB

SQS standard queues provide at-least-once delivery and high throughput; consumers must handle occasional duplicates.

Why this answer

Amazon SQS standard queues provide at-least-once delivery, which guarantees that every message is processed at least once, meeting the requirement that every event must be processed. Duplicate processing is acceptable because the consumer can be designed to handle idempotency. This is a managed, AWS-native service that aligns with the architecture review board's preference.

Exam trap

The trap here is that candidates may confuse 'at-least-once' delivery with 'exactly-once' delivery and incorrectly choose a solution like FIFO queues (not listed) or dismiss SQS standard queues due to the duplicate processing allowance, but the question explicitly states duplicates are acceptable if idempotency is handled, making SQS standard the correct choice.

How to eliminate wrong answers

Option A is wrong because CloudFront signed URLs are used for securing content delivery, not for event processing or messaging. Option C is wrong because UDP is a connectionless, unreliable protocol that does not guarantee delivery, so it cannot ensure at-least-once processing. Option D is wrong because an in-memory queue on a single EC2 instance is not managed, not AWS-native, and introduces a single point of failure, violating the requirement for a resilient, managed service.

Practice this question →

162

Multi-Selectmedium

An order-processing worker consumes messages from Amazon SQS. Occasionally, the worker times out after successfully creating a payment record but before deleting the message, which causes duplicate charges during retries. Some messages also fail validation repeatedly because required fields are missing. Which two changes should the team make? Select two.

Select 2 answers

A.Make the payment step idempotent using a unique transaction identifier.

B.Configure an SQS dead-letter queue with a redrive policy.

C.Reduce the visibility timeout so failed messages return to the queue faster.

D.Run only one long-lived worker instance so the queue can never be processed twice.

E.Switch from a standard queue to a FIFO queue and remove all other changes.

AnswersA, B

Correct. SQS provides at-least-once delivery, so the same message can be processed more than once if the worker times out, retries, or crashes after partially completing the work. An idempotency key lets the application recognize that the payment was already created and prevents duplicate charges.

Why this answer

Option A is correct because making the payment step idempotent using a unique transaction identifier ensures that if the same message is processed multiple times due to a timeout, the payment is only charged once. This is a common pattern for handling at-least-once delivery semantics in Amazon SQS, where the worker must be designed to handle duplicate messages safely.

Exam trap

The trap here is that candidates often think reducing the visibility timeout will speed up recovery, but it actually increases the chance of duplicate processing, and they may also overlook that a FIFO queue alone does not fix the worker's failure to delete the message after processing.

Practice this question →

163

MCQhard

A payments API uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The design must avoid adding custom operational scripts.

A.A FIFO queue without a redrive policy

B.A dead-letter queue with an appropriate maxReceiveCount

C.A larger message retention period only

D.Short polling instead of long polling

AnswerB

A DLQ isolates messages that fail repeatedly so they can be investigated without disrupting normal processing.

Why this answer

A dead-letter queue (DLQ) with an appropriate maxReceiveCount allows messages that repeatedly fail processing to be moved out of the source queue after a specified number of receive attempts. This prevents poison messages from blocking retries and consuming processing resources, without requiring custom operational scripts.

Exam trap

The trap here is that candidates may think increasing retention or switching polling modes solves poison messages, but only a DLQ with maxReceiveCount directly addresses repeated failures without custom scripts.

How to eliminate wrong answers

Option A is wrong because a FIFO queue without a redrive policy does not automatically handle poison messages; it still requires a DLQ configuration to move failing messages out. Option C is wrong because increasing the message retention period only keeps messages longer but does not prevent poison messages from repeatedly failing and blocking retries. Option D is wrong because short polling (immediate return with fewer messages) does not address poison message handling; it only affects message availability and latency, not failure management.

Practice this question →

164

MCQeasy

A content publishing system exposes a static website from S3 and CloudFront. Users should still receive cached pages if the S3 origin has a short outage. Which feature helps most? The design must avoid adding custom operational scripts.

A.IAM Access Analyzer

B.AWS Backup Vault Lock

C.CloudFront caching with appropriate TTLs

D.S3 Select

AnswerC

CloudFront can serve cached content from edge locations when the origin is temporarily unavailable.

Why this answer

CloudFront caches responses from the S3 origin based on configured TTLs (Cache-Control or Expires headers). If the S3 origin experiences a short outage, CloudFront can still serve cached content to users as long as the TTL has not expired, ensuring availability without custom scripts. This is the most direct and resilient feature for this use case.

Exam trap

The trap here is that candidates may confuse backup or access control features (like Backup Vault Lock or IAM Access Analyzer) with availability mechanisms, or think S3 Select provides caching, when the correct answer is simply leveraging CloudFront's built-in caching TTLs to serve stale content during origin outages.

How to eliminate wrong answers

Option A is wrong because IAM Access Analyzer helps identify unintended access to resources, not caching or origin resilience. Option B is wrong because AWS Backup Vault Lock prevents deletion of backups, not caching or serving stale content during origin outages. Option D is wrong because S3 Select is a feature to retrieve subsets of data from objects using SQL queries, not related to caching or origin resilience.

Practice this question →

165

MCQhard

Based on the exhibit, downstream payment timeouts cause EventBridge deliveries to back up and some events are retried until they age out. What change best improves resilience and preserves events during downstream outages?

A.Increase the Lambda timeout so each invocation can wait longer for the payment API.

B.Put an Amazon SQS queue between EventBridge and the consumer, and have workers drain the queue with a DLQ for poison messages.

C.Switch the target to a Lambda function with reserved concurrency of zero during outages.

D.Replace EventBridge with CloudWatch Logs subscriptions so the consumer can poll the log stream later.

AnswerB

SQS is the right durability and buffering layer for this requirement. EventBridge can publish orders.checkout events to a queue, and workers can consume them at a controlled rate even when the payment API is unavailable. This decouples event ingestion from downstream processing, absorbs bursts, and preserves events until the outage ends. A DLQ provides a safe landing zone for messages that continue to fail after retries so they are not silently dropped.

Why this answer

Option B is correct because introducing an SQS queue between EventBridge and the consumer decouples the event delivery from the downstream payment API. During outages, events are stored durably in SQS and can be processed later without being lost. A Dead Letter Queue (DLQ) captures events that fail repeatedly, preventing poison messages from blocking the queue and ensuring no events age out due to retry exhaustion.

Exam trap

The trap here is that candidates often assume increasing timeouts or concurrency adjustments can fix backpressure issues, but they fail to recognize that decoupling with a durable queue is the only way to preserve events during extended downstream outages without losing them to retry expiration.

How to eliminate wrong answers

Option A is wrong because increasing the Lambda timeout does not prevent events from backing up or aging out; it only allows a single invocation to wait longer, which can exacerbate concurrency limits and still result in timeouts if the downstream API is unavailable. Option C is wrong because setting reserved concurrency to zero during outages would completely stop all Lambda invocations, causing all events to be dropped or immediately sent to the DLQ, rather than preserving them for later processing. Option D is wrong because replacing EventBridge with CloudWatch Logs subscriptions does not provide a durable, replayable buffer; log subscriptions are designed for real-time streaming and cannot natively pause or retry deliveries during outages, leading to event loss.

Practice this question →

166

MCQmedium

A company runs a stateful analytics workload on EC2 instances that use EBS volumes. The data must be restorable in another Region after a major outage, with frequent point-in-time recovery. Which approach provides the most suitable replication mechanism for the EBS-backed data?

A.Create scheduled EBS snapshots and copy them to another Region, then restore the volumes from those snapshots during recovery.

B.Enable EBS multi-attach to spread the workload across AZs and replicate snapshots automatically between Regions.

C.Use RDS read replicas in another Region and keep the analytics dataset in an RDS instance only.

D.Rely on instance store for durability and copy only AMIs across Regions.

AnswerA

Snapshotting and cross-Region copying gives point-in-time images of EBS volumes that can be restored in the target Region.

Why this answer

Scheduled EBS snapshots provide point-in-time backups of EBS volumes, which can be copied to another Region using the cross-Region snapshot copy feature. During recovery, you restore volumes from those snapshots in the target Region, ensuring the data is restorable after a major outage. This approach meets the requirements for frequent point-in-time recovery and cross-Region durability.

Exam trap

The trap here is that candidates may confuse EBS multi-attach (which is for high availability within a single AZ) with cross-Region replication, or mistakenly think instance store provides durability for long-term data recovery.

How to eliminate wrong answers

Option B is wrong because EBS multi-attach allows a single EBS volume to be attached to multiple EC2 instances within the same Availability Zone, but it does not replicate snapshots automatically between Regions or provide cross-Region disaster recovery. Option C is wrong because RDS read replicas are for relational databases, not for analytics workloads running on EC2 with EBS volumes; moving the dataset to RDS would require a database migration and does not replicate EBS-backed data. Option D is wrong because instance store volumes are ephemeral and do not persist data across instance stops or terminations, making them unsuitable for durable data that needs point-in-time recovery; copying AMIs across Regions does not replicate the underlying EBS data.

Practice this question →

167

MCQmedium

A company hosts a web application on EC2 instances behind an Application Load Balancer (ALB) in us-east-1. A static failover site is hosted in an S3 bucket with static website hosting enabled. The company needs automatic DNS failover to the S3 bucket if the primary ALB becomes unhealthy. Which Route 53 configuration achieves this?

A.Configure Route 53 Failover routing with a health check on the ALB as PRIMARY and the S3 bucket website endpoint as SECONDARY

B.Configure Route 53 Weighted routing with 100% weight on the ALB and 0% on the S3 bucket

C.Configure Route 53 Latency routing with records in both regions to route to the healthiest endpoint

D.Configure Route 53 Geolocation routing with North American users directed to the ALB and all others to S3

AnswerA

Failover routing with a health-checked PRIMARY (ALB) and SECONDARY (S3) provides automatic DNS switchover. When the ALB health check fails, Route 53 returns the S3 endpoint automatically.

Why this answer

Route 53 Failover routing uses health checks to route traffic to a primary resource and automatically switch to a secondary when the primary health check fails.

Configuration: Create a Route 53 health check targeting the ALB endpoint. Create a PRIMARY alias A record pointing to the ALB with the health check associated. Create a SECONDARY alias A record pointing to the S3 static website endpoint. When the ALB health check fails, Route 53 returns the S3 endpoint automatically.

Exam trap

Route 53 offers multiple routing policies. Failover routing is active-passive — one primary resource, one standby. Weighted routing splits traffic percentages (active-active).

Latency routing picks the lowest-latency endpoint. Geolocation routes by user geography. Only Failover routing provides automatic primary/secondary switchover based on health checks.

Weighted routing at 100%/0% does NOT failover when the 100% target fails.

Why the other options are wrong

Weighted routing at 100%/0% does not failover. When the 100% target (ALB) is unhealthy, Route 53 does not automatically redirect to the 0% target (S3). Weighted routing splits traffic by percentage without health-check-based switching.

Latency routing routes to the lowest-latency endpoint for each client. It does not implement primary/secondary logic. If us-east-1 is unhealthy, some clients may still be routed there unless combined with health checks (but even then this is not a defined primary/secondary failover).

Geolocation routing directs traffic by user geography — North American users always go to the ALB even when it fails. S3 only receives other-region traffic. This is not a failover configuration.

Practice this question →

168

MCQeasy

A web application runs on an Amazon EC2 Auto Scaling group (ASG) behind an Application Load Balancer (ALB). The ALB is configured to use at least two Availability Zones (AZs), but the ASG currently uses subnets in only one AZ. If that AZ becomes unavailable, the application stops serving requests. Which change most directly improves resilience to an AZ outage?

A.Keep the ASG in one Availability Zone, but reduce ALB health check intervals.

B.Place the ASG across multiple Availability Zones by configuring it with subnets in at least two AZs.

C.Switch the load balancer from an ALB to an NLB to remove HTTP health check dependency.

D.Add an Amazon SQS queue to buffer requests during failures.

AnswerB

An ASG launches instances into the AZs of the subnets you specify. By placing the ASG in at least two AZs, the ALB can route traffic to healthy targets in the remaining AZ(s) if one AZ fails, enabling recovery as new instances maintain desired capacity.

Why this answer

Option B is correct because distributing an Auto Scaling group across multiple Availability Zones (AZs) ensures that if one AZ fails, the remaining AZs continue to serve traffic. The Application Load Balancer (ALB) is already configured for at least two AZs, but the ASG’s single-AZ subnet placement creates a single point of failure. By adding subnets in at least two AZs to the ASG, the application becomes resilient to an AZ outage without any other architectural changes.

Exam trap

The trap here is that candidates assume the ALB’s multi-AZ configuration automatically protects the application, overlooking that the ASG must also span multiple AZs to provide compute redundancy.

How to eliminate wrong answers

Option A is wrong because reducing health check intervals only detects failures faster but does not eliminate the single point of failure; if the sole AZ becomes unavailable, no healthy instances exist to serve traffic. Option C is wrong because switching from an ALB to an NLB does not address the root cause—the ASG is still in one AZ—and HTTP health checks are not the issue; the ALB can already perform health checks across AZs. Option D is wrong because adding an SQS queue buffers requests but does not provide compute capacity in another AZ; without instances in a second AZ, the queue cannot process requests during an AZ outage.

Practice this question →

169

MCQeasy

An order-processing system publishes an event whenever a payment succeeds. Three downstream services (inventory, shipping, and analytics) must react independently. Analytics sometimes has high latency, but order processing must not be blocked. What is the best AWS approach to decouple these consumers?

A.Have order processing call each service synchronously via HTTPS and retry on failures.

B.Publish payment events to SNS (or EventBridge) and let each downstream service consume independently (for example, via SQS queues or other async targets).

C.Store events in a single relational database table and let consumers poll continuously for new rows.

D.Send events directly from the producer to each consumer EC2 instance using SSH tunnels.

AnswerB

Using pub/sub decouples the producer from consumers. Order processing publishes once and can complete without waiting for each downstream service. Each consumer receives events independently, so analytics latency does not directly block inventory or shipping processing.

Why this answer

Option B is correct because Amazon SNS (or EventBridge) enables asynchronous, fan-out messaging where a single payment-success event is published once and delivered independently to multiple downstream services (inventory, shipping, analytics) via SQS queues or other targets. This decouples the producer from consumer latency—analytics can take its time without blocking order processing—and ensures each consumer processes the event at its own pace, meeting the requirement for independent, non-blocking reactions.

Exam trap

The trap here is that candidates may choose synchronous integration (Option A) because it seems simpler, failing to recognize that the requirement 'must not be blocked' explicitly demands asynchronous decoupling, not just retries.

How to eliminate wrong answers

Option A is wrong because synchronous HTTPS calls with retries tightly couple the producer to all consumers; if analytics has high latency, order processing is blocked waiting for responses, violating the requirement that it must not be blocked. Option C is wrong because storing events in a single relational database table introduces a single point of failure, creates a polling bottleneck, and tightly couples consumers to a shared schema and table, which is not a decoupled, scalable architecture. Option D is wrong because sending events directly via SSH tunnels requires direct network connectivity to each EC2 instance, introduces security risks, and tightly couples the producer to consumer instances, making it brittle and unscalable.

Practice this question →

170

MCQmedium

An internal worker consumes messages from an Amazon SQS Standard queue. Recently, some messages fail validation in the worker (for example, missing required fields), causing the worker to crash before it can successfully process those messages. Those messages keep getting retried repeatedly, slowing down processing of valid messages. The team wants a resilient mechanism to quarantine bad messages after a limited number of receive attempts. What should they implement?

A.Increase the SQS visibility timeout to several hours so the worker does not retry too quickly.

B.Configure a redrive policy with a Dead-Letter Queue (DLQ) and set maxReceiveCount so poison messages are moved to the DLQ after repeated failures.

C.Switch the queue to an SNS topic and subscribe the worker directly, eliminating message retries.

D.Enable KMS encryption with a new CMK to ensure validation errors stop occurring.

AnswerB

An SQS DLQ with a redrive policy is specifically designed for poison-message handling. When a message exceeds maxReceiveCount without successful processing (for example, the worker crashes before deletion), SQS moves the message to the DLQ. This quarantines bad messages and protects throughput for valid messages.

Why this answer

Option B is correct because Amazon SQS supports configuring a redrive policy with a Dead-Letter Queue (DLQ) that automatically moves messages after a specified number of receive attempts (maxReceiveCount). This isolates poison messages that fail validation and cause crashes, preventing them from being retried indefinitely and slowing down valid message processing. The worker can then focus on valid messages while the DLQ stores the problematic ones for later analysis or manual intervention.

Exam trap

The trap here is that candidates may think increasing the visibility timeout (Option A) solves the retry problem, but it only delays retries without eliminating the root cause, while the DLQ mechanism (Option B) provides a proper quarantine by moving messages after a configurable number of receive attempts.

How to eliminate wrong answers

Option A is wrong because increasing the visibility timeout to several hours would only delay retries, not prevent them; the worker would still crash repeatedly on the same invalid messages after each timeout expires, and valid messages would be blocked for hours. Option C is wrong because switching to an SNS topic eliminates message retries entirely, but the worker would still crash on invalid messages without any retry mechanism or quarantine, and SNS does not provide a built-in DLQ for consumer-side failures. Option D is wrong because enabling KMS encryption with a new CMK addresses data encryption at rest and in transit, but has no effect on message content validation errors or crash handling; encryption does not fix missing required fields or prevent retries.

Practice this question →

171

MCQmedium

A ticket booking system runs on EC2 instances behind an Application Load Balancer. The design must tolerate the failure of one Availability Zone. What should the Auto Scaling group configuration include? The design must avoid adding custom operational scripts.

A.Subnets in at least two Availability Zones with health checks enabled

B.All instances in one larger subnet

C.A Network Load Balancer in one subnet

D.A single EC2 instance with detailed monitoring

AnswerA

An Auto Scaling group spanning multiple AZs can replace unhealthy instances and maintain capacity during an AZ failure.

Why this answer

Option A is correct because placing subnets in at least two Availability Zones ensures that if one AZ fails, the Auto Scaling group can launch instances in the remaining healthy AZ, maintaining application availability. Health checks integrated with the Application Load Balancer allow the Auto Scaling group to automatically replace unhealthy instances without custom scripts, aligning with the requirement to avoid operational overhead.

Exam trap

The trap here is that candidates often assume a single larger subnet or a Network Load Balancer provides AZ resilience, but they fail to recognize that without multiple subnets in distinct AZs, the architecture cannot survive an AZ failure, and custom scripts would be needed for health checks without ELB integration.

How to eliminate wrong answers

Option B is wrong because placing all instances in one larger subnet confines them to a single Availability Zone, violating the requirement to tolerate the failure of one AZ. Option C is wrong because a Network Load Balancer operates at Layer 4 and does not provide the health check integration needed for Auto Scaling group instance replacement; additionally, placing it in one subnet creates a single point of failure. Option D is wrong because a single EC2 instance, even with detailed monitoring, cannot survive an AZ failure and does not leverage Auto Scaling for automatic recovery.

Practice this question →

172

MCQmedium

A company uses an Amazon Aurora DB cluster in a Multi-AZ configuration. During a planned failover of the writer instance, the database endpoints in the application are updated incorrectly. After failover, reads work but writes fail with connection errors and timeouts for several minutes. The team currently uses the instance endpoint for the writer. What should they change to improve write resilience during failovers?

A.Continue using the instance endpoint, but increase application retry count so the writer changes are handled more quickly.

B.Use the Aurora cluster writer endpoint for all write operations.

C.Use a read replica endpoint for writes because it is typically stable across failovers.

D.Disable Multi-AZ failover so the writer instance never changes and writes remain consistent.

AnswerB

Aurora provides a writer endpoint designed specifically for write traffic. During failover, Aurora updates where the writer endpoint points, so the same DNS name continues to resolve to the current writer instance without requiring manual endpoint changes in the application.

Why this answer

The Aurora cluster writer endpoint always points to the current primary (writer) instance, even after a failover. By using this endpoint instead of a static instance endpoint, the application automatically resolves to the new writer without manual updates, eliminating connection errors and timeouts during failover transitions.

Exam trap

The trap here is that candidates confuse the instance endpoint (which is static and tied to a specific instance) with the cluster endpoint (which is dynamic and always points to the current writer), assuming any endpoint will automatically follow failover.

How to eliminate wrong answers

Option A is wrong because increasing the retry count does not fix the root cause—the application is still pointing to the old (now read-only) instance endpoint, so writes will continue to fail until the endpoint is manually corrected. Option C is wrong because read replica endpoints point to read-only instances; writes to a read replica will always fail with an error, regardless of failover state. Option D is wrong because disabling Multi-AZ failover removes high availability entirely, making the database vulnerable to a single point of failure, which contradicts the goal of improving write resilience.

Practice this question →

173

Multi-Selecthard

A claims workflow requires point-in-time recovery and accidental-delete protection for a DynamoDB table. Which two settings should the architect enable? The design must avoid adding custom operational scripts.

Select 2 answers

A.Point-in-time recovery

B.DAX

C.Deletion protection or tightly controlled delete permissions

D.Global secondary indexes

AnswersA, C

PITR allows restoration to a specific second within the supported recovery window.

Why this answer

Point-in-time recovery (PITR) for DynamoDB enables continuous backups with 35-day granularity, allowing restoration to any second within that window. This directly satisfies the point-in-time recovery requirement without custom scripts, as it is a native AWS feature.

Exam trap

The trap here is that candidates often confuse DAX with a data protection feature, but DAX only accelerates reads and has no role in backup or deletion prevention.

Practice this question →

174

MCQmedium

A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The design must avoid adding custom operational scripts.

A.An EBS snapshot schedule

B.S3 Cross-Region Replication with versioning enabled

C.S3 lifecycle transition to Glacier Flexible Retrieval

D.A CloudFront distribution

AnswerB

CRR asynchronously replicates objects to a bucket in another Region and requires versioning.

Why this answer

S3 Cross-Region Replication (CRR) automatically replicates objects to a destination bucket in a different AWS Region, meeting the disaster recovery requirement without custom scripts. Versioning must be enabled on both source and destination buckets for CRR to function, as it tracks object versions and ensures consistency during replication.

Exam trap

The trap here is that candidates may confuse S3 Cross-Region Replication with S3 lifecycle policies or CloudFront, thinking they provide cross-region replication, but only CRR with versioning enabled meets the DR requirement without custom scripts.

How to eliminate wrong answers

Option A is wrong because EBS snapshots are for Amazon Elastic Block Store volumes attached to EC2 instances, not for S3 objects; they cannot replicate S3 data across regions. Option C is wrong because S3 lifecycle transitions to Glacier Flexible Retrieval only change storage class within the same region for cost optimization, not replicate data to another region. Option D is wrong because CloudFront is a content delivery network (CDN) that caches content at edge locations for low-latency access, not a replication mechanism for disaster recovery across regions.

Practice this question →

175

MCQmedium

An order processing workflow uses Amazon SQS as the decoupling layer between a producer and a consumer Lambda function. The consumer intermittently fails due to a downstream dependency. The team has observed that certain “poison” messages keep being retried repeatedly and prevent other messages from being processed efficiently. Which SQS configuration most directly addresses this issue?

A.Set the SQS queue’s retention period to 10 years and rely on application retries to eventually succeed.

B.Increase visibility timeout to a very large value and avoid dead-letter queues to keep ordering stable.

C.Configure a redrive policy with a dead-letter queue (DLQ) and set an appropriate visibility timeout greater than the maximum processing time.

D.Switch the queue to FIFO and remove retries in the Lambda event source mapping entirely.

AnswerC

A DLQ isolates poison messages after a receive count threshold, and correct visibility timeout prevents premature retries.

Why this answer

Option C is correct because configuring a redrive policy with a dead-letter queue (DLQ) allows messages that repeatedly fail processing to be moved out of the main queue after a specified number of receive attempts. Setting an appropriate visibility timeout greater than the maximum processing time ensures that messages are not made visible again before the consumer finishes processing, preventing premature retries. This directly isolates poison messages so they no longer block the processing of other messages in the queue.

Exam trap

The trap here is that candidates may think increasing visibility timeout or switching to FIFO alone will handle failed messages, but without a DLQ, poison messages remain in the queue and continue to block other messages, which is the core issue described.

How to eliminate wrong answers

Option A is wrong because setting the retention period to 10 years does not address the repeated retry of poison messages; it only keeps messages longer, allowing them to continue blocking the queue. Option B is wrong because increasing visibility timeout to a very large value without a DLQ means poison messages will remain in the queue indefinitely, still preventing other messages from being processed efficiently. Option D is wrong because switching to FIFO does not solve the poison message problem; FIFO queues still require a DLQ for handling failed messages, and removing retries entirely would cause messages to be lost permanently without any recovery mechanism.

Practice this question →

176

MCQeasy

A system processes events from Amazon SQS and sometimes sees duplicate messages due to retries. The business requirement is that each payment must be charged at most once. What design choice best addresses this resiliency requirement?

A.Assume duplicates never occur because the consumer deletes messages immediately after receiving them.

B.Implement idempotent processing using a deduplication key (for example, paymentId) and record completed charges so duplicates are safely ignored.

C.Increase the SQS visibility timeout until duplicates never happen.

D.Use SNS topics instead of SQS so retries are disabled by default.

AnswerB

Idempotency ensures at-most-once side effects even when duplicates are delivered. Persist a record keyed by paymentId (e.g., a unique constraint/conditional write). If the record indicates the payment was already charged, skip the charge for any subsequent duplicate message.

Why this answer

Option B is correct because implementing idempotent processing with a deduplication key (e.g., paymentId) ensures that even if duplicate messages arrive from SQS (due to retries or at-least-once delivery), the consumer can check a record of completed charges and safely ignore duplicates. This satisfies the business requirement of charging each payment at most once without relying on SQS’s best-effort deduplication or message ordering.

Exam trap

The trap here is that candidates assume SQS guarantees exactly-once delivery or that increasing visibility timeouts can prevent duplicates, but SQS is designed for at-least-once delivery, and the only reliable way to handle duplicates is to make the consumer idempotent.

How to eliminate wrong answers

Option A is wrong because SQS provides at-least-once delivery, and deleting a message immediately after receiving it does not prevent duplicates that may arrive before the delete is processed or due to visibility timeout expiration; assuming duplicates never occur violates the fundamental reliability guarantee of SQS. Option C is wrong because increasing the visibility timeout cannot eliminate duplicates; it only delays the redelivery of unacknowledged messages, and duplicates can still occur due to network retries, consumer crashes, or SQS’s internal replication. Option D is wrong because SNS topics do not disable retries by default; SNS uses at-least-once delivery and can retry HTTP/S endpoints, and switching to SNS does not solve the duplicate problem—it may even introduce additional delivery attempts without built-in deduplication.

Practice this question →

177

MCQeasy

A company hosts a web application on Amazon EC2 instances in an Auto Scaling group behind an Application Load Balancer (ALB). The ALB and the Auto Scaling group are currently deployed in only one Availability Zone (AZ). The business wants the application to keep running if that AZ has an outage. What is the best change?

A.Increase the desired capacity in the existing Availability Zone to handle all traffic during an outage.

B.Deploy the ALB and the Auto Scaling group across at least two Availability Zones so healthy targets remain.

C.Enable longer ALB health check intervals so failing instances are detected more slowly.

D.Switch from the ALB to an Internet Gateway so instances can fail over to the public internet.

AnswerB

To tolerate an AZ outage, both the load-balancing entry point (the ALB) and the compute capacity (the Auto Scaling instances) must be available in more than one AZ. With the ALB in multiple AZs and the Auto Scaling group using multiple subnets/AZs, requests can be routed to healthy targets in a remaining AZ while Auto Scaling replaces unhealthy instances.

Why this answer

Deploying the ALB and Auto Scaling group across at least two Availability Zones (AZs) ensures that if one AZ fails, the ALB can route traffic to healthy EC2 instances in the remaining AZ(s). This is the fundamental AWS best practice for high availability: an ALB is a regional service that requires targets in multiple AZs to survive an AZ outage, and the Auto Scaling group must also span those AZs to maintain capacity. Without multi-AZ deployment, a single AZ failure makes the entire application unavailable regardless of instance health checks.

Exam trap

The trap here is that candidates think increasing capacity or adjusting health check intervals can compensate for a single-AZ deployment, but AWS high availability fundamentally requires distributing resources across multiple isolated failure domains (AZs).

How to eliminate wrong answers

Option A is wrong because increasing the desired capacity in a single AZ does not protect against an AZ outage; all instances are in the same failure domain, so they all become unreachable simultaneously. Option C is wrong because enabling longer health check intervals would delay detection of failing instances, making the application less responsive to failures and increasing downtime, not improving availability. Option D is wrong because an Internet Gateway (IGW) is a VPC component that enables outbound internet access for instances, not a load balancer; it cannot perform health checks, distribute traffic, or fail over traffic between instances, and it does not replace the ALB's role in high availability.

Practice this question →

178

MCQmedium

Your order-processing system uses EventBridge rules to send events to a Lambda function that updates order status. Over the last week, some events fail with a transient database timeout, and the Lambda retries intermittently but then the events are lost (no alerts after failures). You want at-least-once processing, bounded retries, and a way to inspect unprocessable events for later reprocessing. Which architecture change best meets these requirements?

A.Send EventBridge events to an SQS queue, configure a redrive policy to move messages to a dead-letter queue (DLQ) after a defined receive count, and make the Lambda processing idempotent.

B.Invoke Lambda directly from EventBridge in asynchronous mode, and increase the Lambda timeout to reduce failures.

C.Use SNS topics with Lambda subscriptions, but remove all retry and DLQ configuration to minimize duplicate events.

D.Store failed events only in CloudWatch logs, and have operators manually copy log entries back into the database for reprocessing.

AnswerA

EventBridge-to-SQS provides buffering and decoupling; SQS redrive with a DLQ bounds retries and preserves failed events for analysis and replay.

Why this answer

Option A is correct because it introduces an SQS queue between EventBridge and Lambda, which provides at-least-once processing through message visibility timeouts and retries. The redrive policy moves messages to a dead-letter queue (DLQ) after a defined receive count, ensuring bounded retries and preserving unprocessable events for later inspection and reprocessing. Making the Lambda idempotent prevents duplicate side effects from at-least-once delivery.

Exam trap

The trap here is that candidates may think increasing Lambda timeout or relying on asynchronous invocation retries alone is sufficient, but they overlook the need for a DLQ to capture and inspect events that fail after all retries are exhausted.

How to eliminate wrong answers

Option B is wrong because increasing the Lambda timeout does not address transient database timeouts or prevent event loss; asynchronous invocation already retries twice by default, but after exhausting retries, events are discarded without a DLQ. Option C is wrong because SNS with Lambda subscriptions does not provide a built-in DLQ mechanism; removing retry and DLQ configuration would cause events to be lost immediately on failure, violating the requirement for bounded retries and inspectable failures. Option D is wrong because storing failed events only in CloudWatch logs does not provide a structured, automated way to reprocess them; manual copy-paste is error-prone and does not meet the requirement for at-least-once processing or bounded retries.

Practice this question →

179

MCQmedium

A web application runs on an Amazon EC2 Auto Scaling group behind an Application Load Balancer (ALB). After each deployment, new instances take about 2 minutes to download artifacts and become ready to accept requests on the target port. In the last deployment, the ALB started marking targets unhealthy before the app was ready, and the Auto Scaling group then replaced those instances repeatedly, causing a prolonged outage. Which change best improves resilience during instance start-up without reducing actual availability once the application is healthy?

A.Increase the Auto Scaling group’s health check grace period so it exceeds the ~2-minute initialization time.

B.Add more subnets across additional Availability Zones to distribute the same instances more widely.

C.Switch the load balancer target type from instance targets to IP targets to avoid health check failures.

D.Reduce the ALB health check interval so unhealthy targets are removed faster.

AnswerA

A health check grace period prevents the Auto Scaling group from treating early health check failures as instance health problems. This avoids terminating instances before the application finishes initializing, which stops the restart/replace loop during deployments while still allowing normal health checks to apply once the app is ready.

Why this answer

The Auto Scaling group's health check grace period allows instances to initialize without being marked unhealthy by the ELB health checks. By setting this grace period to exceed the ~2-minute artifact download time, the ASG will not replace instances that are still starting up, preventing the cascade of terminations and redeployments that caused the outage. This directly addresses the root cause—premature health check failures—without changing the health check configuration or reducing availability once the app is ready.

Exam trap

The trap here is that candidates confuse the ALB health check interval or target type with the Auto Scaling group's lifecycle management, mistakenly thinking that changing how the ALB checks health (interval or target type) will fix the premature replacement, when the correct solution is to adjust the ASG's grace period to align with the application's startup time.

How to eliminate wrong answers

Option B is wrong because adding more subnets across additional Availability Zones distributes instances more widely for fault tolerance but does not prevent the ALB from marking starting instances as unhealthy, so it does not solve the premature replacement issue. Option C is wrong because switching from instance targets to IP targets changes how the ALB routes traffic but does not alter the health check logic or timing; the ALB will still mark the target as unhealthy if the health check fails during the initialization window. Option D is wrong because reducing the ALB health check interval causes unhealthy targets to be detected and removed faster, which would worsen the problem by accelerating the replacement cycle, not improving resilience during start-up.

Practice this question →

180

MCQmedium

A.Set the SQS queue’s retention period to 10 years and rely on application retries to eventually succeed.

B.Increase visibility timeout to a very large value and avoid dead-letter queues to keep ordering stable.

C.Configure a redrive policy with a dead-letter queue (DLQ) and set an appropriate visibility timeout greater than the maximum processing time.

D.Switch the queue to FIFO and remove retries in the Lambda event source mapping entirely.

AnswerC

A DLQ isolates poison messages after a receive count threshold, and correct visibility timeout prevents premature retries.

Why this answer

Option C is correct because configuring a redrive policy with a dead-letter queue (DLQ) allows messages that exceed a specified maximum receive count to be moved to the DLQ, isolating poison messages. Setting the visibility timeout greater than the maximum processing time ensures the consumer has enough time to process each message before it becomes visible again, preventing premature retries. This directly addresses the issue of poison messages blocking the queue and degrading throughput.

Exam trap

The trap here is that candidates often confuse increasing the visibility timeout or switching to FIFO as solutions for poison messages, but neither addresses the root cause of isolating messages that repeatedly fail processing.

How to eliminate wrong answers

Option A is wrong because increasing the retention period to 10 years does not prevent poison messages from being retried; it only keeps them in the queue longer, worsening the problem. Option B is wrong because increasing visibility timeout to a very large value without a DLQ means poison messages will still be retried indefinitely, and avoiding DLQs does not help with ordering or poison message handling. Option D is wrong because switching to a FIFO queue does not address poison messages; FIFO ensures strict ordering but still requires a DLQ for poison message handling, and removing retries entirely would cause message loss if processing fails.

Practice this question →

181

Multi-Selectmedium

An internal API is deployed in two AWS Regions behind separate Application Load Balancers. The company wants clients to use the primary Region when it is healthy and automatically switch to the secondary Region if the primary health check fails. Which two Route 53 record configurations are required? Select two.

Select 2 answers

A.Create a primary failover record that points to the primary ALB and associates a Route 53 health check.

B.Create a weighted record set that sends 50 percent of traffic to each Region.

C.Create a secondary failover record that points to the secondary ALB.

D.Create a latency-based record set so Route 53 always prefers the fastest Region.

E.Create a multivalue answer record to return both ALB addresses on each lookup.

AnswersA, C

A primary failover record is the active answer while the primary Region remains healthy. The associated health check tells Route 53 when the primary endpoint should stop being returned to clients.

Why this answer

Option A is correct because a primary failover record in Amazon Route 53 directs traffic to the primary ALB and is associated with a Route 53 health check. If the health check fails, Route 53 automatically fails over to the secondary failover record, ensuring high availability across Regions.

Exam trap

The trap here is that candidates often confuse failover routing with weighted or latency routing, assuming any health-aware routing provides automatic primary/secondary failover, but only failover records enforce a strict active-passive pattern.

Practice this question →

182

Multi-Selecteasy

A production Amazon RDS database must continue serving the application if the primary DB instance fails. The application should reconnect automatically without hard-coding a new IP address. Which two actions should you take? Select two.

Select 2 answers

A.Create an RDS Multi-AZ deployment for the database.

B.Connect the application to the RDS endpoint instead of hard-coding the database IP address.

C.Disable automated backups to reduce the time needed for failover.

D.Use a single-AZ deployment so the standby is not split across Zones.

E.Replace the database with an Amazon S3 bucket and store rows as objects.

AnswersA, B

RDS Multi-AZ maintains a synchronous standby in another Availability Zone and automatically promotes it when the primary fails. This is the standard AWS high-availability pattern for managed relational databases.

Why this answer

A is correct because an RDS Multi-AZ deployment automatically provisions and maintains a synchronous standby replica in a different Availability Zone. If the primary DB instance fails, Amazon RDS automatically fails over to the standby, typically within 60–120 seconds, without requiring manual intervention. This ensures high availability and continuity for the production database.

Exam trap

The trap here is that candidates may think disabling backups or using a single-AZ deployment could improve failover speed, but in reality, Multi-AZ and endpoint-based connections are the only correct combination for automatic failover and reconnection without hard-coded IP addresses.

Practice this question →

183

MCQmedium

An internal-facing application is available in two AWS regions (Region 1 and Region 2). Each region has its own Application Load Balancer (ALB) and target group. The company uses an AWS Route 53 private hosted zone to route clients to Region 1 by default, but it must automatically fail over to Region 2 when Region 1’s ALB is unhealthy. Which Route 53 design best meets this requirement?

A.Use latency-based routing with two alias records; Route 53 will automatically shift traffic away from the unhealthy region.

B.Use weighted routing with weights 100/0 and update weights manually after detecting failures.

C.Use failover routing with two alias A records for the same name: one PRIMARY and one SECONDARY, both pointing to each region’s ALB; attach the health check to the PRIMARY record.

D.Use geolocation routing with a single alias record for Region 1, and enable EDNS Client Subnet to detect unhealthy endpoints.

AnswerC

Failover routing uses health checks to determine which record Route 53 should return. By creating PRIMARY and SECONDARY alias records and associating a health check with the PRIMARY ALB endpoint, Route 53 can automatically stop routing to Region 1 when the health check fails and route to Region 2 until Region 1 recovers.

Why this answer

Option C is correct because Route 53 failover routing with a PRIMARY and SECONDARY alias record allows automatic failover when the health check attached to the PRIMARY record fails. The health check monitors Region 1's ALB, and upon failure, Route 53 returns the SECONDARY record's IP (Region 2's ALB) to clients. This design meets the requirement for automatic failover without manual intervention.

Exam trap

The trap here is that candidates often assume latency-based or geolocation routing inherently handle health checks, but Route 53 only supports health check-based failover with failover routing (or multivalue answer routing for non-alias records), not with latency or geolocation policies.

How to eliminate wrong answers

Option A is wrong because latency-based routing does not support health checks on alias records; it routes based on lowest latency, not endpoint health, so it cannot automatically fail over when an ALB is unhealthy. Option B is wrong because weighted routing with manual weight updates requires human intervention to detect failures and change weights, which violates the 'automatically fail over' requirement. Option D is wrong because geolocation routing routes based on client location, not endpoint health, and EDNS Client Subnet only improves location accuracy; it does not trigger failover when an endpoint becomes unhealthy.

Practice this question →

184

MCQmedium

A SaaS platform plans to run in two AWS Regions for lower latency. The team wants to enable active-active writes (both regions accept updates) to avoid failover downtime. However, the business requires strong consistency for order status transitions (for example, only one transition from “Paid” to “Shipped” must be allowed). Which statement is the best architectural choice to meet the consistency requirement?

A.Use active-active writes only when the workload tolerates eventual consistency; for strongly consistent transitions, use a single-writer pattern with failover (active-passive/pilot light).

B.Active-active writes always provide strong consistency because AWS replicates data across Regions automatically and immediately.

C.Active-active writes can be used safely by simply enabling retries and expecting the application to resolve conflicts without coordination.

D.To ensure strong consistency, run both Regions with different IAM roles and block cross-Region writes at the API layer only.

AnswerA

Strong consistency requirements typically conflict with multi-master active-active replication semantics, so single-writer designs are safer.

Why this answer

Option A is correct because active-active writes across AWS Regions cannot guarantee strong consistency due to the inherent latency and lack of synchronous replication between Regions. For order status transitions that require exactly-once semantics (e.g., only one transition from 'Paid' to 'Shipped'), a single-writer pattern (active-passive or pilot light) ensures that only one Region accepts writes at a time, avoiding conflicts and maintaining a single source of truth. AWS services like DynamoDB global tables offer eventual consistency for multi-region writes, while Aurora Global Database provides read replicas with failover but not active-active writes for strong consistency.

Exam trap

The trap here is that candidates assume AWS's global services (like DynamoDB global tables or Aurora Global Database) inherently provide strong consistency for multi-region writes, when in fact they are designed for eventual consistency and require careful trade-offs for strict ordering requirements.

How to eliminate wrong answers

Option B is wrong because AWS does not replicate data across Regions automatically and immediately for strong consistency; cross-Region replication is asynchronous by design (e.g., DynamoDB global tables use eventual consistency, and S3 CRR is eventually consistent). Option C is wrong because retries alone cannot resolve conflicts in an active-active write scenario; without a distributed consensus protocol (like Paxos or Raft) or a conflict-resolution mechanism, concurrent writes can lead to inconsistent states (e.g., two orders transitioning to 'Shipped' simultaneously). Option D is wrong because blocking cross-Region writes at the API layer with IAM roles does not prevent concurrent writes within each Region; both Regions could still accept updates independently, leading to conflicts, and IAM does not coordinate write ordering across Regions.

Practice this question →

185

MCQeasy

A consumer application reads from an Amazon SQS queue. Some messages have an invalid format and always fail processing. They are retried repeatedly and consume consumer capacity. What is the best way to prevent these "poison pill" messages from blocking normal processing?

A.Enable long polling and increase the maximum message retention to 30 days.

B.Configure a dead-letter queue (DLQ) with a redrive policy and a maxReceiveCount.

C.Switch the queue to FIFO and disable retries in the consumer code.

D.Delete the main queue and recreate it after every failure.

AnswerB

A DLQ with a redrive policy isolates poison-pill messages. After a message fails processing and is received more than maxReceiveCount times, SQS stops returning it to the main queue and moves it to the DLQ. Normal messages continue to be processed without repeatedly consuming consumer capacity.

Why this answer

Option B is correct because a dead-letter queue (DLQ) with a redrive policy and a maxReceiveCount allows messages that repeatedly fail processing to be moved to a separate queue after a specified number of receive attempts. This prevents poison pill messages from being retried indefinitely, freeing consumer capacity for valid messages. Amazon SQS automatically redirects messages to the DLQ once the maxReceiveCount threshold is exceeded, ensuring normal processing is not blocked.

Exam trap

The trap here is that candidates may think increasing retention or polling settings will solve the problem, but they fail to recognize that only a DLQ with a redrive policy isolates repeatedly failing messages from consuming consumer capacity.

How to eliminate wrong answers

Option A is wrong because enabling long polling and increasing maximum message retention does not address the root cause of invalid messages; it only reduces empty responses and keeps messages longer, but poison pills will still be retried. Option C is wrong because switching to a FIFO queue does not prevent poison pills; FIFO ensures exactly-once processing but still retries failed messages, and disabling retries in consumer code would cause message loss without moving them to a DLQ. Option D is wrong because deleting and recreating the main queue after every failure is disruptive, loses all messages, and does not provide a systematic way to isolate or inspect poison pills.

Practice this question →

186

MCQmedium

A.Create scheduled EBS snapshots and copy them to another Region, then restore the volumes from those snapshots during recovery.

B.Enable EBS multi-attach to spread the workload across AZs and replicate snapshots automatically between Regions.

C.Use RDS read replicas in another Region and keep the analytics dataset in an RDS instance only.

D.Rely on instance store for durability and copy only AMIs across Regions.

AnswerA

Snapshotting and cross-Region copying gives point-in-time images of EBS volumes that can be restored in the target Region.

Why this answer

Option A is correct because scheduled EBS snapshots provide point-in-time recovery and can be copied to another Region for cross-region disaster recovery. When a major outage occurs, you can restore EBS volumes from those snapshots in the target Region, meeting the requirement for frequent restorable backups. This approach is native to AWS, cost-effective, and supports the stateful analytics workload without architectural changes.

Exam trap

How to eliminate wrong answers

Option B is wrong because EBS multi-attach allows a single EBS volume to be attached to multiple EC2 instances within the same Availability Zone, but it does not replicate snapshots across Regions or provide cross-region disaster recovery. Option C is wrong because RDS read replicas are for relational databases and cannot store or replicate arbitrary analytics datasets from EC2 instances with EBS volumes; this option misapplies RDS to a non-database workload. Option D is wrong because instance store volumes are ephemeral and lose data on instance stop or termination, making them unsuitable for durable data that must be restorable after an outage; copying AMIs across Regions does not preserve the analytics data stored on instance store.

Practice this question →

187

MCQeasy

Based on the exhibit, the web tier becomes unavailable if us-west-2a has an outage. What is the best change to improve resilience with the least redesign?

A.Increase the Auto Scaling group desired capacity from 2 to 3 in the same subnet.

B.Attach the Application Load Balancer and Auto Scaling group to subnets in a second Availability Zone.

C.Replace the Application Load Balancer with a Network Load Balancer.

D.Increase the health check grace period so instances stay registered longer.

AnswerB

Spanning the load balancer and Auto Scaling group across at least two Availability Zones removes the single-AZ dependency shown in the exhibit. If us-west-2a fails, the remaining AZ can continue serving traffic and Auto Scaling can replace unhealthy instances there. This is the smallest architectural change that directly improves availability.

Why this answer

The web tier is currently deployed in a single Availability Zone (us-west-2a), so an outage of that AZ makes the entire tier unavailable. By attaching the Application Load Balancer and Auto Scaling group to subnets in a second Availability Zone, the application can continue serving traffic from the healthy AZ, achieving high availability with minimal architectural changes. This is the standard AWS best practice for multi-AZ resilience.

Exam trap

The trap here is that candidates may think increasing instance count or changing load balancer type improves resilience, but the core issue is the single-AZ deployment, which only multi-AZ subnets can fix.

How to eliminate wrong answers

Option A is wrong because increasing the desired capacity to 3 in the same subnet still keeps all instances in a single Availability Zone; an AZ outage would still take down all instances. Option C is wrong because replacing the Application Load Balancer with a Network Load Balancer does not address the single-AZ failure; both ALB and NLB can operate across AZs, but the issue is the lack of multi-AZ subnets, not the load balancer type. Option D is wrong because increasing the health check grace period only delays instance deregistration, it does not prevent the loss of all instances when the entire AZ fails.

Practice this question →

188

Multi-Selectmedium

A company is migrating a legacy monolithic application to AWS and wants to improve its resilience by decoupling components. The application currently writes directly to a shared file system and uses synchronous HTTP calls between modules. Which three AWS services should the company use to achieve a more resilient, decoupled architecture? (Choose three.)

Select 3 answers

.Amazon SQS for asynchronous message passing between application components.

.Amazon EFS as a shared file system to replace the on-premises NAS.

.Amazon SNS to fan-out notifications to multiple subscribers for event-driven processing.

.AWS Lambda to run components as stateless functions with automatic scaling.

.AWS Direct Connect to provide a dedicated network link for inter-component communication.

.Amazon Elastic Block Store (EBS) with Multi-Attach for shared block storage between components.

Why this answer

Amazon SQS is correct because it enables asynchronous message passing between application components, decoupling them so that a failure in one component does not block others. This replaces the synchronous HTTP calls, improving resilience by allowing messages to be buffered and processed independently.

Exam trap

The trap here is that candidates often confuse shared storage solutions (EFS, EBS Multi-Attach) with decoupling mechanisms, but these still create tight coupling and single points of failure, whereas the correct services (SQS, SNS, Lambda) enable true asynchronous, stateless, and event-driven decoupling.

Practice this question →

189

MCQhard

A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The design must avoid adding custom operational scripts.

A.Amazon EFS with mount targets in multiple Availability Zones

B.S3 mounted as a POSIX file system without a file gateway

C.Instance store volumes

D.An EBS volume attached to all instances

AnswerA

EFS is regional file storage and supports mount targets across AZs.

Why this answer

Amazon EFS provides a fully managed, POSIX-compliant NFSv4.1 shared file system that can be mounted concurrently across multiple Linux EC2 instances. By deploying mount targets in multiple Availability Zones, the file system remains accessible even if one AZ fails, satisfying the high-availability requirement without any custom scripts.

Exam trap

The trap here is that candidates may confuse EBS Multi-Attach (which has strict limitations and requires cluster-aware file systems) with a true shared file system, or assume that S3 with a FUSE mount is a viable POSIX alternative without considering the operational overhead and lack of native consistency.

How to eliminate wrong answers

Option B is wrong because mounting S3 as a POSIX file system (e.g., via s3fs-fuse) requires custom operational scripts and does not provide native POSIX semantics or strong consistency, making it unsuitable for shared file storage across AZs. Option C is wrong because instance store volumes are ephemeral, tied to a single EC2 instance, and cannot be shared across instances or survive AZ failures. Option D is wrong because a single EBS volume cannot be attached to multiple EC2 instances; it can only be attached to one instance at a time, and while Multi-Attach EBS exists, it is limited to specific instance types and does not provide a shared file system without additional cluster-aware software.

Practice this question →

190

MCQmedium

A.Deploy a single-node Redis cluster and rely on application-level retries when cache misses occur.

B.Configure an ElastiCache Redis replication group with automatic failover across multiple Availability Zones.

C.Move the cache into the VPC but keep it in one Availability Zone to reduce network latency.

D.Use a Memcached cluster and configure only client-side connection pooling without failover support.

AnswerB

Multi-AZ replication groups provide redundant nodes and automatic failover, improving cache resilience during AZ events.

Why this answer

Option B is correct because an ElastiCache Redis replication group with automatic failover across multiple Availability Zones ensures that if the primary node or its AZ becomes impaired, a read-replica in another AZ is automatically promoted to primary. This allows the stateless web service to continue reading cached responses during maintenance events without interruption, as the failover is transparent to the application.

Exam trap

The trap here is that candidates often confuse Memcached with Redis, assuming that Memcached also supports replication and automatic failover, or they mistakenly think a single-node Redis cluster with retries is sufficient for high availability during AZ impairments.

How to eliminate wrong answers

Option A is wrong because a single-node Redis cluster provides no redundancy; if the node or its AZ fails, the cache is completely unavailable, forcing the web service to fall back to the origin server for all requests, which defeats the purpose of a caching layer. Option C is wrong because keeping the cache in one Availability Zone does not protect against AZ impairment; a single-AZ deployment cannot automatically fail over to another node in a different AZ, so the service would lose cached data during an AZ outage. Option D is wrong because Memcached does not support replication or automatic failover; it is a distributed cache with no built-in mechanism to promote a standby node, so any node failure results in cache misses and requires client-side reconfiguration.

Practice this question →

191

MCQeasy

A company runs the same public API in two regions (Region A and Region B), each fronted by an ALB. They want Route 53 to automatically route clients to the Region B API when Region A becomes unhealthy, with minimal configuration effort. Which Route 53 approach should they use?

A.Use a single Route 53 A record that points only to Region A’s ALB and manually update it after failures.

B.Use Route 53 latency-based routing with separate records for each region.

C.Use Route 53 failover routing with health checks for each region’s endpoint.

D.Use weighted routing and set the Region B weight to 0 to ensure it is only used when needed.

AnswerC

Failover routing works with health checks to move traffic from a primary endpoint to a secondary endpoint when the primary becomes unhealthy.

Why this answer

Route 53 failover routing with health checks is the correct choice because it automatically directs traffic to a secondary endpoint (Region B) when the primary endpoint (Region A) fails a health check. This provides active-passive failover with minimal configuration, as Route 53 monitors the health of each ALB and updates DNS responses accordingly without manual intervention.

Exam trap

The trap here is that candidates often confuse latency-based routing with failover capabilities, assuming latency routing will automatically avoid unhealthy endpoints, but it only optimizes for speed and requires health checks to be manually integrated via a separate routing policy.

How to eliminate wrong answers

Option A is wrong because manually updating a single A record after failure is not automated, contradicts the requirement for minimal configuration effort, and introduces significant downtime during the manual update window. Option B is wrong because latency-based routing routes clients based on lowest latency, not health; it does not automatically fail over to Region B when Region A is unhealthy—clients would still be directed to Region A if it has lower latency, even if it is down. Option D is wrong because setting Region B's weight to 0 would never route traffic to it, even if Region A fails; weighted routing does not support automatic failover based on health checks.

Practice this question →

192

MCQhard

Based on the exhibit, the database is manually promoted during an Availability Zone failure and the application outage lasts longer than the target. What change best improves resilience with the least operational intervention?

A.Keep the read replica and automate promotion with a runbook after CloudWatch alarms fire.

B.Convert the database to an RDS Multi-AZ deployment so a synchronous standby can fail over automatically.

C.Use a cross-Region read replica so promotion happens faster during an AZ failure.

D.Increase the application retry count and keep the current database design.

AnswerB

Multi-AZ is designed for automatic failover within the same Region and maintains a synchronous standby for high availability. The exhibit shows that the current read replica requires manual promotion and produces an outage longer than the target. Switching to Multi-AZ removes the manual step and aligns the database layer with the desired recovery time.

Why this answer

B is correct because RDS Multi-AZ automatically synchronously replicates data to a standby in a different Availability Zone and triggers an automatic failover with zero manual intervention when an AZ failure occurs. This directly addresses the requirement to improve resilience while minimizing operational effort, as the failover is handled by AWS without any runbook execution or manual promotion.

Exam trap

The trap here is that candidates often confuse read replicas (designed for read scaling and manual promotion) with Multi-AZ deployments (designed for automatic failover), and incorrectly assume that automating a runbook for read replica promotion is equivalent to the native automatic failover of Multi-AZ.

How to eliminate wrong answers

Option A is wrong because it still requires manual or automated runbook execution to promote the read replica, which introduces operational intervention and delay, failing the 'least operational intervention' requirement. Option C is wrong because a cross-Region read replica involves asynchronous replication and manual promotion, which is slower and more complex than Multi-AZ failover, and does not meet the 'least operational intervention' goal. Option D is wrong because increasing the application retry count does not resolve the underlying database unavailability during an AZ failure; it only masks the symptom and does not improve resilience.

Practice this question →

193

MCQmedium

A company uses Amazon SQS and AWS Lambda to process orders. Lambda typically completes in 4 minutes, but complex orders can take up to 12 minutes. The team reports that some orders are being processed more than once. Which is the MOST likely cause and the recommended fix?

A.Enable SQS FIFO queue to prevent duplicate message delivery

B.Increase the SQS queue visibility timeout to exceed the maximum Lambda processing time

C.Reduce the Lambda function timeout to 4 minutes to match typical processing time

D.Enable SQS long polling to reduce the frequency of message retrieval

AnswerB

Setting visibility timeout above 12 minutes (the maximum processing time) prevents messages from reappearing while being processed. This eliminates the root cause of duplicate processing.

Why this answer

SQS visibility timeout defines how long a message is hidden from other consumers after it is received. If a Lambda function takes longer than the visibility timeout to process a message, the message becomes visible again and another Lambda invocation picks it up — causing duplicate processing.

The default SQS visibility timeout is 30 seconds. If processing takes 12 minutes but visibility timeout is 30 seconds, messages reappear and are processed again. The fix is to increase the visibility timeout to exceed the maximum processing time — at least 13-15 minutes.

Exam trap

Many architects set up SQS/Lambda integrations without adjusting the visibility timeout from the default 30 seconds. When Lambda functions run longer than this, the message reappears and creates duplicates. The symptom is duplicate processing — a classic visibility timeout mismatch.

Fix the root cause (extend visibility timeout) rather than adding application-level deduplication logic.

Why the other options are wrong

SQS FIFO queues provide exactly-once processing within a deduplication window, but they have throughput limits and the root cause here is a visibility timeout mismatch. Switching to FIFO adds complexity without addressing the underlying cause.

Reducing Lambda timeout to 4 minutes would cause 12-minute complex orders to fail before completing, sending them to the DLQ or causing retries. This makes the problem worse.

Long polling reduces API calls and costs for sparse queues by waiting up to 20 seconds for messages. It has no effect on visibility timeout or duplicate processing.

Practice this question →

194

MCQmedium

A trading dashboard stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The team wants the control to be enforceable during normal operations.

A.An EBS snapshot schedule

B.S3 Cross-Region Replication with versioning enabled

C.S3 lifecycle transition to Glacier Flexible Retrieval

D.A CloudFront distribution

AnswerB

CRR asynchronously replicates objects to a bucket in another Region and requires versioning.

Why this answer

S3 Cross-Region Replication (CRR) automatically replicates objects to a destination bucket in a different AWS Region, providing a disaster recovery copy. Enabling versioning on both source and destination buckets is required for CRR to function, and replication can be enforced during normal operations by applying an IAM policy that denies `s3:PutObject` unless the request includes the `x-amz-server-side-encryption` header or by using a bucket policy that requires replication. This meets the business requirement for an enforceable, automated DR copy.

Exam trap

The trap here is that candidates confuse S3 Cross-Region Replication with S3 lifecycle policies or Glacier transitions, thinking that moving data to a cold storage class in the same region provides DR, when in fact DR requires a copy in a separate geographic region.

How to eliminate wrong answers

Option A is wrong because EBS snapshots are for Amazon EC2 block storage volumes, not for S3 objects; they cannot replicate data stored in S3 buckets. Option C is wrong because S3 lifecycle transition to Glacier Flexible Retrieval only moves objects to a lower-cost storage class within the same region, it does not create a copy in another AWS Region for disaster recovery. Option D is wrong because CloudFront is a content delivery network (CDN) that caches content at edge locations for low-latency access, not a replication mechanism for creating a regional DR copy.

Practice this question →

195

MCQmedium

An application writes to an Amazon Aurora DB cluster. After a planned Aurora failover, the application experiences several minutes of connection errors. The logs show the application continues connecting to the specific DB instance endpoint that was the primary before the failover. What change most directly improves resilience during Aurora failovers?

A.Update the application to use the Aurora cluster writer endpoint for write traffic so it always resolves to the current writer instance.

B.Increase Aurora storage autoscaling so failovers are unnecessary.

C.Point both reads and writes to the Aurora reader endpoint to keep the DNS name the same.

D.Disable Aurora failover capability so the cluster never switches writer instances.

AnswerA

During failover, Aurora changes which underlying DB instance is the writer. The cluster writer endpoint (for the cluster) always resolves to the current writer. Using the writer endpoint prevents the application from being pinned to an old instance endpoint that may stop accepting writes after failover.

Why this answer

The Aurora cluster writer endpoint always resolves to the current primary DB instance, even after a failover. By using this endpoint instead of a specific instance endpoint, the application automatically reconnects to the new writer without manual intervention or connection errors.

Exam trap

The trap here is that candidates may think using any Aurora endpoint (like the reader endpoint) is sufficient, but they must understand that only the cluster writer endpoint guarantees write availability after a failover, while the reader endpoint is strictly for read traffic.

How to eliminate wrong answers

Option B is wrong because increasing storage autoscaling does not prevent failovers; failovers occur due to instance health or AZ issues, not storage capacity. Option C is wrong because the reader endpoint is designed for read-only traffic and does not accept write connections, so pointing writes to it would cause immediate failures. Option D is wrong because disabling failover capability would make the cluster unable to recover from primary instance failures, leading to prolonged downtime.

Practice this question →

196

MCQmedium

A company uses Amazon RDS with automated backups enabled (retention period: 7 days). At 10:30 UTC, a bad release corrupts specific rows in a production table. The team detects the issue at 11:10 UTC. They need to revert the database state to what it was from 10:00–10:30 UTC, recover quickly, and minimize risk to the currently running workload. What is the best option?

A.Reboot the DB instance and rely on the corrupted data being overwritten by storage-level changes.

B.Perform a point-in-time restore to a new DB instance using a timestamp before the corruption (for example, a time within 10:00–10:30 UTC).

C.Restore only the most recent automated backup snapshot, even if it is after the corruption timestamp.

D.Create a read replica of the current DB instance and overwrite the corrupted table using SELECT queries from the replica.

AnswerB

With automated backups enabled, RDS supports point-in-time recovery (PITR) within the retention window. Restoring to a timestamp before the corruption creates a consistent copy from that moment. The team can validate the restored DB and then cut over application traffic, reducing risk to the currently running workload.

Why this answer

Amazon RDS automated backups enable point-in-time recovery (PITR) to any second within the retention window. By restoring to a timestamp between 10:00 and 10:30 UTC, you recover the database to a state before the corruption occurred, without affecting the current production instance. This minimizes risk to the running workload because the restore creates a new DB instance, leaving the original untouched until you are ready to switch.

Exam trap

The trap here is that candidates may confuse automated backup snapshots (which are full backups taken once per day) with point-in-time recovery (which uses transaction logs to restore to any point within the retention window), leading them to choose Option C instead of B.

How to eliminate wrong answers

Option A is wrong because rebooting a DB instance does not revert data; it only restarts the database engine and does not undo committed transactions or storage-level changes. Option C is wrong because restoring the most recent automated backup snapshot includes the corrupted data, so it does not achieve the goal of reverting to a pre-corruption state. Option D is wrong because a read replica mirrors the current (corrupted) data; using SELECT queries from it cannot overwrite the corrupted table with clean data, and it does not provide a mechanism to roll back changes.

Practice this question →

197

MCQmedium

A service processes customer payments from a message queue. Because the queue provides at-least-once delivery, the same payment message can be delivered more than once if the consumer times out before committing its state. Currently, the service sometimes charges the customer twice. Which design change most directly prevents duplicate charges while still allowing safe retries?

A.Delete the message from the queue immediately after receive to prevent redelivery.

B.Make the payment processing idempotent by recording an idempotency key for each payment and ensuring repeated deliveries do not apply the charge twice.

C.Increase the queue visibility timeout to a very large value so messages rarely reappear.

D.Switch to a single-threaded consumer with one worker so messages are processed in order.

AnswerB

Idempotency ensures that reprocessing the same payment message has no additional side effects. Recording an idempotency key and using conditional logic prevents duplicate charges.

Why this answer

Option B is correct because making payment processing idempotent using an idempotency key ensures that even if the same message is delivered multiple times due to at-least-once delivery semantics, the charge is applied only once. The consumer records a unique key (e.g., payment ID) in a durable store (like DynamoDB or Redis) and checks it before processing; if the key already exists, the charge is skipped. This directly prevents duplicate charges while still allowing safe retries, as the consumer can safely reprocess messages without side effects.

Exam trap

The trap here is that candidates often confuse at-least-once delivery with exactly-once delivery and assume that increasing visibility timeouts or using single-threaded consumers will prevent duplicates, when in fact only idempotency guarantees safe retries without duplicate charges.

How to eliminate wrong answers

Option A is wrong because deleting the message immediately after receive violates the at-least-once delivery contract and can lead to message loss if the consumer crashes before processing completes. Option C is wrong because increasing the visibility timeout to a very large value only delays redelivery but does not prevent it entirely; if the consumer fails, the message will reappear after the timeout, still risking duplicate charges. Option D is wrong because single-threaded processing does not eliminate duplicates from at-least-once delivery; the same message can still be redelivered if the consumer times out, and ordering alone does not prevent duplicate charges.

Practice this question →

198

MCQmedium

A media company stores original uploads in an S3 bucket. They must recover from accidental overwrites/deletes and also recover quickly from a full Region outage. The required RPO is about 1 hour. Which configuration best meets these requirements?

A.Enable an S3 lifecycle policy to transition objects to Glacier after 7 days without enabling versioning.

B.Enable S3 cross-Region replication (CRR) but leave the bucket without versioning enabled.

C.Enable S3 versioning and configure cross-Region replication to a bucket in another Region.

D.Rely on frequent EBS snapshots of a temporary cache used during uploads.

AnswerC

Versioning enables recovery from accidental overwrites/deletes, and CRR provides near-current copies for Region-level disaster recovery.

Why this answer

Option C is correct because enabling S3 versioning protects against accidental overwrites and deletes by preserving all object versions, while cross-Region replication (CRR) asynchronously replicates objects to a bucket in another Region, providing recovery from a full Region outage. With versioning enabled, CRR replicates both current and previous object versions, meeting the ~1-hour RPO (typically within minutes for new objects) and ensuring data durability across Regions.

Exam trap

AWS often tests the misconception that CRR can work without versioning, but the S3 API explicitly requires versioning on the source bucket for replication to function, and candidates may overlook that versioning is also the mechanism that protects against accidental overwrites and deletes.

How to eliminate wrong answers

Option A is wrong because a lifecycle policy to transition objects to Glacier after 7 days does not protect against accidental overwrites or deletes (versioning is required for that), nor does it provide cross-Region recovery; Glacier is a cold storage class in the same Region, not a replication mechanism. Option B is wrong because S3 cross-Region replication requires versioning to be enabled on the source bucket; without versioning, CRR cannot replicate objects and will fail, leaving no protection against overwrites/deletes or Region outages. Option D is wrong because EBS snapshots of a temporary cache used during uploads do not protect the original S3 objects from overwrites/deletes, and EBS snapshots are tied to a single Availability Zone, not a full Region, failing the cross-Region recovery requirement.

Practice this question →

199

MCQeasy

Your web tier runs on an EC2 Auto Scaling group behind an Application Load Balancer (ALB). You currently deploy both the ALB and the Auto Scaling group in only two Availability Zones (AZs). One AZ fails. What is the best configuration change to improve resilience?

A.Reduce health check timeouts so instances are replaced sooner in the failed AZ.

B.Add a third Availability Zone so the ALB and Auto Scaling group span at least three AZs.

C.Enable instance scale-in protection to stop the ASG from terminating unhealthy instances.

D.Switch the ALB to an internal Network Load Balancer (NLB) to avoid cross-AZ traffic.

AnswerB

An AZ failure typically reduces available capacity to the other AZs. Spreading the ALB subnets and ASG instances across at least three AZs reduces the impact of losing any single AZ and helps ensure the remaining AZs can continue serving traffic.

Why this answer

Adding a third Availability Zone (AZ) ensures that the Application Load Balancer (ALB) and Auto Scaling group (ASG) can continue to route traffic and maintain capacity even if one AZ fails. With only two AZs, a single AZ failure reduces the fleet by 50% and may cause the ALB to lose the minimum healthy hosts required to serve traffic. Spreading across three AZs provides a higher resilience margin, as the remaining two AZs can absorb the load while the failed AZ recovers.

Exam trap

The trap here is that candidates think reducing health check timeouts or enabling scale-in protection can compensate for an AZ failure, but AWS explicitly requires a minimum of three AZs to achieve high availability for ALB-based architectures.

How to eliminate wrong answers

Option A is wrong because reducing health check timeouts only accelerates the replacement of unhealthy instances in the failed AZ, but does not prevent the loss of capacity from that AZ; the ASG will still be unable to launch instances in a failed AZ, so the fleet remains degraded. Option C is wrong because instance scale-in protection prevents termination of instances during scale-in events, but does not protect against AZ failure; unhealthy instances in a failed AZ will still be terminated by the ASG health check process, and scale-in protection does not help maintain capacity. Option D is wrong because switching to an internal NLB does not improve resilience to AZ failure; NLBs also operate within AZs and cross-AZ traffic is not the issue—the core problem is insufficient AZ count to absorb a single AZ outage.

Practice this question →

200

MCQmedium

A fintech startup uses AWS to run a web API and a PostgreSQL database. They must meet an RPO of 15 minutes and an RTO of 2 hours for a Region-wide disaster. Budget allows running a small, always-on set of infrastructure in a secondary Region, but not full production capacity. The team wants a DR approach that is regularly testable without large manual effort. Which disaster recovery strategy is the best fit?

A.Pilot light: replicate databases and store backups, keep only minimal infrastructure in the secondary Region, and scale up fully during failover.

B.Warm standby: keep a scaled-down application environment and database replication active in the secondary Region, using automated failover controls.

C.Backup and restore only: rely on daily automated backups and restore into the secondary Region during an incident.

D.Multi-site active-active: run both Regions at full capacity and route live traffic to both simultaneously.

AnswerB

Warm standby aligns with moderate RTO requirements by having ready-to-run resources plus continuous replication to meet the RPO target during failover.

Why this answer

Warm standby (B) is the best fit because it maintains a scaled-down but fully functional application environment in the secondary Region with active database replication, meeting the RPO of 15 minutes (via synchronous or near-synchronous replication like PostgreSQL streaming replication) and RTO of 2 hours (via automated failover controls such as Route 53 health checks and AWS Lambda automation). This approach allows regular testing without large manual effort by simply promoting the standby environment, and the budget constraint is satisfied by running only minimal compute resources (e.g., smaller EC2 instances) in the secondary Region.

Exam trap

The trap here is that candidates often confuse pilot light with warm standby, assuming minimal infrastructure is sufficient for a 2-hour RTO, but pilot light requires provisioning and configuring application servers during failover, which typically takes longer than 2 hours, whereas warm standby already has the application running and only needs scaling.

How to eliminate wrong answers

Option A is wrong because pilot light keeps only minimal infrastructure (e.g., database replicas and storage) but does not maintain a running application environment; scaling up during failover would likely exceed the 2-hour RTO due to provisioning and configuration time, and testing requires manual steps. Option C is wrong because backup and restore relies on daily backups, which cannot achieve a 15-minute RPO (backups are typically taken every 24 hours) and restoring from backups into a new environment would take far longer than 2 hours, violating both RPO and RTO. Option D is wrong because multi-site active-active requires full production capacity in both Regions, which exceeds the budget constraint of running only a small, always-on set of infrastructure in the secondary Region.

Practice this question →

201

MCQhard

A warehouse integration service must process every event at least once, but duplicate processing is acceptable if the consumer handles idempotency. Which eventing approach is most suitable? The design must avoid adding custom operational scripts.

A.Use CloudFront signed URLs

B.Use Amazon SQS standard queue and design consumers to be idempotent

C.Use UDP messages sent directly to workers

D.Use an in-memory queue on one EC2 instance

AnswerB

SQS standard queues provide at-least-once delivery and high throughput; consumers must handle occasional duplicates.

Why this answer

Amazon SQS standard queues provide at-least-once delivery, ensuring every event is processed at least once, with the possibility of duplicates. Designing consumers to be idempotent handles duplicates without requiring custom scripts, aligning with the requirement to avoid operational overhead. This approach is serverless, scalable, and fits the warehouse integration use case.

Exam trap

The trap here is that candidates may choose UDP (Option C) thinking it is lightweight and fast, but they overlook its lack of delivery guarantees, which fails the 'process every event at least once' requirement.

How to eliminate wrong answers

Option A is wrong because CloudFront signed URLs are for controlling access to content, not for event processing or messaging; they do not provide at-least-once delivery guarantees. Option C is wrong because UDP is a connectionless, unreliable protocol that does not guarantee message delivery, making it unsuitable for processing every event at least once. Option D is wrong because an in-memory queue on a single EC2 instance introduces a single point of failure and requires custom scripts for management, violating the 'avoid adding custom operational scripts' constraint.

Practice this question →

202

MCQmedium

A content publishing system uses Lambda functions that call an unreliable third-party API. Failed events must be retained for later investigation after retries are exhausted. What should be configured? The team wants the control to be enforceable during normal operations.

A.Lambda reserved concurrency set to zero

B.A larger deployment package

C.CloudFront error pages

D.A Lambda dead-letter queue or failure destination

AnswerD

A DLQ or asynchronous failure destination captures failed events after retry attempts.

Why this answer

Option D is correct because a Lambda dead-letter queue (DLQ) or failure destination allows you to capture events that have exhausted all retry attempts from an asynchronous invocation. This ensures failed events are retained in Amazon SQS or SNS for later investigation, providing enforceable control during normal operations without impacting the function's ability to process successful events.

Exam trap

The trap here is that candidates may confuse Lambda's synchronous invocation error handling (where DLQs are not supported) with asynchronous invocation, or mistakenly think that increasing function resources (like deployment package size) can improve reliability against external API failures.

How to eliminate wrong answers

Option A is wrong because setting Lambda reserved concurrency to zero would completely disable the function, preventing any invocations and thus failing to process events at all, which does not address the need to retain failed events after retries. Option B is wrong because a larger deployment package does not affect error handling or retention of failed events; it only increases the function's size, potentially impacting cold start times and deployment limits. Option C is wrong because CloudFront error pages are used for customizing HTTP error responses for web distributions, not for capturing or retaining Lambda invocation failures from asynchronous API calls.

Practice this question →

203

Multi-Selectmedium

A SaaS application is deployed in us-east-1 and us-west-2 behind separate ALBs. The business wants DNS to send new clients to the primary Region when it is healthy and automatically fail over to the secondary Region when the primary endpoint is unhealthy. Which two Route 53 settings are required? Select two.

Select 2 answers

A.Use a failover routing policy with a primary and secondary record.

B.Create a health check and associate it with the primary endpoint.

C.Use weighted routing with a 50/50 traffic split between both Regions.

D.Use latency-based routing so clients always choose the fastest Region.

E.Use a geolocation policy without health checks.

AnswersA, B

Failover routing is designed specifically to send traffic to a secondary endpoint when the primary becomes unhealthy.

Why this answer

A failover routing policy is correct because it allows you to designate one record as primary and another as secondary. Route 53 will route traffic to the primary record as long as it is healthy, and automatically fail over to the secondary record when the primary is unhealthy. This directly meets the requirement to send new clients to the primary region when healthy and fail over to the secondary region.

Exam trap

The trap here is that candidates often confuse failover routing with weighted or latency-based routing, thinking any multi-region setup with health checks will automatically fail over, but only failover routing policy provides the explicit primary/secondary failover behavior required.

Practice this question →

204

MCQhard

A payments API uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The architecture review board prefers a managed AWS-native control.

A.A FIFO queue without a redrive policy

B.A dead-letter queue with an appropriate maxReceiveCount

C.A larger message retention period only

D.Short polling instead of long polling

AnswerB

A DLQ isolates messages that fail repeatedly so they can be investigated without disrupting normal processing.

Why this answer

A dead-letter queue (DLQ) with an appropriate maxReceiveCount is the correct AWS-native solution for handling poison messages. When a message is repeatedly received from an SQS queue but fails processing, it is considered a poison message. By configuring a DLQ and setting a maxReceiveCount (e.g., 3 or 5), the message is automatically moved to the DLQ after exceeding that threshold, preventing it from blocking further retries and allowing the main queue to process valid messages.

Exam trap

The trap here is that candidates may confuse poison message handling with ordering or polling optimizations, and incorrectly choose FIFO queues or short polling, not realizing that only a DLQ with a redrive policy isolates repeatedly failing messages.

How to eliminate wrong answers

Option A is wrong because a FIFO queue without a redrive policy does not automatically handle poison messages; it only ensures strict ordering and exactly-once processing, but failed messages remain in the queue and continue to block retries. Option C is wrong because increasing the message retention period only keeps messages longer in the queue, but does nothing to isolate or remove poison messages that are repeatedly failing. Option D is wrong because short polling (returning immediately even if no messages are available) versus long polling (waiting for messages) affects latency and cost, but does not address the poison message problem; poison messages are a content/processing issue, not a polling mechanism issue.

Practice this question →

205

MCQmedium

A patient portal receives bursts of orders that sometimes overwhelm a downstream fulfilment service. The architecture must absorb spikes and retry processing without losing requests. Which service should be placed between the web tier and fulfilment workers? The design must avoid adding custom operational scripts.

A.AWS WAF

B.Amazon CloudFront

C.Amazon SQS queue

D.Amazon Route 53 weighted routing

AnswerC

SQS decouples producers and consumers, buffers bursts, and supports retries through visibility timeout and dead-letter queues.

Why this answer

Amazon SQS is the correct choice because it acts as a durable, fully managed message buffer that decouples the web tier from the fulfilment workers. When bursts of orders arrive, SQS queues the messages and allows workers to poll at their own pace, absorbing spikes without data loss. The built-in retry logic (visibility timeout and dead-letter queue) ensures failed processing attempts are automatically retried, and no custom operational scripts are needed.

Exam trap

The trap here is that candidates often confuse decoupling with caching or DNS-level distribution, picking CloudFront or Route 53 because they think 'absorbing spikes' means scaling web servers, but the question specifically requires buffering and retry without custom scripts, which only a queue service like SQS provides.

How to eliminate wrong answers

Option A is wrong because AWS WAF is a web application firewall that filters HTTP/S traffic based on rules (e.g., SQL injection, XSS); it does not buffer or retry messages between tiers. Option B is wrong because Amazon CloudFront is a content delivery network (CDN) that caches and accelerates static/dynamic content at edge locations; it cannot queue or retry asynchronous order processing. Option D is wrong because Amazon Route 53 weighted routing distributes DNS traffic across multiple endpoints based on weights; it provides load balancing at the DNS level but does not absorb spikes or provide retry mechanisms for message processing.

Practice this question →

206

Multi-Selecthard

A regional web application for a content publishing system must fail over automatically to a secondary Region if the primary endpoint becomes unhealthy. Which two services or features are required? The design must avoid adding custom operational scripts.

Select 2 answers

A.AWS Organizations service control policies

B.Route 53 failover routing with health checks

C.S3 Transfer Acceleration

D.A deployed standby application stack in the secondary Region

AnswersB, D

Route 53 can monitor endpoint health and return the standby endpoint when the primary is unhealthy.

Why this answer

Route 53 failover routing with health checks is required because it automatically directs traffic away from an unhealthy primary endpoint to a secondary endpoint, enabling cross-region failover without custom scripts. A deployed standby application stack in the secondary Region is necessary to serve traffic when the primary fails, as Route 53 can only route to healthy endpoints that are actually running.

Exam trap

The trap here is that candidates often assume Route 53 alone is sufficient, forgetting that the secondary Region must have a fully deployed and running application stack to receive traffic after failover.

Practice this question →

207

Multi-Selectmedium

A company is designing a highly available web application on AWS. The application runs on Amazon EC2 instances behind an Application Load Balancer (ALB) and uses an Amazon RDS Multi-AZ DB instance. Which three design choices would improve the application's resilience against an AWS Availability Zone failure? (Choose three.)

Select 3 answers

.Deploy EC2 instances across at least two Availability Zones in the same AWS Region.

.Configure the ALB as a Network Load Balancer for faster failover.

.Enable Amazon RDS Multi-AZ deployment for automatic failover to a standby in a different Availability Zone.

.Use Amazon Route 53 health checks with a failover routing policy to redirect traffic to a different Region.

.Store application session data in Amazon ElastiCache for Redis with replication across two Availability Zones.

.Provision EC2 instances in a single Availability Zone and use Auto Scaling to replace failed instances.

Why this answer

Deploying EC2 instances across at least two Availability Zones (AZs) ensures that if one AZ fails, the ALB can route traffic to healthy instances in the other AZ, maintaining application availability. This is a fundamental pattern for building resilient architectures on AWS, as it eliminates the single point of failure at the AZ level.

Exam trap

The trap here is that candidates often confuse AZ-level failures with Regional failures and incorrectly select cross-Region solutions like Route 53 failover routing, which is unnecessary and adds latency for an AZ-level scenario.

Practice this question →

208

MCQmedium

Your web application is deployed in two AWS Regions (Region A and Region B). You want Route 53 to automatically fail over DNS traffic from Region A to Region B when Region A is unhealthy. The failover decision must be based on health checks that verify whether the application in Region A is reachable. Which Route 53 routing configuration best meets these requirements?

A.Latency-based routing with regional aliases to split traffic based on measured latency.

B.Geolocation routing using country-based routing policies.

C.Failover routing using a primary record with an associated health check for Region A and a secondary record for Region B.

D.Weighted routing with weights set to 100 for Region A and 0 for Region B.

AnswerC

Route 53 failover routing is designed for active/standby patterns. You configure the Region A record as primary with a health check. When that health check fails, Route 53 automatically returns the Region B (secondary) record, enabling health-check-driven regional failover.

Why this answer

Option C is correct because Route 53 failover routing allows you to create a primary record with an associated health check for Region A and a secondary record for Region B. When the health check for Region A fails, Route 53 automatically returns the secondary record's IP address, directing traffic to Region B. This directly meets the requirement for automatic failover based on application reachability.

Exam trap

The trap here is that candidates often confuse failover routing with weighted routing, mistakenly thinking that setting weights to 100/0 will achieve failover, but weighted routing does not automatically adjust weights based on health checks.

How to eliminate wrong answers

Option A is wrong because latency-based routing directs traffic to the region with the lowest latency, not based on health checks or failover logic; it does not automatically fail over when a region becomes unhealthy. Option B is wrong because geolocation routing directs traffic based on the geographic location of the user, not on the health of the application endpoint; it cannot perform automatic failover between regions. Option D is wrong because weighted routing distributes traffic based on assigned weights; setting weights to 100 for Region A and 0 for Region B would send all traffic to Region A and never fail over to Region B, even if Region A is unhealthy.

Practice this question →

209

MCQmedium

A.Enable an S3 lifecycle policy to transition objects to Glacier after 7 days without enabling versioning.

B.Enable S3 cross-Region replication (CRR) but leave the bucket without versioning enabled.

C.Enable S3 versioning and configure cross-Region replication to a bucket in another Region.

D.Rely on frequent EBS snapshots of a temporary cache used during uploads.

AnswerC

Versioning enables recovery from accidental overwrites/deletes, and CRR provides near-current copies for Region-level disaster recovery.

Why this answer

Option C is correct because enabling S3 versioning protects against accidental overwrites and deletes by preserving all object versions, while cross-Region replication (CRR) asynchronously replicates objects to a bucket in another Region, enabling recovery from a full Region outage. With an RPO of about 1 hour, CRR meets this requirement as replication typically completes within minutes to a few hours, and versioning ensures point-in-time recovery of previous object states.

Exam trap

The trap here is that candidates often assume CRR alone is sufficient for data protection, overlooking that without versioning, overwrites and deletes are permanent and cannot be recovered, which directly violates the requirement to recover from accidental overwrites/deletes.

How to eliminate wrong answers

Option A is wrong because a lifecycle policy to transition objects to Glacier after 7 days does not protect against accidental overwrites or deletes (no versioning), and Glacier retrieval times (minutes to hours) are too slow for a 1-hour RPO in a Region outage scenario. Option B is wrong because CRR without versioning cannot recover from accidental overwrites or deletes, as overwrites permanently replace the object and deletes remove it entirely, leaving no previous versions to restore. Option D is wrong because EBS snapshots of a temporary cache are not designed for S3 object recovery; they capture block-level changes of an EC2 instance volume, not the S3 bucket's object state, and do not provide cross-Region durability or protection against S3-specific overwrites/deletes.

Practice this question →

210

MCQmedium

A trading dashboard uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The design must avoid adding custom operational scripts.

A.A single-AZ Aurora cluster

B.Aurora Global Database

C.Manual snapshots copied monthly

D.An ElastiCache Redis replica

AnswerB

Aurora Global Database replicates with low latency to secondary Regions and supports faster disaster recovery than snapshot-only approaches.

Why this answer

Aurora Global Database is the correct choice because it provides a fully managed cross-Region disaster recovery solution with a typical RPO of 1 second or less, using storage-based replication that does not require custom scripts. This meets the low RPO requirement while avoiding operational overhead, as replication is handled automatically by the Aurora storage layer.

Exam trap

The trap here is that candidates may confuse cross-Region read replicas (which require manual promotion and scripting) with Aurora Global Database, which provides automated, low-latency replication without custom operational scripts.

How to eliminate wrong answers

Option A is wrong because a single-AZ Aurora cluster lacks any cross-Region replication or failover capability, offering no disaster recovery across Regions. Option C is wrong because manual snapshots copied monthly result in an RPO of up to one month, which is far too high for a trading dashboard requiring low RPO. Option D is wrong because an ElastiCache Redis replica is an in-memory cache, not a database with persistent cross-Region replication, and it does not provide the required disaster recovery for Aurora MySQL data.

Practice this question →

211

MCQmedium

A ticket booking system stores uploaded documents in S3. The business requires a copy in another AWS Region for disaster recovery. What should be configured? The design must avoid adding custom operational scripts.

A.S3 lifecycle transition to Glacier Flexible Retrieval

B.An EBS snapshot schedule

C.S3 Cross-Region Replication with versioning enabled

D.A CloudFront distribution

AnswerC

CRR asynchronously replicates objects to a bucket in another Region and requires versioning.

Why this answer

S3 Cross-Region Replication (CRR) automatically replicates objects from a source bucket in one AWS Region to a destination bucket in another Region, providing a disaster recovery copy without custom scripts. Enabling versioning on both buckets is a prerequisite for CRR, ensuring that all object versions are replicated and that the destination bucket can maintain a complete history of changes.

Exam trap

The trap here is that candidates may confuse S3 Lifecycle policies (which only manage storage tiers within a region) with cross-region replication, or mistakenly think CloudFront's edge caching provides a durable cross-region copy, when in fact CloudFront does not replicate the original S3 object to another region.

How to eliminate wrong answers

Option A is wrong because S3 Lifecycle transitions to Glacier Flexible Retrieval only change the storage class within the same bucket and region; they do not create a cross-region copy for disaster recovery. Option B is wrong because EBS snapshots are for block-level backups of EC2 volumes, not for S3 objects, and they cannot replicate S3 data across regions. Option D is wrong because CloudFront is a content delivery network (CDN) that caches content at edge locations for low-latency access; it does not provide persistent cross-region replication or disaster recovery copies of S3 objects.

Practice this question →

212

MCQmedium

An application uses an Amazon Aurora DB cluster. The cluster performs an automatic failover from the writer instance to a standby instance. After failover completes, reads succeed, but all new writes fail with errors indicating the application is connecting to the old writer endpoint. Which change best fixes the resiliency issue after failover?

A.Update the application to use the Aurora cluster writer endpoint (or the cluster endpoint intended for writes) rather than an instance-specific endpoint.

B.Enable Multi-AZ on the individual writer instance settings so it can automatically create a new instance during failover.

C.Increase the failover timeout for Aurora to 60 minutes to ensure the app finishes reconnecting.

D.Switch the cluster to a single-AZ configuration to reduce connection retries after failover.

AnswerA

During Aurora failover, the writer role moves to a different underlying DB instance. The cluster writer endpoint is stable and always resolves to the current writer, even after failover. An instance-specific endpoint continues to point to the original (now non-writer) instance, so write operations fail if the application keeps using that stale endpoint.

Why this answer

The application is failing writes because it is connecting to the old writer instance's endpoint, which is no longer the writer after failover. The Aurora cluster writer endpoint is a DNS name that always points to the current primary (writer) instance, regardless of failovers. By using the cluster writer endpoint, the application automatically connects to the new writer after failover, eliminating the need to manually update connection strings.

Exam trap

The trap here is that candidates often confuse instance-specific endpoints with cluster endpoints, assuming that failover automatically updates all DNS records, but only the cluster endpoint is dynamically updated to reflect the new writer.

How to eliminate wrong answers

Option B is wrong because Multi-AZ is already inherent in Aurora clusters (by default, Aurora stores data across three Availability Zones) and enabling it on an individual instance does not change the failover behavior or fix the endpoint issue. Option C is wrong because increasing the failover timeout to 60 minutes does not address the root cause; the application will still connect to the old writer endpoint and fail writes indefinitely. Option D is wrong because switching to a single-AZ configuration would actually reduce resiliency and increase the risk of data loss, and it does not solve the problem of the application using the wrong endpoint.

Practice this question →

213

MCQmedium

Based on the exhibit, the application should continue serving requests if one Availability Zone fails. Which change best improves resilience with the least operational complexity?

A.Increase the desired capacity in AZ-a so more instances can absorb the failure of that same Availability Zone.

B.Add at least one subnet from a second Availability Zone to both the ALB and the Auto Scaling group.

C.Disable health checks so the ALB stops removing targets during brief infrastructure issues.

D.Move the application to a single larger instance type so the fleet has fewer moving parts.

AnswerB

A resilient design needs the load balancer and the Auto Scaling group to span multiple Availability Zones. If one AZ fails, the ALB can still route to healthy targets in the remaining AZs and the Auto Scaling group can replenish capacity there. This is the simplest and most common way to achieve AZ-level fault tolerance.

Why this answer

Option B is correct because adding subnets from a second Availability Zone to both the ALB and the Auto Scaling group distributes the application across multiple AZs. This ensures that if one AZ fails, the ALB can route traffic to healthy targets in the remaining AZ, and the Auto Scaling group can maintain capacity by launching instances in the surviving AZ. This approach directly addresses the requirement to continue serving requests during an AZ failure with minimal operational complexity.

Exam trap

The trap here is that candidates often think increasing capacity in a single AZ (Option A) provides resilience, but it actually concentrates risk in that AZ, while the correct answer requires distributing resources across multiple AZs to achieve true fault tolerance.

How to eliminate wrong answers

Option A is wrong because increasing the desired capacity in a single AZ does not provide resilience against the failure of that same AZ; all instances would be lost if the AZ fails. Option C is wrong because disabling health checks would prevent the ALB from detecting and removing unhealthy targets, causing traffic to be routed to failed instances and degrading application availability. Option D is wrong because moving to a single larger instance type creates a single point of failure; if that instance fails, the entire application becomes unavailable, and it does not address AZ-level failures.

Practice this question →

214

MCQhard

A claims workflow uses Amazon SQS. Poison messages are repeatedly failing and blocking useful retries. What should the architect configure? The design must avoid adding custom operational scripts.

A.A FIFO queue without a redrive policy

B.Short polling instead of long polling

C.A dead-letter queue with an appropriate maxReceiveCount

D.A larger message retention period only

AnswerC

A DLQ isolates messages that fail repeatedly so they can be investigated without disrupting normal processing.

Why this answer

A dead-letter queue (DLQ) with an appropriate maxReceiveCount allows messages that repeatedly fail processing to be moved to a separate queue after a specified number of receive attempts. This prevents poison messages from blocking the main queue and consuming retry capacity, while avoiding custom operational scripts by using native SQS functionality.

Exam trap

The trap here is that candidates often confuse increasing message retention or changing polling behavior with solving poison message issues, but only a dead-letter queue with a maxReceiveCount directly removes repeatedly failing messages from the processing flow.

How to eliminate wrong answers

Option A is wrong because a FIFO queue without a redrive policy does not automatically handle poison messages; it still requires a DLQ configuration to move failing messages out of the main queue. Option B is wrong because short polling reduces latency but does not address poison messages; it returns only a subset of partitions and can increase empty responses, but it does not prevent repeated failures. Option D is wrong because a larger message retention period only keeps messages longer in the queue; it does not stop poison messages from being repeatedly retried and blocking useful retries.

Practice this question →

215

MCQmedium

Based on the exhibit, the payment worker sometimes processes the same SQS Standard message more than once after a timeout. What change best prevents duplicate charges while keeping the queue architecture?

A.Increase the SQS visibility timeout to 15 minutes and leave the worker unchanged.

B.Replace the Standard queue with a FIFO queue and rely only on message ordering.

C.Make the payment workflow idempotent by recording a unique order key before charging.

D.Add a second consumer so duplicate messages are processed faster.

AnswerC

SQS Standard queues are at-least-once delivery, so duplicate messages are always possible. The correct safeguard is idempotency: store a unique order or payment request key, check whether that key has already been processed, and only perform the charge the first time it is seen. Any later delivery is safely ignored.

Why this answer

Option C is correct because making the payment workflow idempotent ensures that even if the same SQS Standard message is processed more than once (due to a visibility timeout), the duplicate charge is prevented by checking a unique order key before processing. This is the most robust solution for handling at-least-once delivery semantics of Standard queues without changing the queue architecture.

Exam trap

The trap here is that candidates often think increasing the visibility timeout (Option A) or switching to a FIFO queue (Option B) will solve duplicate processing, but they overlook that the root cause is the worker's timeout behavior, which requires application-level idempotency to prevent duplicate charges.

How to eliminate wrong answers

Option A is wrong because increasing the visibility timeout to 15 minutes does not guarantee that the worker will finish processing within that time; if the worker still times out, the message becomes visible again and can be processed again, leading to duplicate charges. Option B is wrong because replacing the Standard queue with a FIFO queue ensures exactly-once processing but does not prevent duplicate charges if the worker itself processes the same message twice due to a timeout; FIFO queues eliminate duplicates at the queue level but not at the application level. Option D is wrong because adding a second consumer increases the likelihood of duplicate processing when messages become visible again after a timeout, as both consumers may pick up the same message, worsening the duplicate charge problem.

Practice this question →

216

MCQmedium

A production Amazon RDS database has automated backups enabled. At 10:00 UTC, an application deploy accidentally overwrote a subset of rows due to a faulty migration. The issue is detected at 10:45 UTC. The team confirms that the required retention window is still available. Which approach offers the most resilient and least disruptive way to recover the affected data close to the time of the event?

A.Perform a snapshot restore and attach the restored instance, then manually copy only the affected rows back into the current database.

B.Use point-in-time recovery to restore the database to a timestamp just before 10:00 UTC, then swap application connectivity to the recovered instance.

C.Rely on automated backups to roll forward automatically until the data becomes correct.

D.Disable automated backups going forward to prevent future corruption, then reindex the corrupted table.

AnswerB

Point-in-time recovery leverages automated backups to create a recovery point near the incident and supports restoring close to 10:00.

Why this answer

Option B is correct because point-in-time recovery (PITR) allows you to restore the RDS instance to any second within the backup retention window, such as just before the faulty migration at 10:00 UTC. This restores a complete, consistent database state, minimizing data loss and avoiding manual row-by-row recovery. Swapping application connectivity to the restored instance is the least disruptive approach, as it avoids complex manual data merging and reduces downtime.

Exam trap

The trap here is that candidates may choose snapshot restore (Option A) thinking it is faster or simpler, but they overlook that PITR provides a more precise, consistent recovery point without manual data extraction and reinsertion.

How to eliminate wrong answers

Option A is wrong because performing a snapshot restore and manually copying affected rows is error-prone, time-consuming, and risks data inconsistency, especially if the affected rows have dependencies. Option C is wrong because automated backups do not 'roll forward' to correct data corruption; they are used for restore operations, not automatic healing. Option D is wrong because disabling automated backups does not recover lost data and actually increases future risk; reindexing does not restore overwritten rows.

Practice this question →

217

MCQmedium

Your ecommerce app runs behind an Application Load Balancer (ALB) and uses an RDS database for orders. During an AZ impairment in us-east-1, customers report that checkout takes several minutes to recover. The current design places EC2 instances only in private subnets of AZ-a, while the ALB spans multiple subnets. The RDS DB instance is Multi-AZ. Management wants automatic recovery within the same Region. Which change best addresses the issue with minimal operational overhead?

A.Move the EC2 instances into Auto Scaling Groups that span private subnets in at least two AZs, keeping the ALB spanning those subnets.

B.Switch from RDS Single-AZ to RDS Multi-AZ, keeping the EC2 instances in only AZ-a because failover will still reach them.

C.Terminate the ALB and use a Network Load Balancer (NLB) in front of the existing single-AZ EC2 instances.

D.Add more EC2 instances in AZ-a and increase the ALB health check thresholds to avoid unnecessary replacements during impairments.

AnswerA

An Auto Scaling Group across multiple AZs ensures healthy capacity exists when an AZ becomes impaired, and the ALB can route to instances in any available AZ.

Why this answer

The current design places EC2 instances only in AZ-a, so when that AZ becomes impaired, all compute capacity is lost, causing checkout to fail until the impairment ends or manual intervention occurs. By moving EC2 instances into Auto Scaling Groups spanning at least two AZs, the application gains automatic recovery within the same Region because the ALB can route traffic to healthy instances in the remaining AZs. This change minimizes operational overhead because Auto Scaling automatically replaces failed instances and maintains desired capacity across AZs, while the ALB’s health checks ensure traffic is only sent to healthy targets.

Exam trap

The trap here is that candidates assume Multi-AZ RDS alone guarantees full application resilience, overlooking that the compute layer (EC2) must also be distributed across AZs to survive an AZ impairment.

How to eliminate wrong answers

Option B is wrong because the RDS Multi-AZ deployment is already in place (the question states the RDS DB instance is Multi-AZ), so this change does nothing to address the single-AZ EC2 failure; the database remains reachable, but the compute layer is still unavailable during the AZ impairment. Option C is wrong because replacing the ALB with an NLB does not solve the single-AZ EC2 problem; an NLB also requires targets in multiple AZs for high availability, and the existing single-AZ EC2 instances would still be lost during the impairment. Option D is wrong because adding more EC2 instances in the same impaired AZ-a does not provide recovery when that AZ fails, and increasing health check thresholds would actually delay the detection of unhealthy instances, prolonging the recovery time.

Practice this question →

218

MCQmedium

A company runs an Amazon Aurora DB cluster with a Multi-AZ deployment. The application is configured with a hard-coded endpoint that points to the current writer *DB instance* (an instance-specific endpoint), rather than the Aurora cluster writer endpoint. During an unexpected AZ failure, Aurora promotes the standby to become the new writer. However, the application continues to fail to connect until an operator updates the hard-coded endpoint. What change most directly improves resiliency so the application automatically reconnects after failover?

A.Keep using the writer DB instance endpoint, but increase the client connection timeout.

B.Connect using the Aurora cluster writer endpoint so DNS resolves to the current writer after failover.

C.Disable Multi-AZ failover and rely on manual snapshot restore to bring the database back online.

D.Enable cross-Region read replicas and route application traffic to the replica during the outage.

AnswerB

Aurora cluster endpoints are designed to provide continuity across failovers. The Aurora cluster writer endpoint (writer endpoint for the cluster) updates so DNS resolves to the promoted writer. The application can reconnect without manual endpoint changes.

Why this answer

Option B is correct because the Aurora cluster writer endpoint is a DNS name that always resolves to the current writer instance in the cluster, even after a failover. By using this endpoint instead of a hard-coded instance-specific endpoint, the application automatically reconnects to the new writer without manual intervention, directly improving resiliency.

Exam trap

The trap here is that candidates may confuse the instance-specific endpoint with the cluster writer endpoint, or think that increasing timeouts or using read replicas can solve a writer failover issue, when the core problem is the hard-coded reference to a specific instance that no longer exists.

How to eliminate wrong answers

Option A is wrong because increasing the client connection timeout does not change the fact that the hard-coded endpoint points to a failed instance; the connection will still fail after the timeout expires. Option C is wrong because disabling Multi-AZ failover and relying on manual snapshot restore would cause significant downtime and data loss, directly contradicting the goal of improving resiliency. Option D is wrong because cross-Region read replicas are read-only and cannot accept writes; routing application traffic to a read replica during an outage would not allow the application to write data, and it does not address the failover of the writer instance.

Practice this question →

219

MCQhard

Based on the exhibit, duplicate payment charges occasionally occur when the worker times out after the charge is submitted but before the message is deleted. What change best prevents duplicate charges while keeping retry behavior?

A.Switch the queue to FIFO and rely on content-based deduplication to guarantee exactly-once processing.

B.Make the consumer idempotent by storing a processed payment key and rejecting repeat charges.

C.Reduce the visibility timeout so the message becomes available again sooner after a timeout.

D.Add a dead-letter queue and disable retries so the message is never processed twice.

AnswerB

The worker can still receive the same message more than once because SQS Standard is at-least-once delivery and the delete happened after the charge. Idempotency is the correct safety control because it prevents the payment from being applied twice even when the message is retried. A processed-payment record or conditional write lets retries remain possible without creating duplicate charges.

Why this answer

Option B is correct because making the consumer idempotent ensures that even if the same message is processed more than once (due to a timeout after the charge is submitted but before the message is deleted), the duplicate charge will be rejected. By storing a processed payment key (e.g., a unique transaction ID) and checking it before processing, the system can safely retry without causing duplicate payments. This approach preserves retry behavior while preventing duplicates, which is the core requirement.

Exam trap

The trap here is that candidates often assume FIFO queues with deduplication guarantee exactly-once processing, but they fail to recognize that deduplication only prevents duplicate message delivery, not duplicate processing when the consumer times out after processing but before acknowledging the message.

How to eliminate wrong answers

Option A is wrong because switching to a FIFO queue with content-based deduplication does not guarantee exactly-once processing in this scenario; FIFO queues prevent duplicate message delivery but cannot prevent duplicate processing if the consumer times out after processing the charge but before deleting the message—the message would be redelivered and processed again, leading to duplicates. Option C is wrong because reducing the visibility timeout would make the message reappear sooner, increasing the likelihood of duplicate processing and not preventing the existing duplicate issue. Option D is wrong because adding a dead-letter queue and disabling retries would eliminate retry behavior entirely, which contradicts the requirement to keep retry behavior; it would also move the message to a DLQ after the first failure, potentially losing the payment charge.

Practice this question →

220

Matchingmedium

A team wants a web application to keep serving traffic if one Availability Zone fails. Match each architecture element to the resilience behavior it provides.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Stop sending requests to unhealthy targets and keep only healthy instances in rotation.

Launch replacement instances in healthy AZs when capacity is lost.

Maintain a synchronous standby in another AZ and fail over automatically.

Allow instances to be replaced without losing user sessions that are stored elsewhere.

Why these pairings

These pairs match architecture elements with their resilience behaviors for surviving an Availability Zone failure, focusing on AWS services that provide high availability and fault tolerance.

Practice this question →

221

MCQmedium

A ticket booking system uses Aurora MySQL. The company wants fast cross-Region disaster recovery with low RPO. Which architecture should be considered? The team wants the control to be enforceable during normal operations.

A.Aurora Global Database

B.A single-AZ Aurora cluster

C.An ElastiCache Redis replica

D.Manual snapshots copied monthly

AnswerA

Aurora Global Database replicates with low latency to secondary Regions and supports faster disaster recovery than snapshot-only approaches.

Why this answer

Aurora Global Database is designed for cross-Region disaster recovery with a typical RPO of 1 second or less, using storage-based replication that does not impact database performance. It provides fast failover to a secondary Region and allows the primary Region to enforce write control during normal operations, meeting the low RPO and enforceable control requirements.

Exam trap

The trap here is that candidates may confuse Aurora Global Database with cross-Region read replicas or manual snapshot copy strategies, underestimating the RPO and failover speed requirements for disaster recovery.

How to eliminate wrong answers

Option B is wrong because a single-AZ Aurora cluster lacks any cross-Region replication or failover capability, resulting in no DR protection and an RPO that depends on manual backups. Option C is wrong because ElastiCache Redis is an in-memory cache, not a persistent database, and its cross-Region replication (Global Datastore) does not provide the same transactional consistency or DR guarantees as Aurora Global Database for a ticket booking system. Option D is wrong because manual snapshots copied monthly would yield an RPO of up to 30 days, far exceeding the low RPO requirement, and they require manual intervention for recovery, which is not fast.

Practice this question →

222

MCQhard

A warehouse integration service must use shared file storage across Linux EC2 instances in multiple Availability Zones. The storage must remain available during an AZ failure. Which service should be used? The team wants the control to be enforceable during normal operations.

A.Amazon EFS with mount targets in multiple Availability Zones

B.S3 mounted as a POSIX file system without a file gateway

C.Instance store volumes

D.An EBS volume attached to all instances

AnswerA

EFS is regional file storage and supports mount targets across AZs.

Why this answer

Amazon EFS provides a fully managed, POSIX-compliant NFS file system that can be mounted concurrently on multiple Linux EC2 instances across different Availability Zones. By creating mount targets in each AZ, the file system remains accessible even if one AZ fails, because the other mount targets continue to serve traffic. EFS also supports lifecycle policies and IAM enforcement to control access during normal operations, meeting the requirement for enforceable control.

Exam trap

The trap here is that candidates often confuse EBS multi-attach (which is limited to specific instance types and a single AZ) with the cross-AZ shared file system capability that only EFS provides, or they mistakenly think S3 with a FUSE mount is a reliable POSIX file system for production workloads.

How to eliminate wrong answers

Option B is wrong because mounting S3 as a POSIX file system (e.g., using s3fs-fuse) does not provide true POSIX semantics (e.g., no file locking, eventual consistency) and is not designed for shared file storage across AZs with high availability during an AZ failure. Option C is wrong because instance store volumes are ephemeral, tied to a single EC2 instance, and data is lost if the instance stops or fails; they cannot be shared across instances or survive an AZ failure. Option D is wrong because an EBS volume can only be attached to a single EC2 instance at a time (except for multi-attach EBS, which is limited to specific instance types and is not designed for shared file storage across AZs); attaching the same EBS volume to multiple instances is not supported.

Practice this question →

223

MCQmedium

A payments API uses an RDS MySQL database and must remain available during an Availability Zone failure with minimal application changes. What should the architect enable? The architecture review board prefers a managed AWS-native control.

A.S3 Cross-Region Replication

B.Multi-AZ deployment for the RDS DB instance

C.Read replicas only

D.EBS snapshots every hour

AnswerB

Multi-AZ provides synchronous standby replication and automatic failover within a Region.

Why this answer

Multi-AZ deployment for RDS MySQL provides synchronous standby replication to a different Availability Zone, ensuring automatic failover with zero data loss (RPO=0) and minimal downtime (RTO typically under 2 minutes) during an AZ failure. This is a managed AWS-native solution that requires no application changes beyond updating the connection string to use the CNAME endpoint.

Exam trap

The trap here is that candidates often confuse read replicas with Multi-AZ, assuming read replicas can provide high availability, but they lack automatic failover and synchronous replication, making them unsuitable for AZ failure scenarios requiring minimal application changes.

How to eliminate wrong answers

Option A is wrong because S3 Cross-Region Replication is for object storage redundancy across regions, not for RDS database availability within a region, and it does not provide automatic failover for a MySQL database. Option C is wrong because read replicas are asynchronous and do not support automatic failover; they are designed for read scaling, not for maintaining write availability during an AZ failure. Option D is wrong because EBS snapshots are point-in-time backups that require manual restoration and do not provide automatic failover or real-time replication, leading to significant downtime and potential data loss.

Practice this question →

224

MCQmedium

A.Lambda reserved concurrency set to zero

B.A larger deployment package

C.CloudFront error pages

D.A Lambda dead-letter queue or failure destination

AnswerD

A DLQ or asynchronous failure destination captures failed events after retry attempts.

Why this answer

A Lambda dead-letter queue (DLQ) or failure destination captures events that have exhausted all retry attempts, preserving them in Amazon SQS or SNS for later investigation. This ensures failed invocations from the unreliable third-party API are not lost and can be analyzed or replayed, meeting the requirement for retention after retries are exhausted.

Exam trap

The trap here is that candidates may confuse a dead-letter queue with other error-handling mechanisms like reserved concurrency or CloudFront customizations, failing to recognize that DLQs specifically retain events after retries are exhausted for asynchronous Lambda invocations.

How to eliminate wrong answers

Option A is wrong because setting reserved concurrency to zero would prevent the Lambda function from executing at all, not handle failed events after retries. Option B is wrong because a larger deployment package does not affect error handling or retention of failed events; it only increases the function's code size, which can impact cold start times. Option C is wrong because CloudFront error pages are used to customize HTTP error responses for web content delivery, not to capture or retain Lambda invocation failures from asynchronous or event-driven processing.

Practice this question →

225

MCQeasy

A team uses an S3 bucket to store important customer-generated exports. They need protection against accidental overwrites and also want copies of the data in another AWS Region for disaster recovery. Which S3 configuration best satisfies both requirements?

A.Enable S3 lifecycle policies to automatically move objects to Glacier after 30 days only.

B.Enable S3 versioning and configure Cross-Region Replication to a destination bucket in another Region.

C.Disable all versioning and rely on AWS Backup to restore objects from a scheduled backup window.

D.Enable S3 Block Public Access and SSE-S3 encryption, without using versioning or replication.

AnswerB

Versioning preserves previous object states against overwrites and deletes, while replication provides an additional Region copy for recovery.

Why this answer

Option B is correct because enabling S3 versioning protects against accidental overwrites by preserving all object versions, allowing recovery of previous versions. Configuring Cross-Region Replication (CRR) automatically replicates objects to a destination bucket in another AWS Region, providing disaster recovery by maintaining a copy of the data in a separate geographic location.

Exam trap

The trap here is that candidates may think lifecycle policies or AWS Backup alone can handle both accidental overwrites and disaster recovery, but they fail to address the real-time protection and cross-region copy requirements that versioning and CRR specifically provide.

How to eliminate wrong answers

Option A is wrong because lifecycle policies to Glacier only manage storage tier transitions and do not protect against accidental overwrites or provide cross-region copies for disaster recovery. Option C is wrong because disabling versioning removes the ability to recover from accidental overwrites, and relying solely on AWS Backup for scheduled restores does not provide real-time protection against overwrites or continuous replication to another Region. Option D is wrong because enabling Block Public Access and SSE-S3 encryption addresses security and encryption but does not prevent accidental overwrites or create cross-region copies for disaster recovery.

Practice this question →