Knowledge + Practice

CCNA Resilient Cloud Solutions Questions

75 of 259 questions · Page 2/4 · Resilient Cloud Solutions · Answers revealed

Practice these questions Domain overview All questions

76

MCQmedium

A company runs a microservices application on Amazon ECS with Fargate launch type. The application experiences intermittent failures when calling an external API. The errors are transient and usually resolve within a few seconds. How should the company improve resilience?

A.Increase the timeout of the external API call to 60 seconds.

B.Implement retry logic with exponential backoff in the application code.

C.Increase the number of tasks in the ECS service to handle failures.

D.Use an Amazon SQS queue to decouple the API call from the application.

AnswerB

Retries with backoff are best practice for transient failures.

Why this answer

Option B is correct because implementing retry logic with exponential backoff handles transient failures gracefully. Option A is wrong because scaling up may not help with API failures. Option C is wrong because increasing timeouts could worsen latency.

Option D is wrong because async processing adds complexity and does not directly address API failures.

Practice this question →

77

MCQmedium

A company runs a critical web application on EC2 instances behind an Application Load Balancer (ALB). The application stores session state in an Amazon DynamoDB table. During a recent traffic spike, users experienced session timeouts and the application became unavailable. Which design change would BEST improve resilience?

A.Move session state to Amazon ElastiCache for Redis with cluster mode disabled.

B.Enable DynamoDB global tables or use a Multi-AZ deployment for the table.

C.Add a read replica to the DynamoDB table.

D.Increase the EC2 instance sizes to handle more traffic.

AnswerB

Multi-AZ provides automatic failover and high availability for session data.

Why this answer

Option C is correct because using a Multi-AZ deployment for DynamoDB provides automatic failover and higher availability. Option A is wrong because increasing instance size does not address session state persistence. Option B is wrong because read replicas do not improve write availability.

Option D is wrong because ElastiCache for Redis is not as resilient as DynamoDB Multi-AZ for session state.

Practice this question →

78

MCQmedium

A company uses AWS Lambda to process messages from an Amazon SQS queue. The Lambda function occasionally times out after 15 seconds. To improve resilience, the team wants to ensure messages are not lost and are retried. Which configuration is MOST appropriate?

A.Reduce the Lambda timeout to 5 seconds to fail fast and retry quickly.

B.Set the SQS queue visibility timeout to less than the Lambda timeout.

C.Increase the batch size and remove the DLQ to speed up processing.

D.Increase the Lambda timeout to 30 seconds and configure a dead-letter queue (DLQ) for the SQS queue.

AnswerD

Longer timeout accommodates processing, DLQ captures failures for later analysis.

Why this answer

Option B is correct because increasing the visibility timeout to allow Lambda to process and leveraging dead-letter queue for failed messages ensures no messages are lost. Option A is wrong because reducing timeout worsens the issue. Option C is wrong because removing the DLQ loses failed messages.

Option D is wrong because batching does not help with timeout.

Practice this question →

79

MCQhard

A company runs a critical batch processing workload on Amazon EMR that must complete within a 2-hour window each night. The workload is fault-tolerant but must be resilient to instance failures. Currently, the EMR cluster uses instance fleets with Spot Instances. Recently, Spot Instance interruptions caused the cluster to take over 3 hours to complete. Which change will MOST effectively ensure the workload completes within the 2-hour window despite Spot interruptions?

A.Increase the number of core nodes to 20 to improve parallelism.

B.Switch to using On-Demand instances for all nodes.

C.Use a mixed instances policy that includes multiple instance types across different Availability Zones.

D.Configure the cluster to terminate idle nodes after 5 minutes to reduce costs.

AnswerC

Diversified instance types reduce the chance of simultaneous interruptions.

Why this answer

Option C is correct because a mixed instances policy across multiple Availability Zones increases the diversity of Spot capacity pools. When one instance type or zone experiences interruptions, the cluster can fall back to other pools, reducing the likelihood of prolonged delays. This approach directly addresses Spot interruption risk without sacrificing cost efficiency, as On-Demand instances would.

Exam trap

The trap here is that candidates may assume increasing parallelism (Option A) or using On-Demand instances (Option B) are the only ways to handle Spot interruptions, overlooking the cost-effective and resilient design of mixed instances across zones.

How to eliminate wrong answers

Option A is wrong because simply increasing core nodes to 20 does not mitigate Spot interruptions; it only adds parallelism, which may not help if all nodes are interrupted simultaneously. Option B is wrong because switching entirely to On-Demand instances eliminates Spot interruption risk but significantly increases cost, which is not the most effective solution given the fault-tolerant nature of the workload. Option D is wrong because terminating idle nodes after 5 minutes reduces cost but does not address the root cause of Spot interruptions causing delays; it may even worsen performance by removing nodes that could be reused.

Practice this question →

80

MCQeasy

A company runs a critical web application on EC2 instances behind an Application Load Balancer. To improve resilience, they want to automatically replace unhealthy instances. Which AWS feature should they use?

A.Auto Scaling group with ELB health checks

B.CloudWatch alarm that terminates the instance

C.AWS Lambda function that checks health and launches new instances

D.Elastic Load Balancer health checks

AnswerA

Auto Scaling groups automatically replace instances that fail health checks.

Why this answer

Auto Scaling groups with health checks automatically replace unhealthy instances. Option A is wrong because ELB health checks only mark instances unhealthy; replacement requires Auto Scaling. Option C is wrong because CloudWatch alarms trigger actions but not instance replacement.

Option D is wrong because Lambda can invoke API calls but is not the native mechanism.

Practice this question →

81

MCQmedium

A company uses an Application Load Balancer (ALB) to distribute traffic to EC2 instances. The ALB is in us-east-1a and us-east-1b. They want to ensure that if one AZ fails, traffic is routed only to healthy instances in the other AZ. What configuration is necessary?

A.Enable sticky sessions (session affinity)

B.Configure health checks on the target group

C.Add more subnets in additional AZs

D.Enable cross-zone load balancing on the ALB

AnswerD

Allows traffic to be routed to healthy instances in any AZ.

Why this answer

Cross-zone load balancing must be enabled on the ALB so that traffic can be distributed across instances in all AZs. If disabled, each AZ receives traffic only from its own subnet. Option B is wrong because health checks are already required.

Option C is wrong because stickiness doesn't affect failover. Option D is wrong because enabling more AZs helps but without cross-zone balancing, traffic may not be evenly distributed.

Practice this question →

82

Drag & Dropmedium

Drag and drop the steps to troubleshoot a failed deployment in AWS CodeDeploy into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Troubleshooting starts with console, then agent logs, then AppSpec, then instance configuration, then redeploy.

Practice this question →

83

MCQeasy

A company's application runs on EC2 instances in a single Availability Zone. The operations team wants to improve resilience without redesigning the application. Which action is the MOST effective?

A.Use a larger instance type to handle more traffic.

B.Enable EC2 Auto Recovery to automatically restart the instance if it fails.

C.Deploy EC2 instances across multiple Availability Zones using an Auto Scaling group.

D.Place the instance in a placement group to ensure low latency.

AnswerC

Multi-AZ deployment ensures application availability even if one AZ fails.

Why this answer

Deploying EC2 instances across multiple Availability Zones (AZs) using an Auto Scaling group is the most effective action because it eliminates the single point of failure at the AZ level. If one AZ experiences an outage, the Auto Scaling group automatically launches replacement instances in the remaining healthy AZs, ensuring application availability without requiring any application-level changes. This directly addresses the goal of improving resilience by leveraging AWS's fault-isolated infrastructure.

Exam trap

The trap here is that candidates often confuse instance-level recovery (Auto Recovery) with infrastructure-level resilience (multi-AZ deployment), mistakenly thinking that restarting a failed instance in the same AZ provides sufficient protection against the most common cause of downtime—an AZ outage.

How to eliminate wrong answers

Option A is wrong because using a larger instance type only increases compute capacity, not resilience; a single AZ failure still takes down all instances regardless of size. Option B is wrong because EC2 Auto Recovery only recovers an instance within the same AZ if the underlying hardware fails, but it does not protect against an entire AZ outage, which is the primary risk. Option D is wrong because a placement group is designed to reduce network latency by ensuring instances are in close proximity, but it actually increases the risk of correlated failures and does not improve resilience against AZ-level failures.

Practice this question →

84

Multi-Selecteasy

A company wants to ensure that its application running on AWS can withstand the failure of an entire AWS Region. Which TWO strategies should the company implement?

Select 2 answers

A.Deploy the application in multiple AWS Regions using an active-active or active-passive pattern

B.Deploy the application across multiple Availability Zones in a single Region

C.Replicate data across Regions using services like DynamoDB global tables or RDS cross-Region replication

D.Use a single CloudFront distribution with multiple origins in the same Region

E.Configure RDS read replicas in the same Region

AnswersA, C

Provides resilience against Region failure.

Why this answer

B and C are correct. Multi-Region deployment provides regional isolation. Data replication ensures data availability.

Option A is wrong because Multi-AZ protects only within a Region. Option D is wrong because a single CloudFront distribution with multiple origins does not guarantee regional failover. Option E is wrong because RDS read replicas are for read scaling, not disaster recovery.

Practice this question →

85

MCQhard

A company runs a critical e-commerce platform on AWS. The architecture includes an Application Load Balancer (ALB) that distributes traffic to a fleet of EC2 instances in an Auto Scaling group across three Availability Zones. The instances run a Java application that connects to an Amazon RDS Multi-AZ MySQL database. The application also uses Amazon ElastiCache for Redis for session caching. The company recently experienced a severe outage where the ALB's 5xx error rate spiked to 100% for 45 minutes. The root cause was a combination of a slow-running query on the RDS primary instance and a subsequent failover that caused the application to lose connections to the database. The failover happened because the slow query caused the primary to become unresponsive, triggering a Multi-AZ failover. During the failover, the application's connection pool exhausted, and new connections failed. The application logs show a high rate of 'java.sql.SQLTimeoutException' and 'com.mysql.cj.exceptions.CJCommunicationsException'. The DevOps team needs to implement a long-term solution that minimizes the impact of similar incidents. The solution must be cost-effective and require minimal application changes. Which combination of actions should the DevOps team take?

A.Implement Amazon RDS Proxy to manage database connections and add read replicas to offload read traffic.

B.Use an Auto Scaling policy for EC2 based on RDS connection count and implement a read replica for the primary.

C.Configure Multi-AZ RDS with a synchronous standby and use Amazon RDS for MySQL with enhanced monitoring.

D.Increase the instance size of the RDS primary and enable Performance Insights to identify slow queries.

AnswerA

RDS Proxy handles connection pooling and failover seamlessly, reducing connection timeouts during failover. Read replicas reduce load on the primary, preventing slow queries from causing failovers.

Why this answer

Amazon RDS Proxy is the correct solution because it efficiently manages database connection pooling, reducing the likelihood of connection exhaustion during failovers. By maintaining a warm connection pool and automatically reconnecting to the new primary after a Multi-AZ failover, RDS Proxy minimizes application-side connection timeouts and errors like SQLTimeoutException and CJCommunicationsException. Adding read replicas offloads read traffic, reducing the load on the primary and mitigating the risk of slow queries causing unresponsiveness.

This combination requires minimal application changes and is cost-effective compared to scaling the primary instance.

Exam trap

The trap here is that candidates often focus on scaling the database (e.g., increasing instance size or adding read replicas) to fix performance issues, but overlook the critical connection management problem that causes application-level timeouts during failover, which RDS Proxy directly addresses.

How to eliminate wrong answers

Option B is wrong because using an Auto Scaling policy based on RDS connection count does not address the root cause of connection exhaustion during failover; it only scales EC2 instances reactively, which may not prevent timeouts and adds complexity without solving the connection management issue. Option C is wrong because simply configuring Multi-AZ RDS with a synchronous standby and enhanced monitoring does not prevent connection pool exhaustion during failover; the application still needs to manage connections, and enhanced monitoring only provides visibility, not mitigation. Option D is wrong because increasing the instance size of the RDS primary and enabling Performance Insights addresses performance but does not solve the connection management problem during failover; it may delay the issue but does not prevent connection timeouts or exhaustion.

Practice this question →

86

MCQhard

A company runs a critical microservice on Amazon ECS with AWS Fargate. The service must be highly available across multiple Availability Zones. The DevOps engineer configured the service with a desired count of 4 tasks spread across 2 Availability Zones. During a deployment, a new task fails to start due to a missing environment variable. The deployment fails, but the old tasks continue to run. What is the most likely cause of the deployment failure and how can the engineer ensure future deployments are resilient?

A.The deployment failed because the ECS service was using the rolling update deployment controller. Change to blue/green deployment.

B.The deployment failed because the ECS service did not have the deployment circuit breaker enabled. Enable the circuit breaker with rollback.

C.The deployment failed because the desired count was too low. Increase the desired count to 6.

D.The deployment failed because the health check grace period was too short. Increase the grace period.

AnswerB

The circuit breaker detects failed tasks and rolls back to the stable version, maintaining service availability.

Why this answer

The deployment failed because the new task could not start due to a missing environment variable, and the ECS service did not have the deployment circuit breaker enabled. Without the circuit breaker, ECS continues to attempt the deployment indefinitely or until a timeout, but it does not automatically roll back to the previous stable task set. Enabling the deployment circuit breaker with rollback ensures that if a specified number of tasks fail to start (e.g., due to health checks or runtime errors), ECS automatically rolls back to the last successful deployment, maintaining service availability.

Exam trap

The trap here is that candidates may focus on the deployment controller type (rolling vs. blue/green) or task count, but the real issue is the lack of automatic rollback capability provided by the deployment circuit breaker, which is specifically designed to handle task startup failures during deployments.

How to eliminate wrong answers

Option A is wrong because the rolling update deployment controller is not the cause of the failure; it is the default and works correctly here by keeping old tasks running. Changing to blue/green deployment would not inherently fix the missing environment variable issue and adds complexity. Option C is wrong because the desired count of 4 tasks is sufficient for high availability across 2 AZs; increasing it to 6 does not address the root cause of task startup failure.

Option D is wrong because the health check grace period only delays the start of health checks, but the task failed to start entirely due to a missing environment variable, not because health checks failed prematurely.

Practice this question →

87

MCQhard

A company runs a Stateful application on EC2 that requires sticky sessions. They use an ALB with duration-based stickiness. During a deployment, they want to drain existing connections gracefully before terminating instances. Which step is necessary?

A.Increase the deregistration delay on the target group.

B.Reduce the stickiness duration to zero.

C.Configure health checks to mark instances unhealthy.

D.Enable connection draining on the target group.

AnswerD

ALB connection draining allows in-flight requests to complete.

Why this answer

Option C is correct because connection draining on the target group ensures existing connections complete before termination. Option A is incorrect because deregistration delay is similar but the term used is connection draining for ALB. Option B is incorrect because stickiness duration does not drain connections.

Option D is incorrect because health checks do not manage connection draining.

Practice this question →

88

MCQhard

Refer to the exhibit. A Lambda function uses the IAM role with the above policy. The function is configured to access a DynamoDB table MyTable and an RDS instance in a VPC. When invoked, the function fails with an error indicating it cannot describe VPC subnets. What is the MOST likely cause?

A.The Lambda function is missing permissions to describe VPC subnets and security groups.

B.The Lambda function does not have permission to write to DynamoDB.

C.The Lambda function cannot create network interfaces in the VPC.

D.The DynamoDB table's resource policy denies access from Lambda.

AnswerA

Lambda needs ec2:DescribeSubnets and ec2:DescribeSecurityGroups to set up elastic network interfaces in a VPC.

Why this answer

Option D is correct because the Lambda function needs permissions to describe VPC subnets and security groups when it is configured to access a VPC. The policy only allows EC2 actions for network interfaces but not ec2:DescribeSubnets or ec2:DescribeSecurityGroups. Option A is wrong because the error is about describing subnets, not about DynamoDB.

Option B is wrong because the policy allows dynamodb:PutItem and UpdateItem. Option C is wrong because the policy allows creating network interfaces.

Practice this question →

89

MCQeasy

A company runs a static website on Amazon S3 with public read access. The website content is stored in an S3 bucket and served through an Amazon CloudFront distribution for better performance and security. Recently, the company noticed that some users are accessing the S3 bucket directly via the S3 endpoint, bypassing CloudFront. This increases costs and exposes the bucket to potential attacks. The company wants to ensure that all access to the website goes through CloudFront only. Which solution should the company implement?

A.Set the S3 bucket policy to deny all requests that do not come from the CloudFront distribution's IP addresses.

B.Configure the S3 bucket to use AWS WAF to block requests that do not have a custom header set by CloudFront.

C.Create an origin access identity (OAI) in CloudFront and update the S3 bucket policy to allow only the OAI to read objects.

D.Change the S3 bucket to be private and use presigned URLs for all requests.

AnswerC

OAI ensures only CloudFront can access the bucket.

Why this answer

To restrict access to the S3 bucket only through CloudFront, use an origin access identity (OAI) and a bucket policy that allows only the OAI. This way, direct access via S3 URL is denied.

Practice this question →

90

MCQmedium

A company runs a containerized application on Amazon EKS. They want to ensure that if a node fails, the pods are rescheduled on healthy nodes. Which configuration is necessary?

A.Configure a pod disruption budget to prevent too many pods from being terminated simultaneously.

B.Use a horizontal pod autoscaler to increase the number of pods during high load.

C.Configure the EKS managed node group with a health check and ensure that the Kubernetes control plane automatically reschedules pods from failed nodes.

D.Use a cluster autoscaler to automatically add new nodes when pods are pending.

AnswerC

EKS managed node groups automatically replace unhealthy nodes, and Kubernetes reschedules pods.

Why this answer

Option A is correct because EKS manages node health and Kubernetes automatically reschedules pods from failed nodes. Option B is wrong because pod disruption budgets limit voluntary disruptions, not node failures. Option C is wrong because cluster autoscaler adds nodes, but does not reschedule pods from failed nodes.

Option D is wrong because horizontal pod autoscaler scales pods based on load, not node failures.

Practice this question →

91

MCQhard

A company runs a production e-commerce platform on AWS. The architecture includes an Application Load Balancer (ALB) that distributes traffic to a fleet of Amazon EC2 instances running in an Auto Scaling group across three Availability Zones (AZs). The application stores session state in Amazon ElastiCache for Redis (cluster mode disabled) with a single node. The database is an Amazon Aurora MySQL DB cluster with one writer and two reader instances in different AZs. The platform experiences intermittent slowdowns and occasional timeouts during peak traffic hours. The CloudWatch metrics show that the ALB's TargetResponseTime is elevated, and the Redis CPU utilization is consistently above 80% during these periods. The Auto Scaling group is scaling out, but new instances take several minutes to become healthy. The DevOps team has been asked to improve the resilience and performance of the application with minimal changes to the application code. Which solution should the team implement?

A.Replace the ALB with a Network Load Balancer (NLB) to reduce latency, and use an Auto Scaling group with a step scaling policy based on Redis CPU utilization.

B.Increase the instance size of the ElastiCache for Redis node and the size of the Aurora writer instance. Also, increase the cooldown period for the Auto Scaling group to allow new instances to warm up.

C.Implement Amazon RDS Proxy in front of the Aurora cluster to reduce database connection overhead, and increase the size of the Redis instance to handle more connections.

D.Migrate ElastiCache for Redis to a cluster mode enabled configuration with multiple shards and enable Multi-AZ with automatic failover. Also, use an ElastiCache replication group with read replicas in different AZs.

AnswerD

Cluster mode shards data, reducing per-node CPU. Multi-AZ and replicas improve resilience and reduce failover time.

Why this answer

Option D is correct because the primary bottleneck is the single-node Redis instance (CPU > 80%), which cannot scale reads or handle failover. Migrating to cluster mode enabled with multiple shards distributes the CPU load across shards, while Multi-AZ with automatic failover and read replicas in different AZs provides high availability and read scaling. This directly addresses the elevated ALB TargetResponseTime caused by Redis latency without requiring application code changes.

Exam trap

The trap here is that candidates focus on scaling the database or load balancer (options A, B, C) instead of recognizing that the single-node Redis cache is the bottleneck and requires horizontal scaling and high availability to resolve both performance and resilience issues.

How to eliminate wrong answers

Option A is wrong because replacing the ALB with an NLB does not reduce application-layer latency (NLB operates at Layer 4, not Layer 7, and cannot offload TLS or inspect HTTP sessions), and a step scaling policy based on Redis CPU utilization does not fix the single-node Redis bottleneck or the slow instance warm-up. Option B is wrong because increasing the instance size of the single Redis node and the Aurora writer instance only vertically scales the existing bottlenecks, and increasing the Auto Scaling group cooldown period would delay scaling further, worsening the timeouts. Option C is wrong because RDS Proxy reduces database connection overhead but does not address the Redis CPU bottleneck (the primary cause of elevated response times), and increasing the Redis instance size alone does not provide the read scaling or high availability needed.

Practice this question →

92

MCQhard

A company runs a high-traffic web application on a fleet of EC2 instances behind an Application Load Balancer (ALB) with Auto Scaling. The application uses an Amazon RDS for PostgreSQL database. Recently, during a traffic spike, the application became unresponsive. Investigation revealed that the database CPU utilization reached 100%, causing queries to timeout. The Auto Scaling group added more EC2 instances, which only increased the load on the database. The DevOps team needs to implement a solution that prevents the database from being overwhelmed during traffic spikes while maintaining application availability. The solution must be cost-effective and require minimal changes to the application code. Which solution should the DevOps team implement?

A.Implement read replicas for the RDS database and modify the application to use read replicas for read queries.

B.Increase the instance size of the RDS database to a larger instance type to handle more connections.

C.Use Amazon RDS Proxy between the application and the database to pool and reuse connections.

D.Configure Auto Scaling to launch EC2 instances based on a custom metric that tracks database CPU utilization, and throttle the number of instances.

AnswerC

RDS Proxy reduces the number of database connections, lowering CPU usage and improving scalability.

Why this answer

RDS Proxy manages database connections efficiently, reducing the number of connections and CPU overhead. It also provides connection pooling, which helps handle spikes without overwhelming the database.

Practice this question →

93

MCQhard

A company runs a stateful web application on EC2 instances behind a Network Load Balancer (NLB) in a single Availability Zone. The application stores session state locally on the instance. The company wants to achieve high availability across multiple AZs with minimal application changes. What should the DevOps engineer do?

A.Add more AZs and configure the NLB with cross-zone load balancing.

B.Replace the NLB with an ALB and use ElastiCache for session storage.

C.Use a Multi-AZ RDS instance to store session state.

D.Replace the NLB with an ALB and enable sticky sessions (session affinity) using the ALB's cookie.

AnswerD

Sticky sessions ensure that requests from the same client are routed to the same instance, preserving local session state.

Why this answer

Option D is correct because replacing the NLB with an ALB and enabling sticky sessions (session affinity) using the ALB's cookie allows the stateful web application to maintain session state across multiple AZs without modifying the application code. The ALB generates a cookie (AWSALB) that binds a client's session to a specific target instance, ensuring subsequent requests from the same client are routed to the same EC2 instance. This achieves high availability across AZs with minimal changes, as the application continues to store session state locally on the instance.

Exam trap

The trap here is that candidates often assume cross-zone load balancing or adding more AZs inherently solves high availability for stateful applications, but they overlook that session affinity is required to keep a client's requests directed to the same instance when session state is stored locally.

How to eliminate wrong answers

Option A is wrong because adding more AZs and configuring cross-zone load balancing with an NLB does not solve the session state problem; the NLB distributes traffic across instances without session affinity, so a client's requests may be routed to different instances in different AZs, breaking the locally stored session. Option B is wrong because replacing the NLB with an ALB and using ElastiCache for session storage requires application code changes to read/write session data to ElastiCache, which contradicts the requirement for minimal application changes. Option C is wrong because using a Multi-AZ RDS instance for session storage also requires significant application code changes to store and retrieve session data from the database, and it introduces unnecessary complexity and latency for session management.

Practice this question →

94

Multi-Selecthard

A company's application uses Amazon DynamoDB as its primary data store. The application experiences occasional throttling errors during traffic spikes. The DevOps team needs to implement a solution that ensures consistent performance without manual intervention. Which TWO actions should the team take? (Choose TWO.)

Select 2 answers

A.Use eventually consistent reads for all queries.

B.Move the data to Amazon RDS with read replicas.

C.Implement DynamoDB Accelerator (DAX) to cache read requests.

D.Enable DynamoDB Auto Scaling for read and write capacity.

E.Switch DynamoDB to On-Demand capacity mode.

AnswersC, D

DAX reduces read load on DynamoDB, mitigating throttling for read-heavy workloads.

Why this answer

DynamoDB Accelerator (DAX) is an in-memory cache that reduces read response times from milliseconds to microseconds, offloading read-heavy workloads from the DynamoDB table. By caching frequently accessed items, DAX absorbs traffic spikes and reduces the likelihood of throttling on read requests, ensuring consistent performance without manual intervention.

Exam trap

The trap here is that candidates may think On-Demand capacity mode (Option E) is the only way to handle spikes without manual intervention, but it ignores the cost implications and the fact that DAX plus Auto Scaling provides a more balanced and cost-effective solution for read-heavy workloads.

Practice this question →

95

MCQmedium

A company runs a critical application on Amazon RDS for MySQL with Multi-AZ deployment. The database is 2 TB in size. The DevOps team needs to perform a major version upgrade (e.g., MySQL 5.7 to 8.0) with minimal downtime. The RTO is 5 minutes and RPO is 1 minute. Which approach should the team take?

A.Create a read replica of the database in the same region, upgrade the replica to the new version, and then promote it to primary.

B.Take a snapshot of the database, restore it as a new instance, upgrade the restored instance, and redirect traffic.

C.Perform an in-place major version upgrade directly on the primary instance.

D.Use AWS Database Migration Service (DMS) to migrate data to a new database instance with the upgraded version.

AnswerA

This minimizes downtime as the replica syncs continuously; cutover is quick.

Why this answer

Option B is correct because creating a read replica, upgrading it, and then promoting it to primary with minimal downtime (cutover) achieves very low downtime. The replica syncs continuously, so RPO is near zero. Option A (in-place upgrade) causes downtime.

Option C (DMS) is for migration, not upgrade, and may have longer downtime. Option D (snapshot restore) has significant downtime.

Practice this question →

96

MCQhard

A company is implementing a disaster recovery strategy for its Amazon Aurora MySQL database. The primary database is in us-west-2. The company requires an RPO of less than 1 minute and an RTO of less than 5 minutes. Which solution meets these requirements?

A.Create a cross-Region read replica in the secondary Region and promote it during failover.

B.Use automated backups and restore to a new DB instance in the secondary Region.

C.Use Amazon Aurora Global Database with a secondary Region cluster.

D.Take manual snapshots of the DB instance and copy them to the secondary Region every hour.

AnswerC

Aurora Global Database provides low-latency replication and fast failover (under 1 minute for RPO, minutes for RTO).

Why this answer

Amazon Aurora Global Database is designed for low-latency cross-Region replication with a typical RPO of 1 second and RTO of 1 minute or less, meeting the <1 minute RPO and <5 minute RTO requirements. It uses a dedicated storage-level replication channel that keeps the secondary cluster fully synchronized without impacting primary performance, and failover involves promoting the secondary cluster to primary in under a minute.

Exam trap

The trap here is that candidates confuse a cross-Region read replica (Option A) with Aurora Global Database, assuming both provide similar failover speed, but the read replica's promotion process is slower and less reliable for meeting strict RTO/RPO targets.

How to eliminate wrong answers

Option A is wrong because a cross-Region read replica for Aurora MySQL uses asynchronous replication with a typical RPO of several seconds to minutes, but the promotion process can take longer than 5 minutes due to the need to apply remaining redo logs and reconfigure endpoints, failing the RTO requirement. Option B is wrong because automated backups are taken once per day (default retention of 1-35 days) and restoring to a new instance in a secondary Region requires copying the backup across Regions, which can take hours and far exceeds both the RPO and RTO limits. Option D is wrong because manual snapshots taken every hour provide an RPO of up to 60 minutes, which violates the <1 minute RPO requirement, and restoring from a snapshot in a secondary Region also takes significantly longer than 5 minutes.

Practice this question →

97

Multi-Selectmedium

A company runs a stateful web application on EC2 instances that store session data locally. They want to migrate to a stateless architecture for better resilience. Which TWO actions should they take?

Select 2 answers

A.Use Amazon CloudFront to cache session data at the edge.

B.Use Amazon DynamoDB to store session data.

C.Use Amazon S3 to store session data as objects.

D.Use ElastiCache for Redis to store session data externally.

E.Use Amazon EFS to store session data as files.

AnswersB, D

DynamoDB is a scalable, low-latency session store.

Why this answer

Options A and D are correct. ElastiCache provides a centralized session store, and DynamoDB provides a durable, scalable session store. Option B is wrong because EFS is a file system, not ideal for session data.

Option C is wrong because CloudFront is a CDN, not a session store. Option E is wrong because S3 is object storage, not suitable for high-frequency writes like sessions.

Practice this question →

98

MCQeasy

A company runs a serverless application using AWS Lambda functions behind an Amazon API Gateway. The application processes user uploads stored in an S3 bucket. The Lambda function writes results to a DynamoDB table. Recently, the function started timing out when processing large files. What should the DevOps engineer do to improve resilience for large file processing?

A.Increase the Lambda function memory to improve CPU performance.

B.Use S3 event notifications to trigger an AWS Step Functions workflow that processes the file asynchronously.

C.Increase the Lambda function timeout to the maximum 15 minutes.

D.Add Amazon ElastiCache to cache processed results and reduce Lambda execution time.

AnswerB

Step Functions can orchestrate long-running tasks, breaking the file into chunks or using parallel processing, avoiding Lambda timeouts.

Why this answer

Option B is correct because Lambda has a maximum execution timeout of 15 minutes. For large files, using S3 event notifications to trigger an asynchronous step function or using S3 batch operations is more resilient. Option A (increasing timeout) might not be enough if the file is very large.

Option C (ElastiCache) is not relevant. Option D (increasing memory) might help but still has timeout limits.

Practice this question →

99

MCQhard

A company uses Amazon Route 53 with a failover routing policy to direct traffic to an active and a standby endpoint. The health checks are configured to check the active endpoint every 10 seconds. During a recent outage, the failover took over 3 minutes to detect and switch. How can the company improve the failover time to under 1 minute?

A.Configure a Route 53 calculated health check that aggregates multiple fast health checks with a lower failure threshold.

B.Add additional health checks for the same endpoint.

C.Reduce the health check interval to 5 seconds.

D.Change the routing policy to latency based.

AnswerA

Calculated health checks can combine quick checks to detect failure faster.

Why this answer

Option D is correct because using a calculated health check with faster endpoint checks and a lower failure threshold can reduce detection time. Option A is wrong because reducing the health check interval to 5 seconds is not supported (minimum is 10 seconds). Option B is wrong because latency routing does not provide active/passive failover.

Option C is wrong because adding more health checks does not reduce failover time.

Practice this question →

100

MCQeasy

A company uses AWS CodeDeploy to deploy a new version of an application to EC2 instances. They want to minimize downtime and roll back quickly if the deployment fails. Which deployment type should they use?

A.Canary deployment

B.Linear deployment

C.Blue/green deployment

D.In-place deployment

AnswerC

Blue/green allows instant rollback by switching back.

Why this answer

Blue/green deployment creates a new environment (green) and shifts traffic after testing, allowing instant rollback by switching back to the original (blue). Option A is wrong because in-place updates cause downtime during the deployment. Option C is wrong because canary is a subset of blue/green but not the full strategy.

Option D is wrong because linear is a traffic shifting pattern.

Practice this question →

101

MCQmedium

A company is running a stateful web application on EC2 instances behind an Application Load Balancer. During a deployment, users report session timeouts. What should the DevOps engineer implement to ensure zero-downtime deployments without losing in-flight sessions?

A.Update the Auto Scaling group's launch template to use a new AMI and perform a rolling update.

B.Use an Auto Scaling group with a lifecycle hook that waits for instance termination.

C.Enable connection draining (deregistration delay) on the ALB target group and use lifecycle hooks to wait for the draining period.

D.Increase the health check interval and unhealthy threshold on the ALB target group.

AnswerC

Deregistration delay ensures in-flight requests complete; lifecycle hooks provide additional control over termination.

Why this answer

Option D is correct because deregistration delay (connection draining) on the ALB target group allows in-flight requests to complete before instances are terminated. Option A is wrong because Auto Scaling groups do not manage session stickiness during deployments. Option B is wrong because updating the launch template does not prevent session loss during replacement.

Option C is wrong because gradually increasing health check thresholds does not ensure existing sessions are preserved.

Practice this question →

102

Multi-Selecthard

A company has a microservices architecture running on Amazon ECS with Fargate launch type. Each service is deployed in multiple Availability Zones. The services communicate via REST APIs. Recently, a downstream service experienced a partial outage, causing upstream services to time out and leading to cascading failures. The team wants to improve resilience against such failures. Which combination of actions should the DevOps engineer take? (Choose TWO.)

Select 2 answers

A.Increase the HTTP timeout values for all service-to-service calls.

B.Implement circuit breaker patterns in the service clients.

C.Remove all retry logic from service calls.

D.Adopt an asynchronous communication pattern using Amazon SQS or Amazon EventBridge.

E.Configure Auto Scaling for all services based on request count.

AnswersB, D

Circuit breakers stop cascading failures by failing fast when downstream is unhealthy.

Why this answer

Option A is correct because implementing circuit breakers (e.g., with AWS App Mesh or client libraries) prevents cascading failures by failing fast when a downstream service is unhealthy. Option D is correct because using an asynchronous messaging pattern (SQS, SNS, EventBridge) decouples services, allowing upstream services to continue processing even if downstream is unavailable. Option B (increasing timeouts) could worsen the situation by holding resources longer.

Option C (Auto Scaling) helps with capacity but not with handling unavailability. Option E (removing retries) is too drastic and could cause data loss.

Practice this question →

103

Multi-Selectmedium

A company runs a microservices application on Amazon ECS with Fargate. The application includes a service that processes orders and stores them in an RDS PostgreSQL database. The company wants to ensure that the order service is resilient to AZ failures and can handle a sudden increase in order volume. Which TWO actions should the DevOps engineer take? (Choose TWO.)

Select 2 answers

A.Increase the CPU and memory limits for the ECS task definition.

B.Place an Amazon CloudFront distribution in front of the order service.

C.Deploy the RDS instance in a Multi-AZ configuration.

D.Configure the ECS service to run tasks in multiple Availability Zones.

E.Use RDS Proxy to manage database connections.

AnswersC, D

Multi-AZ RDS provides automatic failover to a standby in another AZ.

Why this answer

Option C is correct because deploying the RDS instance in a Multi-AZ configuration provides automatic failover to a standby replica in a different Availability Zone, ensuring database resilience to AZ failures. Option D is correct because configuring the ECS service to run tasks in multiple Availability Zones distributes the order processing workload across AZs, improving both fault tolerance and scalability during sudden traffic spikes.

Exam trap

The trap here is that candidates often confuse connection pooling (RDS Proxy) with high availability (Multi-AZ) or assume that vertical scaling (increasing task limits) is sufficient for both resilience and sudden load, when in fact horizontal distribution across AZs is required for fault tolerance and elasticity.

Practice this question →

104

Multi-Selectmedium

A company runs a containerized application on Amazon ECS with Fargate. The application experiences intermittent failures due to resource exhaustion. The company wants to improve resilience by automatically replacing unhealthy tasks and scaling based on demand. Which TWO actions should the company take? (Choose TWO.)

Select 2 answers

A.Enable cluster auto scaling to add more Fargate capacity.

B.Use an Auto Scaling group to manage the number of tasks.

C.Define a health check command in the task definition to restart unhealthy containers.

D.Configure ECS service auto scaling with a target tracking policy based on CPU utilization.

E.Set a minimum healthy percent of 50% and maximum percent of 200% in the service configuration.

AnswersD, E

Auto scaling adjusts task count to meet demand.

Why this answer

Option D is correct because ECS service auto scaling with a target tracking policy based on CPU utilization automatically adjusts the desired count of tasks to match demand, preventing resource exhaustion by scaling out when CPU is high and scaling in when low. This directly addresses the intermittent failures caused by resource exhaustion by ensuring sufficient task capacity during load spikes.

Exam trap

The trap here is confusing cluster auto scaling (for EC2 instances) with ECS service auto scaling (for Fargate tasks), and assuming a health check command alone restarts containers without understanding that ECS service health checks only trigger replacement when integrated with the service's deployment controller.

Practice this question →

105

Multi-Selectmedium

A company is designing a highly available architecture for a stateless web application using AWS services. Which TWO steps should they take to achieve high availability?

Select 2 answers

A.Store session state in an EBS volume attached to each instance

B.Deploy EC2 instances in multiple Availability Zones

C.Use a single NAT instance in a public subnet

D.Use only M5 instance types for better performance

E.Use an Application Load Balancer to distribute traffic

AnswersB, E

Essential for high availability.

Why this answer

Deploying across multiple AZs ensures availability during AZ failure. An ALB distributes traffic and performs health checks. Option C is wrong because M5 instances are not required.

Option D is wrong because EBS is not stateless; better to use ephemeral or S3. Option E is wrong because it's an anti-pattern.

Practice this question →

106

MCQmedium

An e-commerce platform uses Amazon DynamoDB as its primary database. The platform experiences occasional read throttling during flash sales. The operations team needs to ensure that read traffic is handled without errors, while keeping costs low. What should a DevOps engineer recommend?

A.Enable DynamoDB Accelerator (DAX) to cache frequently read data.

B.Increase the read capacity units for the table during flash sale events.

C.Use DynamoDB Streams to replicate reads to a separate table.

D.Implement Global Tables to distribute read traffic across multiple regions.

AnswerA

DAX reduces read load and throttling with lower cost than increasing capacity.

Why this answer

Option A is correct because DynamoDB Accelerator (DAX) provides an in-memory cache that reduces read load on the database, improving performance and reducing throttling. Option B is wrong because increasing read capacity units increases cost without optimization. Option C is wrong because DynamoDB Streams is for change data capture, not caching.

Option D is wrong because Global Tables is for multi-region replication, not read scaling.

Practice this question →

107

MCQeasy

A company is deploying a critical application on Amazon EC2 instances behind an Application Load Balancer (ALB) across multiple Availability Zones. The application must be resilient to the failure of an entire Availability Zone. Which design should the company implement?

A.Launch EC2 instances in at least two Availability Zones and place them behind an Application Load Balancer with cross-zone load balancing enabled.

B.Use one EC2 instance in a single Availability Zone behind a Network Load Balancer.

C.Launch EC2 instances in one Availability Zone and use an Application Load Balancer to distribute traffic.

D.Deploy EC2 instances in two Availability Zones but use a single Application Load Balancer in one AZ.

AnswerA

Multiple AZs provide resilience; ALB distributes traffic and performs health checks.

Why this answer

Option A is correct because deploying EC2 instances across at least two Availability Zones (AZs) behind an Application Load Balancer (ALB) with cross-zone load balancing enabled ensures that if an entire AZ fails, the ALB can route traffic to healthy instances in the remaining AZs. Cross-zone load balancing allows the ALB to distribute incoming requests evenly across all registered instances in all enabled AZs, which improves fault tolerance and resource utilization. This design meets the requirement for resilience to an AZ failure by eliminating a single point of failure at the AZ level.

Exam trap

The trap here is that candidates often assume that simply placing instances in multiple AZs behind a load balancer is sufficient, but they overlook the critical requirement that the load balancer itself must be deployed across multiple AZs to avoid being a single point of failure.

How to eliminate wrong answers

Option B is wrong because using a single EC2 instance in one AZ behind a Network Load Balancer (NLB) does not provide resilience to an AZ failure; if that AZ goes down, the application becomes unavailable. Option C is wrong because launching EC2 instances in only one AZ behind an ALB still creates a single point of failure at the AZ level; the ALB cannot route traffic to healthy instances if the entire AZ fails. Option D is wrong because deploying EC2 instances in two AZs but using a single ALB in one AZ means the ALB itself is a single point of failure; if that AZ fails, the ALB becomes unavailable, and traffic cannot be distributed to instances in the other AZ.

Practice this question →

108

MCQmedium

A company's DevOps team is designing a multi-region disaster recovery solution for a stateless web application. The application runs on Amazon EC2 instances behind an Application Load Balancer (ALB) in the us-east-1 region. The team needs to fail over to a secondary region (us-west-2) with minimal downtime in case of a regional outage. Which AWS service should the team use to route traffic to the healthy region?

A.Amazon Route 53 with a failover routing policy and health checks on the primary ALB.

B.Elastic Load Balancing (ELB) cross-zone load balancing across regions.

C.AWS Global Accelerator with endpoint groups in both regions.

D.Amazon CloudFront with a multi-origin setup and origin failover.

AnswerA

Route 53 automatically responds to health check failures by routing to the secondary region.

Why this answer

Amazon Route 53 with a failover routing policy and health checks on the primary ALB is correct because it allows DNS-based routing that automatically directs traffic to the secondary region (us-west-2) when the primary ALB in us-east-1 is unhealthy. The health checks monitor the ALB endpoint, and upon failure, Route 53 returns the failover record's IP addresses, enabling multi-region failover with minimal downtime for stateless web applications.

Exam trap

The trap here is that candidates often confuse AWS Global Accelerator's traffic dials and endpoint weights with a true failover mechanism, but Route 53 failover routing is the only service that provides a binary, health-check-driven switch between primary and secondary regions for DNS-based traffic routing.

How to eliminate wrong answers

Option B is wrong because ELB cross-zone load balancing distributes traffic across instances within the same region, not across regions; it cannot route traffic between us-east-1 and us-west-2. Option C is wrong because AWS Global Accelerator uses anycast IPs and endpoint groups to route traffic to the nearest healthy endpoint, but it does not provide a failover routing policy that explicitly switches traffic to a secondary region only when the primary fails; it relies on endpoint health and traffic dials, which can cause partial traffic shifts rather than a clean failover. Option D is wrong because Amazon CloudFront with multi-origin and origin failover is designed for static content delivery and can fail over between origins, but it is not optimized for dynamic web application traffic behind an ALB and introduces additional latency and complexity for failover scenarios that require immediate DNS-level switching.

Practice this question →

109

MCQmedium

A DevOps engineer is designing a multi-Region active-active architecture for a stateless web application using Route 53 latency-based routing and DynamoDB global tables. The application must continue to serve traffic even if an entire AWS Region becomes unavailable. Which additional step is MOST critical for resilience?

A.Use an Auto Scaling group with a scheduled scaling policy

B.Enable DynamoDB Accelerator (DAX) in each Region

C.Place a CloudFront distribution in front of the application

D.Configure Route 53 health checks and associate them with the latency records

AnswerD

Health checks enable Route 53 to detect regional outages and failover.

Why this answer

Option D is correct because health checks are required for Route 53 to detect regional failures and route traffic away from unhealthy endpoints. Option A is wrong because read replicas do not help with write failures. Option B is wrong because CloudFront does not replace regional routing.

Option C is wrong because it does not address failover automation.

Practice this question →

110

MCQhard

An organization runs a critical application on Amazon EC2 instances in an Auto Scaling group behind an Application Load Balancer. The application requires that all traffic be encrypted in transit. The security team mandates the use of TLS 1.2 or higher and specific ciphers. What is the MOST efficient way to enforce this requirement?

A.Use a Network Load Balancer with TLS listeners and target groups.

B.Place a CloudFront distribution in front of the ALB and configure the origin protocol policy.

C.Install a self-signed certificate on each EC2 instance and configure the web server.

D.Configure the ALB with a security policy that enforces TLS 1.2 and the required ciphers.

AnswerD

ALB security policies allow centralized control of TLS settings.

Why this answer

Option A is correct because ALB supports a predefined security policy that can be set to require TLS 1.2 and specific ciphers. Option B is wrong because CloudFront is for CDN, not for backend ALB. Option C is wrong because NLB does not support TLS termination at the load balancer.

Option D is wrong because configuring each EC2 instance is inefficient and not centralized.

Practice this question →

111

Multi-Selecteasy

A company wants to design a highly available and fault-tolerant architecture for a stateless web application on AWS. Which TWO actions should they take? (Choose two.)

Select 2 answers

A.Use a single large EC2 instance to simplify management

B.Deploy multiple Application Load Balancers in each AZ

C.Launch EC2 instances in at least two Availability Zones

D.Use an RDS Multi-AZ deployment for the web server fleet

E.Use an Auto Scaling group to replace failed instances automatically

AnswersC, E

Multiple AZs provide fault tolerance.

Why this answer

Deploying EC2 instances in multiple Availability Zones and using an Auto Scaling group ensures fault tolerance and high availability. Option B is wrong because a single instance in one AZ is not fault-tolerant. Option D is wrong because Multi-AZ is for RDS, not for web application instances.

Option E is wrong because a single ALB is sufficient; multiple ALBs are not needed.

Practice this question →

112

MCQmedium

A company runs a stateful application on EC2 instances. They want to distribute traffic evenly and maintain session stickiness. Which AWS service should they use?

A.Network Load Balancer

B.Application Load Balancer with sticky sessions

C.Amazon Route 53 weighted routing policy

D.Amazon CloudFront with origin failover

AnswerB

ALB supports sticky sessions via cookies.

Why this answer

An Application Load Balancer with sticky sessions (session affinity) ensures that a client's requests are sent to the same target. Option A is wrong because Network Load Balancer does not natively support sticky sessions based on application cookies. Option C is wrong because Route53 weighted routing does not handle session stickiness.

Option D is wrong because CloudFront can forward cookies but is not primarily for load balancing.

Practice this question →

113

MCQhard

A company runs a critical web application on Amazon EC2 instances behind an Application Load Balancer (ALB). The application frequently experiences high latency during peak hours. The DevOps team needs to implement a solution that automatically adds capacity based on demand and reduces cost during off-peak hours. Which combination of AWS services should the team use?

A.Use an AWS Auto Scaling group with scheduled scaling policies that add instances during known peak hours and remove them during off-peak hours.

B.Implement Amazon Route 53 weighted routing policies to distribute traffic to multiple ALBs, each fronting a fixed set of EC2 instances.

C.Use an AWS Auto Scaling group with simple scaling policies based on CPU utilization and attach it to the ALB target group.

D.Use an AWS Auto Scaling group with target tracking scaling policies based on the ALB's request count per target, and attach it to the ALB target group.

AnswerD

This dynamically adjusts capacity based on actual load.

Why this answer

Option D is correct because target tracking scaling policies allow the Auto Scaling group to automatically adjust capacity based on a specific metric, such as ALB request count per target, which directly reflects the load on each instance. This ensures that capacity is added during high latency periods and removed during off-peak hours, optimizing both performance and cost. The ALB target group integration ensures that new instances are automatically registered and start receiving traffic.

Exam trap

The trap here is that candidates often choose scheduled scaling (Option A) because it seems straightforward for known peak hours, but they overlook the requirement to handle unpredictable high latency during peak hours, which demands a dynamic, metric-based scaling solution like target tracking.

How to eliminate wrong answers

Option A is wrong because scheduled scaling policies only add or remove instances at predefined times, which cannot react to real-time demand fluctuations or unexpected traffic spikes, leading to either over-provisioning or under-provisioning. Option B is wrong because Route 53 weighted routing policies distribute traffic across multiple ALBs but do not dynamically scale the underlying EC2 instances; each fixed set of instances would still suffer from high latency during peak hours. Option C is wrong because simple scaling policies based on CPU utilization require manual configuration of thresholds and cooldown periods, which can cause slow reaction to sudden load changes and may not directly correlate with application latency as effectively as request count per target.

Practice this question →

114

MCQeasy

A development team wants to ensure that their application can continue serving traffic even if an entire AWS Availability Zone (AZ) becomes unavailable. The application runs on Amazon EC2 instances in an Auto Scaling group and uses an Application Load Balancer (ALB). Which configuration should the team implement to meet this requirement?

A.Configure the Auto Scaling group to launch EC2 instances across multiple AZs, and ensure the ALB is enabled for cross-zone load balancing.

B.Use a launch template with multiple instance types to ensure diversity across the fleet.

C.Use a single AZ but configure EC2 Auto Scaling to replace unhealthy instances automatically.

D.Launch all EC2 instances in the same AZ to minimize latency, and configure the Auto Scaling group to maintain a minimum of two instances.

AnswerA

Multiple AZs provide resilience against AZ failure; cross-zone load balancing distributes traffic evenly.

Why this answer

Option A is correct because deploying EC2 instances across multiple Availability Zones (AZs) ensures that if one AZ fails, the remaining AZs continue to serve traffic. Enabling cross-zone load balancing on the ALB distributes incoming requests evenly across all healthy instances in all AZs, preventing traffic from being sent only to instances in the same AZ as the client. This architecture meets the requirement for high availability and fault tolerance at the AZ level.

Exam trap

The trap here is that candidates often confuse instance-level resilience (e.g., replacing unhealthy instances) with AZ-level resilience, or they think that multiple instances in a single AZ provide sufficient fault tolerance, ignoring the fact that an AZ failure takes down all instances in that AZ.

How to eliminate wrong answers

Option B is wrong because using multiple instance types in a launch template addresses instance diversity and spot instance interruption resilience, not AZ-level failure; it does not protect against an entire AZ becoming unavailable. Option C is wrong because using a single AZ means all instances are in one failure domain; even with automatic replacement, the application cannot serve traffic if that AZ fails, as the new instances would also be launched in the same unavailable AZ. Option D is wrong because launching all instances in the same AZ and maintaining a minimum of two instances does not provide AZ-level redundancy; if that AZ fails, all instances become unavailable, and the application cannot serve traffic.

Practice this question →

115

MCQeasy

A DevOps engineer is designing a resilient architecture for a serverless application using AWS Lambda, Amazon API Gateway, and Amazon DynamoDB. The application experiences occasional spikes in traffic that cause Lambda function throttling and increased error rates. What is the MOST effective way to improve resilience and reduce throttling?

A.Increase the Lambda function memory to the maximum allowed.

B.Enable DynamoDB auto scaling for the table to handle traffic spikes.

C.Set API Gateway throttling limits to match the expected peak traffic.

D.Reserve concurrency for the Lambda function to ensure it always has available capacity.

AnswerB

Auto scaling adjusts capacity to prevent throttling.

Why this answer

DynamoDB auto scaling adjusts the provisioned throughput (read/write capacity units) in response to traffic spikes, preventing throttling at the database layer that would cause Lambda retries and errors. This directly addresses the root cause of increased error rates when the table's capacity is exceeded, which is a common bottleneck in serverless architectures.

Exam trap

The trap here is that candidates often focus on Lambda concurrency or API Gateway throttling as the first line of defense, but the scenario's 'increased error rates' and 'spikes' point to a downstream dependency (DynamoDB) being overwhelmed, not the upstream request path.

How to eliminate wrong answers

Option A is wrong because increasing Lambda memory also increases CPU and network allocation, but it does not resolve throttling caused by DynamoDB capacity limits or Lambda concurrency limits; it only improves execution speed for compute-bound functions. Option C is wrong because setting API Gateway throttling limits to match expected peak traffic would cap requests at that level, rejecting legitimate traffic during spikes rather than improving resilience. Option D is wrong because reserving concurrency for the Lambda function guarantees a fixed number of concurrent executions, but if the DynamoDB table lacks sufficient capacity, those executions will still fail due to database throttling, and reserved concurrency can also waste capacity during low traffic.

Practice this question →

116

MCQeasy

A DevOps team is designing a disaster recovery plan for an RDS MySQL database. The database must be recoverable with minimal data loss in case of a regional failure. Which solution provides the LOWEST Recovery Point Objective (RPO)?

A.Configure a Cross-Region Read Replica.

B.Take daily automated snapshots and copy them to another Region.

C.Use a Multi-AZ deployment with synchronous standby.

D.Use RDS Proxy to cache database writes.

AnswerA

Replication is continuous, RPO in seconds.

Why this answer

Cross-Region Read Replicas provide asynchronous replication with a typical RPO of seconds, offering minimal data loss.

Practice this question →

117

MCQmedium

A company runs a high-traffic e-commerce application on EC2 instances in an Auto Scaling group behind an ALB. The application uses an in-memory cache on the EC2 instances. During a recent deployment, the Auto Scaling group terminated an instance that had active user sessions, causing users to lose their cart data and leading to a poor customer experience. The company wants to prevent this in future deployments. They need a solution that allows existing sessions to complete before instance termination, without manual intervention. Which solution should they use?

A.Increase the Auto Scaling group's cooldown period and health check grace period.

B.Enable connection draining on the ALB target group and increase the deregistration delay.

C.Implement an Auto Scaling lifecycle hook that puts the instance in a 'terminating:wait' state, and have a script on the instance that signals completion after draining sessions.

D.Change the health check type to ELB and mark instances unhealthy before deployment.

AnswerC

Lifecycle hooks enable custom actions before termination, allowing session draining.

Why this answer

Option C is correct because lifecycle hooks allow the Auto Scaling group to wait for a specified timeout before terminating an instance, giving the application time to drain sessions. Option A is incorrect because connection draining on the ALB only handles HTTP connections, not application-level session state. Option B is incorrect because increasing cooldown does not delay termination.

Option D is incorrect because updating the health check type does not prevent immediate termination.

Practice this question →

118

Multi-Selecteasy

A company wants to protect its application from DDoS attacks. Which THREE AWS services should they use?

Select 3 answers

A.Amazon Inspector

B.AWS WAF

C.AWS Shield Advanced

D.Amazon CloudFront

E.Amazon GuardDuty

AnswersB, C, D

WAF filters malicious web traffic.

Why this answer

AWS Shield Advanced, WAF, and CloudFront provide layered DDoS protection.

Practice this question →

119

MCQhard

A company runs a critical microservices architecture on Amazon ECS with Fargate. They want to ensure that if a task fails, it is automatically restarted, and the service remains available across multiple Availability Zones. How should they configure the ECS service?

A.Place all tasks in the same Availability Zone to reduce latency

B.Run a standalone Fargate task and use a CloudWatch alarm to restart it

C.Use an EC2 launch type with a single instance to reduce complexity

D.Define an ECS service with a task definition, set desired count across multiple Availability Zones, and use Service Auto Scaling

AnswerD

This ensures tasks are distributed and automatically replaced.

Why this answer

Setting the number of tasks across multiple AZs and enabling Service Auto Scaling with task-level restart ensures resilience. The ECS service scheduler automatically restarts failed tasks. Option A is wrong because a single-AZ deployment is a single point of failure.

Option B is wrong because placing all tasks in one AZ does not provide AZ resilience. Option D is wrong because a standalone task does not have automatic restart.

Practice this question →

120

MCQhard

A company runs a stateful web application on EC2 instances behind an ALB. The application uses sticky sessions (session affinity) to maintain user sessions. During a deployment, the company wants to update the application with zero downtime and ensure that in-flight sessions are not lost. Which deployment strategy should they use?

A.Perform a rolling update of the Auto Scaling group with a health check grace period.

B.Use an immutable deployment by launching a new Auto Scaling group and then updating the ALB target group to point to the new group.

C.Use a blue/green deployment: launch a new Auto Scaling group, register it with a new target group, and gradually shift traffic using weighted target groups on the ALB.

D.Use a canary deployment with AWS Lambda to gradually route a percentage of requests to the new version.

AnswerC

Gradual shift preserves sessions on old environment until they complete.

Why this answer

Option C is correct because a blue/green deployment with a new target group and a gradual shift of traffic using the ALB's weighted target groups allows existing sessions to complete on the old environment while new sessions go to the new one. Option A is wrong because rolling update with a fixed number of instances may cause session loss. Option B is wrong because immutable deployment without traffic shifting drops sessions.

Option D is wrong because canary deployment with Lambda is not applicable to EC2.

Practice this question →

121

Multi-Selectmedium

A company uses AWS CloudFormation to deploy infrastructure. The DevOps team wants to ensure that if a stack update fails, the stack automatically rolls back to the previous known good state. The team also wants to receive notifications of the rollback. Which combination of steps should the team take? (Choose THREE.)

Select 3 answers

A.Set the 'Rollback on failure' option to 'Yes' in the CloudFormation stack properties.

B.Enable termination protection on the CloudFormation stack.

C.Configure an SNS topic as a notification target in the CloudFormation stack.

D.Create a CloudWatch Logs log group and subscribe to stack events.

E.Enable drift detection on the CloudFormation stack.

AnswersA, B, C

This ensures automatic rollback to the last known good state on update failure.

Why this answer

Option A is correct because CloudFormation stack updates can be configured with a rollback on failure. Option B is correct because CloudFormation can send events to Amazon SNS, which can then trigger notifications (e.g., email). Option C is correct because enabling termination protection prevents accidental deletion of the stack.

Option D (CloudWatch Logs) is for logging, not rollback. Option E (drift detection) is for detecting manual changes, not rollback.

Practice this question →

122

MCQhard

A company uses AWS CloudFormation to deploy a multi-tier application. During an update, the stack fails and rolls back. The rollback also fails, leaving the stack in UPDATE_ROLLBACK_FAILED state. The operations team needs to resolve this with minimal disruption. What is the MOST efficient approach?

A.Use the 'ContinueUpdateRollback' API or AWS Management Console to retry the rollback.

B.Manually modify the resources to match the previous stack state.

C.Delete the stack and recreate it from the original template.

D.Execute a change set to update the stack to the desired configuration.

AnswerA

This action can complete the rollback and recover the stack.

Why this answer

Option A is correct because continuing the rollback can resolve the failure and bring the stack to a consistent state. Option B is wrong because deleting the stack would remove all resources. Option C is wrong because executing a change set on a failed stack is not possible.

Option D is wrong because manual modification is error-prone and not recommended.

Practice this question →

123

MCQhard

A company deploys the above CloudFormation stack. They want to enforce HTTPS for all requests to the S3 bucket. After deployment, users are still able to make HTTP requests. What is the problem?

A.The condition key 'aws:SecureTransport' is misspelled; it should be 'aws:SecureTransport' with a capital 'T'

B.The bucket is not versioned, so the policy does not apply to object versions

C.The policy uses Deny, but an Allow policy from another statement overrides it

D.The Deny statement's Resource specifies only the objects, not the bucket itself

AnswerD

The Resource does not include the bucket ARN, so bucket-level operations like ListBucket are not denied.

Why this answer

The Deny policy only applies to objects in the bucket (Resource: arn:aws:s3:::bucket/*), but not to the bucket itself. Actions like listing objects (s3:ListBucket) are denied only if the resource is the bucket itself. Option A is wrong because the bucket is versioned, but that doesn't affect encryption.

Option B is wrong because the Condition is correct. Option D is wrong because the policy uses Deny, which overrides Allow.

Practice this question →

124

MCQmedium

A DevOps engineer created this IAM policy for a CI/CD pipeline role. The pipeline needs to stop and start production EC2 instances and manage Auto Scaling groups. However, the pipeline fails when trying to stop an instance. What is the most likely reason?

A.The policy does not allow ec2:DescribeInstances for all instances.

B.The policy does not allow autoscaling:UpdateAutoScalingGroup for instances.

C.The instance does not have the tag 'Environment' with value 'production'.

D.The ec2:StopInstances action requires an additional ec2:DescribeInstanceStatus permission.

AnswerC

The condition restricts StartInstances and StopInstances to instances with that tag. Without the tag, the action is denied.

Why this answer

Option B is correct because the ec2:StartInstances and ec2:StopInstances actions are resource-level actions, but the Condition in the first statement requires the resource tag to be 'Environment':'production'. If the instance does not have that tag, the action is denied. The second statement allows Auto Scaling actions without conditions.

Option A (ec2:DescribeInstances) is allowed. Option C (Auto Scaling actions) are allowed. Option D (stop operations require stop permission) is not accurate because the permission exists but is conditional.

Practice this question →

125

MCQmedium

A DevOps team uses AWS CodePipeline to deploy a web application. The pipeline has a deploy stage that uses CodeDeploy to deploy to an Auto Scaling group. During deployment, the new instances fail health checks and the deployment rolls back. However, the rollback also fails because the old instances have been terminated. What should the team do to avoid this issue?

A.Increase the health check grace period in the Auto Scaling group.

B.Add a manual approval step before the deploy stage.

C.Configure the pipeline to deploy to a new Auto Scaling group each time.

D.Use a blue/green deployment strategy in CodeDeploy to keep the old instances running until the new ones pass health checks.

AnswerD

Blue/green deployment preserves the old environment for rollback.

Why this answer

Option B is correct because using a blue/green deployment with CodeDeploy allows the old instances to remain until the new ones are verified healthy. Option A is wrong because increasing the health check grace period does not help if instances are unhealthy. Option C is wrong because manual approval does not prevent the rollback failure.

Option D is wrong because creating a new Auto Scaling group does not inherently solve the termination issue.

Practice this question →

126

MCQmedium

A company runs a critical database on Amazon RDS for PostgreSQL with Multi-AZ deployment. The application experiences a brief outage during automatic failover. To improve availability, the company wants to reduce the failover time. What should they do?

A.Create a cross-Region Read Replica and promote it during failure

B.Enable Multi-AZ DB cluster with synchronous replication and a standby in a different AZ

C.Increase the DB instance class size to improve I/O performance

D.Remove Multi-AZ and use a single instance with increased backup frequency

AnswerB

Multi-AZ DB cluster provides faster failover with reader endpoint.

Why this answer

Using a Multi-AZ deployment with synchronous replication and a Multi-AZ DB cluster (standby with reader endpoint) provides faster failover times. Option B is wrong because Read Replicas are for read scaling, not automatic failover. Option C is wrong because removing Multi-AZ increases downtime.

Option D is wrong because increasing instance size does not reduce failover time.

Practice this question →

127

MCQeasy

A DevOps engineer is designing a disaster recovery plan for a critical database. The RTO is 15 minutes and RPO is 1 minute. Which solution meets these requirements?

A.Use Amazon RDS Multi-AZ with automatic failover.

B.Use Amazon RDS with cross-Region read replicas and promote one during failover.

C.Schedule automated snapshots every 1 minute and restore in the same AZ.

D.Deploy a standby EC2 instance with a self-managed replication script.

AnswerA

Synchronous replication meets RPO; automatic failover meets RTO.

Why this answer

Option B is correct because Amazon RDS Multi-AZ provides synchronous replication with automatic failover, meeting RPO of seconds and RTO of a few minutes. Option A is wrong because cross-Region replication has higher RPO due to asynchronous replication. Option C is wrong because manual snapshots have high RPO and RTO.

Option D is wrong because RDS Multi-AZ is cheaper and simpler than always-on EC2 replica.

Practice this question →

128

MCQmedium

A company runs a global web application on EC2 instances behind an ALB in us-east-1. They want to improve resilience by routing users to the nearest healthy region. Which service should they use?

A.AWS Global Accelerator

B.Application Load Balancer cross-zone load balancing

C.Amazon Route 53 latency-based routing with health checks

D.Amazon CloudFront with multiple origins

AnswerC

Routes to the region with lowest latency and healthy endpoints.

Why this answer

Amazon Route 53 with latency-based routing and health checks routes users to the region with the lowest latency and only to healthy endpoints. Option A is wrong because CloudFront with multi-region origins requires additional configuration. Option B is wrong because Global Accelerator improves performance but not primarily for routing to nearest region.

Option D is wrong because ALB is regional, not global.

Practice this question →

129

MCQmedium

Refer to the exhibit. An Auto Scaling group is configured with an Application Load Balancer. The group has a desired capacity of 2 instances spread across two Availability Zones. Recently, the application has been experiencing high error rates during deployments. The team suspects that new instances are being marked as healthy before they are fully ready. What should the team do to resolve this issue?

A.Add a step scaling policy to scale out more gradually.

B.Increase the HealthCheckGracePeriod to 600 seconds.

C.Increase the MaxSize to 10.

D.Change the HealthCheckType to ELB.

AnswerD

ELB health checks can be configured to require a successful response from the application, ensuring readiness.

Why this answer

Option B is correct because changing the health check type to ELB ensures that the ALB health checks determine instance health, which can be configured to require the application to respond before marking the instance healthy. Option A is wrong because increasing the grace period may delay health checks but does not ensure readiness. Option C is wrong because scaling policies do not affect health checks.

Option D is wrong because increasing max size does not address the readiness issue.

Practice this question →

130

MCQhard

A company has a serverless application using AWS Lambda functions that process messages from an Amazon SQS queue. The Lambda function sometimes fails due to transient errors. The company wants to ensure that failed messages are retried and eventually processed or sent to a dead-letter queue after 3 retries. What is the correct configuration?

A.Set the Lambda function's retry policy to Maximum retries: 3 and configure a DLQ on the Lambda function.

B.Set the Lambda function's DLQ to an SQS queue and configure the event source mapping to use that DLQ after 3 retries.

C.Configure the SQS queue's redrive policy with maxReceiveCount: 3 and a dead-letter queue.

D.Create an AWS Step Functions workflow that polls the SQS queue, processes messages, and retries failures up to 3 times before moving to a DLQ.

AnswerC

SQS redrive policy handles retries and DLQ for messages that fail processing.

Why this answer

Option B is correct because the SQS queue's redrive policy specifies the max receive count and dead-letter queue, and Lambda's event source mapping handles retries. Option A is wrong because Lambda's retry policy with max retries=3 is for asynchronous invocations, not SQS. Option C is wrong because Step Functions add unnecessary complexity.

Option D is wrong because DLQ on Lambda function is for async invocations, not SQS.

Practice this question →

131

MCQhard

A company uses AWS CloudFormation to deploy infrastructure. The stack creation fails with the error: 'Resource handler returned message: 'The security group does not exist in VPC'.' The template references a security group by name. What is the MOST likely cause?

A.The security group name is misspelled or uses incorrect case

B.The IAM role used for CloudFormation does not have permissions to describe security groups

C.The stack is being created in a Region where the security group does not exist

D.The template uses a parameter that resolves to the default VPC security group

AnswerA, C

A misspelled name would cause the security group not to be found.

Why this answer

Option A is correct because CloudFormation resolves security group references by exact name match, including case sensitivity. When a security group is specified by name in a template, CloudFormation performs a case-sensitive lookup in the target VPC. If the name is misspelled or the case differs (e.g., 'MySecurityGroup' vs 'mysecuritygroup'), the resource handler fails with the error 'The security group does not exist in VPC' because the lookup returns no match.

Exam trap

The trap here is that candidates often assume the error is due to IAM permissions or a missing VPC, but the specific phrasing 'does not exist in VPC' points directly to a name mismatch or case sensitivity issue in the security group lookup.

How to eliminate wrong answers

Option B is wrong because the error message specifically indicates the security group does not exist in the VPC, not a permissions issue; an IAM permissions error would produce a different message such as 'AccessDenied' or 'Unauthorized operation'. Option D is wrong because referencing a parameter that resolves to the default VPC security group would not cause this error; the default security group exists in every VPC and would be found, so the error would not occur unless the VPC itself is missing or the parameter value is invalid.

Practice this question →

132

MCQmedium

A company is designing a disaster recovery strategy for a critical application that requires a Recovery Time Objective (RTO) of 15 minutes and a Recovery Point Objective (RPO) of 1 hour. The application runs on EC2 with data stored in Amazon RDS Multi-AZ. Which approach meets these requirements?

A.Use a pilot light strategy with RDS cross-Region read replicas and automated backups

B.Use backup and restore with daily snapshots to another Region

C.Use a warm standby with a scaled-down production environment in another Region

D.Use a Multi-AZ deployment in the same Region for DR

AnswerA

Pilot light can achieve RTO 15 min, RPO 1 hour.

Why this answer

Option A is correct because a pilot light strategy with RDS cross-Region read replicas and automated backups meets the RTO of 15 minutes and RPO of 1 hour. The cross-Region read replica provides near-synchronous replication with an RPO typically under 5 seconds, and automated backups enable point-in-time recovery within the 1-hour RPO. The pilot light approach allows rapid promotion of the replica to a primary instance, achieving the 15-minute RTO by keeping minimal core services running in the DR Region.

Exam trap

The trap here is that candidates often confuse Multi-AZ (high availability within a Region) with cross-Region disaster recovery, assuming Multi-AZ alone provides DR, but it does not protect against Region-wide outages.

How to eliminate wrong answers

Option B is wrong because daily snapshots to another Region result in an RPO of up to 24 hours, far exceeding the required 1-hour RPO, and restoring from snapshots takes longer than 15 minutes. Option C is wrong because a warm standby with a scaled-down production environment typically has an RTO of minutes but requires continuous replication and failover orchestration; while it could meet the RPO, it is over-engineered and more costly than necessary, and the question asks for an approach that meets requirements, not the most optimal. Option D is wrong because a Multi-AZ deployment in the same Region does not provide disaster recovery across Regions; it only protects against Availability Zone failures, not regional disasters, and thus fails to meet the DR requirement.

Practice this question →

133

MCQmedium

A DevOps engineer ran the above command and saw this output. What is the MOST likely cause of the stack creation failure?

A.The key pair specified in the launch template does not exist.

B.The IAM role does not have permission to create the Auto Scaling group.

C.The AMI ID specified in the launch template is not available in this Region.

D.The launch template name specified in the CloudFormation template is incorrect or does not exist.

AnswerD

The error indicates the launch template parameter is not supported, likely due to a missing or incorrect name.

Why this answer

The error message indicates that the launch template name specified in the CloudFormation template does not match any existing launch template in the account and Region. CloudFormation resolves the launch template name at stack creation time; if the name is incorrect or the template does not exist, the Auto Scaling group creation fails with a validation error. This is the most direct cause because the launch template name is a required parameter that must reference a pre-existing resource.

Exam trap

The trap here is that candidates confuse launch template validation errors with instance-level errors (like missing AMI or key pair), but CloudFormation validates the launch template name at the Auto Scaling group resource level before any EC2 instances are launched.

How to eliminate wrong answers

Option A is wrong because a missing key pair would cause an EC2 instance launch failure, not a stack creation failure at the Auto Scaling group level; the error message would reference 'InvalidKeyPair.NotFound' or similar. Option B is wrong because an IAM role lacking permissions to create an Auto Scaling group would produce an 'AccessDenied' or authorization error, not a validation error about a missing launch template. Option C is wrong because an unavailable AMI ID would cause an instance launch failure with an 'InvalidAMIID.NotFound' error, not a stack creation failure related to the launch template name.

Practice this question →

134

MCQeasy

A company is building a multi-tier web application on AWS. The web tier runs on EC2 instances behind an ALB. The application tier runs on EC2 instances that are not publicly accessible. The database tier runs on RDS MySQL. Which design provides the HIGHEST level of resilience for the database tier?

A.Deploy a single RDS DB instance in one Availability Zone.

B.Deploy an RDS DB instance with a cross-region read replica.

C.Deploy an RDS DB instance with a read replica in the same region.

D.Deploy an RDS DB instance in a Multi-AZ configuration.

AnswerD

Multi-AZ provides automatic failover to a standby in another AZ.

Why this answer

Option D is correct because a Multi-AZ RDS deployment provides automatic failover to a standby in another AZ. Option A is wrong because a single DB instance is not resilient. Option B is wrong because read replicas do not provide automatic failover.

Option C is wrong because cross-region read replicas have higher latency and are not for failover.

Practice this question →

135

Multi-Selecthard

A company runs a containerized application on Amazon EKS. The application must be highly available across multiple Availability Zones and must automatically recover from node failures. Which THREE steps should be taken?

Select 3 answers

A.Use Pod Disruption Budgets to ensure a minimum number of pods are available during voluntary disruptions.

B.Configure the Cluster Autoscaler to add nodes when pods are unschedulable.

C.Deploy worker nodes across multiple Availability Zones.

D.Deploy worker nodes in a single Availability Zone to reduce cross-AZ data transfer costs.

E.Use a single large instance type for all worker nodes to simplify management.

AnswersA, B, C

Helps maintain availability during updates.

Why this answer

Deploy worker nodes across multiple AZs ensures node diversity. Use Pod Disruption Budgets to maintain minimum pod availability. Configure cluster auto scaling to replace failed nodes.

Practice this question →

136

MCQhard

A company runs a critical application on Amazon ECS with the Fargate launch type. The application is deployed across three Availability Zones. Each service has its own Application Load Balancer. The company wants to implement a blue/green deployment strategy to reduce risk. They currently use AWS CodeDeploy for ECS deployments. During a recent deployment, the company noticed that the new version (green) was not receiving any traffic even after passing all health checks. The CodeDeploy configuration uses a 'Linear10PercentEvery3Minutes' traffic shifting configuration. What is the most likely reason that the green tasks are not receiving traffic?

A.The CodeDeploy deployment group is not associated with the correct ECS service.

B.The green target group's health check is misconfigured, causing CodeDeploy to consider the green tasks unhealthy and not route traffic.

C.The blue target group is still set as the production target group in the load balancer listener.

D.The green tasks are in a different VPC than the load balancer.

AnswerB

Health check failures prevent traffic routing even if tasks run.

Why this answer

The green target group's health check is misconfigured, causing CodeDeploy to consider the green tasks unhealthy. With a 'Linear10PercentEvery3Minutes' traffic shifting configuration, CodeDeploy gradually shifts traffic in 10% increments every 3 minutes, but only if the green target group passes health checks. If the health check fails, CodeDeploy stops traffic shifting, leaving the green tasks with zero traffic despite the tasks themselves being healthy.

Exam trap

The trap here is that candidates assume health checks passing on the ECS tasks means traffic will automatically route, but CodeDeploy relies on the target group's health check configuration, not the task's health status, to determine when to shift traffic.

How to eliminate wrong answers

Option A is wrong because if the CodeDeploy deployment group were not associated with the correct ECS service, the deployment would fail entirely or target the wrong service, but the green tasks would still be created and potentially receive traffic if health checks passed. Option C is wrong because CodeDeploy automatically updates the load balancer listener rules to point to the green target group during the traffic shifting process; the blue target group being set as production is the initial state, but CodeDeploy changes it as traffic shifts. Option D is wrong because ECS Fargate tasks and the load balancer must be in the same VPC for the service to function; if they were in different VPCs, the service would not register targets or pass health checks at all, not just fail to receive traffic after health checks pass.

Practice this question →

137

Multi-Selecteasy

A company is designing a highly available architecture for a web application using AWS. Which TWO of the following design principles should be applied? (Select TWO.)

Select 2 answers

A.Run all resources in a single Availability Zone to reduce complexity

B.Store session data on EC2 instances to improve performance

C.Deploy resources across multiple Availability Zones

D.Use loosely coupled components, such as queues and asynchronous processing

E.Use tightly coupled components to reduce latency

AnswersC, D

Provides fault tolerance.

Why this answer

Correct answers: B and D. B spreads resources across AZs for fault tolerance. D decouples components to prevent cascading failures.

A is wrong because single Region is not resilient. C is wrong because tightly coupled components reduce availability.

Practice this question →

138

Multi-Selectmedium

A company is designing a resilient architecture for a critical application. Which TWO strategies improve resilience?

Select 2 answers

A.Deploy resources across multiple Availability Zones

B.Use a single large instance instead of multiple smaller ones

C.Use health checks to automatically replace unhealthy resources

D.Disable automated backups to reduce latency

E.Deploy resources in a single Availability Zone

AnswersA, C

Multi-AZ provides redundancy.

Why this answer

Multi-AZ deployments and health checks with auto-remediation improve resilience by handling failures automatically.

Practice this question →

139

MCQhard

A DevOps engineer runs the above command and sees that one target is unhealthy with a 503 error. The application is a web server running on port 80. The health check is configured to hit the root path '/'. Which action should the engineer take to resolve the issue?

A.Change the health check port to 443 and use HTTPS

B.Verify that the application on the unhealthy instance is configured to respond to '/' with a 200 status code

C.Increase the health check interval and timeout settings

D.Check the security group rules for the target group to ensure port 80 is open

AnswerB

The health check expects a 200 response; a 503 means the app is not serving the root path correctly.

Why this answer

A 503 error indicates the web server is running but cannot handle the request, likely because the application is not responding correctly to the health check path. Option A is wrong because a 503 means the server is reachable. Option C is wrong because the health check is already on port 80.

Option D is wrong because a 503 is not a connection timeout.

Practice this question →

140

Multi-Selecthard

A company runs a microservices architecture on Amazon ECS with Fargate. Services communicate via an internal Application Load Balancer (ALB). The operations team notices that occasional traffic spikes cause increased latency and timeouts. The team wants to improve resilience without over-provisioning. Which THREE steps should be taken? (Choose THREE.)

Select 3 answers

A.Increase the CPU and memory limits in the task definitions.

B.Enable ECS Service Connect for inter-service communication to manage traffic distribution.

C.Configure ECS service auto scaling with a target tracking policy based on ALB request count per target.

D.Implement a graceful shutdown handler in the application to handle SIGTERM.

E.Use EC2 launch type with Spot Instances to reduce cost.

AnswersB, C, D

Service Connect provides resilient service mesh capabilities.

Why this answer

Options A, C, and E are correct. A: Auto scaling based on target tracking ensures capacity matches demand. C: ECS service connect provides service discovery and connection resilience.

E: Graceful shutdown allows in-flight requests to complete. B is wrong because spot instances are not suitable for latency-sensitive workloads. D is wrong because increasing CPU and memory does not solve latency due to traffic spikes.

Practice this question →

141

MCQhard

An application running on Amazon ECS with Fargate experiences intermittent failures. The task definition includes a single container with a health check command. Despite the health check passing, the application occasionally returns HTTP 500 errors. The application logs are sent to CloudWatch Logs. What is the MOST likely root cause?

A.The health check command only checks the process status, not the application's ability to serve requests.

B.The application is missing environment variables that are required for certain requests.

C.The ECS service is configured with a target tracking scaling policy that reacts too slowly.

D.The container port and host port in the task definition do not match the ALB target group port.

AnswerA

A shallow health check can report healthy while the app is unable to serve requests.

Why this answer

Option C is correct because if the health check does not verify the application's ability to serve requests (e.g., only checking the process), it can report healthy even when the application is failing. Option A is wrong because ECS service auto scaling does not cause intermittent failures. Option B is wrong because a container port mismatch would cause persistent failures.

Option D is wrong because missing environment variables would cause consistent failures.

Practice this question →

142

MCQmedium

A company's application uses Amazon SQS to decouple microservices. During peak hours, the SQS queue backlog grows significantly, causing processing delays. The DevOps team wants to reduce latency without increasing costs unnecessarily. What should the team do?

A.Increase the visibility timeout to allow consumers more time to process messages.

B.Use an SQS queue with priority settings to process high-priority messages first.

C.Increase the SQS queue's throughput by requesting a quota increase.

D.Configure Auto Scaling for the consumer fleet based on the ApproximateNumberOfMessagesVisible metric.

AnswerD

Auto Scaling adds consumers as queue depth increases, reducing processing time.

Why this answer

Option D is correct because scaling the consumer fleet based on the ApproximateNumberOfMessagesVisible metric directly addresses the backlog by adding more processing capacity when the queue grows. This approach reduces latency dynamically without incurring unnecessary costs during off-peak hours, as it only scales up when needed. Auto Scaling with SQS metrics is a cost-effective, elastic solution for handling variable workloads.

Exam trap

The trap here is that candidates may confuse SQS's throughput capabilities with consumer-side scaling, assuming that increasing queue throughput (Option C) solves backlog, when in fact SQS already handles high throughput and the bottleneck is the consumer processing rate.

How to eliminate wrong answers

Option A is wrong because increasing the visibility timeout does not reduce backlog; it only gives consumers more time to process a message, which can actually increase latency if consumers fail or take longer, as messages remain hidden longer. Option B is wrong because standard SQS queues do not support priority settings; FIFO queues offer ordering but not priority-based message selection, and SQS has no built-in priority feature. Option C is wrong because SQS queues already offer virtually unlimited throughput by default (up to 3,000 messages per second for FIFO with batching, and unlimited for standard), so requesting a quota increase is unnecessary and does not address consumer-side processing capacity.

Practice this question →

143

MCQmedium

An application on EC2 instances in an Auto Scaling group uses an ALB. The ALB health checks are failing for some instances, but the instances are healthy from the OS perspective. What is the most likely cause?

A.The ALB idle timeout is too low

B.The security group for the instances does not allow traffic from the ALB

C.The Auto Scaling group cooldown period is too short

D.The ALB cross-zone load balancing is disabled

AnswerB

If the security group blocks health check traffic, the ALB marks instances unhealthy.

Why this answer

Misconfigured security group rules can block health check traffic, causing the ALB to mark instances as unhealthy.

Practice this question →

144

MCQmedium

A media company runs a video processing pipeline on AWS. Raw videos are uploaded to an S3 bucket, which triggers a Lambda function to start an AWS Batch job for transcoding. The Batch job reads the source video from S3, processes it, and writes the output to another S3 bucket. Recently, the company has seen an increase in processing failures. Investigation shows that the Batch jobs are being terminated with a 'TIMEOUT' status after running for exactly 30 minutes. The video files are large, and some jobs legitimately take up to 45 minutes. The Batch job definition has a 'timeout' setting configured. Which action should be taken to resolve this issue?

A.Modify the Batch job definition to increase the 'timeout' value to 3600 seconds (60 minutes).

B.Increase the S3 bucket lifecycle policy to retain videos longer.

C.Increase the Lambda function timeout to 60 minutes.

D.Change the Batch job queue to a different compute environment.

AnswerA

The timeout in the job definition controls how long Batch allows a job to run.

Why this answer

The timeout configured in the job definition is causing jobs that exceed 30 minutes to be terminated. Increasing the timeout to 60 minutes allows longer-running jobs to complete.

Practice this question →

145

MCQhard

A company runs a critical web application on EC2 instances in an Auto Scaling group. The application uses an Application Load Balancer (ALB) with health checks pointing to /health. Recently, the application experienced intermittent failures where the ALB would mark instances as unhealthy and route traffic away, causing a reduction in capacity. The development team noticed that the /health endpoint occasionally returns HTTP 503 when the application is under heavy load, but the application can recover quickly. The team wants to avoid unnecessary instance replacements while ensuring availability. Which solution should the DevOps engineer implement?

A.Implement a custom health check using Lambda that ignores 503 responses

B.Decrease the unhealthy threshold to mark instances unhealthy faster

C.Increase the health check interval and increase the unhealthy threshold

D.Decrease the health check interval and decrease the healthy threshold

AnswerC

Less sensitive to transient errors.

Why this answer

Option B is correct because increasing the health check interval and unhealthy threshold reduces sensitivity to transient errors, avoiding unnecessary instance replacements. Option A is wrong because decreasing the interval makes the health check more sensitive, worsening the issue. Option C is wrong because reducing the unhealthy threshold makes it easier to mark instances unhealthy.

Option D is wrong because custom actions are not necessary; the health check configuration can be tuned.

Practice this question →

146

MCQhard

A company runs a microservices application on Amazon EKS. The application's frontend service needs to communicate with the backend service. The DevOps team wants to implement service-to-service authentication using AWS IAM. Which method should the team use?

A.Configure the backend service as an Amazon RDS database with IAM database authentication.

B.Use AWS App Mesh with mTLS for authentication between services.

C.Create an IAM user with access keys and store them as Kubernetes secrets.

D.Use IAM roles for service accounts (IRSA) to associate an IAM role with each service's Kubernetes service account.

AnswerD

IRSA provides fine-grained IAM permissions to pods.

Why this answer

IAM roles for service accounts (IRSA) allows each Kubernetes service account to assume an IAM role with fine-grained permissions, enabling secure service-to-service authentication without managing long-lived credentials. The frontend service can use its associated IAM role to sign AWS API requests (e.g., STS AssumeRole) to authenticate to the backend service, which validates the role via IAM policies. This approach integrates natively with EKS and follows AWS best practices for workload identity.

Exam trap

The trap here is that candidates may confuse mTLS (which provides encryption and certificate-based authentication) with IAM-based authentication, or assume that static IAM users with secrets are acceptable in Kubernetes, when IRSA is the recommended AWS-native approach for pod-level IAM integration.

How to eliminate wrong answers

Option A is wrong because Amazon RDS IAM database authentication is designed for database access, not for service-to-service authentication between microservices on EKS; it does not provide a mechanism for frontend-to-backend communication. Option B is wrong because AWS App Mesh with mTLS provides transport-layer encryption and mutual TLS authentication, but it does not use AWS IAM for authentication; it relies on X.509 certificates, not IAM roles or policies. Option C is wrong because creating an IAM user with access keys and storing them as Kubernetes secrets introduces long-lived static credentials, which violates security best practices (e.g., no automatic rotation, risk of exposure) and does not leverage IAM roles for dynamic, scoped access.

Practice this question →

147

Multi-Selecthard

Which THREE components are required to implement a global application that can withstand the failure of an entire AWS Region? (Select THREE.)

Select 3 answers

A.An Application Load Balancer in the primary Region.

B.Amazon CloudFront with multiple origins and origin failover.

C.Amazon DynamoDB Global Tables.

D.Amazon RDS with a single-AZ deployment.

E.Amazon Route 53 with health checks and failover routing policy.

AnswersB, C, E

Provides edge caching and failover.

Why this answer

Options A, C, and D are correct. Route 53 with health checks and failover routing provides DNS failover. DynamoDB Global Tables provide multi-region write capability.

CloudFront provides edge caching and origin failover. Option B is wrong because a single ALB is regional. Option E is wrong because RDS Single-AZ is not resilient.

Practice this question →

148

Multi-Selecteasy

Which TWO actions can help ensure that an application running on EC2 instances can survive the loss of an entire Availability Zone?

Select 2 answers

A.Deploy all instances in a single Availability Zone for consistency

B.Use an Auto Scaling group with multiple Availability Zones

C.Deploy EC2 instances in at least two Availability Zones

D.Use a larger instance type to handle more load

E.Use CloudWatch alarms to monitor instance health

AnswersB, C

Auto Scaling distributes instances across AZs and replaces failed ones.

Why this answer

Deploying instances in multiple AZs ensures that if one AZ fails, instances in other AZs continue to run. Using an Auto Scaling group with multiple AZs automatically distributes instances and replaces failed ones. Option B is wrong because a single AZ is vulnerable.

Option D is wrong because instance type does not affect AZ resilience. Option E is wrong because CloudWatch alarms do not distribute instances.

Practice this question →

149

MCQmedium

A company runs a stateful application on EC2 instances in an Auto Scaling group. The application stores state on local instance storage. During a scaling event, users lose session data. How can the company make the application resilient without modifying the application code?

A.Reduce the Auto Scaling group cooldown period.

B.Enable sticky sessions on the Application Load Balancer.

C.Increase the instance size to reduce scaling events.

D.Use Elastic Block Store (EBS) volumes instead of instance store.

AnswerB

Sticky sessions route users to the same instance, preserving local state.

Why this answer

Option D is correct because using a sticky session (session affinity) on the ALB ensures users are routed to the same instance, preserving local state. Option A is wrong because larger instances do not prevent data loss on termination. Option B is wrong because EBS volumes persist but require reattachment.

Option C is wrong because reducing cooldown does not preserve state.

Practice this question →

150

Multi-Selecteasy

A company is implementing a disaster recovery plan for its on-premises database using AWS. The plan must have a Recovery Time Objective (RTO) of 2 hours and a Recovery Point Objective (RPO) of 15 minutes. Which TWO AWS services should the company use? (Choose TWO.)

Select 2 answers

A.AWS Snowball Edge

B.Amazon S3 with versioning

C.AWS Backup with cross-Region backup copy

D.AWS Database Migration Service (DMS) with ongoing replication

E.AWS Storage Gateway with cached volumes

AnswersC, D

AWS Backup can automate and restore backups quickly, meeting RTO.

Why this answer

Option B is correct because AWS Database Migration Service (DMS) can replicate changes continuously, meeting RPO of 15 minutes. Option D is correct because AWS Backup can automate backups and support cross-Region recovery, meeting RTO of 2 hours. Option A is wrong because S3 alone does not provide database replication.

Option C is wrong because Storage Gateway is for file/volume storage, not database replication. Option E is wrong because Snowball is for large data transfer, not real-time replication.

Practice this question →

← PreviousPage 2 of 4 · 259 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Resilient Cloud Solutions questions.

Start 20-question session