CCNA Incident Response Questions

75 of 254 questions · Page 2/4 · Incident Response topic · Answers revealed

76
MCQhard

An organization uses AWS Config to track resource changes. They notice that a particular S3 bucket policy was deleted, but the Config rule 's3-bucket-policy-grantee-check' did not trigger a remediation. What is the most likely reason?

A.The Config rule does not support S3 bucket policies.
B.The rule is set to evaluate resources only on a periodic basis.
C.The S3 bucket policy change is not supported by AWS Config.
D.The bucket was deleted and recreated, and the configuration recorder did not capture the deletion event.
AnswerD

If the bucket was recreated, the deletion might not trigger an evaluation if the bucket was not tracked.

Why this answer

Option C is correct because AWS Config evaluates rules only when a configuration change occurs. If the bucket was deleted and recreated, the deletion of the policy may not have been recorded as a change if the bucket itself was recreated without a policy. Option A is wrong because Config supports bucket policies.

Option B is wrong because evaluation frequency does not affect triggered evaluations. Option D is wrong because S3 bucket policy changes are supported.

77
MCQmedium

A company runs a production database on Amazon RDS for PostgreSQL. The DevOps team has set up a read replica to offload read traffic. Recently, the replica started experiencing replication lag that is increasing over time. The primary instance's CPU and memory utilization are normal. The network bandwidth between the primary and replica is not saturated. The team has already increased the replica's instance class, but the lag persists. The primary database is under heavy write load due to a batch job that runs hourly. What is the MOST likely cause of the increasing replication lag?

A.The primary database should be migrated to a Multi-AZ deployment instead of using a read replica.
B.The replica's 'max_standby_streaming_delay' parameter is set too low, causing the replica to cancel queries.
C.The batch job on the primary is using long-running transactions that are holding up WAL generation.
D.The replica's instance class is still too small for the write load from the primary.
AnswerC

Long transactions prevent WAL segments from being recycled and cause replication lag.

Why this answer

Option A is correct because long-running transactions on the primary can cause replication lag because the replica must wait for the transaction to complete before applying changes. Option B is wrong because increasing replica instance class does not address the root cause. Option C is wrong because increased checkpoint timeout on the replica would not help.

Option D is wrong because the issue is not related to storage type change.

78
MCQeasy

A company's production environment uses an Amazon ElastiCache Redis cluster for session caching. The operations team reports that the cache hit ratio has dropped significantly, causing increased load on the backend database. What is the MOST likely cause?

A.The cache is under memory pressure and evicting keys to make room.
B.The cluster was resized from a single node to a cluster mode.
C.The encryption in transit was enabled, adding latency.
D.There is a network partition between the application and the cache.
AnswerA

When memory is full, eviction removes keys, reducing hit ratio.

Why this answer

Option B is correct because a dropped cache hit ratio often indicates that cached keys are being evicted due to memory pressure, especially if the eviction policy is set to 'allkeys-lru' or similar. Option A is wrong because a cluster mode change would cause a temporary disruption but not necessarily a sustained drop in hit ratio. Option C is wrong because a network partition would cause complete cache unavailability, not just a drop in hit ratio.

Option D is wrong because encryption in transit does not affect cache performance.

79
MCQmedium

A company uses AWS CloudFormation to deploy infrastructure. During a production deployment, the stack update fails, and the stack enters the ROLLBACK_COMPLETE state. The DevOps engineer needs to investigate the failure. The engineer checks the CloudFormation console and sees a stack event with a status of UPDATE_FAILED and a reason of 'Internal failure'. The engineer wants to find more details. What is the BEST way to get detailed error information?

A.Check the CloudFormation template for syntax errors using cfn-lint.
B.Manually re-run the stack update in a test environment.
C.Review the stack events in the CloudFormation console for more details.
D.Use AWS CloudTrail to view the CreateStack or UpdateStack API calls and the associated error messages.
AnswerD

CloudTrail captures API errors.

Why this answer

Option C is correct because CloudFormation logs resource provider operations to CloudTrail. Option A is wrong because the console does not show internal details. Option B is wrong because the template is not necessarily the cause.

Option D is wrong because the stack is already rolled back.

80
MCQhard

A company uses AWS Organizations with multiple accounts. The security team needs to receive real-time notifications when any IAM user in any account creates an access key. Which solution is the most operationally efficient?

A.Enable CloudTrail in all accounts and send logs to a central S3 bucket, then use Amazon Athena to query.
B.Create an SCP that denies access key creation and monitor with CloudWatch.
C.Use AWS Trusted Advisor to check for exposed access keys.
D.Use AWS Config rules with a CloudWatch Events rule to detect CreateAccessKey and publish to an SNS topic.
AnswerD

Real-time detection via CloudWatch Events.

Why this answer

Option A is correct because an AWS Config rule with a CloudWatch Events trigger can detect CreateAccessKey API calls and send notifications via SNS. Option B is wrong because CloudTrail is already logging events, but CloudWatch Events can directly capture API calls. Option C is wrong because SCPs cannot trigger notifications.

Option D is wrong because Trusted Advisor does not monitor IAM access key creation in real-time.

81
Multi-Selectmedium

A company uses Amazon ECS with the Fargate launch type for a microservices application. The application experiences intermittent HTTP 5xx errors from the ALB. The DevOps team needs to diagnose the issue. Which TWO steps should be taken to gather diagnostic information? (Choose TWO.)

Select 2 answers
A.Enable Container Insights for the ECS cluster.
B.Enable AWS X-Ray tracing on the ECS service.
C.Enable detailed monitoring on the underlying EC2 instances.
D.Use AWS Systems Manager Session Manager to connect to the running containers.
E.Configure the ECS task definition to send application logs to CloudWatch Logs.
AnswersA, E

Container Insights provides metrics and logs for Fargate tasks.

Why this answer

Options B and D are correct. Option B: ECS service logs (from the awslogs driver) contain application output that can reveal errors. Option D: CloudWatch Container Insights provides metrics and logs for Fargate tasks, including resource utilization that can cause timeouts.

Option A is wrong because instance-level metrics are not available with Fargate (no EC2 instances). Option C is wrong because X-Ray traces HTTP requests but may not capture the root cause if the error is at the application level. Option E is wrong because Fargate does not support Systems Manager Session Manager for tasks.

82
MCQhard

A company configures AWS CloudTrail to deliver logs to S3 bucket 'my-app-logs'. However, no log files appear. The DevOps engineer runs the above command and sees the bucket policy. What is the issue?

A.The bucket policy requires bucket-owner-full-control ACL, but CloudTrail does not support ACLs.
B.The bucket policy does not allow the s3:PutObject action.
C.The service principal in the bucket policy is incorrect; it should be 'cloudtrail.amazonaws.com'.
D.The bucket does not exist; the policy retrieval failed silently.
AnswerC

CloudTrail requires its own service principal.

Why this answer

Option D is correct because CloudTrail requires the S3 bucket policy to grant the service principal 'cloudtrail.amazonaws.com', not 'delivery.logs.amazonaws.com'. Option A is wrong because the policy allows PutObject. Option B is wrong because CloudTrail can deliver to buckets with ACL settings as long as the policy is correct.

Option C is wrong because the bucket exists and the policy is present.

83
MCQmedium

A company uses AWS Organizations with multiple accounts. The security team wants to ensure that all accounts automatically forward their CloudWatch Logs to a central logging account. Which solution should the team implement?

A.Enable AWS Config aggregator in the central account
B.Use AWS Service Catalog to create a product for log forwarding
C.Use AWS CloudFormation StackSets to deploy a subscription filter and Lambda function in each account
D.Configure AWS Organizations to automatically forward logs
AnswerC

StackSets deploy resources across accounts automatically.

Why this answer

Option A is correct because a CloudFormation StackSet can deploy log forwarding resources across all accounts in the organization. Option B is wrong because Organizations does not forward logs on its own. Option C is wrong because Service Catalog is for provisioning products.

Option D is wrong because Config aggregates configuration data, not logs.

84
MCQeasy

A DevOps engineer is investigating an incident where an EC2 instance became unreachable. The engineer checks the AWS Management Console and finds the instance is running, but the status check shows '2/2 checks passed' and the system log shows no errors. What should the engineer do NEXT to diagnose the connectivity issue?

A.Review the CloudWatch metrics for CPU utilization and network throughput.
B.Reboot the instance to reset the network interface.
C.Stop and start the instance to move it to new underlying hardware.
D.Check the security group and network ACL rules to ensure inbound traffic is allowed.
AnswerD

Connectivity issues often stem from network permissions.

Why this answer

Since the instance is running, status checks pass, and the system log shows no errors, the issue is not with the operating system or underlying hardware. The most likely cause is a network-layer restriction, such as security group or network ACL rules blocking inbound traffic. Checking these rules is the correct next step because they control traffic at the instance and subnet levels, respectively, and misconfigurations here are a common cause of unreachability despite healthy instance status.

Exam trap

The trap here is that candidates assume a 'running' instance with passing status checks guarantees network reachability, overlooking that security groups and NACLs can silently drop traffic without any error in system logs or status checks.

How to eliminate wrong answers

Option A is wrong because CloudWatch metrics for CPU utilization and network throughput measure performance, not connectivity; they would not reveal whether inbound traffic is being blocked by security groups or NACLs. Option B is wrong because rebooting the instance resets the OS but does not change network configurations or underlying hardware; if the instance is running and status checks pass, a reboot is unlikely to resolve a network-level block. Option C is wrong because stopping and starting the instance moves it to new underlying hardware, which could help if the issue were hardware-related, but the status checks passing indicates the hardware is healthy; this action is more disruptive and unnecessary for a likely network configuration problem.

85
MCQmedium

A DevOps engineer is troubleshooting an issue where an EC2 instance running a web application becomes unresponsive every few hours. CloudWatch logs show no application errors, but the instance's status checks are passing. The engineer suspects a memory leak. Which AWS service can be used to capture memory utilization metrics at a granular level to confirm the leak?

A.EC2 Status Checks
B.AWS Config
C.AWS CloudTrail
D.CloudWatch Agent
AnswerD

CloudWatch Agent can collect custom metrics like memory utilization from EC2 instances and send them to CloudWatch for analysis.

Why this answer

Option B is correct because CloudWatch Agent can collect memory metrics from EC2 instances and send them to CloudWatch. Option A is wrong because EC2 status checks only check system status, not memory. Option C is wrong because CloudTrail logs API calls, not performance metrics.

Option D is wrong because Config records resource configuration changes, not metrics.

86
MCQhard

Refer to the exhibit. An IAM policy is attached to a user. The user tries to upload an object to 'my-bucket' without specifying server-side encryption. What will happen?

A.The upload is denied because the Allow statement requires encryption.
B.The upload is denied because of the Deny statement.
C.The upload is allowed because the Allow statement grants permission.
D.The upload is allowed because no encryption header is specified.
AnswerB

The Deny statement explicitly denies requests without AES256 encryption.

Why this answer

Option A is correct. The Allow statement requires encryption AES256, but the request does not specify encryption, so the condition is not met. The Deny statement catches any request that does not have encryption set to AES256 (since it uses StringNotEquals).

The explicit Deny overrides any Allow, so the request is denied. Option B is wrong because the Deny statement is more specific. Option C is wrong because the Allow statement has a condition that is not met.

Option D is wrong because the request is not allowed.

87
MCQeasy

A company uses CloudWatch Synthetics canaries to monitor a critical API endpoint. Recently, a canary started failing with a '403 Forbidden' error. The DevOps engineer verifies that the canary's IAM role has the necessary permissions to invoke the API and that the API endpoint is publicly accessible. What should the engineer check NEXT?

A.Review the canary's CloudWatch Logs for any runtime errors.
B.Increase the canary's memory to 512 MB to prevent timeout-related issues.
C.Check if the API requires an API key or other authentication that the canary is not providing.
D.Verify that the canary is attached to the correct VPC and subnet.
AnswerC

A 403 often indicates authorization failure; missing API key is a common cause.

Why this answer

Option B is correct because the API may have a usage plan or API key requirement, and the canary must include the appropriate key or authentication header. Option A is wrong because the canary's VPC configuration does not affect public endpoints. Option C is wrong because logging is for debugging, not the immediate cause.

Option D is wrong because increasing memory won't fix a 403 error.

88
Multi-Selectmedium

A DevOps team is investigating a production incident where an Amazon RDS for MySQL database experienced a sudden spike in connections and CPU utilization. The team suspects a SQL injection attack. Which TWO actions should the team take to investigate and mitigate the incident?

Select 2 answers
A.Delete the error logs to free up storage space.
B.Enable automated backups and ensure point-in-time recovery is configured.
C.Enable RDS Enhanced Monitoring and audit logs to capture SQL queries.
D.Create a read replica to offload traffic from the primary instance.
E.Increase the DB instance size to handle the increased load.
AnswersB, C

Backups allow recovery to a pre-incident state if data is compromised.

Why this answer

Option B is correct because enabling automated backups and point-in-time recovery ensures that the database can be restored to a state before the suspected SQL injection attack, preserving data integrity and enabling forensic analysis. Option C is correct because RDS Enhanced Monitoring provides OS-level metrics (CPU, memory, disk I/O) to correlate with the spike, while audit logs capture actual SQL queries, which are essential for identifying malicious patterns and confirming the attack vector.

Exam trap

The trap here is that candidates confuse reactive scaling (Option E) or read replicas (Option D) with proper incident response, failing to recognize that investigation and mitigation require enabling logging and backup capabilities, not just increasing capacity.

89
Multi-Selecteasy

A company runs a critical application on EC2 instances in an Auto Scaling group. The application must be highly available across multiple Availability Zones. Which TWO configurations are necessary to achieve this? (Choose TWO.)

Select 2 answers
A.Use a single Availability Zone to reduce latency.
B.Use Spot Instances to reduce costs.
C.Use a Classic Load Balancer to distribute traffic.
D.Place an Application Load Balancer in front of the Auto Scaling group.
E.Configure the Auto Scaling group to launch instances in multiple Availability Zones.
AnswersD, E

ALB distributes traffic across AZs and instances.

Why this answer

Options A and D are correct. Option A: Distributing instances across multiple AZs ensures that if one AZ fails, the application continues in other AZs. Option D: An Application Load Balancer (ALB) distributes traffic across instances in multiple AZs.

Option B is wrong because a single AZ defeats high availability. Option C is wrong because a Classic Load Balancer is not recommended; ALB is better for cross-zone. Option E is wrong because spot instances can be terminated, reducing availability.

90
MCQeasy

A DevOps engineer is troubleshooting an Auto Scaling group (ASG) that is not launching instances as expected. The ASG is configured with a launch template that uses an Amazon Linux 2 AMI. The engineer checks the EC2 Auto Scaling console and sees that the group's desired capacity is set to 2, but only 1 instance is running. The last scaling activity shows 'Failed to launch instance. Error: Your quota allows for 0 more running instance(s).' What is the most likely cause?

A.The launch template has insufficient IAM permissions to create instances.
B.The account has reached the EC2 instance limit for the selected instance type in the region.
C.The VPC subnet does not have enough available IP addresses.
D.There is an instance that is not passing health checks, preventing new instances.
AnswerB

B is correct because the error explicitly states quota for running instances is exceeded.

Why this answer

The error message 'Your quota allows for 0 more running instance(s)' directly indicates that the AWS account has reached its EC2 instance limit for the specific instance type in the region. Auto Scaling groups cannot launch instances beyond the service quota, regardless of the desired capacity. This is a common issue when the default limit (e.g., 5 or 20 instances per instance family) has been exhausted.

Exam trap

The trap here is that candidates confuse IAM permissions with service quotas, or assume a VPC subnet IP shortage is the cause, when the explicit quota error message is the definitive clue.

How to eliminate wrong answers

Option A is wrong because IAM permissions affect the ability to call EC2 APIs (e.g., RunInstances), but the error message explicitly cites a quota issue, not an authorization failure (which would return 'UnauthorizedOperation' or 'AccessDenied'). Option C is wrong because insufficient IP addresses in the subnet would produce an error like 'InsufficientFreeAddressesInSubnet' or 'NoFreeAddressesInSubnet', not a quota limit message. Option D is wrong because an instance failing health checks does not prevent new instances from launching; the ASG would still attempt to launch replacements, and the error would relate to health check failures, not a quota limit.

91
MCQmedium

A company uses Amazon CloudFront to distribute content globally. Users in some regions report slow load times. The DevOps team wants to identify the geographic regions where performance is worst. Which tool should they use?

A.Amazon CloudWatch Metrics for CloudFront
B.CloudFront access logs in S3
C.Amazon Route 53 latency records
D.CloudFront reports in the AWS Management Console
AnswerD

CloudFront reports provide geographic performance data.

Why this answer

Option C is correct because CloudFront reports show metrics like total requests, error rates, and latency by geographic region. Option A is wrong because CloudWatch metrics are per distribution, not per region; B is wrong because access logs do not aggregate performance; D is wrong because Route 53 is for DNS, not performance analysis.

92
MCQmedium

A company uses Amazon CloudFront to serve static content from an S3 bucket. Users in a specific region report slow load times. The DevOps team checks CloudFront metrics and sees a high error rate (5xx) for that region. The S3 bucket is healthy. What is the most likely cause?

A.The AWS WAF web ACL is blocking requests from that region.
B.The S3 bucket policy does not grant access to the CloudFront origin access identity (OAI).
C.The CloudFront distribution's default TTL is too short.
D.The CloudFront origin shield is misconfigured.
AnswerB

Misconfigured OAI causes 403 errors that may appear as 5xx if not handled properly.

Why this answer

A high 5xx error rate from CloudFront combined with a healthy S3 bucket strongly indicates that CloudFront cannot fetch objects from the origin. If the S3 bucket policy does not grant access to the CloudFront origin access identity (OAI), CloudFront receives an access denied (403) response from S3, which CloudFront translates into a 5xx error for the user. This is the most common cause of regional 5xx errors when the origin is otherwise healthy.

Exam trap

The trap here is that candidates often assume 5xx errors always indicate an origin server problem, but in CloudFront, a 5xx can also result from an authentication failure (403) at the origin, which CloudFront converts to a 5xx for the viewer.

How to eliminate wrong answers

Option A is wrong because AWS WAF web ACLs block requests at the edge with a 403 Forbidden response, not a 5xx error, and the question states the error rate is 5xx. Option C is wrong because a short default TTL would cause more frequent origin fetches and potentially increase latency, but it would not cause 5xx errors; the origin is healthy and would still return valid content. Option D is wrong because a misconfigured Origin Shield could increase latency or cause connectivity issues, but it would not produce a high 5xx error rate if the underlying S3 bucket is healthy and accessible; Origin Shield is an additional caching layer, not an authentication mechanism.

93
Multi-Selectmedium

A company uses AWS CloudTrail to log API calls in a multi-account environment. The security team wants to be alerted immediately when an IAM user or role performs a specific sensitive action (e.g., DeleteTrail, DeleteDBInstance). Which TWO services can be used together to achieve near real-time alerting? (Choose TWO.)

Select 2 answers
A.CloudWatch Logs metric filters and alarms
B.CloudTrail with CloudWatch Logs integration
C.CloudTrail with Amazon S3 event notifications
D.Amazon Athena and CloudWatch dashboards
E.AWS Config and AWS Lambda
AnswersA, B

Metric filters can trigger alarms.

Why this answer

Option A is correct because CloudTrail can deliver events to CloudWatch Logs. Option C is correct because CloudWatch Logs can trigger a metric filter alarm. Option B is wrong because S3 is not real-time.

Option D is wrong because Athena is for analysis. Option E is wrong because Config is for resource compliance.

94
Multi-Selectmedium

A company's DevOps team is designing an automated incident response workflow using AWS Systems Manager Incident Manager and AWS Lambda. The workflow should automatically acknowledge incidents and send notifications to the appropriate response team. Which TWO actions should the team take to achieve this?

Select 2 answers
A.Have the Lambda function publish a message to an SNS topic that the response team subscribes to.
B.Configure the CloudWatch alarm to directly invoke the Lambda function for incident creation.
C.Use Incident Manager's built-in 'Acknowledge' and 'Notify' actions in a response plan.
D.Use AWS Step Functions to orchestrate the Lambda function and SNS topic.
E.Create an EventBridge rule that matches Incident Manager events and triggers a Lambda function.
AnswersA, E

SNS can send notifications via email, SMS, etc., to the team.

Why this answer

Option A is correct because Lambda can be triggered by Incident Manager via Amazon EventBridge. Option C is correct because the Lambda function can use SNS to send notifications to the team. Option B is wrong because Incident Manager does not directly respond to CloudWatch alarms; it uses engagement plans.

Option D is wrong because Response Plans are different from Incident Manager response plans. Option E is wrong because Step Functions is unnecessary for this simple workflow.

95
MCQeasy

A company hosts a static website on Amazon S3 with CloudFront as the CDN. Users report that they see an old version of the website even after the DevOps team updated the S3 objects. The team verified that the new objects are in the S3 bucket and are publicly accessible. The CloudFront distribution has a default TTL of 24 hours. To immediately serve the new content to users, the team needs to invalidate the CloudFront cache. Which of the following is the CORRECT approach to achieve this with minimal impact?

A.Create a CloudFront invalidation request for the path '/*'.
B.Change the CloudFront origin path to point to a new S3 bucket.
C.Update the CloudFront distribution's default TTL to 0 and wait for the changes to propagate.
D.Delete the S3 objects and re-upload them with different names.
AnswerA

Invalidation removes cached objects immediately.

Why this answer

Option D is correct because creating a CloudFront invalidation for the path '/*' will remove all cached objects and force CloudFront to fetch new content from S3. Option A is wrong because changing the origin path does not invalidate cache. Option B is wrong because updating TTL affects future caching, not existing cache.

Option C is wrong because S3 does not control CloudFront cache.

96
MCQmedium

An application log excerpt shows repeated HTTP 500 errors for the /api/orders endpoint, with occasional successful health checks. The application runs on EC2 instances behind an ALB. What is the MOST likely cause of this pattern?

A.The backend service that the /api/orders endpoint depends on is unavailable or failing.
B.The EC2 instances are running out of memory and the application is crashing.
C.The ALB is misconfigured and routing requests to the wrong target group.
D.The EC2 instances are not passing health checks and are being deregistered from the target group.
AnswerA

The endpoint consistently fails while health check succeeds, indicating a dependency issue.

Why this answer

The pattern of repeated HTTP 500 errors for /api/orders with occasional successful health checks strongly indicates that the backend service dependency (e.g., a database, cache, or another microservice) is intermittently failing or unavailable. HTTP 500 errors are server-side errors, meaning the application code is running but cannot complete the request due to a downstream failure. Successful health checks confirm the EC2 instances themselves are healthy and in-service, ruling out instance-level or ALB misconfiguration issues.

Exam trap

The trap here is that candidates confuse HTTP 500 errors with instance-level failures (like OOM or health check failures), but the key differentiator is that successful health checks prove the instances are operational, shifting the root cause to a failing backend dependency rather than the compute layer.

How to eliminate wrong answers

Option B is wrong because running out of memory typically causes the application process to crash (e.g., OOM killer), leading to connection timeouts or immediate 503 errors, not repeated HTTP 500 errors with successful health checks. Option C is wrong because a misconfigured ALB routing to the wrong target group would cause requests to reach instances that don't serve the /api/orders endpoint, resulting in 404 or 503 errors, not 500 errors from the application itself. Option D is wrong because if instances were failing health checks and being deregistered, they would be removed from the target group and stop receiving traffic entirely, which contradicts the observed pattern of occasional successful health checks and persistent 500 errors on /api/orders.

97
MCQhard

A DevOps engineer applied the above S3 bucket policy to restrict access. Users report that they can download objects from the bucket only when using HTTPS from within the 10.0.0.0/8 network. However, users outside that network receive access denied errors even over HTTPS. What is wrong with the policy?

A.The Deny statement blocks all requests over HTTPS because of the BoolIfExists condition.
B.The Allow statement should include a condition for SecureTransport.
C.The Deny statement should use a Condition for SourceIp instead of SecureTransport.
D.The Allow statement only allows requests from the specific IP range; requests from other IPs are implicitly denied even over HTTPS.
AnswerD

Implicit deny applies to all not explicitly allowed.

Why this answer

Option C is correct because the Deny statement with SecureTransport false only denies non-HTTPS requests; it does not explicitly allow HTTPS requests from outside the allowed IP range. The Allow statement only allows from the specific IP range, so requests from outside that range are implicitly denied. Option A is wrong because the policy already uses a Deny for non-HTTPS.

Option B is wrong because the condition is fine. Option D is wrong because the Deny statement is not the issue; the lack of an Allow for HTTPS from other IPs is the issue.

98
MCQhard

A company runs a critical application on an Amazon RDS for MySQL DB instance. The application experiences intermittent connection timeouts. The DevOps team notices that the DB instance's CPU and memory metrics are normal. What should the team check NEXT to diagnose the issue?

A.Enable Enhanced Monitoring to check OS-level metrics
B.Examine the slow query log to identify long-running queries
C.Verify that the DB instance's storage is not full
D.Check the 'DatabaseConnections' CloudWatch metric to see if the connection count is near the max_connections limit
AnswerD

Connection timeouts often result from hitting the max connections limit.

Why this answer

Option A is correct because connection timeouts with normal CPU/memory often indicate that the DB instance's maximum connections limit has been reached. Option B is wrong because Enhanced Monitoring is for OS-level metrics, not connections; C is wrong because query latency is not directly related to connection timeouts; D is wrong because storage is not the issue here.

99
MCQeasy

A company uses Amazon RDS for MySQL with Multi-AZ deployment. The database instance fails and AWS automatically fails over to the standby. After the failover, the application cannot connect to the database. The engineer checks the RDS console and sees that the instance status is Available. What is the MOST likely cause of the connectivity issue?

A.The security group for the RDS instance has changed during failover.
B.The application is using the database's DNS endpoint for the old primary, which is no longer the writer.
C.The DNS record for the RDS endpoint has not propagated to the application's DNS resolver.
D.The database instance is still in the process of failover and is not yet accepting connections.
AnswerB

After failover, the writer endpoint points to the new primary, but if the application caches the old endpoint, it may fail.

Why this answer

After an RDS Multi-AZ failover, the DNS endpoint for the DB instance remains the same but its underlying IP address changes to point to the new primary (formerly the standby). If the application caches the IP address of the old primary or uses a direct connection to the old writer endpoint, it will attempt to connect to a node that is no longer the writer. The correct practice is to always connect using the RDS instance endpoint (CNAME), which automatically resolves to the current writer, and to avoid caching the resolved IP address.

Since the instance status is 'Available', the new primary is ready, so the issue is a stale connection target.

Exam trap

The trap here is that candidates assume a failed Multi-AZ failover or a DNS propagation delay, when in reality the instance is healthy and DNS updates quickly, but the application's cached IP address from the old primary is the root cause.

How to eliminate wrong answers

Option A is wrong because security groups are associated with the RDS instance itself, not with a specific node; during a failover, the security group configuration is preserved and does not change. Option C is wrong because the DNS record (CNAME) for the RDS endpoint is managed by AWS Route 53 with a very low TTL (typically 5 seconds) and propagates quickly; the application's DNS resolver would have the updated record long before the failover completes. Option D is wrong because the RDS console shows the instance status as 'Available', which means the failover has completed and the new primary is accepting connections; the issue is not that the instance is still transitioning.

100
MCQhard

A company runs a critical web application on AWS. The application is deployed across multiple Availability Zones using an Application Load Balancer (ALB) with an Auto Scaling group of EC2 instances. The Auto Scaling group uses a launch template that specifies an Amazon Linux 2 AMI. The application stores session state in an ElastiCache Redis cluster. Recently, the operations team received alerts that the application is returning 503 errors intermittently. Investigation shows that the ALB target group health checks are failing for some instances, but those instances are still in service. The CloudWatch logs from the instances show that the application is running, but the health check endpoint is timing out after 5 seconds. The health check is configured with a 5-second timeout, 10-second interval, and 2 consecutive successes required to mark healthy. The DevOps engineer suspects that the issue is due to high CPU utilization on the instances causing the health check to respond slowly. The engineer wants to implement a solution that prevents the ALB from routing traffic to instances that are experiencing high CPU, and also automatically scales out to handle the increased load. What should the engineer do?

A.Configure the Auto Scaling group to use ELB health checks and set the health check grace period to 600 seconds.
B.Create a CloudWatch alarm on CPU utilization and use it to perform an EC2 action to stop the instance, and configure the Auto Scaling group to use a target tracking scaling policy based on CPU utilization.
C.Create a scheduled scaling action to add more instances during peak hours.
D.Increase the health check timeout to 10 seconds and the interval to 20 seconds to give instances more time to respond.
AnswerB

Stopping high CPU instances removes them from the ALB, and target tracking scaling adds capacity when needed.

Why this answer

Option B is correct because it addresses both the immediate issue (high CPU causing health check timeouts) and the scaling requirement. Stopping the instance via a CloudWatch alarm removes it from the ALB target group, preventing traffic routing to unhealthy instances. The target tracking scaling policy based on CPU utilization automatically adds instances when CPU is high, ensuring capacity matches demand.

Exam trap

The trap here is that candidates may think increasing health check timeout or grace period solves the problem, but AWS expects you to recognize that high CPU instances should be removed from service and replaced via scaling, not just given more time to respond.

How to eliminate wrong answers

Option A is wrong because increasing the health check grace period to 600 seconds only delays the start of health checks, but does not prevent traffic from being routed to instances with high CPU after the grace period ends; it also does not trigger scaling. Option C is wrong because a scheduled scaling action is reactive to time-based patterns, not to real-time CPU spikes, and does not address the immediate health check failures. Option D is wrong because increasing the health check timeout and interval only masks the symptom by allowing more time for slow responses, but does not remove unhealthy instances from service or scale out to handle load.

101
MCQmedium

A company's application running on EC2 instances behind an Application Load Balancer (ALB) is returning intermittent 504 errors. The instances are in an Auto Scaling group with a health check grace period of 300 seconds. What should the DevOps engineer check first to troubleshoot the issue?

A.Review the Auto Scaling group scaling policies.
B.Verify the target group health checks are passing.
C.Check security group rules for the ALB.
D.Check ALB access logs for target response times.
AnswerD

Helps identify slow responses from targets.

Why this answer

Option A is correct because 504 errors indicate the load balancer is not receiving a response from the target within the idle timeout period. Checking the ALB access logs for target response times can confirm if the backend is slow. Option B is wrong because if the ELB is not healthy, the instances would be replaced, but the issue is intermittent.

Option C is wrong because scaling policies affect capacity, not response times. Option D is wrong because security group rules would cause connection timeouts, not 504s.

102
MCQeasy

A DevOps engineer is troubleshooting a Lambda function that processes S3 events. The function has been running successfully for months, but today it started timing out. The engineer checks CloudWatch Logs and sees 'Task timed out after 3.01 seconds' errors. The function is configured with a 3-second timeout. What should the engineer do to resolve the issue?

A.Increase the Lambda function reserved concurrency.
B.Increase the memory allocation for the Lambda function.
C.Increase the Lambda function timeout to 10 seconds.
D.Configure a dead-letter queue (DLQ) for the Lambda function.
AnswerC

Increasing the timeout directly resolves the timeout error.

Why this answer

Option C is correct because the function is timing out at the configured limit. Increasing the timeout to a higher value (e.g., 10 seconds) gives the function more time to complete. Option A is wrong because increasing memory may improve performance but does not directly address the timeout.

Option B is wrong because the function already has a DLQ; the issue is the function timing out. Option D is wrong because the function is timing out, not provisioning limits.

103
MCQmedium

A company uses EC2 instances in an Auto Scaling group behind an ALB. The DevOps team receives alerts that the CPU utilization on the instances is consistently above 90% during peak hours. The Auto Scaling group is configured with a simple scaling policy that adds one instance when CPU exceeds 80% and removes one when below 30%. However, during sudden traffic spikes, the scaling policy reacts too slowly, causing performance degradation. The team wants to improve the scaling responsiveness without over-provisioning. What should the team do?

A.Increase the cooldown period for the simple scaling policy to allow more time for metrics to stabilize.
B.Replace the simple scaling policy with a step scaling policy that adds multiple instances when CPU exceeds 80%.
C.Create a scheduled scaling action to add instances before peak hours based on historical data.
D.Replace the simple scaling policy with a target tracking scaling policy based on average CPU utilization with a target value of 70%.
AnswerD

Target tracking continuously adjusts capacity to maintain the target metric, providing faster response to spikes.

Why this answer

Option D is correct because using a target tracking scaling policy with a lower target value (e.g., 70%) allows a more proactive scaling response, and the policy automatically adjusts based on the metric. Option A is wrong because scheduled scaling may not handle unpredictable spikes. Option B is wrong because increasing cooldown would slow down scaling.

Option C is wrong because a step scaling policy can be more responsive than simple scaling, but target tracking is even simpler and more effective.

104
MCQmedium

A company uses AWS CloudTrail to log all API calls. During an incident investigation, a security engineer needs to identify who deleted an S3 bucket named 'critical-data' two days ago. Which approach will provide the necessary information?

A.Use AWS CloudTrail LookupEvents API to search for DeleteBucket events.
B.Check the AWS Management Console activity history.
C.Review the S3 access logs for the bucket.
D.Search CloudWatch Logs for 'DeleteBucket' events.
AnswerA

CloudTrail records DeleteBucket and LookupEvents can filter by event name and time.

Why this answer

Option D is correct because CloudTrail logs all API calls, including DeleteBucket, and the LookupEvents API can be used to search for events. The engineer can filter by event name and resource name. Option A is wrong because S3 access logs record object-level access, not bucket deletion.

Option B is wrong because CloudWatch Logs does not store CloudTrail events by default; they must be streamed. Option C is wrong because the AWS Management Console activity is also logged in CloudTrail; the console itself does not track user actions.

105
MCQhard

A company runs a critical application on EC2 instances behind an Application Load Balancer. The application is deployed across three Availability Zones. The DevOps team uses AWS CloudFormation to manage the infrastructure. During a recent deployment, a stack update failed, and the stack entered a ROLLBACK_IN_PROGRESS state. However, the rollback also failed, leaving the stack in UPDATE_ROLLBACK_FAILED state. The engineer needs to restore the application to a working state. The stack includes an Auto Scaling group, an ALB, security groups, and a DynamoDB table. The DynamoDB table is defined with deletion protection enabled. The engineer is considering the following actions: A) ContinueUpdateRollback to retry the rollback, fixing the resource that caused the failure. B) Delete the stack and recreate it from the last known good template. C) Use CloudFormation's 'SignalResource' to manually complete the rollback. D) Manually update the resources to match the previous template, then resume the rollback. Which action should the engineer take?

A.Manually update the resources to match the previous template, then resume the rollback.
B.Use ContinueUpdateRollback to retry the rollback after fixing the resource that caused the failure.
C.Use CloudFormation's 'SignalResource' to manually complete the rollback.
D.Delete the stack and recreate it from the last known good template.
AnswerB

ContinueUpdateRollback is the correct procedure to resolve rollback failures.

Why this answer

Option A is correct because ContinueUpdateRollback is the designed method to resolve a rollback failure. The engineer can fix the underlying issue (e.g., DynamoDB deletion protection preventing table deletion) and then retry the rollback. Option B is wrong because deleting the stack would also delete the DynamoDB table (deletion protection may prevent deletion, causing further issues).

Option C is wrong because SignalResource is for signaling creation of resources, not for rollback. Option D is wrong because manually updating resources is error-prone and not recommended; ContinueUpdateRollback automates the process.

106
MCQeasy

A DevOps engineer receives a CloudWatch alarm that the 'StatusCheckFailed' metric for an EC2 instance is in ALARM state. The instance is part of an Auto Scaling group. What should the engineer do first to restore service?

A.Update the Auto Scaling group's launch configuration
B.Wait for Auto Scaling to replace the instance
C.Manually terminate the instance
D.Reboot the instance
AnswerB

Auto Scaling will automatically terminate and launch a new instance.

Why this answer

Option B is correct because the instance status check failure indicates a system issue; if the instance is unhealthy, Auto Scaling will terminate and replace it. Option A is wrong because rebooting may not fix underlying issues. Option C is wrong because terminating manually is not needed; Auto Scaling handles it.

Option D is wrong because changing the AMI is not immediate.

107
Multi-Selectmedium

A company uses AWS CodePipeline for CI/CD. A recent pipeline execution failed at the 'Deploy' stage with the error 'Action execution failed: Access Denied'. The pipeline uses an IAM service role. Which THREE checks should the engineer perform to resolve this?

Select 3 answers
A.Check that CloudWatch Events rule is configured to trigger the pipeline.
B.Verify that the IAM service role has sufficient permissions to perform the deploy action on the target resource.
C.Ensure the artifact store S3 bucket has a bucket policy that allows the pipeline role to access it.
D.Enable S3 event notifications to trigger the pipeline on code changes.
E.Confirm that the service role's trust policy allows CodePipeline to assume the role.
AnswersB, C, E

Missing permissions cause access denied errors.

Why this answer

Option A is correct because the service role may lack permissions to the deployment target. Option C is correct because the artifact bucket needs permissions for the pipeline. Option D is correct because the IAM role trust policy must allow CodePipeline to assume it.

Option B is wrong because CloudWatch Events are not required for pipeline execution. Option E is wrong because S3 events are not relevant to the deploy stage.

108
MCQeasy

A DevOps engineer receives an alarm that an EC2 instance's CPU utilization has exceeded 90% for 5 minutes. The engineer needs to automatically recover the instance. Which AWS service should be used to configure automatic recovery?

A.Amazon CloudWatch Alarms
B.AWS Lambda
C.AWS Systems Manager Automation
D.EC2 Auto Scaling
AnswerA

CloudWatch alarms can initiate EC2 AutoRecovery by setting 'recover' action.

Why this answer

Amazon CloudWatch Alarms can be configured to trigger an EC2 instance recovery action when a metric like CPU utilization exceeds a threshold (e.g., 90% for 5 minutes). The alarm sends a signal to the EC2 service, which automatically recovers the instance by stopping it and starting it on a new underlying host, preserving the instance ID, private IP, and Elastic IP. This is the native, built-in mechanism for automatic instance recovery without requiring additional compute or orchestration services.

Exam trap

The trap here is that candidates often confuse EC2 Auto Scaling (which replaces instances) with automatic recovery (which recovers the same instance), or they overcomplicate the solution by choosing Lambda or Systems Manager when a simple CloudWatch Alarm action is the correct and native AWS mechanism.

How to eliminate wrong answers

Option B is wrong because AWS Lambda is a serverless compute service that can execute custom code in response to events, but it is not the direct service used to configure automatic EC2 instance recovery; while Lambda could be used to script a recovery, it adds unnecessary complexity and latency compared to the native CloudWatch Alarm recovery action. Option C is wrong because AWS Systems Manager Automation provides runbooks for automated remediation and operational tasks, but it is not the primary service for configuring automatic EC2 instance recovery; it would require additional setup and is not the simplest or recommended approach. Option D is wrong because EC2 Auto Scaling is designed to manage the number of instances in an Auto Scaling group based on scaling policies, not to recover a specific impaired instance; it would terminate and replace the instance rather than recover it, which changes the instance ID and associated resources.

109
Multi-Selectmedium

A company is designing an incident response strategy for its Amazon EKS cluster. Which THREE steps should be taken to ensure rapid response to a compromised pod?

Select 3 answers
A.Delete the entire namespace to ensure all resources are removed.
B.Scale down the deployment to 0 replicas.
C.Delete the pod using kubectl delete pod.
D.Use kubectl exec to gather forensic data from the pod before termination.
E.Apply a Kubernetes NetworkPolicy to deny all ingress/egress traffic to the compromised pod.
AnswersC, D, E

Terminating the pod stops the compromise.

Why this answer

Option A is correct because a network policy can isolate the pod. Option B is correct because executing commands helps gather forensics. Option C is correct because terminating the pod stops the compromise.

Option D is wrong because deleting the namespace would affect other resources. Option E is wrong because scaling down the deployment might not be immediate enough.

110
Multi-Selectmedium

A company runs a critical application on EC2 instances behind an Application Load Balancer (ALB) in an Auto Scaling group. The team wants to automate the response to an instance failure. Which THREE steps should be taken to ensure automatic recovery and notification?

Select 3 answers
A.Create a CloudWatch alarm to terminate the instance
B.Configure Auto Scaling to replace unhealthy instances
C.Configure the ALB health check to mark instances as unhealthy
D.Set up Amazon SNS notifications for Auto Scaling events
E.Create a scaling policy based on CPU utilization
AnswersB, C, D

Auto Scaling can automatically terminate and launch instances based on health checks.

Why this answer

Option A is correct to terminate unhealthy instances. Option C is correct to send notifications via SNS. Option D is correct to configure health checks on ALB.

Option B is wrong because scaling policies adjust capacity but do not handle individual instance failures. Option E is wrong because CloudWatch can terminate instances but Auto Scaling handles it.

111
MCQeasy

A DevOps engineer is troubleshooting a Lambda function that times out after 3 seconds. The function makes an HTTP request to an external API. The function's timeout setting is 10 seconds. What is the most likely cause of the timeout?

A.The external API is throttling the request.
B.The HTTP client's timeout is set to a low value.
C.The Lambda function is not in a VPC.
D.The Lambda function's memory is insufficient.
AnswerB

Default HTTP client timeout is often 3 seconds; needs to be increased.

Why this answer

The Lambda function times out after exactly 3 seconds, which is a common default HTTP client timeout (e.g., 3 seconds in the Node.js `http` module or Python `requests` library). Even though the Lambda function's configured timeout is 10 seconds, the HTTP client's internal timeout fires first, causing the function to fail before the Lambda timeout is reached. This is the most likely cause because the symptom matches a client-side timeout rather than a server-side or infrastructure issue.

Exam trap

The trap here is that candidates assume the Lambda function's configured timeout (10 seconds) is the only timeout that matters, overlooking that application-level timeouts (like HTTP client timeouts) can fire independently and cause earlier failures.

How to eliminate wrong answers

Option A is wrong because external API throttling would typically return an HTTP 429 status code or cause a slower response, not a consistent 3-second timeout; the function would still wait for the response up to the Lambda timeout. Option C is wrong because not being in a VPC actually reduces network latency and complexity (Lambda uses the public internet by default), so it would not cause a timeout; VPC-related timeouts usually occur when the function is in a VPC without a NAT gateway or proper routing. Option D is wrong because insufficient memory would cause out-of-memory errors or slower execution, not a consistent 3-second timeout; memory affects CPU allocation but does not directly trigger a timeout at a specific second.

112
MCQhard

A company uses AWS CloudFormation to manage infrastructure. A recent stack update failed with the error 'UPDATE_ROLLBACK_FAILED'. The stack is now in a 'UPDATE_ROLLBACK_FAILED' state, and the engineer needs to fix the stack. What is the correct course of action?

A.Contact AWS Support to fix the stack
B.Delete the stack and recreate it
C.Submit another stack update with the desired configuration
D.Continue the rollback using the 'ContinueUpdateRollback' API
AnswerD

This allows the stack to complete rollback by skipping resources that failed.

Why this answer

Option A is correct because when a rollback fails, you can continue the rollback (which may skip resources that failed to roll back) or perform a stack operation to fix. Option B is wrong because deleting may not be possible if the stack is stuck. Option C is wrong because you cannot update a stack in rollback failed state without first continuing rollback.

Option D is wrong because support can help but is not the direct action.

113
MCQeasy

A company uses AWS CloudFormation to deploy a multi-tier web application. During an incident, the stack update fails with a 'ROLLBACK_IN_PROGRESS' status. The operations team needs to investigate the root cause quickly without losing the stack's current state. What is the MOST efficient approach?

A.Create a change set and execute it to see the changes.
B.Delete the stack and recreate it using the same template.
C.Manually fix the resources and then retry the stack update.
D.Use the ContinueUpdateRollback API with the ResourcesToSkip parameter.
AnswerD

This allows continuing the rollback while skipping problematic resources.

Why this answer

Option C is correct because using 'ContinueUpdateRollback' with the 'ResourcesToSkip' parameter allows skipping resources that caused the failure while preserving the stack. Option A is wrong because deleting the stack removes all resources. Option B is wrong because creating a new stack does not help investigate the current failure.

Option D is wrong because manually fixing resources and retrying the update without rollback is risky; the stack is already in rollback.

114
Multi-Selectmedium

A DevOps team is investigating a security incident where an unauthorized user accessed an S3 bucket. The team needs to determine what actions were taken by the user. Which TWO AWS services should be used together to investigate? (Choose TWO.)

Select 2 answers
A.S3 server access logs
B.Amazon CloudWatch metrics
C.AWS Config
D.AWS CloudTrail
E.Amazon GuardDuty
AnswersA, D

Server access logs provide detailed records of requests made to the bucket.

Why this answer

Option A and Option D are correct. CloudTrail logs all API calls to S3, including who made the call, from where, and what action was taken. S3 server access logs provide detailed records of requests made to the bucket, including object-level operations.

Option B is wrong because Config records configuration changes, not access. Option C is wrong because GuardDuty detects threats but does not provide detailed logs of past actions. Option E is wrong because CloudWatch metrics provide performance data, not access details.

115
MCQeasy

A DevOps engineer receives an alarm that an EC2 instance's status check has failed. The instance is part of an Auto Scaling group. How should the engineer respond?

A.Immediately terminate the instance to trigger a replacement
B.Verify the instance's system and instance status checks, then consider terminating the instance to allow the Auto Scaling group to launch a new one
C.Do nothing, because the Auto Scaling group will automatically replace the instance
D.Check the security group rules to ensure they allow traffic
AnswerB

This is the correct response to a status check failure.

Why this answer

Option B is correct because if the instance fails status checks, the engineer should verify the instance is healthy and replace it if needed. Option A is wrong because status checks are internal, not security group issues; C is wrong because terminating the instance manually may not be necessary; D is wrong because the Auto Scaling group will launch a new instance only if the instance is terminated or marked unhealthy.

116
Multi-Selecthard

A company is experiencing a DDoS attack on its application hosted on AWS. The application uses an Application Load Balancer (ALB) with an Auto Scaling group of EC2 instances. The security team needs to mitigate the attack with minimal latency impact on legitimate users. Which THREE actions should the team take? (Choose THREE.)

Select 3 answers
A.Disable cross-zone load balancing on the ALB to limit the number of instances receiving traffic.
B.Enable AWS Shield Advanced on the ALB.
C.Enable connection draining (deregistration delay) on the ALB target group.
D.Configure the Auto Scaling group to scale based on the NetworkIn metric to handle the increased traffic.
E.Configure AWS WAF on the ALB to block requests based on source IP reputation or rate-based rules.
AnswersB, C, E

Shield Advanced provides enhanced DDoS protection and access to DDoS Response Team (DRT).

Why this answer

AWS Shield Advanced provides enhanced DDoS mitigation for ALBs, including access to the DDoS Response Team (DRT) and financial protection against scaling costs. It operates at the network and transport layers with minimal latency, as it inspects traffic inline without introducing significant processing delay. This makes it a critical first line of defense for high-availability applications under attack.

Exam trap

The trap here is confusing scaling-based absorption (Option D) with actual mitigation, leading candidates to think handling more traffic automatically defends against DDoS, when in reality it only increases cost and resource exhaustion without blocking the attack source.

117
MCQeasy

A DevOps engineer is investigating why an Amazon ECS service is not scaling out as expected. The service has a target tracking scaling policy based on average CPU utilization. The CloudWatch alarm shows that CPU utilization has exceeded the target for several minutes, but no scaling activity has occurred. What is the most likely cause?

A.The ECS service is configured with a minimum healthy percent that prevents scaling out.
B.The ECS service does not have an IAM role that allows it to call CloudWatch.
C.The CloudWatch alarm is configured with a period that is too long.
D.The scaling policy has a cooldown period that is still in effect from a previous scaling activity.
AnswerD

Cooldown periods prevent further scaling actions until the cooldown expires.

Why this answer

Option D is correct because target tracking scaling policies in Amazon ECS have a cooldown period (default 300 seconds) that prevents the policy from initiating additional scaling activities immediately after a previous scaling action. If a recent scaling activity occurred, the cooldown period would still be in effect, causing the policy to ignore the alarm even though CPU utilization has exceeded the target. This is the most likely reason no scaling activity is observed despite the alarm being triggered.

Exam trap

The trap here is that candidates often overlook the cooldown period and instead blame IAM permissions or alarm configuration, but the cooldown is a deliberate stabilization mechanism that directly explains why scaling is not occurring despite the alarm being active.

How to eliminate wrong answers

Option A is wrong because the minimum healthy percent parameter controls the minimum percentage of tasks that must remain healthy during deployments or scaling activities, but it does not prevent scaling out; it only affects how many tasks can be stopped or started at once. Option B is wrong because the ECS service does not need an IAM role to call CloudWatch; the scaling policy uses the service-linked role AWSServiceRoleForApplicationAutoScaling, which already has permissions to read CloudWatch alarms and metrics. Option C is wrong because a long CloudWatch alarm period would delay the alarm transition to ALARM state, but the question states the alarm has already exceeded the target for several minutes, meaning the period is not the issue.

118
MCQmedium

A company uses an AWS Elastic Load Balancer (ELB) to distribute traffic to EC2 instances. During an incident, some users report slow response times. The DevOps engineer suspects that one instance is unhealthy but the health check is not detecting it. What should the engineer do to improve health check accuracy?

A.Increase the health check interval to reduce load on instances
B.Configure a health check that checks a specific application endpoint
C.Decrease the health check unhealthy threshold to 2
D.Set the health check path to the root document ('/')
AnswerB

Deep health checks verify application availability.

Why this answer

Option C is correct because a deep health check (checking an application-specific endpoint) ensures the instance is truly healthy. Option A is incorrect because increasing the interval makes checks less frequent. Option B is incorrect because decreasing the threshold makes it easier to mark unhealthy.

Option D is incorrect because the health check path should be specific to the application.

119
MCQhard

A DevOps engineer updates an ECS service via CloudFormation. The stack update fails with the message 'Resource update cancelled'. The engineer notices that the ECS service's desired count was temporarily reduced during the update. What is the most likely cause of the failure?

A.The ECS service's minimum healthy percent was set to 100, causing the desired count reduction to zero to be rejected.
B.The ECS service's target group had an unhealthy instance that prevented the deregistration.
C.The ECS service deployment circuit breaker was triggered due to a timeout.
D.The ECS service did not have the required IAM role to call ecs:UpdateService.
AnswerA

CloudFormation reduces desired count to 0 during update; if min healthy percent is 100, it fails.

Why this answer

The error 'Resource update cancelled' occurs because CloudFormation detected that the ECS service update was not progressing as expected. When the minimum healthy percent is set to 100, the deployment process cannot reduce the desired count to zero (or below the current running count) without violating the requirement that 100% of the tasks remain healthy. This causes the update to be cancelled as CloudFormation waits indefinitely for the deployment to complete, eventually timing out and rolling back.

Exam trap

The trap here is that candidates often confuse the 'Resource update cancelled' error with a permissions or circuit breaker issue, but the key clue is the temporary reduction in desired count, which directly points to a minimum healthy percent constraint that prevents the service from scaling down to zero.

How to eliminate wrong answers

Option B is wrong because an unhealthy instance in the target group would cause health check failures and potential deployment issues, but it would not directly cause a 'Resource update cancelled' error with a temporary desired count reduction; the error message specifically points to a deployment configuration issue, not a target group health issue. Option C is wrong because the deployment circuit breaker is a feature that rolls back a deployment when it detects a failure (e.g., tasks failing to start), but it would not cause the desired count to be temporarily reduced; the circuit breaker triggers after a deployment failure, not as a cause of the count reduction. Option D is wrong because a missing IAM role for ecs:UpdateService would result in an authorization error (e.g., 'AccessDenied') during the update, not a 'Resource update cancelled' error with a temporary desired count reduction; the update would fail immediately with a permissions error, not after a partial state change.

120
MCQhard

A DevOps team is configuring an Auto Scaling group for a web application behind an Application Load Balancer. The team wants to automatically replace instances that fail the health check. Which scaling policy should be used?

A.Target tracking scaling policy
B.Default health check replacement
C.Step scaling policy
D.Manual scaling
AnswerB

Auto Scaling automatically terminates unhealthy instances and launches new ones based on the health check configuration.

Why this answer

Option A is correct because the default health check replacement automatically replaces unhealthy instances. Option B is wrong because manual scaling does not automate replacement. Option C is wrong because target tracking policies adjust based on metrics, not health.

Option D is wrong because step scaling policies adjust based on metric thresholds, not health.

121
MCQhard

A company experiences a security incident where an unauthorized user accessed an S3 bucket containing sensitive data. The DevOps team needs to identify the source IP address and user agent of the request. Which AWS service provides this information?

A.VPC Flow Logs
B.Amazon S3 server access logs
C.AWS CloudTrail
D.Amazon CloudWatch Logs
AnswerB

Contains source IP and user agent.

Why this answer

Option B is correct because S3 server access logs contain detailed information about each request, including source IP, user agent, and requester. Option A is wrong because CloudTrail logs S3 API calls but does not include user agent by default. Option C is wrong because CloudWatch Logs can store logs but does not generate them.

Option D is wrong because VPC Flow Logs capture network traffic but not application-level details like user agent.

122
Multi-Selecteasy

A company uses AWS CloudFormation to manage infrastructure. A stack update fails with a 'ROLLBACK_IN_PROGRESS' status. The DevOps engineer needs to investigate the failure. Which TWO actions should the engineer take?

Select 2 answers
A.Review AWS CloudTrail logs for the 'UpdateStack' API call.
B.Review the 'Stack Events' tab in the CloudFormation console to see the specific error messages.
C.Check the 'Rollback triggers' configuration for the stack.
D.Examine the stack's 'Template' and 'Parameters' to ensure they are correct.
E.Create a Change Set to see the proposed changes before re-attempting the update.
AnswersB, D

Stack events contain detailed status reasons for each resource, including failure reasons.

Why this answer

Option A is correct because the 'Stack Events' tab shows detailed error messages for each resource. Option D is correct because viewing the stack's template and parameters helps identify misconfigurations. Option B is wrong because rollback triggers are for deletion, not update failures.

Option C is wrong because CloudTrail logs API calls, not template validation errors. Option E is wrong because Change Sets are used for reviewing changes before execution, not for debugging failed updates.

123
MCQeasy

A DevOps engineer is troubleshooting an Amazon CloudWatch alarm that is not triggering as expected. The alarm monitors an SQS queue's ApproximateNumberOfMessagesVisible metric with a threshold of 100 for 1 evaluation period. The queue has had over 100 messages for the past 30 minutes, but the alarm remains in OK state. What is the most likely cause?

A.The alarm period is set longer than 30 minutes.
B.The alarm is not configured to evaluate the metric correctly.
C.The SQS queue is not being polled.
D.Messages are being consumed but not deleted, causing them to become invisible.
AnswerD

Messages in flight are not counted as visible.

Why this answer

Option B is correct because if the SQS queue has no consumers, the messages are not visible after the visibility timeout expires; the metric ApproximateNumberOfMessagesVisible only counts messages that are available for retrieval, not those in flight. Option A is wrong because the alarm is configured to monitor a metric, so it should trigger. Option C is wrong because the alarm period is likely set appropriately.

Option D is wrong because if the queue is not polled, messages remain in the queue but are visible.

124
Multi-Selectmedium

Which TWO actions should a DevOps engineer take to ensure that an Amazon RDS for PostgreSQL database is automatically recovered in the event of a failure?

Select 2 answers
A.Create a read replica in a different Availability Zone
B.Take manual snapshots every hour
C.Configure cross-region replication to a secondary region
D.Enable automated backups with a retention period of at least 1 day
E.Enable Multi-AZ deployment
AnswersD, E

Automated backups enable point-in-time recovery for manual restoration.

Why this answer

Options A and D are correct. Multi-AZ deployment provides automatic failover, and automated backups with point-in-time recovery allow restoring to a specific time. Option B is incorrect because manual snapshots require manual intervention.

Option C is incorrect because read replicas are for read scaling, not automatic failover. Option E is incorrect because cross-region replication is for disaster recovery, not automatic recovery.

125
MCQhard

A DevOps team uses AWS Systems Manager Incident Manager for incident response. They have an escalation plan that sends notifications to an SNS topic. However, during a recent incident, the on-call engineer did not receive the notification. The engineer's phone number and email are correct in the SSM Incident Manager contact settings. What is the MOST likely cause of the missed notification?

A.SSM Incident Manager requires a VPC endpoint for SNS, which is not configured.
B.The SNS topic's IAM policy does not allow Systems Manager to publish messages.
C.The SNS subscription for the engineer's contact was never confirmed.
D.The SNS topic and the engineer's contact are in different AWS Regions.
AnswerC

SNS requires subscription confirmation for email/HTTP endpoints; without confirmation, notifications are not sent.

Why this answer

Option D is correct because SNS topic subscriptions must be confirmed (typically via email) before delivery begins; an unconfirmed subscription will not deliver messages. Option A is wrong because CloudWatch alarm actions use different IAM permissions; the issue is subscription-level. Option B is wrong because SNS delivery is regional; cross-region delivery works fine.

Option C is wrong because SSM Incident Manager does not require a VPC endpoint to send SNS notifications.

126
Multi-Selecthard

A company uses AWS CloudFormation to manage infrastructure. A stack update fails with the error: 'UPDATE_ROLLBACK_IN_PROGRESS'. The DevOps engineer needs to investigate the cause. Which THREE steps should the engineer take? (Choose THREE.)

Select 3 answers
A.Create a change set to see what changes were attempted.
B.Use the 'describe-stack-resource' AWS CLI command to get the resource status.
C.Review the CloudFormation console to identify which resource failed.
D.Use the '--retain-resources' option to preserve resources that failed to delete.
E.Check the CloudFormation stack events for error messages.
AnswersC, D, E

The console highlights the failed resource.

Why this answer

Options A, D, and E are correct. Option A: The CloudFormation console shows the specific resource that caused the failure. Option D: Stack events provide detailed status and error messages.

Option E: The '--retain-resources' parameter can preserve resources that failed to delete during rollback. Option B is wrong because change sets are used to preview changes before execution, not to troubleshoot failures. Option C is wrong because DescribeStackResource returns specific resource details but not the overall failure cause; also, it's not a primary troubleshooting step.

127
MCQmedium

A company uses an Auto Scaling group with a dynamic scaling policy based on the average CPU utilization of the instances. During an incident, the DevOps team notices that the Auto Scaling group is not launching new instances quickly enough to handle a traffic spike. What is a possible cause for the slow scaling response?

A.The cooldown period is set too high.
B.The health check grace period is set too low.
C.The minimum group size is set too low.
D.The launch template has a long warm-up time.
AnswerA

A long cooldown can delay additional scaling activities after a previous one.

Why this answer

Option A is correct because the cooldown period prevents the Auto Scaling group from launching or terminating instances after a scaling activity. If the cooldown is too long, it delays subsequent scaling actions. Option B is wrong because the health check grace period is for instance health checks after launch, not scaling speed.

Option C is wrong because the warm-up time for the launch template is not a real parameter. Option D is wrong because the minimum group size is irrelevant to scaling speed.

128
MCQhard

A company runs a critical application on Amazon ECS with Fargate launch type. The application uses an Application Load Balancer (ALB) in front. During a load test, the team notices a sudden increase in 5xx errors from the ALB, and some tasks become unhealthy. The task logs show occasional 'OutOfMemoryError' exceptions. The task definition currently has 512 CPU units and 1024 MiB memory. What should the team do to mitigate the issue while maintaining a cost-effective approach?

A.Increase the task definition CPU to 1024 units and memory to 2048 MiB.
B.Increase the task definition memory to 2048 MiB while keeping CPU at 512 units.
C.Configure the ECS service to use a rolling update with a longer health check grace period.
D.Decrease the task definition memory to 512 MiB to force garbage collection more frequently.
AnswerB

This directly addresses the memory error without wasting resources on extra CPU.

Why this answer

Option B is correct because the application is experiencing OutOfMemoryError, indicating the current 1024 MiB memory allocation is insufficient. Increasing memory to 2048 MiB while keeping CPU at 512 units directly resolves the memory constraint without unnecessary CPU cost. ECS Fargate allows independent scaling of CPU and memory within valid combinations, and this change maintains a cost-effective approach by only increasing the resource that is actually constrained.

Exam trap

The trap here is that candidates may assume both CPU and memory must be increased together (Option A) or that a deployment strategy change (Option C) can mitigate resource exhaustion, when in fact the root cause is a memory limit that must be raised independently.

How to eliminate wrong answers

Option A is wrong because it increases both CPU and memory, which is unnecessary and more costly; the issue is memory, not CPU, and the extra CPU units would not resolve OutOfMemoryError. Option C is wrong because a rolling update with a longer health check grace period does not address the root cause of memory exhaustion; it only delays health check failures without fixing the underlying resource shortage. Option D is wrong because decreasing memory to 512 MiB would exacerbate the OutOfMemoryError, causing more frequent failures and task crashes, not improving garbage collection behavior.

129
MCQmedium

A company uses AWS Elastic Beanstalk for its web application. After a deployment, the environment health changes to 'Severe' and the application becomes unresponsive. The DevOps team needs to quickly revert to the previous working version. What is the FASTEST way to achieve this?

A.Use the Elastic Beanstalk console to deploy the previous application version.
B.Swap environment URLs with a different environment that runs the previous version.
C.Terminate the environment and create a new one with the previous version.
D.Redeploy the same application version to the environment.
AnswerA

Elastic Beanstalk allows rolling back to a previous version quickly.

Why this answer

Option A is correct because Elastic Beanstalk supports rolling back to a previous application version through the console or CLI. Option B is wrong because terminating and recreating takes longer and loses configuration. Option C is wrong because swapping URLs with a different environment requires that environment to exist.

Option D is wrong because redeploying the same version may not fix the issue if the environment configuration has changed.

130
MCQeasy

A company uses AWS CloudTrail to log API activity in their AWS account. The DevOps engineer needs to ensure that all management events are logged and that the logs are delivered to an S3 bucket in another account for centralized auditing. The engineer has already created an S3 bucket in the central auditing account and applied a bucket policy that grants the CloudTrail service permission to write logs. However, logs are not being delivered. The engineer verifies that the CloudTrail trail is configured to point to the correct S3 bucket name and that the bucket exists. What is the MOST likely reason the logs are not being delivered?

A.The S3 bucket policy in the central account does not include the 's3:PutObject' action for the CloudTrail service.
B.The S3 bucket must have ACLs enabled to allow cross-account writes.
C.The source account does not have a VPC peering connection to the central auditing account.
D.The CloudTrail trail must be created in the central auditing account, not the source account.
AnswerA

The bucket policy must explicitly allow the CloudTrail service to put objects; otherwise, the service cannot write.

Why this answer

Option B is correct because the S3 bucket policy must also grant the CloudTrail service the 's3:PutObject' permission for the bucket, not just allow write access generically. Option A is wrong because CloudTrail can deliver logs cross-account without a VPC peering connection. Option C is wrong because the bucket policy should be on the destination bucket, not the source account.

Option D is wrong because the bucket is in another account, so the source account cannot set ACLs; bucket policy is the correct mechanism.

131
MCQeasy

A DevOps engineer is responsible for monitoring a production environment that uses Amazon EC2 Auto Scaling. The engineer notices that the Auto Scaling group has been launching and terminating instances frequently over the past hour. The group uses a dynamic scaling policy based on average CPU utilization. The CloudWatch alarm that triggers scaling is set to a threshold of 70% CPU for scale-out and 30% for scale-in. The engineer checks the CloudWatch metrics and sees that CPU utilization is oscillating between 40% and 60%, never reaching the thresholds. The engineer suspects that the scaling policy is not working correctly. The engineer is considering the following actions: A) Change the scaling policy to use a target tracking policy with a target value of 50% CPU utilization. B) Increase the cooldown period for the scaling policy to 300 seconds. C) Disable the scale-in policy to prevent frequent terminations. D) Use a simple scaling policy instead of a dynamic scaling policy. Which action should the engineer take?

A.Disable the scale-in policy to prevent frequent terminations.
B.Change the scaling policy to use a target tracking policy with a target value of 50% CPU utilization.
C.Use a simple scaling policy instead of a dynamic scaling policy.
D.Increase the cooldown period for the scaling policy to 300 seconds.
AnswerB

Target tracking maintains a steady CPU level, reducing oscillations.

Why this answer

Option A is correct because a target tracking policy automatically adjusts the desired capacity to maintain the target value, smoothing out oscillations. Option B is wrong because increasing cooldown may help but does not address the root cause of oscillations; target tracking is more effective. Option C is wrong because disabling scale-in could lead to over-provisioning.

Option D is wrong because simple scaling policies are less responsive and can also cause oscillations.

132
MCQhard

An application running on Amazon ECS Fargate is experiencing intermittent HTTP 503 errors from the Application Load Balancer (ALB). The target group health checks are passing. Which configuration is MOST likely causing this issue?

A.The deregistration delay is set too short, causing connections to be closed before requests complete.
B.The ALB's slow start duration is too long, causing requests to be dropped.
C.The health check interval is set too low, causing targets to be marked unhealthy prematurely.
D.The ALB's circuit breaker is tripping due to high error rates.
AnswerA

If delay is too short, in-flight requests may fail with 503 when a target is deregistered.

Why this answer

Option D is correct because a deregistration delay too short can cause the ALB to route traffic to a draining target, resulting in 503. Option A is wrong because health check interval being too short would cause flapping, not 503. Option B is wrong because slow start only affects new targets.

Option C is wrong because circuit breaker doesn't cause 503s directly.

133
MCQmedium

A company uses AWS CloudFormation to deploy infrastructure. During an incident, a stack update fails with a stack rollback. The engineer needs to prevent the stack from rolling back on future failures and instead retain the resources for debugging. Which CloudFormation feature should the engineer use?

A.Enable drift detection on the stack
B.Use the '--disable-rollback' option when updating the stack
C.Use AWS CloudFormation StackSets to deploy the stack
D.Create a change set before updating instead of direct update
AnswerB

This retains resources for debugging.

Why this answer

The `--disable-rollback` option (or `DisableRollback` in the CloudFormation template) instructs AWS CloudFormation to leave the stack in its current state (with the failed resources intact) instead of automatically rolling back to the last known good state. This allows engineers to retain the resources for debugging without the stack being torn down on failure.

Exam trap

The trap here is that candidates often confuse change sets (which preview changes) with the ability to prevent rollback, or mistakenly think drift detection or StackSets can alter rollback behavior, when only the `--disable-rollback` flag directly controls whether resources are retained on failure.

How to eliminate wrong answers

Option A is wrong because drift detection only identifies differences between the stack's actual deployed resources and the expected template configuration; it does not prevent rollback or retain resources after a failed update. Option C is wrong because StackSets are used to deploy stacks across multiple accounts and regions, not to control rollback behavior on a single stack update failure. Option D is wrong because a change set allows you to preview changes before updating, but it does not affect the rollback behavior; if the update fails, the stack will still roll back by default unless `--disable-rollback` is specified.

134
MCQhard

During an incident, a DevOps engineer needs to quickly revoke access to a set of IAM users who are suspected to be compromised. The users have programmatic access keys and console passwords. The engineer wants to minimize the impact on non-compromised users. Which action should the engineer take FIRST?

A.Delete the compromised IAM users.
B.Attach an IAM policy that explicitly denies all actions to the compromised users.
C.Delete the access keys of the compromised users.
D.Change the IAM password policy to require strong passwords.
AnswerB

This immediately revokes all permissions while preserving the user objects for investigation.

Why this answer

Option A is correct because attaching an IAM policy that explicitly denies all actions to the users effectively revokes their permissions without deleting their credentials, which is reversible and allows investigation. Option B is wrong because deleting access keys would not disable console access. Option C is wrong because changing the password policy would affect all users, not only the compromised ones.

Option D is wrong because deleting the users is irreversible and would lose audit trails.

135
MCQeasy

A company's DevOps team uses AWS Config to monitor resource compliance. They have created a custom AWS Config rule that triggers an AWS Lambda function to evaluate whether EC2 instances have the 'Environment' tag with value 'Production' or 'Staging'. The rule is set to evaluate resources on configuration changes. However, the team notices that the rule does not trigger when an EC2 instance is launched. The Lambda function's IAM role has the necessary permissions to describe EC2 instances. The CloudWatch Logs for the Lambda function show that it is not being invoked. What is the MOST likely reason?

A.The Lambda function's IAM role does not have permission to write to CloudWatch Logs.
B.The AWS Config rule is set to evaluate resources periodically, not on configuration changes.
C.The AWS Config rule is not configured to trigger on AWS::EC2::Instance resources.
D.The custom rule must be deployed using AWS CloudFormation to be active.
AnswerC

The rule's scope must include the resource type; otherwise, Config will not evaluate EC2 instances.

Why this answer

Option A is correct because AWS Config rules need to be associated with the specific resource type (AWS::EC2::Instance) in the rule's scope; without this, the rule won't trigger on EC2 instance changes. Option B is wrong because the Lambda function's permissions are sufficient; the issue is before invocation. Option C is wrong because the rule is event-based; evaluation on configuration changes should work.

Option D is wrong because a custom rule can be triggered by configuration changes.

136
MCQeasy

A DevOps engineer notices that an Auto Scaling group is repeatedly launching and terminating instances. CloudWatch alarms show high CPU but the group's metrics are erratic. What is the most likely cause?

A.The Auto Scaling group is not associated with a load balancer.
B.The launch template user data is causing instances to fail during boot.
C.The health check grace period is too short and the health check type is ELB, but the load balancer health check is failing.
D.The AMI used in the launch template is not properly configured.
AnswerC

Short grace period and failing ELB health checks cause instances to be terminated and replaced.

Why this answer

Option C is correct because incorrect health check configuration can cause the Auto Scaling group to continuously replace instances that are actually healthy, leading to the described behavior. Option A is wrong because insufficient AMI configuration would cause launch failures, not repeated cycling. Option B is wrong because load balancer health checks determine instance health; if they are not configured, instances are considered healthy.

Option D is wrong because changing launch template user data would not cause cycling.

137
MCQmedium

An application running on Amazon EC2 instances in an Auto Scaling group is experiencing intermittent connectivity issues. The DevOps team suspects a security group configuration problem. Which approach should the team use to analyze security group traffic and identify denied requests?

A.Use AWS Config to review security group rules
B.Check AWS CloudTrail for security group modification events
C.Enable AWS Security Hub and review the security findings
D.Enable VPC Flow Logs and query Amazon Athena
AnswerD

Flow Logs capture traffic; Athena can query to find denied connections.

Why this answer

Option C is correct because VPC Flow Logs can capture all traffic (accepted and rejected) and be analyzed to find denied connections. Option A is wrong because Security Hub aggregates security findings but does not log traffic. Option B is wrong because CloudTrail logs API calls, not network traffic.

Option D is wrong because Config records resource configurations, not traffic.

138
Multi-Selecteasy

A DevOps engineer is troubleshooting an AWS CodeDeploy deployment that failed. Which TWO resources should the engineer examine to identify the cause of the failure? (Choose two.)

Select 2 answers
A.EC2 instance system logs
B.CloudWatch Logs for CodeDeploy
C.S3 access logs
D.CloudTrail logs
E.CodeDeploy deployment group configuration
AnswersB, E

Contains deployment events and error messages.

Why this answer

Option B is correct because AWS CodeDeploy emits detailed logs about deployment lifecycle events (e.g., BeforeInstall, ApplicationStop) to CloudWatch Logs. These logs contain error messages, script output, and status codes that directly indicate why a deployment step failed, such as a permission issue or a script syntax error. Examining CloudWatch Logs for CodeDeploy is the primary method to diagnose deployment failures.

Exam trap

The trap here is that candidates often confuse CloudTrail (API auditing) with CloudWatch Logs (application-level logging), or they mistakenly think EC2 system logs are relevant for application deployment failures, when in fact CodeDeploy-specific logs are the correct source.

139
MCQmedium

A company uses AWS Lambda functions to process incoming events from Amazon S3. The operations team notices that some events are not being processed, and there is no error in the Lambda function logs. What is the most likely cause?

A.The Lambda function has reserved concurrency set to a low value, causing throttling.
B.The S3 event notification is configured to send to an SNS topic that is not subscribed to the Lambda function.
C.The S3 bucket policy does not allow the Lambda function to be invoked.
D.The Lambda function has a timeout that is too short.
AnswerA

Throttled events are not logged in the function's CloudWatch Logs because the function is not invoked.

Why this answer

When a Lambda function has reserved concurrency set to a low value, it limits the number of concurrent executions allowed for that function. If incoming S3 events exceed this limit, the Lambda service throttles the invocations, causing some events to be silently dropped without generating errors in the function logs because the function never actually runs. This matches the symptom of missing events with no error logs.

Exam trap

The trap here is that candidates often assume missing events are due to permission or timeout errors, but the absence of any error logs points to throttling, where the function is never invoked and thus no logs are generated.

How to eliminate wrong answers

Option B is wrong because if the SNS topic is not subscribed to the Lambda function, the event would never reach Lambda, but the question states the Lambda function logs show no errors, implying the function is invoked for some events; the issue is about events not being processed, not about delivery failure. Option C is wrong because if the S3 bucket policy did not allow Lambda invocation, the invocation would fail with an access denied error, which would be logged in CloudTrail or appear as an error in the Lambda logs, contradicting the 'no error' condition. Option D is wrong because a timeout that is too short would cause the function to fail mid-execution and generate a timeout error in the Lambda logs, not silently drop events without any log entries.

140
MCQhard

A company uses Amazon CloudWatch Logs to collect application logs from EC2 instances. The security team requires that log data be encrypted at rest using a customer-managed AWS KMS key. The logs are currently being delivered, but they are not encrypted. What is the most likely reason?

A.The IAM role for the EC2 instance does not have kms:Encrypt permission
B.The CloudWatch Logs agent is not configured to encrypt logs
C.The KMS key is disabled
D.The KMS key policy does not allow the CloudWatch Logs service principal
AnswerA

The role needs permission to encrypt using the CMK.

Why this answer

Option C is correct because the CloudWatch Logs role must have kms:Encrypt permission to use CMK. Option A is wrong because KMS key policies allow IAM policies but IAM policies must grant permissions. Option B is wrong because the CMK must be enabled, but that's not the typical issue.

Option D is wrong because encryption is applied at log group level, not at the agent level.

141
MCQmedium

A company uses AWS CodePipeline for CI/CD. During a production deployment, the pipeline fails at the 'Deploy' stage with an error: 'The deployment failed because the deployment group does not have enough capacity to handle the deployment.' The engineer checks the CodeDeploy deployment group and sees that it is configured with a minimum healthy hosts of 100% and a deployment configuration of 'CodeDeployDefault.OneAtATime'. What is the MOST likely cause?

A.The deployment configuration 'OneAtATime' is not compatible with the deployment group.
B.The target group health check is misconfigured, causing all instances to be unhealthy.
C.The CodeDeploy agent on the instances is not running.
D.The deployment group has only one instance, and the minimum healthy hosts setting prevents the deployment.
AnswerD

With one instance and min healthy hosts 100%, taking that instance out of service violates the constraint.

Why this answer

Option B is correct because with a minimum healthy hosts of 100%, CodeDeploy requires that all hosts remain healthy during deployment. The OneAtATime configuration updates only one host at a time, but if the deployment group has only one instance, taking it out of service violates the minimum healthy hosts. Option A is wrong because the error is about capacity, not target group health.

Option C is wrong because CodeDeploy agent issues would cause a different error. Option D is wrong because the deployment configuration is correct for rolling updates.

142
MCQhard

A company uses AWS Lambda functions that are triggered by S3 events (object creation). The Lambda function processes the file and stores results in DynamoDB. Recently, the function started timing out after 15 seconds, causing some files to not be processed. The average file size has increased significantly. The DevOps engineer increases the Lambda function's timeout to 30 seconds and the memory to 512 MB, but the function still times out for large files. The CloudWatch Logs show that the timeout occurs during the 'dynamodb.put_item' call for a large item. The DynamoDB table's write capacity is set to on-demand, and there are no throttling errors. What should the engineer do to resolve the timeout issue?

A.Modify the Lambda function to split the large item into multiple smaller items before writing to DynamoDB.
B.Configure the Lambda function to write to an SQS queue first, then have another Lambda process the queue.
C.Mount an EFS file system to the Lambda function and write the large item to a file instead of DynamoDB.
D.Increase the Lambda function's timeout to 5 minutes and memory to 1024 MB.
AnswerA

DynamoDB has a 400 KB item size limit; splitting the item avoids the timeout.

Why this answer

Option B is correct because using DynamoDB's TransactWriteItems or increasing the timeout and memory may not help if the item size exceeds the 400 KB limit; splitting the item reduces the payload. Option A is wrong because increasing memory and timeout further may not help if the DynamoDB API call itself times out due to large item size. Option C is wrong because the issue is not related to SQS.

Option D is wrong because the Lambda is not using EFS.

143
MCQeasy

A company runs a critical Amazon RDS for PostgreSQL database. The database suddenly becomes unresponsive. The DevOps team checks CloudWatch metrics and notices that the 'DatabaseConnections' metric spiked to the maximum limit. What is the MOST likely cause and immediate action?

A.There is a network connectivity issue; check the VPC settings.
B.The database storage is full; increase the allocated storage.
C.The application has a connection leak; restart the database to clear connections.
D.The database CPU is at 100%; scale up the instance class.
AnswerC

A connection leak causes connections to remain open, hitting the max. Restarting clears them.

Why this answer

Option A is correct because a sudden spike to max connections indicates that the application might have a connection leak, causing connections to accumulate. The immediate action is to restart the database to clear all connections and then investigate the application. Option B is wrong because storage full would cause 'FreeStorageSpace' to be low, not necessarily max connections.

Option C is wrong because CPU high would be shown by 'CPUUtilization' metric, not directly connections. Option D is wrong because a network issue would affect connectivity, not cause max connections.

144
MCQeasy

A DevOps engineer notices that an Amazon RDS for MySQL instance has failed over to a standby replica. The engineer needs to identify the root cause by examining metrics. Which AWS service should the engineer use to view the database load, replication lag, and failover events?

A.AWS CloudTrail
B.Amazon CloudWatch
C.VPC Flow Logs
D.AWS Trusted Advisor
AnswerB

CloudWatch monitors RDS metrics including load and replication lag.

Why this answer

Option B is correct because Amazon RDS integrates with CloudWatch to monitor metrics like DatabaseConnections, ReplicaLag, and failover events. Option A is wrong because CloudTrail records API calls but not database-level metrics. Option C is wrong because VPC Flow Logs capture network traffic, not database metrics.

Option D is wrong because Trusted Advisor provides best-practice checks, not real-time metrics.

145
MCQhard

A company uses an Application Load Balancer (ALB) in front of a fleet of EC2 instances. The security team reports that a specific client IP address is sending malicious requests and must be blocked immediately. The ALB's security group only allows HTTP/HTTPS from 0.0.0.0/0. What is the FASTEST way to block traffic from this IP address without affecting other traffic?

A.Create an AWS WAF web ACL with an IP set deny rule and associate it with the ALB.
B.Modify the ALB listener rules to drop requests from the client IP.
C.Update the ALB security group to add a deny rule for the client IP address.
D.Update the VPC route table to drop packets from the client IP.
AnswerA

AWS WAF can block requests based on source IP quickly.

Why this answer

Option C is correct because updating the ALB's security group to deny the specific IP address would block all traffic from that IP inbound to the ALB, but the current security group allows all traffic, so you would need to modify it to deny that IP. However, security groups are stateful and cannot deny rules; they only allow. The correct approach is to use a network ACL on the ALB's subnets, which is not listed.

Among the options, Option C (creating a WAF web ACL and associating it with the ALB) is the fastest and most appropriate because AWS WAF can block requests based on IP addresses immediately without changing network infrastructure. Option A is wrong because updating the security group to deny is not possible; security groups only support allow rules. Option B is wrong because modifying the ALB's listener rules does not block traffic; it only affects routing.

Option D is wrong because updating route tables would affect all traffic to the ALB, not just that IP.

146
MCQhard

A company runs a critical application on an Amazon RDS for MySQL DB instance. During a recent incident, the database became unresponsive. The DevOps team suspects that a long-running query is blocking other operations. Which metric should they monitor in Amazon CloudWatch to detect this type of issue?

A.DatabaseConnections
B.ActiveTransactions (from Enhanced Monitoring)
C.ReadLatency
D.CPUUtilization
AnswerB

High active transactions or long-running ones can cause blocking.

Why this answer

Option C is correct because the 'Maximum UsedTransactionIDs' metric (or the number of active transactions) can indicate long-running transactions that hold locks. However, for MySQL, the relevant metric is 'ActiveTransactions' or 'OldestActiveTransaction' (via Enhanced Monitoring). Among the options, 'ActiveTransactions' is the most direct.

Option A is wrong because 'DatabaseConnections' shows total connections, not blocking. Option B is wrong because 'CPUUtilization' may be high but not specific to blocking. Option D is wrong because 'ReadLatency' could be affected but is not a direct indicator of blocking.

147
MCQmedium

A company's production RDS MySQL instance experienced a failover. The DevOps team needs to understand the root cause. Which set of logs should be reviewed first?

A.VPC Flow Logs
B.RDS MySQL error logs
C.AWS CloudTrail logs
D.RDS slow query logs
AnswerC

CloudTrail records RDS API calls including failover events.

Why this answer

Option B is correct because RDS events (failover, maintenance) are logged in CloudTrail. Option A is wrong because slow query logs do not record failover events. Option C is wrong because error logs may show errors but not the failover trigger.

Option D is wrong because VPC Flow Logs capture network traffic, not database events.

148
MCQmedium

A company's DevOps team uses AWS CodePipeline to automate deployments. A recent pipeline execution failed at the 'Deploy' stage. The engineer needs to view the detailed logs for the failed action. Which AWS service or feature should the engineer use?

A.CloudWatch Logs
B.S3 access logs
C.CodeBuild logs
D.CloudTrail
AnswerA

CodePipeline logs execution details to CloudWatch Logs.

Why this answer

AWS CodePipeline integrates with Amazon CloudWatch Logs to capture and store detailed execution logs for each pipeline action, including the 'Deploy' stage. When a deployment action fails, the engineer can view the associated logs directly from the CodePipeline console or via the CloudWatch Logs console, which provides granular error messages, timestamps, and stack traces necessary for troubleshooting. This is the designated service for accessing action-level logs in CodePipeline.

Exam trap

The trap here is that candidates may confuse CloudTrail (audit logs) with CloudWatch Logs (operational logs), or assume that CodeBuild logs cover all pipeline stages, when in fact each stage type (e.g., Deploy) has its own log destination.

How to eliminate wrong answers

Option B (S3 access logs) is wrong because S3 access logs record requests made to an S3 bucket, not the execution logs of CodePipeline actions. Option C (CodeBuild logs) is wrong because CodeBuild logs are specific to build actions within CodePipeline, not to deploy actions, which may use other providers like CodeDeploy or Elastic Beanstalk. Option D (CloudTrail) is wrong because CloudTrail records API calls made to AWS services for auditing purposes, not the detailed runtime logs of a pipeline execution.

149
MCQeasy

A company uses AWS CloudFormation to deploy infrastructure. A recent stack update failed, and the engineer needs to roll back to the previous stable state. Which CloudFormation feature should the engineer use?

A.Use AWS CloudFormation Drift Detection.
B.Use AWS CloudFormation StackSets.
C.Use the 'Rollback' action in the CloudFormation console.
D.Create a Change Set and execute it.
AnswerC

CloudFormation supports rolling back a failed stack update.

Why this answer

Option C is correct because CloudFormation provides a built-in 'Rollback' action that automatically reverts a stack to its last known stable state when a stack update fails. This rollback occurs by default on failure, but if the engineer needs to manually trigger it after a failed update, they can use the 'Rollback' action in the CloudFormation console or API (e.g., `aws cloudformation rollback-stack`). This ensures the infrastructure returns to the previous working configuration without manual intervention.

Exam trap

The trap here is that candidates may confuse the 'Rollback' action with creating a Change Set or using Drift Detection, thinking they need to manually define the previous state, when in fact CloudFormation automatically preserves the previous stable state for rollback.

How to eliminate wrong answers

Option A is wrong because Drift Detection is used to identify whether a stack's actual resources have deviated from the expected template configuration, not to roll back a failed update. Option B is wrong because StackSets are designed to deploy stacks across multiple accounts and regions, not to handle rollback of a single stack update. Option D is wrong because creating and executing a Change Set is a method to propose and apply changes to a stack, but it does not automatically revert to the previous state; it would require the engineer to manually define the previous template and parameters, which is less efficient than using the built-in rollback feature.

150
MCQmedium

A company has an AWS Lambda function that processes S3 events. The function is invoked multiple times for the same S3 object, causing duplicate processing. The engineer suspects the issue is related to retries from the S3 event notification or Lambda's built-in retry behavior. What is the MOST effective way to ensure idempotent processing?

A.Modify the S3 bucket event notification configuration to use a prefix filter that excludes duplicate objects.
B.Use a DynamoDB table to store a record of processed S3 object keys and check for existence before processing.
C.Set the Lambda function's ReservedConcurrency to 1 to prevent concurrent executions.
D.Use an Amazon SQS FIFO queue as the event source and enable content-based deduplication.
AnswerB

This pattern ensures idempotency by tracking processed objects.

Why this answer

Option B is correct because storing processed S3 object keys in a DynamoDB table and checking for existence before processing ensures idempotency at the application level. This approach directly handles duplicate invocations caused by S3 event retries or Lambda's built-in retry behavior, as the function can conditionally skip processing if the key already exists in DynamoDB. It provides a durable, consistent, and scalable mechanism to prevent duplicate processing regardless of how many times the function is invoked for the same object.

Exam trap

The trap here is that candidates often confuse concurrency control (ReservedConcurrency) with idempotency, or assume SQS FIFO deduplication is a drop-in solution without realizing S3 cannot directly send events to FIFO queues.

How to eliminate wrong answers

Option A is wrong because S3 prefix filters only filter events based on object key prefixes or suffixes, not on duplicate detection; they cannot prevent multiple notifications for the same object. Option C is wrong because setting ReservedConcurrency to 1 prevents concurrent executions but does not prevent sequential duplicate invocations from retries; the function could still be invoked multiple times for the same object in sequence. Option D is wrong because using an SQS FIFO queue with content-based deduplication would require the S3 event notification to be sent to the queue, but S3 does not natively support sending events to SQS FIFO queues; it only supports standard SQS queues, and even if it did, the deduplication window is only 5 minutes, which may not cover all retry scenarios.

← PreviousPage 2 of 4 · 254 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Incident Response questions.