A company runs a containerized application on Amazon ECS with Fargate launch type. The application consists of three microservices: frontend, backend, and database. The ECS cluster is in a VPC with public and private subnets. The frontend service is publicly accessible via an Application Load Balancer (ALB) in public subnets. The backend service communicates with the database service, which runs as a stateful service with persistent storage using Amazon EFS. The DevOps team is using CloudWatch Container Insights and has enabled Prometheus metrics for the ECS cluster. Recently, the team observed that the frontend service's response time has increased significantly, and some requests are timing out. The team checked the ALB metrics and saw an increase in 5xx errors. They also noticed that the backend service's CPU utilization is high, and the database service's disk I/O is high. The team suspects a bottleneck in the backend service. Which course of action should the team take FIRST to identify the root cause?
Logs will help pinpoint the issue.
Why this answer
Option B is correct. The first step is to analyze the backend service's application logs to identify any errors or slow operations. The high CPU and disk I/O could be symptoms of inefficient queries or code.
Option A is incorrect because increasing capacity without understanding the root cause may not solve the issue and could increase costs. Option C is incorrect because switching to a different database does not address the immediate issue. Option D is incorrect because disabling health checks would hide the problem, not fix it.