A cloud engineer notices that a virtual machine (VM) in a public cloud environment is consistently running at 90% CPU during business hours. The VM hosts a customer-facing web application. Which of the following is the BEST initial troubleshooting step?
Reviewing metrics and logs is the standard first step in troubleshooting.
Why this answer
Option B is correct because the initial step in troubleshooting high CPU usage is to gather diagnostic data. Reviewing the VM's performance metrics (e.g., CPU utilization, memory, disk I/O) and application logs helps identify whether the issue is caused by a legitimate workload spike, a memory leak, or a misconfiguration. This aligns with the 'identify before act' principle in cloud operations, ensuring the engineer understands the root cause before making changes.
Exam trap
The trap here is that candidates often jump to a 'fix' like scaling up or rebooting, but Cisco tests the foundational troubleshooting methodology of 'gather data first' to avoid unnecessary changes and ensure the solution is targeted and cost-effective.
How to eliminate wrong answers
Option A is wrong because migrating the VM to a different availability zone does not address high CPU usage; it only changes the physical location, which may introduce latency or availability issues without resolving the performance bottleneck. Option C is wrong because rebooting the VM is a disruptive action that only temporarily resets resource usage; it does not diagnose or fix the underlying cause, and it can lead to application downtime for a customer-facing web app. Option D is wrong because scaling up to a larger instance size is a reactive measure that may mask the problem without investigation; it increases costs and could be unnecessary if the issue is due to a software bug or misconfiguration.