A company is running a microservices application on a Kubernetes cluster. They have noticed that one of the services, 'payment-api', is experiencing intermittent high latency. The team wants to identify the root cause without modifying the application code. Which approach should they take?
Distributed tracing tracks request flow and identifies slow components.
Why this answer
Option C is correct because distributed tracing with tools like Jaeger or Zipkin allows you to follow a single request as it traverses multiple microservices, identifying exactly which service or call introduces latency. This approach does not require code changes (if the service mesh or sidecar proxy handles instrumentation) and is specifically designed to pinpoint performance bottlenecks in distributed systems, unlike CPU/memory metrics or log analysis which cannot trace a request's end-to-end path.
Exam trap
CNCF often tests the distinction between observability tools that provide request-level context (distributed tracing) versus aggregate resource metrics (kube-state-metrics, Node Exporter) or unstructured logs, leading candidates to mistakenly choose CPU/memory correlation or log analysis for pinpointing intermittent latency in a microservices architecture.
How to eliminate wrong answers
Option A is wrong because kube-state-metrics provides resource utilization data (CPU, memory) per pod or container, but high latency in a microservice is often caused by network delays, database contention, or upstream service failures—not necessarily correlated with local resource usage; correlation does not imply causation and cannot trace the request path. Option B is wrong because increasing log verbosity for all services generates massive volumes of unstructured data and relies on error messages that may not appear during intermittent latency spikes; logs lack the context of a specific request's journey across services, making root cause identification inefficient and often impossible. Option D is wrong because node-level metrics from Prometheus Node Exporter only show host-level resource usage (e.g., disk I/O, network bandwidth) and cannot reveal which microservice or request is causing latency within the cluster; they are useful for infrastructure troubleshooting but not for application-level distributed tracing.