CCNA Kcna Observability Questions

75 of 86 questions · Page 1/2 · Kcna Observability topic · Answers revealed

1
MCQhard

When using OpenTelemetry, what is the role of the 'Collector'?

A.To alert on abnormal metrics
B.To store traces for long-term retention
C.To receive, process, and export telemetry data in a vendor-neutral way
D.To instrument application code manually
AnswerC

Correct. The Collector is a pipeline for telemetry data.

Why this answer

Option C is correct because the OpenTelemetry Collector is a vendor-agnostic proxy that receives telemetry data (traces, metrics, logs) from instrumented applications, processes it (e.g., batching, filtering, enrichment), and exports it to one or more backends (e.g., Jaeger, Prometheus, or any OTLP-compatible system). It decouples data generation from data storage, enabling flexible, scalable observability pipelines without vendor lock-in.

Exam trap

CNCF often tests the misconception that the Collector is a storage or alerting system, when in fact it is a stateless pipeline component that only receives, processes, and exports telemetry data.

How to eliminate wrong answers

Option A is wrong because alerting on abnormal metrics is the responsibility of monitoring systems like Prometheus with Alertmanager, not the OpenTelemetry Collector, which focuses on data ingestion, processing, and export. Option B is wrong because long-term storage of traces is handled by backend systems (e.g., Jaeger, Tempo) or databases; the Collector is a pipeline component that forwards data, not a persistent store. Option D is wrong because manual instrumentation of application code is done using OpenTelemetry SDKs and APIs (e.g., for traces, metrics), while the Collector operates as a separate infrastructure component that receives already-instrumented telemetry.

2
Multi-Selectmedium

Which TWO of the following are valid PromQL functions? (Select two.)

Select 2 answers
A.topk()
B.histogram_quantile()
C.rate()
D.avg()
E.sum()
AnswersB, C

histogram_quantile() calculates quantiles from histogram metrics.

Why this answer

rate() and histogram_quantile() are common PromQL functions. avg_over_time() is also valid but avg is not a function, it's an aggregation operator.

3
MCQmedium

A developer wants to view the logs of a specific container named 'sidecar' in a pod called 'web-app'. Which command should they use?

A.kubectl logs sidecar --pod web-app
B.kubectl logs -c sidecar web-app
C.kubectl logs sidecar web-app
D.kubectl logs web-app -c sidecar
AnswerD

Correct. -c specifies the container.

Why this answer

The -c flag specifies the container name when a pod has multiple containers.

4
MCQeasy

Which Prometheus metric type is used to represent a value that can increase or decrease over time, such as memory usage?

A.Gauge
B.Histogram
C.Summary
D.Counter
AnswerA

A gauge can increase or decrease.

Why this answer

A gauge is a metric that can go up and down, like memory usage or temperature.

5
Multi-Selecthard

Which THREE of the following are components of the OpenTelemetry project?

Select 3 answers
A.Prometheus
B.Collector
C.Specification
D.Jaeger
E.SDKs (Software Development Kits)
AnswersB, C, E

A vendor-agnostic telemetry pipeline.

Why this answer

OpenTelemetry includes specification, SDKs, Collector, and instrumentations. Prometheus and Jaeger are separate projects.

6
MCQhard

What is the main advantage of using OpenTelemetry over vendor-specific instrumentation libraries?

A.It provides a single, vendor-agnostic instrumentation standard
B.It eliminates the need for logging
C.It automatically reduces latency
D.It is the only tool that supports traces
AnswerA

OpenTelemetry is an open standard that works with multiple backends.

Why this answer

OpenTelemetry provides a unified standard that avoids vendor lock-in, allowing data to be sent to any backend.

7
MCQeasy

Which Prometheus metric type is best suited for counting the total number of HTTP requests received by a service?

A.Summary
B.Counter
C.Gauge
D.Histogram
AnswerB

Correct. A counter is cumulative and only increases, perfect for counting total requests.

Why this answer

A counter is a cumulative metric that only increases (or resets to zero). It is ideal for counting events like HTTP requests.

8
MCQmedium

An organization uses Prometheus and Grafana for monitoring. They want to alert when the 99th percentile of request latency exceeds 500ms for more than 5 minutes. Which PromQL query should they use in the alert rule?

A.histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[1m])) > 0.5
B.histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5
C.avg(rate(http_request_duration_seconds_bucket[5m])) > 0.5
D.max(rate(http_request_duration_seconds_bucket[5m])) > 0.5
AnswerB

Correctly calculates 99th percentile over 5 minutes, then compares to 0.5 seconds.

Why this answer

Option B is correct because it uses `histogram_quantile(0.99, rate(...[5m]))` to calculate the 99th percentile request latency over a 5-minute window, matching the requirement to alert when this value exceeds 500ms (0.5 seconds) for more than 5 minutes. The `rate()` function with a 5m range computes the per-second increase of bucket counters over that duration, which is necessary for accurate quantile estimation in Prometheus.

Exam trap

The trap here is that candidates often pick a 1-minute rate window (Option A) thinking it provides faster detection, but the question explicitly requires a 5-minute sustained condition, and Prometheus alert rules evaluate the query over the rule evaluation interval, not the rate window.

How to eliminate wrong answers

Option A is wrong because it uses a 1-minute rate window (`[1m]`), which does not align with the requirement to evaluate over a 5-minute period; this would cause the alert to trigger on short bursts rather than sustained latency. Option C is wrong because `avg(rate(...))` calculates the average rate across all buckets, which has no relation to percentile latency and cannot detect the 99th percentile threshold. Option D is wrong because `max(rate(...))` takes the maximum rate across buckets, which is not a percentile metric and would incorrectly alert on the highest bucket's rate rather than the 99th percentile latency.

9
MCQeasy

What is the purpose of the metrics-server in Kubernetes?

A.To provide resource usage metrics for pods and nodes
B.To manage service meshes
C.To collect application logs
D.To store historical metrics
AnswerA

The metrics-server exposes CPU and memory metrics from kubelets.

Why this answer

The metrics-server provides resource metrics (CPU and memory) per pod and node, used by kubectl top and the Horizontal Pod Autoscaler.

10
Multi-Selectmedium

Which TWO of the following are valid Prometheus metric types?

Select 2 answers
A.Quantile
B.Counter
C.Timer
D.Meter
E.Gauge
AnswersB, E

Correct. A cumulative metric that only increases.

Why this answer

Prometheus supports counter, gauge, histogram, and summary. Options A and D are valid; B (timer) and C (quantile) are not; E (meter) is from OpenTelemetry.

11
MCQmedium

In OpenTelemetry, which component is responsible for receiving, processing, and exporting telemetry data from multiple sources?

A.OpenTelemetry Collector
B.OpenTelemetry SDK
C.OpenTelemetry Exporter
D.OpenTelemetry API
AnswerA

The Collector is a pipeline component for receiving, processing, and exporting data.

Why this answer

The OpenTelemetry Collector is a vendor-agnostic agent that receives, processes, and exports telemetry data.

12
MCQhard

In Prometheus, what is the purpose of the Alertmanager component?

A.To scrape metrics from targets
B.To provide a graphical dashboard for metrics
C.To manage, group, and route alerts to notification channels like email or Slack
D.To store historical metrics data long-term
AnswerC

Correct. Alertmanager handles alert processing and notifications.

Why this answer

Alertmanager handles alerts sent by Prometheus server, deduplicates, groups, and routes them to receivers (email, Slack, etc.), and manages silencing and inhibition.

13
Drag & Dropmedium

Drag and drop the steps to perform a backup of etcd in a Kubernetes cluster into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Access the node, save snapshot, verify, store securely, and restore when necessary.

14
MCQmedium

Which of the following is a core component of the three pillars of observability?

A.Alerting
B.SLIs
C.Logs
D.Dashboards
AnswerC

Logs are one of the three pillars of observability.

Why this answer

The three pillars of observability are logs, metrics, and traces. Alerting is derived from metrics, not a pillar itself.

15
MCQmedium

Which Prometheus metric type is best suited to count the number of HTTP requests received?

A.Gauge
B.Histogram
C.Summary
D.Counter
AnswerD

Counters are cumulative and only increase, perfect for counting total requests.

Why this answer

A counter is a cumulative metric that only increases, ideal for counting requests.

16
Multi-Selectmedium

Which THREE of the following are benefits of using a service mesh for observability? (Select three.)

Select 3 answers
A.Reduced network latency
B.Centralized logging of all application logs
C.Distributed tracing across services
D.Collection of detailed metrics for service-to-service communication
E.Automatic instrumentation of application code for traces
AnswersC, D, E

Service mesh can propagate trace context and generate spans for each hop.

Why this answer

A service mesh provides visibility into inter-service communication, adds distributed tracing without code changes, and collects metrics like request latency.

17
MCQmedium

A DevOps team notices that a microservice is returning 503 errors intermittently. The service runs in Kubernetes and uses a liveness probe. The team wants to understand the root cause without restarting the pod. Which observability approach should they use first?

A.Use kubectl describe pod to check recent events
B.Query Prometheus for kubelet metrics on probe successes and failures
C.Increase log verbosity in the application to capture all requests
D.Enable distributed tracing across the service mesh
AnswerB

Metrics like 'probe_success' from kubelet can show probe status over time, helping identify intermittent failures.

Why this answer

Option B is correct because Prometheus can scrape kubelet metrics that expose liveness probe success and failure counts directly, allowing the team to see if the probe is failing without restarting the pod. This approach provides historical data on probe behavior, which is essential for diagnosing intermittent 503 errors that stem from the kubelet restarting the container when the liveness probe fails. Unlike other options, it does not require modifying the application or restarting the pod, and it directly surfaces the root cause if the probe is the issue.

Exam trap

The trap here is that candidates often assume 'kubectl describe pod' (Option A) is sufficient for debugging, but they overlook that its event log is short-lived and may not retain evidence of intermittent failures, whereas Prometheus metrics provide persistent historical data.

How to eliminate wrong answers

Option A is wrong because 'kubectl describe pod' shows recent events, but these events are ephemeral and may not capture intermittent failures that occurred minutes or hours ago, especially if the pod has not been restarted recently. Option C is wrong because increasing log verbosity requires modifying the application deployment and restarting the pod, which the team explicitly wants to avoid, and it does not directly reveal liveness probe failures (which are handled by the kubelet, not the application). Option D is wrong because distributed tracing across the service mesh focuses on request-level latency and errors between services, not on the kubelet's health check mechanism; it would not show liveness probe failures unless the probe itself is instrumented as a span, which is not standard.

18
MCQeasy

Which of the following is NOT one of the three pillars of observability?

A.Metrics
B.Logs
C.Traces
D.Alerts
AnswerD

Alerts are not a pillar; they are typically generated from metrics or logs.

Why this answer

The three pillars are logs, metrics, and traces. Alerts are derived from these pillars but not considered a pillar themselves.

19
MCQmedium

An application is instrumented with OpenTelemetry to export traces to Jaeger. The team notices that some traces are incomplete. What is the most likely cause?

A.Context propagation is not correctly implemented
B.Span attributes are missing
C.Sampling rate is too high
D.Jaeger database is full
AnswerA

Missing context propagation breaks trace continuity.

Why this answer

Incomplete traces often occur when context propagation is not implemented correctly, causing spans to be disconnected.

20
MCQmedium

What type of Prometheus metric is best suited to count the total number of HTTP requests received by a service?

A.Histogram
B.Gauge
C.Summary
D.Counter
AnswerD

Counter is a cumulative metric that increases monotonically, suitable for counting requests.

Why this answer

A counter is a cumulative metric that can only increase or be reset to zero, ideal for counting events like requests.

21
MCQmedium

Which tool is specifically designed for log aggregation and is built by Grafana Labs as a lightweight, cost-effective alternative to traditional log systems?

A.Loki
B.Zipkin
C.Prometheus
D.Jaeger
AnswerA

Correct. Loki is a log aggregation system from Grafana Labs.

Why this answer

Loki is a log aggregation system optimized for Kubernetes, designed to be cost-effective and easy to operate.

22
Multi-Selectmedium

Which TWO of the following are Prometheus metric types? (Select two.)

Select 2 answers
A.Event
B.Gauge
C.Set
D.Counter
E.Timer
AnswersB, D

Gauge represents a single numerical value that can go up and down.

Why this answer

Prometheus metric types include Counter, Gauge, Histogram, and Summary. Options A and B are correct.

23
MCQhard

A team wants to set up alerts when a Kubernetes pod consumes more than 90% of its memory limit for over 5 minutes. They use Prometheus and Alertmanager. Which Prometheus query would trigger an alert for a specific pod named 'web-app' in the 'default' namespace?

A.container_memory_usage_bytes{pod='web-app'} > 0.9
B.container_memory_usage_bytes{pod='web-app'} / container_spec_memory_limit_bytes{pod='web-app'} > 0.9
C.avg(container_memory_usage_bytes{pod='web-app'}) > 0.9
D.container_memory_limit_bytes{pod='web-app'} > 0.9
AnswerB

This calculates the percentage of memory used relative to the limit.

Why this answer

The correct query divides container memory usage by its limit and compares to 0.9.

24
MCQeasy

Which of the following is considered one of the three pillars of observability?

A.Events
B.Metrics
C.Alerts
D.Profiles
AnswerB

Metrics are one of the three pillars of observability, along with logs and traces.

Why this answer

The three pillars of observability in cloud native environments are logs, metrics, and traces.

25
MCQmedium

Which log aggregation tool is designed specifically for Kubernetes and is often used as a lightweight alternative to Fluentd?

A.Logstash
B.Fluent Bit
C.Loki
D.Elasticsearch
AnswerB

Fluent Bit is lightweight and designed for Kubernetes.

Why this answer

Fluent Bit is a lightweight log processor and forwarder, often used in Kubernetes.

26
MCQeasy

What is the primary purpose of Prometheus in cloud native observability?

A.Provide distributed tracing
B.Visualize data
C.Collect and store logs
D.Collect and store metrics
AnswerD

Prometheus is a metrics system.

Why this answer

Prometheus is a metric system that collects and stores numeric time-series data.

27
MCQhard

An SRE team defines an SLO that 99.9% of requests to a service should complete in under 500ms over a 30-day rolling window. If the service receives 10 million requests in a month, what is the maximum number of requests that can exceed the latency threshold while still meeting the SLO?

A.10,000
B.5,000
C.1,000
D.100,000
AnswerA

0.1% of 10 million is 10,000.

Why this answer

SLO of 99.9% means up to 0.1% errors are allowed. 0.1% of 10,000,000 is 10,000 requests.

28
Multi-Selecthard

Which THREE of the following are core components of the OpenTelemetry specification? (Select three.)

Select 3 answers
A.Data Model
B.API
C.Collector
D.Exporter
E.SDK
AnswersA, B, E

Data Model defines the schema for telemetry data.

Why this answer

The OpenTelemetry specification defines the API, SDK, and data model. These are the core components.

29
MCQhard

A platform team wants to implement observability for a Kubernetes cluster running 500+ microservices. They need to reduce the cost of storing logs while retaining the ability to search for specific error patterns. Which strategy best achieves this?

A.Increase log retention to one year for compliance
B.Store all logs in a centralized Elasticsearch cluster with high retention
C.Aggregate logs into a single pod for easier indexing
D.Use structured logging and sample debug logs, retaining error logs fully
AnswerD

Sampling reduces volume while keeping critical error logs for search.

Why this answer

Option D is correct because structured logging (e.g., JSON format) enables efficient indexing and querying of logs, while sampling debug logs and retaining error logs fully reduces storage costs without losing critical error patterns. This approach balances observability needs with cost optimization, a key principle in cloud-native environments.

Exam trap

The trap here is that candidates may assume centralized storage (Elasticsearch) or longer retention always improves observability, ignoring the cost and scalability constraints of 500+ microservices in a cloud-native environment.

How to eliminate wrong answers

Option A is wrong because increasing log retention to one year for compliance does not address cost reduction; it increases storage costs and may violate data minimization principles. Option B is wrong because storing all logs in a centralized Elasticsearch cluster with high retention is expensive and inefficient, as it retains unnecessary debug logs and scales poorly for 500+ microservices. Option C is wrong because aggregating logs into a single pod creates a single point of failure, violates pod isolation, and does not reduce storage costs or improve searchability.

30
Multi-Selecthard

Which TWO of the following are valid components of the Alertmanager configuration? (Select two.)

Select 2 answers
A.group_by
B.prometheus_rules
C.route
D.alert
E.receivers
AnswersC, E

Route defines alert routing tree in Alertmanager.

Why this answer

Alertmanager configuration includes 'route' for routing alerts and 'receivers' for notification channels. 'prometheus_rules' is part of Prometheus configuration, not Alertmanager. 'alert' and 'group_by' are not top-level Alertmanager config keys.

31
MCQmedium

A DevOps team wants to collect logs from all Kubernetes nodes and forward them to a central log storage system. Which tool is specifically designed for lightweight log aggregation and forwarding on Kubernetes nodes?

A.Elasticsearch
B.Prometheus
C.Fluent Bit
D.Grafana
AnswerC

Fluent Bit is a lightweight log forwarder suitable for Kubernetes nodes.

Why this answer

Fluent Bit is a lightweight log processor and forwarder, designed for resource-constrained environments like Kubernetes nodes.

32
MCQmedium

A team wants to visualize metrics from Prometheus in a dashboard. Which tool is commonly used for this purpose?

A.Grafana
B.Alertmanager
C.Jaeger UI
D.Kibana
AnswerA

Grafana integrates natively with Prometheus.

Why this answer

Grafana is the most popular visualization tool for Prometheus metrics, offering rich dashboards.

33
MCQhard

A team wants to implement cost monitoring for their Kubernetes clusters. Which approach is most effective?

A.Use cloud provider billing APIs combined with resource utilization data
B.Use kubectl top to get resource usage
C.Estimate costs based on node count
D.Monitor CPU and memory usage with Prometheus
AnswerA

This maps resource consumption to cost.

Why this answer

Option A is correct because cloud provider billing APIs provide actual cost data per resource (e.g., per node, per persistent volume, per network egress), and combining this with resource utilization data (e.g., CPU/memory requests and actual usage from metrics) enables accurate cost allocation per namespace, pod, or workload. This approach directly maps infrastructure spend to Kubernetes abstractions, which is essential for chargeback or showback in multi-tenant clusters.

Exam trap

The trap here is that candidates confuse resource monitoring (CPU/memory) with cost monitoring, assuming that tracking utilization alone (e.g., with Prometheus or kubectl top) is sufficient to understand spending, when in fact cost data requires explicit billing integration.

How to eliminate wrong answers

Option B is wrong because 'kubectl top' only shows current resource usage (CPU/memory) for nodes and pods, not cost data; it lacks any billing context or historical aggregation needed for cost monitoring. Option C is wrong because estimating costs based solely on node count ignores variable costs like storage, network egress, and managed services (e.g., load balancers), leading to inaccurate cost attribution. Option D is wrong because Prometheus monitors resource utilization metrics (CPU, memory, disk I/O) but does not inherently provide cost data; it would need to be combined with pricing information from cloud provider APIs to calculate costs.

34
MCQmedium

Which tool is specifically designed for distributed tracing and is a Cloud Native Computing Foundation (CNCF) graduated project?

A.Grafana
B.Fluentd
C.Jaeger
D.Prometheus
AnswerC

Jaeger is a graduated CNCF project for distributed tracing.

Why this answer

Jaeger is a CNCF graduated project focused on distributed tracing.

35
MCQmedium

Which component of the OpenTelemetry architecture is responsible for receiving data from instrumented applications and processing it before export?

A.OpenTelemetry SDK
B.OpenTelemetry API
C.OpenTelemetry Collector
D.OpenTelemetry exporter
AnswerC

The Collector handles ingestion, processing, and export.

Why this answer

The OpenTelemetry Collector receives, processes, and exports telemetry data.

36
Multi-Selecteasy

Which TWO of the following tools are commonly used for distributed tracing in cloud-native environments? (Select two.)

Select 2 answers
A.Zipkin
B.Grafana
C.Jaeger
D.Fluentd
E.Prometheus
AnswersA, C

Zipkin is a distributed tracing system.

Why this answer

Jaeger and Zipkin are popular open-source distributed tracing systems.

37
MCQhard

In PromQL, which function would you use to calculate the per-second rate of increase of a counter over a specified time window?

A.rate()
B.delta()
C.avg_over_time()
D.increase()
AnswerA

rate() is the correct function for per-second rate of a counter.

Why this answer

The rate() function calculates the per-second average rate of increase of a counter over a time range.

38
MCQhard

In OpenTelemetry, what is the purpose of the Collector component?

A.Instrument code automatically
B.Receive, process, and export telemetry data
C.Visualize traces and metrics
D.Aggregate logs from multiple sources
AnswerB

The Collector is a vendor-agnostic pipeline for telemetry data.

Why this answer

The OpenTelemetry Collector is a vendor-agnostic agent or gateway that receives telemetry data (traces, metrics, logs) from instrumented applications, processes it (e.g., batching, filtering, sampling), and exports it to one or more backends (e.g., Jaeger, Prometheus, or any OTLP-compatible system). It decouples data generation from data export, enabling flexible pipeline management without modifying application code.

Exam trap

CNCF often tests the distinction between the Collector's role (data pipeline) and other components like SDKs (instrumentation) or backends (visualization/storage), so candidates mistakenly associate the Collector with auto-instrumentation or visualization.

How to eliminate wrong answers

Option A is wrong because automatic code instrumentation is the role of OpenTelemetry SDKs and auto-instrumentation agents (e.g., Java agent), not the Collector; the Collector does not instrument code. Option C is wrong because visualization of traces and metrics is the responsibility of backend tools like Jaeger UI, Grafana, or Prometheus, not the Collector, which only processes and forwards data. Option D is wrong because while the Collector can handle logs, its primary purpose is not limited to log aggregation; it is a unified pipeline for traces, metrics, and logs, and log aggregation alone is a narrower function often served by tools like Fluentd or Logstash.

39
MCQeasy

A Kubernetes administrator is troubleshooting a pod that is stuck in CrashLoopBackOff. The pod's restart count is increasing. Which initial step should the administrator take to diagnose the issue?

A.Run 'kubectl describe pod <pod-name>' to check events
B.Check the Prometheus metrics for the pod's CPU usage
C.Run 'kubectl exec -it <pod-name> -- /bin/sh' to inspect the container
D.Run 'kubectl logs <pod-name>' to view the application logs
AnswerD

Logs often contain error messages that explain why the application is crashing.

Why this answer

Option D is correct because when a pod is in CrashLoopBackOff, the immediate priority is to inspect the application logs to understand why the container is failing. `kubectl logs <pod-name>` retrieves the stdout/stderr output from the container, which typically contains error messages, stack traces, or configuration issues that caused the crash. This is the most direct and efficient first step before deeper investigation.

Exam trap

The trap here is that candidates often jump to `kubectl describe pod` (Option A) because it shows events and status, but they overlook that application-level errors are only visible in the container logs, not in the pod events.

How to eliminate wrong answers

Option A is wrong because `kubectl describe pod` shows events and status details, but it does not show the application's runtime logs; it is useful for cluster-level issues (e.g., image pull failures, node problems) but not for application crashes. Option B is wrong because Prometheus metrics are for long-term monitoring and alerting, not for real-time crash diagnosis; CPU usage data will not reveal why a process exited. Option C is wrong because `kubectl exec` requires a running container, but a pod in CrashLoopBackOff has a container that is repeatedly crashing and may not be running at the moment the command is issued, causing the exec to fail.

40
MCQeasy

Which of the following is the correct definition of a Service Level Indicator (SLI)?

A.A formal contract between a service provider and a customer
B.A target value or range for a metric, agreed upon with stakeholders
C.A quantitative measure of a specific aspect of the service's reliability
D.A tool for aggregating logs from multiple sources
AnswerC

An SLI is exactly that: a metric that indicates the level of service.

Why this answer

An SLI is a specific metric that measures a particular aspect of service reliability, such as request latency or error rate.

41
Multi-Selecthard

Which TWO of the following are recommended practices for achieving observability in a Kubernetes cluster?

Select 2 answers
A.Use a single centralized logging solution to aggregate logs from all components.
B.Store all debug logs for a minimum of 90 days for compliance.
C.Include correlation IDs in structured logs to enable tracing across services.
D.Disable leader election for monitoring components to reduce complexity.
E.Use Prometheus with a pull-based model to scrape metrics from pods.
AnswersC, E

Correlation IDs help trace requests across microservices.

Why this answer

Option C is correct because including correlation IDs in structured logs is a key observability practice that enables distributed tracing across microservices. In Kubernetes, where requests often traverse multiple pods and services, correlation IDs allow you to link logs from different components into a single transaction flow, which is essential for debugging and understanding system behavior.

Exam trap

CNCF often tests the misconception that centralized logging is always best, but the trap here is that observability emphasizes distributed, resilient data collection over a single monolithic log sink, and that debug logs are not subject to long-term compliance retention like audit logs.

42
MCQmedium

In distributed tracing, what is a 'span'?

A.A metric measuring request latency
B.A single logical operation within a trace
C.A collection of related traces
D.A log entry with trace context
AnswerB

A span represents one operation, such as a function call or a request.

Why this answer

A span represents a unit of work in a distributed system, often a single operation like an HTTP request or database call.

43
MCQmedium

Which component is responsible for aggregating metrics from Kubernetes nodes and exposing them to the metrics API?

A.Prometheus Server
B.Grafana
C.metrics-server
D.Fluentd
AnswerC

Correct. The metrics-server is a cluster-wide aggregator of resource usage data.

Why this answer

The metrics-server is the correct component because it is specifically designed to collect resource metrics (CPU and memory) from the kubelet on each node via the Summary API and expose them through the Kubernetes Metrics API. This allows tools like `kubectl top` and the Horizontal Pod Autoscaler to access real-time resource usage without requiring a full monitoring stack.

Exam trap

The trap here is that candidates often confuse Prometheus (a full monitoring system) with the metrics-server (a lightweight, Kubernetes-native component for the Metrics API), assuming Prometheus is required for `kubectl top` or HPA when in fact the metrics-server is the dedicated and simpler solution.

How to eliminate wrong answers

Option A is wrong because Prometheus Server is a full monitoring and alerting system that scrapes metrics from various endpoints, but it is not the component responsible for aggregating metrics from nodes and exposing them to the Kubernetes Metrics API; it typically scrapes the metrics-server or kubelet directly. Option B is wrong because Grafana is a visualization and dashboarding tool that queries data sources like Prometheus or metrics-server, but it does not aggregate or expose metrics to the Metrics API. Option D is wrong because Fluentd is a log collector and forwarder used for log aggregation, not for collecting or exposing resource metrics to the Kubernetes Metrics API.

44
MCQeasy

Which of the following is NOT one of the three pillars of observability in cloud-native environments?

A.Metrics
B.Traces
C.Security
D.Logs
AnswerC

Security is not one of the three pillars.

Why this answer

The three pillars are logs, metrics, and traces. Security is not one of them, though it is important.

45
Multi-Selecthard

Which THREE of the following are important considerations when defining SLOs (Service Level Objectives)? (Select three.)

Select 3 answers
A.They should include an error budget
B.They must be aligned with business impact
C.They must be based on measurable SLIs
D.They define a target percentage over a time window
E.They should minimize infrastructure cost
AnswersB, C, D

SLOs should reflect what matters to users and business.

Why this answer

SLOs should be based on SLIs, define a target (e.g., 99.9%), and include a measurement window. Cost is not a direct consideration for SLO definition.

46
MCQmedium

In the context of distributed tracing, what is a 'span'?

A.A metric that measures request latency
B.A tool for collecting logs from containers
C.The entire end-to-end transaction across services
D.A single logical operation within a service, with a start and end time
AnswerD

Correct. A span represents one operation, such as a database call or an HTTP request handler.

Why this answer

A span is the fundamental building block of a trace, representing a single unit of work in a distributed system.

47
Matchingmedium

Match each Kubernetes security concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Identity for processes running in a pod

Role-based access control to authorize API requests

Specifies how groups of pods are allowed to communicate

Deprecated but formerly controlled security-sensitive pod settings

Stores sensitive data like passwords and tokens

Why these pairings

These are key security mechanisms in Kubernetes.

48
MCQeasy

What is the primary purpose of structured logging?

A.To replace metrics and traces
B.To reduce the size of log files
C.To make logs human-readable only
D.To enable automated analysis and querying of logs
AnswerD

Structured logs allow tools like Loki or Elasticsearch to index and search log fields efficiently.

Why this answer

Structured logging formats log data in a consistent, machine-parseable format (e.g., JSON) with key-value pairs. This enables automated tools like Elasticsearch, Loki, or Splunk to efficiently index, search, filter, and aggregate logs, which is essential for observability at scale. The primary purpose is to facilitate automated analysis and querying, not to replace other telemetry signals or to focus on human readability alone.

Exam trap

The trap here is that candidates confuse 'structured logging' with 'log formatting for readability' (Option C), but the KCNA exam emphasizes that structured logging is fundamentally about enabling automated processing and correlation, not just making logs easier for humans to read.

How to eliminate wrong answers

Option A is wrong because structured logging does not replace metrics and traces; it complements them as part of the three pillars of observability (logs, metrics, traces), each serving a distinct purpose. Option B is wrong because structured logging often increases log file size due to added metadata (e.g., JSON keys), not reduces it; compression or sampling is used for size reduction. Option C is wrong because while structured logs can be formatted for readability, their core design is for machine parsing, not human readability; unstructured plain-text logs are typically more human-readable.

49
Multi-Selectmedium

Which TWO of the following are common log aggregation tools used in Kubernetes environments? (Select two)

Select 2 answers
A.Loki
B.Fluentd
C.Prometheus
D.Jaeger
E.Fluent Bit
AnswersB, E

Fluentd is a widely used log collector and forwarder.

Why this answer

Fluentd and Fluent Bit are both popular log aggregators and forwarders in Kubernetes. Loki is a log storage system, not an aggregator.

50
MCQhard

What is context propagation in distributed tracing?

A.Sampling traces to reduce data volume
B.Visualizing traces in a user interface
C.Carrying trace context (trace ID, span ID) across services
D.Storing trace data in a centralized database
AnswerC

Context propagation passes metadata to correlate spans.

Why this answer

Context propagation carries trace context across service boundaries to connect spans into a single trace.

51
MCQeasy

What is the primary purpose of structured logging?

A.To format logs in a consistent, machine-readable way for easier processing
B.To compress log files and reduce storage usage
C.To encrypt log data for security purposes
D.To send logs directly to the user's terminal
AnswerA

Correct. Structured logging uses formats like JSON to enable automated analysis.

Why this answer

Structured logging outputs logs in a consistent, machine-readable format (e.g., JSON) making it easier to parse, filter, and analyze log data.

52
MCQmedium

Which tool is primarily used for distributed tracing in cloud native environments?

A.Grafana
B.Fluentd
C.Jaeger
D.Prometheus
AnswerC

Jaeger is a distributed tracing tool.

Why this answer

Jaeger is a popular open-source distributed tracing system.

53
Multi-Selectmedium

Which THREE of the following are benefits of structured logging? (Select three.)

Select 3 answers
A.Easier querying and filtering
B.More human-readable than plain text
C.Reduced storage requirements
D.Machine-parseable output
E.Consistent field names across services
AnswersA, D, E

Fields can be indexed and queried.

Why this answer

Structured logging provides machine-parseable output, enables easier querying and analysis, and ensures consistent field naming. Human readability is not a primary benefit; structured logs can be less human-friendly than plain text.

54
MCQeasy

What is the purpose of Alertmanager in Prometheus?

A.Handle alert notifications
B.Visualize metrics
C.Store long-term metrics
D.Collect metrics from targets
AnswerA

Alertmanager manages alerts and sends notifications.

Why this answer

Alertmanager is the component in the Prometheus ecosystem responsible for handling alerts fired by the Prometheus server. It deduplicates, groups, and routes alerts to configured notification channels such as email, PagerDuty, or Slack, ensuring that operators receive actionable notifications without alert fatigue.

Exam trap

The trap here is that candidates confuse Alertmanager with Prometheus itself, thinking it collects or stores metrics, when in fact it is solely a notification routing and deduplication engine.

How to eliminate wrong answers

Option B is wrong because visualizing metrics is the role of Grafana or the Prometheus expression browser, not Alertmanager. Option C is wrong because long-term metrics storage is handled by remote storage integrations (e.g., Thanos, Cortex) or the Prometheus TSDB itself, not Alertmanager. Option D is wrong because collecting metrics from targets is the function of the Prometheus server via its scrape mechanism, not Alertmanager.

55
Multi-Selecthard

Which THREE are responsibilities of the OpenTelemetry project? (Select three.)

Select 3 answers
A.Visualize telemetry data
B.Store long-term telemetry data
C.Provide instrumentation libraries
D.Define a standard for telemetry data
E.Provide a vendor-agnostic Collector
AnswersC, D, E

Why this answer

OpenTelemetry provides SDKs for instrumentation, a collector for processing, and APIs for standards.

56
Multi-Selectmedium

Which TWO of the following are best practices for implementing observability in a cloud-native environment?

Select 2 answers
A.Store all raw observability data indefinitely for forensic analysis
B.Use only metrics and avoid logs to reduce complexity
C.Add unique request IDs to logs for end-to-end tracing correlation
D.Randomly sample all traces and logs to reduce storage
E.Use structured logging (e.g., JSON format) for easier automated parsing
AnswersC, E

Request IDs help correlate logs across microservices for tracing.

Why this answer

Option C is correct because adding unique request IDs (e.g., via OpenTelemetry trace IDs or custom correlation IDs) to logs enables end-to-end tracing across microservices. This allows operators to correlate a single user request as it traverses multiple services, which is essential for debugging distributed systems in a cloud-native environment.

Exam trap

Cisco often tests the misconception that 'more data is always better' (Option A) or that 'simplifying to one data type is efficient' (Option B), while the correct approach balances cost, performance, and diagnostic value through structured logging and correlation IDs.

57
MCQeasy

A developer wants to monitor the health of a Kubernetes deployment by checking if the number of ready replicas matches the desired replicas. Which metric from kube-state-metrics should they query?

A.kube_deployment_status_replicas_ready
B.kube_deployment_spec_replicas
C.kube_node_status_condition
D.kube_pod_container_status_running
AnswerA

This metric shows ready replicas, enabling comparison with desired replicas.

Why this answer

Option A is correct because `kube_deployment_status_replicas_ready` directly exposes the number of ready replicas for a Deployment, which can be compared against `kube_deployment_spec_replicas` to determine if the desired state matches the actual healthy state. This metric is emitted by kube-state-metrics, which generates Prometheus-compatible metrics from Kubernetes API objects, making it the standard choice for monitoring Deployment health.

Exam trap

The trap here is that candidates might confuse metrics that show pod state (like `kube_pod_container_status_running`) with Deployment-level readiness, not realizing that a pod can be running but not ready, and that the correct metric must reflect the Deployment's own status field.

How to eliminate wrong answers

Option B is wrong because `kube_deployment_spec_replicas` only shows the desired number of replicas as defined in the Deployment spec, not the actual ready count, so it cannot alone indicate health. Option C is wrong because `kube_node_status_condition` tracks node-level conditions (e.g., Ready, DiskPressure) and has no relation to Deployment replica health. Option D is wrong because `kube_pod_container_status_running` counts containers in Running state, not ready replicas of a Deployment, and does not account for readiness probes or desired replica counts.

58
MCQmedium

A team wants to ensure that at least 99.9% of all requests to their application complete within 500ms over a 30-day window. How should this requirement be classified?

A.Service Level Agreement (SLA)
B.Service Level Objective (SLO)
C.Service Level Indicator (SLI)
D.Key Performance Indicator (KPI)
AnswerB

Correct. This is an internal target for reliability.

Why this answer

An SLO is a target level of reliability, expressed as a percentage of a metric over a time window.

59
MCQhard

A company uses Prometheus for monitoring and wants to alert when the average CPU usage over 5 minutes exceeds 80%. Which PromQL query would correctly define this alert rule?

A.avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) > 0.8
B.avg(node_cpu_seconds_total{mode!="idle"}[5m]) > 0.8
C.avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
D.sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
AnswerC

Correct. This calculates the average non-idle (usage) rate over 5 minutes and checks if >80%.

Why this answer

The query should calculate the average CPU usage rate over 5 minutes and compare it to 0.8 (80%).

60
Multi-Selectmedium

Which TWO are components of a distributed trace? (Select two.)

Select 2 answers
A.Metrics
B.Traces
C.Spans
D.Logs
E.Alerts
AnswersB, C

Why this answer

A trace consists of spans, which are individual units of work.

61
MCQmedium

Which of the following is true about Prometheus's pull-based model for collecting metrics?

A.Targets push metrics to Prometheus
B.Prometheus only collects metrics from Kubernetes API server
C.Prometheus scrapes metrics from HTTP endpoints
D.Prometheus stores metrics in a relational database
AnswerC

Prometheus pulls (scrapes) metrics from targets' /metrics endpoints.

Why this answer

Prometheus pulls metrics from targets at regular intervals, which is the pull-based model.

62
MCQhard

A Prometheus alert rule fires when the error rate exceeds 5% for 5 minutes. The alert is sent to Alertmanager. What must be configured in Alertmanager to ensure the alert is deduplicated, grouped, and routed to the correct team?

A.An inhibition rule
B.A recording rule
C.A silence rule
D.A route configuration
AnswerD

Routes in Alertmanager define grouping, deduplication, and which receiver to use.

Why this answer

Alertmanager uses routes to match alerts and receivers to send notifications. Routes define grouping and routing logic.

63
MCQmedium

A company deploys a microservice application on Kubernetes. They notice that one of the services is returning 5xx errors intermittently. Which observability tool should they use to correlate the errors with resource usage across all pods of that service?

A.Prometheus
B.Grafana
C.Fluentd
D.Jaeger
AnswerA

Prometheus collects metrics and can correlate error rates with resource usage via labels.

Why this answer

Prometheus is the correct choice because it is a monitoring and alerting toolkit designed to collect and store time-series metrics, such as CPU, memory, and request error rates. By querying Prometheus with PromQL, you can correlate 5xx error spikes with resource usage across all pods of a service, as it scrapes metrics from each pod's /metrics endpoint. This direct correlation of application-level errors with infrastructure metrics is not natively provided by the other tools listed.

Exam trap

Cisco often tests the distinction between observability pillars (metrics, logs, traces) and their specific tools, so the trap here is confusing Grafana (a visualization layer) with Prometheus (a metrics backend) or assuming Jaeger (tracing) can correlate resource usage metrics.

How to eliminate wrong answers

Option B (Grafana) is wrong because Grafana is a visualization and dashboarding tool, not a data source; it cannot collect or correlate metrics on its own and relies on Prometheus or other backends for data. Option C (Fluentd) is wrong because Fluentd is a log collector and forwarder, focused on unstructured log data, not on time-series metrics or direct correlation with resource usage. Option D (Jaeger) is wrong because Jaeger is a distributed tracing tool for tracking request paths across services, not for correlating error rates with resource usage metrics like CPU or memory.

64
MCQhard

A company uses OpenTelemetry to instrument their microservices. They want to ensure that traces from one service can be correlated with those from another service across network calls. Which OpenTelemetry concept enables this correlation?

A.Exporter configuration
B.Span attributes
C.Context propagation
D.Sampling
AnswerC

Context propagation carries trace IDs and other context across service boundaries.

Why this answer

Context propagation allows trace context to be passed between services, enabling distributed tracing correlation.

65
MCQeasy

What does the 'kubectl logs' command retrieve?

A.Audit logs
B.Cluster events
C.Container logs
D.Node logs
AnswerC

kubectl logs shows the logs of a single container.

Why this answer

kubectl logs fetches the standard output and standard error logs from a container in a pod.

66
MCQeasy

Which tool is commonly used for log aggregation in Kubernetes and is designed to be lightweight?

A.Fluent Bit
B.Jaeger
C.Prometheus
D.Grafana
AnswerA

Fluent Bit is a lightweight log processor for log aggregation.

Why this answer

Fluent Bit is a lightweight log processor and forwarder, often used as a DaemonSet to collect logs.

67
MCQhard

You are an SRE managing a Kubernetes cluster with 200 nodes and 10,000 pods. The cluster runs a critical payment processing application. Users report that transactions are occasionally failing with a 'timeout' error. You have Prometheus and Grafana set up for monitoring, and you use Fluentd with Elasticsearch for logging. You notice that during peak hours, the CPU usage of the payment service pods spikes to 90%, but memory usage remains stable. The pod restart count is low. You also see that the response time of the payment service increases significantly during these spikes. You need to identify the root cause and propose a fix. Which course of action is most appropriate?

A.Add more replicas of the payment service to distribute the load
B.Increase the memory limits for the payment service pods to improve caching
C.Implement a circuit breaker pattern to fail fast and avoid timeouts
D.Increase the CPU limits for the payment service pods to allow more CPU resources during spikes
AnswerD

This directly addresses the CPU bottleneck, reducing response time.

Why this answer

Option D is correct because the CPU usage spikes to 90% during peak hours, indicating that the payment service pods are CPU-bound. Increasing CPU limits allows the pods to burst and utilize more CPU resources, reducing response times and preventing timeouts. This directly addresses the bottleneck without adding unnecessary replicas or changing memory settings.

Exam trap

CNCF often tests the misconception that scaling replicas always solves performance issues, but here the bottleneck is per-pod CPU limits, not overall load distribution.

How to eliminate wrong answers

Option A is wrong because adding more replicas does not solve the root cause of CPU starvation; it may spread the load but each pod still faces the same CPU limit, and the issue is per-pod CPU saturation, not overall cluster capacity. Option B is wrong because memory usage is stable, so increasing memory limits does not address the CPU bottleneck and could waste resources. Option C is wrong because a circuit breaker pattern handles failures gracefully but does not fix the underlying performance issue; it would only mask the timeouts by failing fast, not reduce the actual response time.

68
Multi-Selectmedium

Which TWO are pillars of observability? (Select two.)

Select 2 answers
A.SLIs
B.Alerting
C.Logs
D.Metrics
E.Dashboards
AnswersC, D

Why this answer

Logs and Metrics are two of the three pillars of observability (alongside Traces). Logs provide immutable, timestamped records of discrete events, while Metrics are numeric aggregations of data over time (e.g., Prometheus counters, histograms). Together they form the foundation for understanding system behavior in cloud-native environments.

Exam trap

CNCF often tests the distinction between the pillars of observability (Logs, Metrics, Traces) and the tools or outputs derived from them (e.g., SLIs, Alerting, Dashboards), leading candidates to confuse operational practices with foundational data types.

69
MCQmedium

Which open-source project provides a unified standard for collecting and exporting telemetry data (metrics, logs, and traces) from applications?

A.Prometheus
B.OpenTelemetry
C.Jaeger
D.Fluentd
AnswerB

Correct. OpenTelemetry is a unified standard for metrics, logs, and traces.

Why this answer

OpenTelemetry (OTel) is the industry standard for observability data collection and export, providing vendor-agnostic instrumentation.

70
Multi-Selectmedium

Which THREE of the following are valid use cases for distributed tracing in a microservices architecture?

Select 3 answers
A.Monitoring CPU and memory usage of each service instance
B.Understanding the dependency graph between microservices
C.Pinpointing the root cause of an error in a distributed transaction
D.Identifying which service contributes the most latency to an end-user request
E.Capturing detailed error messages and stack traces
AnswersB, C, D

Traces reveal service call relationships.

Why this answer

Distributed tracing is designed to track the flow of a single request across multiple microservices, recording timing and causality. Option B is correct because tracing systems like Jaeger or Zipkin automatically build a dependency graph by analyzing the parent-child relationships between spans, which reveals how services interact. This is a core use case for understanding service topology and identifying bottlenecks in a distributed system.

Exam trap

Cisco often tests the distinction between observability pillars (metrics, logs, traces) and expects candidates to recognize that distributed tracing is not a catch-all for monitoring or logging tasks, so the trap is confusing request-level tracing with infrastructure metrics or detailed error logging.

71
MCQmedium

Which command retrieves logs from a specific container named 'sidecar' in a multi-container pod?

A.kubectl logs pod-name sidecar
B.kubectl logs pod-name --container sidecar
C.kubectl logs pod-name -c sidecar
D.kubectl logs sidecar pod-name
AnswerC

Correct syntax for specifying a container.

Why this answer

The -c flag specifies the container name.

72
MCQeasy

What does SLA stand for in the context of service reliability?

A.Service Level Agreement
B.Service Level Indicator
C.Service Level Availability
D.Service Level Objective
AnswerA

Correct.

Why this answer

SLA stands for Service Level Agreement, a contract specifying expected service level.

73
Multi-Selecthard

Which TWO of the following are best practices for structuring log output in cloud-native applications to maximize observability?

Select 2 answers
A.Include verbose debug-level information in every log line
B.Use multi-line log entries for detailed error information
C.Output logs in structured format such as JSON
D.Include a unique request or correlation ID in each log entry
E.Avoid timestamps to reduce log size
AnswersC, D

Structured logs are machine-parseable and easily ingested by log aggregators.

Why this answer

Option C is correct because structured logging (e.g., JSON) enables automated parsing, filtering, and querying by log aggregation tools like Fluentd, Logstash, or cloud-native observability backends (e.g., Elasticsearch, Loki). This format ensures each log entry has consistent key-value pairs, making it machine-readable and facilitating correlation across distributed services without manual text parsing.

Exam trap

CNCF often tests the misconception that 'more detail is better' (Option A) or that 'human readability' (Option B) is the priority, when in cloud-native observability, machine-parseable, single-line structured logs are the standard for scalability and automation.

74
MCQmedium

A developer wants to view the logs of a specific container named 'sidecar' inside a pod named 'app-pod'. Which command should they use?

A.kubectl log app-pod --container sidecar
B.kubectl logs app-pod sidecar
C.kubectl logs -c sidecar app-pod
D.kubectl logs app-pod -c sidecar
AnswerD

This command correctly uses the -c flag to select the container.

Why this answer

The -c flag specifies the container name. The correct command is 'kubectl logs app-pod -c sidecar'.

75
MCQhard

A platform team is designing a monitoring strategy for a multi-tenant Kubernetes cluster. Each tenant runs workloads in separate namespaces. The team needs to ensure tenant isolation while providing aggregated cluster-wide dashboards. Which approach best meets these requirements?

A.Deploy a single Prometheus instance with namespace labels on all metrics
B.Use a global Prometheus with recording rules to aggregate per-namespace metrics
C.Have each tenant deploy their own monitoring stack and view separately
D.Deploy a Prometheus instance per tenant and use Thanos to aggregate metrics globally
AnswerD

Per-tenant Prometheus ensures isolation, and Thanos sidecar allows secure global aggregation with proper RBAC.

Why this answer

Option D is correct because deploying a Prometheus instance per tenant enforces strong tenant isolation by preventing cross-tenant metric access or resource contention, while Thanos provides a global view by aggregating metrics from all tenants via sidecar-based or query-frontend federation. This approach satisfies both isolation and aggregated dashboards without compromising security or scalability.

Exam trap

CNCF often tests the misconception that namespace labels alone provide sufficient isolation, but in practice, labels do not enforce access control or resource boundaries, making a single Prometheus instance a security and reliability risk in multi-tenant clusters.

How to eliminate wrong answers

Option A is wrong because a single Prometheus instance with namespace labels does not enforce tenant isolation; any user with access to Prometheus can query all namespaces, and a misconfigured or malicious tenant could overload the instance, affecting others. Option B is wrong because a global Prometheus with recording rules still runs a single instance, failing to isolate tenant workloads and creating a single point of failure and performance bottleneck. Option C is wrong because having each tenant deploy their own monitoring stack and view separately prevents the team from creating aggregated cluster-wide dashboards, as there is no unified query layer to combine metrics across tenants.

Page 1 of 2 · 86 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Kcna Observability questions.