This chapter covers monitoring AKS clusters using Container Insights and Prometheus, two complementary tools that provide deep observability into containerized workloads. For the AZ-104 exam, this topic falls under Compute objective 3.3 (Monitor AKS), and approximately 5-10% of exam questions touch on AKS monitoring, often mixing up the capabilities of each tool. By the end of this chapter, you will understand exactly what each tool monitors, how to enable them, and how to interpret their data to troubleshoot performance and health issues.
Jump to a section
Imagine you manage a massive delivery fleet of 500 trucks (your AKS cluster nodes). Each truck has dozens of packages (containers) that are constantly being loaded, unloaded, and moved between trucks. To monitor this chaotic operation, you install two systems:
A dashboard camera in every truck (the Log Analytics agent for Container Insights). This camera records everything: engine RPM (CPU), fuel level (memory), tire pressure (disk I/O), and the exact time each package is handled. The footage is streamed to a central monitoring room (Log Analytics workspace) where a team (Azure Monitor) analyzes trends, sets alerts (e.g., 'if any truck's engine temp exceeds 100°C for 5 minutes, page the mechanic'), and generates reports.
A separate telemetry system (Prometheus) that each truck's engine computer (node) and package tracking devices (pods) push high-frequency metrics to — like 'packages processed per second' or 'engine vibrations' (custom application metrics). This data is stored in a time-series database (Prometheus server on the cluster or Azure Monitor managed service). Unlike the camera footage, this is structured, numerical data that can be queried instantly with PromQL.
The camera system (Container Insights) gives you broad, out-of-the-box visibility — you see everything but at a coarser granularity. The telemetry system (Prometheus) gives you precise, real-time performance data that you can tailor to your specific packages and routes. Together, they provide a complete operational picture. If a truck breaks down, you replay the camera footage to see the context, then check the telemetry to pinpoint the exact second the vibration spiked.
1. What is Container Insights and Why It Exists
Container Insights is a feature of Azure Monitor that collects performance metrics, inventory data, and logs from containerized workloads running on AKS, AKS-engine, and Azure Container Instances. It solves the problem of 'black box' containers: without it, you would have to SSH into nodes, run kubectl top or docker stats, and manually correlate data across clusters. Container Insights automates this by deploying a containerized Log Analytics agent (the omsagent) as a DaemonSet on each node. The agent collects:
Node-level metrics: CPU, memory, disk, network usage.
Pod-level metrics: CPU and memory per container, restarts, status.
Kubernetes object data: Counts of pods, deployments, services, and their states.
Container logs: stdout/stderr from containers, collected via the agent and sent to Log Analytics.
Live data: Real-time streaming of pod and node performance via the 'Live Data' feature (uses the Kubernetes API).
The data is stored in a Log Analytics workspace, where you can query it using Kusto Query Language (KQL), create alerts, and visualize with workbooks. Container Insights is enabled per cluster and requires the Microsoft.OperationsManagement and Microsoft.OperationalInsights resource providers to be registered.
2. How Container Insights Works Internally
When you enable Container Insights on an AKS cluster, the following happens:
Agent Deployment: Azure deploys a DaemonSet called omsagent in the kube-system namespace. The agent runs as a privileged container on each node (required to access cgroups and the Docker socket). It also deploys a replica set for data aggregation and a ConfigMap for configuration.
2. Metrics Collection: The agent scrapes metrics from:
- cAdvisor: A built-in Kubernetes component that exposes container resource usage via the kubelet API. The agent queries /stats/summary on the kubelet every 60 seconds by default.
- Kubelet API: For node-level metrics like node_cpu_usage, node_memory_working_set.
- Kubernetes API: For inventory data (pods, deployments, namespaces) every 60 seconds.
Log Collection: The agent collects container logs by tailing the Docker log files (JSON files on the host) and sends them to Log Analytics. By default, it collects logs from all containers, but you can filter by namespace, pod, or container using the ConfigMap.
4. Data Flow: The agent sends data to the Log Analytics workspace over HTTPS on port 443. The data is stored in custom log tables:
- Perf: Performance metrics (CPU, memory, disk, network).
- ContainerInventory: Container lifecycle events.
- ContainerLog: Container stdout/stderr logs.
- KubePodInventory: Pod metadata.
- InsightsMetrics: Additional metrics like node filesystem usage.
Live Data: When you use the 'Live Data' feature in the portal, the agent establishes a direct WebSocket connection to the Kubernetes API server (using your credentials) to stream real-time events. This is not stored in Log Analytics.
Key Timers and Defaults:
Metric collection interval: 60 seconds (configurable via ConfigMap).
Log collection: Near real-time, but there is a 5-10 second delay.
Data retention: Determined by Log Analytics workspace retention (default 30 days, configurable up to 730 days).
Agent resource limits: Default CPU limit 150m, memory 400MB per node (can be overridden).
3. What is Prometheus and Why It Exists
Prometheus is an open-source monitoring and alerting toolkit originally built by SoundCloud. It is now a CNCF-graduated project and the de facto standard for Kubernetes monitoring. Prometheus excels at collecting high-dimensional, time-series data from targets that expose metrics via HTTP endpoints (e.g., /metrics). It is pull-based: the Prometheus server scrapes metrics from targets at configurable intervals. For AKS, you can deploy Prometheus yourself (using the Prometheus Operator or Helm chart) or use Azure Monitor's managed Prometheus service (part of Azure Monitor managed service for Prometheus).
The exam focuses on the integration of Prometheus with Azure Monitor via Container Insights. You can enable Prometheus metrics scraping in Container Insights by modifying the agent ConfigMap to add scrape targets. This allows you to collect Prometheus-formatted metrics from your applications and send them to the Log Analytics workspace, where they are stored in the InsightsMetrics table.
4. How Prometheus Works Internally (in the Context of AKS)
When you configure Container Insights to scrape Prometheus metrics:
ConfigMap Update: You edit the container-azm-ms-agentconfig ConfigMap in the kube-system namespace. You add a prometheus-data-collection-settings section that specifies target endpoints, scrape intervals, and metric filtering.
Agent Behavior: The omsagent DaemonSet picks up the ConfigMap changes (no restart needed) and starts scraping the specified targets. Each agent scrapes targets from its own node (e.g., node exporters, kubelet) and can also scrape cluster-level targets (e.g., API server, coredns) via a designated aggregator pod.
Data Transformation: The agent converts Prometheus metrics into the InsightsMetrics table format. Each metric becomes a row with columns: TimeGenerated, SourceSystem, Computer (node), Origin (e.g., Prometheus), Namespace, Name, Val, Tags (a JSON string of labels).
Querying: You query Prometheus metrics using KQL. For example, to get average CPU usage from a Prometheus metric:
InsightsMetrics
| where Namespace == "prometheus"
| where Name == "container_cpu_usage_seconds_total"
| summarize avg(Val) by bin(TimeGenerated, 1m), ComputerKey Prometheus Concepts:
- Metric types: Counter (cumulative), Gauge (current value), Histogram (buckets), Summary (quantiles).
- Labels: Key-value pairs that add dimensions to metrics (e.g., pod, namespace, method).
- Scrape interval: Default 60 seconds, configurable per target.
- Scrape timeout: Default 10 seconds.
5. Configuring Prometheus Scraping in Container Insights
To enable Prometheus scraping, you must create or edit the ConfigMap container-azm-ms-agentconfig with the following structure:
apiVersion: v1
kind: ConfigMap
metadata:
name: container-azm-ms-agentconfig
namespace: kube-system
data:
prometheus-data-collection-settings: |-
[cluster_level_targets]
# Targets that are accessible from any node
targets = ["https://kubernetes.default.svc:443/metrics"]
interval = "60s"
[node_level_targets]
# Targets per node
targets = ["http://<node-ip>:9100/metrics"]
interval = "30s"You can also use a ConfigMap from a URL or apply it directly via kubectl apply -f. After applying, check the agent logs for errors.
6. Comparison: Container Insights vs. Prometheus
Container Insights is a managed, out-of-the-box monitoring solution that requires no configuration to start collecting basic metrics. Prometheus is a more powerful but complex tool that requires setup of exporters and scrape targets. The exam tests when to use each:
Use Container Insights alone: When you need basic cluster health monitoring with minimal configuration. You get CPU, memory, network, and logs without any additional setup.
Add Prometheus scraping: When your application exposes custom metrics (e.g., request latency, queue depth) or you need high-cardinality data (e.g., per-URL metrics).
Use managed Prometheus: When you want to avoid managing a Prometheus server but still need PromQL queries and alerting. Azure Monitor managed Prometheus integrates with Container Insights, but the exam focuses on the self-managed approach within Container Insights.
7. Troubleshooting and Verification
To verify Container Insights is working:
kubectl get ds omsagent --namespace=kube-system
# Should show desired = ready = number of nodesCheck agent logs:
kubectl logs -n kube-system -l app=omsagentCheck if metrics are flowing to Log Analytics: run a query in the Log Analytics workspace:
Perf
| where TimeGenerated > ago(5m)
| countFor Prometheus scraping, check the ConfigMap:
kubectl get configmap container-azm-ms-agentconfig -o yaml -n kube-systemIf Prometheus metrics are not appearing, check the agent logs for scrape errors:
kubectl logs -n kube-system -l app=omsagent --tail=50 | grep "prometheus"8. Interaction with Azure Monitor and Alerts
Container Insights data flows into Azure Monitor, which means you can:
Create metric alerts based on CPU or memory thresholds (e.g., 'if average CPU > 80% for 5 minutes, send email').
Create log alerts using KQL queries (e.g., 'if number of OOMKilled pods > 0 in 10 minutes, trigger webhook').
Use Azure Workbooks to build custom dashboards.
Use Azure Managed Grafana to visualize Prometheus metrics.
The exam may ask about alert rules: you must specify the 'Resource' as the Log Analytics workspace, not the AKS cluster directly, for log alerts based on Container Insights data.
9. Cost Considerations
Container Insights costs are based on: - Log Analytics data ingestion: The amount of data sent to the workspace (GB/month). Container logs and performance metrics count towards this. - Data retention: Beyond 31 days (for Perf and ContainerLog) incurs additional cost. - Prometheus scraping: No extra cost for the agent, but the additional metrics increase ingestion volume. - Live Data: Free (uses existing Kubernetes API bandwidth).
Default data retention is 30 days for the workspace. You can change it per table (e.g., retain Perf for 90 days, ContainerLog for 30 days).
10. Security and Permissions
The omsagent requires:
- Privileged container access: To read Docker socket and cgroups.
- Kubernetes RBAC: A ClusterRole cluster-admin (or custom) to read Kubernetes API objects.
- Log Analytics workspace key: Stored as a Kubernetes secret in the cluster.
On the exam, remember that Container Insights cannot be enabled on a cluster that is private (no outbound internet access) unless you use a Log Analytics gateway or enable forced tunneling with a firewall. The agent must be able to reach the Log Analytics endpoint (e.g., <workspace-id>.ods.opinsights.azure.com).
Enable Container Insights on AKS
In the Azure portal, navigate to your AKS cluster and select 'Insights' under 'Monitoring'. Click 'Enable' to deploy the omsagent DaemonSet. Alternatively, use the Azure CLI: `az aks enable-addons -a monitoring -n <cluster-name> -g <resource-group> --workspace-resource-id <workspace-id>`. This command creates a Log Analytics workspace if not specified, registers the necessary resource providers, and deploys the agent. The agent downloads the container image `mcr.microsoft.com/azuremonitor/containerinsights/ciprod:latest` and starts collecting metrics within 5 minutes.
Verify Agent Deployment and Data Flow
Run `kubectl get pods -n kube-system | grep omsagent` to confirm the agent pods are running (one per node). Then check the Log Analytics workspace by running a simple KQL query: `Perf | where TimeGenerated > ago(1h) | count`. You should see a non-zero count. If not, check agent logs with `kubectl logs -n kube-system ds/omsagent`. Common issues include network restrictions (agent cannot reach workspace) or misconfigured workspace key.
Configure Prometheus Scraping via ConfigMap
Create a ConfigMap YAML file with the `prometheus-data-collection-settings` section. For example, to scrape the kubelet metrics endpoint, add: `[node_level_targets]` with `targets = ["http://<node-ip>:10250/metrics"]` and `interval = "30s"`. Apply with `kubectl apply -f configmap.yaml`. The agent picks up changes automatically within 5 minutes. Verify by querying `InsightsMetrics | where Namespace == "prometheus" | count`.
Query Metrics with Kusto (KQL)
In the Log Analytics workspace, use KQL to analyze data. For example, to get average pod CPU usage over time: `Perf | where ObjectName == "K8SContainer" and CounterName == "cpuUsageNanoCores" | summarize avg(CounterValue) by bin(TimeGenerated, 5m), InstanceName`. For Prometheus metrics: `InsightsMetrics | where Name == "container_cpu_usage_seconds_total" | project TimeGenerated, Computer, Val, Tags`. Use the `Tags` column to filter by pod labels.
Set Up Alerts Based on Container Insights
In Azure Monitor, create a new alert rule. Select the 'Log Analytics workspace' as the scope (not the AKS cluster). Use a KQL query like `Perf | where ObjectName == "K8SNode" and CounterName == "cpuUsagePercentage" | summarize avg(CounterValue) > 90`. Set the frequency and threshold. For metric alerts, select the AKS cluster as the resource and choose a metric like 'cpuUsagePercentage'. Note: metric alerts have lower latency (1-5 minutes) than log alerts (5-15 minutes).
Enterprise Scenario 1: E-Commerce Platform with Microservices
A large e-commerce company runs 50 microservices on a 10-node AKS cluster. They need to monitor both infrastructure health and business metrics (e.g., checkout latency, order rate). They enable Container Insights for basic CPU/memory alerts and to collect container logs. They then configure Prometheus scraping to collect custom metrics from each microservice (e.g., http_requests_total, checkout_duration_seconds). They use the InsightsMetrics table to create a Grafana dashboard showing real-time order rates and error rates. The challenge: high cardinality metrics (e.g., per-URL request counts) can explode ingestion costs. They solve this by using Prometheus relabeling to drop unnecessary labels and by setting a metric collection interval of 60 seconds instead of 15.
Enterprise Scenario 2: Financial Services with Compliance Requirements
A bank runs a PCI-DSS compliant AKS cluster. They cannot use the default Container Insights agent because it requires privileged access. They deploy the agent with a custom ConfigMap that limits log collection to specific namespaces and disables the collection of sensitive data. They also use Azure Policy to enforce that only approved container images are used. Prometheus is used to monitor node-level metrics (disk I/O, network) via node exporters, but application metrics are collected via a sidecar exporter that sends data to a separate Prometheus server outside the cluster to meet audit requirements. The key lesson: in regulated environments, you must review the agent's data collection scope and ensure no sensitive data leaves the cluster.
Enterprise Scenario 3: Multi-Cluster Monitoring with Log Analytics Workspace
A SaaS provider manages 20 AKS clusters across different regions. They send all Container Insights data to a single Log Analytics workspace (cross-region) to have a unified view. They use Azure Workbooks to create a 'global health' dashboard that shows the status of all clusters. The pitfall: if the workspace is in a different region, data transfer costs apply. Also, if one cluster's agent has network issues, it can affect the entire workspace's data completeness. They mitigate this by setting up health alerts per cluster and using a separate workspace for critical production clusters. The exam may test that a Log Analytics workspace can receive data from multiple clusters, but you cannot have a single cluster send data to multiple workspaces.
What AZ-104 Tests on This Topic
The exam objectives under 'Monitor AKS' (Compute 3.3) include: enabling Container Insights, analyzing monitoring data, and configuring alerts. Specific codes: AZ-104: Monitor and troubleshoot AKS clusters. The exam does NOT require deep Prometheus configuration (that's more AZ-400 or CNCF level), but you must know:
How to enable Container Insights (portal, CLI, or ARM template).
What metrics are collected by default (CPU, memory, disk, network, pod restarts).
How to view live data and logs.
How to create alerts from Container Insights data.
The difference between Container Insights and Azure Monitor for containers (the latter is a broader concept).
Common Wrong Answers and Why
'Container Insights requires a Prometheus server': WRONG. Container Insights works without Prometheus. Prometheus scraping is an add-on. Candidates confuse the two because both collect metrics.
'You can enable Container Insights by installing the Log Analytics agent on each node VM': WRONG. You must enable it via AKS addon or CLI, not by manually installing the agent. The agent is deployed as a DaemonSet.
'Prometheus metrics are stored in a separate database': WRONG. When scraped via Container Insights, Prometheus metrics are stored in the Log Analytics workspace in the InsightsMetrics table. The exam tests that all data goes to Log Analytics.
'Live Data is stored in Log Analytics for 30 days': WRONG. Live Data is not stored at all; it is a real-time stream. Only historical data is stored.
Specific Numbers and Values to Memorize
Default metric collection interval: 60 seconds.
Default Log Analytics retention: 30 days (configurable up to 730 days).
Agent default resource limits: CPU 150m, memory 400MB.
Number of tables created: 5 main tables (Perf, ContainerInventory, ContainerLog, KubePodInventory, InsightsMetrics).
Port used for agent communication: 443 (HTTPS).
Edge Cases the Exam Loves
Private clusters: If the AKS cluster is private (no egress), Container Insights fails unless you use a Log Analytics gateway or enable forced tunneling with a firewall. The exam may ask: 'Your cluster has no outbound internet access. Can you enable Container Insights?' Answer: only if you configure a Log Analytics gateway.
Multiple workspaces: A cluster can only send data to ONE Log Analytics workspace. You cannot split data across workspaces.
Agent updates: The agent updates automatically via the container image tag 'ciprod' (latest). You cannot pin a specific version.
How to Eliminate Wrong Answers
When you see a question about Container Insights, ask:
Is the agent deployed as a DaemonSet? (Yes, always.)
Is the data stored in Log Analytics? (Yes, always.)
Does it require Prometheus? (No, optional.)
Can I see real-time data? (Yes, via Live Data, but not stored.)
For Prometheus questions: remember that Prometheus is pull-based (scrapes targets), not push-based. Container Insights agent is the scraper. The exam might ask: 'Which tool collects metrics from the kubelet?' Answer: both Container Insights (via cAdvisor) and Prometheus (via scraping the kubelet metrics endpoint).
Container Insights is enabled via AKS addon or CLI, not by manual agent installation.
The omsagent runs as a DaemonSet in kube-system and collects metrics every 60 seconds.
All Container Insights data is stored in a Log Analytics workspace; you query with KQL.
Prometheus scraping is an optional add-on configured via a ConfigMap.
Prometheus metrics are stored in the InsightsMetrics table, not a separate TSDB.
Live Data is real-time and not stored; it uses a WebSocket to the Kubernetes API.
Default Log Analytics retention is 30 days; can be extended to 730 days per table.
Container Insights cannot be enabled on a private cluster without a Log Analytics gateway or firewall rules.
Alerts based on Container Insights can be log alerts (KQL) or metric alerts (pre-built metrics).
Cost is based on data ingestion volume; high-cardinality Prometheus metrics can increase costs.
These come up on the exam all the time. Here's how to tell them apart.
Container Insights (Default Agent)
Collects CPU, memory, disk, network from nodes and pods out of the box.
Uses cAdvisor and kubelet API; no additional configuration needed.
Data stored in Perf, ContainerInventory, ContainerLog tables.
Best for basic health monitoring and log collection.
Default scrape interval is 60 seconds.
Prometheus (via Container Insights)
Collects custom application metrics exposed at /metrics endpoints.
Requires ConfigMap configuration to define scrape targets.
Data stored in InsightsMetrics table with Tags column for labels.
Best for high-cardinality, application-specific metrics.
Configurable scrape interval (default 60s, can be shorter).
Mistake
Container Insights requires a Prometheus server to collect metrics.
Correct
Container Insights collects metrics independently via its own agent (omsagent) that queries cAdvisor and the kubelet API. Prometheus scraping is an additional feature that can be enabled, but it is not required.
Mistake
Prometheus metrics collected by Container Insights are stored in a separate Prometheus database.
Correct
They are stored in the Log Analytics workspace in the `InsightsMetrics` table, not in a Prometheus TSDB. You query them with KQL, not PromQL (unless you also deploy a standalone Prometheus).
Mistake
Live Data in Container Insights is stored in Log Analytics and can be queried later.
Correct
Live Data is a real-time stream that is not persisted. It uses a WebSocket connection to the Kubernetes API and is not stored in any table.
Mistake
You can enable Container Insights by installing the Log Analytics agent on each node VM.
Correct
You must enable it through the AKS addon or CLI. Manually installing the agent on VMs will not integrate with AKS monitoring dashboards and will not collect Kubernetes-specific metrics.
Mistake
Container Insights collects logs from all containers by default with no filtering.
Correct
It collects stdout/stderr from all containers by default, but you can filter by namespace, pod, or container name using the ConfigMap `container-azm-ms-agentconfig`.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
You can enable it from the Azure portal by navigating to the cluster and clicking 'Insights' > 'Enable'. Alternatively, use the Azure CLI: `az aks enable-addons -a monitoring -n <cluster-name> -g <resource-group> --workspace-resource-id <workspace-id>`. If you don't specify a workspace, Azure creates one in the same region. The command registers the necessary resource providers and deploys the omsagent DaemonSet.
By default, Container Insights collects node-level CPU, memory, disk, and network usage; pod-level CPU and memory; container restarts and status; and inventory of Kubernetes objects (pods, deployments, etc.). It also collects container stdout/stderr logs. These are stored in tables like Perf, ContainerInventory, ContainerLog, and KubePodInventory.
Use the 'Live Data' feature in the Azure portal under the AKS cluster's 'Insights' blade. You can select a pod and view its logs in real time. This uses a WebSocket connection to the Kubernetes API server and does not store the data. Alternatively, you can use `kubectl logs -f <pod>` directly.
No. An AKS cluster can only send Container Insights data to a single Log Analytics workspace. If you need data in multiple workspaces, you would need to configure data export from the primary workspace to another workspace (using Azure Monitor data export), but that is not covered in the exam.
Edit the ConfigMap `container-azm-ms-agentconfig` in the `kube-system` namespace. Add a `prometheus-data-collection-settings` section with targets and intervals. For example, to scrape the kubelet: add `[node_level_targets]` with `targets = ["http://<node-ip>:10250/metrics"]`. Apply the ConfigMap with `kubectl apply -f`. The agent picks up changes automatically.
Container Insights is the specific feature within Azure Monitor that provides monitoring for AKS. 'Azure Monitor for containers' is a broader term that sometimes refers to Container Insights plus other capabilities like VM insights. On the exam, 'Container Insights' is the correct term for AKS monitoring.
Create a log alert in Azure Monitor with a KQL query like: `ContainerInventory | where ContainerRestartCount > 5 | summarize dcount(ContainerID) by bin(TimeGenerated, 5m)`. Set the threshold to > 0. The scope must be the Log Analytics workspace, not the AKS cluster. You can also use a metric alert for 'Pod restarts' if you select the AKS cluster as the resource.
You've just covered AKS Monitoring: Container Insights and Prometheus — now see how well it sticks with free AZ-104 practice questions. Full explanations included, no account needed.
Done with this chapter?