This chapter covers distributed tracing using Azure Application Insights, a critical skill for the AZ-204 exam. Distributed tracing allows you to track requests as they travel across multiple components of a distributed application, providing end-to-end visibility into performance and failures. Approximately 10-15% of exam questions in the Monitor domain relate to Application Insights telemetry, including distributed tracing. You will learn the core concepts, implementation steps, configuration, and how to analyze traces to diagnose issues.
Jump to a section
Imagine a complex surgery performed by a team of specialists: a surgeon, an anesthesiologist, a nurse, and a technician. Each person performs their role, and the entire procedure is recorded on multiple cameras. Later, a reviewer wants to understand exactly why a particular complication occurred. Without distributed tracing, the reviewer would have to watch each camera feed separately, trying to correlate timestamps manually—a tedious and error-prone process. With distributed tracing, every action (e.g., 'surgeon makes incision', 'nurse hands scalpel', 'anesthesiologist adjusts drip') is recorded as a 'span' with a unique ID, a parent span ID (if triggered by another action), and precise timing. All spans belonging to the same 'operation' share a common trace ID. The reviewer can now see a single timeline: the trace ID groups all related actions, and the spans show the sequence, duration, and hierarchy. For example, the span 'incision' might have child spans 'adjust lighting' and 'suction'. If the complication occurred because the anesthesiologist's adjustment took too long, the reviewer can immediately see the bottleneck by examining the duration of that span. This is exactly how Application Insights distributed tracing works—it correlates telemetry across multiple services using trace and span IDs, enabling end-to-end visibility into request flows.
What is Distributed Tracing and Why Does It Exist?
Distributed tracing is a method used to profile and monitor applications, especially those built using a microservices architecture. In a monolithic application, a single process handles a request end-to-end, making it easy to log and trace. However, in a distributed system, a single user request may trigger calls to multiple services, databases, queues, and caches. Traditional logging produces isolated logs in each component, making it nearly impossible to correlate them to understand the full flow. Distributed tracing solves this by assigning a unique trace ID to the root request and propagating it across all service boundaries. Each unit of work within a service is recorded as a span with a span ID and a parent span ID, forming a tree structure. This allows developers and operators to reconstruct the entire request path, measure latency at each hop, and identify bottlenecks or failures.
How Distributed Tracing Works Internally
The mechanism relies on three key identifiers: trace ID, span ID, and parent span ID. When a request first enters the system (e.g., via an HTTP request to a front-end service), the tracing system generates a globally unique trace ID (usually a 128-bit or 64-bit number, often represented as a hex string). It also creates the first span, called the root span, with a unique span ID. The root span has no parent (parent span ID is null or empty). As the front-end service calls downstream services (e.g., via HTTP, gRPC, or message queues), it propagates the trace ID and its own span ID as the parent span ID to the downstream service. The downstream service then creates a new span with its own span ID, sets the parent span ID to the incoming span ID, and continues. This creates a hierarchical tree where each span represents a logical unit of work (e.g., a database query, a cache lookup, an external API call).
Application Insights uses the OpenTelemetry standard (formerly OpenCensus) for distributed tracing. The Application Insights SDK automatically instruments common frameworks (ASP.NET Core, Java, Node.js, etc.) to create and propagate spans. For example, in an ASP.NET Core application, the SDK intercepts incoming HTTP requests and outgoing HTTP calls, generating DependencyTelemetry for each outbound call. Each DependencyTelemetry object contains the trace ID, span ID, and parent span ID. The SDK also supports manual instrumentation using the TelemetryClient API to create custom dependencies or track custom operations.
Key Components, Values, Defaults, and Timers
Trace ID: A 128-bit (16-byte) identifier, often represented as a 32-character hex string. It is unique per request flow. In Application Insights, it is stored as the operation_Id field in telemetry items.
Span ID: A 64-bit (8-byte) identifier, represented as a 16-character hex string. Each span within a trace has a unique span ID. In Application Insights, it is stored as the id field in request and dependency telemetry.
Parent Span ID: The span ID of the parent span. For root spans, this is null. In Application Insights, it is stored as the operation_parentId field.
Operation Name: A human-readable name for the operation, e.g., "GET /api/orders". This is stored as the operation_Name field.
Duration: The time taken by the span, measured in milliseconds. Application Insights records the start time and duration for each telemetry item.
Sampling: To reduce data volume, Application Insights uses sampling. By default, ingestion sampling is adaptive, meaning it adjusts the sampling rate based on the volume of telemetry. You can configure fixed-rate sampling (e.g., 50%) or head-based sampling (decide at the start of the trace). The default sampling rate is 100% (no sampling) but can be reduced to control costs.
Telemetry Initializers: You can add custom properties to all telemetry items using telemetry initializers. This is useful for adding environment-specific tags.
Correlation Context: For HTTP calls, the trace context is propagated via standard headers: traceparent and tracestate (W3C Trace Context). Application Insights also supports the older Request-Id header.
Configuration and Verification Commands
To enable distributed tracing in an ASP.NET Core application, you need to add the Application Insights SDK and configure the connection string. In the appsettings.json file:
{
"ApplicationInsights": {
"ConnectionString": "InstrumentationKey=...;IngestionEndpoint=..."
}
}In Startup.cs (or Program.cs for .NET 6+), add:
using Microsoft.ApplicationInsights.AspNetCore.Extensions;
using Microsoft.Extensions.DependencyInjection;
services.AddApplicationInsightsTelemetry();For manual instrumentation, you can use the TelemetryClient:
using Microsoft.ApplicationInsights;
using Microsoft.ApplicationInsights.DataContracts;
var telemetryClient = new TelemetryClient();
using (var operation = telemetryClient.StartOperation<RequestTelemetry>("MyCustomOperation"))
{
// Your code
operation.Telemetry.Success = true;
}To verify that distributed tracing is working, you can query Application Insights using Log Analytics (Kusto). For example, to find all telemetry for a specific trace:
union requests, dependencies, exceptions, traces
| where operation_Id == "<trace_id>"
| order by timestamp asc
| project timestamp, itemType, name, id, operation_ParentId, durationHow It Interacts with Related Technologies
Distributed tracing in Application Insights integrates with Azure Monitor and other Azure services. For example, you can correlate traces with Azure Functions, Azure Kubernetes Service (AKS), and Azure Service Bus. When using Azure Functions, the host automatically propagates the trace context via the traceparent header, and the Application Insights SDK in the function app creates spans for each function execution. For AKS, you can enable Azure Monitor for containers, which collects Prometheus metrics and container logs, and integrates with Application Insights for distributed tracing. Additionally, Application Insights supports OpenTelemetry, allowing you to use the OpenTelemetry SDK to export traces to Application Insights. This enables interoperability with other observability tools like Jaeger or Zipkin.
Performance and Scaling Considerations
Distributed tracing generates additional telemetry data, which can impact performance and cost. The overhead of generating and sending telemetry is minimal (usually <1% CPU), but the volume of data can be significant in high-throughput systems. Use sampling to control costs: head-based sampling (decide at the start of the trace) ensures that entire traces are either sampled or not, which preserves the trace tree. Adaptive sampling adjusts automatically based on traffic. You can also set limits on the number of telemetry items sent per second using the TelemetryChannel configuration. In production, it is common to sample at 10% or lower for high-traffic services while keeping 100% sampling for critical paths like payment processing.
Troubleshooting Common Issues
Missing traces: Ensure that the trace context is propagated correctly across service boundaries. Check that the SDK is configured to add the traceparent header to outgoing HTTP requests. For non-HTTP protocols (e.g., message queues), you may need to manually propagate the trace context.
Incorrect parent-child relationships: Verify that the operation_ParentId is set correctly. For custom spans, use the StartOperation method to automatically manage the hierarchy.
Sampling causing incomplete traces: If head-based sampling is used, entire traces are either sampled or not. If you see partial traces, check that the sampling decision is consistent across all services (e.g., use a consistent sampling rate or use adaptive sampling).
Too much data: Adjust sampling rates or use telemetry processors to filter out low-value telemetry.
Enable Application Insights SDK
Add the Application Insights SDK to your application. For .NET Core, install the `Microsoft.ApplicationInsights.AspNetCore` NuGet package. Configure the connection string in `appsettings.json`. The SDK automatically collects incoming request telemetry and outgoing dependency calls (HTTP, SQL, etc.) without any code changes. This step is essential to start collecting telemetry.
Propagate Trace Context
The SDK automatically propagates the trace context via the `traceparent` header for outgoing HTTP calls. For other protocols (e.g., Azure Service Bus, gRPC), you may need to manually propagate the trace ID and span ID. Use the `TelemetryClient` to retrieve the current operation context and inject it into outgoing messages.
Create Custom Spans
For operations not automatically instrumented (e.g., background jobs, custom database calls), create custom spans using `StartOperation<RequestTelemetry>` or `StartOperation<DependencyTelemetry>`. This ensures that these operations are part of the distributed trace tree. Always dispose the operation to send the telemetry.
Analyze Traces in Portal
Use the Application Insights 'Transaction Search' blade to view distributed traces. You can filter by trace ID, operation name, or duration. The 'End-to-end transaction details' view shows a timeline of spans. Identify slow spans or failures. Use the 'Performance' blade to see aggregated metrics like average duration and failure rate per operation.
Query with Log Analytics
For advanced analysis, use Log Analytics (Kusto queries). For example, to find the top 10 slowest traces: `requests | where timestamp > ago(1d) | order by duration desc | take 10`. To see the entire trace tree, union all telemetry types and filter by `operation_Id`. This allows deep dives into specific issues.
Enterprise Scenario 1: E-Commerce Order Processing
A large e-commerce platform uses a microservices architecture with separate services for user authentication, product catalog, shopping cart, order processing, payment, and shipping. Each service is deployed in Azure Kubernetes Service (AKS). When a customer places an order, the front-end service calls the order service, which then calls payment and shipping services, each of which may call other internal APIs. Without distributed tracing, diagnosing a slow order is nearly impossible. The team implements Application Insights distributed tracing across all services using the OpenTelemetry SDK. They configure a fixed sampling rate of 20% to balance cost and visibility. In production, they notice that the payment service occasionally takes >10 seconds. Using the transaction search, they find a specific trace where the payment service's span shows a long duration. They drill down and see that the payment service made a call to an external fraud detection API that timed out. The team adds a circuit breaker and reduces the timeout. They also set up alerts using Application Insights metrics for p95 latency. The result is a 50% reduction in order processing time.
Enterprise Scenario 2: IoT Device Management
A company manages thousands of IoT devices that send telemetry to an Azure IoT Hub, which then processes data through Azure Functions and stores results in Azure Cosmos DB. Each device message triggers a chain of functions. To monitor the end-to-end flow, the team enables distributed tracing in the Azure Functions host (using the built-in Application Insights integration). They manually propagate the trace context through IoT Hub messages by adding the traceparent header to the device-to-cloud messages. In the portal, they can see a trace for each message, showing the time spent in each function and in Cosmos DB. They discover that one function is a bottleneck because it makes a synchronous HTTP call to a weather API. They change it to an asynchronous pattern and reduce the function timeout. They also use sampling to avoid high costs from the high volume of device messages.
Common Mistakes and Pitfalls
Not propagating context across async boundaries: In .NET, if you use Task.Run or fire-and-forget, the trace context may be lost. Always use AsyncLocal or the SDK's built-in context propagation.
Forgetting to configure sampling: In high-traffic systems, not sampling can lead to excessive costs (e.g., $1000s per month). Always set a sampling rate appropriate for your traffic.
Relying only on auto-instrumentation: Some dependencies (e.g., Redis, custom TCP calls) are not automatically instrumented. You must manually create dependency telemetry for them.
What AZ-204 Tests on Distributed Tracing
The AZ-204 exam objective 'Monitor and troubleshoot solutions' includes subtopics on configuring Application Insights and analyzing telemetry. Specifically, you need to know how to enable distributed tracing, propagate trace context, and use the portal to diagnose issues. The exam focuses on: - Objective 4.1: Configure instrumentation in an application (e.g., enable Application Insights SDK, configure sampling). - Objective 4.2: Analyze and troubleshoot distributed applications (e.g., use transaction search, identify root causes using traces).
Common Wrong Answers and Why Candidates Choose Them
'Distributed tracing is automatically enabled for all Azure services.' This is false. Only services that have the Application Insights SDK installed and configured will generate traces. Many candidates assume Azure automatically correlates logs across services, but without explicit instrumentation, no tracing occurs.
'Sampling ensures all traces are captured but with reduced detail.' This is misleading. Sampling either captures entire traces or drops them entirely (head-based) or adjusts the rate (adaptive). It does not reduce detail within a captured trace. Candidates often confuse sampling with data truncation.
'The trace ID is the same as the request ID.' In Application Insights, the trace ID is operation_Id, while the request ID is id. They are different. Candidates mix up these fields.
'You must manually propagate trace context for all dependencies.' While manual propagation is needed for non-HTTP dependencies, the SDK automatically propagates for HTTP and SQL. Candidates think they need to write custom code for everything.
Specific Numbers and Terms on the Exam
The default sampling rate is 100% (no sampling). Adaptive sampling adjusts between 0% and 100% based on traffic.
The traceparent header format: 00-<trace-id>-<span-id>-<trace-flags> (W3C standard).
The operation_Id field is used to group telemetry from the same trace.
The operation_ParentId field stores the parent span ID.
The StartOperation method returns an IOperationHolder<T> that must be disposed to send telemetry.
Edge Cases and Exceptions
When using Azure Functions with Consumption Plan: The function host may recycle, causing trace context to be lost if not propagated via durable functions or external storage.
When using message queues: You must manually inject the trace context into the message properties (e.g., Service Bus message ApplicationProperties).
When using multiple instrumentation keys: Traces cannot be correlated across different Application Insights resources unless you manually propagate and query across resources.
How to Eliminate Wrong Answers
Look for keywords: 'automatically' vs. 'manually'. If the question says 'without any code changes', it's likely about auto-instrumentation (e.g., attaching Application Insights Profiler).
If the question mentions 'cost' or 'volume', the answer likely involves sampling.
If the question asks about 'correlating across services', the answer involves trace ID propagation.
Distributed tracing uses a trace ID (operation_Id) and span IDs (id, operation_ParentId) to correlate telemetry across services.
The SDK automatically propagates trace context via the W3C traceparent header for HTTP calls.
Sampling controls data volume: head-based sampling captures entire traces at a fixed rate; adaptive sampling adjusts dynamically.
Use StartOperation to create custom spans for operations not automatically instrumented.
In Log Analytics, union requests, dependencies, exceptions, and traces filtered by operation_Id to see the full trace.
For non-HTTP dependencies (e.g., Service Bus), manually propagate trace context using message properties.
The default sampling rate is 100% (no sampling); configure sampling via ApplicationInsights.config or code.
The Transaction Search blade in the portal provides a visual timeline of spans for a given trace.
Azure Functions automatically propagates trace context when using the built-in Application Insights integration.
Distributed tracing is essential for diagnosing performance issues in microservices architectures.
These come up on the exam all the time. Here's how to tell them apart.
Head-Based Sampling
Decision made at the start of the trace: entire trace is either sampled or not.
Simple to configure with a fixed rate (e.g., 10%).
Preserves complete trace trees for sampled requests.
May over-sample low-traffic periods or under-sample high-traffic periods.
Recommended for predictable workloads where cost control is critical.
Adaptive Sampling
Adjusts sampling rate dynamically based on telemetry volume.
Aims to keep a target rate of telemetry per second (default 5 items/sec).
May break trace trees if some spans are sampled differently (but adaptive tries to keep consistency).
Better for variable workloads; reduces cost during spikes.
Default in Application Insights SDK; requires no configuration.
Mistake
Distributed tracing works automatically for all Azure services without any SDK installation.
Correct
Only services that have the Application Insights SDK and have configured the connection string will generate traces. Azure Functions and App Services have built-in integration, but you still need to enable it.
Mistake
Sampling reduces the detail inside each trace, so you get partial data per trace.
Correct
Head-based sampling either captures the entire trace or drops it entirely. Adaptive sampling adjusts the rate but still captures full traces. No partial traces are created.
Mistake
The trace ID and request ID are the same thing.
Correct
The trace ID (operation_Id) groups all spans of a request, while the request ID (id) is a unique identifier for a single request telemetry item. They are different fields.
Mistake
You must manually propagate trace context for every dependency using custom code.
Correct
The SDK automatically propagates trace context for HTTP and SQL dependencies. Only for non-HTTP protocols (e.g., message queues, gRPC) do you need manual propagation.
Mistake
Distributed tracing only works for HTTP-based microservices.
Correct
It works for any type of dependency as long as trace context is propagated. You can trace through queues, events, and custom protocols using manual instrumentation.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Install the `Microsoft.ApplicationInsights.AspNetCore` NuGet package and add `services.AddApplicationInsightsTelemetry()` in `Startup.cs`. Configure the connection string in `appsettings.json`. The SDK will automatically collect incoming request and outgoing dependency telemetry, including trace context propagation.
`operation_Id` is the trace ID that groups all telemetry items belonging to the same distributed trace. `id` is the span ID of a specific telemetry item (e.g., a request or dependency). The parent span ID is stored in `operation_ParentId`. Together, these fields form the trace tree.
Head-based sampling decides at the start of a trace whether to capture it entirely. If a trace is not sampled, no spans are recorded. Adaptive sampling adjusts the rate based on volume, but may break trace trees if spans are sampled inconsistently. To preserve trace trees, use head-based sampling with consistent rates across services.
Retrieve the current trace context using `Activity.Current` in .NET. Get the trace ID and span ID, then add them as properties to the Service Bus message (e.g., `message.ApplicationProperties["traceparent"] = $"00-{activity.TraceId.ToHexString()}-{activity.SpanId.ToHexString()}-01"`). On the receiving side, parse the header and set the parent span ID accordingly.
By default, no. Each Application Insights resource has its own instrumentation key. To correlate, you would need to export telemetry to a common storage (e.g., Log Analytics workspace) and query across resources, or use a single Application Insights resource for all services.
The format is `00-<trace-id>-<span-id>-<trace-flags>`, where trace-id is a 32-character hex string (16 bytes), span-id is a 16-character hex string (8 bytes), and trace-flags is a 2-character hex string (e.g., `01` for sampled). Example: `00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01`.
Go to your Application Insights resource, select 'Transaction Search' from the left menu. You can search by trace ID, operation name, or time range. Click on a result to see the end-to-end transaction details, which shows a timeline of spans. You can also use the 'Performance' blade to see aggregated metrics.
You've just covered Distributed Tracing with Application Insights — now see how well it sticks with free AZ-204 practice questions. Full explanations included, no account needed.
Done with this chapter?