This chapter covers AWS X-Ray, a distributed tracing service that helps you analyze and debug production applications. For the SOA-C02 exam, understanding X-Ray's architecture, configuration, and integration with other AWS services is critical, as questions on monitoring and troubleshooting appear across multiple domains. Approximately 5-10% of exam questions touch on X-Ray or related monitoring services, often in scenario-based formats where you must decide which tool to use for a given problem.
Jump to a section
Imagine a massive e-commerce warehouse (your distributed application) where hundreds of workers (microservices) process orders. Each worker picks items from shelves, packs boxes, and hands them to the next worker. But when a customer complains that their order arrived damaged, you need to trace exactly which worker mishandled it and why. AWS X-Ray is like installing a tiny camera and GPS tracker on every package as it moves through the warehouse. Each camera records: which worker handled it, what time it arrived and left, any delays (e.g., worker waited 15 seconds for a packing slip), and errors (e.g., scanner failed). The GPS tracker logs the route through the warehouse, showing the exact path from shelf to shipping dock. Back at the supervisor's office (the X-Ray console), you can replay the entire journey of any package, zoom into a specific worker (service), see their efficiency trends, and identify bottlenecks — like Worker B taking 5 seconds longer per package than average. The cameras don't interfere with the work; they just observe and timestamp. Without X-Ray, you'd only know the final result (order delivered or failed), not the step-by-step process. In AWS, X-Ray similarly traces requests as they travel through EC2, Lambda, API Gateway, DynamoDB, and other services, collecting metadata, timings, and errors to help you debug performance issues and pinpoint root causes.
What is AWS X-Ray?
AWS X-Ray is a distributed tracing service that collects data about requests your application processes, providing an end-to-end view of requests as they travel through your AWS infrastructure. It helps you identify performance bottlenecks, errors, and root causes of issues in microservices architectures. X-Ray supports applications running on Amazon EC2, AWS Lambda, Amazon ECS, AWS Elastic Beanstalk, and even on-premises servers via the X-Ray daemon.
How X-Ray Works Internally
When a request enters your application (e.g., via API Gateway or an ALB), the X-Ray SDK generates a unique trace ID and propagates it through the system. Each service that processes the request sends segments to the X-Ray daemon, which buffers and forwards them to the X-Ray API. The service then assembles these segments into a service graph, showing the flow and latency of each component.
Trace: A trace records the entire journey of a single request. It consists of one or more segments. Each trace has a unique ID (e.g., 1-5f2e8a3b-1a2b3c4d5e6f7g8h9i0j1k2l).
Segment: A segment records the work of a single service or component (e.g., a Lambda function, a web server). Segments contain metadata like start time, end time, errors, and annotations.
Subsegment: A subsegment records a downstream call within a service (e.g., an HTTP request to another API or a DynamoDB query). Subsegments can be nested.
Service Graph: A visual representation of the services in your application and their connections, with metrics like average latency and error rates.
Key Components and Defaults
X-Ray SDK: Integrates with your application code to capture traces. Available for Java, Python, Node.js, .NET, Ruby, and Go. The SDK automatically captures incoming and outgoing HTTP requests, AWS SDK calls, and SQL queries (for supported databases).
X-Ray Daemon: A software agent that collects segment data from the SDK and sends it to the X-Ray API. The daemon runs on EC2 instances, on-premises servers, or as a sidecar in ECS tasks. It listens on UDP port 2000 by default. The daemon batches segments and sends them to the X-Ray API every 1 second or when the buffer reaches 100 segments (whichever comes first).
Sampling: To reduce cost and noise, X-Ray uses sampling rules. By default, it records the first request per second and 5% of additional requests. You can configure custom sampling rules with reservoir size (fixed number of traces per second) and rate (percentage of additional requests).
Annotations and Metadata: You can add custom annotations (key-value pairs indexed for search) and metadata (non-indexed key-value pairs) to segments for richer filtering.
Trace Header: The X-Amzn-Trace-Id header is used to propagate the trace ID between services. It includes the trace ID, parent segment ID, and sampling decision.
Configuration and Verification
To enable X-Ray for your application:
1. Install the X-Ray SDK in your application and instrument your code (or use automatic instrumentation for supported frameworks).
2. Deploy the X-Ray daemon on your compute resources (EC2, ECS, Lambda, etc.). For Lambda, X-Ray is integrated via environment variables and IAM permissions; no daemon needed.
3. Ensure IAM permissions allow the daemon to call PutTraceSegments and PutTelemetryRecords.
4. Use the AWS Management Console, CLI, or SDK to view traces and service maps.
Example CLI command to get a list of traces:
aws xray get-traces --trace-ids "1-5f2e8a3b-1a2b3c4d5e6f7g8h9i0j1k2l"Example to get a service graph:
aws xray get-service-graph --start-time 1614556800 --end-time 1614643200Integration with Other AWS Services
AWS Lambda: X-Ray can be enabled per function. Lambda automatically sends segments for function invocations and integrates with the X-Ray daemon (no separate daemon needed).
Amazon API Gateway: You can enable X-Ray tracing on API stages. API Gateway adds the X-Amzn-Trace-Id header to incoming requests and sends segments for each API call.
Elastic Load Balancing (ALB/NLB): ALB supports X-Ray tracing by adding the trace header to requests and sending segments for load balancer processing.
Amazon DynamoDB: The X-Ray SDK can trace DynamoDB calls as subsegments, showing latency and errors for each query.
AWS App Mesh: Envoy proxies can send trace data to X-Ray, enabling tracing for service mesh architectures.
Performance Considerations and Best Practices
Sampling: Use sampling rules to control trace volume and cost. For high-traffic applications, set a reservoir of 1-5 traces per second and a rate of 1-10%.
Annotations: Use annotations for indexing traces (e.g., CustomerID, Region). They are searchable via the console or API. Metadata is for extra data not intended for search.
Security: Traces may contain sensitive data (e.g., request parameters). Use the X-Ray SDK's filtering to exclude sensitive fields. The daemon sends data over HTTPS.
Cost: X-Ray charges per trace recorded and per scan for trace retrieval. Monitor usage via AWS Cost Explorer.
Troubleshooting with X-Ray
Missing traces: Ensure the daemon is running and can reach the X-Ray API (TCP 443). Check IAM permissions. For Lambda, verify X-Ray is enabled in the function configuration.
Incomplete traces: Some services may not propagate the trace header. Ensure downstream services are instrumented with X-Ray SDK.
High latency: Use the service map to identify services with high average latency. Drill into traces to find slow subsegments.
Exam-Specific Details
X-Ray is not real-time; there is a delay of up to 30 seconds for traces to appear.
The daemon uses UDP to receive segments from the SDK, which is fire-and-forget; if the daemon is down, segments are lost.
The X-Ray SDK uses a thread-local storage to track the current segment. In asynchronous code, you must manually propagate the trace context.
For EC2, you must install the daemon manually or use the AWS-provided CloudFormation template. For Elastic Beanstalk, you can enable X-Ray via the console or configuration files.
The GetTraceSummaries API returns trace IDs based on filter criteria (e.g., time range, annotations). You then use GetTrace to retrieve full trace details.
Step-by-Step: Tracing a Request from API Gateway to Lambda to DynamoDB
Request arrives at API Gateway: API Gateway adds the X-Amzn-Trace-Id header (if tracing enabled). It creates a segment and sends it to X-Ray.
API Gateway invokes Lambda: The trace header is passed to Lambda via the event. Lambda's runtime automatically creates a subsegment for the invocation.
Lambda processes request: The X-Ray SDK (if instrumented) creates subsegments for each downstream call, e.g., a DynamoDB query.
Lambda sends subsegments: The Lambda runtime sends segments to X-Ray automatically (no daemon needed).
DynamoDB processes query: If the SDK is used, the DynamoDB call is recorded as a subsegment with latency and error info.
Response flows back: The response propagates the trace ID back through Lambda to API Gateway.
X-Ray assembles trace: The service combines all segments into a trace, visible in the console within seconds.
Common Pitfalls
Not enabling X-Ray on all services: If a service doesn't propagate the trace header, the trace appears broken. Always enable tracing on API Gateway, ALB, and Lambda.
Over-sampling: Setting too high a sampling rate can incur significant costs. Use the default or custom rules carefully.
Ignoring IAM permissions: The daemon needs PutTraceSegments and PutTelemetryRecords. Lambda needs xray:PutTraceSegments in its execution role.
Forgetting to install the daemon on EC2: Without the daemon, segments from the SDK are never sent to the X-Ray API.
Enable Tracing on API Gateway
In the API Gateway console, under the 'Settings' tab of a stage, enable 'X-Ray Tracing'. This causes API Gateway to add the `X-Amzn-Trace-Id` header to incoming requests and send a segment to X-Ray. The segment includes the request method, path, and status code. Without this, the trace would start at the first instrumented service downstream, missing the initial entry point.
Instrument Lambda with X-Ray SDK
For a Node.js Lambda function, add the `aws-xray-sdk` package. Wrap the handler function with `AWSXRay.captureAsyncFunc()` or use automatic mode. The SDK automatically captures all AWS SDK calls and HTTP requests. Enable 'Active tracing' in the Lambda console under 'Monitoring tools'. This ensures Lambda sends segment data to X-Ray without needing a daemon.
Deploy X-Ray Daemon on EC2
If your application runs on EC2, install the X-Ray daemon (e.g., via `yum install -y xray-daemon` on Amazon Linux 2). The daemon runs as a service listening on UDP port 2000. Ensure the instance's IAM role has the `AWSXRayDaemonWriteAccess` policy. The daemon buffers segments and sends them to the X-Ray API every 1 second or when the buffer reaches 100 segments.
Configure Sampling Rules
By default, X-Ray samples the first request per second and 5% of subsequent requests. For a high-traffic application, create custom sampling rules via the console or CLI. For example, set a reservoir of 1 trace per second and a rate of 10% for all requests. Sampling rules are evaluated in order; the first match applies. Use the `GetSamplingRules` API to verify.
Analyze Traces in Console
Open the X-Ray console and view the service map. Identify services with high latency or error rates. Click on a service to see its traces. Filter traces by time range, annotations (e.g., `CustomerID=123`), or response codes. Use the 'Raw data' view to inspect individual segments and subsegments, including timestamps, durations, and any errors recorded.
Scenario 1: E-commerce Checkout Microservice A large e-commerce platform uses a microservices architecture for its checkout flow: API Gateway -> Authentication Service -> Order Service -> Payment Service -> Inventory Service -> Notification Service. Users report that checkout occasionally takes over 10 seconds. Without X-Ray, the team can only see overall response times from the frontend. With X-Ray, they enable tracing on API Gateway and instrument each service with the X-Ray SDK. The service map reveals that the Payment Service has high latency (average 4 seconds) during peak hours. Drilling into traces, they find that a specific downstream call to a third-party payment gateway times out after 3 seconds, causing retries. The team adds a timeout and circuit breaker, reducing checkout time to under 2 seconds.
Scenario 2: Serverless Data Pipeline
A data analytics company processes streaming data using Kinesis, Lambda, DynamoDB, and S3. They enable X-Ray on all Lambda functions and Kinesis Data Streams. A bug causes some records to be dropped. Using X-Ray traces, they discover that a Lambda function processing records from a particular shard fails intermittently due to a DynamoDB throttling error. The trace shows the exact ProvisionedThroughputExceededException and the duration of the throttled call. They increase DynamoDB read capacity units and add exponential backoff to the Lambda function, resolving the data loss.
Scenario 3: Hybrid On-Premises and AWS Application A financial services company runs a legacy application on-premises that connects to AWS services (RDS, SQS). They install the X-Ray daemon on their on-premises servers and instrument the application with the X-Ray SDK. The daemon sends segments to X-Ray over the internet (HTTPS). They can now trace requests from on-premises through to AWS, identifying that a slow SQL query on RDS is causing performance degradation. They optimize the query and see latency drop from 2 seconds to 200ms.
Common Production Issues:
- Misconfigured Sampling: Setting too high a sampling rate (e.g., 100%) on a high-traffic service leads to excessive costs (thousands of dollars per month). Best practice: start with default sampling and adjust based on traffic.
- Missing IAM Permissions: The daemon fails silently if the IAM role lacks PutTraceSegments. Traces appear missing. Always test with a simple trace first.
- Daemon Not Running: On EC2, if the daemon crashes or is stopped, segments are lost. Use CloudWatch alarms to monitor daemon health.
Objective Codes: SOA-C02 Domain 1: Monitoring, Logging, and Remediation (Objective 1.1: Implement and manage monitoring). X-Ray questions often appear in combination with CloudWatch and CloudTrail.
Common Wrong Answers: 1. 'Use CloudWatch Logs to trace requests': CloudWatch Logs collects log events but does not correlate them across services. X-Ray provides end-to-end tracing with a single trace ID. The exam tests that you know the difference. 2. 'Enable detailed billing metrics': Billing metrics are unrelated to tracing. Candidates confuse 'monitoring' with 'cost tracking'. 3. 'Use VPC Flow Logs': Flow logs capture network traffic metadata (IPs, ports) but not application-level request tracing. X-Ray works at the application layer. 4. 'X-Ray is real-time': X-Ray has a delay of up to 30 seconds. The exam might present a scenario requiring immediate alerting; CloudWatch metrics or logs would be appropriate instead.
Specific Numbers and Terms:
Default sampling: 1 trace per second (reservoir) + 5% of additional requests.
Daemon UDP port: 2000.
Daemon send interval: 1 second or 100 segments.
Trace ID format: 1-<timestamp>-<hexadecimal> (e.g., 1-5f2e8a3b-1a2b3c4d5e6f7g8h9i0j1k2l).
Header: X-Amzn-Trace-Id.
IAM policies: AWSXRayDaemonWriteAccess for daemon; AWSXrayReadOnlyAccess for read users.
Lambda integration: 'Active tracing' must be enabled; no daemon needed.
API Gateway: Enable per stage.
Edge Cases: - Asynchronous Lambda invocations: If Lambda invokes another Lambda asynchronously, the trace ID is not automatically propagated. You must pass the trace header manually in the event. - Step Functions: X-Ray integration with Step Functions is limited; you need to instrument individual Lambda tasks within the state machine. - On-premises servers: The X-Ray daemon can send segments from on-premises, but the server must have outbound HTTPS access to the X-Ray endpoint.
How to Eliminate Wrong Answers:
If the question asks for 'end-to-end request tracing' or 'visualizing service dependencies', the answer is X-Ray.
If the question involves 'real-time alerting' or 'metric-based alarms', the answer is CloudWatch.
If the question is about 'audit logs of API calls', the answer is CloudTrail.
If the question involves 'distributed tracing across microservices', X-Ray is the correct choice; other options like CloudWatch Logs Insights require manual correlation.
X-Ray provides distributed tracing for microservices applications, correlating requests across services using a trace ID.
Default sampling: 1 trace per second (reservoir) + 5% of additional requests.
The X-Ray daemon listens on UDP port 2000 and sends segments to the API every 1 second or after 100 segments.
For Lambda, enable 'Active tracing' in the function configuration; no daemon needed.
The `X-Amzn-Trace-Id` header propagates the trace context between services.
X-Ray is not real-time; traces appear within 30 seconds.
IAM permissions required: `PutTraceSegments` and `PutTelemetryRecords` for the daemon, `xray:PutTraceSegments` for Lambda.
Annotations are indexed for search; metadata is not indexed.
CloudWatch Logs collects logs, CloudTrail records AWS API calls, X-Ray traces application requests.
Costs depend on number of traces recorded and scans performed.
These come up on the exam all the time. Here's how to tell them apart.
AWS X-Ray
Provides end-to-end tracing across multiple services using a single trace ID.
Automatically correlates segments into a service graph and trace timeline.
Supports sampling to reduce cost and noise.
Integrates with Lambda, API Gateway, ALB, and other services natively.
Requires SDK instrumentation for detailed subsegment tracing.
Amazon CloudWatch Logs
Collects log events from multiple sources but does not correlate them automatically.
Requires manual correlation using log groups and filters (e.g., `@requestId`).
No built-in sampling; you pay for all ingested logs.
Can be used for any application without SDK changes (agent-based).
Supports real-time alerting via metric filters and alarms.
AWS X-Ray
Traces application-level requests and their performance.
Captures segment data including latency, errors, and metadata.
Used for debugging and performance analysis.
Integrates with application SDKs.
Data is retained for 30 days by default (can be extended with S3).
AWS CloudTrail
Records API calls made to AWS services (management events).
Captures who made the call, from which IP, and what action was taken.
Used for auditing and security analysis.
No application-level tracing; only AWS API calls.
Data is retained for 90 days by default in CloudTrail console, or indefinitely in S3.
Mistake
X-Ray requires code changes to work with Lambda.
Correct
Lambda can automatically send trace data for invocations without instrumenting code if 'Active tracing' is enabled. However, for capturing downstream calls (e.g., DynamoDB queries), you need the X-Ray SDK in your Lambda function code.
Mistake
X-Ray provides real-time tracing with sub-second latency.
Correct
X-Ray has a delay of up to 30 seconds between when a request is processed and when the trace appears in the console. It is not suitable for real-time alerting.
Mistake
Enabling X-Ray on all services automatically propagates the trace ID.
Correct
The trace ID is propagated via the `X-Amzn-Trace-Id` header. Services not instrumented with the X-Ray SDK may drop or ignore this header, breaking the trace. You must ensure all services in the request path are instrumented.
Mistake
X-Ray can trace requests that go through an on-premises load balancer without any configuration.
Correct
For on-premises servers, you must install the X-Ray daemon and instrument your application with the SDK. The load balancer itself does not add the trace header automatically.
Mistake
X-Ray is free with no additional cost.
Correct
X-Ray charges per trace recorded and per scan for trace retrieval. The first 100,000 traces per month are free, but high-volume usage can incur significant costs.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
In the Lambda console, select your function, go to the 'Configuration' tab, then 'Monitoring tools', and click 'Edit'. Enable 'Active tracing' and save. Ensure the function's execution role has the `AWSXRayDaemonWriteAccess` policy (or equivalent permissions). No code changes are required for basic tracing of invocation duration and response, but to trace downstream calls (e.g., DynamoDB), you must instrument your code with the X-Ray SDK.
A segment represents the work of a single service (e.g., an API Gateway stage, a Lambda function). A subsegment represents a downstream call within that service (e.g., an HTTP request to another service, a DynamoDB query). Subsegments are nested within segments and provide granular timing and error information for each call.
Add an annotation to your segment or subsegment using the X-Ray SDK, e.g., `AWSXRay.getSegment().putAnnotation('CustomerID', '12345')`. Then in the X-Ray console, use the filter bar to enter `annotation.CustomerID = "12345"`. You can also use the `GetTraceSummaries` API with a filter expression.
Possible reasons: (1) The X-Ray daemon is not running on EC2 or the IAM role lacks permissions. (2) For Lambda, 'Active tracing' is not enabled. (3) Some services in the request path are not instrumented with the X-Ray SDK, causing the trace header to be lost. (4) Sampling rules are filtering out the request. Check the daemon logs (e.g., `/var/log/xray/xray.log`) for errors.
Yes. Install the X-Ray daemon on your on-premises server and instrument your application with the X-Ray SDK. The daemon sends segments to the X-Ray API over HTTPS (TCP 443). Ensure the server can reach the X-Ray endpoint (e.g., `xray.us-east-1.amazonaws.com`). The daemon requires outbound internet access.
Use sampling rules to limit the number of traces recorded. Set a low reservoir (e.g., 1 trace per second) and a low rate (e.g., 5%). You can also set specific rules to sample only error traces or only traces from certain paths. Additionally, use the `PutTraceSegments` API with care; avoid sending unnecessary metadata.
X-Ray retains trace data for 30 days by default. You can export traces to S3 for longer retention using the X-Ray API or by configuring a CloudWatch Events rule to trigger a Lambda function that exports traces periodically.
You've just covered AWS X-Ray for SysOps Monitoring — now see how well it sticks with free SOA-C02 practice questions. Full explanations included, no account needed.
Done with this chapter?