How many Troubleshooting Scenario Questions questions are on this page?

This page has 13 Troubleshooting Scenario Questions scenario questions for the MLA-C01 exam, each with detailed explanations and wrong-answer analysis.

How should I approach MLA-C01 scenario questions?

Read the full scenario before looking at the answer options. Identify the constraint or requirement in the scenario, then eliminate options that are generally true but wrong for this specific case. Scenario questions reward careful reading over pattern matching.

← Back to AWS Certified Machine Learning Engineer Associate MLA-C01 questions

Scenario-based practice

Troubleshooting Scenario Questions

Practise AWS Certified Machine Learning Engineer Associate MLA-C01 practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

Start full practice test Read exam guide

scenario questions

MLA-C01

exam code

Amazon Web Services

vendor

Scenario guide

How to approach troubleshooting scenario questions

These questions describe a network symptom and ask you to identify the root cause or the correct fix. They appear across all certification exams and reward systematic thinking over memorisation. The best candidates follow a consistent troubleshooting framework even under time pressure.

Quick answer

Troubleshooting Scenario Questions questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Practice scenarios

Question 1easymultiple choice

Full question →

An ML engineer runs the CLI command shown in the exhibit. However, the training job fails immediately with an error: 'Unable to assume role'. What is the most likely cause?

Exhibit

Refer to the exhibit.

aws sagemaker create-training-job \
    --training-job-name my-training-job \
    --algorithm-specification 'TrainingImage=123456789012.dkr.ecr.us-west-2.amazonaws.com/my-custom-training:latest,TrainingInputMode=File' \
    --role-arn arn:aws:iam::123456789012:role/SageMakerExecutionRole \
    --input-data-config '[{"ChannelName":"train","DataSource":{"S3DataSource":{"S3Uri":"s3://my-bucket/train/","S3DataType":"S3Prefix"}},"ContentType":"text/csv"}]' \
    --output-data-config '{"S3OutputPath":"s3://my-bucket/output/"}' \
    --resource-config '{"InstanceType":"ml.m5.large","InstanceCount":1,"VolumeSizeInGB":30}' \
    --vpc-config '{"SecurityGroupIds":["sg-12345678"],"Subnets":["subnet-12345678"]}'

A
The IAM role 'SageMakerExecutionRole' does not have permission to create the training job.
Why wrong: The error is about assuming the role, not permissions using it.
B
The training image in ECR does not exist.
Why wrong: Image existence is checked later; the immediate error is role assumption.
C
The S3 bucket 'my-bucket' does not exist.
Why wrong: If the bucket didn't exist, the error would be later, not an immediate role failure.
D
The IAM role's trust policy does not grant SageMaker permission to assume the role.
Without proper trust policy, SageMaker cannot assume the role, causing immediate failure.

Full breakdown with real-world context →

Question 2easymultiple choice

Full question →

Refer to the exhibit. A user is unable to invoke a SageMaker endpoint. The IAM policy shown is attached to the user. Which permission is missing to allow invocation?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:DescribeEndpoint",
        "sagemaker:ListEndpoints"
      ],
      "Resource": "*"
    }
  ]
}

A
sagemaker:InvokeEndpoint
InvokeEndpoint is required to send inference requests.
B
sagemaker:DescribeEndpoint
Why wrong: Already allowed but not sufficient for invocation.
C
sagemaker:CreateEndpoint
Why wrong: Creating endpoints is not needed for invocation.
D
sagemaker:ListEndpoints
Why wrong: Already allowed but not sufficient for invocation.

Full breakdown with real-world context →

Question 3mediummultiple choice

Full question →

A machine learning engineer is troubleshooting a model that is producing unexpectedly low accuracy in production. The engineer examines the model's training data and finds that the distribution of the target variable in production is significantly different from the training set. What type of drift is the model experiencing?

A
Prior probability shift
Why wrong: Prior probability shift is a specific case of concept drift where class proportions change.
B
Concept drift
Concept drift is a change in the statistical properties of the target variable.
C
Data drift
Why wrong: Data drift is a broad term; concept drift is more specific.
D
Covariate shift
Why wrong: Covariate shift refers to changes in input features, not the target.

Full breakdown with real-world context →

Question 4mediummultiple choice

Full question →

A SageMaker Processing job fails with the error: 'Unable to parse CSV file due to inconsistent number of columns'. The data is stored as CSV in S3. What is the most likely cause?

A
The CSV file is missing a header row
Why wrong: Missing header affects column names, not number of columns per row.
B
The file uses a different delimiter like tab
Why wrong: A different delimiter would cause consistent column count errors, not inconsistent.
C
Some fields contain quoted commas
Why wrong: CSV parsers handle quoted commas correctly.
D
Some rows have missing values causing fewer columns
If some values are missing, the row may have fewer commas, leading to column count mismatch.

Full breakdown with real-world context →

Question 5hardmultiple choice

Full question →

A team uses SageMaker Neo to compile a model for deployment on a target device. After compilation, they deploy the compiled model to a SageMaker endpoint using the Neo-optimized container. The endpoint fails to start with error "RuntimeError: Unable to load model". What could be the issue?

A
The compiled model was not uploaded to the correct S3 path.
Why wrong: Incorrect path would cause a file not found error.
B
The Neo compilation job failed silently.
Why wrong: A failed compilation would produce an error during compilation, not at deployment.
C
The endpoint instance type does not support Neo.
Why wrong: Neo supports many instance types; unsupported types would cause a different error.
D
The target device architecture during compilation does not match the endpoint instance architecture.
Neo models are compiled for specific architectures; mismatch causes load failure.

Full breakdown with real-world context →

Question 6mediummultiple choice

Full question →

A team used the above config to create an endpoint. However, the endpoint fails to invoke because of a "ModelError". What is the most likely cause?

Network Topology

A
The instance type is not available in the region.
Why wrong: Instance unavailability would cause an insufficient capacity error.
B
The IAM role does not have permission to access the S3 bucket.
Without s3:GetObject, the endpoint cannot load the model artifact.
C
The model data URL points to a non-existent file.
Why wrong: Missing file would cause a file not found error, not ModelError.
D
The ECR image URI is incorrect for the region.
Why wrong: An incorrect image URI typically results in an image not found error, not ModelError.

Full breakdown with real-world context →

Question 7mediummultiple choice

Full question →

A company is training a deep learning model on Amazon SageMaker. The training job started but has been stuck in 'InProgress' state for an unusually long time with low CPU utilization. The data scientist suspects a bottleneck. What should be the first troubleshooting step?

A
Switch the training job to use Spot instances to reduce cost and potentially improve throughput.
Why wrong: Spot instances do not fix performance bottlenecks; they may interrupt jobs.
B
Increase the number of training instances to parallelize data loading.
Why wrong: Increasing instances can worsen the problem if the bottleneck is not compute-bound.
C
Stop and restart the training job with a different instance type.
Why wrong: Restarting without root cause analysis may lead to the same issue.
D
Review CloudWatch Logs for the training container to identify errors or warnings.
Logs often show the exact cause of hanging, such as waiting for data or resource constraints.

Full breakdown with real-world context →

Question 8easymultiple choice

Full question →

A company uses Amazon SageMaker to deploy a real-time inference endpoint. They notice increased latency in predictions during peak hours. Which should they investigate first to address the issue?

A
Review the endpoint auto-scaling policy
Auto-scaling policy determines how instances are added/removed; insufficient capacity causes high latency.
B
Check the data labeling job status
Why wrong: Data labeling jobs are not related to inference latency.
C
Modify the training instance type
Why wrong: Training instance type does not affect inference latency.
D
Increase the model artifact size
Why wrong: Larger model artifact may increase latency but is not the first thing to investigate.

Full breakdown with real-world context →

Question 9mediummultiple choice

Full question →

Refer to the exhibit. A data engineer investigates why a SageMaker endpoint is returning errors. The endpoint configuration has been updated to point to a new model version. What is the MOST likely cause of the error?

Exhibit

[ERROR] 2024-03-15 10:23:45,123 - sagemaker - 1321 - root - ERROR - InvocationException: Received response status code 404 from container. Error: ResourceNotFoundException: Model 'my-model-v2' is not found. You may be using an outdated endpoint configuration.

A
The endpoint instance type is insufficient.
Why wrong: Insufficient instances would lead to timeouts or throttling, not a 404.
B
The container image for the new model is not compatible.
Why wrong: Incompatible image would cause a different error, such as container startup failure.
C
The IAM role does not have permission to invoke the endpoint.
Why wrong: Permission errors would result in 403 or 5xx, not 404.
D
The endpoint is still using the previous configuration.
Correct. The endpoint configuration likely still points to an older model name that does not exist.
E
The new model artifact is not properly uploaded to S3.
Why wrong: If the artifact were missing, the error would be about accessing S3, not a 404 from the container.

Full breakdown with real-world context →

Question 10mediummultiple choice

Full question →

A data scientist is training a binary classification model using Amazon SageMaker. The dataset has a severe class imbalance (95% negative, 5% positive). The model achieves 99% accuracy but fails to identify positive cases correctly. Which action should the data scientist take to improve the model's ability to detect positive cases?

A
Switch to a logistic regression model with balanced class weights.
Why wrong: Model choice alone does not guarantee improved recall.
B
Use accuracy as the evaluation metric and retrain the model.
Why wrong: Accuracy is misleading for imbalanced classes.
C
Apply SMOTE (Synthetic Minority Over-sampling Technique) to the training data.
Why wrong: SMOTE is a valid approach but not the most direct action for improving detection of positive cases.
D
Use the F1 score as the evaluation metric and adjust the classification threshold based on the precision-recall curve.
F1 score and threshold tuning directly address the imbalance.

Full breakdown with real-world context →

Question 11hardmultiple choice

Full question →

A company is deploying a ML model for real-time fraud detection using SageMaker. The model must process requests within 50 ms and scale to handle up to 10,000 requests per second during peak hours. The data includes PII, so all traffic must stay within a VPC. The team has configured the SageMaker endpoint with a VPC and an internet gateway for model downloads. During a load test, the endpoint fails to achieve the required throughput. Which change would most likely resolve the issue?

A
Remove the VPC configuration and use public endpoints to reduce network overhead.
Why wrong: Removing VPC would break VPC-only requirement and increase latency due to internet egress.
B
Use VPC endpoints (interface endpoint for SageMaker and gateway endpoint for S3) to keep traffic within AWS backbone.
VPC endpoints reduce latency and keep traffic within AWS network, improving throughput.
C
Add a NAT gateway to allow the SageMaker endpoint to access the internet efficiently.
Why wrong: NAT gateway adds latency and still routes through internet, not optimal.
D
Increase the instance count and use a larger instance type to handle the throughput.
Why wrong: While scaling helps, the fundamental network bottleneck remains; VPC endpoints address the root cause.

Full breakdown with real-world context →

Question 12hardmultiple choice

Full question →

A team is using SageMaker Pipelines to train a model. The pipeline has multiple steps: data processing, training, evaluation, and registration. They use a Condition step to evaluate the model's accuracy and if it exceeds a threshold, register the model. They run the pipeline and the training step succeeds, but the pipeline fails at the Condition step with an error: 'Unable to evaluate condition: the property 'Accuracy' does not exist.' The evaluation step output is a JSON file with key 'accuracy'. What is the most likely cause?

A
The evaluation step did not produce the output correctly.
Why wrong: The evaluation step produced the output with key 'accuracy' as expected.
B
The training step output is being used instead of the evaluation step output.
Why wrong: Even if that were the case, the property name mismatch would still cause the error.
C
The pipeline definition has a syntax error.
Why wrong: The error message points to a missing property, not a syntax error.
D
The Condition step is referencing the wrong property name.
Correct: 'Accuracy' vs 'accuracy' case mismatch causes the error.

Full breakdown with real-world context →

Question 13hardmultiple choice

Full question →

A financial services company uses Amazon SageMaker to deploy a fraud detection model for real-time inference. The model is deployed on an ml.m5.large instance with a SageMaker real-time endpoint. The endpoint has an auto scaling policy configured using a custom scaling policy based on average CPU utilization, with scale out threshold at 70% and scale in threshold at 30%. During a flash sale event, the traffic to the endpoint spikes tenfold within minutes. The endpoint fails to handle the load, resulting in increased latency and timeouts. The data science team needs to improve the scalability of the endpoint to handle sudden traffic spikes. Which solution should the team implement?

A
Implement a SageMaker Model Ensemble with two additional models to balance the load.
Why wrong: Adding models increases the computational load per request, worsening the latency issue.
B
Replace the custom scaling policy with a target tracking scaling policy based on the number of invocations per instance, with a target value of 1000.
Target tracking on request count provides faster reaction to traffic spikes because it directly measures the traffic, whereas CPU utilization is a lagging indicator.
C
Implement a SageMaker Inference Pipeline with a pre-processing step to reduce model input size.
Why wrong: An inference pipeline adds extra pre-processing time, increasing latency, and does not improve scaling responsiveness.
D
Switch to a GPU instance type, such as ml.p3.2xlarge, to increase compute capacity.
Why wrong: While GPU instances may process requests faster, the scaling policy still lags, so the endpoint will still fail to scale in time.

Full breakdown with real-world context →

These MLA-C01 practice questions are part of Courseiva's free Amazon Web Services certification practice question bank. Courseiva provides original exam-style MLA-C01 questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.

Troubleshooting Scenario Questions

How to approach troubleshooting scenario questions

Quick answer

Related MLA-C01 topic practice pages

Data Preparation for Machine Learning practice questions

ML Model Development practice questions

Deployment and Orchestration of ML Workflows practice questions

ML Solution Monitoring, Maintenance and Security practice questions

MLA-C01 fundamentals practice questions

MLA-C01 scenario practice questions

MLA-C01 troubleshooting practice questions

Practice scenarios

An ML engineer runs the CLI command shown in the exhibit. However, the training job fails immediately with an error: 'Unable to assume role'. What is the most likely cause?

Exhibit

Refer to the exhibit. A user is unable to invoke a SageMaker endpoint. The IAM policy shown is attached to the user. Which permission is missing to allow invocation?

Exhibit

A SageMaker Processing job fails with the error: 'Unable to parse CSV file due to inconsistent number of columns'. The data is stored as CSV in S3. What is the most likely cause?

A team uses SageMaker Neo to compile a model for deployment on a target device. After compilation, they deploy the compiled model to a SageMaker endpoint using the Neo-optimized container. The endpoint fails to start with error "RuntimeError: Unable to load model". What could be the issue?

A team used the above config to create an endpoint. However, the endpoint fails to invoke because of a "ModelError". What is the most likely cause?

A company is training a deep learning model on Amazon SageMaker. The training job started but has been stuck in 'InProgress' state for an unusually long time with low CPU utilization. The data scientist suspects a bottleneck. What should be the first troubleshooting step?

A company uses Amazon SageMaker to deploy a real-time inference endpoint. They notice increased latency in predictions during peak hours. Which should they investigate first to address the issue?

Refer to the exhibit. A data engineer investigates why a SageMaker endpoint is returning errors. The endpoint configuration has been updated to point to a new model version. What is the MOST likely cause of the error?

Exhibit