How many Troubleshooting Scenario Questions questions are on this page?

This page has 8 Troubleshooting Scenario Questions scenario questions for the MLS-C01 exam, each with detailed explanations and wrong-answer analysis.

How should I approach MLS-C01 scenario questions?

Read the full scenario before looking at the answer options. Identify the constraint or requirement in the scenario, then eliminate options that are generally true but wrong for this specific case. Scenario questions reward careful reading over pattern matching.

← Back to AWS Certified Machine Learning Specialty MLS-C01 questions

Scenario-based practice

Troubleshooting Scenario Questions

Practise AWS Certified Machine Learning Specialty MLS-C01 practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

Start full practice test Read exam guide

scenario questions

MLS-C01

exam code

Amazon Web Services

vendor

Scenario guide

How to approach troubleshooting scenario questions

These questions describe a network symptom and ask you to identify the root cause or the correct fix. They appear across all certification exams and reward systematic thinking over memorisation. The best candidates follow a consistent troubleshooting framework even under time pressure.

Quick answer

Troubleshooting Scenario Questions questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Practice scenarios

Question 1easymultiple choice

Full question →

A data scientist is training a binary classification model using Amazon SageMaker. The dataset is highly imbalanced (99% negative class, 1% positive class). The model currently achieves 99% accuracy but fails to detect most positive cases. Which metric should the data scientist primarily use to evaluate model performance?

A
ROC AUC
Why wrong: ROC AUC can be overly optimistic for imbalanced data.
B
F1 score
F1 score balances precision and recall, suitable for imbalanced data.
C
Recall
Why wrong: Recall alone ignores false positives.
D
Accuracy
Why wrong: Accuracy is misleading for imbalanced datasets.

Full breakdown with real-world context →

Question 2easymultiple choice

Full question →

An ML engineer is troubleshooting why an automated CI/CD pipeline cannot deploy an updated model to an existing SageMaker endpoint. The pipeline uses the IAM role that has the attached policy shown in the exhibit. What is the MOST likely cause of the failure?

Exhibit

Refer to the exhibit.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateModel",
        "sagemaker:CreateEndpointConfig",
        "sagemaker:CreateEndpoint",
        "sagemaker:InvokeEndpoint"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::123456789012:role/SageMakerExecutionRole",
      "Condition": {
        "StringEquals": {
          "iam:PassedToService": "sagemaker.amazonaws.com"
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": [
        "sagemaker:DeleteEndpoint",
        "sagemaker:DeleteEndpointConfig",
        "sagemaker:DeleteModel"
      ],
      "Resource": "*"
    }
  ]
}

A
The pipeline tries to update an existing endpoint, but the sagemaker:UpdateEndpoint action is not allowed.
The policy does not include sagemaker:UpdateEndpoint, which is required to update an existing endpoint. Without this permission, the update fails.
B
The pipeline tries to create a new endpoint, but the sagemaker:CreateEndpoint action is denied.
Why wrong: The policy allows sagemaker:CreateEndpoint, so creation is permitted. The issue is with updating an existing endpoint.
C
The pipeline tries to delete the old endpoint, but the sagemaker:DeleteEndpoint action is denied by a Deny statement.
Why wrong: The Deny statement blocks Delete actions, but the pipeline is trying to update, not delete. The lack of Update permission is the issue.
D
The pipeline attempts to invoke the endpoint, but the sagemaker:InvokeEndpoint action is denied.
Why wrong: The policy allows sagemaker:InvokeEndpoint, so invocation is not the issue.

Full breakdown with real-world context →

Question 3mediummultiple choice

Full question →

A company is building a binary classifier to predict equipment failure. The dataset has 99% negative (no failure) and 1% positive (failure) examples. The data scientist uses a random forest model with default settings. The model achieves 99% accuracy on the test set but fails to identify any actual failures. Which metric should the data scientist use to evaluate the model?

A
RMSE
Why wrong: RMSE is for regression, not classification.
B
R-squared
Why wrong: R-squared is for regression, not classification.
C
Recall
Recall measures the proportion of actual positives correctly identified, which is critical for imbalanced data.
D
Precision
Why wrong: Precision is useful but does not capture the failure to identify positives; recall is more relevant here.

Full breakdown with real-world context →

Question 4mediummultiple choice

Full question →

Refer to the exhibit. A data scientist runs the above CLI command to create a SageMaker training job. The job fails with an error 'Unable to read data from s3://bucket/train/'. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

```
aws sagemaker create-training-job \
    --training-job-name my-job \
    --algorithm-specification TrainingImage=123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image:latest,TrainingInputMode=File \
    --role-arn arn:aws:iam::123456789012:role/SageMakerRole \
    --input-data-config '[{"ChannelName": "train", "DataSource": {"S3DataSource": {"S3DataType": "S3Prefix", "S3Uri": "s3://bucket/train/"}}}]' \
    --output-data-config S3OutputPath=s3://bucket/output/ \
    --resource-config InstanceType=ml.c5.xlarge,InstanceCount=1,VolumeSizeInGB=10 \
    --stopping-condition MaxRuntimeInSeconds=3600
```

A
The training image is not accessible
Why wrong: Image accessibility would cause a different error (e.g., 'Unable to pull image').
B
The instance type does not support the required memory
Why wrong: Instance memory would cause an out-of-memory error, not a data read error.
C
The IAM role does not have permissions to read from the S3 bucket
The role must have s3:GetObject permission for the training data.
D
The training job is in a different region than the S3 bucket
Why wrong: S3 is region-agnostic for access, but cross-region access requires explicit permission; however, the error message suggests a permission issue.

Full breakdown with real-world context →

Question 5hardmultiple choice

Full question →

Refer to the exhibit. The training job 'my-job' failed with the error 'Unable to pull image from ECR'. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
aws sagemaker create-training-job \
    --training-job-name my-job \
    --algorithm-specification TrainingImage=123456789012.dkr.ecr.us-east-1.amazonaws.com/my-custom-image:latest,TrainingInputMode=File \
    --role-arn arn:aws:iam::123456789012:role/SageMakerRole \
    --input-data-config ChannelName=training,DataSource={S3DataSource={S3Uri=s3://my-bucket/train/,S3DataType=S3Prefix,S3DataDistributionType=FullyReplicated}} \
    --output-data-config S3OutputPath=s3://my-bucket/output/ \
    --resource-config InstanceType=ml.m5.large,InstanceCount=1,VolumeSizeInGB=10 \
    --stopping-condition MaxRuntimeInSeconds=3600
```

A
The IAM role does not have permission to pull images from the ECR repository.
Without ecr:GetDownloadUrlForLayer and BatchGetImage, the pull fails.
B
The instance type ml.m5.large does not support custom images.
Why wrong: Custom images are supported on all instance types.
C
The S3 bucket for training data is in a different account.
Why wrong: The error is about ECR, not S3.
D
The role ARN is incorrect.
Why wrong: The ARN appears valid.

Full breakdown with real-world context →

Question 6easymultiple choice

Full question →

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents only 1% of the data. The model achieves 99% accuracy but fails to identify most positive cases. Which metric should the data scientist use to evaluate model performance?

A
R-squared
Why wrong: R-squared is for regression, not classification.
B
F1 score
F1 score balances precision and recall, suitable for imbalanced data.
C
Accuracy
Why wrong: Accuracy is misleading for imbalanced datasets.
D
RMSE
Why wrong: RMSE is for regression, not classification.

Full breakdown with real-world context →

Question 7mediummultiple choice

Full question →

A company uses Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error: "Unable to write to /opt/ml/output/data". The data scientist checks the container and finds that the /opt/ml directory is not writable. What is the MOST likely cause?

A
The Docker image is built from a base image that does not have the required libraries.
Why wrong: Missing libraries would cause import errors, not write permission errors.
B
The container runs as a non-root user that lacks write permissions to /opt/ml.
SageMaker mounts volumes as root by default; if the container runs as a different user, it may not have write access.
C
The SageMaker training job is configured with insufficient memory.
Why wrong: Insufficient memory would cause out-of-memory errors, not write permission errors.
D
The training script is not copying the model to /opt/ml/model.
Why wrong: The error is about writing to /opt/ml/output/data, not copying model.

Full breakdown with real-world context →

Question 8hardmultiple choice

Full question →

A machine learning team is using Amazon SageMaker to train a model with a custom algorithm packaged in a Docker container. The training job fails with the error 'Error: Unable to locate sagemaker-training toolkit.' What is the MOST likely cause?

A
The container does not have internet access to download dependencies
Why wrong: The toolkit should be included in the image, not downloaded at runtime.
B
The instance type is incompatible with the container
Why wrong: Incompatibility would cause a different error.
C
The training role does not have permissions to access the container repository
Why wrong: That would cause a different error.
D
The container does not include the SageMaker Training Toolkit
The toolkit must be installed in the Docker image.

Full breakdown with real-world context →

These MLS-C01 practice questions are part of Courseiva's free Amazon Web Services certification practice question bank. Courseiva provides original exam-style MLS-C01 questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.

Troubleshooting Scenario Questions

How to approach troubleshooting scenario questions

Quick answer

Related MLS-C01 topic practice pages

Data Engineering practice questions

Machine Learning Implementation and Operations practice questions

Modeling practice questions

Exploratory Data Analysis practice questions

MLS-C01 fundamentals practice questions

MLS-C01 scenario practice questions

MLS-C01 troubleshooting practice questions

Practice scenarios

An ML engineer is troubleshooting why an automated CI/CD pipeline cannot deploy an updated model to an existing SageMaker endpoint. The pipeline uses the IAM role that has the attached policy shown in the exhibit. What is the MOST likely cause of the failure?

Exhibit

Refer to the exhibit. A data scientist runs the above CLI command to create a SageMaker training job. The job fails with an error 'Unable to read data from s3://bucket/train/'. What is the MOST likely cause?

Exhibit

Refer to the exhibit. The training job 'my-job' failed with the error 'Unable to pull image from ECR'. What is the most likely cause?

Exhibit

A data scientist is training a binary classification model on an imbalanced dataset where the positive class represents only 1% of the data. The model achieves 99% accuracy but fails to identify most positive cases. Which metric should the data scientist use to evaluate model performance?

A company uses Amazon SageMaker to train a model using a custom Docker container. The training job fails with an error: "Unable to write to /opt/ml/output/data". The data scientist checks the container and finds that the /opt/ml directory is not writable. What is the MOST likely cause?

A machine learning team is using Amazon SageMaker to train a model with a custom algorithm packaged in a Docker container. The training job fails with the error 'Error: Unable to locate sagemaker-training toolkit.' What is the MOST likely cause?