Back to AWS Certified Machine Learning Specialty MLS-C01 questions

Scenario-based practice

Hard Difficulty Questions

Practise AWS Certified Machine Learning Specialty MLS-C01 practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

20
scenario questions
MLS-C01
exam code
Amazon Web Services
vendor

Scenario guide

How to approach hard difficulty questions

These are the questions most candidates get wrong. They require connecting multiple concepts, reading tricky output, or knowing edge-case behaviour that isn't on most study cards. Practising them trains you to operate under uncertainty — a necessary skill on the real exam.

Quick answer

Hard Difficulty Questions questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Related practice questions

Related MLS-C01 topic practice pages

Scenario questions usually connect to one or more exam topics. Use these links to review the underlying concepts behind the scenario.

Practice set

Practice scenarios

Question 1hardmultiple choice
Full question →

A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?

Question 2hardmultiple choice
Full question →

A company uses Amazon SageMaker to train and deploy machine learning models. The training data is stored in Amazon S3 (Parquet format, 10 TB). The data scientists have been running training jobs using the File mode input, but the jobs are taking too long due to data download time. They want to reduce the training start-up time and overall training time. Which solution is MOST cost-effective and efficient?

Question 3hardmulti select
Full question →

A company needs to build a data lake on AWS for analytics. The data includes structured, semi-structured, and unstructured data. The solution must support schema-on-read, provide fine-grained access control, and be cost-effective for storing rarely accessed data. Which THREE services should be used? (Choose THREE)

Question 4hardmultiple choice
Full question →

A data engineer created an IAM policy to allow a Glue ETL job to read and write objects to an S3 bucket. The ETL job fails when writing data with the error 'Access Denied'. The job is configured to use SSE-S3 (AES256) encryption. What is the likely issue?

Exhibit

Refer to the exhibit.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::my-data-lake/*",
      "Condition": {
        "StringEquals": {
          "s3:x-amz-server-side-encryption": "AES256"
        }
      }
    }
  ]
}
Question 5hardmultiple choice
Full question →

A company runs a real-time fraud detection system using Amazon Kinesis Data Streams with 100 shards. Data is consumed by a custom Java application running on Amazon EC2 instances in an Auto Scaling group. The application processes records and writes results to a DynamoDB table. Over the past month, the application has experienced intermittent slowdowns and the DynamoDB write capacity has been fully utilized during peak hours. The team wants to improve throughput without losing the ability to reprocess failed records. The application currently uses the Kinesis Client Library (KCL) with DynamoDB as the lease table. The team is considering the following changes: A. Increase the number of EC2 instances to match the number of shards. B. Switch to using AWS Lambda as the consumer to handle scaling automatically. C. Increase the write capacity of the DynamoDB lease table to handle more workers. D. Use enhanced fan-out to have each consumer receive its own 2 MB/second shard throughput. Which change should the team implement first to address the issue?

Question 6hardmultiple choice
Full question →

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?

Question 7hardmultiple choice
Full question →

An e-commerce company uses a linear regression model to predict customer lifetime value (LTV). The model shows high variance on the test set, with training RMSE much lower than test RMSE. Which of the following is the MOST effective approach to reduce overfitting?

Question 8hardmulti select
Full question →

A data scientist is training a deep learning model using Amazon SageMaker. The training loss is decreasing, but the validation loss starts increasing after 10 epochs. The model is overfitting. Which TWO actions should the data scientist take to reduce overfitting? (Choose 2.)

Question 9hardmultiple choice
Full question →

A data scientist is exploring a dataset with 500 features and 10,000 samples. The data scientist computes the pairwise correlation matrix and finds that many features have correlations above 0.9. The data scientist wants to reduce the dataset to 50 features while preserving as much variance as possible. Which technique should be used?

Question 10hardmulti select
Full question →

A data scientist is analyzing a dataset of customer reviews. The dataset contains a text column 'review' and a numerical rating from 1 to 5. The data scientist wants to create features for sentiment analysis. Which THREE preprocessing steps should be applied to the text data before feature extraction? (Choose THREE.)

Question 11hardmulti select
Full question →

Which THREE of the following are best practices when performing exploratory data analysis on a dataset with both numerical and categorical features?

Question 12hardmultiple choice
Full question →

A data scientist is trying to read a CSV file from S3 bucket 'my-bucket' with key 'training/data.csv' using an IAM role with the attached policy shown in the exhibit. The read operation fails with an Access Denied error. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/training/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::my-bucket/training/"
    }
  ]
}
```
Question 13hardmultiple choice
Read the full NAT/PAT explanation →

A data scientist is building a fraud detection model using a dataset of 500,000 credit card transactions. The dataset contains 20 features, including transaction amount, merchant category, time since last transaction, and customer age. The target variable 'is_fraud' has 0.1% positive examples. Initial EDA reveals that the transaction amount distribution is highly skewed with a long tail. Also, there are missing values in the 'customer_age' field (5% missing). The data scientist needs to prepare the data for training a binary classifier. Which combination of preprocessing steps should the data scientist apply to address these issues and improve model performance? (Select TWO.)

Question 14hardmulti select
Full question →

A company is using Amazon SageMaker to tune hyperparameters for a gradient boosting model. The objective is to minimize root mean squared error (RMSE). The data scientist wants to explore the hyperparameter space efficiently. Which THREE hyperparameter tuning strategies should the data scientist consider? (Choose 3.)

Question 15hardmultiple choice
Full question →

A machine learning engineer is training a neural network on Amazon SageMaker using a custom Docker container. The training job fails with an error: 'CUDA out of memory.' The training instance is an ml.p3.2xlarge with 16 GB GPU memory. The model and data fit into memory when using batch size 32, but the engineer wants to maximize GPU utilization. Which approach should the engineer use to fix the out-of-memory error while maintaining efficient training?

Question 16hardmultiple choice
Full question →

A data scientist runs a SageMaker training job that fails with the above error. The S3 bucket and object exist, and the IAM role has s3:GetObject permission. What is the MOST likely cause?

Exhibit

Refer to the exhibit.

```
Training job status: Failed
Error: ClientError: Data download failed.
The downloaded file size (0 bytes) does not match expected size (1024 bytes).
Check that the S3 object exists and is readable.
```
Question 17hardmultiple choice
Full question →

A data scientist is building a multi-class classification model with 10 classes. The dataset has 100,000 samples. After training a random forest with 100 trees, the model achieves 85% accuracy on the test set. However, the data scientist notices that for one rare class (1% of data), recall is only 5%. Which technique is MOST likely to improve recall for the rare class without significantly reducing overall accuracy?

Question 18hardmulti select
Full question →

Which THREE techniques can help reduce overfitting in a neural network trained on a small dataset?

Question 19hardmultiple choice
Full question →

A company is deploying a real-time fraud detection system using a gradient boosting model on AWS SageMaker. The model uses 200 features and is trained on 50 GB of data. The inference latency requirement is under 10 ms per request. During load testing, the endpoint shows average latency of 15 ms. Which change is MOST likely to reduce latency below 10 ms?

Question 20hardmulti select
Full question →

A company is using Amazon SageMaker to train a large language model. The training job is taking too long. The data scientist wants to reduce training time without sacrificing model accuracy. Which THREE strategies are MOST appropriate?

These MLS-C01 practice questions are part of Courseiva's free Amazon Web Services certification practice question bank. Courseiva provides original exam-style MLS-C01 questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.