How many Hard Difficulty Questions questions are on this page?

This page has 20 Hard Difficulty Questions scenario questions for the MLS-C01 exam, each with detailed explanations and wrong-answer analysis.

How should I approach MLS-C01 scenario questions?

Read the full scenario before looking at the answer options. Identify the constraint or requirement in the scenario, then eliminate options that are generally true but wrong for this specific case. Scenario questions reward careful reading over pattern matching.

← Back to AWS Certified Machine Learning Specialty MLS-C01 questions

Scenario-based practice

Hard Difficulty Questions

Practise AWS Certified Machine Learning Specialty MLS-C01 practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

Start full practice test Read exam guide

scenario questions

MLS-C01

exam code

Amazon Web Services

vendor

Scenario guide

How to approach hard difficulty questions

These are the questions most candidates get wrong. They require connecting multiple concepts, reading tricky output, or knowing edge-case behaviour that isn't on most study cards. Practising them trains you to operate under uncertainty — a necessary skill on the real exam.

Quick answer

Hard Difficulty Questions questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Practice scenarios

Question 1hardmultiple choice

Full question →

A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?

A
Use Amazon EMR with Spark to convert data to Parquet and use on-demand instances.
Why wrong: On-demand instances are more expensive.
B
Use Amazon EMR with Spark to convert data to Parquet and store in S3, using spot instances for task nodes.
Parquet reduces scan size, spot instances reduce cost.
C
Use AWS Glue to convert data to gzip-compressed CSV and query with Athena.
Why wrong: CSV is not optimal for Athena performance.
D
Use Amazon EMR with Hive to transform data to compressed CSV and store in S3.
Why wrong: CSV still incurs full scan costs.

Hard Difficulty Questions

How to approach hard difficulty questions

Quick answer

Related MLS-C01 topic practice pages

Data Engineering practice questions

Machine Learning Implementation and Operations practice questions

Modeling practice questions

Exploratory Data Analysis practice questions

MLS-C01 fundamentals practice questions

MLS-C01 scenario practice questions

MLS-C01 troubleshooting practice questions

Practice scenarios

A team is building a data pipeline to process terabytes of log data daily using Amazon EMR. The data arrives in 5-minute windows and must be available for querying within 30 minutes. The data is originally in gzip-compressed CSV files. Which approach will minimize processing time and cost?

A data engineer created an IAM policy to allow a Glue ETL job to read and write objects to an S3 bucket. The ETL job fails when writing data with the error 'Access Denied'. The job is configured to use SSE-S3 (AES256) encryption. What is the likely issue?

Exhibit

A data scientist is performing EDA on a dataset with 1,000 features and 10,000 rows. The target variable is binary. After checking for multicollinearity, the scientist finds many pairs of features with correlation > 0.95. Which action should be taken to prepare the data for modeling?

An e-commerce company uses a linear regression model to predict customer lifetime value (LTV). The model shows high variance on the test set, with training RMSE much lower than test RMSE. Which of the following is the MOST effective approach to reduce overfitting?

A data scientist is training a deep learning model using Amazon SageMaker. The training loss is decreasing, but the validation loss starts increasing after 10 epochs. The model is overfitting. Which TWO actions should the data scientist take to reduce overfitting? (Choose 2.)

Which THREE of the following are best practices when performing exploratory data analysis on a dataset with both numerical and categorical features?

A data scientist is trying to read a CSV file from S3 bucket 'my-bucket' with key 'training/data.csv' using an IAM role with the attached policy shown in the exhibit. The read operation fails with an Access Denied error. What is the most likely cause?

Exhibit

A data scientist runs a SageMaker training job that fails with the above error. The S3 bucket and object exist, and the IAM role has s3:GetObject permission. What is the MOST likely cause?

Exhibit

Which THREE techniques can help reduce overfitting in a neural network trained on a small dataset?

A company is using Amazon SageMaker to train a large language model. The training job is taking too long. The data scientist wants to reduce training time without sacrificing model accuracy. Which THREE strategies are MOST appropriate?