How should I use these Data Preparation for Machine Learning practice questions?

Read each scenario carefully and choose your answer before revealing the explanation. Then check why your choice was right or wrong. Repeat until the reasoning feels automatic.

MLA-C01 · topic practice

Data Preparation for Machine Learning practice questions

Q: Can I practise just Data Preparation for Machine Learning questions in a focused session?

Yes — use the session launcher on this page to start a 10-, 20-, 30- or 50-question session drawn entirely from the Data Preparation for Machine Learning domain.

Practise AWS Certified Machine Learning Engineer Associate MLA-C01 Data Preparation for Machine Learning practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security

20 questionsDomain: Data Preparation for Machine Learning

Practice 10 questions Browse domain →

What the exam tests

What to know about Data Preparation for Machine Learning

Data Preparation for Machine Learning questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Data Preparation for Machine Learning exam traps

▸Answering from memory before reading the full scenario.
▸Missing a constraint such as cost, availability, security, scope or command context.
▸Choosing a broad answer when the question asks for the most specific fix.
▸Ignoring why the wrong options are tempting.

Practice set

Data Preparation for Machine Learning questions

20 questions · select your answer, then reveal the explanation

Question 1easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a large dataset for training a machine learning model. The dataset contains missing values in several columns. Which approach is the MOST efficient for handling missing values in a large dataset using AWS services?

Trap 1: Use AWS Glue ETL to write a custom Python script that imputes…

Custom scripts require development effort and may not be the most efficient for large datasets.

Trap 2: Use pandas in a SageMaker notebook to impute missing values with…

pandas is not designed for large-scale data processing and may run out of memory.

Trap 3: Remove all rows with missing values from the dataset.

Removing rows can lead to data loss and biased models, and is not always efficient.

Study all Data Preparation for Machine Learning common traps →

A
Use AWS Glue ETL to write a custom Python script that imputes missing values with the mean.
Why wrong: Custom scripts require development effort and may not be the most efficient for large datasets.
B
Use Amazon SageMaker Data Wrangler to impute missing values using built-in transforms.
Data Wrangler provides efficient, scalable, and visual data preparation without custom code.
C
Use pandas in a SageMaker notebook to impute missing values with the median.
Why wrong: pandas is not designed for large-scale data processing and may run out of memory.
D
Remove all rows with missing values from the dataset.
Why wrong: Removing rows can lead to data loss and biased models, and is not always efficient.

Data Preparation for Machine Learning practice questions

What to know about Data Preparation for Machine Learning

Common Data Preparation for Machine Learning exam traps

Data Preparation for Machine Learning questions

A data scientist is preparing a large dataset for training a machine learning model. The dataset contains missing values in several columns. Which approach is the MOST efficient for handling missing values in a large dataset using AWS services?

A data scientist is preparing a dataset for a machine learning model that predicts customer churn. The dataset contains a column 'CustomerID' that is a unique identifier. What should the data scientist do with this column before training the model?

A data scientist is using Amazon SageMaker Data Wrangler to prepare a dataset. The dataset contains a column with date strings in the format 'YYYY-MM-DD'. The data scientist wants to extract the year, month, and day as separate features. Which Data Wrangler transform should be used?

A data engineer is using AWS Glue to prepare a dataset for machine learning. The dataset has several columns with outliers. The engineer wants to detect and handle outliers in a scalable manner. Which TWO approaches should the engineer consider? (Select TWO.)

A data scientist is preparing a large dataset for training a binary classification model. The dataset has a severe class imbalance (95% negative, 5% positive). Which data preparation technique should the scientist use to address this imbalance without losing too much data?

A machine learning engineer is preparing a dataset that contains both numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). Which technique is most appropriate for encoding these high-cardinality categorical features?

A team is building a regression model on a dataset with missing values in multiple features. They decide to use a k-Nearest Neighbors (k-NN) imputer. The dataset has 100,000 rows and 50 features. Which step should the team take to ensure the imputation is efficient and accurate?

A data engineer is using Amazon SageMaker Data Wrangler to prepare a dataset. The dataset contains a column 'review_date' with timestamps. The engineer wants to extract the day of the week as a new feature. How should this transformation be performed in Data Wrangler?

A company uses AWS Glue ETL jobs to transform data for machine learning. They have a dataset with a column 'income' that is heavily right-skewed. Which transformation should be applied to make the distribution more Gaussian-like?

A data scientist is working on a time series forecasting problem. The dataset contains a column 'sales' with occasional negative values due to returns. The model expects non-negative input. Which data preparation step should be taken?

A team is using Amazon SageMaker Processing for data preprocessing. They have a Parquet dataset in Amazon S3. Which configuration will provide the most efficient reading of the dataset during processing?

A machine learning engineer is preparing a dataset for a multiclass classification task. The dataset has 10 features and 100,000 rows. Which TWO techniques should the engineer use to reduce the risk of overfitting during data preparation?

A team is preparing text data for a natural language processing (NLP) model. They have a corpus of customer reviews. Which THREE preprocessing steps are essential to reduce noise and improve model performance?

Track your progress over time

Start a Data Preparation for Machine Learning only practice session

Related MLA-C01 topic practice pages

Data Preparation for Machine Learning practice questions

ML Model Development practice questions

Deployment and Orchestration of ML Workflows practice questions

ML Solution Monitoring, Maintenance and Security practice questions

MLA-C01 fundamentals practice questions

MLA-C01 scenario practice questions

MLA-C01 troubleshooting practice questions

Frequently asked questions

Track your progress

Study resources

Exam traps to avoid