MLA-C01 · topic practice

Data Preparation for Machine Learning practice questions

Practise AWS Certified Machine Learning Engineer Associate MLA-C01 Data Preparation for Machine Learning practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security
20 questionsDomain: Data Preparation for Machine Learning

What the exam tests

What to know about Data Preparation for Machine Learning

Data Preparation for Machine Learning questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Data Preparation for Machine Learning exam traps

  • Answering from memory before reading the full scenario.
  • Missing a constraint such as cost, availability, security, scope or command context.
  • Choosing a broad answer when the question asks for the most specific fix.
  • Ignoring why the wrong options are tempting.

Practice set

Data Preparation for Machine Learning questions

20 questions · select your answer, then reveal the explanation

A data scientist is preparing a large dataset for training a machine learning model. The dataset contains missing values in several columns. Which approach is the MOST efficient for handling missing values in a large dataset using AWS services?

A company is using AWS Glue to prepare data for a machine learning pipeline. The source data is in an Amazon S3 bucket in CSV format. The data scientist wants to convert the data to Parquet format and partition it by date. Which AWS Glue feature should be used to optimize the data for query performance and reduce storage costs?

A machine learning engineer is preparing a dataset for a binary classification model. The dataset has a severe class imbalance (95% class A, 5% class B). The engineer wants to use Amazon SageMaker to train the model. Which data preparation technique should the engineer apply to the training dataset to address the imbalance and improve model performance?

A data scientist is preparing a dataset for a machine learning model that predicts customer churn. The dataset contains a column 'CustomerID' that is a unique identifier. What should the data scientist do with this column before training the model?

A company uses AWS Glue to run ETL jobs that prepare data for machine learning. The data is stored in Amazon S3 in Parquet format. A data engineer notices that the Glue job is running slowly and consuming a lot of resources. What is the MOST cost-effective way to improve the performance of the Glue job?

A machine learning team is building a model using a dataset that contains a mix of numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). The team wants to use Amazon SageMaker for training. Which technique should the team use to encode the high-cardinality categorical features effectively?

A data scientist is using Amazon SageMaker Data Wrangler to prepare a dataset. The dataset contains a column with date strings in the format 'YYYY-MM-DD'. The data scientist wants to extract the year, month, and day as separate features. Which Data Wrangler transform should be used?

A data engineer is using AWS Glue to prepare a dataset for machine learning. The dataset has several columns with outliers. The engineer wants to detect and handle outliers in a scalable manner. Which TWO approaches should the engineer consider? (Select TWO.)

A machine learning team is preparing a dataset for a regression model. The dataset contains numerical features that are on different scales (e.g., age 0-100, income 0-1,000,000). The team plans to use Amazon SageMaker to train a linear regression model. Which THREE data preparation steps should the team take to ensure the model performs well? (Select THREE.)

A data scientist is using Amazon SageMaker Data Wrangler to create a data flow for a machine learning project. The source data is in Amazon S3 and contains PII (personally identifiable information) such as email addresses and credit card numbers. The data scientist needs to prepare the data for training while ensuring compliance with data privacy regulations. Which THREE actions should the data scientist take? (Select THREE.)

A data scientist is preparing a large dataset for training a binary classification model. The dataset has a severe class imbalance (95% negative, 5% positive). Which data preparation technique should the scientist use to address this imbalance without losing too much data?

A machine learning engineer is preparing a dataset that contains both numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). Which technique is most appropriate for encoding these high-cardinality categorical features?

A team is building a regression model on a dataset with missing values in multiple features. They decide to use a k-Nearest Neighbors (k-NN) imputer. The dataset has 100,000 rows and 50 features. Which step should the team take to ensure the imputation is efficient and accurate?

A data engineer is using Amazon SageMaker Data Wrangler to prepare a dataset. The dataset contains a column 'review_date' with timestamps. The engineer wants to extract the day of the week as a new feature. How should this transformation be performed in Data Wrangler?

A company uses AWS Glue ETL jobs to transform data for machine learning. They have a dataset with a column 'income' that is heavily right-skewed. Which transformation should be applied to make the distribution more Gaussian-like?

A data scientist is working on a time series forecasting problem. The dataset contains a column 'sales' with occasional negative values due to returns. The model expects non-negative input. Which data preparation step should be taken?

A team is using Amazon SageMaker Processing for data preprocessing. They have a Parquet dataset in Amazon S3. Which configuration will provide the most efficient reading of the dataset during processing?

A machine learning engineer is preparing a dataset for a multiclass classification task. The dataset has 10 features and 100,000 rows. Which TWO techniques should the engineer use to reduce the risk of overfitting during data preparation?

Question 19hardmulti select
Read the full NAT/PAT explanation →

A team is preparing text data for a natural language processing (NLP) model. They have a corpus of customer reviews. Which THREE preprocessing steps are essential to reduce noise and improve model performance?

A retail company is preparing a dataset for a machine learning model to predict customer churn. The dataset includes customer_id, signup_date, last_purchase_date, total_purchases, average_order_value, and churn_label. The data scientist notices that the 'total_purchases' column has missing values for 15% of the records. The company wants to use AWS Glue for data preparation. Which approach should the data scientist take to handle the missing values while minimizing bias and preserving data integrity?

Free account

Track your progress over time

Create a free account to save your results and see which topics improve across sessions.

Focused Data Preparation for Machine Learning sessions

Start a Data Preparation for Machine Learning only practice session

Every question in these sessions is drawn from the Data Preparation for Machine Learning domain — nothing else.

Related practice questions

Related MLA-C01 topic practice pages

Move into related areas when this topic feels solid.

Frequently asked questions

What does the MLA-C01 exam test about Data Preparation for Machine Learning?
Data Preparation for Machine Learning questions test whether you can apply the concept in context, not just recognise a definition.
How should I use these practice questions?
Select your answer before revealing the explanation. Then read why each option is right or wrong — this active recall approach builds retention far faster than re-reading notes.
Can I practise just Data Preparation for Machine Learning questions in a focused session?
Yes — the session launcher on this page draws every question from the Data Preparation for Machine Learning domain. Use a 10-question session first to gauge your baseline, then move to 20 or 30 once the weak spots are clear.
Where can I practise other MLA-C01 topics?
Use the topic links above to move to related areas, or go back to the MLA-C01 question bank to see all topics.
Are these real exam questions or dumps?
These are original practice questions written to test the same concepts the MLA-C01 exam covers. They are not copied from any real exam or dump site.