20+ practice questions focused on Data Preparation for Machine Learning — one of the most tested topics on the AWS Certified Machine Learning Engineer Associate MLA-C01 exam. Each question includes a detailed explanation so you learn why the right answer is correct.
Start Data Preparation for Machine Learning PracticeA data scientist is preparing a large dataset for training a machine learning model. The dataset contains missing values in several columns. Which approach is the MOST efficient for handling missing values in a large dataset using AWS services?
Explanation: Amazon SageMaker Data Wrangler provides a visual interface and built-in transforms for handling missing values efficiently at scale, without writing custom code. Glue ETL is more code-heavy, and imputation with pandas is not scalable for large datasets. Removing all rows with missing values is not always optimal and may not be efficient.
A company is using AWS Glue to prepare data for a machine learning pipeline. The source data is in an Amazon S3 bucket in CSV format. The data scientist wants to convert the data to Parquet format and partition it by date. Which AWS Glue feature should be used to optimize the data for query performance and reduce storage costs?
Explanation: Option B is correct because AWS Glue DynamicFrames provide built-in optimizations for writing data in columnar formats like Parquet, which improves query performance through predicate pushdown and compression, and reduces storage costs by using efficient encoding. The DynamicFrame's `repartition()` method allows you to control the number of output files, and writing as Parquet directly from Glue avoids intermediate conversions, making it the most efficient choice for this task.
A machine learning engineer is preparing a dataset for a binary classification model. The dataset has a severe class imbalance (95% class A, 5% class B). The engineer wants to use Amazon SageMaker to train the model. Which data preparation technique should the engineer apply to the training dataset to address the imbalance and improve model performance?
Explanation: Option B is correct because SMOTE generates synthetic samples for the minority class by interpolating between existing minority instances, which directly addresses the severe class imbalance (95% class A, 5% class B) by creating a more balanced training dataset. This technique is particularly effective for tabular data in Amazon SageMaker, as it increases the representation of the minority class without simply duplicating existing samples, thereby reducing overfitting and improving the model's ability to learn decision boundaries for the minority class.
A data scientist is preparing a dataset for a machine learning model that predicts customer churn. The dataset contains a column 'CustomerID' that is a unique identifier. What should the data scientist do with this column before training the model?
Explanation: Option C is correct because 'CustomerID' is a unique identifier with no predictive power for churn. Including it as a feature would cause the model to memorize individual customers rather than learn generalizable patterns, leading to overfitting and poor performance on unseen data. In machine learning, such columns should be removed during data preparation to ensure the model learns from meaningful features.
A company uses AWS Glue to run ETL jobs that prepare data for machine learning. The data is stored in Amazon S3 in Parquet format. A data engineer notices that the Glue job is running slowly and consuming a lot of resources. What is the MOST cost-effective way to improve the performance of the Glue job?
Explanation: Increasing the number of DPUs (Data Processing Units) in AWS Glue can improve parallelism and reduce job runtime, but it increases cost. Using G.1X worker type with more memory per worker can improve performance without increasing DPU count, offering better resource utilization. Switching to CSV may degrade performance. Using partition pruning on the source data can reduce data scanned but may not address resource consumption.
+15 more Data Preparation for Machine Learning questions available
Practice all Data Preparation for Machine Learning questions1. Baseline your knowledge
Start with 10 questions to gauge your current understanding of Data Preparation for Machine Learning. This tells you whether you need a concept refresher or just practice.
2. Review every explanation
For each question — right or wrong — read the full explanation. Understanding why an answer is correct is more valuable than knowing the answer itself.
3. Focus on exam traps
Data Preparation for Machine Learning questions on the MLA-C01 frequently use trap wording. Look for subtle differences in answers that test your precision, not just general knowledge.
4. Reach 80% consistently
Do repeated sessions until you score 80%+ three times in a row. Then move to mixed-mode practice to test cross-topic recall under realistic conditions.
The exact number varies per candidate. Data Preparation for Machine Learning is tested as part of the AWS Certified Machine Learning Engineer Associate MLA-C01 blueprint. Practicing with targeted Data Preparation for Machine Learning questions ensures you can handle any format or difficulty that appears.
Yes. Courseiva provides free MLA-C01 practice questions across all exam topics and domains. The platform includes topic-based practice, mock exams, missed-question review, bookmarked questions, and readiness tracking — no account required.
Difficulty is subjective, but Data Preparation for Machine Learning is a high-priority exam concept tested in multiple ways — direct recall, scenario analysis, and command-output interpretation. Consistent practice is the best way to build confidence.
Launch a full Data Preparation for Machine Learning practice session with instant scoring and detailed explanations.
Start Data Preparation for Machine Learning Practice →