Practice MLA-C01 Data Preparation for Machine Learning questions with full explanations on every answer.
Start practicing
Data Preparation for Machine Learning — choose a session length
Free · No account required
Click any question to see the full explanation and answer options, or start a focused practice session above.
A data scientist is preparing a large dataset for training a machine learning model. The dataset contains missing values in several columns. Which approach is the MOST efficient for handling missing values in a large dataset using AWS services?
2A company is using AWS Glue to prepare data for a machine learning pipeline. The source data is in an Amazon S3 bucket in CSV format. The data scientist wants to convert the data to Parquet format and partition it by date. Which AWS Glue feature should be used to optimize the data for query performance and reduce storage costs?
3A machine learning engineer is preparing a dataset for a binary classification model. The dataset has a severe class imbalance (95% class A, 5% class B). The engineer wants to use Amazon SageMaker to train the model. Which data preparation technique should the engineer apply to the training dataset to address the imbalance and improve model performance?
4A data scientist is preparing a dataset for a machine learning model that predicts customer churn. The dataset contains a column 'CustomerID' that is a unique identifier. What should the data scientist do with this column before training the model?
5A company uses AWS Glue to run ETL jobs that prepare data for machine learning. The data is stored in Amazon S3 in Parquet format. A data engineer notices that the Glue job is running slowly and consuming a lot of resources. What is the MOST cost-effective way to improve the performance of the Glue job?
6A machine learning team is building a model using a dataset that contains a mix of numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). The team wants to use Amazon SageMaker for training. Which technique should the team use to encode the high-cardinality categorical features effectively?
7A data scientist is using Amazon SageMaker Data Wrangler to prepare a dataset. The dataset contains a column with date strings in the format 'YYYY-MM-DD'. The data scientist wants to extract the year, month, and day as separate features. Which Data Wrangler transform should be used?
8A data engineer is using AWS Glue to prepare a dataset for machine learning. The dataset has several columns with outliers. The engineer wants to detect and handle outliers in a scalable manner. Which TWO approaches should the engineer consider? (Select TWO.)
9A machine learning team is preparing a dataset for a regression model. The dataset contains numerical features that are on different scales (e.g., age 0-100, income 0-1,000,000). The team plans to use Amazon SageMaker to train a linear regression model. Which THREE data preparation steps should the team take to ensure the model performs well? (Select THREE.)
10A data scientist is using Amazon SageMaker Data Wrangler to create a data flow for a machine learning project. The source data is in Amazon S3 and contains PII (personally identifiable information) such as email addresses and credit card numbers. The data scientist needs to prepare the data for training while ensuring compliance with data privacy regulations. Which THREE actions should the data scientist take? (Select THREE.)
11A data scientist is preparing a large dataset for training a binary classification model. The dataset has a severe class imbalance (95% negative, 5% positive). Which data preparation technique should the scientist use to address this imbalance without losing too much data?
12A machine learning engineer is preparing a dataset that contains both numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). Which technique is most appropriate for encoding these high-cardinality categorical features?
13A team is building a regression model on a dataset with missing values in multiple features. They decide to use a k-Nearest Neighbors (k-NN) imputer. The dataset has 100,000 rows and 50 features. Which step should the team take to ensure the imputation is efficient and accurate?
14A data engineer is using Amazon SageMaker Data Wrangler to prepare a dataset. The dataset contains a column 'review_date' with timestamps. The engineer wants to extract the day of the week as a new feature. How should this transformation be performed in Data Wrangler?
15A company uses AWS Glue ETL jobs to transform data for machine learning. They have a dataset with a column 'income' that is heavily right-skewed. Which transformation should be applied to make the distribution more Gaussian-like?
16A data scientist is working on a time series forecasting problem. The dataset contains a column 'sales' with occasional negative values due to returns. The model expects non-negative input. Which data preparation step should be taken?
17A team is using Amazon SageMaker Processing for data preprocessing. They have a Parquet dataset in Amazon S3. Which configuration will provide the most efficient reading of the dataset during processing?
18A machine learning engineer is preparing a dataset for a multiclass classification task. The dataset has 10 features and 100,000 rows. Which TWO techniques should the engineer use to reduce the risk of overfitting during data preparation?
19A team is preparing text data for a natural language processing (NLP) model. They have a corpus of customer reviews. Which THREE preprocessing steps are essential to reduce noise and improve model performance?
20A retail company is preparing a dataset for a machine learning model to predict customer churn. The dataset includes customer_id, signup_date, last_purchase_date, total_purchases, average_order_value, and churn_label. The data scientist notices that the 'total_purchases' column has missing values for 15% of the records. The company wants to use AWS Glue for data preparation. Which approach should the data scientist take to handle the missing values while minimizing bias and preserving data integrity?
21A financial services company is building a fraud detection model using transactional data stored in Amazon S3. The data includes transaction_id, timestamp, amount, merchant_category, and fraud_label (0/1). The data is collected from multiple sources and has inconsistencies: timestamps are in different timezones (UTC and EST), merchant categories are sometimes misspelled (e.g., 'RESTAURANT', 'Restaurant', 'restaurant'), and the fraud_label is missing for about 5% of records. The data science team uses AWS Glue for ETL. They need to prepare a clean dataset for training. The final dataset must have consistent timestamps in UTC, standardized merchant categories, and no missing fraud labels. The team also wants to minimize data loss. Which set of actions should the team take?
22A healthcare startup is building a model to predict patient readmission within 30 days. The data is stored in Amazon Redshift and includes patient demographics, admission history, lab results, and medication records. The data scientist extracts a sample of 10,000 records to Amazon S3 as CSV files for initial prototyping. During exploratory data analysis, they find that the 'age' column has values like '150', '0', and negative numbers. The 'diagnosis_code' column contains codes like 'E11', 'E11.9', and 'e11' (inconsistent formatting). The 'readmitted' target column has 60% 'Yes' and 40% 'No'. The data scientist wants to use AWS Glue DataBrew for data cleaning. Which combination of steps should they use?
23An e-commerce company is building a recommendation system using user interaction data stored in Amazon DynamoDB. The data includes user_id, product_id, timestamp, event_type (click, add_to_cart, purchase), and session_id. The data science team exports the data to Amazon S3 as JSON files. During preprocessing, they discover that the 'event_type' field contains inconsistent values due to logging errors: 'Click', 'click', 'CLICK', and 'clck' all appear. Also, there are duplicate records where the same user_id, product_id, and timestamp appear multiple times with the same event_type. The team wants to use AWS Glue to clean the data for training a sequence-based recommendation model. Which set of actions should they perform?
24A data scientist is preparing a dataset for training a binary classification model. The dataset has 100,000 rows and 50 features. The target variable is imbalanced, with only 5% positive cases. Which technique should the data scientist apply to address the class imbalance BEFORE training?
25A machine learning engineer is building a pipeline to preprocess text data for a sentiment analysis model. The data consists of customer reviews. The engineer wants to convert the text into numerical features while preserving the semantic meaning of words. Which technique should be used?
26A company uses Amazon SageMaker Data Wrangler to prepare data for ML. The dataset contains a timestamp column and sensor readings from IoT devices. The data scientist needs to create features such as moving averages and rolling statistics over time windows. Which Data Wrangler transformation type should be selected?
27A data engineer is preparing a large dataset of 10 TB for ML training on Amazon SageMaker. The data is stored in Amazon S3 as CSV files. To reduce training time and cost, the engineer wants to use a columnar format that is optimized for analytical queries. Which format should the engineer convert the data to?
28A data scientist is using Amazon SageMaker Processing to run a feature engineering job. The job requires installing additional Python libraries not included in the default SageMaker containers. Which approach should the data scientist use to include these libraries?
29A machine learning team is building a model to predict customer churn. They have historical data that includes customer activity logs, each with a timestamp. The team wants to ensure that the training data does not contain any data leakage from the future. Which approach should they take when preparing the training and validation datasets?
30A data scientist is working with a dataset that contains missing values in several numeric features. The data scientist wants to impute the missing values with the median of each feature. Which Amazon SageMaker Data Wrangler transformation should be used?
31A data engineer needs to join two large datasets from Amazon S3: one containing customer demographics and another containing transaction history. The join key is `customer_id`. To minimize data shuffling and improve performance, the engineer decides to use Amazon SageMaker Processing with Spark. Which configuration should the engineer use?
32A data scientist is preparing a dataset for a regression model that predicts house prices. The dataset includes a `neighborhood` feature with 500 distinct categories. The data scientist wants to encode this feature without increasing dimensionality too much and while capturing the target relationship. Which encoding technique should be used?
33A data scientist is performing feature engineering for a dataset with both numerical and categorical features. The data scientist wants to apply transformations that preserve the interpretability of the features. Which TWO transformations should the data scientist use? (Select TWO)
34A company is building a real-time inference pipeline for an ML model. The raw data arrives in JSON format via Amazon Kinesis Data Streams. Before invoking the SageMaker endpoint, the data must be preprocessed to match the training data format. Which THREE steps should be included in the preprocessing function? (Select THREE)
35A data engineer is using AWS Glue to prepare a dataset for ML. The engineer wants to split the dataset into training and testing sets while preserving the distribution of the target variable. Which TWO methods achieve this goal? (Select TWO)
36A data scientist runs the exhibit AWS Glue ETL job. The job fails with a Spark stage failure error. What is the most likely cause?
37A SageMaker Processing job fails with 'Access Denied' when listing objects in an S3 bucket, despite the IAM policy shown in the exhibit. What is the most likely cause?
38A data scientist creates a feature group as shown in the exhibit. When ingesting data with an 'age' column of integer values, the ingestion fails. What is the most likely cause?
39A data scientist is preparing a dataset for a linear regression model. The dataset has a few missing values in a numerical feature with a normal distribution and no outliers. Which imputation method is most appropriate?
40A SageMaker Processing job fails with the error: 'Unable to parse CSV file due to inconsistent number of columns'. The data is stored as CSV in S3. What is the most likely cause?
41A company is preparing a dataset with a categorical feature that has over 1000 unique values. They need to create features for a random forest model. Which feature engineering approach is most scalable and effective in AWS for high-cardinality categories?
42An organization stores raw data in Amazon S3 as CSV files. They need to perform serverless data transformation and convert the data to Parquet format for efficient ML training. Which AWS service is most appropriate?
43A data scientist is using SageMaker Data Wrangler to prepare a large dataset. The data contains duplicate rows, which could bias the model. Which built-in step in Data Wrangler can automatically detect and remove duplicates?
44A dataset contains a numerical feature with extreme outliers. The outliers are genuine (not errors), and the ML model is a linear regression which is sensitive to outliers. Which data transformation should be applied to reduce the impact of outliers while preserving the data?
45An ML engineer needs to convert a raw dataset from CSV to Parquet format in a serverless manner for cost efficiency. Which AWS service can be used to perform this conversion without managing servers?
46A data scientist needs to split a dataset into training, validation, and test sets. The dataset has a categorical target variable with imbalanced class distribution. Which splitting technique ensures that each subset has a similar proportion of each class?
47In SageMaker Data Wrangler, you have a flow that imports data from Amazon S3 and needs to join it with a table from Amazon Redshift. The data volumes are large (hundreds of GB). Which approach is most efficient within Data Wrangler?
48A dataset for binary classification has a severe class imbalance (5% positive class). Which two data preparation techniques can help address this imbalance? (Choose two.)
49You are preparing a time-series dataset for a forecasting model. Which three steps are critical to prevent data leakage during preprocessing? (Choose three.)
50A company ingests daily log data into an S3 bucket. They need to update the existing ML training dataset with new data without reprocessing the entire history. Which two strategies should they adopt? (Choose two.)
51Refer to the exhibit. A SageMaker Processing job configured as above fails with a timeout error. The input data is 100 GB of CSV files. The processing script performs standard data cleaning operations. What is the most likely cause?
52Refer to the exhibit. A Glue job runs successfully the first time but on subsequent runs with new data (added to the same input location), the job does not process the new data. What is the most likely cause?
53Refer to the exhibit. The Glue job reads a CSV file and attempts to write to a Parquet table. What is the most likely cause of this error?
54A company uses SageMaker Processing jobs to clean customer transaction data. The processing script runs on a single ml.m5.large instance and takes 30 minutes to process 50 GB of data in CSV format. To reduce processing time, the company wants to process 200 GB of data within 1 hour. Which combination of changes should the company make?
55A data scientist is training a binary classifier on a highly imbalanced dataset (1:100 class ratio). The dataset contains 500,000 rows and 30 features. The data is stored in S3 in Parquet format. The data scientist wants to use SageMaker's built-in XGBoost algorithm. Which data preparation technique should the data scientist apply to best address the class imbalance without causing data leakage?
56A data engineer needs to prepare a large dataset for machine learning. The data is stored in an Amazon RDS MySQL database and needs to be transformed and moved to an S3 bucket in Parquet format for use with SageMaker. Which AWS service is most suitable for this extraction, transformation, and loading (ETL) task?
57A team is building a recommendation system and wants to store and serve features for online and offline models. The features include user statistics (updated daily) and movie metadata (static). The team needs low-latency inference for real-time recommendations and wants to reuse features across multiple models. Which AWS service should the team use to store, manage, and serve these features?
58A data scientist is preprocessing time series data for a fraud detection model. The data includes transaction timestamps, amounts, and merchant IDs. The model should predict fraud within seconds of a transaction. The data scientist wants to avoid data leakage by not using future information to predict past events. Which data preparation practice should be implemented?
59A company has 10 TB of log data in compressed JSON format stored in Amazon S3. The data needs to be processed and transformed into a structured format for machine learning. The processing requires complex transformations, including parsing nested JSON and joining with a reference table. The company wants to minimize infrastructure management. Which approach should the company use?
60A team is collaborating on a machine learning project and needs to ensure that data used for training is consistent across experiments. The team wants to version datasets, track data lineage, and be able to reproduce past experiments. The team uses SageMaker for model training. Which combination of services and features should the team use?
61A data scientist is using SageMaker built-in linear learner algorithm for a regression problem. The dataset has 10 features, some have missing values, and the target variable is right-skewed. The data scientist wants to handle missing values and transform the target variable to improve model performance. Which data preparation steps should the data scientist take?
62A company has a dataset of 2 billion records stored as text files in Amazon S3. The data is partitioned by year and month. The data science team wants to read only the last 6 months of data for model training using SageMaker. To minimize data scanned and reduce costs, which approach should the team use?
63A data team is preparing data for a machine learning pipeline. Which TWO practices are best for ensuring data quality and reproducibility? (Choose two.)
64A data scientist is working with a dataset containing customer demographics and purchase history. The dataset includes categorical variables with high cardinality (e.g., ZIP code, product ID). The data scientist wants to perform feature engineering to improve model performance. Which THREE feature engineering techniques should the data scientist consider? (Choose three.)
65A data engineer needs to provide the data science team with access to various data sources for machine learning. The team uses Amazon SageMaker Studio. Which TWO data sources can be accessed directly from SageMaker Studio notebooks without additional infrastructure? (Choose two.)
66A data scientist needs to convert categorical variables to numerical format for a linear regression model. The dataset contains a 'Country' column with 50 unique values. Which transformation should the engineer use to avoid introducing ordinal relationships?
67A company is building a fraud detection model on an imbalanced dataset (99% legitimate, 1% fraudulent). To improve recall on the minority class, they want to resample data. Which combination of techniques should they use?
68A data engineer is processing a large dataset in Amazon S3 with AWS Glue ETL. The dataset contains timestamps in multiple time zones. The engineer needs to create a feature for hour-of-day consistent across all records. Which approach ensures correctness?
69A machine learning engineer needs to handle missing values in a dataset containing numerical features. The missingness is completely at random (MCAR). Which imputation strategy is most robust for downstream model performance?
70A team is using Amazon SageMaker for feature engineering. They have a dataset with a column 'TransactionDate' in string format (e.g., '2023-01-15 10:30:00'). They need to create features: year, month, day, hour, and day_of_week. What is the most efficient way to do this in a SageMaker processing job?
71A data scientist is preparing text data for natural language processing (NLP). The corpus contains many rare words and typos. To reduce dimensionality and improve generalization, they decide to apply stemming and remove stop words. However, after training, the model performs poorly on domain-specific terms. What is the most likely cause?
72An ML engineer needs to split a dataset into training, validation, and test sets. The dataset has a time-based column that should not be leaked. Which split method is most appropriate?
73A company collects sensor data from IoT devices. The data arrives with missing timestamps due to network issues. For anomaly detection, the engineer needs to create features that capture rolling statistics over fixed windows. Which data preprocessing step is essential before feature generation?
74A data scientist is using Amazon SageMaker Data Wrangler for feature engineering on a large dataset stored in S3. The dataset has a column 'ProductCategory' with 1000+ unique values. To reduce dimensionality, they want to group categories that appear less than 1% of the time into an 'Other' category. Which Data Wrangler transform should they use?
75A machine learning engineer is preparing a dataset for a binary classification model. The dataset has 10,000 rows and 200 features, with 5% positive class. The engineer suspects class imbalance may affect model performance. Which TWO actions should the engineer take to mitigate imbalance? (Choose 2.)
76A data engineer is building a feature engineering pipeline in AWS Glue ETL to process streaming data from Amazon Kinesis. The data includes a nested JSON structure with arrays. The engineer needs to flatten the nested structures into a tabular format for machine learning. Which THREE approaches are valid for this task? (Choose 3.)
77A data scientist is evaluating data quality for a machine learning project. The dataset has missing values, outliers, and inconsistent formatting. Which TWO steps should the data scientist perform during the data preparation phase? (Choose 2.)
78A data scientist is preparing a dataset for a binary classification model. The dataset has 10,000 records with 100 features. The target variable is imbalanced, with 95% negative class and 5% positive class. Which data preparation step should the data scientist take to address the imbalance before training?
79A company is building a machine learning model on customer transaction data stored in Amazon S3. The data includes columns with missing values in the 'age' field. The data scientist wants to impute missing values with the median age across all customers. Which approach is MOST efficient for preparing the data at scale?
80A machine learning team is processing a large dataset in Amazon SageMaker using a processing job. The data is stored in S3 in CSV format. The team wants to split the data into training, validation, and test sets (70/20/10) while ensuring that the distribution of a categorical feature 'region' is preserved across splits. Which SageMaker SDK method should they use to write the output?
81A data engineer needs to convert a JSON dataset to Parquet format for efficient querying with Amazon Athena. The JSON files are in an S3 bucket. Which service can perform this conversion with minimal coding?
82A data scientist is exploring data stored in an Amazon Redshift cluster. The data includes timestamp columns with different formats. The scientist wants to create a new column that standardizes the timestamp format to UTC. Which approach is MOST efficient?
83A team is using AWS Glue to process streaming data from Amazon Kinesis. The streaming data contains both structured and semi-structured fields. The team needs to flatten the semi-structured fields into columns for downstream ML training. Which Glue feature is BEST suited?
84A data engineer notices that an AWS Glue ETL job is failing with an Out of Memory error when processing a large dataset. The dataset is 500 GB in size, and the worker type is G.1X. Which change is MOST likely to resolve the issue?
85A company needs to anonymize personally identifiable information (PII) in a dataset before using it for ML. The dataset is stored in S3 as CSV files. The team wants to mask credit card numbers by replacing all digits except the last four with asterisks. Which approach is the most scalable?
86A data scientist is preparing data for a regression model. The target variable has a skewed distribution. The scientist wants to apply a log transformation to make it closer to normal. Which step should be taken before applying log transformation?
87A data engineer is preparing a dataset for a classification model. The dataset contains duplicate rows. Which TWO approaches are appropriate to handle duplicates in AWS? (Choose 2.)
88A data scientist is cleaning a text dataset for natural language processing. The raw data contains HTML tags, URLs, and special characters. Which THREE steps should be taken to preprocess the text data? (Choose 3.)
89A company is preparing data for a time-series forecasting model. The data is collected from IoT sensors at irregular intervals. Which TWO steps are necessary to prepare the data? (Choose 2.)
90Refer to the exhibit. A data scientist is trying to use AWS Glue to read data from the S3 bucket `ml-data-bucket`. The Glue job fails with an access denied error. What is the most likely cause?
91Refer to the exhibit. A data engineer runs a Glue ETL job that uses a Python script. The job fails because of a missing module `scikit-learn`. Which fix is MOST appropriate?
92Refer to the exhibit. A data engineer deploys this Glue job via CloudFormation. When running, the job fails with a timeout after 2 hours. The job processes a large dataset and expected to take 3 hours. Which change would resolve the issue?
93A data scientist is preparing a dataset for binary classification using SageMaker. The dataset has 100 features and 10,000 rows, but the target variable is highly imbalanced (95% negative, 5% positive). Which technique should the data scientist apply during data preparation to address the imbalance?
94A machine learning engineer is using SageMaker Processing to run a scikit-learn preprocessing script. The script reads a CSV file from S3, applies a StandardScaler, and writes the output. The job fails with a 'MemoryError'. Which change should the engineer make to the data preparation process?
95A company uses AWS Glue to run ETL jobs that prepare data for machine learning. The source data in S3 has a schema that evolves over time (new columns are added occasionally). The Glue job schema is defined as a fixed schema in the job script. After an update to the source data, the Glue job fails with an error about mismatched schemas. How should the data engineer modify the data preparation process to handle schema evolution?
96A data scientist is using SageMaker Data Wrangler to prepare features for a classification model. Which TWO statements about feature engineering in Data Wrangler are correct?
97A data engineer is optimizing Amazon Athena queries on large datasets stored in S3 for machine learning data preparation. Which THREE practices improve query performance?
98A team is building a machine learning model for natural language processing using SageMaker BlazingText. The data preparation step must format the training data correctly. What format does BlazingText require for supervised text classification?
99A company uses Amazon SageMaker Ground Truth to create labeled datasets for object detection. The output must be in COCO format for downstream model training. How should the data preparation process be configured?
100A machine learning engineer is using SageMaker Data Wrangler to perform data validation. Which step should be added to the pipeline to ensure data quality before training?
101A data scientist is preparing a large dataset (50 GB) for training a TensorFlow model on SageMaker. The dataset consists of many small CSV files. Training is slow due to I/O bottlenecks. Which data preparation strategy most effectively accelerates training?
102Refer to the exhibit. A data engineer runs an AWS Glue ETL job with the following script portion. The job fails with an error: 'An error occurred while calling o113.pyWriteDynamicFrame. No such file or directory'. What is the most likely cause?
103Refer to the exhibit. A SageMaker Processing job fails with the following error log. Which change during data preparation would resolve the issue?
104A data engineer is using SageMaker Pipelines to automate data preparation. Which TWO statements about data validation within a pipeline are correct?
105A company is building a time series forecasting model using SageMaker DeepAR. The raw data is a CSV with columns: timestamp, item_id, and value. What is the correct data format required for DeepAR training?
106A data engineer is using Amazon Athena to query a partitioned dataset stored in S3. Which THREE actions are necessary to ensure the queries can access the data and run efficiently?
107A company operates an IoT platform that ingests sensor data from thousands of devices. Data is streamed via Amazon Kinesis Data Streams and stored in an S3 bucket using a Kinesis Firehose delivery stream, which writes data in 5-minute windows. The data is then used to train a machine learning model for anomaly detection. Recently, the data science team noticed that the training dataset is always missing the last 5 minutes of events from the end of each day. The S3 objects show that the last delivery stream buffer window is incomplete. The data engineer checked the Kinesis Firehose metrics and found no delivery errors or data loss, but the 'IncomingBytes' and 'IncomingRecords' metrics show consistent data for all periods. The S3 bucket has Lifecycle policies that do not delete objects. The team suspects the issue is related to the data preparation pipeline. Which course of action would correctly resolve the missing data problem?
108A data scientist is preparing a dataset for a binary classification model to predict customer churn. The dataset contains a timestamp column 'signup_date' that is not relevant for the prediction. What is the most appropriate action to handle this column?
109A machine learning engineer is building a regression model to predict house prices. The feature 'square_footage' has values ranging from 500 to 10,000, while 'num_bedrooms' ranges from 1 to 10. Which preprocessing step is most critical before training a model that uses gradient descent?
110A company uses Amazon SageMaker Data Wrangler to create a data flow for a classification model. The dataset contains a high-cardinality categorical feature 'product_id' with 50,000 unique values. The data scientist wants to reduce dimensionality while preserving predictive power. Which approach is most effective?
111A data engineer needs to prepare a large dataset (10 TB) stored in Amazon S3 for a training job on SageMaker. The data is in CSV format, but the training algorithm expects Parquet for performance. The engineer must transform the data with minimal cost and without writing custom code. Which service should be used?
112A data science team is building a model to predict fraudulent transactions. The dataset has 1 million legitimate transactions and only 1,000 fraudulent ones. They plan to use Amazon SageMaker to train a model. Which data preparation technique should they apply to address the severe class imbalance before training?
113A company is training a deep learning model on Amazon SageMaker using a dataset stored in Amazon S3. The training job is taking a long time due to I/O bottlenecks. The data is in JSON lines format. Which data preparation step combined with SageMaker's best practices would most effectively reduce training time?
114A data engineer is using Amazon SageMaker Processing to run a data preprocessing script on a dataset with 500 million rows. The script runs out of memory on a single ml.r5.24xlarge instance. The engineer needs to modify the processing job to handle the dataset size. Which approach is most cost-effective and scalable?
115Which TWO actions are recommended best practices when preparing training data for a machine learning model in AWS? (Choose two.)
116A data scientist is using Amazon SageMaker Data Wrangler to prepare a dataset. Which TWO features of Data Wrangler can be used to handle imbalanced classification problems? (Choose two.)
117A company is preparing a large dataset for a SageMaker built-in XGBoost model. The dataset has missing values in both numeric and categorical features, and some categorical features have high cardinality. Which THREE data preparation steps should the company take to optimize model performance? (Choose three.)
118A company runs an online retail business and wants to build a product recommendation system. They have a dataset of customer purchases stored in Amazon S3 as CSV files. The dataset includes columns: 'customer_id', 'product_id', 'purchase_date', 'quantity', 'price', and 'category'. The data science team plans to use Amazon SageMaker to train a factorization machines model. During data exploration, they discover that the 'category' column has 1,200 unique values, and many categories appear only a few times. The 'product_id' column has 50,000 unique values. They want to include both features in the model. The team is concerned about the high cardinality of these features. Which approach should they take to prepare these features for the factorization machines model?
119A healthcare company is building a model to predict patient readmission rates. The dataset contains a mix of numeric features (age, blood pressure, lab test results) and categorical features (gender, diagnosis code, hospital department). The dataset has 2 million rows. The data is stored in an Amazon S3 bucket, and they use AWS Glue to catalog and preprocess the data. The data scientist notices that the 'diagnosis_code' column has 10,000 unique codes, and 20% of the rows have missing values for 'blood_pressure'. They plan to use a SageMaker built-in XGBoost model. For optimal model performance, which preprocessing steps should they apply using AWS Glue ETL?
120A financial services company is developing a fraud detection model using Amazon SageMaker. They have a dataset with 10 million transactions, each with 300 features. The dataset is highly imbalanced (0.1% fraud). They have performed feature engineering and now need to split the data for training, validation, and test sets. The data is stored in CSV files in Amazon S3. They plan to use SageMaker's built-in XGBoost algorithm. To ensure proper evaluation and avoid data leakage, which data splitting strategy should they use?
121An e-commerce company uses Amazon SageMaker to train a model that predicts click-through rates. The training data includes a timestamp column 'click_time' and a categorical feature 'device_type' (8 values). They notice that the model's performance degrades over time because the data distribution shifts. They want to ensure the training data represents the most recent behavior. The data is stored in a daily partitioned S3 bucket (e.g., s3://bucket/data/2024-01-01/). The total dataset size is 500 GB. Which approach should they take to prepare the training data while minimizing bias and cost?
122A data engineer is building a data pipeline for a machine learning model that requires both structured and unstructured data. The structured data (customer demographics) is in Amazon RDS, and the unstructured data (customer support chat logs) is in Amazon S3 as JSON files. The engineer needs to combine these datasets into a single training dataset stored in S3 in Parquet format. They must also perform feature engineering such as text vectorization on the chat logs. The pipeline should be serverless and cost-effective. Which approach should they use?
123A data scientist needs to prepare a dataset for a binary classification model. The dataset contains 100,000 records with 50 features, including categorical variables with high cardinality, missing values in 30% of records for a key numeric feature, and a severe class imbalance (5% positive class). The data is stored in an Amazon S3 bucket. Which TWO actions should the data scientist take to improve model performance and ensure robust data preparation? (Choose two.)
124A retail company is building a machine learning model to predict customer churn. The data engineering team has extracted customer transaction data from Amazon Aurora and stored it as CSV files in Amazon S3. The data includes customer IDs, transaction amounts, timestamps, and product categories. A data scientist discovers that the dataset contains several missing values in the 'transaction_amount' column for about 15% of the records. The data scientist also notices that the 'customer_id' column has some duplicate entries. The team wants to prepare the data for training a churn model using Amazon SageMaker. The data is approximately 50 GB in size. What should the data scientist do to handle the missing values and duplicates efficiently while preparing the data for training?
125A financial services company is building a fraud detection model using historical transaction data stored in Amazon S3. The data includes features such as transaction amount, merchant category, time of day, and user location. The data scientist observes that the 'merchant_category' column is a text attribute with over 200 unique values. Additionally, the 'transaction_amount' column has a long-tail distribution with extreme outliers. The dataset is 200 GB in size, and the company wants to use Amazon SageMaker for model training. The data scientist needs to engineer features that capture the high-cardinality category and reduce the impact of outliers. What is the MOST efficient and effective approach to prepare this data?
126A social media company is processing a real-time stream of user activity data from Amazon Kinesis Data Streams to train a machine learning model for content recommendation. The raw data includes user ID, timestamp, content ID, interaction type (like, share, comment), and device type. The data scientists need to aggregate features per user over a sliding window of 7 days, including counts of interaction types, unique content IDs engaged, and a moving average of interaction timestamps. The aggregated data will be used to update a user embedding model. The streaming data volume is approximately 500 records per second, and the company uses an AWS Glue streaming ETL job for transformation. However, the Glue job is failing frequently with high latency and checkpoint errors. The team needs a more robust solution to prepare the streaming data features. Which approach should the team take?
127A healthcare company is developing a predictive model to identify patients at risk of readmission within 30 days after discharge. The dataset contains electronic health record (EHR) data from multiple hospitals, stored as Parquet files in Amazon S3. The data includes patient demographics, diagnoses (ICD-10 codes), medications, lab results, and length of stay. A data scientist notices that the 'lab_result' column has a high number of null values (over 60%) because some tests are not applicable to all patients. Additionally, the 'diagnosis_code' column has over 10,000 unique ICD-10 codes. The company wants to build a model that complies with HIPAA and performs well. The data scientist must prepare the features efficiently using AWS services. Which combination of steps should the data scientist take? (Assume the company can use any AWS service.)
128A marketing company is preparing a dataset to train a logistic regression model to predict whether a customer will click on an online ad. The dataset includes 1 million records with features: customer_age (numeric), income (numeric), education_level (ordinal: high school, bachelor, master, PhD), and ad_category (categorical: 50 unique values). The data is stored in a CSV file in Amazon S3. The data scientist plans to use Amazon SageMaker's built-in linear learner algorithm. The data scientist needs to preprocess the data before training. What is the correct sequence of data preparation steps that should be applied to this dataset to ensure optimal model performance?
The Data Preparation for Machine Learning domain covers the key concepts tested in this area of the MLA-C01 exam blueprint published by Amazon Web Services. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all MLA-C01 domains — no account required.
The Courseiva MLA-C01 question bank contains 128 questions in the Data Preparation for Machine Learning domain. Click any question to see the full explanation and answer breakdown.
Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.
Yes — the session launcher on this page draws questions exclusively from the Data Preparation for Machine Learning domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.
Save your results, see per-domain analytics, and get readiness scores — free, for every certification.
Sign Up FreeFree forever · Every certification included