MLA-C01 Data Preparation for Machine Learning — All Questions With Answers

Question 1easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a large dataset for training a machine learning model. The dataset contains missing values in several columns. Which approach is the MOST efficient for handling missing values in a large dataset using AWS services?

Question 2mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company is using AWS Glue to prepare data for a machine learning pipeline. The source data is in an Amazon S3 bucket in CSV format. The data scientist wants to convert the data to Parquet format and partition it by date. Which AWS Glue feature should be used to optimize the data for query performance and reduce storage costs?

Question 3hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A machine learning engineer is preparing a dataset for a binary classification model. The dataset has a severe class imbalance (95% class A, 5% class B). The engineer wants to use Amazon SageMaker to train the model. Which data preparation technique should the engineer apply to the training dataset to address the imbalance and improve model performance?

Question 4easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a dataset for a machine learning model that predicts customer churn. The dataset contains a column 'CustomerID' that is a unique identifier. What should the data scientist do with this column before training the model?

Question 5mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company uses AWS Glue to run ETL jobs that prepare data for machine learning. The data is stored in Amazon S3 in Parquet format. A data engineer notices that the Glue job is running slowly and consuming a lot of resources. What is the MOST cost-effective way to improve the performance of the Glue job?

Question 6hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A machine learning team is building a model using a dataset that contains a mix of numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). The team wants to use Amazon SageMaker for training. Which technique should the team use to encode the high-cardinality categorical features effectively?

Question 7mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is using Amazon SageMaker Data Wrangler to prepare a dataset. The dataset contains a column with date strings in the format 'YYYY-MM-DD'. The data scientist wants to extract the year, month, and day as separate features. Which Data Wrangler transform should be used?

Question 8easymulti select

Read the full Data Preparation for Machine Learning explanation →

A data engineer is using AWS Glue to prepare a dataset for machine learning. The dataset has several columns with outliers. The engineer wants to detect and handle outliers in a scalable manner. Which TWO approaches should the engineer consider? (Select TWO.)

Question 9mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A machine learning team is preparing a dataset for a regression model. The dataset contains numerical features that are on different scales (e.g., age 0-100, income 0-1,000,000). The team plans to use Amazon SageMaker to train a linear regression model. Which THREE data preparation steps should the team take to ensure the model performs well? (Select THREE.)

Question 10hardmulti select

Read the full Data Preparation for Machine Learning explanation →

A data scientist is using Amazon SageMaker Data Wrangler to create a data flow for a machine learning project. The source data is in Amazon S3 and contains PII (personally identifiable information) such as email addresses and credit card numbers. The data scientist needs to prepare the data for training while ensuring compliance with data privacy regulations. Which THREE actions should the data scientist take? (Select THREE.)

Question 11mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a large dataset for training a binary classification model. The dataset has a severe class imbalance (95% negative, 5% positive). Which data preparation technique should the scientist use to address this imbalance without losing too much data?

Question 12easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A machine learning engineer is preparing a dataset that contains both numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). Which technique is most appropriate for encoding these high-cardinality categorical features?

Question 13hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A team is building a regression model on a dataset with missing values in multiple features. They decide to use a k-Nearest Neighbors (k-NN) imputer. The dataset has 100,000 rows and 50 features. Which step should the team take to ensure the imputation is efficient and accurate?

Question 14mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data engineer is using Amazon SageMaker Data Wrangler to prepare a dataset. The dataset contains a column 'review_date' with timestamps. The engineer wants to extract the day of the week as a new feature. How should this transformation be performed in Data Wrangler?

Question 15hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company uses AWS Glue ETL jobs to transform data for machine learning. They have a dataset with a column 'income' that is heavily right-skewed. Which transformation should be applied to make the distribution more Gaussian-like?

Question 16easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is working on a time series forecasting problem. The dataset contains a column 'sales' with occasional negative values due to returns. The model expects non-negative input. Which data preparation step should be taken?

Question 17mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A team is using Amazon SageMaker Processing for data preprocessing. They have a Parquet dataset in Amazon S3. Which configuration will provide the most efficient reading of the dataset during processing?

Question 18mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A machine learning engineer is preparing a dataset for a multiclass classification task. The dataset has 10 features and 100,000 rows. Which TWO techniques should the engineer use to reduce the risk of overfitting during data preparation?

Question 19hardmulti select

Read the full NAT/PAT explanation →

A team is preparing text data for a natural language processing (NLP) model. They have a corpus of customer reviews. Which THREE preprocessing steps are essential to reduce noise and improve model performance?

Question 20mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A retail company is preparing a dataset for a machine learning model to predict customer churn. The dataset includes customer_id, signup_date, last_purchase_date, total_purchases, average_order_value, and churn_label. The data scientist notices that the 'total_purchases' column has missing values for 15% of the records. The company wants to use AWS Glue for data preparation. Which approach should the data scientist take to handle the missing values while minimizing bias and preserving data integrity?

Question 21hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A financial services company is building a fraud detection model using transactional data stored in Amazon S3. The data includes transaction_id, timestamp, amount, merchant_category, and fraud_label (0/1). The data is collected from multiple sources and has inconsistencies: timestamps are in different timezones (UTC and EST), merchant categories are sometimes misspelled (e.g., 'RESTAURANT', 'Restaurant', 'restaurant'), and the fraud_label is missing for about 5% of records. The data science team uses AWS Glue for ETL. They need to prepare a clean dataset for training. The final dataset must have consistent timestamps in UTC, standardized merchant categories, and no missing fraud labels. The team also wants to minimize data loss. Which set of actions should the team take?

Question 22easymultiple choice

Read the full NAT/PAT explanation →

A healthcare startup is building a model to predict patient readmission within 30 days. The data is stored in Amazon Redshift and includes patient demographics, admission history, lab results, and medication records. The data scientist extracts a sample of 10,000 records to Amazon S3 as CSV files for initial prototyping. During exploratory data analysis, they find that the 'age' column has values like '150', '0', and negative numbers. The 'diagnosis_code' column contains codes like 'E11', 'E11.9', and 'e11' (inconsistent formatting). The 'readmitted' target column has 60% 'Yes' and 40% 'No'. The data scientist wants to use AWS Glue DataBrew for data cleaning. Which combination of steps should they use?

Question 23mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

An e-commerce company is building a recommendation system using user interaction data stored in Amazon DynamoDB. The data includes user_id, product_id, timestamp, event_type (click, add_to_cart, purchase), and session_id. The data science team exports the data to Amazon S3 as JSON files. During preprocessing, they discover that the 'event_type' field contains inconsistent values due to logging errors: 'Click', 'click', 'CLICK', and 'clck' all appear. Also, there are duplicate records where the same user_id, product_id, and timestamp appear multiple times with the same event_type. The team wants to use AWS Glue to clean the data for training a sequence-based recommendation model. Which set of actions should they perform?

Question 24easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a dataset for training a binary classification model. The dataset has 100,000 rows and 50 features. The target variable is imbalanced, with only 5% positive cases. Which technique should the data scientist apply to address the class imbalance BEFORE training?

Question 25mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A machine learning engineer is building a pipeline to preprocess text data for a sentiment analysis model. The data consists of customer reviews. The engineer wants to convert the text into numerical features while preserving the semantic meaning of words. Which technique should be used?

Question 26hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company uses Amazon SageMaker Data Wrangler to prepare data for ML. The dataset contains a timestamp column and sensor readings from IoT devices. The data scientist needs to create features such as moving averages and rolling statistics over time windows. Which Data Wrangler transformation type should be selected?

Question 27easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data engineer is preparing a large dataset of 10 TB for ML training on Amazon SageMaker. The data is stored in Amazon S3 as CSV files. To reduce training time and cost, the engineer wants to use a columnar format that is optimized for analytical queries. Which format should the engineer convert the data to?

Question 28mediummultiple choice

Study the full Python automation breakdown →

A data scientist is using Amazon SageMaker Processing to run a feature engineering job. The job requires installing additional Python libraries not included in the default SageMaker containers. Which approach should the data scientist use to include these libraries?

Question 29hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A machine learning team is building a model to predict customer churn. They have historical data that includes customer activity logs, each with a timestamp. The team wants to ensure that the training data does not contain any data leakage from the future. Which approach should they take when preparing the training and validation datasets?

Question 30easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is working with a dataset that contains missing values in several numeric features. The data scientist wants to impute the missing values with the median of each feature. Which Amazon SageMaker Data Wrangler transformation should be used?

Question 31mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data engineer needs to join two large datasets from Amazon S3: one containing customer demographics and another containing transaction history. The join key is `customer_id`. To minimize data shuffling and improve performance, the engineer decides to use Amazon SageMaker Processing with Spark. Which configuration should the engineer use?

Question 32hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a dataset for a regression model that predicts house prices. The dataset includes a `neighborhood` feature with 500 distinct categories. The data scientist wants to encode this feature without increasing dimensionality too much and while capturing the target relationship. Which encoding technique should be used?

Question 33mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A data scientist is performing feature engineering for a dataset with both numerical and categorical features. The data scientist wants to apply transformations that preserve the interpretability of the features. Which TWO transformations should the data scientist use? (Select TWO)

Question 34hardmulti select

Read the full Data Preparation for Machine Learning explanation →

A company is building a real-time inference pipeline for an ML model. The raw data arrives in JSON format via Amazon Kinesis Data Streams. Before invoking the SageMaker endpoint, the data must be preprocessed to match the training data format. Which THREE steps should be included in the preprocessing function? (Select THREE)

Question 35easymulti select

Read the full Data Preparation for Machine Learning explanation →

A data engineer is using AWS Glue to prepare a dataset for ML. The engineer wants to split the dataset into training and testing sets while preserving the distribution of the target variable. Which TWO methods achieve this goal? (Select TWO)

Question 36mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist runs the exhibit AWS Glue ETL job. The job fails with a Spark stage failure error. What is the most likely cause?

Exhibit

Refer to the exhibit. A data scientist runs the following AWS Glue ETL job script (Spark) to prepare data for ML:

```python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource = glueContext.create_dynamic_frame.from_options(
    connection_type = "s3",
    connection_options = {"paths": ["s3://bucket/input/"]},
    format = "csv",
    format_options = {"withHeader": True}
)

applymapping = ApplyMapping.apply(frame = datasource, mappings = [("id", "int", "id", "int"), ("value", "string", "value", "double")])
...
```

The job fails with an error: "Job run failed: org.apache.spark.SparkException: Job aborted due to stage failure: Task failed while writing rows." What is the most likely cause of this error?

Question 37easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A SageMaker Processing job fails with 'Access Denied' when listing objects in an S3 bucket, despite the IAM policy shown in the exhibit. What is the most likely cause?

Exhibit

Refer to the exhibit. A data scientist is trying to run a SageMaker Processing job that reads data from an S3 bucket. The IAM role attached to the processing job has the following policy:

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::my-bucket/*"
        }
    ]
}
```

The job fails with an error: "Access Denied" when trying to list objects. What is the root cause?

Question 38hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist creates a feature group as shown in the exhibit. When ingesting data with an 'age' column of integer values, the ingestion fails. What is the most likely cause?

Exhibit

Refer to the exhibit. A data scientist uses the following SageMaker Feature Store feature definition (using the Boto3 SDK) to create a feature group:

```python
import boto3
sagemaker = boto3.client('sagemaker', region_name='us-east-1')
response = sagemaker.create_feature_group(
    FeatureGroupName='my-feature-group',
    RecordIdentifierFeatureName='customer_id',
    EventTimeFeatureName='timestamp',
    FeatureDefinitions=[
        {'FeatureName': 'customer_id', 'FeatureType': 'String'},
        {'FeatureName': 'age', 'FeatureType': 'String'},
        {'FeatureName': 'income', 'FeatureType': 'Fractional'}
    ],
    OnlineStoreConfig={'EnableOnlineStore': True},
    RoleArn='arn:aws:iam::123456789012:role/SageMakerRole'
)
```

The data scientist later tries to ingest data with an 'age' column containing integer values. The ingestion fails. What is the most likely reason?

Question 39easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a dataset for a linear regression model. The dataset has a few missing values in a numerical feature with a normal distribution and no outliers. Which imputation method is most appropriate?

Question 40mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A SageMaker Processing job fails with the error: 'Unable to parse CSV file due to inconsistent number of columns'. The data is stored as CSV in S3. What is the most likely cause?

Question 41hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company is preparing a dataset with a categorical feature that has over 1000 unique values. They need to create features for a random forest model. Which feature engineering approach is most scalable and effective in AWS for high-cardinality categories?

Question 42easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

An organization stores raw data in Amazon S3 as CSV files. They need to perform serverless data transformation and convert the data to Parquet format for efficient ML training. Which AWS service is most appropriate?

Question 43mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is using SageMaker Data Wrangler to prepare a large dataset. The data contains duplicate rows, which could bias the model. Which built-in step in Data Wrangler can automatically detect and remove duplicates?

Question 44hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A dataset contains a numerical feature with extreme outliers. The outliers are genuine (not errors), and the ML model is a linear regression which is sensitive to outliers. Which data transformation should be applied to reduce the impact of outliers while preserving the data?

Question 45easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

An ML engineer needs to convert a raw dataset from CSV to Parquet format in a serverless manner for cost efficiency. Which AWS service can be used to perform this conversion without managing servers?

Question 46mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist needs to split a dataset into training, validation, and test sets. The dataset has a categorical target variable with imbalanced class distribution. Which splitting technique ensures that each subset has a similar proportion of each class?

Question 47hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

In SageMaker Data Wrangler, you have a flow that imports data from Amazon S3 and needs to join it with a table from Amazon Redshift. The data volumes are large (hundreds of GB). Which approach is most efficient within Data Wrangler?

Question 48mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A dataset for binary classification has a severe class imbalance (5% positive class). Which two data preparation techniques can help address this imbalance? (Choose two.)

Question 49hardmulti select

Read the full Data Preparation for Machine Learning explanation →

You are preparing a time-series dataset for a forecasting model. Which three steps are critical to prevent data leakage during preprocessing? (Choose three.)

Question 50easymulti select

Read the full Data Preparation for Machine Learning explanation →

A company ingests daily log data into an S3 bucket. They need to update the existing ML training dataset with new data without reprocessing the entire history. Which two strategies should they adopt? (Choose two.)

Question 51hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

Refer to the exhibit. A SageMaker Processing job configured as above fails with a timeout error. The input data is 100 GB of CSV files. The processing script performs standard data cleaning operations. What is the most likely cause?

Exhibit

{
  "ProcessingResources": {
    "ClusterConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.m5.large",
      "VolumeSizeInGB": 30
    }
  },
  "AppSpecification": {
    "ImageUri": "123456789012.dkr.ecr.us-west-2.amazonaws.com/my-custom-image:latest",
    "ContainerEntrypoint": ["python", "process.py"]
  },
  "RoleArn": "arn:aws:iam::123456789012:role/SageMakerProcessingRole",
  "ProcessingInputs": [
    {
      "InputName": "input-1",
      "S3Input": {
        "S3Uri": "s3://my-bucket/input/data.csv",
        "LocalPath": "/opt/ml/processing/input",
        "S3DataType": "S3Prefix",
        "S3InputMode": "File",
        "S3DataDistributionType": "FullyReplicated",
        "S3CompressionType": "None"
      }
    }
  ],
  "ProcessingOutputConfig": {
    "Outputs": [
      {
        "OutputName": "output-1",
        "S3Output": {
          "S3Uri": "s3://my-bucket/output/",
          "LocalPath": "/opt/ml/processing/output",
          "S3UploadMode": "EndOfJob"
        }
      }
    ]
  }
}

Question 52mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

Refer to the exhibit. A Glue job runs successfully the first time but on subsequent runs with new data (added to the same input location), the job does not process the new data. What is the most likely cause?

Network Topology

Question 53easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

Refer to the exhibit. The Glue job reads a CSV file and attempts to write to a Parquet table. What is the most likely cause of this error?

Exhibit

An AWS Glue job fails with the following error from the CloudWatch logs:
"Conversion error: Unable to convert column 'price' from String to Double for some rows."

Question 54mediummultiple choice

Read the full NAT/PAT explanation →

A company uses SageMaker Processing jobs to clean customer transaction data. The processing script runs on a single ml.m5.large instance and takes 30 minutes to process 50 GB of data in CSV format. To reduce processing time, the company wants to process 200 GB of data within 1 hour. Which combination of changes should the company make?

Question 55hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is training a binary classifier on a highly imbalanced dataset (1:100 class ratio). The dataset contains 500,000 rows and 30 features. The data is stored in S3 in Parquet format. The data scientist wants to use SageMaker's built-in XGBoost algorithm. Which data preparation technique should the data scientist apply to best address the class imbalance without causing data leakage?

Question 56easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data engineer needs to prepare a large dataset for machine learning. The data is stored in an Amazon RDS MySQL database and needs to be transformed and moved to an S3 bucket in Parquet format for use with SageMaker. Which AWS service is most suitable for this extraction, transformation, and loading (ETL) task?

Question 57mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A team is building a recommendation system and wants to store and serve features for online and offline models. The features include user statistics (updated daily) and movie metadata (static). The team needs low-latency inference for real-time recommendations and wants to reuse features across multiple models. Which AWS service should the team use to store, manage, and serve these features?

Question 58hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preprocessing time series data for a fraud detection model. The data includes transaction timestamps, amounts, and merchant IDs. The model should predict fraud within seconds of a transaction. The data scientist wants to avoid data leakage by not using future information to predict past events. Which data preparation practice should be implemented?

Question 59easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company has 10 TB of log data in compressed JSON format stored in Amazon S3. The data needs to be processed and transformed into a structured format for machine learning. The processing requires complex transformations, including parsing nested JSON and joining with a reference table. The company wants to minimize infrastructure management. Which approach should the company use?

Question 60mediummultiple choice

Read the full NAT/PAT explanation →

A team is collaborating on a machine learning project and needs to ensure that data used for training is consistent across experiments. The team wants to version datasets, track data lineage, and be able to reproduce past experiments. The team uses SageMaker for model training. Which combination of services and features should the team use?

Question 61hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is using SageMaker built-in linear learner algorithm for a regression problem. The dataset has 10 features, some have missing values, and the target variable is right-skewed. The data scientist wants to handle missing values and transform the target variable to improve model performance. Which data preparation steps should the data scientist take?

Question 62easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company has a dataset of 2 billion records stored as text files in Amazon S3. The data is partitioned by year and month. The data science team wants to read only the last 6 months of data for model training using SageMaker. To minimize data scanned and reduce costs, which approach should the team use?

Question 63mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A data team is preparing data for a machine learning pipeline. Which TWO practices are best for ensuring data quality and reproducibility? (Choose two.)

Question 64hardmulti select

Read the full Data Preparation for Machine Learning explanation →

A data scientist is working with a dataset containing customer demographics and purchase history. The dataset includes categorical variables with high cardinality (e.g., ZIP code, product ID). The data scientist wants to perform feature engineering to improve model performance. Which THREE feature engineering techniques should the data scientist consider? (Choose three.)

Question 65easymulti select

Read the full Data Preparation for Machine Learning explanation →

A data engineer needs to provide the data science team with access to various data sources for machine learning. The team uses Amazon SageMaker Studio. Which TWO data sources can be accessed directly from SageMaker Studio notebooks without additional infrastructure? (Choose two.)

Question 66easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist needs to convert categorical variables to numerical format for a linear regression model. The dataset contains a 'Country' column with 50 unique values. Which transformation should the engineer use to avoid introducing ordinal relationships?

Question 67mediummultiple choice

Read the full NAT/PAT explanation →

A company is building a fraud detection model on an imbalanced dataset (99% legitimate, 1% fraudulent). To improve recall on the minority class, they want to resample data. Which combination of techniques should they use?

Question 68hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data engineer is processing a large dataset in Amazon S3 with AWS Glue ETL. The dataset contains timestamps in multiple time zones. The engineer needs to create a feature for hour-of-day consistent across all records. Which approach ensures correctness?

Question 69easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A machine learning engineer needs to handle missing values in a dataset containing numerical features. The missingness is completely at random (MCAR). Which imputation strategy is most robust for downstream model performance?

Question 70mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A team is using Amazon SageMaker for feature engineering. They have a dataset with a column 'TransactionDate' in string format (e.g., '2023-01-15 10:30:00'). They need to create features: year, month, day, hour, and day_of_week. What is the most efficient way to do this in a SageMaker processing job?

Question 71hardmultiple choice

Read the full NAT/PAT explanation →

A data scientist is preparing text data for natural language processing (NLP). The corpus contains many rare words and typos. To reduce dimensionality and improve generalization, they decide to apply stemming and remove stop words. However, after training, the model performs poorly on domain-specific terms. What is the most likely cause?

Question 72easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

An ML engineer needs to split a dataset into training, validation, and test sets. The dataset has a time-based column that should not be leaked. Which split method is most appropriate?

Question 73mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company collects sensor data from IoT devices. The data arrives with missing timestamps due to network issues. For anomaly detection, the engineer needs to create features that capture rolling statistics over fixed windows. Which data preprocessing step is essential before feature generation?

Question 74hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is using Amazon SageMaker Data Wrangler for feature engineering on a large dataset stored in S3. The dataset has a column 'ProductCategory' with 1000+ unique values. To reduce dimensionality, they want to group categories that appear less than 1% of the time into an 'Other' category. Which Data Wrangler transform should they use?

Question 75mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A machine learning engineer is preparing a dataset for a binary classification model. The dataset has 10,000 rows and 200 features, with 5% positive class. The engineer suspects class imbalance may affect model performance. Which TWO actions should the engineer take to mitigate imbalance? (Choose 2.)

Question 76hardmulti select

Read the full Data Preparation for Machine Learning explanation →

A data engineer is building a feature engineering pipeline in AWS Glue ETL to process streaming data from Amazon Kinesis. The data includes a nested JSON structure with arrays. The engineer needs to flatten the nested structures into a tabular format for machine learning. Which THREE approaches are valid for this task? (Choose 3.)

Question 77easymulti select

Read the full Data Preparation for Machine Learning explanation →

A data scientist is evaluating data quality for a machine learning project. The dataset has missing values, outliers, and inconsistent formatting. Which TWO steps should the data scientist perform during the data preparation phase? (Choose 2.)

Question 78easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a dataset for a binary classification model. The dataset has 10,000 records with 100 features. The target variable is imbalanced, with 95% negative class and 5% positive class. Which data preparation step should the data scientist take to address the imbalance before training?

Question 79mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company is building a machine learning model on customer transaction data stored in Amazon S3. The data includes columns with missing values in the 'age' field. The data scientist wants to impute missing values with the median age across all customers. Which approach is MOST efficient for preparing the data at scale?

Question 80hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A machine learning team is processing a large dataset in Amazon SageMaker using a processing job. The data is stored in S3 in CSV format. The team wants to split the data into training, validation, and test sets (70/20/10) while ensuring that the distribution of a categorical feature 'region' is preserved across splits. Which SageMaker SDK method should they use to write the output?

Question 81easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data engineer needs to convert a JSON dataset to Parquet format for efficient querying with Amazon Athena. The JSON files are in an S3 bucket. Which service can perform this conversion with minimal coding?

Question 82mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is exploring data stored in an Amazon Redshift cluster. The data includes timestamp columns with different formats. The scientist wants to create a new column that standardizes the timestamp format to UTC. Which approach is MOST efficient?

Question 83hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A team is using AWS Glue to process streaming data from Amazon Kinesis. The streaming data contains both structured and semi-structured fields. The team needs to flatten the semi-structured fields into columns for downstream ML training. Which Glue feature is BEST suited?

Question 84easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data engineer notices that an AWS Glue ETL job is failing with an Out of Memory error when processing a large dataset. The dataset is 500 GB in size, and the worker type is G.1X. Which change is MOST likely to resolve the issue?

Question 85mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company needs to anonymize personally identifiable information (PII) in a dataset before using it for ML. The dataset is stored in S3 as CSV files. The team wants to mask credit card numbers by replacing all digits except the last four with asterisks. Which approach is the most scalable?

Question 86hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing data for a regression model. The target variable has a skewed distribution. The scientist wants to apply a log transformation to make it closer to normal. Which step should be taken before applying log transformation?

Question 87mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A data engineer is preparing a dataset for a classification model. The dataset contains duplicate rows. Which TWO approaches are appropriate to handle duplicates in AWS? (Choose 2.)

Question 88hardmulti select

Read the full NAT/PAT explanation →

A data scientist is cleaning a text dataset for natural language processing. The raw data contains HTML tags, URLs, and special characters. Which THREE steps should be taken to preprocess the text data? (Choose 3.)

Question 89mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A company is preparing data for a time-series forecasting model. The data is collected from IoT sensors at irregular intervals. Which TWO steps are necessary to prepare the data? (Choose 2.)

Question 90easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

Refer to the exhibit. A data scientist is trying to use AWS Glue to read data from the S3 bucket `ml-data-bucket`. The Glue job fails with an access denied error. What is the most likely cause?

Exhibit

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": "arn:aws:s3:::ml-data-bucket/*"
    }
  ]
}

Question 91mediummultiple choice

Study the full Python automation breakdown →

Refer to the exhibit. A data engineer runs a Glue ETL job that uses a Python script. The job fails because of a missing module `scikit-learn`. Which fix is MOST appropriate?

Network Topology

Question 92hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

Refer to the exhibit. A data engineer deploys this Glue job via CloudFormation. When running, the job fails with a timeout after 2 hours. The job processes a large dataset and expected to take 3 hours. Which change would resolve the issue?

Network Topology

Question 93easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a dataset for binary classification using SageMaker. The dataset has 100 features and 10,000 rows, but the target variable is highly imbalanced (95% negative, 5% positive). Which technique should the data scientist apply during data preparation to address the imbalance?

Question 94mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A machine learning engineer is using SageMaker Processing to run a scikit-learn preprocessing script. The script reads a CSV file from S3, applies a StandardScaler, and writes the output. The job fails with a 'MemoryError'. Which change should the engineer make to the data preparation process?

Question 95hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company uses AWS Glue to run ETL jobs that prepare data for machine learning. The source data in S3 has a schema that evolves over time (new columns are added occasionally). The Glue job schema is defined as a fixed schema in the job script. After an update to the source data, the Glue job fails with an error about mismatched schemas. How should the data engineer modify the data preparation process to handle schema evolution?

Question 96mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A data scientist is using SageMaker Data Wrangler to prepare features for a classification model. Which TWO statements about feature engineering in Data Wrangler are correct?

Question 97hardmulti select

Read the full Data Preparation for Machine Learning explanation →

A data engineer is optimizing Amazon Athena queries on large datasets stored in S3 for machine learning data preparation. Which THREE practices improve query performance?

Question 98easymultiple choice

Read the full NAT/PAT explanation →

A team is building a machine learning model for natural language processing using SageMaker BlazingText. The data preparation step must format the training data correctly. What format does BlazingText require for supervised text classification?

Question 99mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company uses Amazon SageMaker Ground Truth to create labeled datasets for object detection. The output must be in COCO format for downstream model training. How should the data preparation process be configured?

Question 100easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A machine learning engineer is using SageMaker Data Wrangler to perform data validation. Which step should be added to the pipeline to ensure data quality before training?

Question 101hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a large dataset (50 GB) for training a TensorFlow model on SageMaker. The dataset consists of many small CSV files. Training is slow due to I/O bottlenecks. Which data preparation strategy most effectively accelerates training?

Question 102hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

Refer to the exhibit. A data engineer runs an AWS Glue ETL job with the following script portion. The job fails with an error: 'An error occurred while calling o113.pyWriteDynamicFrame. No such file or directory'. What is the most likely cause?

Exhibit

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

raw = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://bucket/input/year=2023/month=01/"]},
    format="json")

transformed = raw.select_fields(["col1", "col2"]).rename_field("col1", "new_col")

glueContext.write_dynamic_frame.from_options(
    frame=transformed,
    connection_type="s3",
    connection_options={"path": "s3://bucket/output/transformed"},
    format="parquet")

job.commit()

Question 103mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

Refer to the exhibit. A SageMaker Processing job fails with the following error log. Which change during data preparation would resolve the issue?

Exhibit

ProcessingJobError: Execution failed
Error: Traceback (most recent call last):
  File "/opt/ml/processing/input/code/preprocess.py", line 45, in <module>
    df['age'] = df['age'].apply(float)
ValueError: could not convert string to float: 'twenty-five'

Question 104easymulti select

Read the full Data Preparation for Machine Learning explanation →

A data engineer is using SageMaker Pipelines to automate data preparation. Which TWO statements about data validation within a pipeline are correct?

Question 105mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company is building a time series forecasting model using SageMaker DeepAR. The raw data is a CSV with columns: timestamp, item_id, and value. What is the correct data format required for DeepAR training?

Question 106mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A data engineer is using Amazon Athena to query a partitioned dataset stored in S3. Which THREE actions are necessary to ensure the queries can access the data and run efficiently?

Question 107hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company operates an IoT platform that ingests sensor data from thousands of devices. Data is streamed via Amazon Kinesis Data Streams and stored in an S3 bucket using a Kinesis Firehose delivery stream, which writes data in 5-minute windows. The data is then used to train a machine learning model for anomaly detection. Recently, the data science team noticed that the training dataset is always missing the last 5 minutes of events from the end of each day. The S3 objects show that the last delivery stream buffer window is incomplete. The data engineer checked the Kinesis Firehose metrics and found no delivery errors or data loss, but the 'IncomingBytes' and 'IncomingRecords' metrics show consistent data for all periods. The S3 bucket has Lifecycle policies that do not delete objects. The team suspects the issue is related to the data preparation pipeline. Which course of action would correctly resolve the missing data problem?

Question 108easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data scientist is preparing a dataset for a binary classification model to predict customer churn. The dataset contains a timestamp column 'signup_date' that is not relevant for the prediction. What is the most appropriate action to handle this column?

Question 109easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A machine learning engineer is building a regression model to predict house prices. The feature 'square_footage' has values ranging from 500 to 10,000, while 'num_bedrooms' ranges from 1 to 10. Which preprocessing step is most critical before training a model that uses gradient descent?

Question 110mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company uses Amazon SageMaker Data Wrangler to create a data flow for a classification model. The dataset contains a high-cardinality categorical feature 'product_id' with 50,000 unique values. The data scientist wants to reduce dimensionality while preserving predictive power. Which approach is most effective?

Question 111mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data engineer needs to prepare a large dataset (10 TB) stored in Amazon S3 for a training job on SageMaker. The data is in CSV format, but the training algorithm expects Parquet for performance. The engineer must transform the data with minimal cost and without writing custom code. Which service should be used?

Question 112hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data science team is building a model to predict fraudulent transactions. The dataset has 1 million legitimate transactions and only 1,000 fraudulent ones. They plan to use Amazon SageMaker to train a model. Which data preparation technique should they apply to address the severe class imbalance before training?

Question 113hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company is training a deep learning model on Amazon SageMaker using a dataset stored in Amazon S3. The training job is taking a long time due to I/O bottlenecks. The data is in JSON lines format. Which data preparation step combined with SageMaker's best practices would most effectively reduce training time?

Question 114hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data engineer is using Amazon SageMaker Processing to run a data preprocessing script on a dataset with 500 million rows. The script runs out of memory on a single ml.r5.24xlarge instance. The engineer needs to modify the processing job to handle the dataset size. Which approach is most cost-effective and scalable?

Question 115easymulti select

Read the full Data Preparation for Machine Learning explanation →

Which TWO actions are recommended best practices when preparing training data for a machine learning model in AWS? (Choose two.)

Question 116mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A data scientist is using Amazon SageMaker Data Wrangler to prepare a dataset. Which TWO features of Data Wrangler can be used to handle imbalanced classification problems? (Choose two.)

Question 117hardmulti select

Read the full Data Preparation for Machine Learning explanation →

A company is preparing a large dataset for a SageMaker built-in XGBoost model. The dataset has missing values in both numeric and categorical features, and some categorical features have high cardinality. Which THREE data preparation steps should the company take to optimize model performance? (Choose three.)

Question 118mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A company runs an online retail business and wants to build a product recommendation system. They have a dataset of customer purchases stored in Amazon S3 as CSV files. The dataset includes columns: 'customer_id', 'product_id', 'purchase_date', 'quantity', 'price', and 'category'. The data science team plans to use Amazon SageMaker to train a factorization machines model. During data exploration, they discover that the 'category' column has 1,200 unique values, and many categories appear only a few times. The 'product_id' column has 50,000 unique values. They want to include both features in the model. The team is concerned about the high cardinality of these features. Which approach should they take to prepare these features for the factorization machines model?

Question 119mediummultiple choice

Read the full NAT/PAT explanation →

A healthcare company is building a model to predict patient readmission rates. The dataset contains a mix of numeric features (age, blood pressure, lab test results) and categorical features (gender, diagnosis code, hospital department). The dataset has 2 million rows. The data is stored in an Amazon S3 bucket, and they use AWS Glue to catalog and preprocess the data. The data scientist notices that the 'diagnosis_code' column has 10,000 unique codes, and 20% of the rows have missing values for 'blood_pressure'. They plan to use a SageMaker built-in XGBoost model. For optimal model performance, which preprocessing steps should they apply using AWS Glue ETL?

Question 120hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A financial services company is developing a fraud detection model using Amazon SageMaker. They have a dataset with 10 million transactions, each with 300 features. The dataset is highly imbalanced (0.1% fraud). They have performed feature engineering and now need to split the data for training, validation, and test sets. The data is stored in CSV files in Amazon S3. They plan to use SageMaker's built-in XGBoost algorithm. To ensure proper evaluation and avoid data leakage, which data splitting strategy should they use?

Question 121hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

An e-commerce company uses Amazon SageMaker to train a model that predicts click-through rates. The training data includes a timestamp column 'click_time' and a categorical feature 'device_type' (8 values). They notice that the model's performance degrades over time because the data distribution shifts. They want to ensure the training data represents the most recent behavior. The data is stored in a daily partitioned S3 bucket (e.g., s3://bucket/data/2024-01-01/). The total dataset size is 500 GB. Which approach should they take to prepare the training data while minimizing bias and cost?

Question 122mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A data engineer is building a data pipeline for a machine learning model that requires both structured and unstructured data. The structured data (customer demographics) is in Amazon RDS, and the unstructured data (customer support chat logs) is in Amazon S3 as JSON files. The engineer needs to combine these datasets into a single training dataset stored in S3 in Parquet format. They must also perform feature engineering such as text vectorization on the chat logs. The pipeline should be serverless and cost-effective. Which approach should they use?

Question 123mediummulti select

Read the full Data Preparation for Machine Learning explanation →

A data scientist needs to prepare a dataset for a binary classification model. The dataset contains 100,000 records with 50 features, including categorical variables with high cardinality, missing values in 30% of records for a key numeric feature, and a severe class imbalance (5% positive class). The data is stored in an Amazon S3 bucket. Which TWO actions should the data scientist take to improve model performance and ensure robust data preparation? (Choose two.)

Question 124easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A retail company is building a machine learning model to predict customer churn. The data engineering team has extracted customer transaction data from Amazon Aurora and stored it as CSV files in Amazon S3. The data includes customer IDs, transaction amounts, timestamps, and product categories. A data scientist discovers that the dataset contains several missing values in the 'transaction_amount' column for about 15% of the records. The data scientist also notices that the 'customer_id' column has some duplicate entries. The team wants to prepare the data for training a churn model using Amazon SageMaker. The data is approximately 50 GB in size. What should the data scientist do to handle the missing values and duplicates efficiently while preparing the data for training?

Question 125mediummultiple choice

Read the full Data Preparation for Machine Learning explanation →

A financial services company is building a fraud detection model using historical transaction data stored in Amazon S3. The data includes features such as transaction amount, merchant category, time of day, and user location. The data scientist observes that the 'merchant_category' column is a text attribute with over 200 unique values. Additionally, the 'transaction_amount' column has a long-tail distribution with extreme outliers. The dataset is 200 GB in size, and the company wants to use Amazon SageMaker for model training. The data scientist needs to engineer features that capture the high-cardinality category and reduce the impact of outliers. What is the MOST efficient and effective approach to prepare this data?

Question 126hardmultiple choice

Read the full Data Preparation for Machine Learning explanation →

A social media company is processing a real-time stream of user activity data from Amazon Kinesis Data Streams to train a machine learning model for content recommendation. The raw data includes user ID, timestamp, content ID, interaction type (like, share, comment), and device type. The data scientists need to aggregate features per user over a sliding window of 7 days, including counts of interaction types, unique content IDs engaged, and a moving average of interaction timestamps. The aggregated data will be used to update a user embedding model. The streaming data volume is approximately 500 records per second, and the company uses an AWS Glue streaming ETL job for transformation. However, the Glue job is failing frequently with high latency and checkpoint errors. The team needs a more robust solution to prepare the streaming data features. Which approach should the team take?

Question 127mediummultiple choice

Read the full NAT/PAT explanation →

A healthcare company is developing a predictive model to identify patients at risk of readmission within 30 days after discharge. The dataset contains electronic health record (EHR) data from multiple hospitals, stored as Parquet files in Amazon S3. The data includes patient demographics, diagnoses (ICD-10 codes), medications, lab results, and length of stay. A data scientist notices that the 'lab_result' column has a high number of null values (over 60%) because some tests are not applicable to all patients. Additionally, the 'diagnosis_code' column has over 10,000 unique ICD-10 codes. The company wants to build a model that complies with HIPAA and performs well. The data scientist must prepare the features efficiently using AWS services. Which combination of steps should the data scientist take? (Assume the company can use any AWS service.)

Question 128easymultiple choice

Read the full Data Preparation for Machine Learning explanation →

A marketing company is preparing a dataset to train a logistic regression model to predict whether a customer will click on an online ad. The dataset includes 1 million records with features: customer_age (numeric), income (numeric), education_level (ordinal: high school, bachelor, master, PhD), and ad_category (categorical: 50 unique values). The data is stored in a CSV file in Amazon S3. The data scientist plans to use Amazon SageMaker's built-in linear learner algorithm. The data scientist needs to preprocess the data before training. What is the correct sequence of data preparation steps that should be applied to this dataset to ensure optimal model performance?