A data scientist is preparing a large dataset for training a machine learning model. The dataset contains missing values in several columns. Which approach is the MOST efficient for handling missing values in a large dataset using AWS services?
Trap 1: Use AWS Glue ETL to write a custom Python script that imputes…
Custom scripts require development effort and may not be the most efficient for large datasets.
Trap 2: Use pandas in a SageMaker notebook to impute missing values with…
pandas is not designed for large-scale data processing and may run out of memory.
Trap 3: Remove all rows with missing values from the dataset.
Removing rows can lead to data loss and biased models, and is not always efficient.
- A
Use AWS Glue ETL to write a custom Python script that imputes missing values with the mean.
Why wrong: Custom scripts require development effort and may not be the most efficient for large datasets.
- B
Use Amazon SageMaker Data Wrangler to impute missing values using built-in transforms.
Data Wrangler provides efficient, scalable, and visual data preparation without custom code.
- C
Use pandas in a SageMaker notebook to impute missing values with the median.
Why wrong: pandas is not designed for large-scale data processing and may run out of memory.
- D
Remove all rows with missing values from the dataset.
Why wrong: Removing rows can lead to data loss and biased models, and is not always efficient.