A data scientist is using SageMaker built-in linear learner algorithm for a regression problem. The dataset has 10 features, some have missing values, and the target variable is right-skewed. The data scientist wants to handle missing values and transform the target variable to improve model performance. Which data preparation steps should the data scientist take?
Handles missing values and skew appropriately.
Why this answer
Option C is correct because imputing missing values with the median is robust to outliers and preserves the distribution of each feature, which is important when the target is right-skewed. Applying a log transformation to the right-skewed target variable helps normalize its distribution, which aligns with the linear learner algorithm's assumption of normally distributed errors and improves convergence and prediction accuracy.
Exam trap
The trap here is that candidates may assume standardizing features (Option B) is always required, but for a right-skewed target, transforming the target itself (e.g., log transform) is more critical than scaling features, and imputation is essential to avoid data loss.
How to eliminate wrong answers
Option A is wrong because one-hot encoding all features, including numeric ones, would dramatically increase dimensionality and is inappropriate for features that are not categorical; dropping rows with missing values reduces the dataset size and can introduce bias. Option B is wrong because standardizing features is beneficial, but applying a Box-Cox transformation to the target variable requires all target values to be positive (which may not hold) and is less commonly used than log transformation for right-skewed targets; also, Box-Cox is not directly available in SageMaker's built-in linear learner without custom preprocessing. Option D is wrong because removing rows with missing values discards potentially valuable data and can lead to biased models; normalizing the target to [0,1] does not address skewness and may compress the variance, harming regression performance.