A machine learning engineer is preparing a dataset that contains both numerical and categorical features. The categorical features have high cardinality (e.g., zip code with thousands of unique values). Which technique is most appropriate for encoding these high-cardinality categorical features?
Encodes using target mean, handles high cardinality well.
Why this answer
Target encoding is the most appropriate technique for high-cardinality categorical features because it replaces each category with the mean of the target variable for that category, effectively capturing the predictive signal while keeping the feature as a single numeric column. This avoids the dimensionality explosion of one-hot encoding and the arbitrary ordinality of label encoding, making it a common choice in gradient boosting frameworks like XGBoost or LightGBM for datasets with thousands of unique categories.
Exam trap
AWS often tests the misconception that one-hot encoding is always the safest choice for categorical data, but candidates fail to recognize that high cardinality makes it impractical, leading them to overlook target encoding as a more efficient alternative.
How to eliminate wrong answers
Option A is wrong because label encoding assigns arbitrary integer values to categories, which introduces a false ordinal relationship that can mislead tree-based models into treating high-cardinality features as ordered, degrading performance. Option B is wrong because one-hot encoding creates a binary column for each unique category, which with thousands of categories leads to an extremely high-dimensional and sparse feature space, causing memory issues and overfitting. Option C is wrong because frequency encoding replaces categories with their occurrence counts, which loses the relationship between the category and the target variable, often resulting in weaker predictive power compared to target encoding.