A machine learning engineer is analyzing a dataset that contains a categorical feature 'country' with 200 unique values. The target variable is binary. The engineer wants to use this feature in a linear model. Which encoding method should be applied during EDA to prepare the data for modeling, considering the high cardinality?
Target encoding captures the relationship with the target, and cross-validation prevents data leakage.
Why this answer
Option D is correct because target encoding (or mean encoding) replaces each category with the mean of the target, which is suitable for high cardinality in linear models but requires careful validation to avoid overfitting. Option A is wrong because one-hot encoding would create 199 dummy variables, leading to high dimensionality. Option B is wrong because label encoding imposes an arbitrary ordinal relationship.
Option C is wrong because frequency encoding may not capture the relationship with the target.