A data scientist is analyzing a dataset with high cardinality categorical features (e.g., user IDs with millions of unique values). They want to visualize the relationship between these categorical features and a continuous target variable. Which approach is most effective for EDA?
Grouping reduces cardinality and box plots effectively show relationship with target.
Why this answer
For high cardinality categorical features, grouping rare categories into an 'Other' category reduces cardinality and allows meaningful visualizations like box plots. Option A is wrong because removing the feature loses information. Option B is wrong because one-hot encoding creates too many columns and is not suitable for visualization.
Option D is wrong because visualizing millions of categories is not feasible. Option E is wrong because feature hashing is for modeling, not EDA visualization.