A data scientist is working on a project to predict customer churn. The dataset contains 50,000 rows and 20 features, including categorical variables like 'Region' (10 categories) and 'SubscriptionType' (5 categories). The target variable is binary (churn or not). During exploratory data analysis, they plot the distribution of each feature and notice that 'Region' has a highly imbalanced distribution: one region accounts for 80% of the data. Which of the following is the most appropriate next step?
This reduces sparsity and helps the model learn patterns for rare categories.
Why this answer
Option B is correct because imbalanced categorical features may cause the model to ignore rare categories; grouping rare levels into an 'Other' category can improve model performance. Option A is wrong because removing the feature could discard useful information. Option C is wrong because one-hot encoding does not address imbalance.
Option D is wrong because oversampling addresses target imbalance, not feature imbalance.