A data scientist is working on a regression problem to predict house prices. The dataset has 80 features, including categorical variables with high cardinality (e.g., zip code with 10,000 unique values). The target variable is log-transformed. The data scientist trains a linear regression model and obtains an R² of 0.45 on the test set. To improve performance, the data scientist considers: A) Applying one-hot encoding to all categorical features and using Ridge regression. B) Using target encoding for high-cardinality features and using a tree-based model like XGBoost. C) Removing all categorical features and using polynomial features for numerical features. D) Using principal component analysis (PCA) on all features before training a linear model. Which approach is MOST likely to improve the model's performance?
Target encoding reduces dimensionality and XGBoost captures complex patterns.
Why this answer
Target encoding efficiently handles high-cardinality features, and tree-based models like XGBoost can capture non-linear relationships and interactions, likely improving R². One-hot encoding would create too many features, causing sparsity. Removing categories loses information.
PCA may discard important information.