A data scientist is exploring a dataset of customer transactions. The dataset has 1 million rows and 50 columns. The target variable is a binary flag indicating whether a customer churned. The data scientist runs a correlation matrix on all numerical features and finds that two features have a correlation coefficient of 0.98. Which action should be taken to improve model performance?
Trap 1: Create an interaction term between the two features.
Interaction terms can increase multicollinearity and complexity.
Trap 2: Increase the regularization parameter (e.g., lambda) in the model.
Regularization helps but does not directly address the redundancy; correlated features can still cause instability.
Trap 3: Apply mean-centering to both features to reduce correlation.
Mean-centering does not change the correlation coefficient.
- A
Create an interaction term between the two features.
Why wrong: Interaction terms can increase multicollinearity and complexity.
- B
Remove one of the two highly correlated features from the dataset.
Removing one feature eliminates multicollinearity, simplifying the model and improving interpretability.
- C
Increase the regularization parameter (e.g., lambda) in the model.
Why wrong: Regularization helps but does not directly address the redundancy; correlated features can still cause instability.
- D
Apply mean-centering to both features to reduce correlation.
Why wrong: Mean-centering does not change the correlation coefficient.