A data scientist is preparing a dataset for training a classification model. The dataset contains 10,000 records with a binary target variable where 9,500 belong to class A and 500 belong to class B. Which technique should the scientist use to address the class imbalance?
Trap 1: Random undersampling of class A
Undersampling reduces data and may lose important patterns.
Trap 2: Adding Gaussian noise to class B
Adding noise does not create new informative samples.
Trap 3: Principal Component Analysis (PCA)
PCA reduces features, not address imbalance.
- A
SMOTE (Synthetic Minority Oversampling Technique)
SMOTE creates synthetic minority samples to balance classes.
- B
Random undersampling of class A
Why wrong: Undersampling reduces data and may lose important patterns.
- C
Adding Gaussian noise to class B
Why wrong: Adding noise does not create new informative samples.
- D
Principal Component Analysis (PCA)
Why wrong: PCA reduces features, not address imbalance.