A data scientist is training a binary classification model to detect fraudulent transactions. The dataset is highly imbalanced with only 1% fraud cases. Which technique is most appropriate to address the class imbalance?
Oversampling creates synthetic instances of the minority class, helping the model learn better boundaries.
Why this answer
Oversampling the minority class (e.g., using SMOTE or random oversampling) is the most appropriate technique because it balances the dataset by generating synthetic or duplicate examples of the fraud cases, allowing the model to learn the decision boundary for the minority class without discarding valuable majority-class data. This directly addresses the class imbalance where only 1% of transactions are fraudulent, improving recall and precision for fraud detection.
Exam trap
Cisco often tests the misconception that undersampling is always better because it reduces dataset size and training time, but the trap here is that undersampling discards majority-class data, which can severely degrade model performance when the imbalance is extreme (e.g., 1:99 ratio).
How to eliminate wrong answers
Option A is wrong because linear regression is a regression algorithm, not a classification model, and it cannot output binary class probabilities or handle class imbalance without modification. Option C is wrong because undersampling the majority class discards a large amount of potentially useful non-fraud data, which can lead to loss of information and poor generalization, especially when the imbalance is severe (99% majority). Option D is wrong because increasing the learning rate does not address class imbalance; it only affects the convergence speed of gradient descent and may cause the model to overshoot the optimum, not rebalance the dataset.