A data scientist is training a binary classification model on a dataset with a severe class imbalance (95% negative, 5% positive). The model achieves 95% accuracy but only correctly identifies 10% of the positive class. Which metric should the data scientist use to evaluate model performance?
F1 score balances precision and recall, making it suitable for imbalanced datasets where the minority class is important.
Why this answer
The F1 score is the harmonic mean of precision and recall, making it robust to class imbalance. With 95% accuracy but only 10% recall on the positive class, the model is essentially a trivial classifier that predicts the majority class. F1 score captures both false positives and false negatives, providing a balanced view of performance on the minority class.
Exam trap
The trap here is that candidates see high accuracy and assume the model is good, but AWS tests the understanding that accuracy is meaningless for imbalanced datasets, and that AUC can be misleadingly high even when minority class recall is poor.
How to eliminate wrong answers
Option A is wrong because log loss measures the probabilistic confidence of predictions and can be misleading when class imbalance is severe, as it is dominated by the majority class. Option C is wrong because accuracy is misleading in imbalanced datasets; a model predicting all negatives achieves 95% accuracy without learning anything about the positive class. Option D is wrong because AUC measures the model's ability to rank positive instances higher than negative ones, but it can still be high even when recall on the positive class is low, as it aggregates performance across all thresholds.