A company is building an AI-powered document intelligence system to extract key fields from scanned invoices. The data contains 95% of invoices from one vendor and 5% from others. During model training, the F1 score is 0.95 on the overall test set, but the performance on the minority vendor invoices is very poor. What is the MOST likely cause?
Imbalanced data causes the model to optimize for the majority class, ignoring minority classes.
Why this answer
The imbalanced dataset causes the model to learn mostly from the majority class, leading to poor performance on the minority class. The other options are less likely given the high overall F1 score.