A machine learning engineer is training a transformer model for machine translation. The model's perplexity on the validation set is 8.5, and the BLEU score is 32. After increasing the number of encoder layers from 6 to 12, perplexity drops to 7.2 but BLEU decreases to 28. What is the MOST likely cause?
Overfitting leads to lower perplexity on validation (memorization) but worse generalization, reflected in the BLEU drop.
Why this answer
Perplexity measures language model confidence, but BLEU measures translation quality. The deeper model may overfit to the training data, reducing perplexity but hurting generalization to validation translations. Overfitting causes high confidence (low perplexity) but poor translation diversity or exact matches.