A data scientist is preparing text data for natural language processing (NLP). The corpus contains many rare words and typos. To reduce dimensionality and improve generalization, they decide to apply stemming and remove stop words. However, after training, the model performs poorly on domain-specific terms. What is the most likely cause?
In specialized domains, stemming can distort meaning and stop words can carry essential context.
Why this answer
Option B is correct because both stemming and stop word removal are inappropriate for this domain. Stemming aggressively reduces words to their root forms, which can conflate distinct domain-specific terms (e.g., 'therapy' and 'therapist' both stem to 'therap'), losing critical semantic nuance. Stop word removal can discard words that carry domain-specific meaning (e.g., 'not' in medical negation or 'up' in 'tune-up' for maintenance), leading to poor generalization on specialized vocabulary.
Exam trap
AWS often tests the misconception that lemmatization is always superior to stemming, but the trap here is that the root cause is the inappropriate application of both preprocessing techniques to domain-specific text, not the choice between stemming and lemmatization.
How to eliminate wrong answers
Option A is wrong because lemmatization, while more accurate than stemming, still does not address the core issue: removing stop words and aggressive normalization are fundamentally inappropriate for domain-specific text where rare terms and typos require preservation of original forms or specialized handling. Option C is wrong because while stemming can be aggressive, the primary problem is not the aggressiveness alone but the combination of stemming and stop word removal that strips domain-relevant context; even a less aggressive stemmer would fail if stop words containing domain meaning are removed. Option D is wrong because stop word removal can indeed remove important context words, but this is only part of the issue; the question states the model performs poorly on domain-specific terms, which is primarily caused by stemming distorting those terms, not just by stop word removal.