A machine learning team is preparing a dataset for a supervised learning task. They have 100,000 labeled samples. Which data preparation step is essential before splitting into train/test sets?
Shuffling prevents biased splits.
Why this answer
Option C is correct because shuffling the dataset randomly before splitting into train/test sets ensures that the data distribution is similar across both subsets. Without shuffling, the split might inadvertently separate ordered or grouped data (e.g., time-series or batches), leading to biased model evaluation. This step is essential for supervised learning to prevent data leakage and ensure the test set is representative of the overall population.
Exam trap
Salesforce often tests the misconception that normalization or outlier removal must be done before splitting, but the trap here is that candidates overlook the fundamental need to randomize the data order to avoid temporal or structural bias in the train/test split.
How to eliminate wrong answers
Option A is wrong because normalizing features to the same scale is a preprocessing step typically applied after splitting the data, using statistics (e.g., mean and standard deviation) computed only from the training set to avoid data leakage into the test set. Option B is wrong because removing all outliers before splitting can introduce bias and reduce the dataset's representativeness; outlier handling should be done with care, often after splitting, and may be domain-specific. Option D is wrong because visualizing data distributions is an exploratory step that helps understand the data but is not essential before splitting; it can be performed after splitting to avoid influencing the split decisions.