This chapter covers the critical concept of splitting data into training, validation, and test sets in machine learning—a fundamental practice for building models that generalize well. For the AI-900 exam, understanding these splits is essential because it underpins model evaluation and prevents overfitting. Approximately 10-15% of exam questions directly or indirectly test your knowledge of data partitioning, especially in the context of automated machine learning and responsible AI. Mastering this topic ensures you can critically evaluate model performance claims and design robust ML workflows.
Jump to a section
Imagine a chef developing a new cake recipe. She first creates a batch using her initial recipe and serves it to a small group of trusted tasters (the training data) to adjust ingredients and baking time. After several iterations, she freezes the recipe. Next, she asks a different group of tasters to sample the cake and give feedback (the validation data) to fine-tune the recipe further, such as adjusting sugar or baking temperature. Finally, she serves the cake to a completely new group of customers who have never tasted it before (the test data) to evaluate how well the recipe will perform in the real world. The chef never uses the test group feedback to change the recipe; that would be cheating. Similarly, in machine learning, the training data is used to learn patterns, the validation data is used to tune hyperparameters, and the test data provides an unbiased final evaluation of model performance. Mixing these sets or using test data for training would lead to overfitting and an overly optimistic estimate of real-world accuracy.
What Are Training, Validation, and Test Data Splits?
In supervised machine learning, we train a model on labeled data to predict outcomes on unseen data. The goal is not to memorize the training examples but to learn patterns that generalize. To achieve this, we must partition the available labeled data into three distinct subsets: training, validation, and test sets. Each serves a unique purpose in the model development lifecycle.
Training set: The largest subset (typically 60-80% of the data) used to fit the model parameters (e.g., weights in a neural network, coefficients in linear regression). The model sees this data during training and adjusts its internal parameters to minimize prediction error.
Validation set: A separate subset (typically 10-20%) used to tune hyperparameters and make model selection decisions. Hyperparameters are settings that control the learning process, such as learning rate, regularization strength, or tree depth. The model does not learn from the validation set; instead, we evaluate its performance on this set after each training epoch or configuration to choose the best model.
Test set: A held-out subset (typically 10-20%) used only once at the end to estimate the final model's performance on unseen data. It must never be used for training or validation decisions. Its sole purpose is to provide an unbiased evaluation of the model's generalization error.
Why Splitting Is Necessary
Without splitting, we risk overfitting—the model performs well on the data it has seen but poorly on new data. If you evaluate a model on the same data it was trained on, the performance metrics (accuracy, precision, etc.) will be artificially high because the model has memorized the training examples. This gives a false sense of confidence. The validation set helps detect overfitting during development: if validation performance stops improving while training performance continues, the model is overfitting. The test set provides the final, honest assessment.
How Splitting Works Internally
When you perform a train/validation/test split, you are essentially randomly shuffling the dataset and dividing it into three contiguous blocks. The randomness ensures that each subset is representative of the overall data distribution. However, in some cases, you may need to preserve the distribution of a categorical variable (stratification) or maintain temporal order (for time-series data).
Stratified splitting: For classification problems with imbalanced classes (e.g., 90% cats, 10% dogs), a simple random split might accidentally place all dogs in the test set. Stratified splitting ensures each subset has the same proportion of each class as the original dataset. This is critical for reliable evaluation.
Time-series splitting: For data with a temporal component (e.g., stock prices, sensor readings), you cannot randomly shuffle because future data points depend on past ones. Instead, you split chronologically: train on older data, validate on more recent, and test on the most recent. This simulates real-world forecasting.
Typical Split Ratios
There is no one-size-fits-all ratio, but common conventions exist:
Small datasets (<10,000 examples): 60% train, 20% validation, 20% test. A larger validation/test proportion ensures enough data for reliable evaluation.
Large datasets (>1 million examples): 98% train, 1% validation, 1% test. With massive data, even a small fraction gives millions of examples for validation and test.
Very large datasets (e.g., ImageNet with 1.2 million images): Often 1.2 million train, 50,000 validation, 50,000 test.
In Azure Machine Learning, when using automated ML, the default split is 80% train / 10% validation / 10% test, but you can customize it. The validation set can be provided explicitly or generated automatically via cross-validation.
Cross-Validation: An Alternative to a Single Validation Set
Instead of a single validation set, you can use k-fold cross-validation. The training set is split into k equal folds. The model is trained on k-1 folds and validated on the remaining fold. This process repeats k times, each time using a different fold for validation. The final validation metric is the average across all k folds. This gives a more robust estimate of model performance and reduces the variance of the validation metric. Common values for k are 5 or 10. For very large datasets, even 2-fold (50% train, 50% validation) works. Note that cross-validation is computationally expensive because you train k models.
The Test Set: Sacred and Untouchable
The test set must be isolated from the entire model development process. This means:
Never use test data for training or hyperparameter tuning.
Never look at test set labels while developing the model.
Only evaluate on the test set once, after all model selection is complete.
If you repeatedly evaluate on the test set and adjust your model based on those results, you are effectively leaking information from the test set into the model, leading to overfitting to the test set itself. This is a common pitfall called test set contamination.
Data Leakage: The Hidden Danger
Data leakage occurs when information from outside the training set influences the model, leading to overly optimistic performance. Common causes include:
Using future data to predict the past: In time-series, if you shuffle data, the model might learn patterns that depend on future values.
Including features that are not available at prediction time: For example, using 'patient_diagnosis' to predict 'patient_diagnosis'—a trivial leak.
Improper splitting of grouped data: If you have multiple rows per customer (e.g., transactions), you must split by customer, not by row, to avoid having the same customer in both train and test.
Azure Machine Learning provides tools like the Data Drift Monitor and Feature Importance to help detect leakage, but the best defense is careful data partitioning before any modeling.
How Azure Machine Learning Handles Splits
In Azure ML, you can specify data splits when configuring an automated ML run:
Validation type: Choose 'k-fold cross-validation' or 'train-validation split'.
Number of cross-validations: Specify k (e.g., 5).
Validation data size: For a single split, specify the fraction or absolute number of rows for validation.
Test data: Can be provided as a separate dataset or generated by splitting the input data. Azure ML automatically splits if you specify a test size.
Example configuration in Python SDK:
from azureml.train.automl import AutoMLConfig
automl_config = AutoMLConfig(
task='classification',
primary_metric='accuracy',
training_data=train_data,
validation_data=val_data, # optional; if not provided, AutoML splits training_data
n_cross_validations=5, # use 5-fold cross-validation
test_data=test_data, # separate test set
...
)If you do not provide a separate test dataset, Azure ML can create one from the training_data by splitting it (default 10% test). However, best practice is to bring your own test set that has never been touched.
Evaluation Metrics and Splits
The choice of split affects how you compute metrics. For example:
Training metrics: Computed on the training set—high values may indicate overfitting.
Validation metrics: Used to compare models and tune hyperparameters.
Test metrics: The final reported performance.
Azure ML automated ML reports metrics for both validation and test sets. The test set metrics are the ones you should trust for deployment decisions.
Common Pitfalls on the Exam
The AI-900 exam tests your understanding of why splitting is necessary and what each set is used for. Common trap questions:
Using test data for hyperparameter tuning: This is wrong because it leaks information. The correct approach is to use validation data.
Believing that a high training accuracy guarantees good real-world performance: This ignores overfitting. You need validation/test accuracy.
Thinking that cross-validation replaces the need for a test set: Cross-validation still uses a validation set; you still need a separate test set for final evaluation.
Assuming all splits must be equal size: No, training is usually larger.
Summary of Best Practices
Split data before any analysis or preprocessing that uses the entire dataset (e.g., scaling, imputation). Fit preprocessing on training data only, then apply to validation and test sets.
Use stratified splitting for classification.
Use chronological splitting for time-series.
Keep the test set locked away until the final evaluation.
Use cross-validation for robust validation when data is limited.
In Azure ML, leverage built-in splitting and cross-validation options.
By following these practices, you ensure your model's reported performance is trustworthy and that you are not fooling yourself with overly optimistic numbers.
Split Data into Three Sets
Begin by randomly shuffling the entire labeled dataset to ensure representative distribution across subsets. Then partition the data into three non-overlapping sets: training (typically 60-80%), validation (10-20%), and test (10-20%). For classification, use stratified splitting to preserve class proportions. In Azure ML, this can be done via the `train_test_split` function or by specifying `validation_data` and `test_data` parameters in AutoMLConfig. The split must be performed before any data preprocessing that uses global statistics (e.g., mean imputation) to avoid data leakage.
Train Model on Training Set
Use the training set to fit the model. The model learns patterns by minimizing a loss function (e.g., mean squared error for regression, cross-entropy for classification). During training, the model updates its internal parameters (weights, biases) based on the training data. This step is iterative; for neural networks, multiple epochs pass over the data. The training set is the only data used to adjust model parameters. The validation and test sets are not touched during this phase.
Validate and Tune Hyperparameters
After each training epoch or configuration, evaluate the model on the validation set. Compute metrics like accuracy, precision, recall, or F1-score. Use these metrics to tune hyperparameters such as learning rate, number of trees in a random forest, or regularization strength. This step may involve many iterations (e.g., grid search, random search). The validation set guides model selection without influencing the training process directly. In Azure ML automated ML, this is done automatically via cross-validation or a fixed validation split.
Select Best Model
Compare all candidate models (different architectures, hyperparameter combinations) based on their validation set performance. Choose the model that achieves the best validation metric (e.g., highest accuracy or lowest error). This model is considered the final candidate. It is important to note that the validation set is used multiple times during this selection, which can introduce some bias, but it is acceptable as long as the test set remains untouched.
Evaluate on Test Set
Once the best model is selected, evaluate it on the test set exactly once. Compute final performance metrics. These metrics are the unbiased estimate of how the model will perform on new, unseen data. If the test set performance is significantly worse than validation performance, the model may have been overfitting to the validation set (due to too many tuning iterations). In that case, you may need to collect more data or simplify the model, but you should not retrain using test set feedback.
Scenario 1: Credit Risk Modeling at a Bank
A bank wants to build a machine learning model to predict loan default risk. They have historical data for 500,000 loans with features like credit score, income, debt-to-income ratio, and loan amount. The dataset is imbalanced: only 5% of loans defaulted. The data science team splits the data into 80% training, 10% validation, and 10% test, using stratified splitting to maintain the 5% default rate in each set. They train several models (logistic regression, random forest, gradient boosting) and use the validation set to tune hyperparameters and select the best model. The final model achieves 92% validation AUC. When they evaluate on the test set, AUC drops to 88%, indicating some overfitting. They decide to simplify the model and increase regularization. The test set evaluation is critical because it provides the honest estimate that regulators and business stakeholders rely on. Misconfiguring the split (e.g., not stratifying) could lead to a validation set with no defaults, making the model appear perfect but failing in production.
Scenario 2: Predictive Maintenance in Manufacturing
A factory uses IoT sensors to monitor equipment and predict failures. Data is time-series: sensor readings every minute for 2 years. The team must split chronologically: train on the first 18 months, validate on the next 3 months, and test on the last 3 months. They cannot shuffle because future data points depend on past trends. They use Azure ML automated ML with a time-series task. The validation set is used to choose the best lookback window and forecast horizon. The test set simulates the model's performance in the future. A common mistake is using random splitting, which would allow the model to learn from future sensor readings to predict past failures—a classic data leakage that results in unrealistic accuracy. The factory engineers must ensure the split respects temporal order; otherwise, the model will fail when deployed.
Scenario 3: Medical Image Classification at a Hospital
A hospital develops a deep learning model to detect pneumonia from chest X-rays. They have 100,000 images from 10,000 patients. To avoid patient-level leakage, they split by patient ID: 70% of patients for training, 15% for validation, 15% for test. All images from a single patient go into only one set. If they split by image instead, the model could learn to recognize a patient's anatomy rather than pneumonia patterns, leading to inflated test performance. The validation set helps tune the number of layers and learning rate. The test set provides the final accuracy that determines whether the model is safe for clinical use. The hospital's regulatory approval depends on the test set performance being a true reflection of generalizability. Misunderstanding this grouping requirement is a common exam trap.
What AI-900 Tests on This Topic (Objective 2.1)
AI-900 objective 2.1 states: 'Describe core machine learning concepts.' Within this, you must understand the purpose of training, validation, and test data. The exam does not test specific split ratios or cross-validation details in depth, but it expects you to know:
Why each split is used.
The order of use: train -> validate -> test.
That the test set is only used once at the end.
That validation data is used for hyperparameter tuning.
That training data is used to fit the model.
Common Wrong Answers and Traps
'The test set is used to tune hyperparameters.' This is the most common trap. Candidates confuse validation and test sets. The correct answer is that the validation set is used for tuning; the test set is for final evaluation.
'Splitting is only needed for large datasets.' Wrong. Splitting is essential for any supervised learning project, regardless of size, to avoid overfitting and get an unbiased performance estimate.
'The training set should be as small as possible to save time.' Incorrect. The training set should be the largest subset because the model needs enough data to learn patterns. Too little training data leads to underfitting.
'You can use the test set multiple times to improve the model.' This is a violation of best practices. Using the test set repeatedly leads to overfitting to the test set and invalidates the final evaluation.
Specific Numbers and Terms That Appear on the Exam
The term 'holdout set' is sometimes used synonymously with test set.
'Cross-validation' may appear as an alternative to a single validation split.
'Data leakage' is a related concept that the exam may test.
The phrase 'unbiased estimate of model performance' is often associated with the test set.
Edge Cases and Exceptions
Time-series data: The exam may ask about splitting time-series data. The correct approach is chronological splitting, not random.
Imbalanced data: Stratified splitting is important to maintain class proportions.
Very small datasets: The exam might mention that with very small data, cross-validation is preferred over a single validation split because it uses data more efficiently.
How to Eliminate Wrong Answers
When you see a question about data splits, ask yourself: 'What is the purpose of each set?' The training set is for learning parameters. The validation set is for tuning. The test set is for final evaluation. If a question says 'uses test data to choose hyperparameters,' it is wrong. If it says 'uses validation data to train the model,' it is wrong. Look for keywords like 'tune,' 'select model,' 'evaluate final performance.' The correct answer will match the purpose exactly.
Training set: learns patterns; validation set: tunes hyperparameters; test set: final unbiased evaluation.
Typical split: 60-80% train, 10-20% validation, 10-20% test.
Never use test data for training or hyperparameter tuning.
For time-series, split chronologically; for classification, use stratified splitting.
Cross-validation uses multiple validation folds but still requires a separate test set.
Data leakage can occur if splitting does not respect group or temporal boundaries.
In Azure ML, automated ML can split data for you, but best practice is to provide a separate test set.
The test set is sometimes called a holdout set.
These come up on the exam all the time. Here's how to tell them apart.
Training Set
Used to fit model parameters (weights, biases).
Largest subset (60-80% of data).
Model sees this data multiple times during training.
Metrics on training set can be misleadingly high due to overfitting.
No tuning decisions are made based on training set performance alone.
Validation Set
Used to tune hyperparameters and select models.
Medium subset (10-20% of data).
Model does not learn from this data; only evaluated.
Provides an estimate of model performance during development.
Can be used multiple times across different model configurations.
Validation Set
Used iteratively during model development.
Influences model selection and hyperparameter choices.
Can be used many times.
May introduce some bias if used too many times.
Performance on validation set is not the final word.
Test Set
Used only once at the end.
Provides the final, unbiased performance estimate.
Must never be used for tuning.
Should be kept isolated until final evaluation.
Performance on test set determines deployment readiness.
Mistake
The validation set is used to train the model.
Correct
The validation set is never used to update model parameters. It is used only to evaluate performance during hyperparameter tuning and model selection. Training uses the training set exclusively.
Mistake
The test set can be used multiple times to refine the model.
Correct
The test set should be used only once at the very end. Using it multiple times leaks information and invalidates the performance estimate. The validation set is the proper tool for iterative refinement.
Mistake
A 50/50 split between training and test is standard.
Correct
Typical splits allocate the majority (60-80%) to training, with validation and test each taking 10-20%. A 50/50 split would leave too little data for training, causing underfitting.
Mistake
Cross-validation eliminates the need for a separate test set.
Correct
Cross-validation still uses a validation set (the held-out fold). A separate test set is still required for the final unbiased evaluation. Cross-validation does not replace the test set.
Mistake
Data splitting is unnecessary if you have a very large dataset.
Correct
Even with large datasets, splitting is essential to detect overfitting and provide an unbiased performance estimate. Without a test set, you cannot know if the model generalizes.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
The training set is used to fit the model's parameters (e.g., weights in a neural network). The validation set is used to tune hyperparameters and select the best model. The test set is used only once at the end to provide an unbiased estimate of the final model's performance on unseen data. Using the test set for tuning would lead to overfitting and an overly optimistic performance estimate.
If you use the test set for tuning, you are effectively leaking information about the test set into the model. The model may then perform well on that specific test set but fail on new data. The test set should remain untouched until the very end to simulate how the model will perform in the real world. The validation set is the proper dataset for tuning.
Cross-validation is a technique where the training set is divided into k folds. The model is trained on k-1 folds and validated on the remaining fold, repeating k times. The validation metric is averaged across folds. This provides a more robust estimate than a single validation split. However, cross-validation still requires a separate test set for final evaluation; it does not replace it.
For time-series data, you must split chronologically to preserve temporal order. Typically, you train on older data, validate on more recent data, and test on the most recent data. Random shuffling would cause data leakage because future information would be used to predict the past, leading to overly optimistic performance.
Stratified splitting ensures that each subset (train, validation, test) has the same proportion of classes as the original dataset. It is essential for classification problems with imbalanced classes to prevent one subset from being unrepresentative. For example, if only 5% of data is positive, stratified splitting keeps 5% positives in each set.
No. The test set should be used only once. If you evaluate on the test set, then modify the model based on those results, and evaluate again, you have leaked information from the test set into the model. This invalidates the test set as an unbiased evaluation. You would need a new, unseen test set.
Data leakage occurs when information from outside the training set influences the model, leading to overly optimistic performance. Proper splitting prevents leakage by ensuring that no data from the validation or test sets is used during training. Common sources of leakage include: using future data in time-series, including features not available at prediction time, and splitting by row instead of by group (e.g., patient).
You've just covered Training, Validation, and Test Data Splits — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?