This chapter covers decision trees and random forests, two fundamental supervised learning algorithms used for both classification and regression. For the AI-900 exam, these topics appear in roughly 5-8% of questions, primarily under objective 2.2 (Create and evaluate machine learning models) and 2.3 (Select appropriate models for different scenarios). You must understand how decision trees split data, how random forests improve upon them, and when to use each algorithm in Azure Machine Learning. We will also cover key hyperparameters, overfitting, and evaluation metrics like accuracy and feature importance.
Jump to a section
Imagine a hospital emergency department that uses a triage nurse to decide which patients get immediate care. The nurse follows a decision tree: first, is the patient breathing? If no, immediate resuscitation (leaf node). If yes, check pulse. If weak, go to critical care; if strong, ask about chest pain. If yes, go to cardiac unit; if no, check temperature. If high fever, go to isolation; otherwise, wait in general queue. Each question splits the patient into a branch until a final action is assigned. This is exactly how a decision tree works: it asks a series of yes/no or threshold-based questions about the data features, each splitting the dataset into purer subsets, until a leaf node makes a final prediction. A random forest is like having many triage nurses, each with a slightly different set of questions (trained on different random subsets of patients and symptoms), and then taking a vote among them to decide the final action. This reduces the chance that any single nurse's flawed rule (overfitting) misdirects the entire department. The ensemble of trees is more robust and accurate than any single tree, especially when the data is noisy or has missing values.
What Are Decision Trees?
A decision tree is a supervised learning algorithm that models decisions and their possible consequences as a tree-like structure. It is used for both classification (predicting a discrete class) and regression (predicting a continuous value). The tree consists of: - Root node: the topmost node that contains the entire dataset. - Internal nodes: nodes that test a specific feature and split the data into branches based on the outcome. - Branches: the outcomes of a test (e.g., feature <= threshold, or categorical value). - Leaf nodes: terminal nodes that output a prediction (class label or numeric value).
The algorithm recursively partitions the feature space into regions, each associated with a simple model (e.g., majority class). The goal is to create subsets that are as homogeneous as possible with respect to the target variable.
How Decision Trees Work Internally
The tree is built using a greedy, top-down, recursive partitioning algorithm. At each node, the algorithm selects the best feature and split point that maximizes the separation of the target variable. The "best" split is determined by a criterion such as: - Gini impurity: measures the probability of misclassifying a randomly chosen element if it were labeled according to the class distribution in the node. Lower Gini is better. Formula: Gini = 1 - Σ(p_i)^2, where p_i is the proportion of class i in the node. - Entropy / Information gain: entropy measures disorder; information gain is the reduction in entropy after a split. Higher information gain is better. Entropy = -Σ(p_i * log2(p_i)). - Mean squared error (MSE): for regression trees, the split that minimizes the total MSE of the two child nodes.
The algorithm stops splitting when a stopping criterion is met, such as:
Maximum depth reached.
Minimum number of samples in a node (e.g., 2).
Minimum number of samples required to split (e.g., 10).
No further improvement in impurity beyond a threshold.
Key Hyperparameters in Decision Trees
In Azure Machine Learning, the Decision Tree module (or the underlying scikit-learn implementation) exposes these critical hyperparameters: - Maximum depth (max_depth): the maximum number of levels from root to leaf. Default is None (unlimited), but typical values range from 3 to 20. Deeper trees risk overfitting. - Minimum samples split (min_samples_split): minimum number of samples required to split an internal node. Default = 2. Increase to prevent splits on very small subsets. - Minimum samples leaf (min_samples_leaf): minimum number of samples that must be in a leaf node. Default = 1. Increase to smooth the model. - Criterion: the function to measure split quality. For classification: 'gini' or 'entropy'. For regression: 'mse' or 'mae'. - Max features (max_features): number of features to consider when looking for the best split. For classification, default is sqrt(n_features); for regression, it's n_features. - Splitter: strategy used to choose the split at each node. 'best' (default) chooses the best split; 'random' chooses the best random split.
Training and Prediction
When you train a decision tree using Azure Machine Learning, you provide a labeled dataset. The algorithm builds the tree by examining all possible splits for each feature. For numeric features, it sorts the values and considers midpoints between consecutive distinct values as potential split points. For categorical features, it considers partitions of the categories.
During prediction, a new sample traverses the tree from root to a leaf, following the branches based on its feature values. The leaf's majority class (classification) or mean value (regression) is the prediction.
Overfitting in Decision Trees
Decision trees are prone to overfitting, especially when the tree is deep and captures noise in the training data. Symptoms include high accuracy on training data but poor performance on test data. Mitigation strategies include:
Pruning: remove branches that have little predictive power.
Setting a maximum depth.
Requiring a minimum number of samples per leaf.
Using ensemble methods like random forests.
What Are Random Forests?
A random forest is an ensemble of decision trees, typically trained with the bagging (Bootstrap Aggregating) method. It combines the predictions of multiple trees to improve accuracy and control overfitting. The algorithm works as follows: 1. Create N bootstrap samples (random samples with replacement) from the original dataset. 2. For each bootstrap sample, train a decision tree, but with a twist: at each node, only a random subset of features is considered for splitting (typically sqrt(p) for classification, p/3 for regression). 3. The final prediction is the majority vote (classification) or average (regression) of all trees.
Key Hyperparameters in Random Forests
Number of estimators (n_estimators): number of trees in the forest. Default = 100. Higher values improve performance but increase training time.
Maximum depth: same as decision trees; often left unlimited in random forests because the ensemble reduces overfitting.
Minimum samples split & leaf: same as decision trees.
Max features: size of the random subset of features to consider at each split. Default is sqrt(n_features) for classification, n_features for regression. Common values: 'sqrt', 'log2', or a fraction.
Bootstrap: whether bootstrap samples are used. Default = True.
Out-of-bag (OOB) score: if bootstrap=True, OOB samples (those not in the bootstrap sample) can be used to estimate generalization error without a separate validation set.
How Random Forests Reduce Overfitting
The combination of bagging (training each tree on a different random subset of data) and random feature selection decorrelates the trees. If all trees were identical, the ensemble would have the same bias as a single tree. By introducing randomness, each tree learns different patterns, and the averaging/voting reduces variance without increasing bias significantly. This makes random forests robust to noise and outliers.
Feature Importance
Both decision trees and random forests provide feature importance scores, which indicate how much each feature contributes to the predictions. In random forests, importance is often calculated as the average reduction in impurity (e.g., Gini) across all trees when a feature is used for splitting. This is useful for feature selection and model interpretation.
Interaction with Azure Machine Learning
In Azure Machine Learning, you can build decision trees and random forests using:
- Designer: drag-and-drop modules like "Two-Class Decision Forest" or "Multiclass Decision Forest" for classification, and "Decision Forest Regression" for regression.
- Automated ML: automatically tries decision trees and random forests among other algorithms.
- SDK (Python): using azureml.train.automl or directly using scikit-learn estimators.
The Azure ML Designer provides prebuilt modules with hyperparameter options. For example, the "Two-Class Decision Forest" module has parameters like: - Number of decision trees: default 8. - Maximum depth of the decision trees: default 32. - Number of random splits per node: default 128. - Minimum number of samples per leaf node: default 10.
Evaluation Metrics
For classification, common metrics include accuracy, precision, recall, F1-score, and AUC-ROC. For regression, use RMSE, MAE, R-squared. The AI-900 exam expects you to know that random forests generally outperform single decision trees in accuracy but are less interpretable. You should also understand that random forests handle missing values better than decision trees (through surrogate splits or by ignoring missing values during splitting).
When to Use Decision Trees vs. Random Forests
Decision trees: when interpretability is critical (e.g., medical diagnosis where you need to explain the reasoning), when the dataset is small, or when you need a quick baseline model.
Random forests: when accuracy is the primary goal, when the dataset is large, when there are many features, or when you need to handle missing values and outliers robustly.
Azure Specifics: Automated ML and Hyperparameter Tuning
In Azure Automated ML, decision trees and random forests are among the algorithms tried automatically. You can constrain the search space by specifying allowed models. Hyperparameter tuning can be done via Azure Hyperdrive, which supports random sampling, grid sampling, and Bayesian optimization. For random forests, typical hyperparameter ranges include n_estimators (10-500), max_depth (1-50), and min_samples_leaf (1-20).
Prepare the dataset
Load and preprocess your data in Azure Machine Learning. Ensure the target column is labeled, and all features are numeric (categorical features must be one-hot encoded or label-encoded). Split the data into training and test sets (e.g., 80/20). For the AI-900 exam, you might be asked about the need to handle missing values — random forests can handle missing data internally, but it's better to impute them. Use Azure's 'Clean Missing Data' module or pandas `fillna()`.
Choose the algorithm
Select 'Two-Class Decision Forest' or 'Multiclass Decision Forest' in the Azure ML Designer for classification, or 'Decision Forest Regression' for regression. Alternatively, use the 'Train Model' module with a scikit-learn estimator. The exam expects you to know that random forests are ensemble methods that combine multiple decision trees to improve accuracy and reduce overfitting.
Configure hyperparameters
Set key hyperparameters: number of trees (n_estimators), maximum depth (max_depth), minimum samples per leaf (min_samples_leaf), and number of random splits per node. In Azure Designer, the default number of trees is 8, but for production you might use 100-500. Increasing trees improves accuracy but increases training time. Maximum depth controls tree complexity; typical values are 10-50. Minimum samples per leaf prevents overfitting; default is 10.
Train the model
Connect the training dataset to the algorithm module and run the pipeline. The algorithm builds multiple decision trees on bootstrap samples, each considering a random subset of features at each split. In Azure, the 'Train Model' module outputs a trained model object. Training time scales linearly with the number of trees and the size of the dataset.
Evaluate the model
Use the 'Score Model' module to generate predictions on the test set, then 'Evaluate Model' to compute metrics like accuracy, precision, recall, F1-score, and AUC. For regression, use RMSE and R-squared. The exam may ask you to interpret these metrics. For example, a high accuracy on training but low on test indicates overfitting. Use out-of-bag error in random forests as an internal validation.
Tune hyperparameters
If performance is unsatisfactory, use Azure Hyperdrive to sweep hyperparameters. Define a parameter space (e.g., n_estimators: [50, 100, 200], max_depth: [10, 20, 30]), choose a sampling method (random or grid), and an early termination policy (e.g., BanditPolicy). The best model can be registered in the Azure ML workspace. The exam expects you to know that hyperparameter tuning is essential for optimal performance.
Enterprise Scenario 1: Credit Risk Assessment
A large bank uses a random forest model to predict whether a loan applicant will default. The dataset includes features like credit score, income, debt-to-income ratio, employment length, and number of late payments. The bank trains a random forest with 500 trees on historical data. The model outputs a probability of default, and the bank sets a threshold (e.g., 0.3) to approve or reject loans. The random forest handles missing values (e.g., some applicants have no employment length) and provides feature importance scores, revealing that credit score and debt-to-income ratio are top predictors. The model is deployed as a web service on Azure Kubernetes Service for real-time scoring. Misconfiguration: if the number of trees is too low (e.g., 10), the model may have high variance; if max_depth is too high (e.g., 100), it overfits. The bank uses Azure ML's automated ML to tune hyperparameters, achieving an AUC of 0.85 on the test set.
Enterprise Scenario 2: Predictive Maintenance in Manufacturing
A manufacturing company uses decision trees to diagnose machine failures. Each machine has sensors recording temperature, vibration, pressure, and runtime. A decision tree is trained to classify whether a failure will occur within the next 24 hours. The tree's interpretability is crucial because engineers need to understand why a failure is predicted (e.g., if temperature > 80°C and vibration > 0.5 mm/s, then failure likely). The tree is deployed on edge devices using Azure IoT Edge. However, the tree overfits to a specific machine type, so the company uses a random forest to generalize across different machines. The random forest achieves 92% accuracy, but the engineers sacrifice some interpretability. They use SHAP values (Shapley Additive Explanations) to explain individual predictions. Common issue: if the dataset is imbalanced (failures are rare), the model may predict 'no failure' for all cases. They use SMOTE (Synthetic Minority Over-sampling Technique) in the Azure ML pipeline to balance the classes.
Scenario 3: Customer Churn Prediction
A telecom company uses a random forest to predict which customers are likely to cancel their subscription. Features include contract length, monthly charges, tenure, number of customer service calls, and payment method. The random forest is trained on 100,000 customers. The model's feature importance reveals that tenure and contract length are the most important. The company uses the model to target high-risk customers with retention offers. They deploy the model as a batch inference pipeline in Azure ML, scoring 1 million customers weekly. A common pitfall is using default hyperparameters without tuning, leading to suboptimal performance. By tuning n_estimators to 200 and max_depth to 20, they improve precision from 0.65 to 0.72. The model is monitored for concept drift using Azure ML's data drift monitoring.
Exactly What AI-900 Tests
The AI-900 exam covers decision trees and random forests under objective 2.2 (Create and evaluate machine learning models) and 2.3 (Select appropriate models for different scenarios). Specific points tested:
Understand that decision trees are easy to interpret but prone to overfitting.
Know that random forests are ensemble methods that combine multiple decision trees to improve accuracy and reduce overfitting.
Recognize that random forests use bagging and random feature selection.
Be able to identify scenarios where interpretability is key (decision tree) vs. where accuracy is paramount (random forest).
Know that hyperparameters like number of trees and maximum depth affect performance.
Understand that feature importance is a key output of random forests.
Know that Azure ML Designer has modules for Decision Forest (Two-Class, Multiclass, Regression).
Common Wrong Answers and Why Candidates Choose Them
"Decision trees are always better than random forests because they are simpler." Wrong because simplicity does not guarantee better accuracy. Random forests almost always outperform a single tree in predictive performance. Candidates confuse interpretability with accuracy.
"Random forests use boosting instead of bagging." Wrong. Random forests use bagging (bootstrap aggregating). Boosting is used by algorithms like AdaBoost and Gradient Boosting. Candidates often mix up ensemble methods.
"You should never prune decision trees." Wrong. Pruning (or setting max_depth) is essential to prevent overfitting. Candidates may think that deeper trees are always better.
"Random forests can only be used for classification." Wrong. Random forests also support regression (predicting continuous values). The Azure ML Designer has a separate 'Decision Forest Regression' module.
Specific Numbers and Terms That Appear on the Exam
Default number of trees in Azure ML's Two-Class Decision Forest: 8.
Default maximum depth: 32.
Minimum samples per leaf: 10.
Number of random splits per node: 128.
Gini impurity and entropy are common split criteria.
The term 'ensemble learning' appears frequently.
Edge Cases and Exceptions
When the dataset is very small (e.g., < 100 samples), a decision tree may be better than a random forest because the bootstrap samples will have too much overlap.
When features are highly correlated, random forests may still perform well, but feature importance can be unreliable.
For imbalanced datasets, random forests can be biased toward the majority class; use class weights or resampling.
How to Eliminate Wrong Answers
If the question asks for a model that is easy to interpret, eliminate random forest and choose decision tree.
If the question asks for a model that reduces overfitting, eliminate decision tree and choose random forest.
If the question mentions 'ensemble', 'bagging', or 'multiple trees', the answer is likely random forest.
If the question asks for feature importance, both can provide it, but random forest is more robust.
Decision trees split data recursively based on feature thresholds to maximize purity (Gini or entropy).
Random forests are ensemble models that combine many decision trees trained on bootstrap samples with random feature subsets.
Random forests generally outperform single decision trees in accuracy but are less interpretable.
Key hyperparameters: number of trees (default 8 in Azure ML Designer), maximum depth (default 32), minimum samples per leaf (default 10).
Feature importance is a key output of both algorithms, indicating which features contribute most to predictions.
Decision trees are ideal when interpretability is critical; random forests are preferred when accuracy is the primary goal.
In Azure ML, use Designer modules like 'Two-Class Decision Forest' and 'Decision Forest Regression'.
Overfitting in decision trees can be mitigated by pruning (limiting depth) or by using random forests.
These come up on the exam all the time. Here's how to tell them apart.
Decision Tree
Single tree model
High interpretability (easy to visualize and explain)
Prone to overfitting, especially with deep trees
Low computational cost for training and prediction
Sensitive to small changes in data (high variance)
Random Forest
Ensemble of multiple decision trees
Lower interpretability (black-box compared to single tree)
Reduces overfitting through averaging/voting
Higher computational cost (training and prediction scale with number of trees)
More robust to noise and outliers (low variance)
Mistake
Decision trees can only handle numeric features.
Correct
Decision trees can handle both numeric and categorical features. For categorical features, the algorithm can split on any subset of categories. However, many implementations (including scikit-learn) require one-hot encoding for categorical features.
Mistake
Random forests never overfit.
Correct
While random forests are less prone to overfitting than single decision trees, they can still overfit if the number of trees is too high or if the trees are too deep. Techniques like limiting max_depth and min_samples_leaf help.
Mistake
The more trees in a random forest, the better, with no downside.
Correct
Adding more trees improves accuracy up to a point, but after a certain number (e.g., 500-1000), the gain is negligible while training time and memory usage increase linearly. There's also a risk of overfitting if the trees are correlated.
Mistake
Decision trees are not affected by outliers.
Correct
Decision trees are robust to outliers because splits are based on thresholds, not distances. However, outliers can still influence the tree if they cause splits that isolate them, leading to overfitting. Random forests mitigate this by averaging.
Mistake
Random forests cannot handle missing values.
Correct
Many implementations of random forests can handle missing values by using surrogate splits (splits on other features that approximate the original split) or by ignoring missing values during split evaluation. Azure ML's Decision Forest module supports missing values.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
A decision tree is a single model that splits data based on feature values, while a random forest is an ensemble of many decision trees trained on random subsets of data and features. The random forest averages the predictions of all trees, reducing overfitting and improving accuracy. For AI-900, remember that random forests use bagging and random feature selection.
Use a decision tree when you need a model that is easy to interpret and explain, such as in medical diagnosis or regulatory compliance. Decision trees are also faster to train and can serve as a baseline. However, if accuracy is the priority and interpretability is less important, use a random forest.
Azure ML Designer provides modules like 'Two-Class Decision Forest', 'Multiclass Decision Forest', and 'Decision Forest Regression'. These modules allow you to set hyperparameters such as number of trees (default 8), maximum depth (default 32), minimum samples per leaf (default 10), and number of random splits per node (default 128).
Feature importance measures how much each feature contributes to the model's predictions. In random forests, it is typically calculated as the average reduction in impurity (e.g., Gini) across all trees when that feature is used for splitting. Higher importance means the feature is more influential. This helps in feature selection and model interpretation.
Yes, random forests can be used for regression tasks. Instead of voting on class labels, the forest averages the predicted values from all trees. Azure ML Designer has a dedicated 'Decision Forest Regression' module. The hyperparameters are similar to classification, but the split criterion is typically MSE.
Overfitting occurs when a model learns noise in the training data, performing well on training but poorly on new data. Random forests prevent overfitting by training each tree on a different bootstrap sample and considering only a random subset of features at each split. This decorrelates the trees, and averaging their predictions reduces variance.
In Azure ML Designer's Two-Class Decision Forest, the default number of trees is 8, maximum depth is 32, minimum number of samples per leaf is 10, and number of random splits per node is 128. These can be adjusted to improve performance.
You've just covered Decision Trees and Random Forests — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?