AI-900Chapter 36 of 100Objective 2.2

ML Evaluation Metrics: Accuracy, Precision, Recall

This chapter covers three fundamental machine learning evaluation metrics: accuracy, precision, and recall. Understanding these metrics is critical for the AI-900 exam, as questions on model evaluation appear in approximately 15-20% of the exam questions, often requiring you to choose the appropriate metric for a given scenario. By the end of this chapter, you will be able to calculate and interpret these metrics, understand their trade-offs, and select the right metric for different classification problems.

25 min read
Intermediate
Updated May 31, 2026

The Fire Alarm System Analogy

Imagine a building with a fire alarm system designed to detect real fires and ignore false alarms. The system has two key components: smoke detectors and heat sensors. Accuracy measures how often the alarm is correct overall: if there are 100 events (fires or non-fires) and the alarm correctly identifies 90 of them, accuracy is 90%. But this can be misleading if fires are rare—say only 1 fire per 100 events. The alarm could simply never go off and still be 99% accurate. Precision focuses on the times the alarm goes off: of all the times it sounds, how many are real fires? If the alarm sounds 10 times but only 8 are real fires, precision is 80%. Recall focuses on real fires: of all the real fires, how many did the alarm catch? If there are 5 real fires and the alarm catches 4, recall is 80%. In a hospital, you want high recall (never miss a fire) even if it means more false alarms (lower precision). In a server room, you want high precision (avoid false alarms that cause costly shutdowns) even if you miss a few fires (lower recall). The trade-off is controlled by adjusting the sensitivity of the detectors.

How It Actually Works

What Are Evaluation Metrics and Why Do They Exist?

Machine learning models are only as good as their ability to generalize to new, unseen data. Evaluation metrics quantify how well a model performs on test data, allowing data scientists to compare models, tune hyperparameters, and decide whether a model is ready for deployment. Without metrics, you cannot objectively measure improvement or detect problems like overfitting. The AI-900 exam expects you to understand the most common metrics for classification tasks: accuracy, precision, recall, and the related F1 score.

The Confusion Matrix: The Foundation

Before diving into metrics, you must understand the confusion matrix. It is a 2x2 table that compares actual vs. predicted labels for a binary classification problem. The four cells are:

True Positives (TP): Actual positive, predicted positive.

False Positives (FP): Actual negative, predicted positive (Type I error).

True Negatives (TN): Actual negative, predicted negative.

False Negatives (FN): Actual positive, predicted negative (Type II error).

For example, a spam filter:

TP: Spam email correctly marked as spam.

FP: Legitimate email incorrectly marked as spam.

TN: Legitimate email correctly marked as not spam.

FN: Spam email incorrectly allowed into inbox.

Accuracy: The Overall Correctness

Accuracy is the ratio of correct predictions to total predictions:

Accuracy = (TP + TN) / (TP + FP + TN + FN)

It answers: "Out of all predictions, how many were correct?"

Accuracy is intuitive and easy to understand, but it can be misleading when classes are imbalanced. For instance, if 95% of emails are legitimate and only 5% are spam, a model that predicts "not spam" for every email achieves 95% accuracy—yet it never catches any spam. The AI-900 exam often tests this concept by presenting a scenario with imbalanced classes and asking why accuracy is not the best metric.

When to use: When classes are roughly balanced and false positives and false negatives have similar costs.

When to avoid: Imbalanced datasets where one class dominates.

Precision: How Trustworthy Are Positive Predictions?

Precision measures the proportion of positive identifications that were actually correct:

Precision = TP / (TP + FP)

It answers: "Of all the instances the model labeled as positive, how many were truly positive?"

High precision means the model has a low false positive rate. In the spam filter example, precision tells you how many of the emails flagged as spam are actually spam. If precision is low, you are blocking legitimate emails.

When to use: When the cost of false positives is high. Examples:

Medical diagnosis: A false positive (saying a healthy patient has a disease) can lead to unnecessary stress and treatment.

Fraud detection: Flagging a legitimate transaction as fraud can inconvenience customers.

Search engines: Returning irrelevant results (false positives) degrades user trust.

Recall: How Many Positives Did You Catch?

Recall (also called sensitivity or true positive rate) measures the proportion of actual positives that were correctly identified:

Recall = TP / (TP + FN)

It answers: "Of all the actual positive instances, how many did the model find?"

High recall means the model has a low false negative rate. In the spam filter, recall tells you how many of the actual spam emails were caught. If recall is low, spam is getting through.

When to use: When the cost of false negatives is high. Examples:

Cancer screening: Missing a cancer case (false negative) could be fatal.

Security monitoring: Failing to detect an intrusion (false negative) could lead to a breach.

Medical testing: A false negative means a sick patient is told they are healthy.

The Precision-Recall Trade-off

Precision and recall are inversely related: increasing one typically decreases the other. This trade-off is controlled by the classification threshold. Most classifiers output a probability or score; the threshold determines above which score the prediction is positive. A lower threshold (e.g., score > 0.3) increases recall (catches more positives) but decreases precision (more false positives). A higher threshold (e.g., score > 0.7) increases precision but decreases recall.

Example: In a logistic regression model predicting disease, setting a low threshold (e.g., 0.3) will flag many patients as having the disease, catching most true cases (high recall) but also causing many false alarms (low precision). Setting a high threshold (e.g., 0.9) will only flag very certain cases, resulting in high precision but missing many actual cases (low recall).

F1 Score: Harmonic Mean of Precision and Recall

The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

It is especially useful when you need a balance between precision and recall, or when classes are imbalanced. The harmonic mean penalizes extreme values more than the arithmetic mean. For example, if precision = 1.0 and recall = 0.0, the arithmetic mean is 0.5, but the harmonic mean is 0 (since you cannot multiply by zero).

When to use: When you need a single metric that balances precision and recall, especially in imbalanced datasets.

Other Related Metrics

Specificity (True Negative Rate): TN / (TN + FP). Measures how well the model identifies negatives. High specificity means few false positives.

False Positive Rate (FPR): FP / (FP + TN). Used in ROC curves.

False Negative Rate (FNR): FN / (FN + TP).

ROC AUC: Area under the Receiver Operating Characteristic curve, which plots TPR vs. FPR at various thresholds. A higher AUC indicates better overall performance.

How Metrics Interact with Azure Machine Learning

In Azure Machine Learning, you can view these metrics in the model evaluation reports. When you run an automated ML experiment, it automatically computes accuracy, precision, recall, F1 score, and AUC for classification models. You can also log custom metrics using the run.log() method. The AI-900 exam may ask about which metric to use in a given scenario, often referencing Azure's capabilities.

Common Exam Traps

Accuracy paradox: High accuracy does not mean a good model if classes are imbalanced. The exam will present a dataset with 95% negative class and a model that predicts all negatives, achieving 95% accuracy but 0% recall.

Confusing precision and recall: Candidates often mix up the formulas. Remember: precision looks at predicted positives; recall looks at actual positives.

Ignoring the threshold: The exam may ask how to adjust precision or recall—the answer is often to change the classification threshold.

F1 score interpretation: The exam may ask why F1 is better than average of precision and recall—because it is harmonic mean, not arithmetic, and penalizes imbalance.

Step-by-Step Calculation Example

Suppose a model predicts whether an email is spam. The test set has 100 emails: 20 actual spam, 80 legitimate. The model predicts:

TP = 15 (correctly identified spam)

FP = 5 (legitimate marked as spam)

TN = 75 (correctly identified legitimate)

FN = 5 (spam missed)

Calculate:

Accuracy = (15+75)/100 = 90%

Precision = 15/(15+5) = 75%

Recall = 15/(15+5) = 75%

F1 = 2*(0.75*0.75)/(0.75+0.75) = 75%

If we lower the threshold to catch more spam, we might get: TP=18, FP=15, TN=65, FN=2. Then:

Precision = 18/(18+15) = 54.5%

Recall = 18/(18+2) = 90%

F1 = 2*(0.545*0.9)/(0.545+0.9) = 67.9%

This shows the trade-off: recall increased but precision decreased, and F1 dropped because the imbalance worsened.

Summary of Key Points for the Exam

Accuracy is for balanced classes with equal cost of errors.

Precision is for minimizing false positives.

Recall is for minimizing false negatives.

F1 is a balanced metric for imbalanced classes.

The confusion matrix is the basis for all these metrics.

Threshold tuning changes precision and recall.

Always consider the business context when choosing a metric.

Walk-Through

1

Build the Confusion Matrix

Start by collecting the model's predictions on a test set with known true labels. Count the four outcomes: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). This matrix is the foundation for all subsequent metrics. For example, if your test set has 100 samples and the model predicted positive for 30, but only 20 were actually positive, you might have TP=18, FP=12, TN=68, FN=2. Ensure you understand which cell corresponds to which error type: FP is a Type I error, FN is a Type II error.

2

Calculate Accuracy

Accuracy = (TP + TN) / (TP + FP + TN + FN). This gives the proportion of correct predictions out of all predictions. In the example above, accuracy = (18+68)/100 = 86%. Accuracy is intuitive but can be misleading if classes are imbalanced. For instance, if 95% of samples are negative, a model that always predicts negative achieves 95% accuracy, yet has zero recall. The exam often tests this pitfall.

3

Calculate Precision

Precision = TP / (TP + FP). This measures how many of the predicted positives are actually positive. In the example, precision = 18/(18+12) = 60%. High precision means few false alarms. Precision is critical when false positives are costly, such as in spam detection (blocking legitimate emails) or medical diagnosis (unnecessary treatment). Note: precision is also called positive predictive value.

4

Calculate Recall

Recall = TP / (TP + FN). This measures how many of the actual positives were captured. In the example, recall = 18/(18+2) = 90%. High recall means few missed positives. Recall is crucial when false negatives are costly, such as in cancer screening (missing a tumor) or fraud detection (missing a fraudulent transaction). Recall is also called sensitivity or true positive rate.

5

Calculate F1 Score

F1 = 2 * (Precision * Recall) / (Precision + Recall). This is the harmonic mean, which balances precision and recall. In the example, F1 = 2*(0.6*0.9)/(0.6+0.9) = 1.08/1.5 = 0.72 or 72%. The F1 score is especially useful when you need a single metric for imbalanced datasets. Unlike the arithmetic mean, the harmonic mean is lower when one metric is low, penalizing extreme imbalance.

What This Looks Like on the Job

Enterprise Scenario 1: Healthcare – Cancer Detection

A hospital deploys a machine learning model to detect malignant tumors from MRI scans. The dataset is imbalanced: only 2% of scans show malignancy. The model must have high recall to avoid missing any cancers (false negatives are life-threatening). However, too many false positives (low precision) would overwhelm radiologists with unnecessary follow-ups. The team uses Azure Machine Learning to train a convolutional neural network and tunes the classification threshold to achieve recall > 98% while maintaining precision above 50%. They monitor the confusion matrix weekly and adjust the threshold as new data comes in. Misconfiguration (e.g., optimizing for accuracy) would lead to a model that misses cancers, potentially causing patient harm.

Enterprise Scenario 2: E-commerce – Fraud Detection

An online retailer uses a fraud detection model to flag suspicious transactions. False positives (blocking legitimate purchases) lead to lost sales and customer frustration, while false negatives (allowing fraud) cause financial loss. The business decides that the cost of a false positive is higher than a false negative (because they can recover fraud losses but not lost customers). Therefore, they optimize for high precision, accepting lower recall. They set the threshold to 0.9, achieving precision of 95% but recall of 60%. The model is deployed in Azure Functions, processing thousands of transactions per second. They continuously retrain the model with new fraud patterns and use Azure Monitor to track precision and recall in real time.

Scenario 3: Manufacturing – Defect Detection

A factory uses a computer vision model to detect defects on an assembly line. The cost of missing a defect (false negative) is high because defective products reach customers, damaging brand reputation. The cost of false positives is moderate—they cause unnecessary re-inspection but no major loss. The team optimizes for recall, setting a low threshold of 0.3, achieving recall of 99% but precision of 70%. They use Azure IoT Edge to run the model on edge devices. The confusion matrix is reviewed daily, and if precision drops below 60%, they trigger a retraining pipeline. Misconfiguring the threshold could lead to either too many missed defects or too many false alarms slowing production.

How AI-900 Actually Tests This

What the AI-900 Exam Tests

The AI-900 exam objective 2.2 expects you to "evaluate machine learning models" by understanding metrics like accuracy, precision, recall, and F1 score. Specifically, you should be able to:

Interpret a confusion matrix.

Identify which metric to use in a given scenario (e.g., high recall for cancer detection).

Explain the trade-off between precision and recall.

Recognize that accuracy is misleading for imbalanced datasets.

Understand that F1 score balances precision and recall.

Common Wrong Answers and Why Candidates Choose Them

1.

Choosing accuracy for imbalanced datasets: Candidates see accuracy as the default metric and don't realize it can be high even when the model is useless. The exam will present a dataset with 95% negatives and a model predicting all negatives—accuracy is 95%, but the model fails to catch any positives. The correct answer is to use precision, recall, or F1.

2.

Confusing precision and recall: Many students mix up the formulas. They might say precision measures how many actual positives were caught (which is recall). Remember: precision = TP/(TP+FP) (focus on predicted positives); recall = TP/(TP+FN) (focus on actual positives).

3.

Thinking F1 is the arithmetic mean: The exam may ask why F1 is better than average of precision and recall. The answer is that F1 uses harmonic mean, which penalizes imbalance more.

4.

Ignoring the threshold: When asked how to improve precision, some candidates suggest retraining the model, but the simplest way is to raise the classification threshold.

Specific Numbers and Terms on the Exam

The confusion matrix is always 2x2 for binary classification.

Formulas: Accuracy = (TP+TN)/(TP+FP+TN+FN); Precision = TP/(TP+FP); Recall = TP/(TP+FN); F1 = 2*(P*R)/(P+R).

The exam may present a scenario with a specific cost for false positives vs. false negatives and ask which metric to optimize.

Terms like "sensitivity" (recall) and "positive predictive value" (precision) may appear.

Edge Cases and Exceptions

Multi-class classification: For multi-class, metrics are computed per class (one-vs-rest) and then averaged (macro, micro, weighted). The AI-900 exam focuses on binary classification, but be aware that precision and recall can be extended.

Perfect model: If precision = 1 and recall = 1, F1 = 1. But if one is 0, F1 = 0.

Zero division: If TP+FP = 0, precision is undefined (often set to 0). Similarly if TP+FN = 0, recall is undefined.

How to Eliminate Wrong Answers

If the question involves "false positives are costly," eliminate accuracy and recall; choose precision or F1 if balance is needed.

If the question involves "false negatives are costly," eliminate accuracy and precision; choose recall or F1.

If the dataset is imbalanced, eliminate accuracy; choose F1 or a combination of precision and recall.

If the question asks for a single metric that balances precision and recall, choose F1.

Key Takeaways

Accuracy = (TP+TN)/(TP+FP+TN+FN); best for balanced classes.

Precision = TP/(TP+FP); minimizes false positives.

Recall = TP/(TP+FN); minimizes false negatives.

F1 = 2*P*R/(P+R); harmonic mean, balances precision and recall.

The confusion matrix is a 2x2 table of TP, FP, TN, FN.

Classification threshold determines precision-recall trade-off.

For imbalanced datasets, avoid accuracy; use precision, recall, or F1.

On the AI-900 exam, always consider business context to choose the right metric.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Accuracy

Measures overall correctness: (TP+TN)/total.

Works well for balanced classes.

Misleading for imbalanced datasets.

Does not distinguish between error types.

Easy to understand and explain.

F1 Score

Harmonic mean of precision and recall.

Better for imbalanced datasets.

Penalizes extreme imbalance between precision and recall.

Does not consider true negatives.

More complex but more informative for skewed classes.

Watch Out for These

Mistake

Accuracy is always the best metric to evaluate a model.

Correct

Accuracy can be misleading when classes are imbalanced. For example, a model that predicts the majority class for all instances can have high accuracy but zero recall for the minority class. Always consider class distribution and business costs.

Mistake

Precision and recall are independent of each other.

Correct

Precision and recall are inversely related due to the classification threshold. Increasing one typically decreases the other. The F1 score captures this trade-off.

Mistake

F1 score is the arithmetic mean of precision and recall.

Correct

F1 is the harmonic mean, which is lower than the arithmetic mean when precision and recall differ. The harmonic mean penalizes imbalance more severely.

Mistake

A high F1 score always means a good model.

Correct

F1 score balances precision and recall, but it does not account for true negatives. If the cost of false positives and false negatives is asymmetric, you might still need to prioritize one metric over F1.

Mistake

You cannot improve precision without reducing recall.

Correct

While there is typically a trade-off, you can sometimes improve both by retraining the model with better features, more data, or a different algorithm. However, at a fixed model, the trade-off is governed by the threshold.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between precision and recall?

Precision measures how many of the predicted positives are actually positive (TP/(TP+FP)), while recall measures how many of the actual positives were captured (TP/(TP+FN)). In a spam filter, precision tells you how many flagged emails are truly spam; recall tells you how much spam was caught. The exam often tests this distinction by asking which metric to use when false positives are costly (precision) or false negatives are costly (recall).

When should I use F1 score instead of accuracy?

Use F1 score when you have an imbalanced dataset or when you need a balance between precision and recall. Accuracy can be high even if the model misses all positive cases if negatives dominate. F1 score captures both false positives and false negatives and is more informative for skewed classes. For example, in fraud detection (99.9% legitimate transactions), a model that predicts all transactions as legitimate has 99.9% accuracy but 0% recall and undefined F1, which clearly indicates failure.

How do I calculate precision and recall from a confusion matrix?

Given a confusion matrix with TP, FP, TN, FN: Precision = TP/(TP+FP). Recall = TP/(TP+FN). For example, if TP=80, FP=20, FN=10, TN=90, then Precision=80/(80+20)=0.8, Recall=80/(80+10)=0.889. Always ensure you are using the correct cells: TP is the top-left, FP top-right, FN bottom-left, TN bottom-right in standard layout.

What is the impact of changing the classification threshold on precision and recall?

Lowering the threshold (e.g., from 0.5 to 0.3) increases recall (more positives are predicted, so more true positives are caught) but decreases precision (more false positives). Raising the threshold does the opposite. This trade-off is fundamental. For example, in a medical test, a lower threshold catches more diseases but also causes more false alarms. The exam may ask how to increase recall—the answer is to lower the threshold.

Why is accuracy not a good metric for imbalanced datasets?

Accuracy is the ratio of correct predictions to total predictions. In an imbalanced dataset where 95% of instances are negative, a model that always predicts negative achieves 95% accuracy, yet it fails to identify any positive instances. This is called the accuracy paradox. The model is useless for the minority class. Precision, recall, or F1 provide a better picture. The exam often tests this by presenting such a scenario and asking why accuracy is misleading.

What does the F1 score penalize?

The F1 score penalizes imbalance between precision and recall. Because it is a harmonic mean, it is lower than the arithmetic mean when precision and recall differ. For example, if precision=1.0 and recall=0.0, the arithmetic mean is 0.5 but F1 is 0.0. This penalizes models that are extremely good at one metric but terrible at the other. The exam may ask why F1 is better than average precision and recall.

How do I interpret a confusion matrix for a multi-class problem?

For multi-class, the confusion matrix has size NxN where N is the number of classes. Each row represents actual class, each column predicted class. To compute precision and recall for a specific class, treat that class as positive and all others as negative (one-vs-rest). Then compute TP, FP, FN accordingly. The AI-900 exam focuses on binary classification, but be aware of this extension.

Terms Worth Knowing

Ready to put this to the test?

You've just covered ML Evaluation Metrics: Accuracy, Precision, Recall — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?