AI-900Chapter 61 of 100Objective 3.5

Custom Vision Models: Training and Evaluation

This chapter covers how to train and evaluate custom vision models using Azure Custom Vision, a key service for computer vision tasks without deep learning expertise. For the AI-900 exam, understanding the training process, evaluation metrics, and iteration is crucial, as approximately 10-15% of questions relate to custom vision and its lifecycle. You will learn the mechanics of image classification and object detection training, how to interpret performance metrics, and common pitfalls to avoid.

25 min read
Intermediate
Updated May 31, 2026

Training a Custom Vision Model Like a Detective

Think of training a custom vision model as teaching a detective to recognize specific types of evidence from crime scenes. You are the lead detective who has a pile of photos—some show burglaries, others show arson, and others are irrelevant. You start by sorting these photos into labeled folders: 'burglary', 'arson', and 'no crime'. Each folder contains many examples, some with the evidence in different lighting, angles, and backgrounds. The detective (the model) studies each folder, learning patterns: broken windows and footprints for burglary, charred wood and smoke for arson. But the detective doesn't just memorize—they learn to generalize. You then test the detective with new photos they've never seen. If they misidentify a photo of a car theft as burglary, you correct them and add that photo to the training. Over time, the detective becomes an expert, able to spot even subtle clues. In Azure Custom Vision, you provide labeled images (training data), the service extracts features (like a detective's keen eye), and you evaluate performance with metrics like precision and recall. The iterative process of adding more examples or correcting errors is exactly like refining a detective's skills until they can reliably solve cases.

How It Actually Works

What is Custom Vision and Why Does It Exist?

Azure Custom Vision is a cloud-based service that allows you to build, train, and deploy custom image classifiers and object detectors without requiring machine learning expertise. It is part of Azure Cognitive Services, specifically under Computer Vision. The service is designed for scenarios where pre-built models (like those in Computer Vision) are insufficient because you need to recognize domain-specific objects or concepts—for example, identifying defects in manufactured parts, distinguishing between different species of plants, or detecting specific logos in images.

Custom Vision eliminates the need to write code for model architecture, training loops, or hyperparameter tuning. Instead, you provide labeled images, and the service automatically handles feature extraction, model selection, and optimization. The underlying architecture uses transfer learning from a pre-trained deep neural network (such as ResNet or MobileNet), which is fine-tuned on your custom dataset. This approach dramatically reduces the amount of data and time required compared to training from scratch.

How It Works Internally

When you upload images and assign tags (for classification) or draw bounding boxes (for object detection), Azure Custom Vision performs several steps:

1.

Image Preprocessing: Images are resized to a standard input size (typically 224x224 pixels for classification) and normalized. This ensures consistency across training batches.

2.

Feature Extraction: The pre-trained base network (e.g., a ResNet-50) processes each image to extract high-level features. These features are learned patterns such as edges, textures, and shapes. The base network's weights are frozen initially, meaning only the final layers are trained on your data.

3.

Fine-Tuning: The final fully connected layer (for classification) or the detection head (for object detection) is replaced with a new layer that matches the number of your custom tags. The model is then trained using stochastic gradient descent (SGD) with a learning rate that is typically small (e.g., 0.001) to avoid catastrophic forgetting. The training process minimizes a loss function—categorical cross-entropy for classification, and a combination of classification and localization loss for object detection.

4.

Iteration and Evaluation: After each training iteration, the model is evaluated on a held-out validation set (if you choose to split your data) or on the entire dataset if you use the default setting. Metrics like precision, recall, and mean average precision (mAP) are computed. You can view these in the Azure portal and decide whether the model is ready for deployment or needs more data or adjustments.

Key Components, Values, and Defaults

- Project Types: - Image Classification: Assigns one or more tags to an entire image. Two subtypes: - Multiclass: Each image has exactly one tag (e.g., 'cat' or 'dog'). - Multilabel: Each image can have multiple tags (e.g., 'cat' and 'sleeping'). - Object Detection: Identifies objects within an image and draws bounding boxes around them. This requires you to draw bounding boxes for each object instance.

- Domains: Custom Vision offers specialized domains that optimize the model for specific scenarios: - General: Default, works for most everyday objects. - Food: Optimized for recognizing food items. - Landmarks: For recognizing natural and man-made landmarks. - Retail: For products on shelves. - Compact Domains: Lightweight models for edge devices (e.g., 'General (compact)', 'Landmarks (compact)'). These are smaller and faster but may have lower accuracy.

Training Budget: You are billed per training hour. The first 20 training hours per month are free for the Free tier. After that, you pay $2 per hour for S0 tier.

Minimum Data Requirements:

For classification: At least 2 images per tag, but 50+ per tag is recommended for good accuracy.

For object detection: At least 15 images per tag, with at least 30 instances of each tag.

Image Requirements:

Format: JPEG, PNG, BMP, or GIF (static).

Max file size: 6 MB for training images, 4 MB for prediction.

Dimensions: Minimum 256 pixels on the shortest side; maximum 10240 pixels on the longest side.

Color space: RGB.

- Evaluation Metrics: - Precision: Percentage of correct positive predictions (True Positives / (True Positives + False Positives)). - Recall: Percentage of actual positives correctly identified (True Positives / (True Positives + False Negatives)). - Mean Average Precision (mAP): Standard metric for object detection. It averages the precision across different recall thresholds. Higher is better (range 0-100). - Accuracy: For classification, the percentage of correctly classified images (True Positives + True Negatives) / Total. This metric can be misleading if classes are imbalanced.

Probability Threshold: During prediction, you can set a probability threshold (default 50%). Only predictions with confidence above this threshold are returned. Adjusting this threshold trades off precision and recall.

Configuration and Verification Commands

Custom Vision is primarily managed through the Azure portal or the Custom Vision API. There is no direct CLI for training, but you can use the REST API or SDKs (Python, C#, etc.).

Using the Python SDK to create a project and train:

from azure.cognitiveservices.vision.customvision.training import CustomVisionTrainingClient
from msrest.authentication import ApiKeyCredentials

# Replace with your endpoint and training key
endpoint = "https://your-resource.cognitiveservices.azure.com/"
training_key = "your-training-key"

credentials = ApiKeyCredentials(in_headers={"Training-key": training_key})
trainer = CustomVisionTrainingClient(endpoint, credentials)

# Create a new project
project = trainer.create_project("My Custom Vision Project", domain_id="domain-id")

# Add tags and upload images (code omitted for brevity)

# Train the model
iteration = trainer.train_project(project.id)
print("Training completed with iteration ID:", iteration.id)

Verifying training status:

iteration = trainer.get_iteration(project.id, iteration.id)
print("Status:", iteration.status)
print("Trained at:", iteration.trained_time)
print("Is default:", iteration.is_default)

Publishing the model for prediction:

trainer.publish_iteration(project.id, iteration.id, "myModel", prediction_resource_id)

How It Interacts with Related Technologies

Custom Vision is often used alongside other Azure services: - Azure Storage: Store your training images in Blob storage and import them into Custom Vision. - Azure Functions or Logic Apps: Automate retraining when new images are added. - Azure IoT Edge: Deploy compact models to edge devices for offline inference. - Azure Cognitive Services Container: Run Custom Vision containers on-premises for low-latency scenarios.

The service feeds into a larger AI pipeline: images are captured, preprocessed, sent to Custom Vision for inference, and results are stored or trigger actions.

Walk-Through

1

Create a Custom Vision Resource

In the Azure portal, create a Custom Vision resource. You need two keys: a training key and a prediction key. The training key is used to create projects, upload images, and train models. The prediction key is used to call the prediction endpoint. The resource also has an endpoint URL. For the Free tier, you get 2 training hours per month and 10,000 predictions per month. The S0 tier is paid and scales.

2

Create a Project and Choose Domain

Once the resource is created, go to the Custom Vision portal (customvision.ai) and create a new project. Select the project type (Classification or Object Detection) and the domain that best matches your scenario. For example, if you are classifying food, choose the 'Food' domain. The domain optimizes the base model for that type of imagery. If you plan to deploy to a mobile device, choose a compact domain.

3

Upload and Tag Images

Upload your images to the project. For classification, assign one or more tags to each image. For object detection, draw bounding boxes around each object and assign a tag. Use at least 50 images per tag for classification, and 15 images per tag with multiple instances for object detection. Ensure images represent real-world variability: different angles, lighting, backgrounds, and occlusions. The service will automatically split your data into training (80%) and validation (20%) sets, but you can also manually assign images to specific sets.

4

Train the Model

Click the 'Train' button in the portal. You can choose 'Quick Training' (faster, uses default settings) or 'Advanced Training' where you can set the training budget (time) and choose whether to use the validation set. The training process takes minutes to hours depending on dataset size and chosen budget. The model learns to map features to tags. During training, you can monitor the loss and accuracy metrics in real-time. Once training completes, you get precision, recall, and mAP (for object detection) on the validation set.

5

Evaluate and Iterate

Review the performance metrics. If precision is low, your model is producing many false positives—consider adding negative images or retagging ambiguous cases. If recall is low, you need more positive examples. Use the 'Quick Test' feature to test the model on new images. You can also view the confusion matrix to see which classes are often confused. If performance is unsatisfactory, add more images, correct labels, or adjust the probability threshold. Retrain and repeat until you meet your accuracy goals.

6

Publish and Deploy

Once satisfied, publish the iteration to a prediction endpoint. You give the published model a name (e.g., 'myModel'). Then, using the prediction key and endpoint, you can call the API to classify new images. The prediction API returns a JSON object with predicted tags and confidence scores. You can set a probability threshold to filter low-confidence predictions. The model can be deployed to containers for edge scenarios or used directly via the cloud.

What This Looks Like on the Job

Enterprise Scenario 1: Manufacturing Defect Detection

A car parts manufacturer uses Custom Vision to detect scratches on painted surfaces. They train an object detection model with hundreds of images of scratched and flawless parts. The model is deployed on the factory floor using Azure IoT Edge on a local GPU server. The system processes images from cameras in real time, flagging defective parts. Key considerations: the model must achieve high recall (to catch almost all defects) even if precision is slightly lower (some false alarms are acceptable). The team uses a compact domain for low latency. They continuously add new images of rare defects to improve recall. Misconfiguration: if the training set contains only images under ideal lighting, the model fails under variable factory lighting. They solved this by augmenting the dataset with synthetic variations.

Enterprise Scenario 2: Retail Inventory Management

A retail chain uses Custom Vision to identify products on shelves. They train a multilabel classifier to recognize multiple brands and categories in a single shelf image. The model runs on cameras in stores, updating inventory in real time. The challenge is scale: hundreds of stores, each with thousands of products. They use the General domain and train on images from multiple stores to handle lighting differences. Performance metric: mAP is used because it balances precision and recall across many classes. Common issue: class imbalance—popular products dominate, while rare items have few examples. They mitigate by oversampling rare classes. The model is retrained weekly with new product images.

Enterprise Scenario 3: Medical Image Analysis

A healthcare startup uses Custom Vision to classify X-ray images for signs of pneumonia. They use a multiclass classifier with tags 'normal' and 'pneumonia'. Due to regulatory requirements, the model must have extremely high precision (to avoid false positives that could lead to unnecessary treatment). They set the probability threshold to 95% and only act on high-confidence predictions. The model is deployed in a container on-premises for data privacy. They continuously monitor performance using a separate test set. Failure mode: if the training set has more normal images than pneumonia, the model becomes biased toward 'normal'. They carefully balance the dataset and use weighted loss functions.

How AI-900 Actually Tests This

What AI-900 Tests on Custom Vision

The AI-900 exam (objective 3.5) focuses on the lifecycle of a custom vision model: data preparation, training, evaluation, and iteration. Specifically, you should know:

The difference between image classification and object detection.

Minimum data requirements: at least 2 images per tag for classification, but 50+ recommended.

The concept of probability threshold and its effect on precision and recall.

How to interpret precision, recall, and mAP.

The role of domains (General, Food, Landmarks, Retail) and compact domains for edge devices.

The iterative nature of training: add more data, correct labels, retrain.

Common Wrong Answers and Why Candidates Choose Them

1.

"You need to write custom code for the model architecture." Many candidates assume that custom vision requires machine learning expertise. In reality, Azure Custom Vision is a no-code service; you only upload and tag images.

2.

"Precision is more important than recall in all cases." This is false. The importance depends on the business scenario. For medical diagnosis, high recall may be critical to catch all cases. The exam expects you to understand the trade-off.

3.

"You must split your data manually into training and testing sets." While you can, the service automatically splits data 80/20 if you don't specify. The exam tests that automation.

4.

"Object detection requires one bounding box per image." False. An image can have zero or multiple bounding boxes. The exam may test that you can tag images with no objects (negative images).

Specific Numbers and Terms to Memorize

Max image file size: 6 MB for training, 4 MB for prediction.

Minimum image dimension: 256 pixels on shortest side.

Probability threshold default: 50%.

Free tier: 2 training hours/month, 10,000 predictions/month.

Compact domains: for mobile or edge deployment.

Edge Cases and Exceptions

If you have very few images (e.g., 2 per tag), the model will likely overfit and perform poorly on new data. The exam may ask about the minimum recommended number.

For object detection, if you don't draw bounding boxes for all instances, the model will miss them. The exam might present a scenario where only some objects are labeled.

The evaluation metrics are computed on the validation set. If you don't have a separate validation set, the service uses a random subset of your training data.

How to Eliminate Wrong Answers

If an answer suggests that Custom Vision requires coding, eliminate it.

If an answer claims that you must manually split data, eliminate it (optional, not required).

If an answer confuses precision with recall, check the definition: precision = correct positives among predicted positives; recall = correct positives among actual positives.

For deployment, remember that you can publish an iteration to a prediction endpoint and use the prediction key.

Key Takeaways

Custom Vision is a no-code service for building custom image classifiers and object detectors.

Minimum data requirement: 2 images per tag for classification, 15 images per tag for object detection (recommend 50+).

Domains optimize the base model for specific scenarios (General, Food, Landmarks, Retail, Compact).

Key metrics: Precision, Recall, and mAP (for object detection).

Probability threshold (default 50%) controls the trade-off between precision and recall.

Training is iterative: add more data, correct labels, retrain until performance is satisfactory.

Published models are accessed via a prediction endpoint using the prediction key.

Compact domains are designed for deployment on edge devices with limited resources.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Image Classification

Assigns one or more tags to the entire image.

No bounding boxes required; only image-level labels.

Simpler and requires less labeling effort.

Suitable for scenarios like scene recognition or image categorization.

Output is a set of tags with confidence scores.

Object Detection

Identifies specific objects and their locations with bounding boxes.

Requires drawing bounding boxes around each object instance.

More complex and time-consuming to label.

Suitable for scenarios like counting objects or spatial analysis.

Output includes bounding box coordinates (x, y, width, height) and tag with confidence.

Watch Out for These

Mistake

Custom Vision requires you to write Python code to train a model.

Correct

You can train entirely through the Custom Vision portal without writing any code. Code is optional for automation.

Mistake

The probability threshold is fixed at 50% and cannot be changed.

Correct

You can adjust the probability threshold when calling the prediction API or in the portal's test interface. It defaults to 50% but can be set from 0 to 100.

Mistake

You must have a separate validation set, and if you don't provide one, the model cannot be evaluated.

Correct

If you don't specify a validation set, Custom Vision automatically reserves 20% of your training images for validation. You can also manually assign images to the validation set.

Mistake

Object detection models can only detect one type of object per image.

Correct

Object detection can detect multiple objects of different types in the same image, each with its own bounding box and tag.

Mistake

Training is instantaneous once you click 'Train'.

Correct

Training takes time, from a few minutes to hours, depending on the number of images, iterations, and chosen training budget.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

How many images do I need to train a custom vision model?

For classification, you need at least 2 images per tag, but 50 or more per tag is recommended for good accuracy. For object detection, you need at least 15 images per tag, with at least 30 instances of each tag. The more diverse and representative your images, the better the model will generalize.

What is the difference between precision and recall?

Precision measures how many of the predicted positive cases were actually correct (TP / (TP + FP)). Recall measures how many of the actual positive cases were correctly identified (TP / (TP + FN)). In the context of Custom Vision, precision reflects how many of your model's predictions are correct, while recall reflects how many of the true objects are found. The exam may ask about the trade-off: increasing the probability threshold increases precision but decreases recall.

Can I deploy a Custom Vision model to an edge device?

Yes, you can export a compact domain model as a Docker container, TensorFlow model, or ONNX model for deployment on edge devices like IoT devices or mobile phones. Compact domains are specifically designed for this purpose, offering smaller model sizes and faster inference at the cost of some accuracy.

How do I improve a poorly performing custom vision model?

First, check the confusion matrix to see which classes are confused. Then, add more images of those classes, especially images that are difficult or ambiguous. Ensure your training set is balanced across classes. Also, consider using a more suitable domain or adjusting the probability threshold. Retrain and evaluate again. Iterate until performance meets your needs.

What is the training budget?

The training budget is the amount of time (in hours) you allocate for training. In the free tier, you get 2 hours per month. In the S0 tier, you pay per hour. A longer training budget can lead to better model performance because the model can iterate more, but beyond a certain point, returns diminish.

Can I use Custom Vision to detect multiple objects in one image?

Yes, object detection projects are designed for that. You draw bounding boxes around each object and assign a tag. The model will output a list of detected objects with their bounding boxes and confidence scores. An image can have zero or multiple objects of different types.

What image formats are supported?

JPEG, PNG, BMP, and static GIF. Images must be in RGB color space, with a minimum dimension of 256 pixels on the shortest side, and a maximum file size of 6 MB for training and 4 MB for prediction.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Custom Vision Models: Training and Evaluation — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?