AI-900Chapter 13 of 100Objective 3.5

Azure Custom Vision

This chapter covers Azure Custom Vision, a service that allows you to build, train, and deploy custom image classification and object detection models without needing machine learning expertise. For the AI-900 exam, questions on Custom Vision typically appear in the Computer Vision domain (objective 3.5) and account for about 5-10% of the exam. You need to understand the two project types (image classification and object detection), the training process, performance metrics like precision and recall, and how to export models for offline use. This chapter provides the depth required to answer all exam questions on this topic.

25 min read

Intermediate

Updated May 31, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Training a Specialist Doctor

Imagine you need a doctor who can instantly diagnose different species of birds from X-ray images. You can't just give the doctor a general medical textbook and expect expertise in ornithology. Instead, you bring in a specialist trainer who shows the doctor hundreds of X-rays of specific bird species, each labeled with the correct diagnosis. The doctor studies these examples, learning the subtle patterns that distinguish a sparrow from a finch. Over time, the doctor becomes an expert in bird X-rays, but only for the species you trained on. If you later show an X-ray of a mammal, the doctor might misclassify it or be uncertain. This is exactly how Custom Vision works: you provide labeled images (the training data), the service learns the visual features of each class, and then it can classify new images based on that learned knowledge. The service is specialized for your custom categories, unlike pre-built models that recognize general objects.

How It Actually Works

What is Azure Custom Vision?

Azure Custom Vision is a cloud-based service that lets you create custom image classification and object detection models using your own labeled images. It is part of Azure Cognitive Services and is designed for users who want to build specialized vision models without writing code or understanding deep learning. The service handles the entire machine learning pipeline: data preparation, model training, evaluation, and deployment.

Why does it exist?

Pre-built models (like those from Computer Vision API) recognize thousands of common objects, but they cannot handle domain-specific categories such as types of manufacturing defects, rare animal species, or unique logos. Custom Vision fills this gap by enabling you to train a model on your own dataset. It uses transfer learning: starting from a pre-trained neural network (like ResNet), it fine-tunes the final layers on your images. This requires far fewer images than training from scratch—typically 50-100 per class for classification, and more for object detection.

How does it work internally?

The process has four main phases: 1. Project creation: You choose a project type (classification or object detection) and a domain (e.g., General, Food, Landmarks). The domain optimizes the underlying model for certain image characteristics. For example, the 'Food' domain is tuned for images of dishes, while 'General' works for most scenarios. 2. Image upload and labeling: You upload images to the Custom Vision portal (or via API) and assign tags (for classification) or draw bounding boxes around objects (for detection). Each tag represents a class. The service stores images in Azure Blob Storage and manages labeling metadata. 3. Training: When you click 'Train', the service splits your data into training (default 90%) and testing (default 10%) sets. It uses the training set to update the model weights via backpropagation. Training typically takes a few minutes to hours depending on dataset size and compute resources. The service automatically selects a learning rate and epochs based on dataset size. 4. Evaluation and iteration: After training, the service shows performance metrics on the test set: precision, recall, mean average precision (mAP) for detection, and accuracy for classification. You can then add more images, fix mislabeled ones, or adjust thresholds and retrain.

Key components, values, and defaults

Project types: Image Classification (assign one or more tags to the entire image) and Object Detection (locate objects with bounding boxes and assign tags). Classification supports 'Multilabel' (multiple tags per image) and 'Multiclass' (single tag per image).

Domains: General, Food, Landmarks, Retail, General (compact), Food (compact), Landmarks (compact), Retail (compact). Compact domains create smaller models suitable for export to mobile or edge devices.

Training budget: Free tier (F0) includes 2 training hours per month; Standard tier (S0) has no limit but incurs charges per training hour. Training time depends on image count and resolution.

Image requirements: Maximum image size 6MB (4MB for free tier). Minimum resolution 256x256 pixels. Supported formats: JPEG, PNG, BMP, GIF. For object detection, images must be at least 256x256.

Tag limits: Up to 500 tags per project for classification, 50 for object detection (S0 tier). Free tier: up to 100 tags for classification, 10 for detection.

Performance metrics: Precision (P) = TP/(TP+FP), Recall (R) = TP/(TP+FN), mAP (for detection) averages precision across recall levels. The portal shows these per tag and overall.

Export options: Models can be exported as TensorFlow, ONNX, CoreML, or Docker image (for IoT). Only compact domain models are exportable.

Configuration and verification

You interact with Custom Vision primarily through the Azure portal (customvision.ai) or via REST APIs and SDKs (C#, Python, Node.js, Go). To train a model programmatically:

from azure.cognitiveservices.vision.customvision.training import CustomVisionTrainingClient
from azure.cognitiveservices.vision.customvision.training.models import ImageFileCreateEntry

# Create client
trainer = CustomVisionTrainingClient(ENDPOINT, TRAINING_KEY)

# Create project
project = trainer.create_project("My Project", domain_id=DOMAIN_ID)

# Upload images
for image_path, tag_name in image_list:
    with open(image_path, "rb") as image_contents:
        trainer.create_images_from_files(project.id, [ImageFileCreateEntry(name=image_path, contents=image_contents.read(), tag_ids=[tag.id])])

# Train
iteration = trainer.train_project(project.id)

To verify training status:

iteration = trainer.get_iteration(project.id, iteration.id)
print(iteration.status)  # 'Completed', 'Training', 'Failed'

Interaction with related technologies

Custom Vision integrates with Azure IoT Edge for running models on edge devices, Azure Functions for serverless inference, and Power Automate for automated workflows. It can also be used alongside Azure Machine Learning for advanced model management. The trained model endpoint provides a prediction API that returns tag probabilities or bounding boxes.

Pricing and limits

Free tier (F0): 2 training hours/month, 10,000 predictions/month, 5 projects, 1,000 images per project.

Standard tier (S0): Unlimited training hours (pay per hour), unlimited predictions (pay per 1,000), unlimited projects, 10,000 images per project (can be increased).

Prediction API: S0 tier charges $1.50 per 1,000 transactions; F0 includes 10,000 free per month.

Best practices for exam

Use at least 50 images per class for classification, 100+ for detection.

Include diverse images: different angles, lighting, backgrounds.

Avoid images with ambiguous or overlapping objects.

Use the 'Quick Test' feature to validate model before deployment.

For export, always choose a compact domain.

Understand that Custom Vision is not suitable for scenarios requiring high accuracy on very similar classes (e.g., different dog breeds) without large datasets.

Walk-Through

Create a Custom Vision Project

In the Azure portal (customvision.ai), click 'New Project'. Provide a name, description, and select your resource group. Choose the project type: Image Classification (Multilabel or Multiclass) or Object Detection. Then select a domain that best matches your images. For example, choose 'Food' if your images are of dishes. The domain affects the underlying neural network architecture and training speed. After creation, you get a project ID used in API calls.

Upload and Label Images

Upload your images (up to 6MB each) via the portal or API. For classification, assign one or more tags per image. For object detection, draw bounding boxes around each object and assign a tag. The portal provides tools for efficient labeling: you can tag multiple images at once, use 'Auto-tag' with existing model, or import tags from a CSV. Ensure balanced distribution: each class should have roughly equal number of images. Avoid too many similar images; include variations.

Train the Model

Click the 'Train' button. The service splits data into training (90%) and testing (10%) sets. It then runs training iterations, adjusting weights to minimize loss. Training time depends on image count, resolution, and domain. You can monitor progress in the portal. Once complete, the service displays performance metrics: precision, recall, and mAP (for detection). If results are unsatisfactory, add more images or fix mislabels and retrain. You can also adjust the probability threshold (default 50%) to trade off precision and recall.

Evaluate and Iterate

Review the precision and recall per tag. For classification, the 'Quick Test' feature lets you upload a new image to see predictions. For detection, check bounding box accuracy. Identify tags with low performance and collect more diverse images for those classes. You can also use the 'Images suggested by the model' feature to find images the model is uncertain about. Retrain after each change. The goal is to achieve high precision (few false positives) and high recall (few false negatives) for your use case.

Publish and Export the Model

Once satisfied, publish the model to a prediction endpoint. In the portal, go to 'Performance' tab, select the iteration, and click 'Publish'. Provide a name and confirm. The endpoint URL and prediction key are then available. For offline use, export the model: choose a compact domain during creation, then under 'Export' select format (TensorFlow, ONNX, CoreML, Docker). Exporting a non-compact model is not supported. The exported model can be deployed to mobile devices, IoT devices, or local servers.

What This Looks Like on the Job

Enterprise Scenario 1: Quality Inspection in Manufacturing

A car manufacturer wants to detect surface defects on painted car bodies. They capture thousands of images of car panels from cameras on the assembly line. Using Custom Vision Object Detection, they label images with defects like scratches, dents, and paint bubbles. They train a model on 500 images per defect class. The model is deployed to a local server via Docker export, running inference in real-time. When a defect is detected, the system alerts operators. Challenges include varying lighting conditions and reflections; the team mitigates by adding images from different shifts and angles. Misconfiguration (e.g., using classification instead of detection) would fail to locate defects, leading to false alarms.

Enterprise Scenario 2: Retail Shelf Monitoring

A retail chain uses Custom Vision to monitor product availability on shelves. They deploy cameras in stores and train an object detection model to recognize products using compact domain for export to Jetson Nano devices at the edge. The model checks if shelves are stocked or if products are misplaced. With 50 products, they need at least 100 images per product, including occluded and rotated views. The model runs inference every 5 minutes, sending results to Azure IoT Hub. Performance consideration: model size must be under 50MB for edge devices; compact domains ensure this. Common issue: false positives when products are partially hidden by other items; solved by retraining with more occluded examples.

Enterprise Scenario 3: Wildlife Conservation

A conservation organization uses Custom Vision to identify species from camera trap photos. They classify images into species tags (e.g., tiger, deer, human). They use the General domain and train with 200 images per species. The model is published as an API endpoint, and a Power Automate flow triggers when a rare species is detected. They face challenges with images taken at night (infrared) and motion blur; they include such images in training. Misconfiguration: using multilabel instead of multiclass when each image has only one animal, causing unnecessary complexity and lower accuracy.

How AI-900 Actually Tests This

What AI-900 Tests on Custom Vision (Objective 3.5)

The exam expects you to:

Distinguish between Custom Vision and Computer Vision API: Custom Vision is for custom categories you train; Computer Vision is for pre-built categories like 'cat', 'dog', 'text'.

Identify the two project types: Image Classification and Object Detection. Know that classification assigns tags to the whole image; detection adds bounding boxes.

Understand the training process: you provide labeled images, the service trains a model, and you evaluate with precision, recall, and mAP.

Know that compact domains allow model export to TensorFlow, ONNX, CoreML, or Docker.

Recognize that Custom Vision is a no-code/low-code tool, but you can also use SDKs.

Common Wrong Answers and Why Candidates Choose Them

'Custom Vision can recognize any object without training' — This is false; candidates confuse it with Computer Vision API. Custom Vision requires training with your own images.

'Object detection only works with rectangular bounding boxes' — While Custom Vision uses rectangular boxes, the exam may imply that detection can use polygons; but Custom Vision only supports axis-aligned rectangles. Candidates might think polygons are supported because other services do.

'You need to write code to train a model' — The portal provides a no-code interface; you can train by clicking buttons. Candidates familiar with ML may assume coding is required.

'Precision and recall are the same' — The exam tests understanding: precision = correct positives / all predicted positives; recall = correct positives / all actual positives. Candidates often mix them up.

Specific Numbers and Terms on the Exam

Minimum images per class: 50 (classification), 100 (detection) — though not strict, the exam expects this.

Image size limit: 6MB (S0), 4MB (F0).

Supported export formats: TensorFlow, ONNX, CoreML, Docker.

Domains: General, Food, Landmarks, Retail, and their compact versions.

Free tier limits: 2 training hours/month, 10,000 predictions/month, 5 projects.

Edge Cases and Exceptions

If you use a non-compact domain, export is disabled.

Multilabel classification allows multiple tags per image; multiclass allows only one. The exam may test this distinction.

The probability threshold (default 50%) can be adjusted in the prediction API; lowering it increases recall but may lower precision.

Custom Vision does NOT support image segmentation (pixel-level masks). That is a different service.

How to Eliminate Wrong Answers

If a question mentions 'pre-built categories' or 'common objects', the answer is Computer Vision API, not Custom Vision.

If a question mentions 'locating objects with bounding boxes', it's Object Detection.

If a question mentions 'export to mobile', look for 'compact domain'.

If a question says 'no training required', it's not Custom Vision.

Key Takeaways

Custom Vision requires labeled images for training; you cannot use it without providing your own data.

The two project types are Image Classification (tags for whole image) and Object Detection (bounding boxes for objects).

At least 50 images per class for classification, 100 for object detection is recommended.

Only compact domain models can be exported to TensorFlow, ONNX, CoreML, or Docker.

Performance metrics: precision, recall, and mean Average Precision (mAP) for detection.

Free tier (F0) limits: 2 training hours/month, 10,000 predictions/month, 5 projects.

Custom Vision is different from Computer Vision API; the latter is pre-trained.

You can adjust the probability threshold to balance precision and recall.

Training splits data into 90% training and 10% testing automatically.

Custom Vision supports multilabel and multiclass classification.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Image Classification (Multiclass)

Each image can have only one tag.

Useful when images belong to mutually exclusive categories (e.g., 'cat' or 'dog').

Model outputs a single predicted label per image.

Simpler and often more accurate for exclusive classes.

Example: Classifying a photo as 'beach' or 'mountain'.

Image Classification (Multilabel)

Each image can have multiple tags (e.g., 'cat' and 'sleeping').

Useful when images contain multiple objects or attributes.

Model outputs multiple probabilities, one per tag, with a threshold.

More flexible but can be harder to train if tags are correlated.

Example: Tagging a photo with 'cat', 'sleeping', 'indoors'.

Custom Vision

Requires you to upload and label your own images.

Train for custom categories not in pre-built set.

Supports classification and object detection.

Model can be exported for offline use.

Best for domain-specific scenarios (e.g., defects, rare species).

Computer Vision API

No training needed; uses pre-built models.

Recognizes thousands of common objects, faces, text, etc.

Provides rich features like OCR, celebrity recognition, description.

Cannot be exported; always cloud-based.

Best for general-purpose image analysis.

Watch Out for These

Mistake

Custom Vision can automatically label images for you.

Correct

Custom Vision requires you to manually label images (tags or bounding boxes). There is no automatic labeling; you must provide the ground truth. The service only learns from your labels.

Mistake

You can train a Custom Vision model with just 5 images per class.

Correct

While technically possible, the model will have poor accuracy. Microsoft recommends at least 50 images per class for classification and 100 for object detection to achieve reliable results.

Mistake

Custom Vision supports image segmentation (pixel-level masks).

Correct

Custom Vision only supports image classification (whole image) and object detection (bounding boxes). For pixel-level segmentation, you need a different service like Azure Machine Learning.

Mistake

All domains support model export.

Correct

Only compact domains (e.g., General (compact), Food (compact)) support export. Standard domains like 'General' cannot be exported to TensorFlow, ONNX, etc.

Mistake

Custom Vision models can be trained on video directly.

Correct

Custom Vision only accepts still images (JPEG, PNG, BMP, GIF). To work with video, you must extract frames and treat them as individual images.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Custom Vision and Computer Vision API?

Custom Vision allows you to train a model on your own images for custom categories, while Computer Vision API uses pre-trained models for common objects, text, and faces. Custom Vision requires labeled training data; Computer Vision does not. For AI-900, know that Custom Vision is for custom scenarios, Computer Vision for general-purpose analysis.

How many images do I need to train a Custom Vision model?

Microsoft recommends at least 50 images per class for image classification and 100 per class for object detection. More images with variation (angles, lighting) improve accuracy. The exam may test this recommendation.

Can I export a Custom Vision model to run on a mobile device?

Yes, but only if you use a compact domain (e.g., General (compact)) during project creation. Export formats include TensorFlow, CoreML (iOS), ONNX, and Docker. Non-compact models cannot be exported.

What is the difference between multiclass and multilabel classification in Custom Vision?

Multiclass assigns a single tag per image (mutually exclusive categories). Multilabel allows multiple tags per image (e.g., 'cat' and 'sleeping'). Choose based on whether your images can belong to more than one category.

How do I improve my Custom Vision model's accuracy?

Add more images, especially of difficult cases (e.g., different angles, lighting, occlusions). Ensure balanced classes. Use the 'Quick Test' to find misclassifications. Adjust the probability threshold. Retrain after each change.

What are the pricing tiers for Custom Vision?

Free tier (F0): 2 training hours/month, 10,000 predictions/month, 5 projects, 1,000 images/project. Standard tier (S0): pay per training hour and per 1,000 predictions, unlimited projects, up to 10,000 images/project (can be increased).

Can Custom Vision detect multiple objects in one image?

Yes, using Object Detection project type. It draws bounding boxes around each detected object and assigns tags. For classification, multiple tags can be assigned per image in multilabel mode.

Terms Worth Knowing

Artificial intelligence Computer vision Generative AI Machine learning Natural language processing Responsible AI

Ready to put this to the test?

You've just covered Azure Custom Vision — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Try AI-900 practice questions Back to all chapters

Done with this chapter?

Azure Document Intelligence

What is NLP?

See the full AI-900 study guide