This chapter covers two fundamental computer vision tasks: custom image classification and object detection. You will learn the difference between them, how each works in Azure AI services, and when to use one over the other. Approximately 10-15% of AI-900 exam questions touch on computer vision topics, and understanding these two concepts is critical for scenario-based questions. By the end, you will be able to choose the right service for a given business requirement and explain the underlying mechanisms.
Jump to a section
Imagine you have a huge photo album of your family and friends. Custom image classification is like asking, 'Who is in this photo?' You look at the picture and say, 'That's Aunt Mary.' You don't care exactly where she is standing; you just know she is present. Now imagine you have a treasure map. Object detection is like saying, 'There is a treasure chest at grid B-4, and a pirate at grid D-7.' You not only identify what is in the scene but also pinpoint its exact location. In computer vision, custom image classification assigns a single label (or multiple labels) to an entire image, while object detection draws bounding boxes around each object of interest and labels them. The 'photo album' approach groups everything in the image into one category, whereas the 'treasure map' approach identifies individual items and their positions. For the AI-900 exam, you need to understand when to use each technique: classification for simple categorisation tasks, and detection for tasks requiring spatial awareness, like counting objects or locating defects.
What Are Custom Image Classification and Object Detection?
Custom image classification and object detection are two distinct computer vision tasks. In image classification, the model assigns one or more labels to an entire image. For example, given a photo of a dog, the model outputs 'dog'. If the image contains both a dog and a cat, a multi-label classification model can output both 'dog' and 'cat', but it does not indicate where each animal is located.
Object detection goes a step further: it identifies the objects in an image and also localizes each one by drawing a bounding box around it. The output includes the object class (e.g., 'dog') and coordinates (x, y, width, height) of the bounding box. This allows you to count objects, track their positions, or measure sizes.
How They Work Internally
Both tasks are typically performed using convolutional neural networks (CNNs). For classification, the network ends with a fully connected layer and a softmax activation that outputs probabilities for each class. The model is trained on images labelled with the class(es) present.
For object detection, the architecture is more complex. Common approaches include: - Region-based CNNs (R-CNN): Propose regions of interest, then classify each region. - You Only Look Once (YOLO): Divide the image into a grid, predict bounding boxes and class probabilities directly in one pass. - Single Shot Detector (SSD): Similar to YOLO but uses multi-scale feature maps.
Azure Custom Vision service supports both tasks with a simple drag-and-drop interface. Under the hood, it uses transfer learning from pre-trained models like ResNet or MobileNet, fine-tuned on your dataset.
Key Components and Defaults
Custom Vision service: Azure resource that provides a training and prediction API.
Project types: Choose 'Classification' (multilabel or multiclass) or 'Object Detection'.
Training time: Typically 1-2 hours for small datasets; larger datasets may take longer.
Minimum images per class: Azure recommends at least 30 images per class for classification, and at least 50 images per object for detection.
Image size: Maximum 6 MB per image, dimensions up to 1024x1024 (if larger, the service scales down).
Export formats: TensorFlow, ONNX, CoreML, Docker for edge deployment.
Configuration and Verification
To create a classification project via Azure portal: 1. Create a Custom Vision resource. 2. Go to the Custom Vision portal (customvision.ai). 3. Create a new project, select 'Classification' and the type (Multilabel or Multiclass). 4. Upload images and tag them with labels. 5. Train the model. 6. Publish the iteration and obtain the prediction endpoint.
For object detection: 1. In the project creation, select 'Object Detection'. 2. Upload images and draw bounding boxes around objects using the portal's tagging interface. 3. Train and publish similarly.
Verification: After training, check the precision, recall, and mAP (mean Average Precision) metrics. For classification, precision and recall are per class. For detection, mAP is the primary metric.
Interaction with Related Technologies
Azure Cognitive Services: Custom Vision is part of the Azure AI services family. It can be integrated with Logic Apps, Power Automate, or custom applications via REST API.
Azure Machine Learning: For advanced users, you can export the model and retrain it using Azure ML pipelines.
Azure Functions: Trigger image classification or detection on blob uploads.
Edge deployment: Export to Docker or TensorFlow Lite for running on IoT devices.
Performance Considerations
Accuracy vs. speed: Deeper models (e.g., ResNet) are more accurate but slower. For real-time applications, use lighter models like MobileNet.
Data quality: Ensure images are diverse and representative of real-world conditions (lighting, angles, backgrounds).
Overfitting: With small datasets, the model may memorize rather than generalize. Use data augmentation (rotation, flip, crop) built into Custom Vision.
Pricing
Training: Free for up to 1 hour per month; then $20 per hour.
Prediction: Free tier includes 5,000 predictions per month; then $0.50 per 1,000 predictions.
Storage: $0.02 per image per month for stored training images.
Exam Relevance
AI-900 tests your ability to distinguish between classification and detection scenarios. Typical questions present a business requirement (e.g., 'count number of cars in a parking lot') and ask which Azure service to use. You must recognize that counting requires localization, hence object detection, not classification.
Choose Project Type
In the Custom Vision portal, you first select the project type: Classification (Multiclass or Multilabel) or Object Detection. Multiclass assigns a single label per image; Multilabel allows multiple labels. Object Detection requires bounding boxes. This choice determines the model architecture and output format. For the exam, remember that if you need to locate objects, choose Object Detection.
Upload and Tag Images
For classification, upload images and assign tags (labels) to each image. For object detection, upload images and draw bounding boxes around each object of interest, then assign a tag to each box. Azure recommends at least 30 images per class for classification, 50 per object for detection. Images should be at least 256x256 pixels. Use diverse images to avoid overfitting.
Train the Model
Click 'Train' in the portal. Custom Vision uses transfer learning from a pre-trained CNN. Training time depends on dataset size and image count. You can choose a training budget (Quick or Advanced). Quick training is faster but may be less accurate. Advanced training uses more compute. The portal displays precision, recall, and mAP (for detection) after training.
Evaluate Model Performance
Check the metrics: Precision (how many predicted positives are correct), Recall (how many actual positives were found), and mAP (for detection, average precision across all classes). If performance is low, add more images, improve tagging consistency, or use data augmentation. The portal provides a confusion matrix for classification and precision-recall curves for detection.
Publish and Consume
After training, publish the iteration. Obtain the prediction endpoint URL and key from the portal. Use the REST API or SDK to send new images for prediction. The response for classification includes tag names and probabilities. For object detection, it includes bounding box coordinates (left, top, width, height) and probabilities. You can also export the model for offline use.
Enterprise Scenario 1: Retail Inventory Management
A large retailer wants to automatically count products on shelves from store camera feeds. They need to identify each product type and its location to detect out-of-stock items. Object detection is the correct choice because it provides bounding boxes around each product. The retailer uses Azure Custom Vision Object Detection with a dataset of 200 images per product, covering various lighting and shelf angles. They train a model and deploy it to edge devices (Docker containers) in each store. Common issues: overlapping products cause missed detections; adding more training images with occlusion helps. The model achieves 90% mAP, reducing manual checks by 80%.
Enterprise Scenario 2: Manufacturing Quality Control
A factory inspects parts for defects. They need to classify each part as 'defective' or 'non-defective' without locating the defect. Custom image classification is sufficient. They collect 500 images of defective parts and 500 of non-defective parts. The model is trained as a multiclass classifier. It is deployed via the prediction API. The system achieves 99% accuracy. Misconfigurations: using object detection when classification is enough adds unnecessary complexity and cost. The exam tests this distinction: if the requirement is just to classify, use classification.
Enterprise Scenario 3: Wildlife Monitoring
A conservation organization uses camera traps to identify animal species and count individuals. They need both identification and localization to avoid double-counting. Object detection is used. They train a model on 1000 images per species, with bounding boxes. The model is exported to TensorFlow and run on Raspberry Pis at the edge. Performance: 85% mAP. They face challenges with small animals; using a higher resolution input (1024x1024) improves detection. The exam may ask: 'Which Azure service can identify animals and their positions?' Answer: Custom Vision with Object Detection.
Exactly What AI-900 Tests
AI-900 objective 3.5: 'Describe computer vision workloads' includes distinguishing between classification and detection. The exam presents scenarios and asks you to choose the appropriate Azure service or method. Common question formats: - 'Which Azure Cognitive Service can identify objects and their locations in an image?' - 'You need to count the number of people in a photo. Which service should you use?' - 'A company wants to categorize images of products. Which project type in Custom Vision should they choose?'
Common Wrong Answers and Why
Choosing 'Computer Vision' OCR service for classification: OCR is for text extraction, not object classification.
Selecting 'Face API' for object detection: Face API detects human faces only, not general objects.
Confusing multiclass with multilabel: Multiclass assigns one label per image; multilabel allows multiple. The exam tests this distinction.
Using classification when detection is needed: For counting or locating, classification is insufficient.
Specific Numbers and Terms
Minimum images per class: 30 for classification, 50 for detection.
Maximum image size: 6 MB, 1024x1024 pixels.
mAP: metric for object detection.
Precision and recall: metrics for classification.
Export formats: TensorFlow, ONNX, CoreML, Docker.
Edge Cases and Exceptions
If an image contains multiple objects of the same class, classification (multilabel) will output the class once, not per object. Detection will output multiple bounding boxes.
For real-time applications, use lightweight models (MobileNet) via export.
Custom Vision does not support object detection with rotated bounding boxes; only axis-aligned rectangles.
How to Eliminate Wrong Answers
If the scenario mentions 'location', 'position', 'bounding box', or 'count', it's object detection.
If it mentions 'categorize', 'classify', 'label', or 'identify' without location, it's classification.
If it mentions 'text', 'handwriting', or 'OCR', use Computer Vision OCR.
If it mentions 'faces', use Face API.
Custom image classification assigns labels to entire images; object detection localizes objects with bounding boxes.
Use classification when you only need to know what is in the image, not where.
Use object detection when you need to count objects or know their positions.
Azure Custom Vision supports both tasks with a minimum of 30 images per class (classification) or 50 per object (detection).
Object detection outputs include bounding box coordinates (left, top, width, height) and confidence scores.
Metrics: classification uses precision and recall; detection uses mAP.
Export formats include TensorFlow, ONNX, CoreML, and Docker for edge deployment.
These come up on the exam all the time. Here's how to tell them apart.
Custom Image Classification
Outputs one or more labels for the entire image.
No spatial information about objects.
Suitable for categorizing images (e.g., 'dog', 'cat').
Uses multiclass or multilabel project type.
Metrics: precision, recall, accuracy.
Object Detection
Outputs bounding boxes and labels for each object.
Provides location (coordinates) of objects.
Suitable for counting, tracking, or measuring objects.
Uses object detection project type.
Metrics: mAP (mean Average Precision).
Mistake
Custom image classification can also provide the location of objects.
Correct
Classification only outputs labels for the entire image; it does not provide bounding boxes or coordinates. For location, you need object detection.
Mistake
Object detection in Custom Vision requires a minimum of 100 images per object.
Correct
Azure recommends at least 50 images per object for detection, not 100.
Mistake
Multiclass classification can assign multiple labels to a single image.
Correct
Multiclass assigns exactly one label per image. Multilabel classification allows multiple labels.
Mistake
Custom Vision can only be used via the portal, not programmatically.
Correct
Custom Vision provides REST APIs and SDKs (C#, Python, Node.js) for training and prediction. The portal is just one interface.
Mistake
You can train a Custom Vision model with as few as 5 images per class.
Correct
While possible, Azure recommends at least 30 images per class for classification to achieve acceptable accuracy. Fewer images often lead to overfitting.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Custom image classification assigns a label to the entire image (e.g., 'cat'), while object detection identifies individual objects and their locations using bounding boxes. Use classification for simple categorization, detection for tasks requiring spatial awareness like counting objects.
Use multiclass when each image belongs to exactly one category (e.g., 'cat' or 'dog'). Use multilabel when an image can contain multiple categories (e.g., both 'cat' and 'dog'). The project type is selected during project creation.
mAP stands for mean Average Precision. It is the primary metric for evaluating object detection models. It averages the precision across all classes and at different recall thresholds. A higher mAP indicates better detection performance.
No, Custom Vision is for custom object detection and classification. For text detection, use Azure Computer Vision's OCR feature (Read API).
Azure recommends at least 30 images per class for classification and 50 images per object for detection. Images should be at least 256x256 pixels and no larger than 6 MB or 1024x1024 pixels (larger images are scaled down).
After training, go to the 'Performance' tab, select the iteration, and click 'Export'. You can export to TensorFlow, ONNX, CoreML, or Docker. The exported model can be run locally or on edge devices.
Custom Vision allows you to train custom models on your own images. Computer Vision provides pre-built models for common tasks like OCR, object detection, and image analysis. Use Custom Vision when you need a model trained on your specific data.
You've just covered Custom Image Classification vs Object Detection — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?