AI-900Chapter 62 of 100Objective 3.5

Custom Image Classification vs Object Detection

What distinguishes the two fundamental computer vision tasks of custom image classification and object detection? You will learn the difference between them, how each works in Azure AI services, and when to use one over the other. Approximately 10-15% of AI-900 exam questions touch on computer vision topics, and understanding these two concepts is critical for scenario-based questions. By the end, you will be able to choose the right service for a given business requirement and explain the underlying mechanisms.

25 min read

Intermediate

Updated Jul 21, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Photo Album vs. Treasure Map

A hundred snapshots of your family and friends fill a huge photo album. Custom image classification is like asking, 'Who is in this photo?' You look at the picture and say, 'That's Aunt Mary.' You don't care exactly where she is standing; you just know she is present. Now imagine you have a treasure map. Object detection is like saying, 'There is a treasure chest at grid B-4, and a pirate at grid D-7.' You not only identify what is in the scene but also pinpoint its exact location. In computer vision, custom image classification assigns a single label (or multiple labels) to an entire image, while object detection draws bounding boxes around each object of interest and labels them. The 'photo album' approach groups everything in the image into one category, whereas the 'treasure map' approach identifies individual items and their positions. For the AI-900 exam, you need to understand when to use each technique: classification for simple categorisation tasks, and detection for tasks requiring spatial awareness, like counting objects or locating defects.

How It Actually Works

What Are Custom Image Classification and Object Detection?

Custom image classification and object detection are two distinct computer vision tasks. In image classification, the model assigns one or more labels to an entire image. For example, given a photo of a dog, the model outputs 'dog'. If the image contains both a dog and a cat, a multi-label classification model can output both 'dog' and 'cat', but it does not indicate where each animal is located.

Object detection goes a step further: it identifies the objects in an image and also localizes each one by drawing a bounding box around it. The output includes the object class (e.g., 'dog') and coordinates (x, y, width, height) of the bounding box. This allows you to count objects, track their positions, or measure sizes.

How They Work Internally

Both tasks are typically performed using convolutional neural networks (CNNs). For classification, the network ends with a fully connected layer and a softmax activation that outputs probabilities for each class. The model is trained on images labelled with the class(es) present.

For object detection, the architecture is more complex. Common approaches include: - Region-based CNNs (R-CNN): Propose regions of interest, then classify each region. - You Only Look Once (YOLO): Divide the image into a grid, predict bounding boxes and class probabilities directly in one pass. - Single Shot Detector (SSD): Similar to YOLO but uses multi-scale feature maps.

Azure Custom Vision service supports both tasks with a simple drag-and-drop interface. Under the hood, it uses transfer learning from pre-trained models like ResNet or MobileNet, fine-tuned on your dataset.

Key Components and Defaults

Custom Vision service: Azure resource that provides a training and prediction API.

Project types: Choose 'Classification' (multilabel or multiclass) or 'Object Detection'.

Training time: Typically 1-2 hours for small datasets; larger datasets may take longer.

Minimum images per class: Azure recommends at least 30 images per class for classification, and at least 50 images per object for detection.

Image size: Maximum 6 MB per image, dimensions up to 1024x1024 (if larger, the service scales down).

Export formats: TensorFlow, ONNX, CoreML, Docker for edge deployment.

Configuration and Verification

To create a classification project via Azure portal: 1. Create a Custom Vision resource. 2. Go to the Custom Vision portal (customvision.ai). 3. Create a new project, select 'Classification' and the type (Multilabel or Multiclass). 4. Upload images and tag them with labels. 5. Train the model. 6. Publish the iteration and obtain the prediction endpoint.

For object detection: 1. In the project creation, select 'Object Detection'. 2. Upload images and draw bounding boxes around objects using the portal's tagging interface. 3. Train and publish similarly.

Verification: After training, check the precision, recall, and mAP (mean Average Precision) metrics. For classification, precision and recall are per class. For detection, mAP is the primary metric.

Interaction with Related Technologies

Azure Cognitive Services: Custom Vision is part of the Azure AI services family. It can be integrated with Logic Apps, Power Automate, or custom applications via REST API.

Azure Machine Learning: For advanced users, you can export the model and retrain it using Azure ML pipelines.

Azure Functions: Trigger image classification or detection on blob uploads.

Edge deployment: Export to Docker or TensorFlow Lite for running on IoT devices.

Performance Considerations

Accuracy vs. speed: Deeper models (e.g., ResNet) are more accurate but slower. For real-time applications, use lighter models like MobileNet.

Data quality: Ensure images are diverse and representative of real-world conditions (lighting, angles, backgrounds).

Overfitting: With small datasets, the model may memorize rather than generalize. Use data augmentation (rotation, flip, crop) built into Custom Vision.

Pricing

Training: Free for up to 1 hour per month; then $20 per hour.

Prediction: Free tier includes 5,000 predictions per month; then $0.50 per 1,000 predictions.

Storage: $0.02 per image per month for stored training images.

Exam Relevance

AI-900 tests your ability to distinguish between classification and detection scenarios. Typical questions present a business requirement (e.g., 'count number of cars in a parking lot') and ask which Azure service to use. You must recognize that counting requires localization, hence object detection, not classification.

Walk-Through

Choose Project Type

In the Custom Vision portal, you first select the project type: Classification (Multiclass or Multilabel) or Object Detection. Multiclass assigns a single label per image; Multilabel allows multiple labels. Object Detection requires bounding boxes. This choice determines the model architecture and output format. For the exam, remember that if you need to locate objects, choose Object Detection.

Upload and Tag Images

For classification, upload images and assign tags (labels) to each image. For object detection, upload images and draw bounding boxes around each object of interest, then assign a tag to each box. Azure recommends at least 30 images per class for classification, 50 per object for detection. Images should be at least 256x256 pixels. Use diverse images to avoid overfitting.

Train the Model

Click 'Train' in the portal. Custom Vision uses transfer learning from a pre-trained CNN. Training time depends on dataset size and image count. You can choose a training budget (Quick or Advanced). Quick training is faster but may be less accurate. Advanced training uses more compute. The portal displays precision, recall, and mAP (for detection) after training.

Evaluate Model Performance

Check the metrics: Precision (how many predicted positives are correct), Recall (how many actual positives were found), and mAP (for detection, average precision across all classes). If performance is low, add more images, improve tagging consistency, or use data augmentation. The portal provides a confusion matrix for classification and precision-recall curves for detection.

Publish and Consume

After training, publish the iteration. Obtain the prediction endpoint URL and key from the portal. Use the REST API or SDK to send new images for prediction. The response for classification includes tag names and probabilities. For object detection, it includes bounding box coordinates (left, top, width, height) and probabilities. You can also export the model for offline use.

What This Looks Like on the Job

Enterprise Scenario 1: Retail Inventory Management

A large retailer wants to automatically count products on shelves from store camera feeds. They need to identify each product type and its location to detect out-of-stock items. Object detection is the correct choice because it provides bounding boxes around each product. The retailer uses Azure Custom Vision Object Detection with a dataset of 200 images per product, covering various lighting and shelf angles. They train a model and deploy it to edge devices (Docker containers) in each store. Common issues: overlapping products cause missed detections; adding more training images with occlusion helps. The model achieves 90% mAP, reducing manual checks by 80%.

Enterprise Scenario 2: Manufacturing Quality Control

A factory inspects parts for defects. They need to classify each part as 'defective' or 'non-defective' without locating the defect. Custom image classification is sufficient. They collect 500 images of defective parts and 500 of non-defective parts. The model is trained as a multiclass classifier. It is deployed via the prediction API. The system achieves 99% accuracy. Misconfigurations: using object detection when classification is enough adds unnecessary complexity and cost. The exam tests this distinction: if the requirement is just to classify, use classification.

Enterprise Scenario 3: Wildlife Monitoring

A conservation organization uses camera traps to identify animal species and count individuals. They need both identification and localization to avoid double-counting. Object detection is used. They train a model on 1000 images per species, with bounding boxes. The model is exported to TensorFlow and run on Raspberry Pis at the edge. Performance: 85% mAP. They face challenges with small animals; using a higher resolution input (1024x1024) improves detection. The exam may ask: 'Which Azure service can identify animals and their positions?' Answer: Custom Vision with Object Detection.

How AI-900 Actually Tests This

Exactly What AI-900 Tests

AI-900 objective 3.5: 'Describe computer vision workloads' includes distinguishing between classification and detection. The exam presents scenarios and asks you to choose the appropriate Azure service or method. Common question formats: - 'Which Azure Cognitive Service can identify objects and their locations in an image?' - 'You need to count the number of people in a photo. Which service should you use?' - 'A company wants to categorize images of products. Which project type in Custom Vision should they choose?'

Common Wrong Answers and Why

Choosing 'Computer Vision' OCR service for classification: OCR is for text extraction, not object classification.

Selecting 'Face API' for object detection: Face API detects human faces only, not general objects.

Confusing multiclass with multilabel: Multiclass assigns one label per image; multilabel allows multiple. The exam tests this distinction.

Using classification when detection is needed: For counting or locating, classification is insufficient.

Specific Numbers and Terms

Minimum images per class: 30 for classification, 50 for detection.

Maximum image size: 6 MB, 1024x1024 pixels.

mAP: metric for object detection.

Precision and recall: metrics for classification.

Export formats: TensorFlow, ONNX, CoreML, Docker.

Edge Cases and Exceptions

If an image contains multiple objects of the same class, classification (multilabel) will output the class once, not per object. Detection will output multiple bounding boxes.

For real-time applications, use lightweight models (MobileNet) via export.

Custom Vision does not support object detection with rotated bounding boxes; only axis-aligned rectangles.

How to Eliminate Wrong Answers

If the scenario mentions 'location', 'position', 'bounding box', or 'count', it's object detection.

If it mentions 'categorize', 'classify', 'label', or 'identify' without location, it's classification.

If it mentions 'text', 'handwriting', or 'OCR', use Computer Vision OCR.

If it mentions 'faces', use Face API.

Key Takeaways

Custom image classification assigns labels to entire images; object detection localizes objects with bounding boxes.

Use classification when you only need to know what is in the image, not where.

Use object detection when you need to count objects or know their positions.

Azure Custom Vision supports both tasks with a minimum of 30 images per class (classification) or 50 per object (detection).

Object detection outputs include bounding box coordinates (left, top, width, height) and confidence scores.

Metrics: classification uses precision and recall; detection uses mAP.

Export formats include TensorFlow, ONNX, CoreML, and Docker for edge deployment.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Custom Image Classification

Outputs one or more labels for the entire image.

No spatial information about objects.

Suitable for categorizing images (e.g., 'dog', 'cat').

Uses multiclass or multilabel project type.

Metrics: precision, recall, accuracy.

Object Detection

Outputs bounding boxes and labels for each object.

Provides location (coordinates) of objects.

Suitable for counting, tracking, or measuring objects.

Uses object detection project type.

Metrics: mAP (mean Average Precision).

Watch Out for These

Mistake

Custom image classification can also provide the location of objects.

Correct

Classification only outputs labels for the entire image; it does not provide bounding boxes or coordinates. For location, you need object detection.

Mistake

Object detection in Custom Vision requires a minimum of 100 images per object.

Correct

Azure recommends at least 50 images per object for detection, not 100.

Mistake

Multiclass classification can assign multiple labels to a single image.

Correct

Multiclass assigns exactly one label per image. Multilabel classification allows multiple labels.

Mistake

Custom Vision can only be used via the portal, not programmatically.

Correct

Custom Vision provides REST APIs and SDKs (C#, Python, Node.js) for training and prediction. The portal is just one interface.

Mistake

You can train a Custom Vision model with as few as 5 images per class.

Correct

While possible, Azure recommends at least 30 images per class for classification to achieve acceptable accuracy. Fewer images often lead to overfitting.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between custom image classification and object detection in Azure Custom Vision?

Custom image classification assigns a label to the entire image (e.g., 'cat'), while object detection identifies individual objects and their locations using bounding boxes. Use classification for simple categorization, detection for tasks requiring spatial awareness like counting objects.

When should I use multilabel vs multiclass classification in Custom Vision?

Use multiclass when each image belongs to exactly one category (e.g., 'cat' or 'dog'). Use multilabel when an image can contain multiple categories (e.g., both 'cat' and 'dog'). The project type is selected during project creation.

What is mAP in object detection?

mAP stands for mean Average Precision. It is the primary metric for evaluating object detection models. It averages the precision across all classes and at different recall thresholds. A higher mAP indicates better detection performance.

Can I use Custom Vision to detect text in images?

No, Custom Vision is for custom object detection and classification. For text detection, use Azure Computer Vision's OCR feature (Read API).

What are the minimum image requirements for Custom Vision?

Azure recommends at least 30 images per class for classification and 50 images per object for detection. Images should be at least 256x256 pixels and no larger than 6 MB or 1024x1024 pixels (larger images are scaled down).

How do I export a Custom Vision model for offline use?

After training, go to the 'Performance' tab, select the iteration, and click 'Export'. You can export to TensorFlow, ONNX, CoreML, or Docker. The exported model can be run locally or on edge devices.

What is the difference between Custom Vision and Computer Vision service?

Custom Vision allows you to train custom models on your own images. Computer Vision provides pre-built models for common tasks like OCR, object detection, and image analysis. Use Custom Vision when you need a model trained on your specific data.

Terms Worth Knowing

Information protection Machine learning Sensitivity label

Ready to put this to the test?

You've just covered Custom Image Classification vs Object Detection — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Try AI-900 practice questions Back to all chapters

Done with this chapter?

Custom Vision Models: Training and Evaluation

Tokenization and Text Normalization

See the full AI-900 study guide