AI-900Chapter 51 of 100Objective 3.1

Image Classification Tasks

This chapter covers image classification tasks in Azure Computer Vision, a core topic for the AI-900 exam. Image classification is the process of assigning one or more labels to an entire image based on its visual content. Approximately 10–15% of AI-900 exam questions touch on computer vision topics, with image classification being a foundational concept. You will learn how image classification works, the difference between single-label and multi-label classification, how to use Azure's pre-built models, and how to build custom classifiers using Custom Vision. This knowledge is essential for passing the exam and for real-world applications like product categorization, medical imaging, and content moderation.

25 min read
Intermediate
Updated May 31, 2026

The Librarian Sorting Photos by Subject

Imagine a library with millions of unsorted photographs. The librarian's job is to categorize each photo into predefined albums: 'Beach', 'Mountain', 'City', 'Forest'. The librarian doesn't know what's in a photo until she looks at it. She examines each photo for visual clues: the presence of sand and water suggests 'Beach'; tall trees and green leaves suggest 'Forest'. She doesn't need to see the entire photo—she can focus on small patches that are most informative. Over time, she gets better by learning which patterns are most indicative of each category. For example, she learns that blue at the bottom often means water, and gray jagged shapes often mean mountains. When a new photo arrives, she quickly scans it, identifies the most prominent patterns, and assigns it to the most likely album. If a photo contains both sand and trees (a beach forest), she might place it in 'Beach' because the sand pattern is stronger, but she might also create a new album if she sees many such mixed photos. This is exactly how image classification works: the model learns to recognize distinctive visual features from labeled examples, then uses those features to classify new images into predefined categories, with a confidence score for each prediction.

How It Actually Works

What Is Image Classification?

Image classification is a supervised machine learning task where a model learns to assign a label (class) to an entire image from a predefined set of labels. For example, given a picture of a dog, the model outputs 'dog' from classes like 'dog', 'cat', 'bird', etc. The model is trained on a dataset of labeled images, learning to map pixel patterns to labels.

Single-Label vs. Multi-Label Classification

Single-label classification: Each image belongs to exactly one class. Example: a photo of a cat is classified as 'cat', not 'dog' or 'bird'. The model outputs a probability distribution over all classes, and the class with the highest probability is chosen.

Multi-label classification: An image can belong to multiple classes simultaneously. Example: a photo of a cat and a dog might be labeled both 'cat' and 'dog'. The model outputs a probability for each class independently, and a threshold (e.g., 0.5) determines which labels are assigned.

How Image Classification Works Internally

Modern image classification uses deep convolutional neural networks (CNNs). The process:

1.

Input: An image is represented as a tensor of pixel values (height × width × 3 for RGB).

2.

Convolutional layers: Apply filters (kernels) that slide over the image, detecting features like edges, textures, and shapes. Early layers detect low-level features (edges, colors); later layers combine them into high-level features (faces, objects).

3.

Pooling layers: Downsample the feature maps, reducing dimensionality and providing translation invariance. Max pooling (taking the maximum value in a window) is common.

4.

Fully connected layers: After several convolutional and pooling layers, the output is flattened and fed into one or more dense layers that learn to map features to class scores.

5.

Output layer: For single-label classification, a softmax activation function produces a probability distribution over classes. For multi-label, a sigmoid activation on each output neuron gives independent probabilities.

Training Process

Training involves: - Dataset: Thousands to millions of labeled images. - Loss function: Categorical cross-entropy for single-label, binary cross-entropy for multi-label. - Optimizer: Stochastic gradient descent (SGD) or Adam to minimize loss. - Backpropagation: Gradients are propagated through the network to update weights. - Epochs: Full passes over the training data. Typical values: 10–100. - Batch size: Number of images processed before updating weights. Common: 32, 64, 128.

Azure Computer Vision Pre-built Models

Azure offers pre-trained models for common image classification tasks: - Computer Vision API: Can classify images into thousands of categories from the ImageNet dataset. Returns tags with confidence scores. Example: https://<endpoint>/vision/v3.2/analyze?visualFeatures=Categories. - Custom Vision: Allows you to train custom classifiers with your own images. Supports both single-label and multi-label, and also object detection. It uses a transfer learning approach: starts with a pre-trained model (e.g., ResNet) and fine-tunes on your data.

Custom Vision Service

Custom Vision is a no-code/low-code platform for building custom image classifiers. Key concepts: - Project: Contains your model. You specify domain (e.g., 'General', 'Food', 'Landmarks'). - Tags: Labels for your classes. - Training: Upload images, assign tags, and train. You can choose between 'Quick Training' (fast, less accurate) and 'Advanced Training' (more time, higher accuracy). - Prediction: After training, you get an endpoint URL and prediction key. Send images to classify. - Performance metrics: Precision, recall, and mean average precision (mAP). The exam tests understanding of these metrics.

Key Configuration Values

Minimum probability threshold: When using the prediction API, you can set a threshold (default 0.5) to filter out low-confidence predictions.

Domain: Custom Vision offers domains like 'General', 'Food', 'Landmarks', 'Retail', 'Adult'. Choosing the right domain improves accuracy.

Training time: Quick training typically takes 1–5 minutes; advanced training can take 10–60 minutes depending on image count.

Verification Commands (Azure CLI)

To list Custom Vision projects:

az cognitiveservices account list --resource-group <rg> --query "[?kind=='CustomVision.Prediction']"

To get prediction endpoint:

az cognitiveservices account show --name <name> --resource-group <rg> --query "properties.endpoint"

Interaction with Other Azure Services

Azure Blob Storage: Store training images.

Azure Functions: Trigger classification on new images.

Azure Logic Apps: Automate workflows, e.g., classify images in email attachments.

Power BI: Visualize classification results.

Common Pitfalls

Overfitting: When the model memorizes training data but fails on new images. Symptoms: high training accuracy, low test accuracy. Mitigation: more data, data augmentation, regularization.

Imbalanced data: If one class has many more images than others, the model becomes biased. Mitigation: oversample minority classes or use weighted loss.

Incorrect domain selection: Using 'Food' domain for classifying cars reduces accuracy.

Exam-Relevant Details

The Computer Vision API can analyze images for categories, tags, descriptions, objects, brands, faces, and adult content.

Custom Vision supports both classification and object detection.

Precision = TP/(TP+FP); Recall = TP/(TP+FN). mAP is the mean of average precision across all classes.

The minimum probability threshold in Custom Vision prediction is 0.5 by default, but you can adjust it.

Transfer learning is used in Custom Vision: starts with a pre-trained model and fine-tunes it on your dataset.

Code Example: Using Computer Vision API

from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from msrest.authentication import CognitiveServicesCredentials

endpoint = "https://<your-endpoint>.cognitiveservices.azure.com/"
key = "<your-key>"

client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(key))

with open("image.jpg", "rb") as image:
    result = client.analyze_image_in_stream(image, visual_features=["Categories", "Tags"])

for category in result.categories:
    print(f"Category: {category.name}, Score: {category.score:.2f}")

Summary

Image classification is a core computer vision task. Azure provides both pre-built and custom models. The exam focuses on understanding the difference between single-label and multi-label, the capabilities of Computer Vision API and Custom Vision, and key metrics like precision, recall, and mAP. Be prepared to identify scenarios where each service is appropriate and to interpret classification results.

Walk-Through

1

1. Define the Classification Task

Determine whether you need single-label or multi-label classification. Single-label means each image belongs to exactly one class (e.g., 'cat' or 'dog'). Multi-label means an image can have multiple classes (e.g., 'cat' and 'dog' together). This decision affects model architecture and output. For Azure Custom Vision, you specify the project type (Classification) and then choose 'Multilabel' or 'Multiclass' during project creation. The exam may present a scenario and ask which type to use.

2

2. Collect and Label Training Data

Gather a representative dataset of images. Each image must be labeled with the correct class(es). For single-label, one tag per image. For multi-label, multiple tags. Azure Custom Vision requires a minimum of 5 images per tag, but 50+ is recommended for good accuracy. Images should cover variations in lighting, angle, background, and object appearance. Store images in Azure Blob Storage or upload directly to Custom Vision portal.

3

3. Create a Custom Vision Project

In the Custom Vision portal (customvision.ai), create a new project. Choose 'Classification' as the project type. Then select 'Multiclass' (single-label) or 'Multilabel' (multi-label). Choose a domain: 'General' for most cases, 'Food' for food images, 'Landmarks' for landmarks, etc. The domain optimizes the underlying pre-trained model. The project will have a unique ID and a training/prediction key.

4

4. Upload and Tag Images

Upload your training images to the project. Assign tags (labels) to each image. For multiclass, each image gets exactly one tag. For multilabel, you can assign multiple tags. Use the portal's tagging interface or the Custom Vision API. Ensure balanced representation across classes to avoid bias. The portal shows the number of images per tag.

5

5. Train the Model

Click 'Train' in the portal. Choose 'Quick Training' for faster results (minutes) or 'Advanced Training' for longer but potentially more accurate (tens of minutes). The service uses transfer learning: it starts with a pre-trained CNN (like ResNet) and fine-tunes it on your data. After training, the portal displays performance metrics: precision, recall, and mean average precision (mAP). Review these to assess model quality.

6

6. Evaluate and Iterate

Use the 'Quick Test' feature to test with new images. If accuracy is low, consider adding more images, especially for misclassified examples. You can also adjust the probability threshold (default 0.5) to trade off precision and recall. Re-train after changes. The exam may ask about interpreting precision/recall and how to improve model performance.

7

7. Publish and Use the Prediction Endpoint

After training, publish the model iteration. The portal provides a prediction endpoint URL and key. Use the Custom Vision prediction API to classify new images. For example, send an image URL or binary data to `https://<endpoint>/customvision/v3.0/Prediction/<project-id>/classify/iterations/<iteration-name>/url`. The response includes predicted tag names and probabilities. Set a minimum probability threshold in code to filter low-confidence predictions.

What This Looks Like on the Job

Enterprise Scenario 1: Retail Product Categorization

A large e-commerce company needs to automatically categorize product images into categories like 'Electronics', 'Clothing', 'Home & Garden', etc. They have millions of product images from sellers. Using Azure Custom Vision, they create a multi-class classifier with 50 categories. They train on 1,000 images per category. The model achieves 95% precision and 92% recall. They deploy the prediction endpoint as a web service called from their product upload pipeline. When a seller uploads an image, it is classified, and the product is assigned to a category. Misclassifications are flagged for manual review. Performance considerations: latency must be under 500ms; they use Azure App Service with autoscaling. Common issues: sellers upload images with watermarks or white backgrounds that differ from training data, causing misclassifications. They mitigate by adding augmented images during training (rotations, brightness changes) and setting a high probability threshold (0.8) to reduce false positives.

Scenario 2: Medical Imaging Diagnosis Support

A hospital uses image classification to flag chest X-rays for potential pneumonia. They use a binary classifier: 'Normal' vs 'Pneumonia'. They train on 10,000 labeled X-rays. The model must have very high recall (low false negatives) because missing a pneumonia case is dangerous. They set the probability threshold low (0.3) to catch more positives, accepting more false positives that radiologists review. They use Custom Vision with the 'General' domain. The model is deployed in a HIPAA-compliant environment. They monitor performance over time as new X-ray machines produce different image characteristics. They retrain monthly with new data. Misconfiguration: initially they used single-label but some X-rays show both pneumonia and other conditions; they switched to multi-label to capture all findings.

Scenario 3: Content Moderation for Social Media

A social media platform uses image classification to detect inappropriate content like violence, nudity, or hate symbols. They use Azure Computer Vision's pre-built 'Adult' classification (which detects adult content) and combine it with a custom classifier for specific banned symbols. The pre-built model returns scores for 'adult', 'racy', and 'gory' categories. They set thresholds: if 'adult' score > 0.7, automatically block; if between 0.4 and 0.7, send for human review. The custom model is trained on images of banned symbols using Custom Vision. They process millions of images daily, requiring high throughput. They use Azure Functions triggered by blob storage events to classify images asynchronously. Common failure: the pre-built model may flag artistic nudes as adult, causing false positives; they fine-tune thresholds based on user feedback.

How AI-900 Actually Tests This

What AI-900 Tests on Image Classification

AI-900 exam objective 3.1 covers 'Identify computer vision workloads'. Image classification is a key workload. The exam expects you to:

Differentiate between image classification, object detection, and image segmentation.

Understand single-label vs multi-label classification.

Know the capabilities of Azure Computer Vision (pre-built) and Custom Vision (custom).

Identify appropriate scenarios for each service.

Interpret model performance metrics: precision, recall, mAP.

Recognize the minimum number of images needed per tag (5) and recommended (50+).

Understand transfer learning and how Custom Vision uses it.

Common Wrong Answers and Why Candidates Choose Them

1.

Confusing image classification with object detection: Many candidates think image classification identifies objects and their locations. That is object detection. Image classification only labels the entire image. Wrong answer: 'Image classification identifies the location of objects in an image.'

2.

Believing Custom Vision requires thousands of images: The minimum is 5 per tag, but the exam may ask 'What is the minimum number of images per tag?' Answer: 5. Candidates often overestimate.

3.

Thinking multi-label classification is the same as multi-class: Multi-class is single-label (one class per image). Multi-label allows multiple classes. The exam may describe a scenario and ask which type to use.

4.

Assuming the Computer Vision API can detect specific custom objects: It only recognizes pre-defined categories (1,000+ from ImageNet). For custom objects, use Custom Vision.

5.

Misinterpreting precision and recall: Precision = TP/(TP+FP) (how many predicted positives are correct). Recall = TP/(TP+FN) (how many actual positives were found). Candidates often swap them.

Specific Numbers and Terms on the Exam

Minimum images per tag: 5 (Custom Vision).

Default probability threshold: 0.5.

Domains: General, Food, Landmarks, Retail, Adult.

Training types: Quick Training (fast), Advanced Training (more accurate).

Metrics: precision, recall, mAP.

Pre-built Computer Vision visual features: Categories, Tags, Description, Objects, Brands, Faces, ImageType, Color, Adult, Metadata.

Edge Cases and Exceptions

If you have fewer than 5 images per tag, the Custom Vision portal will warn you but may still train with low accuracy.

The 'Adult' domain in Custom Vision is specifically for detecting adult content; using 'General' may reduce accuracy.

Multi-label classification in Custom Vision uses a probability threshold per tag; you can set different thresholds for different tags.

The Computer Vision API's 'Categories' feature returns a hierarchy (e.g., 'animal_dog').

How to Eliminate Wrong Answers

If a question asks about labeling an entire image, it's classification. If it asks about locating objects with bounding boxes, it's object detection.

If the scenario involves custom categories not in the pre-built set, the answer is Custom Vision.

If the scenario requires distinguishing multiple things in one image, it's multi-label classification.

For metric questions, think: 'Precision: when the model says something is true, how often is it right? Recall: of all true things, how many did the model find?' Use the formulas.

Key Takeaways

Image classification assigns one or more labels to an entire image; it does not locate objects.

Azure Computer Vision pre-built API recognizes thousands of categories, but cannot be customized.

Custom Vision allows training custom classifiers with a minimum of 5 images per tag.

Single-label (multiclass) classification: each image gets exactly one label. Multi-label: multiple labels allowed.

Precision = TP/(TP+FP); Recall = TP/(TP+FN); mAP is the mean of average precision across classes.

Default probability threshold for Custom Vision predictions is 0.5.

Transfer learning is used in Custom Vision: starts with a pre-trained model and fine-tunes on your data.

Domains in Custom Vision (General, Food, Landmarks, Retail, Adult) optimize the base model for specific image types.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Single-label (Multiclass) Classification

Each image belongs to exactly one class.

Output is a probability distribution over classes; softmax activation.

Used when classes are mutually exclusive (e.g., breed of dog).

Loss function: categorical cross-entropy.

Azure Custom Vision project type: 'Multiclass'.

Multi-label Classification

Each image can belong to multiple classes.

Output is independent probabilities per class; sigmoid activation.

Used when classes are not mutually exclusive (e.g., tags on a photo).

Loss function: binary cross-entropy.

Azure Custom Vision project type: 'Multilabel'.

Watch Out for These

Mistake

Image classification can identify multiple objects in an image with their locations.

Correct

Image classification only assigns a label to the entire image. It does not provide location information. Object detection provides bounding boxes around objects.

Mistake

You need at least 50 images per tag to train a Custom Vision model.

Correct

The minimum is 5 images per tag, but 50+ is recommended for good accuracy. The exam tests the minimum of 5.

Mistake

Multi-label classification means each image has exactly one label.

Correct

Multi-label classification allows an image to have multiple labels simultaneously. Single-label (multiclass) means exactly one label per image.

Mistake

Azure Computer Vision API can be trained on custom images.

Correct

The Computer Vision API uses pre-trained models and cannot be retrained on custom data. For custom models, use Custom Vision.

Mistake

Precision and recall are the same thing.

Correct

Precision = TP/(TP+FP) (accuracy of positive predictions). Recall = TP/(TP+FN) (coverage of actual positives). They measure different aspects of model performance.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between image classification and object detection?

Image classification assigns a label to the entire image, indicating what the image contains overall. Object detection not only identifies objects but also locates them with bounding boxes. For example, classification might say 'this is a dog', while detection says 'there is a dog at coordinates (x1,y1,x2,y2)'. On the AI-900 exam, if a scenario requires bounding boxes, it's object detection; otherwise, it's classification.

How many images do I need to train a Custom Vision classifier?

The minimum is 5 images per tag. However, for reliable accuracy, Microsoft recommends at least 50 images per tag. The exam tests the minimum of 5. If you have fewer, the portal will warn you but training may still proceed with poor results.

Can I use Azure Computer Vision API to classify my own custom categories?

No. The Computer Vision API uses pre-trained models that recognize a fixed set of categories (over 1,000 from ImageNet). You cannot add custom categories. For custom categories, you must use Azure Custom Vision to train a custom model.

What is the default probability threshold in Custom Vision predictions?

The default is 0.5. You can change it in the prediction API call by setting the `probabilityThreshold` parameter. Lowering the threshold increases recall (more positives detected) but may decrease precision (more false positives). Raising it does the opposite.

What is transfer learning and how does it relate to Custom Vision?

Transfer learning is a technique where a model trained on a large dataset (e.g., ImageNet) is used as a starting point for a new task. Custom Vision uses transfer learning: it takes a pre-trained convolutional neural network (like ResNet) and fine-tunes its weights on your images. This allows training with fewer images and less time than training from scratch.

What are precision, recall, and mAP in the context of Custom Vision?

Precision = TP/(TP+FP): of all images predicted as a certain class, how many are correct. Recall = TP/(TP+FN): of all actual images of that class, how many were correctly predicted. mAP (mean average precision) is the average of precision values at different recall levels, averaged across all classes. Higher values indicate better model performance.

What domains are available in Custom Vision?

The domains are: General, Food, Landmarks, Retail, and Adult. Choosing the appropriate domain optimizes the underlying model for your type of images, potentially improving accuracy. For example, use 'Food' for food images, 'Landmarks' for landmarks. 'General' works for most other cases.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Image Classification Tasks — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?