AI-900Chapter 55 of 100Objective 3.2

Image Analysis: Tags, Captions, Objects

Image tagging, caption generation, and object detection — three core image analysis capabilities in Azure Computer Vision — are the key topics of this chapter. These features are heavily tested on the AI-900 exam, appearing in approximately 15-20% of questions in Domain 3 (Computer Vision). You must understand the differences between tags, captions, and objects, when to use each, and how to call them via the Azure AI Vision service. We will dive into the internal mechanisms, API parameters, default behaviors, and common exam traps.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Photo Album with Librarian Assistants

Your massive photo album holds thousands of pictures. You hire three specialized assistants to help you organize and describe it. The first assistant, 'Taggy,' looks at every photo and writes down one-word labels on sticky notes: 'beach,' 'dog,' 'sunset,' 'birthday.' She doesn't write sentences; she just attaches as many relevant keywords as she can. The second assistant, 'Caption Carol,' writes a full sentence for each photo, like 'A golden retriever playing fetch on a sunny beach.' She focuses on the main subject and action, ignoring minor details. The third assistant, 'Bounding Bob,' draws boxes around every distinct object he sees: one box around the dog, another around the ball, another around the person. He labels each box with the object name. When you later search for 'dog,' Taggy's sticky notes help you find all dog photos. If you want a natural description, Carol's captions are perfect. If you need to count how many dogs are in a photo or know exactly where each object is located, Bob's boxes give you that spatial information. In Azure Computer Vision, these three assistants work in parallel. Tags are the sticky notes (many keywords), captions are the full sentences (one per image), and object detection draws bounding boxes with labels. All three use deep neural networks trained on millions of images, but they produce different types of output for different use cases.

How It Actually Works

What Are Image Tags, Captions, and Objects?

Azure Computer Vision provides three distinct but related services for analyzing images:

Image Tagging: Returns a list of words (tags) that describe the content of the image. Each tag has a confidence score (0 to 1). Tags are not mutually exclusive; an image of a dog on a beach might get tags like 'dog', 'beach', 'sand', 'animal', 'outdoor'. The service can return up to 79 tags per image (default is 10, but can be increased via the maxCandidates parameter). Tags are derived from a set of thousands of predefined categories.

Image Captioning: Generates a human-readable sentence describing the image content. The service returns one or more captions, each with a confidence score. The caption focuses on the most salient objects and actions. For example, 'A brown dog running on a sandy beach.' The default returns one caption; you can request multiple via maxCandidates.

Object Detection: Identifies objects within an image and returns bounding boxes (coordinates) for each object, along with a label and confidence score. Unlike tagging, object detection provides spatial location. It can detect up to 80 common object categories (e.g., person, bicycle, car, dog). The service also returns a 'parent' tag for grouped objects (e.g., 'furniture' for 'chair').

How They Work Internally

All three features use deep convolutional neural networks (CNNs) trained on the Microsoft Common Objects in Context (COCO) dataset and other large image corpora. The pipeline is:

Preprocessing: The image is resized to a standard input size (e.g., 224x224 pixels for tagging/captioning, variable for object detection). The service supports images up to 4 MB in size and dimensions up to 10,000 x 10,000 pixels.

Feature Extraction: A CNN (e.g., ResNet-50 or ResNet-101) extracts feature maps from the image. These maps represent hierarchical visual features (edges, textures, shapes, objects).

3. Task-Specific Heads: - Tagging: The feature maps are fed into a multi-label classifier that outputs probabilities for each of the thousands of predefined tags. A threshold (default 0.5) filters out low-confidence tags. - Captioning: The feature maps are passed to a recurrent neural network (RNN) or transformer-based language model that generates a sequence of words. The model uses attention mechanisms to focus on relevant image regions while generating each word. - Object Detection: The feature maps are processed by a region proposal network (RPN) that generates candidate bounding boxes. Then a classifier assigns labels to each box, and a regression head refines box coordinates. Non-maximum suppression removes duplicate detections.

Key Parameters and Defaults

`maxCandidates`: For tagging and captioning, controls the maximum number of tags/captions returned. Default is 10 for tags, 1 for captions. Range: 1-79 for tags, 1-10 for captions.

`language`: Specifies the language for tags and captions. Default is 'en'. Supported languages include 'en', 'zh', 'ja', 'es', etc. The tag vocabulary is language-specific.

`model-version`: Allows choosing a specific model version (e.g., '2023-10-01'). Default is the latest stable version.

Confidence threshold: For object detection, you can filter results by confidence score. The default threshold is 0.5. Lowering it returns more detections but may include false positives.

API Calls and Examples

To use these features, you need an Azure AI Vision resource (formerly Computer Vision) and its endpoint and key. The REST API endpoint is:

https://<endpoint>/computervision/imageanalysis:analyze?api-version=2023-10-01

Example request body (JSON):

{
  "url": "https://example.com/image.jpg"
}

Or send binary image data directly.

Query parameters to specify features:

features=tags for tagging

features=caption for captioning

features=objects for object detection

You can combine: features=tags,caption,objects

Example cURL command:

curl -X POST "https://<endpoint>/computervision/imageanalysis:analyze?api-version=2023-10-01&features=tags,caption,objects" \
-H "Ocp-Apim-Subscription-Key: <key>" \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com/image.jpg"}'

Sample response (truncated):

{
  "captionResult": {
    "text": "A brown dog running on a sandy beach",
    "confidence": 0.95
  },
  "tagsResult": {
    "tags": [
      {"name": "dog", "confidence": 0.99},
      {"name": "beach", "confidence": 0.98},
      {"name": "sand", "confidence": 0.97},
      {"name": "animal", "confidence": 0.96},
      {"name": "outdoor", "confidence": 0.95}
    ]
  },
  "objectsResult": {
    "objects": [
      {
        "name": "dog",
        "confidence": 0.98,
        "boundingBox": {"x": 100, "y": 200, "w": 150, "h": 120}
      },
      {
        "name": "frisbee",
        "confidence": 0.85,
        "boundingBox": {"x": 300, "y": 250, "w": 50, "h": 50}
      }
    ]
  }
}

How They Interact

Tags and captions both describe content but at different granularities. Tags are atomic keywords; captions are sentences. Object detection adds spatial awareness. The exam often asks which service to use for a given scenario:

Search by keyword: Use tags (e.g., find all images with 'dog').

Generate alt text for accessibility: Use captions.

Count objects or locate them: Use object detection.

All three can be called in a single API request by specifying multiple features. This is efficient and commonly done in production.

Performance and Limits

Throughput: Standard tier supports up to 20 transactions per second (TPS) per resource. Free tier (F0) supports 20 TPS but limited to 5,000 transactions per month.

Latency: Typically 1-3 seconds per image depending on size and features requested. Object detection is slightly slower due to region proposal.

Image size: Max 4 MB; image dimensions up to 10,000 x 10,000 pixels.

Accepted formats: JPEG, PNG, GIF, BMP, WEBP. Animated GIFs are treated as static (first frame).

Common Use Cases

E-commerce: Auto-tag product images for search; generate alt text for accessibility; detect objects for inventory.

Social media: Automatically caption photos for visually impaired users; moderate content by detecting objects.

Surveillance: Detect people, vehicles, and objects in security footage.

Important Exam Details

Tags are not the same as objects. Tags are keywords; objects are bounding boxes.

Captions are generated by a separate model that produces natural language sentences.

The maxCandidates parameter affects both tags and captions but with different defaults.

Confidence scores are between 0 and 1. Higher is better.

The service does not recognize faces or celebrities by default; that requires Face API.

Optical character recognition (OCR) is a separate feature (not covered here).

Walk-Through

Create Azure AI Vision Resource

Go to the Azure portal (portal.azure.com) and create a new 'Computer Vision' resource (now called 'Azure AI Vision'). Choose a resource group, region (e.g., East US), and pricing tier (Free F0 or Standard S0). For production, use Standard S0. After deployment, note the endpoint URL and access key. These are required for API calls. The endpoint format is 'https://<region>.api.cognitive.microsoft.com/' or 'https://<custom-name>.cognitiveservices.azure.com/'.

Prepare Image and API Request

Select an image (local file or URL). Ensure it's under 4 MB and in a supported format. Decide which features you need: tags, caption, objects, or a combination. Construct the REST API call with the correct endpoint and query parameters. For example, to get all three, use 'features=tags,caption,objects'. Set the HTTP header 'Ocp-Apim-Subscription-Key' to your access key and 'Content-Type' to 'application/json' for URL input or 'application/octet-stream' for binary.

Send Request and Parse Response

Execute the POST request. The service processes the image and returns a JSON response. For tags, you get an array of 'name' and 'confidence' pairs. For caption, you get 'text' and 'confidence'. For objects, you get 'name', 'confidence', and 'boundingBox' with x, y, width, height coordinates. Parse the response in your application. Typical HTTP status codes: 200 (success), 400 (bad request, e.g., invalid image), 401 (unauthorized), 429 (rate limit exceeded).

Filter Results by Confidence

By default, the service returns all results with confidence above a threshold (0.5 for tags and objects, no threshold for caption). You can apply additional filtering in your code. For example, you might only accept tags with confidence > 0.7. Lowering the threshold increases recall but may include false positives. The exam may ask about default confidence thresholds; remember 0.5 is the default for tags and objects.

Integrate into Application

Use the results in your application. For tagging, you might build a search index. For captions, generate alt text for HTML img tags. For objects, draw bounding boxes on the image for visualization. In production, consider caching results for frequently accessed images to reduce latency and cost. Also handle errors gracefully: retry on 429 with exponential backoff, and validate image input to avoid 400 errors.

What This Looks Like on the Job

Enterprise Scenario 1: E-commerce Product Catalog

A large online retailer needs to automatically tag millions of product images with keywords for search engine optimization and internal search. They use Azure Computer Vision's image tagging feature. They send each product image with features=tags and maxCandidates=20. The tags (e.g., 'shoes', 'red', 'leather', 'sneakers') are stored in a database and indexed by Azure Cognitive Search. In production, they process images in batches using Azure Functions triggered by blob storage uploads. They set up a queue to handle throttling (20 TPS per resource). They learned to avoid sending very similar images (e.g., same product from different angles) because tags are consistent, but they still pay per transaction. To reduce costs, they cache results for identical images using a hash of the image bytes. A common misconfiguration is not setting maxCandidates high enough, resulting in missing relevant tags like 'running' for athletic shoes. They also discovered that the service sometimes tags 'person' even if only a hand is visible, so they filter out tags with confidence below 0.7 for search indexing.

Scenario 2: Accessibility for a News Website

A news media company wants to automatically generate alt text for images to comply with WCAG accessibility guidelines. They use the caption feature. Each news article image is sent with features=caption. The returned caption becomes the alt attribute. They use the Standard tier and process about 50,000 images per day. They found that captions are generally accurate for common scenes but struggle with abstract or artistic images. For those, they fall back to manual entry. They also noticed that the caption model sometimes describes irrelevant background details, so they use the confidence score to decide whether to use the caption (only if confidence > 0.8). The exam may ask: 'Which feature should you use to generate a sentence describing an image?' The answer is caption, not tags.

Scenario 3: Inventory Management in a Warehouse

A logistics company uses object detection to count items on shelves from camera feeds. They deploy Azure Computer Vision object detection on edge devices using Azure IoT Edge. Each frame is analyzed with features=objects. They detect 'box', 'pallet', 'forklift', and 'person'. Bounding boxes allow them to estimate occupancy. They set the confidence threshold to 0.6 to reduce false positives. A common problem is overlapping boxes (e.g., a person standing in front of a box) – non-maximum suppression handles this internally, but occasionally two objects merge into one box. They also use the 'parent' tag (e.g., 'furniture' for 'chair') to group objects. Performance is critical; they use GPU-accelerated VMs for lower latency. The exam may test that object detection returns bounding box coordinates (x, y, width, height) and that you can count objects programmatically.

How AI-900 Actually Tests This

What AI-900 Tests

Domain 3.2: 'Identify features of computer vision workloads on Azure' includes image analysis capabilities. Specific objectives:

3.2.1: Describe image tagging – returns tags (keywords) with confidence scores.

3.2.2: Describe image captioning – returns a human-readable sentence describing the image.

3.2.3: Describe object detection – returns bounding boxes for objects.

The exam expects you to differentiate these three and match them to scenarios.

Common Wrong Answers and Traps

Confusing tags with captions: A question like 'Which feature should you use to generate a description for an alt attribute?' Many candidates choose 'tags' because they think tags describe the image. But tags are keywords, not sentences. The correct answer is 'caption'.

Thinking tags are exclusive: Tags are not mutually exclusive; multiple tags can apply. A common wrong answer says 'tags return a single label' – false.

Assuming object detection returns only one object: Object detection can return multiple objects, each with its own bounding box. Some candidates think it returns a single bounding box for the whole image – that's incorrect.

Mixing up confidence thresholds: The default confidence threshold for tags and objects is 0.5. Candidates often guess 0.7 or 0.9. Remember 0.5 is the default.

Believing captions are always accurate: Captions are probabilistic; they have a confidence score. The exam may present a scenario where a caption is wrong and ask what to do – check the confidence score.

Edge Cases and Exceptions

No objects detected: Object detection returns an empty array if no objects are found above the threshold.

Tags with very low confidence: The service returns tags with confidence as low as 0.0; you must filter.

Language support: Tags and captions support multiple languages, but the vocabulary is language-specific. An image of a 'dog' in English may not be tagged as 'chien' in French unless the service is called with language=fr.

Image orientation: The service automatically corrects orientation for JPEG images with EXIF metadata. For other formats, you may need to preprocess.

How to Eliminate Wrong Answers

If the scenario mentions 'keywords' or 'search terms', it's tagging.

If it mentions 'sentence' or 'description', it's captioning.

If it mentions 'bounding boxes', 'location', or 'count objects', it's object detection.

Look for confidence scores in the description – all three return them.

Remember that you can combine features in a single API call; the exam may ask about efficiency.

Key Takeaways

Image tagging returns multiple keywords (tags) with confidence scores; default max is 10, up to 79.

Image captioning returns a human-readable sentence; default returns 1 caption, up to 10 with maxCandidates.

Object detection returns bounding boxes (x, y, width, height) for each detected object; default confidence threshold is 0.5.

All three features can be called in a single API request by specifying multiple features in the features parameter.

The service supports images up to 4 MB and dimensions up to 10,000 x 10,000 pixels.

Free tier (F0) allows 5,000 transactions per month; Standard tier (S0) has no monthly limit but charges per transaction.

Tags and captions support multiple languages; use the 'language' parameter (default 'en').

Object detection can detect up to 80 common object categories; it also returns parent tags for grouped objects.

Confidence scores range from 0 to 1; higher is better. Filter results based on your application's needs.

The Azure AI Vision resource (formerly Computer Vision) is used for all three features.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Image Tagging

Returns multiple keywords (tags) per image

Each tag has a confidence score

Best for search indexing and categorization

Can return up to 79 tags

No natural language sentence generation

Image Captioning

Returns a single sentence (or multiple) describing the image

Each caption has a confidence score

Best for alt text and accessibility

Returns 1 caption by default (max 10 with maxCandidates)

Uses a language model to generate fluent text

Image Tagging

Returns only keywords, no spatial info

Can detect abstract concepts (e.g., 'outdoor', 'beautiful')

Faster than object detection

No bounding boxes

Suitable for content moderation keywords

Object Detection

Returns bounding boxes with coordinates

Detects only physical objects (e.g., person, car, chair)

Slower due to region proposal

Provides exact location of each object

Suitable for counting objects or spatial analysis

Watch Out for These

Mistake

Image tagging returns a single label per image.

Correct

Tagging returns multiple tags (keywords) with confidence scores. You can control the maximum number via the `maxCandidates` parameter (default 10, up to 79).

Mistake

Object detection and image tagging are the same thing.

Correct

Object detection returns bounding boxes with coordinates for each detected object, while tagging returns only keywords without spatial information.

Mistake

Captions are always 100% accurate.

Correct

Captions have a confidence score (0-1). They are probabilistic and can be incorrect, especially for complex or ambiguous images.

Mistake

You must use separate API calls for tags, captions, and objects.

Correct

You can request all three in a single API call by specifying multiple features in the `features` parameter (e.g., `features=tags,caption,objects`).

Mistake

The default confidence threshold for object detection is 0.8.

Correct

The default confidence threshold for object detection and tagging is 0.5. You can adjust it via the `confidenceThreshold` parameter (though not directly in the analyze API; you filter client-side).

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between image tagging and object detection in Azure Computer Vision?

Image tagging returns a list of keywords (tags) that describe the image content, such as 'dog', 'beach', 'sunset'. Each tag has a confidence score. Object detection, on the other hand, identifies objects within the image and returns their bounding box coordinates (x, y, width, height) along with a label and confidence score. Tagging does not provide spatial location; object detection does. Use tagging for keyword-based search and object detection for counting or locating objects.

Can I get both tags and captions in a single API call?

Yes, you can request multiple features in a single API call by specifying them in the `features` query parameter. For example, `features=tags,caption` will return both tags and captions in the response. This is efficient and reduces the number of API calls. You can also include 'objects' for object detection.

What is the default confidence threshold for tags and objects?

The default confidence threshold is 0.5 for both tags and objects. This means the service returns only those tags and objects with a confidence score of 0.5 or higher. You can filter results further in your code by checking the confidence values. Lowering the threshold may return more results but with lower accuracy.

How many tags can be returned per image?

By default, the service returns up to 10 tags per image. You can increase this by setting the `maxCandidates` parameter to a value between 1 and 79. The service has a predefined vocabulary of thousands of tags, but the actual number returned depends on the image content and confidence threshold.

What image formats are supported by Azure Computer Vision?

The service supports JPEG, PNG, GIF, BMP, and WEBP formats. For animated GIFs, only the first frame is analyzed. The maximum file size is 4 MB, and the maximum image dimensions are 10,000 x 10,000 pixels. If your image exceeds these limits, you need to resize or compress it before sending.

Can I use Azure Computer Vision to detect faces?

Azure Computer Vision can detect faces as part of object detection (the 'person' object), but it does not provide face-specific attributes like age, emotion, or identification. For detailed face analysis, you should use the Azure Face API, which is a separate service. The exam may test that face detection is not a feature of Computer Vision by default.

What is the pricing tier for Azure Computer Vision?

There are two tiers: Free (F0) and Standard (S0). The Free tier allows up to 5,000 transactions per month and 20 transactions per minute. The Standard tier has no monthly limit but charges per transaction (pricing varies by region). For production workloads, use Standard S0. The exam may ask about limits of the free tier.

Terms Worth Knowing

Artificial intelligence Computer vision Generative AI Machine learning Natural language processing Responsible AI

Ready to put this to the test?

You've just covered Image Analysis: Tags, Captions, Objects — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Try AI-900 practice questions Back to all chapters

Done with this chapter?

Optical Character Recognition (OCR)

Spatial Analysis and Video Insights

See the full AI-900 study guide