This chapter covers Azure AI Vision, a core service in Microsoft's computer vision portfolio that enables applications to extract rich information from images. For the AI-900 exam, questions on Azure AI Vision appear in approximately 15-20% of the exam, making it a critical topic to master. You will learn how to use the service for image analysis, object detection, optical character recognition (OCR), and facial analysis, along with the specific capabilities, pricing tiers, and integration patterns you need to know for the exam.
Jump to a section
Imagine you are an art connoisseur standing before a massive painting. Your task is to describe everything you see: the objects (a tree, a person), their attributes (the tree is green, the person is smiling), and even the text on a plaque. You have a magnifying glass that can zoom in on specific details, and you can mentally compare what you see to a vast library of known images and concepts. This is exactly how Azure AI Vision works. The service uses pre-trained deep neural networks as its 'eyes' and 'brain'. When you send an image, the service first breaks it down into thousands of tiny features (like the connoisseur scanning the painting in sections). Then, it uses a hierarchy of layers to detect edges, shapes, objects, and even entire scenes. The 'magnifying glass' corresponds to the ability to extract specific features like text (OCR) or faces. The 'library' is the massive training dataset that allows the service to recognize over 10,000 objects and concepts. Just as a connoisseur might provide a caption for the painting, Azure AI Vision can generate a human-readable description of the image. The entire process happens in milliseconds, thanks to optimized neural networks running on Azure's GPU infrastructure.
What is Azure AI Vision?
Azure AI Vision is a cloud-based service that uses pre-trained deep learning models to analyze images and extract information such as objects, faces, text, and descriptions. It is part of the Azure Cognitive Services family and is designed to be used without requiring machine learning expertise. The service exposes a REST API and SDKs for popular programming languages, allowing developers to integrate vision capabilities into their applications with minimal code.
Why It Exists
Before cloud-based vision APIs, building computer vision systems required significant expertise in machine learning, large datasets, and powerful hardware. Azure AI Vision democratizes this technology by providing pre-trained models that can be called via simple HTTP requests. This allows businesses to add features like automated image tagging, content moderation, and OCR to their applications without building models from scratch.
How It Works Internally
When you send an image to Azure AI Vision, the service performs the following high-level steps:
Image Preprocessing: The image is decoded and resized to a standard input size (e.g., 224x224 pixels for feature extraction). The service also normalizes pixel values.
Feature Extraction: The preprocessed image is passed through a convolutional neural network (CNN) such as ResNet or EfficientNet. The CNN extracts hierarchical features: edges in early layers, shapes in middle layers, and high-level concepts in later layers.
3. Task-Specific Heads: The extracted features are fed into specialized heads for each capability: - Image Classification: A softmax layer outputs probabilities over thousands of categories (e.g., 'dog', 'car'). - Object Detection: Region proposal networks (RPN) identify bounding boxes, and classification heads label each box. - OCR: A text detection model (e.g., CRAFT) locates text regions, followed by a recognition model (e.g., CRNN) that reads the text. - Facial Analysis: Face detection (using a variant of MTCNN) identifies faces, then attribute classifiers predict age, emotion, etc.
Post-Processing: Results are formatted as JSON, including confidence scores, bounding box coordinates, and metadata.
Key Components, Values, and Defaults
- Pricing Tiers: - Free (F0): 20 transactions per minute, 5K transactions per month. - Standard (S0): Pay-as-you-go, up to 10 transactions per second (TPS) per region. Higher TPS available via request.
Image Size Limits:
Maximum image size: 4 MB (for all features).
Minimum image size: 10 x 10 pixels (for face detection).
Supported Image Formats: JPEG, PNG, GIF (non-animated), BMP, TIFF.
OCR:
Read API (latest): Supports printed and handwritten text, multiple languages, and orientation detection. Returns text lines and words with bounding boxes.
OCR API (legacy): Supports printed text in multiple languages. Not recommended for new projects.
Analyze Image API: Returns tags, objects, description (captions), brands, faces, image type, and color scheme.
Confidence Scores: Most results include a confidence score between 0 and 1. For example, object detection returns a score for each detected object.
Configuration and Verification Commands
To use Azure AI Vision, you need an Azure subscription and a Cognitive Services resource. You can create one via the Azure portal, CLI, or PowerShell.
Azure CLI example to create a resource:
az cognitiveservices account create \
--name myvisionresource \
--resource-group myresourcegroup \
--kind ComputerVision \
--sku S0 \
--location westus2 \
--yesGet the endpoint and key:
az cognitiveservices account show \
--name myvisionresource \
--resource-group myresourcegroup \
--query "properties.endpoint"
az cognitiveservices account keys list \
--name myvisionresource \
--resource-group myresourcegroup \
--query "key1"Test with a sample image using curl:
curl -H "Ocp-Apim-Subscription-Key: <your-key>" \
-H "Content-Type: application/json" \
-d "{'url':'https://example.com/image.jpg'}" \
https://<your-endpoint>/vision/v3.2/analyze?visualFeatures=Objects,Tags,DescriptionHow It Interacts with Related Technologies
Azure Cognitive Search: Azure AI Vision can be used to enrich search indexes by extracting tags and descriptions from images, enabling image-based search.
Azure Logic Apps and Power Automate: Integrate image analysis into workflows, e.g., automatically moderate uploaded images.
Azure Functions: Serverless compute to process images as they are uploaded to blob storage.
Custom Vision: If the pre-trained models do not meet your needs, you can use Custom Vision to train a custom model on your own images, then deploy it alongside Azure AI Vision.
Exam-Relevant Details
Capabilities: Know the main features: Image Analysis (tags, objects, description, faces, brands), OCR (Read API), and Face Detection (but not face recognition — that's Face API).
Pricing: Free tier allows 20 calls per minute, 5K per month. Standard tier is pay-as-you-go, up to 10 TPS.
Image Requirements: Max 4 MB, min 10x10 pixels for face detection.
SDKs: Available for .NET, Python, Java, Node.js, Go, and REST API.
Read API vs. OCR API: Read API is the latest, supports handwritten text and multiple languages. OCR API is legacy.
Visual Features: The visualFeatures parameter in the Analyze API accepts: Categories, Tags, Description, Faces, ImageType, Color, Adult, Objects, Brands.
Confidence Thresholds: The service returns scores; you can filter results by a threshold (e.g., only show objects with confidence > 0.5).
Error Handling: Common errors include 429 (rate limit exceeded), 401 (invalid key), 413 (image too large), and 415 (unsupported media type).
Create an Azure AI Vision Resource
In the Azure portal, search for 'Cognitive Services' and click 'Create'. Select 'Computer Vision' as the API type. Choose a subscription, resource group, region (e.g., West US), and pricing tier (F0 for free, S0 for standard). Provide a name and click 'Review + create'. After deployment, note the endpoint URL and one of the two keys. These are used for authentication in API calls. The endpoint typically looks like 'https://<region>.api.cognitive.microsoft.com/'.
Prepare Your Image
The image must be a JPEG, PNG, GIF (non-animated), BMP, or TIFF file. Maximum file size is 4 MB. Minimum dimensions for face detection are 10x10 pixels. The image can be provided as a URL or as binary data in the request body. For best results, ensure the image is clear, well-lit, and the subject is prominent. Avoid images with excessive noise or very small objects.
Call the Analyze Image API
Send an HTTP POST request to the endpoint: 'https://<endpoint>/vision/v3.2/analyze?visualFeatures=Objects,Tags,Description'. Include the header 'Ocp-Apim-Subscription-Key' with your key. The request body contains the image URL or binary data. The service processes the image and returns a JSON response with detected objects (name, confidence, bounding box), tags (confidence), and a description (captions with confidence). The response time is typically under 2 seconds for average-sized images.
Parse the API Response
The JSON response contains a 'objects' array with objects like {'object': 'dog', 'confidence': 0.95, 'rectangle': {'x': 100, 'y': 200, 'w': 300, 'h': 400}}. The 'tags' array includes tags with confidence scores. The 'description' object has an array of 'captions' with text and confidence. You can filter results by a confidence threshold, e.g., only consider objects with confidence > 0.5. Use these results to drive application logic, such as auto-tagging images or generating alt text.
Extract Text Using the Read API
For OCR, use the Read API (latest). Send a POST request to 'https://<endpoint>/vision/v3.2/read/analyze' with the image. The service returns an 'Operation-Location' header with a URL to poll for results. Poll the URL with GET requests until the status is 'succeeded'. The final response contains 'readResult' with pages, lines, and words, each with bounding boxes and text. This API supports printed and handwritten text in multiple languages. The Read API is asynchronous to handle large documents.
Enterprise Scenario 1: Automated Content Moderation
A social media platform uses Azure AI Vision to automatically moderate user-uploaded images. They use the Analyze API with the 'Adult' visual feature to detect adult or racy content. The API returns an 'adult' score (0-1) and a 'racy' score. The platform sets a threshold of 0.7 for adult content and automatically rejects images exceeding it. They also use the 'Tags' feature to flag images with violence-related tags (e.g., 'weapon', 'blood'). The system processes up to 10 images per second using the S0 tier. When misconfigured (e.g., threshold too low), they experienced false positives, blocking legitimate content. They later adjusted thresholds based on A/B testing.
Enterprise Scenario 2: Retail Inventory Management
A retail chain uses Azure AI Vision to analyze shelf images from stores. They use the Object Detection feature to detect products on shelves. The API returns bounding boxes and labels for each product. They integrated this with a custom backend that matches detected products against inventory databases. The system processes images from 500 stores daily, each image around 2 MB. They chose the S0 tier with 10 TPS. A common issue is overlapping products causing missed detections; they mitigated by taking multiple angles. Performance considerations include network latency (they use Azure CDN for image uploads) and cost optimization (they cache results for unchanged shelves).
Enterprise Scenario 3: Document Digitization
A legal firm uses the Read API to digitize scanned contracts. They send multi-page TIFF files (each page under 4 MB) to the Read API. The asynchronous operation allows them to poll for results without blocking. They extract text from the JSON response and feed it into a search index. They handle handwritten annotations with the Read API's handwriting support. A frequent misconfiguration is not handling the polling loop correctly, leading to timeouts. They set a maximum wait of 60 seconds per page. They also encountered rate limiting (429 errors) when sending pages too fast; they implemented exponential backoff.
What AI-900 Tests on Azure AI Vision
The AI-900 exam (objective 3.2) focuses on understanding the capabilities and use cases of Azure AI Vision. You do not need to memorize API endpoints or code, but you must know:
The difference between Azure AI Vision, Custom Vision, and Face API.
The specific features: image analysis (tags, objects, descriptions, faces, brands, adult content), OCR (Read API for printed and handwritten text), and face detection (but not recognition).
Use cases: automated tagging, content moderation, text extraction from images, and accessibility (generating alt text).
Limitations: image size (4 MB), format support, and free tier limits (20 calls/min, 5K/month).
Common Wrong Answers and Why Candidates Choose Them
Choosing 'Face API' for face detection: Candidates often confuse Azure AI Vision (which includes face detection) with Face API (which adds face recognition and verification). The exam expects you to know that Azure AI Vision can detect faces and attributes like age and emotion, but it cannot identify individuals. Face API is for recognition.
Selecting 'Custom Vision' for general image analysis: Candidates think they need to train a custom model for every task. However, Azure AI Vision's pre-trained models cover thousands of common objects and scenes. Custom Vision is only needed for domain-specific objects not covered by the pre-trained model.
Believing the OCR API is the latest: The exam tests your knowledge of the Read API as the current OCR solution. The legacy OCR API is still available but not recommended. Candidates who answer questions about OCR with the legacy API will be wrong.
Assuming face detection returns identity: Azure AI Vision's face detection returns attributes like age, emotion, and facial hair, but it does not recognize individuals. The exam may ask what information is returned by Azure AI Vision vs. Face API.
Specific Numbers and Terms on the Exam
Free tier: 20 transactions per minute, 5,000 per month.
Standard tier: up to 10 transactions per second.
Image size limit: 4 MB.
Minimum face detection size: 10x10 pixels.
Supported image formats: JPEG, PNG, GIF, BMP, TIFF.
Visual features: Categories, Tags, Description, Faces, ImageType, Color, Adult, Objects, Brands.
Read API supports printed and handwritten text.
Azure AI Vision does NOT support face recognition or verification.
Edge Cases and Exceptions
If an image is too large (over 4 MB), the API returns a 413 error. You must resize or compress the image.
If the image format is unsupported (e.g., animated GIF), the API returns 415.
The Read API is asynchronous; you must poll the Operation-Location URL. The initial POST returns a 202 Accepted status.
Free tier has a rate limit of 20 calls per minute. Exceeding it returns 429 Too Many Requests.
The service may return empty results if the image quality is poor (e.g., too dark, blurry).
How to Eliminate Wrong Answers
Understand the underlying mechanism: Azure AI Vision is a pre-trained, general-purpose vision service. If a scenario requires recognizing specific objects not in the pre-trained set (e.g., rare species of birds), Custom Vision is needed. If a scenario requires identifying a specific person (e.g., 'Is this John?'), Face API is needed. If the scenario just needs to detect faces or general objects, Azure AI Vision suffices. Also, remember that OCR is now the Read API, not the legacy one.
Azure AI Vision provides pre-trained models for image analysis, OCR, and face detection — no ML expertise needed.
The Read API is the latest OCR service, supporting printed and handwritten text asynchronously.
Face detection in Azure AI Vision returns attributes (age, emotion) but does NOT identify individuals.
Free tier: 20 calls/min, 5,000 calls/month. Standard tier: up to 10 TPS (pay-as-you-go).
Image size limit: 4 MB; supported formats: JPEG, PNG, GIF, BMP, TIFF.
Use Custom Vision when you need to recognize objects not covered by the pre-trained model.
Use Face API when you need face recognition (identifying specific people).
The Analyze API can return tags, objects, description, faces, brands, adult content, and more.
These come up on the exam all the time. Here's how to tell them apart.
Azure AI Vision
Pre-trained on thousands of common objects and scenes.
No training required; ready to use via API.
Supports image analysis, OCR, face detection.
Cannot recognize domain-specific objects not in training set.
Pricing: Free tier (5K/month) or Standard (pay-as-you-go).
Custom Vision
Train custom models on your own labeled images.
Requires training process; you provide images and tags.
Supports image classification and object detection (custom).
Can recognize any object you train it on.
Pricing: Separate tiers; training consumes compute hours.
Mistake
Azure AI Vision can recognize specific individuals in images.
Correct
Azure AI Vision only detects faces and predicts attributes like age and emotion. It does not perform face recognition (identifying a specific person). That requires the Face API.
Mistake
The OCR API and Read API are the same.
Correct
The Read API is the latest OCR service, supporting printed and handwritten text, multiple languages, and asynchronous processing. The legacy OCR API only supports printed text and is not recommended for new projects.
Mistake
You must train a custom model for any image analysis task.
Correct
Azure AI Vision comes with pre-trained models that can recognize thousands of common objects, scenes, and concepts. Custom Vision is only needed when you have domain-specific images not covered by the pre-trained model.
Mistake
Azure AI Vision can process images of any size.
Correct
Images are limited to 4 MB in file size. Larger images must be resized or compressed before sending. Also, for face detection, the minimum size is 10x10 pixels.
Mistake
The free tier has unlimited usage.
Correct
The free tier (F0) allows 20 transactions per minute and 5,000 transactions per month. Exceeding these limits results in a 429 (rate limit) error.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Azure AI Vision is a pre-trained service that can analyze general images for common objects, text, and faces. Custom Vision allows you to train your own model on your own images for specialized tasks. Use Azure AI Vision for common scenarios; use Custom Vision when you need to recognize unique items not in the pre-trained set.
Yes, the Read API (the latest OCR service) supports both printed and handwritten text. The legacy OCR API does not support handwriting. For the exam, remember that the Read API is the correct choice for handwriting.
The free tier (F0) allows 20 transactions per minute and 5,000 transactions per month. If you exceed these limits, you will receive a 429 rate limit error. For higher throughput, you need the Standard tier.
No. Azure AI Vision can detect faces and predict attributes like age, emotion, and facial hair, but it cannot identify specific individuals. For face recognition, use the Face API.
Supported formats are JPEG, PNG, GIF (non-animated), BMP, and TIFF. If you send an animated GIF or another unsupported format, you'll get a 415 error.
The maximum file size is 4 MB. If your image is larger, you must resize or compress it before sending. The API will return a 413 error if the image exceeds this limit.
The Analyze API is used for general image analysis (tags, objects, descriptions, faces, etc.). The Read API is specifically for optical character recognition (OCR) to extract text from images. They are different endpoints with different capabilities.
You've just covered Azure AI Vision Service — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?