AI-900Chapter 10 of 100Objective 3.1

What is Computer Vision?

This chapter covers the fundamentals of computer vision, a core area in AI that enables machines to interpret and understand visual information from the world. For the AI-900 exam, computer vision topics appear in approximately 15-20% of questions, making it essential to grasp the key concepts, services, and use cases. You will learn how computer vision works, the specific Azure AI services that implement it, and how to apply them in real-world scenarios.

25 min read
Intermediate
Updated May 31, 2026

The Automated Quality Inspector

Imagine a factory assembly line producing smartphones. Each phone passes under a high-speed camera connected to an automated quality inspector. This inspector doesn't just take a picture; it has been trained on thousands of images of perfect phones and defective phones. When a phone comes down the line, the camera captures a digital image and breaks it down into millions of tiny squares called pixels. Each pixel has a color value. The inspector's trained neural network analyzes the pixel patterns, looking for features like edges, textures, and shapes. It compares these features against its learned model of what a 'good' phone looks like. If a phone has a scratch, the pattern of pixels in that area will deviate from the expected pattern. The inspector flags that phone as defective. The key is that the inspector doesn't just see a scratch; it recognizes the scratch as a pattern anomaly because it has learned from many examples. This mirrors computer vision: a computer doesn't 'see' images like we do; it processes pixel data through trained models to detect objects, classify scenes, or read text. The quality inspector's ability to generalize—to spot a new type of defect it hasn't seen before—is analogous to a well-trained computer vision model that can identify objects in varied lighting and angles.

How It Actually Works

Computer vision is a field of artificial intelligence that trains computers to interpret and understand the visual world. Using digital images from cameras and videos, computer vision systems can identify objects, classify scenes, detect faces, read text, and even analyze motion. Unlike human vision, which is intuitive and context-rich, computer vision relies on mathematical models and massive datasets to recognize patterns.

Why Does Computer Vision Exist?

Human visual processing is limited by scale, speed, and consistency. A human can look at an image and identify a cat, but cannot process millions of images per hour without fatigue. Computer vision exists to automate visual tasks at scale, enabling applications like autonomous vehicles, medical imaging analysis, quality inspection in manufacturing, and security surveillance. Microsoft Azure provides several pre-built computer vision services that allow developers to add vision capabilities without building models from scratch.

How Does Computer Vision Work Internally?

At a high level, computer vision systems follow a pipeline: image acquisition, preprocessing, feature extraction, and interpretation.

- Image Acquisition: An image is captured as a grid of pixels. Each pixel has a numerical value representing color or intensity. For example, a 640x480 image has 307,200 pixels. - Preprocessing: The image may be resized, normalized (scaling pixel values to a range like 0-1), or augmented (rotated, flipped) to improve model robustness. - Feature Extraction: Traditional computer vision used hand-crafted algorithms like edge detection (Canny, Sobel) to identify edges and corners. Modern deep learning approaches use convolutional neural networks (CNNs) that automatically learn hierarchical features. A CNN typically consists of: - Convolutional layers: Apply filters (kernels) to detect features like edges, textures, and shapes. For example, a 3x3 filter slides over the image, computing dot products to create feature maps. - Activation functions: ReLU (Rectified Linear Unit) introduces non-linearity, setting negative values to zero. - Pooling layers: Reduce spatial dimensions (e.g., max pooling takes the maximum value in a 2x2 region, reducing size by half). - Fully connected layers: Flatten the feature maps into a vector and classify based on learned weights. - Interpretation: The final layer outputs probabilities for each class (e.g., 0.95 cat, 0.05 dog). A threshold (often 0.5) determines the final prediction.

Key Components and Values

Image resolution: Common input sizes for CNNs are 224x224 (ImageNet standard), 256x256, or 299x299 (Inception).

Color channels: RGB images have 3 channels; grayscale has 1.

Batch size: Number of images processed together (e.g., 32, 64) affects memory and training speed.

Learning rate: Typically 0.001 to 0.0001, controls how much weights update per step.

Epochs: One full pass through the training dataset; often 10-100.

Confidence threshold: For object detection, a threshold (e.g., 0.5) filters out low-confidence detections.

Azure Computer Vision Services

Microsoft Azure offers several computer vision services under Azure AI Services:

Azure Computer Vision: A pre-built service for image analysis, OCR, and spatial analysis. It can extract tags, describe images, detect objects, read text (OCR), and analyze faces.

Azure Custom Vision: Allows you to train custom image classification and object detection models using your own images. You upload images, label them, and train a model via a simple GUI or API.

Azure Face API: Detects, identifies, and analyzes human faces. Can detect 27 facial landmarks, estimate age, emotion (8 categories), and identify individuals (requires PersonGroup training).

Azure Form Recognizer: Extracts text, key-value pairs, and tables from documents using OCR and deep learning.

Azure Video Analyzer: Analyzes video streams for motion, object detection, and event detection.

How to Configure and Verify

To use Azure Computer Vision, you create a Computer Vision resource in the Azure portal. You'll get an endpoint URL and two keys. Example using Python:

import requests

endpoint = "https://your-resource-name.cognitiveservices.azure.com/"
subscription_key = "your-key"

# Analyze an image
url = endpoint + "vision/v3.2/analyze"
params = {"visualFeatures": "Categories,Description,Color"}
image_url = {"url": "https://example.com/image.jpg"}
headers = {"Ocp-Apim-Subscription-Key": subscription_key}
response = requests.post(url, params=params, json=image_url, headers=headers)
print(response.json())

For Custom Vision, you use the Custom Vision portal (customvision.ai) to upload images, label them, and train. The training process uses transfer learning—starting from a pre-trained model (like ResNet) and fine-tuning on your data.

Interaction with Related Technologies

Computer vision often integrates with: - Azure Cognitive Search: Index images and extract metadata via OCR and image analysis for search. - Azure Logic Apps: Automate workflows triggered by image uploads. - Azure Functions: Process images serverlessly. - Azure Machine Learning: Build custom models using deep learning frameworks (TensorFlow, PyTorch) and deploy as endpoints.

Common Exam Traps

Trap 1: Confusing Computer Vision with Custom Vision. Computer Vision is pre-built; Custom Vision allows custom training.

Trap 2: Thinking Computer Vision can identify specific people. For person identification, use Face API with PersonGroup.

Trap 3: Assuming OCR works on handwritten text. Azure Computer Vision OCR works best on printed text; for handwriting, use Form Recognizer or a custom model.

Trap 4: Believing that image analysis returns bounding boxes for all objects. The "analyze" operation returns tags and descriptions; for bounding boxes, use the "detect" operation (object detection).

Walk-Through

1

Capture and Preprocess Image

The process begins when an image is captured from a camera, file, or URL. The image is a 2D array of pixels, each with RGB values (0-255). Preprocessing may resize the image to a standard size (e.g., 224x224), normalize pixel values to [0,1] or [-1,1], and convert to tensor format. For Azure Computer Vision, the service accepts images up to 4 MB in size, with dimensions up to 10,000 x 10,000 pixels. The image can be sent as binary in the request body or as a URL.

2

Feature Extraction via CNN

The preprocessed image passes through a convolutional neural network (CNN) trained on millions of images. The CNN's first layers detect low-level features like edges and corners. Subsequent layers combine these into mid-level features (e.g., textures, shapes) and high-level features (e.g., object parts). For example, a ResNet-50 model has 50 layers and uses residual connections to avoid vanishing gradients. The output is a feature vector representing the image's content.

3

Classification or Detection

The feature vector is fed into a classifier (e.g., softmax layer) that outputs probabilities for each class. For object detection, the model also predicts bounding boxes using region proposal networks (like in Faster R-CNN) or anchor boxes (like in YOLO). Azure Computer Vision's object detection returns up to 50 objects per image, each with a bounding box (coordinates as x, y, width, height) and a confidence score (0-1). A confidence threshold (default 0.5) filters out low-confidence detections.

4

Post-processing and Output

The raw predictions are post-processed to remove duplicates via non-maximum suppression (NMS). NMS selects the bounding box with highest confidence and suppresses others with overlap greater than a threshold (e.g., 0.5 IoU). The final output is a JSON object containing tags, descriptions, captions, or detected objects. For OCR, the text is returned with bounding polygons and confidence scores for each word.

5

Consume Results in Application

The application receives the JSON response and processes it. For example, a retail app might use detected objects to inventory items, or a security system might trigger an alert when a specific object (e.g., weapon) is detected. The response includes metadata like request ID and API version. Developers can use SDKs in C#, Python, Java, etc., to integrate easily.

What This Looks Like on the Job

Enterprise Scenario 1: Retail Inventory Management

A large retail chain uses Azure Custom Vision to automate shelf inventory. They deploy cameras on shopping carts that capture images of shelves. The images are sent to a Custom Vision model trained on thousands of labeled product images. The model detects and counts products on shelves, identifying out-of-stock items. The system is configured to process images every 5 minutes during store hours, handling up to 100 images per minute. A common issue is misclassification due to similar packaging or partial occlusion. To mitigate, the team retrains the model weekly with new images and adjusts the confidence threshold to 0.7 to reduce false positives. Performance considerations include network latency (images are sent over Wi-Fi) and processing time (each image takes ~500ms). Misconfiguration, like using Computer Vision instead of Custom Vision, would fail because pre-built models don't recognize specific products.

Enterprise Scenario 2: Medical Imaging Triage

A hospital deploys Azure Computer Vision to analyze chest X-rays for signs of pneumonia. The system uses a custom model built in Azure Machine Learning, deployed as a real-time endpoint. X-rays are uploaded to a secure blob storage, triggering an Azure Function that calls the endpoint. The model returns a probability score and highlights suspicious regions. The system processes 200 images per day, with a latency requirement of under 2 seconds per image. A critical consideration is compliance with HIPAA; all data must be encrypted in transit and at rest. A common pitfall is using the pre-built Computer Vision service, which is not trained for medical images and would give inaccurate results. The team uses transfer learning on a pre-trained ResNet-50 with a custom dataset of labeled X-rays. They monitor model drift by comparing weekly accuracy against a holdout set.

Enterprise Scenario 3: Automated Document Processing

A financial services firm uses Azure Form Recognizer to extract data from invoices and receipts. The system processes 10,000 documents per day, extracting fields like invoice number, date, total amount, and line items. Form Recognizer uses OCR and deep learning to understand document layout. It is configured with a custom model trained on 500 sample invoices. The system integrates with Azure Logic Apps to route extracted data to an ERP system. A common issue is low confidence on handwritten fields; the team sets a confidence threshold of 0.8 and flags low-confidence extractions for manual review. Misconfiguration, such as using Computer Vision's OCR instead of Form Recognizer, would miss key-value pair extraction and table structure.

How AI-900 Actually Tests This

What AI-900 Tests on Computer Vision (Objective 3.1)

The exam focuses on identifying the correct Azure service for a given computer vision task. Key objective codes: Describe computer vision concepts, identify Azure services for computer vision, and understand image analysis, object detection, OCR, and face detection.

Common Wrong Answers

Wrong Answer 1: Choosing 'Face API' for general image analysis. Face API is specialized for faces; for general analysis (tags, descriptions), use Computer Vision.

Wrong Answer 2: Selecting 'Custom Vision' when the scenario requires pre-built capabilities. If the task is common (e.g., detecting common objects), Computer Vision is sufficient. Custom Vision is for custom, unique objects.

Wrong Answer 3: Assuming OCR is only available in Computer Vision. OCR is also available in Form Recognizer (for documents) and in Azure Cognitive Search (for indexing). The exam expects you to choose the most appropriate service.

Wrong Answer 4: Thinking that Computer Vision can identify individuals. For person identification, use Face API with a PersonGroup.

Specific Numbers and Terms

Confidence scores: Ranges from 0 to 1. Thresholds like 0.5 are common.

Bounding boxes: Returned as x, y, width, height (in pixels).

Image size limit: 4 MB for Computer Vision.

Maximum image dimensions: 10,000 x 10,000 pixels.

Number of objects detected: Up to 50 per image.

Face landmarks: 27 landmarks detected by Face API.

Emotion categories: 8 (anger, contempt, disgust, fear, happiness, neutral, sadness, surprise).

Edge Cases and Exceptions

Handwritten text: Computer Vision's OCR works best on printed text. For handwriting, use Form Recognizer (pre-built receipt model) or Custom Vision.

Celebrity detection: Computer Vision can identify celebrities if the 'celebrities' domain is specified. This is a specific model.

Landmark detection: Similar to celebrities, Computer Vision can recognize landmarks like the Eiffel Tower.

Thumbnail generation: Computer Vision can generate a thumbnail by cropping around the most important region (smart cropping).

How to Eliminate Wrong Answers

Read the scenario carefully: Look for keywords like 'custom', 'specific', 'unique' to decide between Computer Vision and Custom Vision.

Identify the task: If the task is about faces (detect, identify, verify), choose Face API. If it's about reading text in documents, choose Form Recognizer. If it's general image analysis, choose Computer Vision.

Check for pre-built vs. custom: If the scenario mentions a common object (car, person, dog), Computer Vision works. If it's a proprietary product, Custom Vision is needed.

Watch for 'extract key-value pairs': This points to Form Recognizer, not Computer Vision.

Key Takeaways

Computer vision enables machines to interpret visual data using deep learning, specifically CNNs.

Azure Computer Vision is a pre-built service for common tasks like object detection, OCR, and image description.

Azure Custom Vision lets you train custom image classification and object detection models with your own images.

Azure Face API detects and identifies faces, including attributes like emotion, age, and landmarks.

Azure Form Recognizer extracts text, key-value pairs, and tables from documents using OCR and layout analysis.

For video analysis, use Azure Video Analyzer, not Computer Vision.

Confidence scores range from 0 to 1; a threshold of 0.5 is typical for filtering detections.

Computer Vision image size limit is 4 MB; dimensions up to 10,000 x 10,000 pixels.

Object detection returns up to 50 objects per image with bounding boxes.

OCR in Computer Vision is best for printed text; use Form Recognizer for handwriting or structured documents.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Computer Vision

Pre-trained on millions of images; no training needed.

Can analyze images for tags, descriptions, objects, faces, OCR.

Supports domain-specific models like celebrities and landmarks.

Processes images via API; max 4 MB, 10,000x10,000 pixels.

Best for common, general-purpose image analysis tasks.

Azure Custom Vision

Requires training with your own labeled images.

Supports image classification and object detection (with bounding boxes).

Exports models to TensorFlow, ONNX, or Docker for on-premises deployment.

Training is done via customvision.ai or API; requires at least 2 images per class.

Best for specialized or unique visual concepts not covered by pre-built models.

Watch Out for These

Mistake

Computer Vision can identify specific people by name.

Correct

Computer Vision does not identify individuals. It detects faces and can estimate attributes like age and emotion, but person identification requires the Face API with a PersonGroup trained on known individuals.

Mistake

Custom Vision is always better than Computer Vision for any image analysis.

Correct

Custom Vision requires training data and time. For common tasks like detecting everyday objects, reading printed text, or generating image descriptions, the pre-built Computer Vision service is faster and more accurate. Custom Vision is only needed for specialized objects or scenarios.

Mistake

OCR in Computer Vision can extract handwritten text perfectly.

Correct

Computer Vision's OCR (Read API) is optimized for printed text. For handwritten text, accuracy is lower. Azure Form Recognizer's pre-built receipt or invoice models handle handwriting better, but for general handwriting, a custom model may be needed.

Mistake

All Azure AI services for computer vision require custom training.

Correct

Azure Computer Vision, Face API, and Form Recognizer offer pre-built models that work out of the box. Custom Vision is the only one that requires training with your own images. The exam tests knowing which service is pre-built vs. custom.

Mistake

Computer Vision can analyze videos in real time.

Correct

Computer Vision is designed for static images. For video analysis, use Azure Video Analyzer (formerly Video Indexer) which can process video streams and extract insights like motion, objects, and speech.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Computer Vision and Custom Vision in Azure?

Azure Computer Vision is a pre-trained service that can analyze images out of the box for common tasks like detecting objects, reading text, and generating captions. It requires no training data. Azure Custom Vision, on the other hand, allows you to train your own image classification or object detection model using your own labeled images. Use Computer Vision for general purposes; use Custom Vision when you need to recognize specific items not covered by the pre-trained model.

Can Computer Vision detect faces?

Yes, Computer Vision can detect faces in an image and return bounding boxes, along with estimated age and gender. However, for more advanced face analysis like emotion detection (8 emotions) or person identification, you should use Azure Face API. Face API provides 27 facial landmarks and can identify individuals if you create a PersonGroup.

How does OCR work in Azure Computer Vision?

Azure Computer Vision includes OCR (Optical Character Recognition) via the Read API. It extracts printed text from images and documents. The process involves detecting text regions, recognizing characters using deep learning, and returning the text along with bounding polygons and confidence scores. The Read API is asynchronous for large documents; you submit the image, get an operation ID, and poll for results.

What is the maximum image size for Azure Computer Vision?

The maximum image file size is 4 MB. The maximum dimensions are 10,000 x 10,000 pixels. Images larger than 4 MB will be rejected. For optimal performance, it's recommended to use images under 4 MB and resize if necessary.

Can I use Computer Vision to identify my company logo?

The pre-built Computer Vision service can detect common objects, but it is not trained on specific company logos. To detect a custom logo, you should use Azure Custom Vision to train a model with images of your logo. Alternatively, you can use Computer Vision's OCR to read text in the logo, but that won't recognize the logo itself.

What is the confidence threshold in object detection?

The confidence threshold is a value between 0 and 1 that filters out low-confidence predictions. Azure Computer Vision's object detection returns only objects with confidence above the threshold. The default is 0.5, but you can adjust it via API. A higher threshold (e.g., 0.8) reduces false positives but may miss some objects.

Is Azure Computer Vision available in all regions?

Azure Computer Vision is available in many Azure regions, including East US, West Europe, Southeast Asia, and more. Not all regions support all features; for example, the 'Read' OCR API may have different regional availability. Always check the Azure documentation for the latest regional availability.

Terms Worth Knowing

Ready to put this to the test?

You've just covered What is Computer Vision? — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Done with this chapter?