This chapter covers semantic segmentation, a computer vision technique that assigns a class label to every pixel in an image. For the AI-900 exam, semantic segmentation appears within Objective 3.1: Identify features of common computer vision workloads. Expect approximately 5–10% of exam questions to touch on segmentation concepts, often distinguishing it from image classification and object detection. Understanding the mechanism, use cases, and Azure services that support segmentation is critical for scoring well on these questions.
Jump to a section
Imagine you have a giant photograph of a city street, and your task is to create a mosaic of that photo using thousands of tiny colored tiles. Each tile represents a small square of the image. In a regular classification task, you would say, "This mosaic is a city street." In object detection, you would draw bounding boxes around cars, pedestrians, and buildings, labeling each box. But semantic segmentation is like assigning a color to every single tile based on what part of the scene it belongs to: all tiles that are part of the sky get blue, all tiles that are part of a car get red, all tiles that are part of a building get gray, and so on. Every tile gets exactly one color (class), and no tile is left uncolored. The process is pixel-level: for each tile, you look at its content and decide its class. The final mosaic is a fully labeled image where every pixel has a class label. This is analogous to how a convolutional neural network (CNN) processes an image: it slides a window over the image, makes a prediction for each pixel based on its surrounding context, and outputs a segmentation map where each pixel is assigned a class. The network does not just say "there is a car" — it tells you exactly which pixels belong to that car.
What is Semantic Segmentation?
Semantic segmentation is a computer vision task where the goal is to classify each pixel in an image into a predefined category. Unlike image classification (which labels the entire image) or object detection (which locates objects with bounding boxes), semantic segmentation produces a dense, pixel-wise label map. For example, in an autonomous driving scenario, every pixel might be labeled as "road," "car," "pedestrian," "building," "sky," etc. This is essential for tasks requiring fine-grained understanding, such as medical image analysis (segmenting organs or tumors), satellite imagery (identifying land cover types), and augmented reality (understanding scene geometry).
How It Works Internally
Semantic segmentation is typically implemented using fully convolutional networks (FCNs) or encoder-decoder architectures like U-Net, DeepLab, and Mask R-CNN (though Mask R-CNN is more for instance segmentation). The core mechanism involves:
Encoder: A convolutional neural network (e.g., ResNet, VGG) that downsamples the input image through a series of convolutional and pooling layers, extracting high-level semantic features while reducing spatial resolution.
Decoder: A network that upsamples the feature maps back to the original image resolution, using techniques like transposed convolutions (deconvolutions) or bilinear interpolation. The decoder combines high-level features from the encoder with low-level features (via skip connections) to produce precise segmentation boundaries.
Pixel-wise Classification: The final layer of the decoder outputs a tensor with shape (H, W, C) where C is the number of classes. A softmax activation is applied along the C dimension to produce a probability distribution over classes for each pixel. The class with the highest probability is assigned to that pixel.
Key Components and Parameters
Loss Function: Cross-entropy loss is commonly used, computed per pixel and averaged over the entire image. For imbalanced classes (e.g., small objects), weighted cross-entropy or Dice loss may be used.
Metrics: Mean Intersection over Union (mIoU) is the standard evaluation metric. It computes the overlap between predicted and ground truth segmentation masks for each class and averages across classes. Pixel accuracy is also used but can be misleading for imbalanced datasets.
Input Size: Images are often resized to a fixed size (e.g., 512x512 or 256x256) for batch processing. In Azure Custom Vision, the maximum image size is 6 MB and dimensions are scaled to fit within 1024x1024.
Training: Requires a dataset with pixel-level annotations, which are expensive to produce. Azure Custom Vision supports semantic segmentation (as of 2024) with a minimum of 50 images per class for training.
Configuration and Verification in Azure
Azure offers two primary services for semantic segmentation:
Computer Vision Image Analysis 4.0 (preview): Provides a pre-built segmentation capability called "background removal" or "segmentation" that can isolate the foreground from the background. It uses a deep neural network trained on large datasets. The API returns a segmentation mask as a binary image or polygon coordinates.
Custom Vision: Allows you to train a custom segmentation model using your own labeled images. You upload images, draw polygons around objects of interest, and train a model. The service supports object detection and classification, but semantic segmentation is available only through the "Object Detection" project type with "Polygon" as the region shape. For true pixel-level segmentation, you would use Azure Machine Learning with a custom model.
To verify a segmentation model's performance, use the Custom Vision portal's "Quick Test" or the Prediction API. The response includes bounding boxes and polygon points for detected objects. For the Computer Vision API, the response includes a segmentation property with a mask URL or base64-encoded mask.
Interaction with Related Technologies
Semantic segmentation is often combined with:
Object Detection: In autonomous driving, object detection identifies vehicles and pedestrians, while segmentation provides precise boundaries for drivable space.
Image Classification: Segmentation can be used to isolate objects before classification, improving accuracy.
Optical Character Recognition (OCR): Segmentation can locate text regions before OCR processing.
Azure Cognitive Services: The Computer Vision API's "Read" OCR can be combined with segmentation to extract text from specific regions.
Exam-Relevant Details
The AI-900 exam expects you to:
Differentiate semantic segmentation from image classification and object detection.
Identify use cases: medical imaging, autonomous vehicles, satellite imagery, background removal.
Know that Azure Custom Vision supports segmentation via polygon annotation in Object Detection projects.
Understand that semantic segmentation assigns a class to every pixel, while instance segmentation distinguishes individual objects of the same class.
Input Image Acquisition
The process begins with acquiring an image, typically in RGB format (3 channels) with dimensions like 640x480 or 1920x1080. The image may be preprocessed by resizing to a fixed input size required by the neural network (e.g., 512x512). In Azure Custom Vision, images are automatically scaled to fit within 1024x1024 while maintaining aspect ratio. The image is then normalized by subtracting mean and dividing by standard deviation per channel (e.g., mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] for models trained on ImageNet). This normalization aligns the input distribution with the training data distribution, improving convergence.
Encoding Feature Extraction
The normalized image passes through the encoder network, a series of convolutional layers that progressively downsample the spatial dimensions while increasing the number of feature channels. For example, a ResNet-50 encoder reduces a 512x512 image to a 16x16 feature map with 2048 channels. Each convolutional layer applies filters that detect edges, textures, and higher-level patterns. Max-pooling layers reduce spatial size by selecting the maximum value in each 2x2 window. The encoder outputs a rich, compact representation of the image content. In the encoder, batch normalization and ReLU activation are applied after each convolution to stabilize training and introduce non-linearity.
Decoding Upsampling
The decoder takes the low-resolution, high-channel feature map from the encoder and upsamples it back to the original image resolution. This is done via transposed convolutions (also called deconvolutions) that learn to interpolate features. For example, a 2x transposed convolution with stride 2 doubles the spatial dimensions. Skip connections from corresponding encoder layers are concatenated to the upsampled features, providing fine-grained details lost during downsampling. This is crucial for accurate boundary prediction. The decoder gradually reduces the number of channels while increasing spatial size, finally outputting a tensor of shape (H, W, C) where C is the number of classes.
Pixel-wise Classification with Softmax
The decoder's output tensor passes through a softmax activation function along the channel dimension. For each pixel location (i,j), softmax converts the raw scores (logits) for each class into probabilities that sum to 1. The formula is: softmax(x_c) = exp(x_c) / sum_{k=1}^{C} exp(x_k). The class with the highest probability is selected as the predicted label for that pixel. This produces a segmentation map of size (H, W) where each pixel contains an integer class ID. In Azure Custom Vision, the Prediction API returns this map as a set of polygons for each detected object, not as a full pixel map, because Custom Vision uses object detection with polygons rather than true semantic segmentation.
Post-processing and Output
The raw segmentation map may undergo post-processing such as conditional random fields (CRFs) to refine boundaries by enforcing spatial consistency. CRFs penalize neighboring pixels that have different labels if their colors are similar, smoothing the output. In practice, many modern segmentation networks (e.g., DeepLab) use atrous convolution and spatial pyramid pooling to handle multi-scale objects without needing CRFs. The final output is a segmentation mask that can be overlaid on the original image for visualization. For Azure Computer Vision API, the output is a binary mask (foreground/background) or a set of polygons. The API returns the mask as a base64-encoded string or a URL to a mask image. The response also includes confidence scores for the segmentation.
Autonomous Vehicle Perception
A leading autonomous vehicle company uses semantic segmentation to understand the driving environment. The system processes camera feeds at 30 fps, segmenting each frame into classes: road, lane markings, vehicles, pedestrians, cyclists, traffic signs, and sky. The segmentation map is fed into a path-planning algorithm that determines drivable space. The model is a DeepLabV3+ with a ResNet-101 backbone, trained on a proprietary dataset of 2 million annotated images. In production, the model runs on an NVIDIA DRIVE AGX Pegasus GPU, achieving inference in under 30 ms per frame. A common misconfiguration is using a model trained on daytime data only, which fails at night. The solution is to include diverse lighting conditions in the training set and use data augmentation (e.g., brightness adjustment).
Medical Image Analysis for Tumor Segmentation
A hospital's radiology department uses semantic segmentation to delineate brain tumors from MRI scans. The model is a 3D U-Net trained on the BraTS dataset. It segments each voxel into four classes: healthy tissue, necrotic core, peritumoral edema, and enhancing tumor. The segmentation output is used to calculate tumor volume and guide surgical planning. The model is deployed on Azure Kubernetes Service with GPU nodes for inference. A key performance consideration is the trade-off between resolution and inference speed: full-resolution 3D volumes (240x240x155) take 10 seconds per scan, but the hospital requires under 30 seconds. By using a sliding window approach with 128x128x128 patches and overlapping, they achieve 15-second inference with acceptable accuracy. A common pitfall is class imbalance: the tumor region is tiny compared to healthy tissue, so the model may predict all pixels as healthy. This is mitigated by using a weighted Dice loss function.
Satellite Imagery for Land Cover Classification
An agricultural analytics company uses semantic segmentation to classify land cover from satellite images. The model segments each pixel into categories: crop type, forest, water, bare soil, urban. They use a U-Net with a EfficientNet-B3 encoder, trained on Sentinel-2 imagery (13 spectral bands). The output helps farmers monitor crop health and estimate yields. A challenge is the variability in scale: a single image covers 100 km², but objects like individual fields are small. The solution is to use atrous spatial pyramid pooling (ASPP) to capture multi-scale features. In production, the model is served via Azure Functions with GPU acceleration. A common error is misclassifying shadows as water; this is reduced by incorporating a normalized difference water index (NDWI) as an additional input channel.
What AI-900 Tests on Semantic Segmentation (Objective 3.1)
The exam focuses on understanding the difference between semantic segmentation and other computer vision tasks. You will NOT be asked to implement a segmentation model or recall specific neural network architectures. Instead, you must:
Recognize that semantic segmentation assigns a label to every pixel in an image.
Identify that segmentation is used in autonomous vehicles (delineating road, obstacles), medical imaging (organ/tumor boundaries), and satellite imagery (land cover classification).
Know that Azure Custom Vision supports segmentation through polygon annotations in Object Detection projects (not true pixel-level segmentation in the free tier).
Understand that Azure Computer Vision Image Analysis 4.0 provides a background removal capability that segments foreground from background.
Common Wrong Answers and Why Candidates Choose Them
"Semantic segmentation draws bounding boxes around objects." This confuses segmentation with object detection. Candidates see "segmentation" and think of dividing an image into regions, but they miss the pixel-level detail. Remember: bounding boxes are for detection, not segmentation.
"Semantic segmentation classifies the entire image into one category." This is image classification. Candidates confuse the term "classification" with segmentation. The key differentiator is that segmentation operates per pixel.
"Instance segmentation and semantic segmentation are the same." Instance segmentation distinguishes individual objects of the same class (e.g., car1 vs car2), while semantic segmentation does not. The exam may ask which task is appropriate when you need to count distinct objects.
Specific Numbers and Terms That Appear on the Exam
The term "pixel-level classification" is used in official Microsoft documentation.
Mean Intersection over Union (mIoU) is the standard evaluation metric.
Azure Custom Vision requires at least 50 images per class for training a segmentation model.
Computer Vision API's segmentation feature is in preview (as of 2024).
Edge Cases and Exceptions
If an image contains multiple objects of the same class that overlap, semantic segmentation will assign the same label to all overlapping pixels (it does not separate instances).
Segmentation models require dense annotations (every pixel labeled), which is expensive and time-consuming to create.
Azure Custom Vision's polygon annotation is not true pixel-level segmentation; it approximates object boundaries but may miss fine details.
How to Eliminate Wrong Answers
When a question asks which computer vision technique to use for a scenario:
If the scenario requires locating objects with bounding boxes, eliminate segmentation.
If the scenario requires classifying the entire image, eliminate segmentation.
If the scenario requires separating individual instances of the same class (e.g., count cars), eliminate semantic segmentation and choose instance segmentation.
If the scenario requires background removal or pixel-level labeling, choose semantic segmentation.
Semantic segmentation classifies every pixel in an image into a predefined class.
It differs from image classification (one label per image) and object detection (bounding boxes).
Common use cases: autonomous driving, medical imaging, satellite imagery, background removal.
Azure Custom Vision supports polygon-based segmentation via Object Detection projects.
Azure Computer Vision API 4.0 offers background removal segmentation (preview).
Mean Intersection over Union (mIoU) is the standard evaluation metric.
Training requires pixel-level annotated data (at least 50 images per class for Custom Vision).
Semantic segmentation does not distinguish between instances of the same class.
These come up on the exam all the time. Here's how to tell them apart.
Semantic Segmentation
Assigns same label to all pixels of a given class (e.g., all cars labeled 'car').
Does not distinguish between individual objects of the same class.
Output is a single segmentation map with class IDs per pixel.
Used for scene understanding (e.g., drivable area, land cover).
Example model: U-Net, DeepLab.
Instance Segmentation
Assigns unique labels to each object instance (e.g., car1, car2).
Distinguishes individual objects even if same class.
Output is a map where each instance has a unique ID (often with a mask per instance).
Used for counting objects or tracking individual items.
Example model: Mask R-CNN.
Mistake
Semantic segmentation and image classification are the same because both classify objects.
Correct
Image classification assigns a single label to the entire image, while semantic segmentation assigns a label to every pixel. Segmentation provides spatial detail that classification lacks.
Mistake
Semantic segmentation can distinguish between two different cars of the same make and model.
Correct
Semantic segmentation assigns the same class label to all pixels of the same object type. It does not differentiate between instances. For that, you need instance segmentation (e.g., Mask R-CNN).
Mistake
Azure Custom Vision provides true pixel-level semantic segmentation out of the box.
Correct
Custom Vision supports object detection with polygon annotations, which approximate object boundaries but do not label every pixel. For true pixel-level segmentation, you need to use Azure Machine Learning with a custom model.
Mistake
Semantic segmentation models require no training data; they work out of the box.
Correct
Like all supervised models, semantic segmentation requires a labeled dataset with pixel-wise annotations. Pre-trained models exist but must be fine-tuned on the target domain.
Mistake
Semantic segmentation produces bounding boxes as output.
Correct
The output is a segmentation mask (a pixel-wise label map), not bounding boxes. Bounding boxes are output by object detection models.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Semantic segmentation assigns the same class label to all pixels of a given category (e.g., all cars are labeled 'car'), without distinguishing individual objects. Instance segmentation, on the other hand, assigns a unique label to each object instance (e.g., car1, car2). For example, in an image with two cars, semantic segmentation labels all car pixels as 'car', while instance segmentation labels one car's pixels as 'car1' and the other as 'car2'. Use instance segmentation when you need to count or track individual objects.
Azure Custom Vision does not offer true pixel-level semantic segmentation in its standard tiers. Instead, you can create an Object Detection project and draw polygons around objects of interest. The model learns to output polygon coordinates for each detected object, which approximates segmentation. For true pixel-level segmentation, you must use Azure Machine Learning to train a custom model (e.g., U-Net) and deploy it as a web service.
The output is typically a 2D array (image) of the same height and width as the input, where each pixel contains an integer representing the predicted class ID. Alternatively, for binary segmentation (foreground/background), the output may be a binary mask. In Azure Computer Vision API, the segmentation mask can be returned as a base64-encoded image or a URL to a PNG file.
Key challenges include: (1) requiring large amounts of pixel-level annotated data, which is expensive and time-consuming to create; (2) class imbalance, where some classes (e.g., tumor) occupy very few pixels; (3) handling varying object scales and fine boundaries; (4) domain shift between training and deployment environments (e.g., day vs night driving). Solutions include data augmentation, weighted loss functions, and using pre-trained encoders.
Yes, but it depends on the model complexity and hardware. Lightweight models like ENet or MobileNet-based segmentation can run at 30+ fps on a GPU. For autonomous driving, models like DeepLabV3+ with ResNet-101 achieve ~10-20 fps on high-end GPUs. Using model quantization, pruning, or TensorRT can further optimize inference speed. Azure's Computer Vision API has latency in the order of seconds due to cloud processing.
Skip connections transfer feature maps from the encoder to the decoder at corresponding resolutions. This helps the decoder recover fine-grained spatial information lost during downsampling, improving boundary accuracy. Without skip connections, the decoder relies solely on upsampled low-resolution features, leading to blurry boundaries.
The Computer Vision Image Analysis 4.0 API (preview) provides a segmentation endpoint that detects the dominant foreground object and returns a binary mask separating it from the background. It uses a deep neural network trained on millions of images. The API requires an image URL or binary data and returns the mask as a base64-encoded PNG. This is a form of semantic segmentation with two classes: foreground and background.
You've just covered Semantic Segmentation — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?