This chapter covers object detection and bounding boxes, a core computer vision capability in Microsoft Azure. You will learn how object detection differs from image classification, the mechanism of bounding boxes, and how Azure services implement this. Approximately 10-15% of AI-900 exam questions touch on object detection, especially in the context of Custom Vision and Computer Vision services. Mastering this topic is essential for scenarios like automated inventory, autonomous driving, and medical imaging.
Jump to a section
Imagine you are organizing a huge photo album of a crowded city street. Instead of just saying 'this is a photo of a street' (classification), you need to find every person, car, and traffic light, and draw a box around each one, labeling it. That's object detection. The bounding box is like drawing a rectangle around each object with a label. Now, imagine you have a team of assistants: each assistant is a specialized 'detector' trained to find one type of object (like cars). The assistants scan the photo by sliding a small window across the image, and at each position, they decide if the window contains a car and how well it fits. This produces many candidate boxes, often overlapping. Then a 'referee' (non-maximum suppression) picks the best box for each car, discarding duplicates. The final result is a set of bounding boxes with confidence scores, just like your album with labeled rectangles around each object. In Azure Custom Vision, you train a model by providing images and manually drawing bounding boxes. The model learns to predict these boxes on new images. The key is that object detection not only classifies but also locates, which is critical for applications like counting vehicles or detecting defects.
What is Object Detection?
Object detection is a computer vision task that identifies and locates objects within an image or video. Unlike image classification, which assigns a single label to the entire image, object detection finds multiple objects, assigns each a class label, and provides a bounding box (a rectangle) that tightly encloses each object. The AI-900 exam focuses on understanding the concept and the Azure services that provide this capability: Azure Computer Vision and Azure Custom Vision.
Why Object Detection Exists
In real-world applications, you often need to know not just what is in an image but also where it is. For example, in a retail store, you may want to count the number of products on a shelf and locate each one. In autonomous driving, detecting pedestrians and other vehicles requires both classification and localization. Object detection enables these scenarios by outputting a list of detected objects with their positions.
How Object Detection Works Internally
Modern object detection models (like YOLO, SSD, or Faster R-CNN) use deep neural networks. The process can be broken into steps:
Feature Extraction: The image is passed through a convolutional neural network (CNN) that extracts features at multiple scales.
Region Proposal (for two-stage detectors): Some models first propose candidate regions (bounding boxes) that might contain objects. For example, Faster R-CNN uses a Region Proposal Network (RPN) that slides a small window over the feature map and outputs potential boxes with objectness scores.
Classification and Regression: For each candidate box, the model classifies the object (e.g., car, pedestrian) and refines the box coordinates (regression) to better fit the object.
Non-Maximum Suppression (NMS): Since multiple boxes may overlap for the same object, NMS removes duplicates by keeping only the box with the highest confidence score and discarding others with high overlap (e.g., Intersection over Union > 0.5).
Key Components and Defaults
Bounding Box: Represented by four numbers: (x, y, width, height) or (left, top, right, bottom). In Azure, coordinates are normalized (0-1) relative to image dimensions.
Confidence Score: A value between 0 and 1 indicating how likely the box contains an object of the predicted class. Azure returns a minimum confidence threshold (default 0.5) configurable in Custom Vision.
Intersection over Union (IoU): A metric measuring overlap between predicted and ground-truth boxes. IoU = area of overlap / area of union. Used in NMS and evaluation. A typical threshold for NMS is 0.5.
Non-Maximum Suppression (NMS): Algorithm to select the best bounding box among overlapping candidates. Steps: sort by confidence, pick the highest, remove all boxes with IoU > threshold (e.g., 0.5) with the picked box, repeat.
Azure Services for Object Detection
Azure Computer Vision (prebuilt): The Analyze Image API can detect common objects (people, vehicles, animals, etc.) from a set of predefined categories. It returns bounding boxes and confidence scores. You cannot train custom objects.
Azure Custom Vision (custom): You can train your own object detection model by uploading images and drawing bounding boxes for your custom objects. The service supports two types: Classification (single label per image) and Object Detection (multiple labels with boxes).
How to Use Object Detection in Azure
Using Computer Vision API:
Call the POST https://{endpoint}/vision/v3.2/analyze endpoint with visualFeatures=Objects.
Response includes objects array with rectangle (x, y, w, h) and object (parent class) and confidence.
Using Custom Vision:
Create a Custom Vision project with domain "Object Detection (General)" or "Object Detection (Logo)".
Upload images and use the tagging interface to draw bounding boxes around each object.
Train the model. The service outputs a model that can be deployed as a prediction endpoint.
Prediction API returns predictions array with boundingBox (left, top, width, height) and probability.
Interaction with Related Technologies
Object detection often works alongside other Azure services: - Azure Video Indexer: Uses object detection to identify objects in videos. - Azure Cognitive Search: Can index objects detected in images for search. - Azure Logic Apps: Automate workflows when objects are detected (e.g., send alert if a specific object appears).
Performance Considerations
Model Accuracy: Depends on training data quality, number of images, and diversity. Minimum 15 images per class recommended.
Inference Speed: Custom Vision's prediction endpoint is optimized for low latency. For high throughput, consider scaling.
Bounding Box Precision: The model outputs coordinates as floats between 0 and 1. Multiply by image width/height to get pixel coordinates.
Exam-Relevant Details
The AI-900 exam expects you to know that object detection provides bounding boxes, while image classification does not.
You should understand that Custom Vision allows training custom object detectors, while Computer Vision uses prebuilt models.
Know that bounding boxes are rectangular and defined by coordinates.
Understand the concept of confidence score and that you can set a threshold to filter low-confidence predictions.
Prepare Training Images
Collect images that contain the objects you want to detect. For each image, you will need to draw bounding boxes around each object instance. In Azure Custom Vision, you upload images and then use the tagging interface to draw rectangles. The service requires a minimum of 15 images per class for object detection, but more is better for accuracy. Images should be diverse in terms of scale, orientation, lighting, and background to avoid overfitting.
Create Custom Vision Project
In the Azure portal, create a Custom Vision resource. Then go to the Custom Vision portal (customvision.ai) and create a new project. Choose the project type as 'Object Detection'. Select a domain: 'General' for common objects, 'Logo' for logos, or 'Compact' for edge devices. The domain affects the model architecture and inference speed.
Upload and Tag Images
Upload your training images to the project. For each image, click on 'Object Detection' tagger. Draw bounding boxes around each object of interest and assign a class label. You can add multiple tags per image. Ensure boxes tightly enclose the objects. The service will use these to learn the location and class.
Train the Model
Click the 'Train' button in the Custom Vision portal. Choose 'Quick Training' for a faster iteration or 'Advanced Training' for more accuracy. The training process uses your tagged images to build a deep learning model. It typically takes a few minutes. After training, you get a precision, recall, and mean average precision (mAP) score.
Evaluate and Publish
After training, test the model using the 'Quick Test' button or by submitting new images via the prediction API. You can iterate by adding more images or correcting bounding boxes. Once satisfied, publish the model to an endpoint. In the Custom Vision portal, go to 'Performance' and click 'Publish'. You will get a prediction URL and key.
Call Prediction API
Use the published endpoint to send new images. The API returns a JSON with predictions, each containing `tagName`, `probability` (confidence score), and `boundingBox` with `left`, `top`, `width`, `height` as normalized coordinates. You can set a probability threshold (e.g., 0.5) to filter out low-confidence predictions. Then you can draw the bounding boxes on the image for visualization.
Retail Inventory Management
A large retailer uses object detection to automatically count and locate products on shelves. They deploy a camera system that captures images of shelves multiple times per day. Each image is sent to Azure Custom Vision, which has been trained to detect dozens of product types. The bounding boxes indicate where each product is, and the count per product is used to trigger restocking alerts. The system processes thousands of images per hour. A common issue is when products are partially occluded or stacked, causing missed detections. To mitigate, the training set includes images with occlusions and varied lighting. The confidence threshold is set to 0.6 to reduce false positives, but this may lower recall. Regular retraining with new product images is necessary.
Autonomous Vehicle Perception
An autonomous driving startup uses Azure Computer Vision's prebuilt object detection to detect pedestrians, vehicles, and traffic signs in real-time video streams. The bounding boxes are used to calculate distances and trajectories. The prebuilt model covers common objects but needs custom detection for unusual traffic signs. They use Custom Vision to augment the model. Performance is critical: the detection must run at 30 fps. They use compact domain models and deploy to Azure Stack Edge for edge computing. Misconfigurations like incorrect confidence thresholds can lead to false positives (e.g., detecting a shadow as a pedestrian) causing unnecessary braking, or false negatives causing collisions.
Medical Imaging Analysis
A hospital uses object detection to locate tumors in CT scans. Radiologists manually draw bounding boxes on training images. The model then assists by highlighting potential tumors in new scans. The bounding boxes must be precise to guide biopsy. The model outputs coordinates and confidence. A confidence threshold of 0.7 is used to avoid false positives. However, the model may miss small tumors (low recall). The team continuously adds new cases to improve accuracy. They also use Azure Machine Learning to fine-tune the model. A common pitfall is using the same training domain for different types of scans (e.g., chest vs. brain), which degrades performance.
What AI-900 Tests on Object Detection
The AI-900 exam objectives (Domain 3: Computer Vision, Objective 3.1) expect you to:
Understand that object detection identifies and locates objects using bounding boxes.
Differentiate between image classification (labels the whole image) and object detection (labels and locates multiple objects).
Know that Azure Custom Vision can train custom object detection models, while Azure Computer Vision provides prebuilt object detection.
Recognize that bounding boxes are rectangular and defined by coordinates (left, top, width, height).
Understand that confidence scores indicate the likelihood of a correct detection.
Common Wrong Answers and Why Candidates Choose Them
Confusing object detection with semantic segmentation: Candidates often think object detection provides pixel-level masks. The exam tests that object detection outputs bounding boxes, not segmentation masks. The wrong answer might say "object detection outputs a mask for each object." Semantic segmentation does that, not object detection.
Believing object detection can classify the entire image: Some candidates think object detection is the same as image classification. The exam will have a scenario where you need to locate objects, and the wrong answer suggests using image classification. Remember: classification gives one label per image; detection gives labels and locations.
Thinking Custom Vision only does classification: A common trap is that Custom Vision only does image classification. The exam expects you to know that Custom Vision supports both classification and object detection projects.
Misunderstanding bounding box coordinates: Candidates may think coordinates are absolute pixels or that they are from the center. The exam may ask about normalized coordinates. Remember: in Azure, bounding box coordinates are normalized (0-1) relative to image dimensions.
Specific Numbers and Terms on the Exam
Confidence score: Typically a value between 0 and 1. Default threshold is 0.5.
Bounding box format: left, top, width, height (normalized).
Minimum images per class: At least 15 for object detection in Custom Vision.
IoU: Not directly tested but concept may appear.
Edge Cases and Exceptions
The exam may ask about handling multiple objects of the same class: object detection can detect multiple instances, each with its own bounding box.
When using the Computer Vision API, the visualFeatures parameter must include Objects to get object detection.
Custom Vision object detection models cannot be exported to certain formats (e.g., TensorFlow) if they use the 'General' domain; some domains support export.
How to Eliminate Wrong Answers
If the question asks for location information, eliminate any answer that only provides class labels.
If the question mentions custom objects, the answer should involve Custom Vision, not Computer Vision.
If the answer mentions 'pixel-level segmentation', it is likely wrong unless the question is about semantic segmentation.
For bounding box questions, remember that coordinates are normalized and rectangular.
Object detection outputs bounding boxes with class labels and confidence scores.
Azure Custom Vision allows training custom object detection models; Azure Computer Vision provides prebuilt detection.
Bounding box coordinates are normalized (0-1) and represent left, top, width, height.
Minimum 15 images per class are recommended for training object detection in Custom Vision.
Confidence score threshold can be adjusted to filter predictions (default 0.5).
Non-maximum suppression removes duplicate boxes based on IoU threshold.
Object detection is different from image classification and semantic segmentation.
These come up on the exam all the time. Here's how to tell them apart.
Image Classification
Outputs a single label (class) for the entire image.
No location information provided.
Used for tasks like 'Is this a cat?'.
Azure Custom Vision supports classification projects.
Training requires images labeled with one tag per image.
Object Detection
Outputs multiple labels with bounding boxes for each object.
Provides location (bounding box coordinates).
Used for tasks like 'Where are the cats?'.
Azure Custom Vision supports object detection projects.
Training requires images with bounding boxes drawn around each object.
Mistake
Object detection and image classification are the same thing.
Correct
Image classification assigns a single label to the entire image. Object detection identifies multiple objects and their locations using bounding boxes. They are different tasks.
Mistake
Bounding boxes are always squares.
Correct
Bounding boxes are rectangles; they can have different widths and heights to tightly enclose objects.
Mistake
Azure Computer Vision can detect any custom object.
Correct
Azure Computer Vision uses prebuilt models for common objects. For custom objects, you must use Azure Custom Vision and train your own model.
Mistake
Confidence score is the probability that the bounding box is correct.
Correct
Confidence score is the model's certainty that the detected object is of the predicted class. It is not a probability in the strict sense but a score between 0 and 1.
Mistake
You need to draw bounding boxes for every single object in every training image.
Correct
Yes, for object detection training, you must draw bounding boxes for all instances of the objects you want to detect. Omitting objects can confuse the model.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Image classification assigns a single label to the entire image (e.g., 'cat'). Object detection identifies multiple objects and their locations using bounding boxes (e.g., 'cat' at coordinates x,y,w,h). In Azure, you use classification when you only need to know what is in the image, and object detection when you also need to know where objects are.
Use Azure Custom Vision. Create a project with type 'Object Detection'. Upload images and draw bounding boxes around each object of interest. Assign class labels. Train the model. Then publish the model to get a prediction endpoint. You can then send new images to the endpoint and receive bounding box predictions.
Bounding box coordinates are normalized (0 to 1) values representing left, top, width, and height relative to the image dimensions. For example, a box with left=0.25, top=0.25, width=0.5, height=0.5 covers the center of the image. To get pixel coordinates, multiply by image width and height.
No, Azure Computer Vision's Analyze Image API detects only a fixed set of common objects (people, vehicles, animals, etc.). For custom objects, you must use Azure Custom Vision to train your own model.
A confidence score is a value between 0 and 1 that indicates how likely the predicted bounding box contains an object of the specified class. You can set a threshold (e.g., 0.5) to only consider predictions above that threshold. Higher thresholds reduce false positives but may miss some objects.
Azure Custom Vision recommends at least 15 images per class for object detection. More images improve accuracy. Images should be diverse in scale, orientation, lighting, and background to avoid overfitting.
NMS is a post-processing step that removes duplicate bounding boxes for the same object. It selects the box with the highest confidence score and discards any other box that has a high overlap (IoU > threshold, typically 0.5) with it. This ensures each object is detected only once.
You've just covered Object Detection and Bounding Boxes — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?