This chapter covers Spatial Analysis and Video Insights within Azure Cognitive Services, focusing on how to extract actionable intelligence from video feeds using AI. For the AI-900 exam, this topic appears in approximately 10-15% of Computer Vision questions, particularly around the capabilities of Azure Video Indexer and the Spatial Analysis container. You will learn the mechanisms behind person tracking, zone occupancy, and line crossing, along with deployment configurations and common pitfalls. Mastering this content ensures you can distinguish between video analysis services and understand when to use each.
Jump to a section
Imagine a large corporate security control room with dozens of monitors displaying live feeds from hundreds of cameras. A team of security guards watches these feeds, but they can't look at every camera at once. Instead, they set up rules: 'If a person enters a restricted area, alert me immediately.' The guards don't just watch; they analyze movement, count people, and track objects across cameras. In Azure, Video Indexer and Spatial Analysis act like this control room. Video Indexer is the guard who extracts metadata—like speech, faces, and text—from the video stream. Spatial Analysis is the guard who watches for specific spatial events: people moving within zones, crossing lines, or maintaining distance. The 'rules' are configured via Azure Cognitive Services APIs. The 'cameras' are video feeds from Azure Video Analyzer or direct RTSP streams. The 'alerts' are events sent to Azure Event Grid or Logic Apps. Just as a guard might miss something if too many rules are active, Spatial Analysis has performance limits—maximum 30 frames per second (fps) per camera, and each operation (e.g., person crossing line) consumes compute. The system is like a control room that scales: you can add more cameras (video feeds) and more guards (compute instances) as needed, but each guard has a cognitive load limit. If you exceed it, you drop frames or miss events—exactly like a tired guard missing a breach.
What Are Spatial Analysis and Video Insights?
Spatial Analysis and Video Insights are two distinct but complementary Azure services under the Computer Vision umbrella. Video Insights is primarily delivered via Azure Video Indexer (formerly Video Analyzer for Media), which extracts rich metadata from video files—including transcripts, faces, emotions, objects, and keyframes. Spatial Analysis, on the other hand, is a real-time video analysis capability that uses Azure Cognitive Services Computer Vision to detect and track people in a video stream, and trigger events based on their spatial relationships (e.g., entering a zone, crossing a line, maintaining distance).
For the AI-900 exam, you must understand that: - Azure Video Indexer is for processing recorded video or live streams to generate insights like OCR, speech-to-text, sentiment, and object detection. It is a SaaS solution. - Spatial Analysis is a containerized service (deployed on the edge or cloud) that analyzes live video feeds to count people, track movement, and detect specific spatial events. It is designed for real-time scenarios like retail analytics, workplace safety, and access control.
How Spatial Analysis Works Internally
Spatial Analysis is deployed as a Docker container that ingests video frames from an RTSP (Real-Time Streaming Protocol) camera or a video file. The container runs on an Azure IoT Edge device or a virtual machine with a GPU. The processing pipeline follows these steps:
Frame Ingestion: The container captures frames at a configurable rate, up to 30 fps. The default is 15 fps. Each frame is resized to a standard resolution (e.g., 1920x1080) before processing.
Person Detection: A deep learning model (based on YOLOv3) detects bounding boxes for people in each frame. The model is pre-trained and optimized for edge deployment. It outputs confidence scores (default threshold 0.5) and bounding box coordinates.
Person Tracking: The container assigns a unique ID to each detected person and tracks them across frames using a Kalman filter. The tracking algorithm predicts the person's next position based on velocity and motion model. If a person is not detected for 3 consecutive frames, the ID is released.
Spatial Event Detection: Based on configured operations (e.g., personCrossingLine, personZoneEnterExit, personCount), the container evaluates spatial relationships. For personCrossingLine, it checks if the person's bounding box center crosses a defined polyline. For personZoneEnterExit, it checks if the person's bounding box intersects a polygon zone.
Event Emission: When a spatial event occurs, the container sends a JSON message to an Azure IoT Hub or a configured endpoint (e.g., Event Grid, HTTP). The message includes the event type, timestamp, person ID, and relevant coordinates.
Key Components and Defaults
- Operations: Spatial Analysis supports several operations. Each operation has specific parameters:
- personCrossingLine: Detects when a person crosses a line. Parameters: Line (list of points defining the line), CrossingDirection (e.g., leftToRight, rightToLeft, both). Default: both.
- personZoneEnterExit: Detects when a person enters or exits a zone. Parameters: Zone (list of points defining polygon), EventType (enter, exit, both). Default: both.
- personCount: Counts the number of people in a zone. No event emission; returns count at intervals.
- personDistance: Detects if two people are within a specified distance. Parameters: MinimumDistanceThreshold in pixels (default 100).
- Configuration via JSON: All operations are configured in a JSON file passed to the container at startup. Example:
{
"operations": [
{
"operation": "personCrossingLine",
"parameters": {
"Line": [[0.1, 0.5], [0.9, 0.5]],
"CrossingDirection": "both"
}
}
],
"cameraConfiguration": {
"videoUrl": "rtsp://camera:554/stream",
"frameRate": 15
}
}Hardware Requirements: Spatial Analysis requires a GPU for real-time processing. Minimum: NVIDIA Tesla T4 or equivalent. Without GPU, frame rate drops below 1 fps.
Pricing: Spatial Analysis is billed per operation per hour of video processed. As of 2025, $0.50 per operation per hour. Video Indexer is billed per minute of video indexed.
Azure Video Indexer Deep Dive
Azure Video Indexer is a cloud service that analyzes video and audio to extract insights. It supports both upload and real-time streaming via the API. Key capabilities tested on AI-900:
Face Detection and Identification: Detects faces in video and can identify celebrities from a built-in database. For custom faces, you need to train a PersonGroup using the Face API.
Speech-to-Text: Transcribes spoken words into text with speaker diarization (who said what). Supports multiple languages.
Optical Character Recognition (OCR): Extracts text from video frames (e.g., signs, captions).
Object Detection: Detects common objects (e.g., car, dog, person).
Sentiment Analysis: Analyzes audio tone to determine sentiment (positive, negative, neutral).
Keyframe Extraction: Identifies significant frames (e.g., scene changes) and extracts them as thumbnails.
Labels: Tags video with labels like 'outdoor', 'indoor', 'meeting'.
Video Indexer uses pre-trained models and does not require custom training for general insights. You can customize by uploading a custom language model or face model.
Interaction with Other Azure Services
Spatial Analysis and Video Indexer often integrate with: - Azure IoT Hub: For receiving events from Spatial Analysis containers. - Azure Event Grid: For routing events to functions or logic apps. - Azure Logic Apps: For triggering workflows (e.g., send email when person enters restricted zone). - Azure Blob Storage: For storing video files and insights. - Azure Media Services: For encoding and streaming video.
Verification and Monitoring
To verify Spatial Analysis is running:
Check container logs: docker logs spatial-analysis-container
Monitor via Azure Monitor: Metrics like EventsEmitted, DroppedFrames, AverageLatency.
Use the Spatial Analysis REST API to query current state: GET /state returns connected cameras and operations.
For Video Indexer, you can use the Azure portal or REST API to view indexed insights. The portal provides a timeline with detected entities.
Edge Cases and Limits
Crowded Scenes: If more than 50 people are in frame, detection accuracy drops. The model may fail to track individuals.
Occlusion: If a person is partially hidden, the tracking may lose the ID. The Kalman filter can predict for up to 3 frames.
Lighting Conditions: Low light (< 10 lux) reduces detection confidence. The model expects typical indoor/outdoor lighting.
Camera Angle: Best results with overhead cameras (top-down view). Side-angle cameras may cause occlusions.
Video Indexer Limitations: Free tier allows 10 hours of indexing per month. Paid tier starts at $0.13 per minute.
Exam-Relevant Configuration Values
Frame rate: Default 15 fps, max 30 fps.
Confidence threshold: Default 0.5 for person detection.
Tracking loss threshold: 3 consecutive frames without detection.
Maximum people tracked: Up to 50 per camera.
Supported protocols: RTSP (Real-Time Streaming Protocol) for live feeds; MP4, MOV for files.
Summary of Differences
| Feature | Spatial Analysis | Video Indexer | |---------|------------------|---------------| | Deployment | Container (edge/cloud) | SaaS (cloud) | | Use Case | Real-time spatial events | Post-processing insights | | Output | Event messages | JSON metadata, timeline | | Customization | Parameters only | Custom models (face, language) | | Hardware | GPU required | No GPU needed |
Both services are part of the Computer Vision category in Azure Cognitive Services. The exam expects you to know which service to use for a given scenario: real-time vs. batch, edge vs. cloud, spatial vs. general insights.
Deploy Spatial Analysis Container
First, you need to deploy the Spatial Analysis container on an Azure IoT Edge device or a VM with GPU. You must have an Azure Cognitive Services resource (Computer Vision) created in the portal. Then, pull the container image from Microsoft Container Registry: `mcr.microsoft.com/azure-cognitive-services/vision/spatial-analysis`. Configure the container with a JSON file specifying the operations, camera URL, and IoT Hub connection string. The container connects to IoT Hub to receive configuration updates and send events. Ensure the device has NVIDIA GPU drivers and Docker installed. The container exposes ports 5000 (HTTP) and 5001 (HTTPS) for local API access. Deployment typically takes 5-10 minutes.
Configure Camera and Operations
In the configuration JSON, define the video source (RTSP URL) and operations. Each operation has a unique name and parameters. For example, to detect people crossing a line, use `personCrossingLine` with a list of line points normalized to [0,1] range. The camera must support RTSP; common ports are 554 or 8554. You can set `frameRate` to balance accuracy and performance – lower frame rates reduce GPU load but may miss fast movements. The container supports up to 8 simultaneous operations per camera. After configuration, restart the container to apply changes. Verify the container logs show 'Connected to camera' and 'Operation started'.
Process Video Frames
The container ingests frames at the configured rate. Each frame is decoded and resized. The person detection model runs on the GPU, outputting bounding boxes and confidence scores. The tracking algorithm assigns IDs and maintains a state for each person (position, velocity). The container uses a sliding window of 3 frames to smooth detections. If a person is not detected for 3 frames, the ID is released and a 'track lost' event may be emitted if configured. The container also computes optical flow to assist tracking. Processing latency is typically 30-50 ms per frame on a T4 GPU.
Detect Spatial Events
For each frame, the container evaluates all active operations. For `personZoneEnterExit`, it checks if the person's bounding box centroid is inside the polygon zone. If the state changes from outside to inside, an 'enter' event is triggered. For `personCrossingLine`, it checks if the centroid crosses the line by comparing previous and current positions. Events include a timestamp, person ID, and the operation name. Events are batched and sent every 1 second or immediately if the buffer reaches 10 events. The container also supports `personCount` which returns the current count every 5 seconds without emitting events.
Emit Events to Azure IoT Hub
Spatial events are sent as JSON messages to Azure IoT Hub via the device twin or direct method. The message format includes: `{ "eventType": "personZoneEnterExit", "personId": 123, "timestamp": "2025-03-15T10:00:00Z", "zone": "restricted_area" }`. IoT Hub can then route these messages to Azure Functions, Logic Apps, or Event Grid for further processing. The container also supports HTTP endpoints for direct integration. If IoT Hub connection fails, the container buffers up to 1000 events in memory. After 5 minutes of disconnection, it stops processing and logs an error.
Enterprise Scenario 1: Retail Store Analytics
A large retail chain deploys Spatial Analysis in 500 stores to count foot traffic and measure dwell times. Each store has 4 ceiling-mounted cameras covering aisles and entrances. The operations used are personCount for each zone and personCrossingLine for entrance/exit lines. The container runs on an NVIDIA Jetson AGX Orin device at each store. Events are sent to Azure IoT Hub, which feeds a Power BI dashboard showing real-time occupancy. The system processes 15 fps per camera, handling up to 30 people per frame. Common misconfiguration: setting frameRate too high (30 fps) causes GPU overload and dropped frames. The solution is to reduce to 10 fps for static scenes. Another issue: camera angle too oblique reduces detection accuracy – overhead angles (45-60 degrees) are optimal. The chain saves 20% on staffing costs by aligning schedules with traffic patterns.
Enterprise Scenario 2: Workplace Safety Monitoring
A manufacturing plant uses Spatial Analysis to enforce social distancing. Cameras monitor assembly lines and break rooms. The personDistance operation is configured with MinimumDistanceThreshold=150 pixels (about 6 feet at typical camera distance). When two workers are too close, an alert is sent to a safety supervisor via Logic Apps and SMS. The system also uses personZoneEnterExit to detect entry into hazardous areas. The container runs on an Azure Stack Edge device with a T4 GPU. Performance consideration: the distance calculation is computationally intensive – each pair of people requires a comparison. With 20 people in frame, that's 190 comparisons per frame. To optimize, the container skips comparisons if the bounding boxes are far apart (using a spatial hash grid). The plant reduced safety incidents by 40%.
Scenario 3: Smart Building Access Control
An office building uses Spatial Analysis to grant access to authorized personnel. Cameras at entrances detect people and trigger personCrossingLine. The event is sent to an Azure Function that checks a database of authorized employees via face recognition (using Face API). If authorized, the function opens the door via an IoT relay. The system processes 5 entrances simultaneously. A key challenge: lighting changes at different times of day cause false negatives. The solution is to adjust the confidence threshold dynamically based on time of day (0.6 during day, 0.4 at night). The building also uses Video Indexer to analyze lobby footage for security audits, indexing 24 hours of video daily. The combination of real-time spatial analysis and batch video indexing provides both immediate response and historical review.
What AI-900 Tests on This Topic (Objective 3.2)
The AI-900 exam covers Spatial Analysis and Video Insights under 'Computer Vision workloads' with objective code 3.2: Identify capabilities of computer vision. Specifically, you need to know:
The difference between Azure Video Indexer and Spatial Analysis.
The types of insights each service provides (e.g., Video Indexer: OCR, face detection, sentiment; Spatial Analysis: person tracking, zone counting, line crossing).
That Spatial Analysis is a container that can run on the edge (IoT Edge) or in the cloud.
That Video Indexer is a SaaS service for processing recorded or live video.
Common Wrong Answers and Why Candidates Choose Them
'Azure Video Indexer can detect people crossing a line in real time.' – Wrong. Video Indexer is for post-processing; it does not support real-time spatial events. Candidates confuse it with Spatial Analysis.
'Spatial Analysis requires an internet connection to Azure at all times.' – Wrong. Spatial Analysis containers can run fully offline on the edge, sending events later. Candidates think all Azure services require connectivity.
'Video Indexer uses the same container as Spatial Analysis.' – Wrong. Video Indexer is a cloud service, not a container. Candidates mix up deployment models.
'Spatial Analysis can identify specific people by name.' – Wrong. Spatial Analysis only tracks people as anonymous IDs; it does not perform facial recognition. For identification, you need the Face API.
Exact Terms and Values That Appear on the Exam
Operations: personCount, personCrossingLine, personZoneEnterExit, personDistance.
Frame rate: 15 fps default, maximum 30 fps.
Confidence threshold: 0.5 default.
Maximum people tracked: 50 per camera.
Hardware: GPU required (e.g., NVIDIA Tesla T4).
Protocol: RTSP for live cameras.
Video Indexer pricing: Per minute of video indexed.
Edge Cases the Exam Loves
What if the camera angle is too low? – Detection accuracy decreases; overhead angles are recommended.
What if many people are in the frame? – Tracking may fail beyond 50 people; the system may drop events.
Can Spatial Analysis work with recorded video? – Yes, if you provide an RTSP stream from a video file (e.g., using a local media server). But typically it's for live.
How to Eliminate Wrong Answers
If the scenario mentions 'real-time detection of people entering a zone', the answer is Spatial Analysis, not Video Indexer.
If the scenario mentions 'extracting text from a video', the answer is Video Indexer (OCR capability).
If the question asks about 'deploying on a camera device without internet', the answer is Spatial Analysis container on IoT Edge.
If the question mentions 'identifying celebrities', the answer is Video Indexer (celebrity recognition).
Spatial Analysis is for real-time spatial events; Video Indexer is for batch video insights.
Spatial Analysis runs as a container; Video Indexer is SaaS.
Default frame rate for Spatial Analysis is 15 fps; max 30 fps.
Spatial Analysis tracks up to 50 people per camera.
Spatial Analysis operations: personCount, personCrossingLine, personZoneEnterExit, personDistance.
Video Indexer includes face detection, OCR, speech-to-text, sentiment analysis, and object detection.
Spatial Analysis requires GPU (e.g., NVIDIA T4) for real-time processing.
Spatial Analysis can run offline on IoT Edge; events are buffered and sent later.
These come up on the exam all the time. Here's how to tell them apart.
Spatial Analysis
Real-time event detection (sub-second latency)
Deployed as a Docker container on edge or cloud
Tracks people and detects spatial events (crossing lines, zone entry)
Requires GPU hardware for processing
Outputs JSON events to IoT Hub or HTTP endpoint
Azure Video Indexer
Post-processing of recorded video (minutes delay)
SaaS service, no deployment needed
Extracts transcripts, faces, OCR, objects, sentiments
No GPU required; runs in Azure cloud
Outputs a JSON metadata file and timeline in portal
Mistake
Spatial Analysis can identify specific individuals by name.
Correct
Spatial Analysis only assigns anonymous IDs to tracked persons. It does not perform facial recognition or identification. For identification, you must integrate with Azure Face API.
Mistake
Azure Video Indexer can process live video streams in real time for spatial events.
Correct
Video Indexer is primarily for post-processing recorded video. It can index live streams but with a delay (minutes), not real-time. For real-time spatial events, use Spatial Analysis.
Mistake
Spatial Analysis requires constant internet connectivity to Azure.
Correct
The container can run fully offline on an edge device. It can buffer events and send them later when connectivity is restored. However, initial deployment and configuration may require internet.
Mistake
Video Indexer and Spatial Analysis are the same service with different names.
Correct
They are distinct services. Video Indexer is a cloud SaaS for general video insights; Spatial Analysis is a container for real-time spatial tracking. They serve different use cases.
Mistake
Spatial Analysis can process any video format directly.
Correct
Spatial Analysis requires RTSP streams for live video. For recorded video, you need to stream it via an RTSP server. It does not directly read MP4 files.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Spatial Analysis is a container-based service for real-time detection of spatial events like people crossing lines or entering zones. It runs on edge devices with a GPU. Azure Video Indexer is a cloud SaaS that processes recorded video to extract insights like transcripts, faces, and objects. Use Spatial Analysis for real-time monitoring and Video Indexer for post-event analysis.
No, Spatial Analysis only assigns anonymous IDs to tracked individuals. It does not perform facial recognition. To identify people, you need to integrate with Azure Face API or a custom recognition system.
Spatial Analysis requires a GPU for real-time processing. Minimum recommended is NVIDIA Tesla T4 or equivalent. Without GPU, processing speed drops below 1 fps.
Yes, Video Indexer can index live streams, but with a delay (typically 2-5 minutes). It is not real-time. For immediate event detection, use Spatial Analysis.
The default confidence threshold is 0.5. You can adjust it in the configuration JSON to reduce false positives or false negatives.
If a person is not detected for 3 consecutive frames, the tracking ID is released. The Kalman filter can predict position for up to 3 frames during temporary occlusion.
Technically yes, but performance will be very poor (less than 1 fps). It is not recommended for any production use. Always use a GPU.
You've just covered Spatial Analysis and Video Insights — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?