Optical Character Recognition (OCR) is a core Computer Vision capability in Azure AI that extracts printed or handwritten text from images and documents. For the AI-900 exam, OCR questions appear in roughly 5-10% of the questions under objective 3.2 (Computer Vision). This chapter covers the mechanics of OCR, its implementation in Azure AI services, and how it differs from other text extraction technologies like Read API and Form Recognizer. You will learn exactly what the exam expects you to know about OCR, including its use cases, supported languages, and limitations.
Jump to a section
Imagine a postal sorting facility where letters arrive in bins. The facility has an automated sorter that must read handwritten addresses on envelopes. First, a high-speed camera takes a photo of each envelope. The image is then converted to a digital format. The sorter's software scans the image to find the region containing the address, ignoring stamps and logos. It then isolates each character by looking for white spaces between them. For each character, the software compares its shape to a library of known character outlines. It uses machine learning models trained on millions of examples to handle variations in handwriting. The result is a string of text that represents the address. Finally, the sorter prints a barcode on the envelope and routes it to the correct bin. If the handwriting is too messy, the sorter flags the envelope for manual processing. This mirrors Azure AI Document Intelligence's OCR: it takes an image, preprocesses it, detects text regions, segments characters, and uses a trained model to recognize them, outputting structured text with confidence scores.
What is OCR and Why It Exists
Optical Character Recognition (OCR) is a technology that converts different types of documents—such as scanned paper documents, PDF files, or images captured by a digital camera—into editable and searchable data. The core problem OCR solves is that images are pixel-based and lack semantic text information. Without OCR, the text in an image is just a collection of colored pixels; a computer cannot search, copy, or edit it. OCR bridges the gap between visual content and machine-readable text.
In the context of Azure AI, OCR is provided by the Computer Vision service, specifically through the Read API. The Azure AI Document Intelligence service (formerly Form Recognizer) also includes OCR capabilities, but its primary focus is on extracting key-value pairs and tables from forms. For the AI-900 exam, you need to understand the Read API as the primary OCR offering.
How OCR Works Internally
OCR in Azure AI follows a multi-step pipeline:
Image Preprocessing: The input image is converted to a standard format (e.g., JPEG or PNG) and resized if necessary. The service applies noise reduction, contrast adjustment, and skew correction to improve recognition accuracy. For example, if a document is scanned at a slight angle, the service deskews it so that text lines are horizontal.
Text Detection (Region Proposal): The service uses a deep learning model to identify regions in the image that contain text. This is done by scanning the image with a sliding window and classifying each window as containing text or not. The output is a set of bounding boxes around each text region. A text region can be a paragraph, a line, or even a single word.
Text Segmentation: Within each detected region, the service segments the text into lines and words. It uses spatial analysis—looking at the distance between characters and lines—to group characters into words and words into lines. Punctuation marks are typically attached to the preceding word.
Character Recognition: Each segmented character image is passed to a convolutional neural network (CNN) that has been trained on millions of character examples. The network outputs a probability distribution over all possible characters (letters, digits, punctuation). The character with the highest probability is selected, but the service also returns confidence scores for each character.
Post-processing: The recognized characters are assembled into words and lines. The service applies language models to correct common errors (e.g., 'cl' might be misinterpreted as 'd'—the language model knows that 'd' is more likely in English). It also formats the output into a structured JSON response containing the recognized text, bounding box coordinates, and confidence levels.
Key Components and Parameters
- Read API Endpoint: https://<your-resource-name>.cognitiveservices.azure.com/vision/v3.2/read/analyze (version 3.2 is commonly tested).
- Operation: The Read API is asynchronous. You first call the POST endpoint to submit the image, which returns an Operation-Location header with a URL to poll for results.
- Polling: You must poll the GET endpoint at the Operation-Location URL until the status is 'succeeded' or 'failed'. The recommended polling interval is 1 second, but not less than 0.5 second to avoid throttling.
- Output Fields: The JSON response includes:
- status: 'succeeded', 'running', 'failed'
- analyzeResult: contains readResults array, each with:
- page: page number (1-indexed)
- angle: skew correction angle in degrees
- width, height: dimensions of the image
- unit: 'pixel' or 'inch'
- lines: array of text lines, each with:
- boundingBox: array of 8 numbers (x1,y1,x2,y2,x3,y3,x4,y4)
- text: the recognized text
- words: array of words, each with:
- boundingBox: similar
- text: the word
- confidence: confidence score between 0 and 1
- Supported Languages: The Read API supports over 160 languages, including English, Spanish, French, German, Chinese, Japanese, Arabic, etc. For printed text, it supports many languages; for handwritten text, it supports English and a few others (like Spanish, French, German, Italian, Portuguese, Dutch). Check the latest documentation for the exact list.
- Image Requirements:
Image size: between 50x50 and 10000x10000 pixels.
File size: up to 4 MB (free tier) or larger depending on pricing tier.
Supported formats: JPEG, PNG, BMP, TIFF (multi-page), PDF (multi-page).
Configuration and Usage
To use the Read API, you need an Azure AI Services resource (formerly Cognitive Services) with the Computer Vision endpoint and key. You can call the API via REST or using SDKs (Python, C#, Java, etc.).
Example REST call:
POST https://<your-resource-name>.cognitiveservices.azure.com/vision/v3.2/read/analyze
Headers:
Ocp-Apim-Subscription-Key: <your-key>
Content-Type: application/json
Body:
{
"url": "https://example.com/document.jpg"
}Response header:
Operation-Location: https://<your-resource-name>.cognitiveservices.azure.com/vision/v3.2/read/analyzeResults/<operation-id>Then poll:
GET https://<your-resource-name>.cognitiveservices.azure.com/vision/v3.2/read/analyzeResults/<operation-id>
Headers:
Ocp-Apim-Subscription-Key: <your-key>Interaction with Related Technologies
Azure AI Document Intelligence: While the Read API extracts raw text, Document Intelligence (Form Recognizer) uses OCR as a first step but then applies additional models to extract structured information like key-value pairs, tables, and signatures. For the exam, know that Document Intelligence is for forms and documents, while the Read API is for general OCR.
Azure Cognitive Search: OCR can be used with Cognitive Search to index scanned documents. The OCR skill in an AI enrichment pipeline extracts text from images, which is then indexed for full-text search.
Power Automate: You can integrate OCR into workflows using the AI Builder OCR model, which is built on the same technology but is no-code.
Performance and Limitations
Latency: The Read API typically takes a few seconds for a single-page document. For large documents (many pages), it can take minutes. The asynchronous design allows you to handle large workloads without blocking.
Accuracy: OCR accuracy depends on image quality, font clarity, and language. Printed text in standard fonts can achieve >99% accuracy. Handwritten text is less accurate, especially with cursive or unusual styles. The confidence score helps you decide whether to accept or flag results.
Throttling: The free tier allows 20 transactions per minute. Paid tiers have higher limits (e.g., 100 TPS). Exceeding limits results in HTTP 429 errors.
Exam-Relevant Details
The Read API is part of the Computer Vision service, not a separate service.
The correct endpoint version for the exam is v3.2 or v3.1 (both are tested).
OCR is asynchronous: you must poll for results.
The response includes bounding boxes and confidence scores.
OCR can handle printed and handwritten text, but handwritten is limited to certain languages.
The maximum image size is 10000x10000 pixels.
The free tier has a limit of 20 calls per minute.
OCR does not extract structure like tables or forms—that's Form Recognizer's job.
Common Exam Traps
Trap 1: Thinking OCR is synchronous. The Read API is always asynchronous. A synchronous call would be incorrect.
Trap 2: Confusing OCR with Form Recognizer. OCR extracts text only; Form Recognizer extracts structure.
Trap 3: Assuming OCR works on all handwritten text. It supports only a subset of languages for handwriting.
Trap 4: Forgetting to poll for results. The initial POST only submits the job; you must poll the GET endpoint.
Submit Image for OCR
You send a POST request to the Read API endpoint with the image URL or binary data. The request must include your subscription key in the header. The service validates the image: it checks file size (max 4 MB for free tier, larger for paid), dimensions (50x50 to 10000x10000), and format (JPEG, PNG, BMP, TIFF, PDF). If valid, it returns a 202 Accepted status with an Operation-Location header containing a URL to poll for the result. This URL includes an operation ID that uniquely identifies your OCR job.
Asynchronous Processing Starts
The service queues your job and begins processing. It first preprocesses the image: converts to a standard format, applies noise reduction, and corrects skew. Then it runs a deep learning model to detect text regions. This model uses a convolutional neural network (CNN) trained on millions of images. The processing time depends on image size and complexity. For a typical A4 page, it takes about 1-2 seconds. The service stores the result in a temporary location that expires after 24 hours.
Poll for Results
You must poll the GET endpoint at the Operation-Location URL. The recommended interval is 1 second. The response includes a 'status' field: 'running' (still processing), 'succeeded' (done), or 'failed' (error). You should poll until status is not 'running'. If you poll too frequently (more than once per 0.5 second), you may get throttled. The service returns a 200 OK for each poll, even if still running. The final response includes the full OCR results.
Parse OCR Output
Once status is 'succeeded', the response contains an 'analyzeResult' object. Inside, there is a 'readResults' array. Each element corresponds to a page (for multi-page documents). Each page has 'lines' and 'words' arrays. Each line has a 'boundingBox' (8 coordinates: top-left, top-right, bottom-right, bottom-left) and 'text' (the line as a string). Each word has its own bounding box, text, and a 'confidence' score (0 to 1). A confidence below 0.5 is considered low and may indicate poor recognition.
Handle Errors and Edge Cases
If the image is too large or unsupported format, the initial POST returns a 400 Bad Request. If the key is invalid, you get 401 Unauthorized. If you exceed the rate limit, you get 429 Too Many Requests. For poor-quality images, the OCR may return empty lines or low confidence scores. You should implement logic to retry with a different image or flag for manual review. Also note that the service may not recognize very small text (less than 10 pixels) or text with heavy background patterns.
Enterprise Scenario 1: Invoice Processing at a Large Retailer
A retail company receives thousands of paper invoices daily from suppliers. They use Azure AI Document Intelligence (which includes OCR) to digitize invoices. The OCR step extracts all text from scanned invoices, then Form Recognizer extracts key fields like invoice number, date, total amount, and line items. The company processes about 10,000 invoices per day. They use the paid tier with a throughput of 100 TPS. They have a custom model trained on their invoice format. Common issues: poor scan quality (low resolution, skewed pages) leads to OCR errors. They preprocess images to ensure at least 300 DPI and deskew before sending. They also set up a fallback: if confidence scores are below 0.6 for critical fields, the invoice is sent for manual review. Misconfiguration: initially they used the Read API alone, but it did not extract structured data, so they had to switch to Form Recognizer.
Enterprise Scenario 2: Automated Mailroom at a Law Firm
A law firm receives physical mail that needs to be digitized and indexed. They use a scanner that automatically sends images to Azure Computer Vision OCR. The OCR extracts text from envelopes and letters. The text is then used to classify the mail (e.g., client, court, opposing counsel) using Azure Cognitive Search. They process about 500 documents per day. They use the free tier initially but upgrade to paid as volume grows. A common problem: handwritten addresses on envelopes have low accuracy (around 80%). They mitigate by using a custom handwriting recognition model trained on their specific handwriting styles (e.g., from known senders). They also integrate with Power Automate to trigger workflows based on recognized text. Misconfiguration: they initially used synchronous OCR calls, which caused timeouts for large documents. They switched to the asynchronous Read API and implemented proper polling.
Enterprise Scenario 3: Historical Document Digitization at a Library
A library digitizes historical books and newspapers. They use OCR to make the text searchable. The documents are old with faded ink and unusual fonts. They use the Read API with the 'language' parameter set to the document's language. They process about 100 pages per hour. To improve accuracy, they preprocess images with contrast enhancement and binarization. They also use the 'pages' parameter to process specific pages in a multi-page TIFF. A common issue: the OCR confuses similar characters like 's' and 'f' in old fonts. They use post-processing with a custom dictionary to correct known errors. They also store the raw OCR output for archival purposes, even if not perfect. Misconfiguration: they tried to process very large TIFF files (over 100 MB) that exceeded the file size limit. They now split files into individual pages before sending.
What AI-900 Tests on OCR (Objective 3.2)
The exam focuses on identifying the correct Azure service for OCR tasks, understanding the asynchronous nature of the Read API, and knowing the output format (bounding boxes, confidence scores). Specific objective codes: 3.2.1 (Identify capabilities of the Computer Vision service) includes OCR. 3.2.2 (Identify capabilities of Azure AI Document Intelligence) includes form extraction but not pure OCR. Questions often ask: 'Which service should you use to extract printed text from an image?' The answer is Computer Vision (Read API). Another common question: 'How do you get the results from the Read API?' The answer is by polling the Operation-Location URL.
Most Common Wrong Answers and Why
Wrong answer: Form Recognizer (Document Intelligence) - Candidates choose this because they think OCR is for forms. But Form Recognizer is for structured extraction; for raw text, use the Read API.
Wrong answer: The Read API returns results synchronously - Candidates assume all APIs are synchronous. They must remember the Read API is asynchronous.
Wrong answer: OCR works for all handwritten text - The exam tests that handwritten OCR is limited to certain languages (English and a few others).
Wrong answer: OCR extracts tables and key-value pairs - OCR only extracts raw text; structured extraction is done by Form Recognizer.
Specific Numbers and Terms
Endpoint version: v3.2 or v3.1
Operation-Location header
Polling interval: at least 1 second (not less than 0.5)
Image size limits: 50x50 to 10000x10000 pixels
File size: up to 4 MB (free tier)
Supported formats: JPEG, PNG, BMP, TIFF, PDF
Status values: 'running', 'succeeded', 'failed'
Confidence scores: between 0 and 1
Response includes: 'readResults', 'lines', 'words', 'boundingBox', 'text', 'confidence'
Edge Cases and Exceptions
Multi-page documents: the Read API can process TIFF and PDF with multiple pages. Each page is returned as a separate entry in 'readResults'.
Language specification: you can pass a 'language' parameter to improve accuracy. If not specified, the service auto-detects.
Handwriting detection: only works for some languages. The exam may ask which languages are supported for handwriting (e.g., English, Spanish, French).
Throttling: free tier 20 calls per minute; paid tier varies.
How to Eliminate Wrong Answers
If the question asks for 'extracting text from an image', eliminate any answer that mentions 'key-value pairs' or 'tables'—those are Form Recognizer.
If the question mentions 'asynchronous', look for polling or Operation-Location.
If the question is about 'printed text' only, OCR works for many languages; if 'handwritten', limit to few.
If the question includes 'confidence scores', that's a clue for OCR output.
By understanding these patterns, you can quickly eliminate distractors and choose the correct answer.
OCR in Azure is provided by the Computer Vision Read API, which is asynchronous.
The Read API returns bounding boxes (8 coordinates) and confidence scores (0-1) for each word.
The initial POST returns 202 with an Operation-Location header; you must poll the GET endpoint until status is 'succeeded'.
OCR supports printed text in over 160 languages, but handwritten text is limited to a subset (English, Spanish, French, German, Italian, Portuguese, Dutch).
Image size must be between 50x50 and 10000x10000 pixels; file size up to 4 MB on free tier.
The free tier allows 20 transactions per minute; exceeding this results in HTTP 429 errors.
OCR does not extract structure (tables, key-value pairs); use Form Recognizer for that.
The endpoint version commonly tested is v3.2 or v3.1.
Confidence scores below 0.5 indicate low reliability; you may need manual review.
Multi-page documents (TIFF, PDF) are supported; each page is returned separately in readResults.
These come up on the exam all the time. Here's how to tell them apart.
Read API (Computer Vision)
Extracts raw text from images and documents.
Returns bounding boxes and confidence scores for each word.
Asynchronous operation with polling.
Part of the Computer Vision service.
Best for general OCR tasks (scanned books, signs, etc.).
Azure AI Document Intelligence (Form Recognizer)
Extracts structured information like key-value pairs and tables.
Uses OCR internally but adds custom models for form understanding.
Also asynchronous; uses a similar polling mechanism.
Separate service (formerly Form Recognizer).
Best for forms, invoices, receipts, and structured documents.
Mistake
OCR in Azure is a synchronous operation that returns text immediately.
Correct
The Read API is asynchronous. You submit the image and get an Operation-Location URL, then you must poll that URL until the status is 'succeeded'. The initial POST does not return the text.
Mistake
Azure OCR can extract text from any image, regardless of quality or language.
Correct
OCR accuracy depends on image quality (resolution, contrast, skew) and language support. Handwritten text is only supported for a few languages (English, Spanish, French, German, Italian, Portuguese, Dutch). The service may fail on low-quality images.
Mistake
The Read API and Form Recognizer are the same thing.
Correct
The Read API (Computer Vision) extracts raw text with bounding boxes. Form Recognizer (Azure AI Document Intelligence) uses OCR as a first step but then applies additional models to extract structured data like key-value pairs and tables.
Mistake
OCR can recognize text in any font, including handwritten cursive with 100% accuracy.
Correct
Even with deep learning, OCR is not 100% accurate. Printed text can achieve >99% accuracy, but handwritten text is lower (typically 80-95%). The service returns confidence scores to indicate reliability.
Mistake
You can call the Read API with a synchronous endpoint if you set a parameter.
Correct
There is no synchronous version of the Read API. All text extraction via the Computer Vision Read API is asynchronous. The only way to get results is by polling the Operation-Location URL.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
OCR (via the Read API) extracts raw text from images and returns it with bounding boxes and confidence scores. Form Recognizer (now Azure AI Document Intelligence) uses OCR as a first step but then applies custom models to extract structured data like key-value pairs, tables, and signatures. For the AI-900 exam, remember that OCR is for general text extraction, while Form Recognizer is for forms and documents with structure.
First, send a POST request to the Read API endpoint with your image. The response includes an Operation-Location header with a URL. Then, send GET requests to that URL to poll for results. When the status is 'succeeded', the response body contains the extracted text. The recommended polling interval is 1 second.
Azure OCR supports JPEG, PNG, BMP, TIFF, and PDF. For multi-page documents, TIFF and PDF are supported. The image must be between 50x50 and 10000x10000 pixels, and the file size must be under 4 MB for the free tier (larger for paid tiers).
Yes, but only for a limited set of languages: English, Spanish, French, German, Italian, Portuguese, and Dutch. For printed text, it supports over 160 languages. The accuracy for handwriting is lower than for printed text, and confidence scores help identify uncertain results.
The maximum image dimensions are 10000x10000 pixels. The minimum is 50x50 pixels. For file size, the free tier allows up to 4 MB; paid tiers can handle larger files (check documentation).
The free tier of Azure Computer Vision allows 20 transactions per minute. If you exceed this, you will receive an HTTP 429 (Too Many Requests) error. Paid tiers have higher limits (e.g., 100 transactions per second).
Yes, the Read API accepts PDF files as input. It can process multi-page PDFs and returns results for each page separately. The PDF is converted to images internally for OCR processing.
You've just covered Optical Character Recognition (OCR) — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.
Done with this chapter?