AI-900Chapter 59 of 100Objective 3.4

Document Layout Analysis

When a scanned document arrives as a chaotic image, you must first identify its structural components—paragraphs, tables, figures, selection marks, and more—without reading the content; Document Layout Analysis, a core capability of Azure AI Document Intelligence, solves this problem. For the AI-900 exam, this topic appears in approximately 5–8% of questions under Objective 3.4, focusing on understanding how layout analysis differs from OCR and extraction, and when to use the prebuilt layout model. You will learn the internal mechanisms, the exact output schema, and how to configure and interpret results, all of which are directly tested.

25 min read

Intermediate

Updated Jul 20, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

Document Layout Analysis as a Master Organizer

What does a master organizer do when faced with a cluttered desk covered with papers, sticky notes, photographs, and handwritten memos? The organizer's job is not to read every word but to sort everything into labeled folders: one for letters, one for forms, one for images, one for tables, and one for handwritten notes. To do this, the organizer first scans the entire desk, identifying each item's shape, size, and position. A rectangular sheet with lines and a signature block goes into the 'form' folder. A glossy, irregular shape with colors goes into the 'image' folder. A small yellow square with handwriting goes into the 'sticky note' folder. The organizer does not need to understand the content's meaning—just its structural role. Later, someone else (the downstream AI) reads the sorted documents. In Azure AI Document Intelligence, the prebuilt layout model acts exactly like this organizer. It takes a scanned document (the cluttered desk) and returns the bounding box, content, and type of every element: paragraphs, tables, checkboxes, selection marks, and figures. It does not interpret the text's semantics—that is for the read or custom extraction models. This separation of concerns allows you to first understand a document's structure before extracting specific data.

How It Actually Works

What Is Document Layout Analysis and Why Does It Exist?

Document Layout Analysis is a computer vision technique that identifies and classifies the physical structure of a document page. Unlike Optical Character Recognition (OCR), which extracts raw text and its bounding boxes, layout analysis groups text and images into meaningful regions: paragraphs, sections, tables, figures, headers, footers, page numbers, and selection marks (checkboxes and radio buttons). This is a prerequisite for downstream tasks like document understanding, form extraction, and intelligent search.

Azure AI Document Intelligence (formerly Form Recognizer) offers a prebuilt layout model that performs this analysis. It is part of the broader Document Intelligence service, which includes prebuilt models for invoices, receipts, identity documents, and custom extraction models. The layout model is the foundation: you can run it alone to understand a document's structure, or use its output as input to other models.

How It Works Internally — Step Through the Mechanism

The layout model uses a deep neural network based on a transformer architecture, specifically a variant of the LayoutLM family. The process has four stages:

Image Preprocessing: The input document (PDF, TIFF, or image) is converted to a standard resolution (300 DPI for images, or the native resolution for PDFs). The system ensures the image is oriented correctly (0°, 90°, 180°, or 270°) using a separate orientation detection model. If the image is skewed, it is deskewed to a maximum of 5 degrees.

OCR Layer: An internal OCR engine extracts all text characters and their bounding boxes at the word level. This is a standard OCR pass that returns text content, confidence scores (0–1), and bounding polygon coordinates. The OCR engine is Microsoft's own, with support for over 100 languages.

3. Layout Classification: The model analyzes the spatial arrangement of text blocks, images, and lines. It uses attention mechanisms to understand relationships between adjacent elements. For each detected region, it assigns a type from a predefined taxonomy: - paragraph: A block of text that forms a coherent paragraph. - sectionHeading: A heading that starts a new section. - figure: An image or chart (bounding box only, no content extraction). - table: A structured table with rows and columns. - pageNumber: A page number. - header/footer: Repeating text at the top or bottom of the page. - checkbox/radioButton: Selection marks (checked or unchecked). - formula: A mathematical equation.

4. Output Generation: The model returns a JSON object containing: - pages: Array of pages, each with width, height, angle, and units (pixels or inches). - paragraphs: Array of paragraphs with content, boundingRegions (page, polygon), and spans (offset and length in the concatenated text). - tables: Array of tables with cells (each with kind (content, rowHeader, columnHeader), rowIndex, columnIndex, rowSpan, columnSpan, content, boundingRegions). - figures: Array of figures with boundingRegions and caption (if detected). - sections: Array of section headings with content and boundingRegions. - selectionMarks: Array of selection marks with state (selected or unselected), boundingRegions, and confidence.

Key Components, Values, Defaults, and Timers

API Version: The latest stable version is 2023-07-31 (prebuilt-layout). The exam may reference older versions, but always check the current documentation.

Input Formats: PDF (up to 2,000 pages), TIFF (up to 2,000 pages), JPEG, PNG, BMP. Maximum file size is 500 MB for paid tier, 4 MB for free tier.

Output Units: Coordinates are returned in pixels (relative to page dimensions) or inches (if outputContentFormat: "text" is specified).

Confidence Threshold: The API does not filter by default; you must implement your own threshold (e.g., confidence > 0.8 for production use).

Language Support: The layout model supports over 100 languages for OCR, but layout classification (paragraph, table, etc.) is language-agnostic.

Pricing: The layout model is billed per page (e.g., $0.01 per page for S0 tier).

Configuration and Verification Commands

To call the layout model, use the Azure AI Document Intelligence REST API or SDK. Here is an example using the Python SDK:

from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

endpoint = "https://your-endpoint.cognitiveservices.azure.com/"
credential = AzureKeyCredential("your-api-key")
client = DocumentAnalysisClient(endpoint, credential)

with open("document.pdf", "rb") as f:
    poller = client.begin_analyze_document("prebuilt-layout", document=f)
    result = poller.result()

for page in result.pages:
    print(f"Page {page.page_number}: width={page.width}, height={page.height}, angle={page.angle}")

for table in result.tables:
    print(f"Table with {len(table.cells)} cells")
    for cell in table.cells:
        print(f"  Row {cell.row_index}, Col {cell.column_index}: {cell.content}")

for mark in result.selection_marks:
    print(f"Selection mark at page {mark.page_number}: state={mark.state}, confidence={mark.confidence}")

To verify the output, examine the JSON response. The analyzeResult object contains all layout elements. For example, a table cell appears as:

{
  "kind": "content",
  "rowIndex": 0,
  "columnIndex": 0,
  "rowSpan": 1,
  "columnSpan": 1,
  "content": "Name",
  "boundingRegions": [{"pageNumber": 1, "polygon": [x1,y1,x2,y2,x3,y3,x4,y4]}]
}

How It Interacts with Related Technologies

Document Layout Analysis is often the first step in a pipeline:

OCR: The layout model includes OCR internally, but you can also use the read model (prebuilt-read) for pure text extraction without layout classification.

Custom Extraction Models: After layout analysis, you can train a custom model to extract specific fields (e.g., invoice total) using the layout output as a feature.

Search Indexing: The structured output (paragraphs, tables) can be indexed in Azure Cognitive Search to enable semantic search over document structure.

Power Automate: You can use the Document Intelligence connector in Power Automate to trigger workflows based on detected layout elements (e.g., if a checkbox is selected).

Common Pitfalls and Exam Traps

Trap: Layout analysis extracts data values. Reality: It only identifies structure, not semantics. For example, it tells you "this is a table cell" but not that the cell contains an invoice date.

Trap: Layout analysis works on handwritten documents. Reality: It works best on printed text; handwriting may be misclassified or have low confidence.

Trap: Tables are always detected perfectly. Reality: Tables without clear borders (e.g., tab-separated) may be classified as paragraphs. The model uses visual cues like lines and spacing.

Trap: The output includes all text in reading order. Reality: The model returns elements in the order they appear on the page (top-left to bottom-right), but multi-column layouts may be reordered. Use the paragraphs array for reading order.

Walk-Through

Submit Document to Layout API

Send a POST request to the Document Intelligence endpoint with the document file or URL. Use the `prebuilt-layout` model ID. The API accepts PDF, TIFF, JPEG, PNG, or BMP. For PDFs, up to 2,000 pages are processed. The request returns an `operation-location` header containing a URL to poll for results. You must poll this URL until the status changes to `succeeded`. The initial response is asynchronous, with a typical processing time of 1–5 seconds per page depending on complexity.

Image Preprocessing and OCR

The service preprocesses the document: it converts to 300 DPI if needed, detects orientation, and deskews if skew is less than 5 degrees. Then, OCR extracts every character with bounding polygons. The OCR engine returns text content and confidence scores (0–1). This step is invisible to the user but is essential for the layout model to have text coordinates. The OCR supports over 100 languages; you can specify the language hint in the request (e.g., `locale: "en"`).

Layout Classification by Neural Network

The transformer-based model analyzes the spatial relationships between OCR words, lines, and images. It uses attention to determine if a block of text is a paragraph, heading, table, or figure. Tables are identified by detecting grid-like structures (rows and columns) and cell boundaries. Selection marks are identified by looking for small squares or circles with or without a fill. The model outputs a type for each region: `paragraph`, `sectionHeading`, `figure`, `table`, `pageNumber`, `header`, `footer`, `checkbox`, `radioButton`, or `formula`. Each region includes bounding polygons and confidence scores.

Construct Structured JSON Output

The model assembles the results into a JSON object with arrays for pages, paragraphs, tables, figures, selection marks, and sections. Each element includes `boundingRegions` (page number and polygon), `spans` (offset and length from the concatenated text), and confidence. For tables, each cell includes `rowIndex`, `columnIndex`, `rowSpan`, `columnSpan`, and `kind` (content, rowHeader, columnHeader). The `content` field holds the extracted text. The `spans` allow you to map cells back to the full text.

Poll for Results and Parse Response

Poll the `operation-location` URL using GET requests every 1–2 seconds until `status` is `succeeded` or `failed`. The response includes `analyzeResult` with all layout elements. Parse the JSON to extract needed data. For example, iterate through `result.tables` to find all table structures. Use `result.paragraphs` to get reading order. The `result.content` field contains the full concatenated text with line breaks. The polling endpoint is rate-limited; free tier allows 20 calls per minute.

What This Looks Like on the Job

Enterprise Scenario 1: Invoice Processing Automation

A large logistics company receives 50,000 invoices per month in PDF format. They need to extract line items, totals, and vendor names. Before extraction, they use the layout model to identify tables (line items) and paragraphs (vendor address). The layout output is fed into a custom extraction model that uses the table structure to extract each row. The company processes invoices in batches using Azure Logic Apps, which calls the layout API and stores the JSON in Azure Blob Storage. A key challenge is handling multi-page invoices where tables span pages; the layout model correctly identifies the table as a single entity with cells on multiple pages. Misconfiguration: If the company uses the read model instead, they lose the table structure and must parse text manually, leading to 20% error rates.

Enterprise Scenario 2: Legal Document Review

A law firm uses Document Intelligence to analyze contracts. They need to identify clauses (paragraphs), signatures (figures), and checkboxes (selection marks). The layout model helps them categorize each section. They run the layout model on thousands of documents and then use custom NLP to extract clause types. Performance: Processing a 100-page contract takes about 2 minutes. The firm must handle scanned documents with low resolution; they preprocess images to 300 DPI to improve accuracy. A common problem is that handwritten signatures are often misclassified as figures, which is acceptable because they only need the bounding box for redaction.

Enterprise Scenario 3: Healthcare Forms Digitization

A hospital digitizes patient intake forms. The forms contain checkbox fields (e.g., symptoms list) and handwritten notes. The layout model correctly identifies checkboxes and their state (selected/unselected). However, handwritten text in note fields is not extracted by the layout model; they use a separate handwriting recognition model. The hospital processes 10,000 forms daily using a batch pipeline. They set a confidence threshold of 0.7 for selection marks to reduce false positives. Misconfiguration: If they do not filter by confidence, they might mark a smudge as a selected checkbox, causing incorrect patient records.

How AI-900 Actually Tests This

What AI-900 Tests on This Topic

Objective 3.4 covers "Document Layout Analysis" under "Computer Vision". The exam expects you to:

Understand the purpose of layout analysis: to identify document structure (paragraphs, tables, figures, selection marks) without extracting semantic meaning.

Know the prebuilt layout model's capabilities and limitations.

Distinguish between layout analysis, OCR (read model), and extraction models.

Identify appropriate use cases: forms, contracts, reports, etc.

Recognize the output format: bounding boxes, page numbers, and element types.

Common Wrong Answers and Why Candidates Choose Them

"Layout analysis extracts key-value pairs like invoice total." This is the most common trap. Candidates confuse layout analysis with the prebuilt invoice or custom extraction models. Layout analysis only gives structure; it does not know what a 'total' is.

"Layout analysis works on any image quality." The model requires clear text; low-resolution or heavily skewed images reduce accuracy. Candidates overestimate the model's robustness.

"Layout analysis recognizes handwriting." While OCR can extract handwritten text, the layout model's classification (e.g., paragraph vs. table) is less reliable on handwriting. The exam may test that printed documents yield better results.

"Layout analysis returns text in reading order across columns." Reading order is generally top-left to bottom-right, but multi-column layouts may not be perfectly ordered. The exam might ask about ordering limitations.

Specific Numbers and Terms That Appear on the Exam

Model name: prebuilt-layout

API version: 2023-07-31 (or older, but know the latest)

Maximum PDF pages: 2,000

Maximum file size: 500 MB (paid), 4 MB (free)

Supported output element types: paragraph, sectionHeading, figure, table, pageNumber, header, footer, checkbox, radioButton, formula

Selection mark states: selected or unselected

Confidence scores: 0–1 (float)

Edge Cases and Exceptions

Password-protected PDFs: Not supported; the API returns an error.

Scanned images with rotation: The model auto-rotates, but if rotation is >5 degrees, deskew fails and accuracy drops.

Tables without borders: May be misclassified as paragraphs. The model relies on visual cues.

Mixed-language documents: OCR accuracy varies; layout classification remains language-agnostic.

How to Eliminate Wrong Answers

If the question mentions extracting a specific field (e.g., "invoice date"), it is NOT layout analysis — it is extraction.

If the question mentions "text only" without structure, it is the read model, not layout.

If the question mentions "checkbox detection," layout analysis is correct because selection marks are a layout element.

If the question mentions "handwriting," be cautious: layout analysis can detect handwriting as text but not classify it reliably.

Key Takeaways

Document Layout Analysis identifies the physical structure of a document: paragraphs, tables, figures, selection marks, headers, footers, and page numbers.

The prebuilt-layout model in Azure AI Document Intelligence performs layout analysis without requiring training.

Layout analysis does NOT extract semantic meaning (e.g., invoice totals); it only provides structural context.

The output includes bounding polygons, confidence scores, and element types for each detected region.

Selection marks are detected and reported as 'selected' or 'unselected' with confidence.

Tables are identified by visual grid lines; borderless tables may be misclassified as paragraphs.

Maximum document size is 500 MB (paid) or 4 MB (free), with up to 2,000 pages for PDFs.

The model supports over 100 languages for OCR but layout classification is language-agnostic.

Layout analysis is the foundation for custom extraction models and document understanding pipelines.

On the AI-900 exam, distinguish layout analysis from OCR (read model) and extraction models by focusing on whether the task requires structure or semantics.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Prebuilt Layout Model

Identifies document structure: paragraphs, tables, figures, selection marks.

Returns bounding polygons for each structural element.

Includes OCR internally; outputs both text and layout.

Suitable for understanding document structure before extraction.

Higher cost per page due to additional analysis.

Prebuilt Read Model

Extracts raw text only, with bounding boxes for words and lines.

Does not classify text into structural types.

Faster processing since no layout classification.

Suitable for pure text extraction without structure.

Lower cost per page.

Watch Out for These

Mistake

Document Layout Analysis can extract the value of a checkbox (e.g., 'Yes' or 'No').

Correct

Layout analysis only detects the selection mark's state (selected or unselected) and its bounding box. It does not associate the checkbox with a label or extract the label's text. That requires a custom extraction model.

Mistake

Layout analysis works on any document, including handwritten notes, with equal accuracy.

Correct

The model is optimized for printed text. Handwriting may have lower OCR confidence, and layout classification (e.g., table vs. paragraph) may be incorrect. For best results, use printed or typed documents.

Mistake

The layout model returns all text in a single concatenated string in reading order.

Correct

The model provides a `content` field with concatenated text, but the order is top-left to bottom-right per page. Multi-column layouts may not follow logical reading order. Use the `paragraphs` array for better ordering.

Mistake

Layout analysis is the same as OCR (Optical Character Recognition).

Correct

OCR extracts raw text and its coordinates. Layout analysis goes further by classifying regions into structural types (paragraph, table, figure, etc.). The prebuilt-read model does OCR only; prebuilt-layout adds classification.

Mistake

You need to train a custom model to use layout analysis.

Correct

The prebuilt-layout model is ready to use out of the box. No training is required. Custom models are for extracting specific fields from forms, not for layout classification.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between the prebuilt-layout model and the prebuilt-read model in Azure AI Document Intelligence?

The prebuilt-read model performs OCR, extracting all text and its bounding boxes. The prebuilt-layout model goes further by classifying text regions into structural types like paragraphs, tables, figures, and selection marks. Use read when you only need raw text; use layout when you need to understand the document's structure, such as for table extraction or form processing.

Can Document Layout Analysis extract data from tables?

Yes, but only the structure. The layout model identifies table boundaries, rows, columns, and cells. It returns the text content of each cell. However, it does not interpret the data (e.g., it does not know that a cell contains a 'Total' value). For semantic extraction, you need a custom extraction model or the prebuilt invoice/receipt models.

Does the layout model support handwriting?

The layout model uses OCR that can extract handwritten text, but accuracy is lower than for printed text. Layout classification (e.g., identifying a table) is less reliable on handwritten documents. For best results, use printed documents. The exam may test that the model works best on printed text.

What input formats are supported by the prebuilt-layout model?

Supported formats: PDF, TIFF, JPEG, PNG, and BMP. PDF and TIFF can have up to 2,000 pages. Maximum file size is 500 MB for paid tier, 4 MB for free tier. The model also supports documents from URLs.

How do I get the reading order of a document using layout analysis?

The `paragraphs` array in the output is ordered from top-left to bottom-right of the page. For multi-column layouts, the order may not match logical reading order. The `content` field contains the concatenated text of all elements in the order they appear. If you need exact reading order, you may need to post-process the output.

What does a selection mark look like in the output?

A selection mark is represented as an object in the `selectionMarks` array. It includes `state` (selected or unselected), `confidence` (0–1), and `boundingRegions` (page number and polygon). The polygon is a list of 8 coordinates (x1,y1,x2,y2,x3,y3,x4,y4) defining the checkbox boundary.

Can I use the layout model on password-protected PDFs?

No. The API returns an error if the PDF is password-protected. You must remove the password before submission.

Terms Worth Knowing

Artificial intelligence Computer vision Generative AI Machine learning Natural language processing Responsible AI

Ready to put this to the test?

You've just covered Document Layout Analysis — now see how well it sticks with free AI-900 practice questions. Full explanations included, no account needed.

Try AI-900 practice questions Back to all chapters

Done with this chapter?

Face Attributes and Emotion Detection

Pre-Built Models: Invoices, Receipts, IDs

See the full AI-900 study guide