This chapter covers Azure AI Search, a cloud search service that indexes unstructured data and makes it searchable via full-text queries, faceted navigation, and AI enrichment. For the DP-900 exam, this topic falls under Domain 3: Describe how to work with non-relational data on Azure (Objective 3.5: Describe AI Search for unstructured data). Approximately 5-8% of exam questions touch on Azure AI Search, focusing on its purpose, components (index, indexer, skillset), and how it handles unstructured data like text, images, and blobs. You will need to know when to use it versus other Azure data services and how AI enrichment adds value.
Jump to a section
Azure AI Search is like a library that does not just store books on shelves but also creates a massive card catalog for every word, phrase, and concept inside every book. The library's catalog is built by indexers that scan each book, extract key terms, and organize them into a searchable index. When a patron asks a question, the librarian does not walk the aisles reading every book; instead, she consults the card catalog, which points directly to the exact pages where the answer appears. The catalog can rank results by relevance, highlight matching text, and even suggest related topics. In this analogy, the books are your unstructured data (PDFs, emails, web pages), the indexers are Azure AI Search's built-in skills that perform OCR, key phrase extraction, and language detection, and the card catalog is the search index. The librarian's ability to answer complex questions using the catalog mirrors how Azure AI Search enables full-text search, faceted navigation, and semantic ranking on unstructured content. Just as a library patron does not need to know the Dewey Decimal System to find a book, a user does not need to understand the underlying index schema to get relevant search results.
What is Azure AI Search and Why It Exists
Azure AI Search (formerly Azure Cognitive Search) is a fully managed, cloud-based search-as-a-service solution that enables developers to build rich search experiences over heterogeneous content. It is designed primarily for unstructured and semi-structured data—documents, images, text files, web pages, emails, and other content that does not fit neatly into a relational schema. The service ingests data from various sources, indexes it using inverted indexes and AI-powered enrichment, and exposes search capabilities via REST APIs or SDKs.
The core problem Azure AI Search solves is the difficulty of searching across large volumes of unstructured data efficiently. Traditional databases excel at querying structured data (e.g., SELECT * FROM Customers WHERE LastName = 'Smith'), but they struggle with free-text queries, relevance ranking, and handling content like PDFs or images. Azure AI Search fills this gap by providing:
Full-text search with linguistic analysis (stemming, lemmatization, tokenization).
Relevance scoring using TF-IDF or BM25 (default BM25 since 2020).
Faceted navigation and filtering.
AI enrichment via built-in cognitive skills (OCR, key phrase extraction, entity recognition, language detection, sentiment analysis).
Integration with Azure data sources (Blob Storage, Cosmos DB, SQL Database, etc.).
How It Works Internally – Step Through the Mechanism
The process of making unstructured data searchable in Azure AI Search occurs in three main phases: indexing, enrichment (optional), and querying.
#### 1. Indexing Indexing is the process of extracting data from a source, transforming it, and loading it into a search index. The key components are:
Data Source: A connection to the source of your unstructured data. Supported sources include Azure Blob Storage, Azure Data Lake Storage Gen2, Azure Cosmos DB, Azure SQL Database, Azure Table Storage, and more. For unstructured data, Blob Storage is the most common.
Indexer: A service that automates the data ingestion pipeline. It connects to the data source, reads the data, and writes it to the index. Indexers can run on-demand or on a schedule (e.g., every 5 minutes). They also handle incremental indexing by tracking changes using high-water marks or change tracking.
Index: A persistent store of searchable content. It is defined by a schema that includes fields (e.g., content, metadata_storage_name, metadata_last_modified), each with a type (Edm.String, Edm.Int32, Edm.DateTimeOffset, etc.) and attributes (searchable, filterable, sortable, facetable, retrievable). The index uses an inverted index structure: for each unique term, it stores a list of document IDs and positions where that term appears.
#### 2. AI Enrichment (Skillsets) AI enrichment is optional but powerful for unstructured data. It uses a skillset—a pipeline of cognitive skills that transform or extract information from the raw content. Skillsets are attached to indexers and execute during indexing. Each skill takes input from the document or previous skills and produces output that is written to the index.
Built-in skills include: - OCR Skill: Extracts text from images (JPEG, PNG, TIFF, BMP) using optical character recognition. Supports multiple languages. - Key Phrase Extraction Skill: Identifies key talking points from text (e.g., from a PDF). - Entity Recognition Skill: Identifies named entities (people, organizations, locations, dates, etc.). - Language Detection Skill: Detects the language of text (supports 120+ languages). - Sentiment Analysis Skill: Scores text sentiment from 0 (negative) to 1 (positive). - Merge Skill: Merges fields together (e.g., concatenate title and content). - Conditional Skill: Implements if-then-else logic. - Custom Skill: Allows you to call an Azure Function or web API for custom processing.
Skillsets are defined in JSON. Here is an example snippet:
{
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
"name": "keyphrase",
"context": "/document",
"inputs": [
{ "name": "text", "source": "/document/content" }
],
"outputs": [
{ "name": "keyPhrases", "targetName": "myKeyPhrases" }
]
}
]
}#### 3. Querying
Once the index is populated, clients can query it via REST API or SDK. Queries are HTTP POST requests to the search endpoint with parameters like search, filter, orderby, facets, and highlight. The search engine parses the query, looks up terms in the inverted index, computes relevance scores (BM25), and returns results ordered by score.
Example query:
POST https://[service-name].search.windows.net/indexes/[index-name]/docs/search?api-version=2020-06-30
Content-Type: application/json
{
"search": "Azure AI Search",
"queryType": "simple",
"searchMode": "any",
"top": 10
}Key Components, Values, Defaults, and Timers
Service Tiers (SKUs): Free (F), Basic (B), Standard (S1, S2, S3), Storage Optimized (L1, L2). The Free tier is limited to 3 indexes, 10 MB of storage, and 10,000 documents per index. Standard tiers scale up to 200 GB per partition (S3) and support up to 12 replicas for high availability.
Index Limits: Max fields per index: 1,000 (Free/Basic) or 2,500 (Standard). Max index size depends on tier.
Indexer Schedule: Minimum interval is 5 minutes. Runs can be triggered on-demand or via Azure Logic Apps.
Skillset Execution: Each skill has a timeout of 2.5 minutes for built-in skills; custom skills have a 2-minute timeout.
Query Defaults: top defaults to 50, searchMode defaults to any (returns documents matching any term). For more precise results, use searchMode=all.
Relevance Scoring: Default is BM25. Parameters k1 (term frequency saturation) and b (length normalization) can be tuned via scoringProfile.
Semantic Search: Optional capability (additional cost) that uses deep learning models to improve relevance. Requires a semantic configuration in the index.
Configuration and Verification Commands
To create an indexer from the Azure CLI:
az search indexer create --name myindexer --service-name myservice --data-source-name mydatasource --target-index-name myindex --skillset-name myskillset --schedule "interval" PT5MTo check indexer status:
az search indexer show-status --name myindexer --service-name myserviceTo run an indexer on demand:
az search indexer run --name myindexer --service-name myserviceHow It Interacts with Related Technologies
Azure AI Search integrates with many Azure services: - Azure Blob Storage: Primary source for unstructured data (PDFs, images, text files). Indexers can extract metadata and content from blobs. - Azure Cognitive Services: Skillsets use Cognitive Services APIs (e.g., Computer Vision for OCR, Text Analytics for key phrases). You must attach a Cognitive Services resource to enable billable skills. - Azure Data Factory: Can orchestrate data movement into Blob Storage before indexing. - Azure Synapse Analytics: Can use search indexes as a source for enrichment pipelines. - Power BI: Can connect to search indexes via the Search Index connector for reporting. - Azure Functions: Custom skills can call Azure Functions for domain-specific processing. - Azure Monitor: Logs and metrics for indexing operations, query latency, and errors.
The service is not a replacement for Azure Cosmos DB (OLTP) or Azure SQL Database (relational queries); it is complementary for search-centric workloads.
Create a Search Service
In the Azure portal, click 'Create a resource' and search for 'Azure Cognitive Search'. Choose a subscription, resource group, service name, location, and pricing tier. The Free tier is suitable for small-scale tests (max 3 indexes, 10 MB storage). For production, choose at least Basic or Standard. The service endpoint is https://[name].search.windows.net. After deployment, you will have admin API keys for management and query API keys for client apps.
Define a Data Source
Create a data source object that tells the indexer where to read data. For unstructured data, use Azure Blob Storage. Provide the connection string (either account key or managed identity). Specify the container name and optionally a file pattern (e.g., *.pdf). The data source type must match the source: 'azureblob', 'azuresql', 'cosmosdb', etc. You can create it via REST API, Azure CLI, or portal.
Create a Search Index
Define the index schema with fields that map to the data you want to search and display. Common fields for unstructured data include: content (the extracted text), metadata_storage_name (filename), metadata_storage_size, metadata_last_modified, and any enriched fields like keyphrases, entities, or language. Each field has a type and attributes: searchable (included in full-text search), filterable (can be used in $filter), sortable, facetable, and retrievable (returned in results). Use the Azure portal's 'Import Data' wizard to auto-generate an index from a data source, or define it manually via REST.
Create and Run an Indexer
An indexer automates the pipeline from data source to index. Attach it to the data source and index. Optionally attach a skillset for AI enrichment. Configure a schedule (e.g., PT5M for every 5 minutes) or run manually. The indexer reads each blob, extracts text (if it's a PDF, it uses OCR if a skillset is configured), applies skills, and writes the output to the index. Monitor indexer execution via the portal or CLI; check for errors like 'Failed to decode document' or 'Skill execution timeout'.
Query the Index
Use the REST API or SDK to send search requests. The query endpoint is https://[service].search.windows.net/indexes/[index]/docs?api-version=2020-06-30. Include a JSON body with the search term, filters, facets, and pagination. For example, to search for 'Azure' and filter by language 'en', use: {'search': 'Azure', 'filter': 'language eq \'en\''}. The response includes a list of documents with relevance scores (BM25). Use the 'highlight' parameter to show matching text snippets. For advanced scenarios, enable semantic search for better understanding of natural language queries.
Enterprise Scenario 1: Legal Document Review
A law firm ingests millions of PDF contracts, emails, and scanned documents into Azure Blob Storage. They need to search across all documents for specific clauses, parties, and dates. They deploy Azure AI Search with an indexer that runs nightly. A skillset extracts key phrases (e.g., 'confidentiality', 'indemnification'), entities (company names, people), and performs OCR on scanned PDFs. The index is configured with faceted navigation on document type, date range, and law firm. Lawyers use a custom web app to search and filter. Performance: With Standard S2 tier, they index 500,000 documents per day and achieve sub-second query latency. Common misconfiguration: Not setting the 'parsingMode' to 'jsonArray' for complex documents, causing indexer failures. They also learned to increase the skillset timeout for large PDFs (over 100 pages) by using custom skills with longer timeouts.
Enterprise Scenario 2: E-commerce Product Catalog
An online retailer stores product descriptions, images, and specifications in a mix of Cosmos DB and Blob Storage. They need a unified search experience that allows customers to search by keywords, filter by category and price, and get suggestions. They use Azure AI Search with a Cosmos DB data source for structured product info and a Blob Storage data source for product images (with OCR to extract text from images). The index includes fields like productId, name, description, price, category, and imageTags (extracted via Computer Vision). They enable semantic search to handle queries like 'affordable wireless headphones'. The service is configured with 3 replicas for high availability and 2 partitions for performance. They use a custom scoring profile to boost products with higher ratings. A common pitfall: Over-fetching fields (setting 'retrievable' on all fields) increases storage and query latency. They optimized by only marking essential fields as retrievable.
Scenario 3: Healthcare Patient Records
A hospital digitizes patient records (PDFs, forms, lab reports) and stores them in Azure Blob Storage. They need to search for diagnoses, medications, and procedures across millions of records. They use Azure AI Search with a skillset that includes entity recognition for medical terms (using a custom entity extraction model via Custom Skill) and language detection for multilingual records. The index includes fields like patientId, date, diagnosis, medications, and procedures. Access control is implemented using Azure AD and index-level security filters (e.g., only doctors from a specific department can see certain records). They run indexing on a schedule every hour to capture new documents. Performance: Indexing 1 million documents takes about 4 hours on S2 tier. They monitor indexer errors using Azure Monitor alerts. A common issue: OCR fails on poor-quality scans; they preprocess images to enhance contrast before indexing.
What DP-900 Tests on This Topic (Objective 3.5)
The exam focuses on:
Identifying Azure AI Search as the service for search over unstructured data.
Understanding the purpose of components: index, indexer, data source, skillset.
Knowing that AI enrichment uses cognitive skills (OCR, key phrase extraction, entity recognition, language detection, sentiment analysis).
Recognizing that Azure AI Search can index data from Azure Blob Storage, Cosmos DB, SQL Database, and other sources.
Differentiating between Azure AI Search and other data services (e.g., Azure Cognitive Services, Azure Data Lake, Azure Synapse).
Common Wrong Answers and Why Candidates Choose Them
Wrong: 'Azure AI Search is used for storing structured data.' Why chosen: Candidates confuse it with Azure SQL Database or Cosmos DB. Correction: Azure AI Search is for search over unstructured data; it does not store the original data, only an index.
Wrong: 'AI enrichment is required for all unstructured data indexing.' Why chosen: Candidates think AI skills are mandatory because the service name includes 'AI'. Correction: AI enrichment is optional; you can index text-based files without any skills.
Wrong: 'Azure AI Search replaces Azure Cognitive Services.' Why chosen: Both have 'Cognitive' in their history. Correction: Azure AI Search uses Cognitive Services APIs for skills but is a separate service for search.
Wrong: 'You can query unstructured data directly in Blob Storage using Azure AI Search.' Why chosen: Candidates assume the service queries the source directly. Correction: Data must be indexed first; queries run against the index, not the source.
Specific Numbers, Values, and Terms That Appear Verbatim
BM25 is the default relevance scoring algorithm.
Skillset is the term for the AI enrichment pipeline.
Indexer automates data ingestion.
Free tier limits: 3 indexes, 10 MB storage, 10,000 documents per index.
Minimum indexer schedule interval: 5 minutes.
Supported data sources: Azure Blob Storage, Cosmos DB, SQL Database, Table Storage, Data Lake Storage Gen2.
Built-in skills: OCR, Key Phrase Extraction, Entity Recognition, Language Detection, Sentiment Analysis.
Edge Cases and Exceptions the Exam Loves to Test
Images: Without an OCR skill, images are not searchable (only metadata is indexed).
Multiple languages: Language detection skill can auto-detect language; you can then filter by language.
Custom skills: You can create custom skills using Azure Functions for domain-specific processing.
Security: Use API keys or Azure AD for authentication; implement document-level security via filters.
Semantic search: An optional feature (additional cost) that improves relevance for natural language queries.
How to Eliminate Wrong Answers
If a question asks about 'searching unstructured data', eliminate any option that describes a relational database (SQL Database, Cosmos DB SQL API) or a data warehouse (Synapse, Data Lake).
If the question mentions 'AI enrichment', look for keywords like OCR, key phrases, entity recognition, sentiment.
If the question asks about 'automated data ingestion', the answer is 'indexer'.
If the question asks about 'scoring algorithm', the answer is 'BM25' (or 'TF-IDF' for older services, but BM25 is current default).
If the question mentions 'faceted navigation', it is a feature of Azure AI Search, not of the data source itself.
Azure AI Search is for building search experiences over unstructured data (text, images, documents).
Key components: data source, indexer, index, skillset (optional).
Default relevance scoring algorithm is BM25.
AI enrichment is optional and uses cognitive skills (OCR, key phrase extraction, etc.).
Indexers can run on a schedule with a minimum interval of 5 minutes.
Free tier limits: 3 indexes, 10 MB storage, 10,000 documents per index.
Data sources include Azure Blob Storage, Cosmos DB, SQL Database, and more.
Queries are executed against the index, not the source data.
Semantic search is an optional add-on for better natural language understanding.
Use API keys or Azure AD for authentication; implement document-level security with filters.
These come up on the exam all the time. Here's how to tell them apart.
Azure AI Search
Purpose: Full-text search with ranking, faceting, and filtering.
Ingests data via indexers from various sources.
Stores an index for fast querying.
Supports AI enrichment as a pipeline (skillset).
Provides search-specific features like autocomplete, suggestions, and semantic search.
Azure Cognitive Services (e.g., Text Analytics)
Purpose: Analyze text or images via pre-built AI models.
No data ingestion; you send data via API calls.
No persistent index; results are returned per request.
Offers individual APIs (e.g., key phrase extraction) without a pipeline.
Does not support search ranking or faceted navigation.
Mistake
Azure AI Search stores the original unstructured data.
Correct
It stores only a search index (inverted index and field values). The original data remains in the source (e.g., Blob Storage). The index is a copy of searchable fields, not the full document.
Mistake
AI enrichment is mandatory for every index.
Correct
AI enrichment is optional. You can index plain text files without any cognitive skills. Skills are only needed when you want to extract information from images or perform advanced text analysis.
Mistake
Azure AI Search can query data directly from Blob Storage without indexing.
Correct
No. The service must first index the data. Queries are executed against the index, not the source. Without an index, there is nothing to search.
Mistake
The Free tier supports production workloads.
Correct
The Free tier is limited to 3 indexes, 10 MB storage, and 10,000 documents per index. It is intended for development and testing only. Production requires at least Basic or Standard tier.
Mistake
Azure AI Search and Azure Cognitive Services are the same service.
Correct
They are separate. Azure Cognitive Services provides APIs for vision, speech, language, etc. Azure AI Search uses those APIs as part of skillsets but is a dedicated search service.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
They are the same service. Azure Cognitive Search was renamed to Azure AI Search in 2023. The old name is still used in some documentation. For the DP-900 exam, both names refer to the same service.
Yes, but only if you include an OCR skill in the skillset. Without OCR, only image metadata (like filename, size) is indexed. The OCR skill extracts text from images, making that text searchable.
BM25 is the default scoring algorithm since July 2020. Previously, TF-IDF was used. BM25 improves relevance by saturating term frequency and normalizing by document length. You can tune BM25 parameters (k1, b) via scoring profiles.
You can use API keys (admin and query) for service-level access. For document-level security, implement a filter that restricts results based on user identity. For example, add a field 'allowedUsers' and filter with $filter=allowedUsers/any(u: u eq 'user@domain.com').
An indexer orchestrates the data ingestion pipeline: it reads from a data source, optionally runs a skillset, and writes to an index. A skillset is a collection of cognitive skills that transform or enrich data during indexing. You can have an indexer without a skillset, but a skillset is always attached to an indexer.
Yes, Azure AI Search can index structured data from sources like Azure SQL Database or Cosmos DB. However, its strength is full-text search over unstructured content. For pure structured queries, a relational database is more efficient.
Tiers: Free (F), Basic (B), Standard (S1, S2, S3), Storage Optimized (L1, L2). Free is for development. Basic offers 2 GB storage. Standard scales up to 200 GB (S3). Storage Optimized tiers are for large indexes with lower query throughput.
You've just covered Azure AI Search for Unstructured Data — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?