This chapter covers search indexes and full-text search concepts in Azure, a key topic in the Analytics domain of DP-900 (Objective 3.5). You will learn how Azure Cognitive Search builds and uses indexes to enable fast, relevant full-text queries over structured and unstructured data. Expect about 5-10% of exam questions to touch on search indexes, index components, and query capabilities. Mastering this material will help you differentiate between search indexes and traditional database indexes, understand how to design an index schema, and know when to use Azure Cognitive Search.
Jump to a section
Imagine a large library with millions of books. Instead of reading every book to find a specific phrase, the library maintains a card catalog — a structured index of every word and its location in every book. Each card lists a word, the book title, page number, and paragraph. When a patron searches for 'data lake', the librarian consults the card catalog, finds all cards for 'data' and 'lake', intersects the book lists, and retrieves the relevant pages. This is exactly how a full-text search index works: it pre-processes documents into inverted indexes mapping each token (word) to its positions. Without the catalog, the librarian would have to scan every book (full table scan). The catalog is updated when new books arrive (index maintenance). The librarian can also handle word variations (stemming) — for example, 'running' and 'ran' map to the same card 'run'. The catalog does not store the full book content, only pointers to it, just as a search index stores token positions and document IDs, not the original text. This analogy mirrors Azure Cognitive Search: you define an index (the catalog), the search service tokenizes and normalizes text, then queries use the index to return results in milliseconds instead of minutes.
What is a Search Index?
A search index is a persistent data structure optimized for fast full-text search queries. Unlike a traditional database index (which speeds up exact-match lookups or range scans), a search index supports tokenization, stemming, fuzzy matching, scoring, and relevance ranking. In Azure Cognitive Search, the index is the primary resource that stores searchable content. It is analogous to a table in a relational database but designed for information retrieval.
Why Exists?
Full-text search is needed when users want to find documents based on words or phrases, not just exact IDs or key values. For example, searching for 'cloud computing' in a set of technical articles should return documents containing either 'cloud' or 'computing' or both, ranked by relevance. Traditional SQL LIKE '%cloud%' queries are slow and lack linguistic processing. A search index pre-processes documents to enable sub-second responses even on millions of documents.
How It Works Internally
When you create an Azure Cognitive Search index, you define a schema with fields, each having attributes like searchable, filterable, sortable, facetable, and retrievable. The indexing process works as follows:
Document Ingestion: Documents are pushed to the search service via REST API or pull from data sources (Azure SQL, Blob Storage, Cosmos DB).
Tokenization: Each searchable field's text is broken into tokens (words) using a language-specific analyzer. The standard Lucene analyzer splits on whitespace and punctuation, lowercases tokens, and removes stop words (e.g., 'the', 'and').
Normalization: Tokens are normalized — for example, 'running' becomes 'run' via stemming (in English).
Inverted Index Creation: For each unique token, an inverted index is built mapping the token to a list of document IDs and positions within the document. This is the core data structure enabling fast lookup.
Storage: The inverted index is stored in a compressed, optimized format on disk. Additional structures store field values for sorting, filtering, and faceting.
Query Execution
When a user submits a search query (e.g., 'data lake'):
The query string is parsed and tokenized using the same analyzer used during indexing (ensuring consistency).
The search engine looks up each token in the inverted index.
It computes a score for each matching document based on term frequency-inverse document frequency (TF-IDF) or a more advanced ranking algorithm (e.g., BM25 in newer versions).
Results are sorted by score and returned, along with any requested fields, highlights, or facets.
Key Components, Values, Defaults, and Timers
Index Schema: Must define at least one field as key (unique identifier). Fields have types: Edm.String, Edm.Int32, Edm.DateTimeOffset, etc.
Analyzers: Default is standard.lucene. You can choose language-specific analyzers (e.g., en.microsoft, fr.microsoft) or custom analyzers with custom tokenizers and filters.
Scoring Profiles: Optional; they boost scores based on field weights or functions (freshness, distance).
Suggesters: Enable autocomplete and search-as-you-type. They are built on a separate data structure (trie) and require a suggester definition in the index.
Index Size: Depends on number of documents and fields. The maximum index size for the Free tier is 50 MB or 10,000 documents (whichever is hit first). Paid tiers have higher limits: S2 tier can handle up to 20 GB per partition.
Replicas and Partitions: You can scale an index by increasing replicas (for higher query throughput) and partitions (for larger index size). The default is 1 replica and 1 partition.
Indexing Latency: Near-real-time for push-based indexing (typically under 5 seconds). For pull-based (indexers), latency is determined by the indexing interval (default 5 minutes for blob indexers, can be as low as 1 minute).
Configuration and Verification Commands
You can create an index using the Azure portal, REST API, or SDK. Here's an example REST API call to create an index:
POST https://[service name].search.windows.net/indexes?api-version=2020-06-30
Content-Type: application/json
api-key: [admin key]
{
"name": "hotels-index",
"fields": [
{ "name": "HotelId", "type": "Edm.String", "key": true, "searchable": false },
{ "name": "HotelName", "type": "Edm.String", "searchable": true, "filterable": true },
{ "name": "Description", "type": "Edm.String", "searchable": true, "analyzer": "en.microsoft" },
{ "name": "Category", "type": "Edm.String", "filterable": true, "facetable": true },
{ "name": "Tags", "type": "Collection(Edm.String)", "searchable": true },
{ "name": "Rating", "type": "Edm.Int32", "filterable": true, "sortable": true },
{ "name": "LastRenovationDate", "type": "Edm.DateTimeOffset", "filterable": true, "sortable": true }
],
"suggesters": [
{ "name": "sg", "searchMode": "analyzingInfixMatching", "sourceFields": ["HotelName"] }
]
}To verify an index exists and its statistics:
GET https://[service name].search.windows.net/indexes/hotels-index/stats?api-version=2020-06-30Response includes documentCount, storageSize (in bytes), and vectorIndexSize.
Interaction with Related Technologies
Azure Cognitive Search can index data from Azure SQL Database, Azure Cosmos DB, Azure Blob Storage (including PDFs, Office docs, images with OCR), and Azure Table Storage via indexers.
AI Enrichment: Using cognitive skills (e.g., entity recognition, key phrase extraction, translation) to enrich documents during indexing. These skills are defined in a skillset attached to an indexer.
Knowledge Store: An optional projection of enriched data into Azure Storage for downstream analytics.
Synonym Maps: Extend query capabilities by defining equivalent terms (e.g., 'car' and 'automobile'). Uploaded as a separate resource and referenced in index fields.
Semantic Search: A premium feature that uses deep learning models to understand query intent and provide more relevant results, including captions and answers.
Indexing Process in Detail
Define Data Source: Specifies connection string, container/table, and credentials.
Create Index: Schema as above.
Create Indexer: Orchestrates data extraction, optional skillset execution, and index population. Indexers can run on a schedule or on-demand.
Monitor: Check indexer execution history via portal or REST API.
Querying the Index
Search queries use a simplified syntax (Lucene or simple). Example:
POST https://[service name].search.windows.net/indexes/hotels-index/docs/search?api-version=2020-06-30
Content-Type: application/json
api-key: [query key]
{
"search": "luxury hotel",
"queryType": "simple",
"searchMode": "any",
"filter": "Rating gt 4",
"facet": ["Category"],
"top": 10
}searchMode=any returns documents matching any term (OR). searchMode=all requires all terms (AND).
Filters are applied after search to reduce the result set. They are OData expressions.
Facets provide counts for each value in a facetable field (e.g., Category: {Budget: 15, Luxury: 8}).
$orderby, $select, $count are also supported.
Performance Considerations
Index Size: Keep fields searchable only when needed. Every searchable field increases index size and indexing time.
Replicas: Add replicas for higher QPS. Each replica is a copy of the index; queries are load-balanced.
Partitions: Add partitions to distribute index data across multiple nodes. This increases index capacity but does not improve query throughput (actually adds overhead for distributed queries).
Scoring: Complex scoring profiles can slow down queries. Use them judiciously.
Indexing vs. Querying: They share resources. Heavy indexing can degrade query performance. Schedule indexing during low-load periods.
Default Values and Limits
Maximum index fields: 1000 (S2 tier).
Maximum indexer execution time: 24 hours.
Maximum index size per partition: varies by tier (e.g., S1: 25 GB, S2: 100 GB).
Maximum number of indexes per service: 50 (S2).
Query timeout: 30 seconds default, configurable up to 120 seconds.
Maximum number of terms in a search query: 1024 (Lucene limit).
Common Exam Traps
Confusing search indexes with database indexes: A database index (B-tree) is for exact lookups; a search index (inverted index) is for full-text search. They are not interchangeable.
Assuming all fields are searchable by default: Only fields marked searchable: true are tokenized and indexed for full-text search.
Thinking fuzzy search is automatic: Fuzzy search requires the ~ operator (e.g., blue~). Simple queries do not support it by default.
Forgetting that the key field must be a string: In Azure Cognitive Search, the document key must be of type Edm.String.
Define Index Schema
Create the index structure by defining fields with names, types, and attributes. Each field can be searchable, filterable, sortable, facetable, or retrievable. The key field must be a string. You can also define suggesters for autocomplete and scoring profiles for custom ranking. This step is analogous to creating a table schema in a database, but with additional search-specific attributes.
Ingest Documents
Documents are pushed to the index via REST API or pulled using an indexer from a data source. Each document must have a unique key value. For push-based indexing, documents are sent as JSON. For pull-based, an indexer reads from Azure SQL, Blob Storage, etc. The indexer can also apply cognitive skills for enrichment. Documents are processed in batches (default 1000 documents or 16 MB per batch).
Tokenization and Analysis
Each searchable field's text is passed through an analyzer that performs tokenization (splitting into words), lowercasing, stop word removal, and stemming. The default analyzer is standard Lucene. Language-specific analyzers handle linguistic variations. Custom analyzers can use custom tokenizers and filters. This step converts raw text into normalized tokens for the inverted index.
Build Inverted Index
For each unique token, the search engine creates an inverted index entry mapping the token to a list of document IDs and positions. This is stored in a compressed format. Additional structures store field values for filtering, sorting, and faceting. The inverted index enables O(1) lookup for each token, making full-text search fast even on large datasets.
Execute Query and Rank Results
When a search query is submitted, it is tokenized using the same analyzer. The engine looks up each token in the inverted index, computes a relevance score (TF-IDF or BM25) for each matching document, and returns results sorted by score. Filters, facets, and sorting are applied after the search. The query can use simple syntax (AND/OR, phrase searches) or full Lucene syntax (fuzzy, proximity, regex).
Enterprise Scenario 1: E-Commerce Product Search
A large online retailer needs to provide fast, relevant product search across millions of SKUs. They use Azure Cognitive Search with an index containing fields like ProductName, Description, Category, Price, and Tags. The Description field uses the en.microsoft analyzer for better stemming. A scoring profile boosts products with higher ratings and newer release dates. The search is integrated into the website via the REST API. The index is updated every 15 minutes via an indexer pulling from Azure SQL Database. Without proper index design, queries like 'laptop' might return irrelevant results because of poor field weighting. Misconfigured analyzers could cause 'running shoes' to miss 'run shoes' if stemming is not applied. The retailer also uses suggesters for autocomplete, reducing typing effort. Performance is critical: they need sub-500ms response times at 10,000 QPS. They scale to 6 replicas and 2 partitions. A common mistake is making too many fields searchable, causing index bloat and slower indexing.
Enterprise Scenario 2: Legal Document Discovery
A law firm indexes millions of legal documents (PDFs, Word files) stored in Azure Blob Storage. They use Azure Cognitive Search with AI enrichment to extract key entities (people, organizations, dates) and key phrases. The index includes Content (searchable), CaseNumber (filterable), and DateFiled (sortable). They use a custom analyzer that preserves hyphenated terms like 'pre-trial'. The indexer runs daily with a skillset that calls Text Analytics for entity recognition. The firm needs to support complex queries with Boolean operators and proximity search (e.g., 'negligence' within 5 words of 'duty'). Without proper use of search modes (any vs. all), they might get too many or too few results. They also use synonym maps to include legal terminology (e.g., 'plaintiff' and 'claimant'). A typical pitfall is forgetting to set searchable on the content field, making it unfilterable for full-text search.
Enterprise Scenario 3: Customer Support Knowledge Base
A software company builds a knowledge base for customer support agents. They index articles from a Cosmos DB collection. The index has fields Title, Body, Product, and Version. They implement a scoring profile that boosts articles marked as 'official' and penalizes outdated versions. They use semantic search (premium tier) to provide direct answers to questions like 'How do I reset my password?'. The index is updated in near-real-time as new articles are added. A common misconfiguration is not defining a suggester for search-as-you-type, forcing users to type complete queries. They also fail to use filters to restrict results to the customer's product version, leading to irrelevant articles. Monitoring indexer execution history reveals failures due to schema mismatches (e.g., a field expected to be a string but received an integer).
Exactly What DP-900 Tests on This Topic (Objective 3.5)
The DP-900 exam focuses on the conceptual understanding of search indexes and full-text search, not on administrative details or pricing. You need to know:
What a search index is and how it differs from a database index.
The purpose of an analyzer in full-text search.
How to design a simple index schema (fields with attributes).
The difference between push and pull indexing.
Basic query capabilities: simple search, filters, facets, scoring.
The role of Azure Cognitive Search in a data analytics pipeline.
Integration with other Azure services (SQL, Blob, Cosmos DB).
Common Wrong Answers and Why Candidates Choose Them
1. Wrong: 'A search index is the same as a clustered index in SQL Server.' Why chosen: Candidates confuse the term 'index'. They think all indexes are B-trees. Reality: Search indexes use inverted indexes for full-text search.
2. Wrong: 'All fields in a search index are automatically searchable.'
Why chosen: They assume default behavior. Reality: Only fields explicitly marked searchable are tokenized.
3. Wrong: 'Fuzzy search is enabled by default in simple queries.'
Why chosen: They think 'search' automatically handles typos. Reality: Fuzzy search requires the ~ operator and Lucene query syntax.
4. Wrong: 'You can use a search index to perform real-time analytics on streaming data.' Why chosen: They confuse search with real-time processing. Reality: Search indexes are for queries on indexed data, not for streaming analytics.
Specific Numbers, Values, and Terms on the Exam
Analyzer types: Standard Lucene, language-specific (e.g., en.microsoft), custom.
Field attributes: searchable, filterable, sortable, facetable, retrievable.
Key field: Must be Edm.String.
Query types: Simple (default) and Full Lucene.
Scoring: TF-IDF or BM25.
Indexers: For pull-based indexing from Azure data sources.
Skillsets: For AI enrichment.
Synonyms: Defined in a synonym map.
Edge Cases and Exceptions
Case sensitivity: By default, search is case-insensitive because analyzers lowercase tokens. You can use a custom analyzer to preserve case.
Stop words: Removed by default. If you need to search for 'the', use a custom analyzer that keeps stop words.
Special characters: The standard analyzer strips punctuation. To search for 'C++', use a custom analyzer that treats '+' as part of the token.
Multiple languages: Use separate fields with different analyzers for each language.
How to Eliminate Wrong Answers Using the Underlying Mechanism
If a question mentions 'fast search for words or phrases' and an option says 'database index', eliminate it because database indexes are for exact lookups.
If a question mentions 'ranking by relevance', the answer must involve a search index with scoring.
If a question mentions 'autocomplete', the answer must involve a suggester.
If a question mentions 'processing images or text extraction', the answer involves AI enrichment via skillsets.
Remember: DP-900 is a fundamentals exam. Focus on what the technology does and why it is used, not on how to configure every detail.
A search index uses an inverted index to enable fast full-text search across millions of documents.
Fields must be explicitly marked as searchable to be tokenized and added to the inverted index.
The key field in Azure Cognitive Search must be of type Edm.String.
Analyzers perform tokenization, lowercasing, stop word removal, and stemming; default is standard Lucene.
Scoring is based on TF-IDF or BM25; you can customize with scoring profiles.
Indexers pull data from Azure sources; push API sends documents directly.
Suggesters enable autocomplete and search-as-you-type.
Fuzzy search requires Lucene query syntax and the ~ operator.
Search indexes are not for real-time streaming analytics.
Synonyms are defined in a synonym map and applied at query time.
These come up on the exam all the time. Here's how to tell them apart.
Search Index (Azure Cognitive Search)
Uses inverted index for full-text search
Supports tokenization, stemming, fuzzy search
Returns ranked results by relevance
Can index unstructured text (PDFs, Word docs)
Supports facets, suggesters, scoring profiles
Database Index (SQL Server)
Uses B-tree or hash index for exact lookups
Does not tokenize; searches exact values
Returns exact matches or range results
Indexes structured data only (columns)
Supports only sort and filter; no relevance scoring
Mistake
A search index is just like a database index.
Correct
A database index (e.g., B-tree) speeds up exact-match or range queries on structured data. A search index uses an inverted index optimized for full-text search, supporting tokenization, stemming, fuzzy matching, and relevance scoring.
Mistake
All fields in a search index are automatically searchable.
Correct
Only fields explicitly defined with `searchable: true` are tokenized and added to the inverted index. Fields without this attribute can be used for filtering, sorting, or faceting but not for full-text search.
Mistake
Fuzzy search is enabled by default in simple queries.
Correct
Simple query syntax does not support fuzzy search. You must use Lucene query syntax and append the `~` operator (e.g., `blue~`). The default query type is simple.
Mistake
You can perform real-time analytics on streaming data using a search index.
Correct
Search indexes are designed for querying indexed documents, not for real-time stream processing. For streaming analytics, use Azure Stream Analytics or Azure Synapse Analytics.
Mistake
The key field in a search index can be any data type.
Correct
The key field must be of type `Edm.String`. Other types like integers or GUIDs are not allowed as document keys in Azure Cognitive Search.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
A search index is a persistent data structure that stores tokenized, normalized text and metadata for fast full-text search. It is built by defining a schema with fields and attributes, then populating it with documents. The index uses an inverted index to map tokens to document IDs, enabling sub-second queries. Unlike a database index, it supports linguistic processing, fuzzy matching, and relevance ranking.
During indexing, text is tokenized and normalized by an analyzer. An inverted index is built mapping each token to document IDs and positions. When you query, the search engine tokenizes the query string, looks up tokens in the inverted index, calculates relevance scores (TF-IDF or BM25), and returns ranked results. Filters, facets, and sorting are applied post-search.
Push indexing involves sending documents directly to the search service via REST API or SDK. It offers near-real-time updates (seconds). Pull indexing uses an indexer to fetch data from a supported source (Azure SQL, Blob, Cosmos DB) on a schedule or on-demand. Pull indexing can also apply AI enrichment via skillsets. Push is simpler for custom applications; pull is better for automated ingestion from existing data stores.
An analyzer is a component that processes text during indexing and querying. It performs tokenization (splitting into words), lowercasing, stop word removal, and stemming. The default is standard Lucene. You can choose language-specific analyzers (e.g., en.microsoft) or create custom analyzers with custom tokenizers and filters. The same analyzer must be used for indexing and querying to ensure consistent results.
Use scoring profiles to boost fields or apply functions (e.g., freshness, distance). Use language-specific analyzers for better stemming. Add synonyms to handle different terms with the same meaning. Use semantic search (premium tier) for deep understanding. Ensure fields are properly weighted in the search schema. Test queries with different search modes (any vs. all) and use filters to narrow results.
Facets are counts of documents for each value in a facetable field. They enable drill-down navigation in search results. For example, a 'Category' facet might show 'Laptop: 45, Phone: 30'. Facets are returned in the query response and are used to build filter UI elements. Only fields marked `facetable: true` can be used for faceting.
A suggester is a feature that enables autocomplete and search-as-you-type suggestions. It is defined in the index schema with a list of source fields. The suggester builds a separate data structure (trie) for fast prefix matching. Queries use the `suggest` API to return suggested terms. It is useful for improving user experience in search boxes.
You've just covered Search Indexes and Full-Text Search Concepts — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?