This chapter covers Microsoft Purview Data Map and its scanning capabilities, a core component of the Microsoft Purview governance portal. Understanding how Purview discovers, classifies, and catalogs data assets is essential for the DP-900 exam, particularly in the 'Describe core data concepts' domain. Approximately 10-15% of exam questions touch on data cataloging, classification, and scanning concepts. You will learn the architecture, scanning mechanics, classification rules, and how Purview integrates with Azure data services to provide a unified data map.
Jump to a section
Imagine a massive university library with millions of books across multiple buildings, each with its own filing system. The library wants a single, searchable catalog that shows where every book is, what language it's in, its subject, and who last borrowed it. The catalog doesn't store the books themselves—it stores metadata about them. Microsoft Purview Data Map is like that unified library catalog for an organization's data assets. Just as a library catalog has librarians who walk through each building, record book details, and update the central index, Purview uses scanners that connect to data sources (like Azure SQL Database, Azure Blob Storage, or on-premises SQL Server) and extract metadata—schema, classifications, lineage—into a central map. The catalog then allows data consumers to search, browse, and understand data assets without needing direct access to the underlying systems. Without Purview, finding data across hundreds of databases and lakes is like trying to find a book in a library with no catalog—you'd have to physically walk through every aisle and open every book. Purview's scanning is not a one-time event; it runs on schedules (e.g., weekly full scan, daily incremental scan) to keep the catalog up-to-date as new tables, columns, or files are added. The scan rules (like classification regex patterns) act as the cataloging rules that determine how each book is categorized.
What is Microsoft Purview Data Map?
Microsoft Purview Data Map is a fully managed, cloud-native service that provides a unified map of an organization's data landscape. It automatically discovers and catalogs metadata from various data sources—both on-premises and in the cloud—including Azure SQL Database, Azure Synapse Analytics, Azure Data Lake Storage, Power BI, and more. The Data Map is the foundation for data discovery, data lineage, and data governance in Microsoft Purview.
Why does it exist?
Modern enterprises have data scattered across hundreds of databases, data lakes, and analytics systems. Without a central catalog, data consumers (data scientists, analysts, engineers) waste time finding the right data, understanding its meaning, and assessing its quality. The Data Map solves this by providing a single pane of glass for data discovery and metadata management.
How does scanning work?
Scanning is the process by which Purview connects to a data source, extracts metadata (schema, table names, column names, data types, sample data), and classifies sensitive data (e.g., credit card numbers, social security numbers). The scan is performed by a scan rule set that defines which classifications to apply. Each scan is executed by a scan runtime that runs in the Purview managed virtual network (or via self-hosted integration runtime for on-premises sources).
Key components and defaults:
Data Source: The system being scanned (e.g., Azure SQL Database, Azure Blob Storage).
Scan: A scheduled or one-time process that connects to a data source and extracts metadata.
Scan Rule Set: A collection of classification rules and criteria. Purview provides built-in system rule sets (e.g., "System.Default" for Azure SQL) that include over 100 built-in classification types (e.g., Credit Card Number, Email Address, National ID). Custom rule sets can be created.
Classification: A label applied to a column or asset based on its data pattern (e.g., "Credit Card Number"). Classifications are defined by regular expressions and can have minimum match thresholds (e.g., 50% of values in a column must match the pattern to classify).
Schedule: Scans can run once or on a recurring schedule (e.g., weekly full scan, daily incremental scan). The default schedule for a new scan is "Once".
Incremental vs. Full Scan: A full scan re-evaluates all assets. An incremental scan only processes assets that have changed since the last scan (based on modification timestamps). Incremental scans are faster and consume fewer resources.
Scan Level: For databases, you can scan all tables/views or a subset. For storage, you can scan specific folders/containers.
Self-hosted Integration Runtime (SHIR): Required for scanning on-premises data sources or sources in virtual networks. The SHIR must be installed on a machine that can reach the data source and communicate with Purview over HTTPS (port 443).
Managed Virtual Network: For Azure sources, Purview uses a managed VNet to securely connect to data sources without exposing them to the internet.
Step-by-step scanning mechanism:
Create a data source registration: User registers a data source in Purview Studio, providing connection details (server name, database name, authentication method).
Define a scan: User selects the data source, chooses a scan rule set (built-in or custom), sets the schedule, and specifies the scope (e.g., all tables or a subset).
Scan execution: Purview's scan runtime connects to the data source using the provided credentials. For Azure SQL, it uses a read-only user (recommended) to query system tables (e.g., INFORMATION_SCHEMA.COLUMNS, sys.tables) to extract schema. For Azure Blob Storage, it enumerates containers, directories, and files, reading metadata (file name, size, last modified) and sample data (first 100 rows or 1 MB, whichever is smaller).
Classification: For each column or file, Purview applies the scan rule set. It samples data (default: 100 rows for databases, 100 files for storage) and runs regular expressions against the sample. If a pattern matches a predefined threshold (default: 50% of sampled values must match), the classification is applied.
Ingestion into Data Map: The extracted metadata (schema, classifications, lineage) is stored in the Purview Data Map as assets (e.g., SQL Table, SQL Column, Azure Blob File).
Post-scan actions: Users can view the scan results in Purview Studio, approve or reject classifications, and add custom metadata (e.g., descriptions, glossary terms).
Configuration and verification:
To create a scan using Azure CLI:
az purview scan create --account-name MyPurviewAccount \
--data-source-name MyAzureSqlSource \
--name MyScan \
--scan-ruleset-name "System.Default" \
--collection-name "MyCollection" \
--resource-group MyResourceGroup \
--schedule "{\"interval\":\"Week\",\"startTime\":\"2025-01-01T00:00:00Z\"}"To view scan history:
az purview scan list-runs --account-name MyPurviewAccount \
--data-source-name MyAzureSqlSource \
--scan-name MyScanInteraction with related technologies:
Purview Data Map feeds into Purview Data Catalog (the search and browse UI) and Purview Data Estate Insights (analytics on data assets).
Lineage: Purview can automatically capture lineage from Azure Data Factory, Azure Synapse Pipelines, and Power BI when scanning these sources. Lineage shows how data flows from source to destination (e.g., from Azure Blob to Azure SQL via a Data Factory copy activity).
Glossary: Business terms defined in the Purview glossary can be mapped to assets in the Data Map, enabling business metadata enrichment.
Microsoft 365 Compliance Center: Purview can export classifications to Microsoft 365 for sensitivity labels (via Microsoft Information Protection integration).
Trap patterns for the exam:
Confusing scanning with data replication: Purview scans metadata, not the actual data. It does not copy data into Purview.
Assuming all classifications are automatic: Classifications are based on scan rule sets; custom rule sets can be created, but the built-in ones cover common patterns.
Thinking scanning is real-time: Scans run on a schedule (e.g., weekly). Incremental scans can be daily, but there is always latency between data changes and catalog updates.
Believing on-premises sources don't need SHIR: Any source not directly reachable via Purview's managed VNet (on-premises, other clouds) requires a self-hosted integration runtime.
Exam-specific numbers and terms:
Built-in classification types: over 100.
Default classification threshold: 50% of sampled values must match the pattern.
Default sample size for databases: 100 rows.
Default sample size for storage: 100 files or 1 MB.
Scanning can be full or incremental.
SHIR communicates over HTTPS (port 443).
Purview supports scanning of Azure SQL DB, Azure Synapse, Azure Blob, Azure Data Lake Storage Gen1/Gen2, Power BI, and on-premises SQL Server (via SHIR).
Register Data Source
In Purview Studio, navigate to the Data Map section and select 'Register'. Choose the source type (e.g., Azure SQL Database). Provide the server name, database name, and authentication method (SQL Authentication or Managed Identity). Purview stores this registration as a connection definition. The registration does not test connectivity immediately; that happens at scan time. You can also assign the source to a collection for organizational hierarchy. For on-premises sources, you must first install a self-hosted integration runtime on a machine that can reach the source, then register the source using the SHIR endpoint.
Define Scan Rule Set
Select or create a scan rule set. Built-in rule sets like 'System.Default' include over 100 classification rules (e.g., Credit Card Number, Email Address). You can also create custom rule sets by selecting specific classification rules or defining custom regex patterns. Each classification rule has a minimum match threshold (default 50%). For example, the 'Credit Card Number' rule looks for patterns like 16-digit numbers with Luhn check. You can also choose to include or exclude certain file types (e.g., .csv, .parquet) for storage scans. The rule set is applied during the scan to classify columns or files.
Configure and Run Scan
In Purview Studio, select the registered data source and click 'New scan'. Choose the scan rule set, set the scope (e.g., all tables or specific schemas), and define the schedule (e.g., weekly full scan on Sunday, daily incremental scan). You can also set a 'Scan trigger' to run immediately. For Azure SQL, Purview uses a read-only account to query `INFORMATION_SCHEMA`. The scan runtime connects using the provided credentials. For Blob Storage, it enumerates containers and files up to a configurable depth. The scan then extracts metadata and samples data for classification. The runtime respects the data source's firewall rules; you must allow Purview's managed VNet IP range or configure a private endpoint.
View and Manage Scan Results
After the scan completes, you can view the results in Purview Studio under 'Data Map' > 'Scans'. The scan status shows 'Completed', 'Failed', or 'In Progress'. For each asset (table, column, file), you can see the classifications applied (e.g., 'Credit Card Number' on a column). Users can approve or reject classifications manually. If a classification is incorrect, you can reclassify or edit the scan rule set. You can also add custom metadata like descriptions, glossary terms, or contacts. The scan history shows past runs, durations, and any errors (e.g., connection failures, permission issues).
Search and Browse Data Catalog
Once metadata is in the Data Map, it becomes searchable in the Purview Data Catalog. Users can search by asset name, classification, or glossary term. The catalog shows schema details, classifications, lineage (if available), and related glossary terms. For example, a data scientist can search for 'customer' and find all tables across the organization that contain customer data, along with their sensitivity classifications. The catalog also provides a 'Browse' view organized by collection, source type, or classification. This step is where the value of scanning is realized: users can discover and understand data without needing direct access to the source systems.
Enterprise Scenario 1: Financial Services Compliance
A global bank must comply with GDPR and PCI DSS. They have thousands of databases across Azure SQL, on-premises SQL Server, and Azure Data Lake Storage. The compliance team needs to identify all columns containing personal data (e.g., email, phone, credit card) and ensure they are properly classified and monitored. Using Purview Data Map, they register all data sources and run weekly full scans with a custom scan rule set that includes both built-in classifications (Credit Card Number, Email Address) and custom regex patterns for internal employee IDs. The scan results feed into Purview Data Estate Insights, which provides dashboards showing classification coverage and data source health. The bank also enables lineage tracking by scanning Azure Data Factory pipelines, so they can trace how sensitive data flows from source to reporting. A common misconfiguration is not allowing Purview's managed VNet IPs in the Azure SQL firewall, causing scans to fail. The team must also ensure the self-hosted integration runtime for on-premises SQL Server has network connectivity and is updated regularly.
Enterprise Scenario 2: Retail Data Lake Governance
A large retailer uses Azure Data Lake Storage Gen2 for their data lake, with thousands of Parquet files ingested daily from point-of-sale systems. Data analysts struggle to find relevant tables because there is no central catalog. The data engineering team deploys Purview and schedules daily incremental scans of the data lake. They create a custom scan rule set that classifies columns containing product IDs, store IDs, and transaction amounts. They also map glossary terms like 'Sales Transaction' to the corresponding folders. One challenge is that the data lake contains both structured (Parquet) and unstructured (CSV, JSON) files. Purview can scan all these formats, but the sample size for classification is limited to 100 files or 1 MB per file, so very large files may not be fully sampled. The team learns that they need to set the scan scope to specific folders (e.g., /sales/transactions/) to avoid scanning non-business data like logs. They also configure Purview to ignore certain file extensions (e.g., .tmp) to reduce scan time.
Enterprise Scenario 3: Healthcare Data Integration
A healthcare provider uses Azure Synapse Analytics for their data warehouse and Power BI for reporting. They need to ensure Protected Health Information (PHI) is properly classified and that data lineage is visible for audit purposes. They register their Azure Synapse workspace in Purview and enable scanning of both the SQL pool and the underlying Azure Data Lake Storage. They use the built-in classification 'Medical Record Number' (custom regex) and 'Date of Birth'. They also scan their Power BI tenant to capture dataset lineage. A common pitfall is that scanning Power BI requires the Power BI admin to enable 'Allow service principals to use read-only Power BI admin APIs' in the Power BI admin portal. Without this, the scan fails. Once configured, Purview shows lineage from Power BI datasets back to the Synapse tables, giving auditors a complete picture of data flow. The team also uses Purview's Data Map to assign data owners to each asset, ensuring accountability.
DP-900 Exam Focus: Microsoft Purview Data Map and Scanning
Objective Code: The DP-900 exam covers Purview under 'Describe core data concepts' (15-20% of exam). Specifically, you need to 'describe how to discover and classify data using Microsoft Purview'. This includes understanding the purpose of the Data Map, the scanning process, and the difference between full and incremental scans.
Common Wrong Answers and Why: 1. 'Purview copies data to a central repository.' – This is wrong because Purview only extracts metadata (schema, classifications), not the actual data. Candidates often confuse data cataloging with data replication. 2. 'Scans run in real-time.' – Purview scans are scheduled (e.g., weekly). There is no real-time scanning. Candidates may think changes are reflected immediately, but there is always latency. 3. 'All classifications are applied automatically without configuration.' – While built-in classifications exist, you must select a scan rule set. Without it, no classifications are applied. Also, custom rule sets can be created. 4. 'On-premises sources can be scanned without additional infrastructure.' – On-premises sources require a self-hosted integration runtime (SHIR). Azure-only sources use the managed VNet.
Specific Numbers and Terms to Memorize:
Built-in classification types: over 100.
Default classification threshold: 50% of sampled values must match.
Default sample size for databases: 100 rows.
Default sample size for storage: 100 files or 1 MB.
Full vs. incremental scans: incremental only processes changed assets.
SHIR communication: HTTPS port 443.
Supported sources: Azure SQL DB, Azure Synapse, Azure Blob, ADLS Gen1/Gen2, Power BI, on-premises SQL Server (via SHIR).
Edge Cases and Exceptions:
If a data source is behind a firewall, you must allow Purview's managed VNet IPs or configure a private endpoint.
Scanning Power BI requires Power BI admin to enable read-only admin APIs.
Classification thresholds can be adjusted per rule; setting a threshold too low (e.g., 10%) may cause false positives.
Incremental scans rely on modification timestamps; if the source doesn't track changes (e.g., some file systems), incremental scan may not work and full scan is needed.
How to Eliminate Wrong Answers:
If an answer says 'Purview stores the actual data', eliminate it immediately.
If an answer says 'real-time', it's almost certainly wrong.
If an answer suggests scanning on-premises without mentioning SHIR, it's incomplete.
Look for keywords: 'metadata', 'classification', 'scan rule set', 'full/incremental'. The exam tests understanding of these terms.
Purview Data Map stores metadata only, not actual data.
Scans can be full or incremental; incremental scans are faster but rely on change tracking.
Built-in classification rules exceed 100; default threshold is 50% match.
Sample size for classification: 100 rows for databases, 100 files or 1 MB for storage.
On-premises sources require a self-hosted integration runtime (SHIR) over HTTPS port 443.
Purview supports Azure SQL DB, Azure Synapse, Azure Blob, ADLS Gen1/Gen2, Power BI, and on-premises SQL Server.
Scan schedules can be once, weekly, monthly, or custom; default is 'Once'.
Classification can be automated via scan rule sets or manually overridden.
Purview integrates with Azure Data Factory and Power BI for lineage.
Managed VNet is used for Azure sources; SHIR for on-premises/non-Azure sources.
These come up on the exam all the time. Here's how to tell them apart.
Full Scan
Re-evaluates all assets in the scope, regardless of whether they changed.
Takes longer and consumes more resources (CPU, network).
Suitable for initial discovery or when you need a complete refresh.
Always captures all classifications and metadata.
Can be scheduled less frequently (e.g., weekly) to reduce overhead.
Incremental Scan
Only processes assets that have changed since the last scan (based on modification timestamps).
Faster and consumes fewer resources.
Suitable for ongoing updates after an initial full scan.
May miss changes if the source does not update timestamps properly.
Typically scheduled more frequently (e.g., daily).
Mistake
Purview Data Map stores a copy of the actual data from scanned sources.
Correct
Purview only stores metadata (schema, classifications, lineage) – never the actual data rows. Scanning extracts metadata and sample data only for classification purposes, but this sample is not stored permanently; only the classification labels are persisted.
Mistake
Scans run continuously in real-time, so changes are reflected immediately.
Correct
Scans are scheduled (e.g., weekly full, daily incremental). There is always latency between a data change and its appearance in the Data Map. Incremental scans depend on modification timestamps and run at most daily.
Mistake
All columns are automatically classified without any configuration.
Correct
Classification requires a scan rule set. Without selecting a rule set, no classifications are applied. Built-in rule sets exist but must be chosen. Custom rule sets can also be created.
Mistake
On-premises data sources can be scanned using the same managed VNet as Azure sources.
Correct
On-premises sources require a self-hosted integration runtime (SHIR) installed on a machine that can reach both the on-premises source and Purview over HTTPS. Azure sources can use the managed VNet.
Mistake
Incremental scans always capture all changes since the last scan.
Correct
Incremental scans rely on source-side change tracking (e.g., LastModified timestamps for files). If the source does not support change tracking or if timestamps are not updated, incremental scans may miss changes. In that case, a full scan is needed.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
No. Purview Data Map stores only metadata: schema (table names, column names, data types), classifications (e.g., 'Credit Card Number'), and lineage. It does not copy the actual data rows. During scanning, Purview may sample a small amount of data (e.g., 100 rows) to apply classifications, but this sample is temporary and not stored permanently. The Data Map is a catalog, not a data warehouse.
You can run scans on a schedule: once, weekly, monthly, or custom (e.g., every 5 days). There is no real-time or continuous scanning. The minimum interval for recurring scans is 1 hour, but typical schedules are daily incremental and weekly full. You can also trigger a scan manually at any time.
A full scan re-evaluates all assets in the scope (e.g., all tables in a database), regardless of whether they changed. It is slower but ensures completeness. An incremental scan only processes assets that have changed since the last scan, based on modification timestamps. Incremental scans are faster and consume fewer resources, but they rely on the source system accurately tracking changes (e.g., LastModified for files). For databases, incremental scans use change data capture (CDC) or query patterns like `WHERE last_modify_date > @lastScanTime`.
Yes. To scan on-premises sources (e.g., SQL Server on-premises), you must install a self-hosted integration runtime (SHIR) on a machine that has network access to both the on-premises data source and the internet (HTTPS port 443 to Purview). The SHIR acts as a bridge. For Azure-only sources, Purview uses a managed virtual network and does not require SHIR.
Purview uses scan rule sets that contain classification rules. Each rule is a regular expression pattern (e.g., for credit card: 16-digit number with Luhn check). During a scan, Purview samples data (default 100 rows for databases) and checks each value against the pattern. If at least 50% of sampled values match the pattern, the column is classified with that label (e.g., 'Credit Card Number'). You can adjust the threshold or create custom rules.
Yes. Purview can scan Power BI tenants to discover datasets, reports, and dashboards. However, you must enable 'Allow service principals to use read-only Power BI admin APIs' in the Power BI admin portal. The scan extracts metadata like dataset name, tables, columns, and lineage from Power BI to underlying sources (if those sources are also scanned).
If a scan fails, Purview records the error in the scan history (e.g., 'Connection failed', 'Permission denied'). You can view the error details and take corrective action (e.g., update credentials, allow IP addresses in firewall). You can then re-run the scan manually. Failed scans do not delete previously ingested metadata; the Data Map retains the last successful scan's data until a new successful scan overwrites it.
You've just covered Microsoft Purview Data Map and Scanning — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?