This chapter covers Microsoft Purview Data Catalog, a key component of the Microsoft Purview governance platform. For the DP-900 exam, this topic appears in approximately 5-10% of questions, primarily within the 'Core Data Concepts' domain (Objective 1.1: Identify data formats and data stores). Understanding Purview's role in data discovery, classification, and lineage is essential for answering questions about metadata management and data cataloging in Azure. This chapter provides a deep dive into Purview's architecture, scanning process, and how it enables data governance.
Jump to a section
Imagine a massive public library with millions of books scattered across floors and rooms. Without a card catalog, finding a specific book would require wandering aimlessly. The card catalog is a centralized registry: each card lists the book's title, author, subject, location (e.g., '3rd floor, aisle 12, shelf 4'), and a summary. Librarians update cards when new books arrive or are moved. Patrons search by title, author, or subject to get the exact location and a quick description before fetching the book. Microsoft Purview Data Catalog works the same way for enterprise data assets. It stores metadata—technical schema, business descriptions, classifications (e.g., 'contains PII'), and lineage—in a searchable index. Data consumers search the catalog to find datasets (like 'sales_2023' in Azure Data Lake), see their schema, understand sensitivity, and trace origin. Just as a card catalog doesn't hold the books themselves, Purview catalog doesn't store data; it holds pointers (metadata) to the data sources. The catalog is automatically populated by scanning registered sources (Azure SQL Database, Power BI, etc.) using 'scanners' that crawl metadata, analogous to librarians walking aisles and noting new acquisitions. This mechanism ensures that data discovery is systematic, governed, and auditable.
What is Microsoft Purview Data Catalog?
Microsoft Purview Data Catalog is a unified data governance service that helps organizations discover, understand, and manage their data assets across on-premises, multi-cloud, and SaaS environments. It is part of the Microsoft Purview governance portal (formerly Azure Purview). The data catalog is a metadata repository that stores technical, business, and operational metadata about data sources. It enables data consumers (data analysts, data scientists, data stewards) to search for datasets, understand their structure, lineage, and sensitivity, and request access.
Why it Exists
In large enterprises, data is spread across hundreds of sources—Azure SQL Database, Azure Synapse, Azure Data Lake Storage, Power BI datasets, on-premises SQL Server, Amazon S3, etc. Without a catalog, users waste time finding the right data, often creating duplicate copies or using stale datasets. Purview solves this by providing a single pane of glass for data discovery and governance.
How Purview Data Catalog Works Internally
Purview operates through a collection of components: - Data Map: The backbone that stores metadata about data sources. It is built on Apache Atlas and includes entities (data assets) and classifications (tags). - Scanners: Connectors that crawl data sources to extract metadata. Each scanner is specific to a source type (e.g., Azure SQL DB scanner, Power BI scanner). - Classification: Purview automatically scans sample data (up to 100 rows per table by default) to detect sensitive information like credit card numbers, email addresses, or custom patterns. It uses built-in classifiers (e.g., 'Credit Card Number', 'Person Name') and custom rules. - Lineage: Tracks how data moves from source to destination (e.g., from Azure Data Lake to Azure Synapse to Power BI). This is captured via integration with Azure Data Factory, SQL Server Integration Services (SSIS), and custom lineage extraction. - Search & Browse: Users search using keywords, filters (data source type, classification, contact), or browse the data map hierarchy. - Glossary: Business terms (e.g., 'Customer ID', 'Revenue') that map to technical assets, bridging business and IT.
Key Components, Values, and Defaults
Scanning schedule: Can be one-time or recurring (daily, weekly, monthly). Default scan concurrency is 5 scans per source type.
Classification thresholds: Built-in classifiers trigger when confidence level exceeds a threshold (e.g., 80% for credit card). Custom classifiers can set minimum match count and confidence.
Data Map size: Maximum of 10,000 data sources per Purview account. Each source can have thousands of assets (tables, files).
Lineage capture: For Azure Data Factory, lineage is automatically captured when using Copy activity or Data Flow. For other sources, custom lineage can be pushed via Atlas API.
Role-based access control (RBAC): Roles include Data Curator (manage metadata), Data Reader (search and view), Data Source Administrator (register sources).
Configuration and Verification
To register a data source in Purview: 1. Open the Microsoft Purview governance portal. 2. Go to 'Data Map' -> 'Sources'. 3. Click 'Register' and select the source type (e.g., 'Azure SQL Database'). 4. Provide connection details (server name, database name, authentication method). 5. Create a scan rule set (optional) to include/exclude specific tables. 6. Run a scan immediately or schedule it.
Verification: After scanning, assets appear under 'Browse assets'. Search for a table name to confirm. Use the 'Lineage' tab to see data movement.
How Purview Interacts with Related Technologies
Azure Data Factory: Purview captures lineage from ADF pipelines automatically when 'Lineage' is enabled in ADF (requires Purview account connection).
Power BI: Purview can scan Power BI tenants to catalog datasets, reports, and dashboards. Lineage from Power BI to underlying sources is captured if the sources are also registered in Purview.
Azure Synapse Analytics: Purview scans Synapse workspaces to catalog tables, views, and stored procedures.
Microsoft 365: Purview integrates with Microsoft 365 sensitivity labels via 'Information Protection' to apply labels to data assets.
Azure Policy: Purview can enforce data governance policies via Azure Policy, e.g., requiring all Azure SQL DBs to be registered in Purview.
Step-by-Step Scanning Process
Registration: User registers a data source in Purview, providing endpoint and credentials.
Scan Trigger: A scan is triggered manually or via schedule. The scanner connects to the source.
Metadata Extraction: Scanner retrieves schema (table names, column names, data types), statistics (row count, size), and sample data (up to 100 rows per table).
Classification: Purview applies built-in and custom classifiers to sample data, generating classification tags (e.g., 'Email', 'Credit Card Number').
Ingestion: Metadata and classifications are uploaded to the Data Map, creating or updating assets.
Lineage Extraction: If the source participates in data movement (e.g., ADF pipeline), Purview captures lineage from the data factory.
Search Index Update: The search index is refreshed, making new assets discoverable.
Key Defaults and Timers
Scan timeout: Default 60 minutes per scan. If exceeded, scan fails.
Sample rows: 100 rows per table (configurable up to 1000).
Classification confidence threshold: 80% for built-in classifiers.
Recurring scan interval: Minimum 1 hour.
Data Map sync: Metadata changes appear within minutes (typically 5-10 minutes after scan completion).
Exam Trap Patterns
Trap: 'Purview stores the actual data.' Reality: Purview stores only metadata, not the data itself.
Trap: 'Purview can automatically classify all data in real time.' Reality: Classification is based on sampled data during scans, not real-time.
Trap: 'Purview is only for Azure sources.' Reality: Purview supports on-premises, multi-cloud (AWS, Google Cloud), and SaaS (Salesforce, etc.) via connectors.
Trap: 'Lineage is automatically captured for all sources.' Reality: Lineage requires integration with data movement services like ADF or custom API calls.
Command Examples
While Purview is primarily GUI-based, you can use REST APIs or Azure CLI to automate. Example Azure CLI command to create a Purview account:
az purview account create --name MyPurviewAccount --resource-group MyRG --location eastus --sku StandardTo list data sources:
az purview data-source list --account-name MyPurviewAccount --resource-group MyRGConclusion
Microsoft Purview Data Catalog is a critical tool for data governance in Azure. The DP-900 exam expects you to understand its purpose (metadata management, data discovery, classification, lineage), know the types of data sources it supports (Azure, on-premises, multi-cloud), and recognize that it does not store actual data. Focus on the scanning process, classification, and integration with other Azure services.
Register a Data Source
In the Purview governance portal, navigate to Data Map > Sources. Click 'Register' and select the source type (e.g., Azure SQL Database). You'll provide the server name, database name, and authentication credentials (SQL authentication, service principal, or managed identity). This step does not yet extract metadata; it only creates a reference to the source in Purview's data map. The source appears as an unregistered asset until scanned.
Create or Select a Scan Rule Set
Scan rule sets define what metadata to extract. For example, you can include or exclude specific tables, folders, or file types. Purview provides default scan rule sets for each source type (e.g., 'DefaultAzureSqlDBScanRuleSet'). You can also create custom rule sets to filter out system tables or specific schemas. The rule set also determines whether to enable classification (default: enabled) and lineage extraction.
Configure and Run a Scan
You set the scan trigger: one-time or recurring (daily, weekly, monthly). For recurring scans, you specify the interval and start time. The scanner uses the credentials from registration to connect to the source. It then crawls metadata: for a database, it lists tables, columns, data types, and row counts. For files in ADLS, it lists file names, sizes, and schema (if Parquet/CSV). The scanner also samples up to 100 rows per table (configurable) for classification. The scan runs asynchronously; you can monitor progress in the 'Scan history' tab.
Review and Curate Metadata
After the scan completes, assets appear in the data map. Data curators can edit metadata: add descriptions, assign business glossary terms (e.g., 'Customer'), set sensitivity labels (e.g., 'Confidential'), and configure lineage manually. They can also approve or reject automatic classifications. For example, if a column is misclassified as 'Email', a curator can remove that classification. Curators can also add contacts (experts) and set data ownership.
Search and Discover Assets
Data consumers use the search bar to find assets by keyword (e.g., 'sales'), filter by source type, classification, or glossary term. Search results show asset name, type, source, and classification badges. Clicking an asset reveals detailed metadata: schema, lineage (if available), classifications, contacts, and related glossary terms. Users can also browse the data map hierarchy by source. This step enables self-service data discovery without needing to know the exact location of data.
Enterprise Scenario 1: Financial Services – Regulatory Compliance
A large bank needs to identify all data assets containing personally identifiable information (PII) to comply with GDPR and CCPA. They deploy Purview to scan over 500 Azure SQL databases, 200 Azure Data Lake Storage Gen2 containers, and 50 on-premises SQL Server instances. The scanning process runs weekly to catch new assets. Purview automatically classifies columns containing credit card numbers, email addresses, and social security numbers using built-in classifiers. Data stewards review classifications and apply sensitivity labels (e.g., 'Highly Confidential') via Microsoft 365 Information Protection. The catalog enables auditors to search for all assets with 'Credit Card Number' classification and generate compliance reports. Common pitfalls: initial scan of 500 databases may take 48+ hours; they had to increase scan concurrency from default 5 to 20 by contacting support. Misconfiguration: failing to exclude system tables (e.g., 'sys.*') caused unnecessary noise and slower scans.
Enterprise Scenario 2: Retail – Data Democratization
A retail company wants to enable its data analysts to self-serve data for sales forecasting. They register Azure Synapse Analytics, Azure Data Lake Storage, and Power BI datasets in Purview. They create a business glossary with terms like 'Revenue', 'Product SKU', and 'Region'. Analysts search the catalog for 'sales 2023' and find the correct table in Synapse, see its schema, and understand it is refreshed daily (via lineage from Azure Data Factory). They also see contact information for the data owner. Lineage shows that the data flows from the transactional database to the data lake to Synapse. This reduces time to find data from days to minutes. Performance consideration: lineage from ADF can be delayed up to 1 hour after pipeline run. Misconfiguration: if ADF lineage is not enabled, analysts cannot trace data origin; they must enable 'Lineage' in ADF linked to Purview.
Enterprise Scenario 3: Healthcare – Data Sharing and Governance
A healthcare provider shares de-identified patient data with research partners. They use Purview to catalog datasets in Azure Data Lake and apply classification tags like 'Patient ID' and 'Diagnosis Code'. They use custom classifiers to detect specific medical codes (ICD-10). They set up access workflows so researchers can request access via the catalog. Purview integrates with Azure Data Catalog (deprecated) migration. Common issue: custom classifiers require careful tuning to avoid false positives; they had to adjust minimum match count to 10 to reduce noise. The catalog helps maintain compliance with HIPAA by ensuring only de-identified data is shared.
DP-900 Exam Focus: Microsoft Purview Data Catalog
Objective Code: This topic falls under Core Data Concepts (15-20% of exam), specifically Objective 1.1: 'Identify data formats and data stores'. Questions on Purview may also appear under 'Describe an analytics workload' (Objective 1.3) because Purview is part of the analytics ecosystem.
What DP-900 Tests:
Purpose of Purview Data Catalog: metadata management, data discovery, classification, lineage.
What it does NOT do: store actual data, process data, or perform ETL.
Supported data sources: Azure (SQL DB, Synapse, Data Lake, Power BI), on-premises (SQL Server, Oracle), multi-cloud (AWS S3, Google Cloud Storage), SaaS (Salesforce).
Classification: uses built-in classifiers (e.g., 'Credit Card Number', 'Email') and custom classifiers.
Lineage: captures data movement from source to destination via Azure Data Factory, SSIS, and custom APIs.
Integration: with Microsoft 365 sensitivity labels, Azure Policy, and Azure Data Factory.
Common Wrong Answers: 1. 'Purview stores the actual data.' – Wrong because Purview stores only metadata; the actual data stays in the source. 2. 'Purview can classify data in real time.' – Wrong because classification happens during scheduled scans, not real-time. 3. 'Purview is only for Azure data sources.' – Wrong because it supports on-premises, multi-cloud, and SaaS. 4. 'Purview automatically captures lineage for all data movement.' – Wrong because lineage requires integration (e.g., ADF) or custom API.
Specific Numbers and Terms: - 'Data Map' (the metadata store). - 'Scanner' (the component that crawls sources). - 'Classification' (tagging sensitive data). - 'Lineage' (data movement tracking). - 'Glossary' (business terms). - Default sample rows: 100 per table. - Default classification confidence threshold: 80%.
Edge Cases:
Purview can scan Azure Data Lake Storage but cannot scan data within Azure Files or Azure Blob Storage (unless using ADLS Gen2).
Lineage for Power BI datasets is only captured if the underlying data sources are also registered in Purview.
Custom classifiers require a minimum of 10 data patterns to be effective.
How to Eliminate Wrong Answers:
If an answer says 'Purview stores data', eliminate it.
If an answer claims real-time classification, eliminate it.
If an answer limits Purview to only Azure, eliminate it (unless the question specifically mentions Azure-only scenario).
For lineage questions, look for integration with Azure Data Factory or SSIS.
Purview Data Catalog stores metadata (schema, classifications, lineage) – never actual data.
Classification uses built-in and custom classifiers; default sample rows = 100 per table.
Lineage requires integration with Azure Data Factory, SSIS, or custom API – not automatic.
Supports Azure, on-premises (SQL Server, Oracle), multi-cloud (AWS S3, GCS), and SaaS (Salesforce).
Scans are scheduled (one-time or recurring) – not real-time.
Default classification confidence threshold is 80% for built-in classifiers.
Purview integrates with Microsoft 365 sensitivity labels for data governance.
These come up on the exam all the time. Here's how to tell them apart.
Microsoft Purview Data Catalog
Part of Microsoft Purview governance portal (unified governance).
Supports automated scanning and classification with built-in classifiers.
Captures lineage from Azure Data Factory, SSIS, and custom sources.
Integrates with Microsoft 365 sensitivity labels and Azure Policy.
Supports multi-cloud and on-premises sources.
Azure Data Catalog (Classic)
Standalone service (now deprecated; migration to Purview recommended).
Required manual annotation; no automated classification.
No lineage capture capability.
Limited integration with other Azure services.
Only supported Azure data sources (SQL DB, Data Lake, etc.).
Mistake
Purview Data Catalog stores the actual data from scanned sources.
Correct
Purview stores only metadata—schema, classifications, lineage, and descriptions. The actual data remains in the original source (e.g., Azure SQL DB, Data Lake). Purview never ingests or copies data.
Mistake
Purview can classify data in real time as it changes.
Correct
Classification is performed during scheduled scans (e.g., daily, weekly). It is not real-time. Changes to data between scans will not be classified until the next scan.
Mistake
Purview is only for Azure data sources.
Correct
Purview supports a wide range of sources: Azure (SQL DB, Synapse, Data Lake, Blob, Power BI), on-premises (SQL Server, Oracle, Teradata), multi-cloud (Amazon S3, Google Cloud Storage), and SaaS (Salesforce, SAP). The DP-900 exam may ask about supported sources.
Mistake
Lineage is automatically captured for all data movement without any configuration.
Correct
Lineage requires explicit integration. For Azure Data Factory, you must enable lineage in ADF and connect it to Purview. For other tools, you must use Purview's Atlas API to push lineage.
Mistake
Purview can scan and classify data in Azure Blob Storage (not ADLS Gen2).
Correct
Purview supports Azure Data Lake Storage Gen2 (hierarchical namespace) and Azure Blob Storage (flat namespace). However, for Blob Storage, scanning is limited; full classification and schema extraction are better with ADLS Gen2. The exam may test that Purview works with ADLS Gen2.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
No. Purview stores only metadata—table names, column names, data types, classifications, and lineage. It never copies or stores the actual data rows. The data remains in its original source. This is a common exam trap: answers that say Purview stores data are wrong.
Purview supports a wide range: Azure SQL Database, Azure Synapse Analytics, Azure Data Lake Storage Gen2, Azure Blob Storage, Power BI, on-premises SQL Server, Oracle, Teradata, Amazon S3, Google Cloud Storage, and SaaS like Salesforce. The DP-900 exam may ask you to identify supported sources.
During a scan, Purview samples up to 100 rows (configurable) from each table and applies built-in classifiers (e.g., 'Credit Card Number', 'Email'). If the confidence level exceeds 80% (default), the classification is applied. You can also create custom classifiers with regex patterns.
Lineage shows how data moves from source to destination (e.g., from Azure Data Lake to Azure Synapse to Power BI). It is captured via integration with Azure Data Factory (enable lineage in ADF), SSIS, or by pushing lineage using Purview's Atlas API. It is not automatic for all data movement.
Yes. Purview can scan on-premises SQL Server, Oracle, Teradata, and other sources using a self-hosted integration runtime (SHIR). The SHIR connects to on-premises networks and extracts metadata. This is a key distinction from Azure Data Catalog (classic) which only supported Azure sources.
In the Purview governance portal, use the search bar to find assets by name, classification, glossary term, or source type. You can also browse the data map hierarchy. Search results show asset details, schema, lineage, and contacts. This enables self-service data discovery.
Purview is the evolution of Azure Data Catalog. Key differences: Purview supports automated scanning and classification, lineage capture, multi-cloud sources, and integration with Microsoft 365 sensitivity labels. Azure Data Catalog (classic) is deprecated and required manual annotation.
You've just covered Microsoft Purview Data Catalog — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?