MS-102Chapter 72 of 104Objective 3.3

Microsoft Purview Data Map

This chapter covers Microsoft Purview Data Map, a critical component of Microsoft's data governance and compliance solution. You will learn its architecture, how it discovers and classifies data across hybrid environments, and how it integrates with other Purview tools like Data Loss Prevention (DLP) and Insider Risk Management. On the MS-102 exam, questions on Data Map appear in about 5-8% of the Security Threats domain (Objective 3.3), focusing on its role in data classification, sensitivity labeling, and compliance readiness.

25 min read

Intermediate

Updated May 31, 2026

Reviewed by Johnson Ajibi· Senior Network & Security Engineer · MSc IT Security

Jump to a section

Explain it to me simply Where people get tripped up Test what I know Look up key terms

The Library Card Catalog for Your Data

Imagine a massive library with millions of books, journals, maps, and digital media scattered across different floors, rooms, and even offsite storage. Without a central catalog, finding anything would be impossible. The library employs a card catalog system—a master index that records every item's title, author, subject, location (floor, shelf, section), and metadata like publication date and ISBN. When a librarian needs to find a book, they consult the catalog first, which tells them exactly where to go. The catalog doesn't store the books themselves; it stores pointers and descriptive information. Similarly, Microsoft Purview Data Map is a metadata catalog for an organization's data estate. It scans data sources across on-premises, Azure, and other clouds, extracting technical metadata (schema, data types), business metadata (descriptions, classifications), and lineage (how data flows and transforms). Just as the library catalog allows efficient discovery and governance of physical items, the Data Map enables data stewards and compliance officers to discover, classify, and govern sensitive data without moving or copying the actual data. The Data Map stores metadata in a scalable graph database, providing a unified view. It also supports automated classification using sensitive information types (SITs) and custom classifiers, akin to the catalog's subject headings, allowing quick identification of sensitive content like PII or financial data. Without the Data Map, organizations would be like a library without a catalog—chaotic, inefficient, and non-compliant.

How It Actually Works

What is Microsoft Purview Data Map?

Microsoft Purview Data Map is a unified metadata repository that provides a comprehensive view of an organization's data assets. It is the foundation of the Microsoft Purview governance portfolio, enabling data discovery, classification, lineage tracking, and policy enforcement across on-premises, multi-cloud, and SaaS data sources. The Data Map does NOT store the actual data; it stores metadata—information about the data—such as schema, data types, classifications, sensitivity labels, and relationships.

Why Does It Exist?

Organizations face challenges in understanding and governing their data estate due to data silos, varying formats, and lack of centralized visibility. Regulatory requirements (GDPR, HIPAA, CCPA) demand that organizations know where sensitive data resides, who has access, and how it flows. The Data Map solves this by providing a single pane of glass for data governance, enabling: - Data Discovery: Automatically scanning data sources to catalog assets. - Classification: Applying sensitivity labels and classifications to identify sensitive data. - Lineage: Tracking data movement and transformation across pipelines. - Policy Enforcement: Integrating with Microsoft Purview Data Loss Prevention (DLP) and other compliance tools.

How It Works Internally

The Data Map operates through a scanning and classification engine that connects to data sources via scanning connectors. Here is a step-by-step mechanism:

Registration: An administrator registers a data source (e.g., Azure SQL Database, Amazon S3, on-premises SQL Server) in the Purview governance portal. This involves providing connection details and credentials.

Scanning: A scan is triggered manually or on a schedule (e.g., weekly). The scanner connects to the data source and extracts metadata:

- For relational databases: tables, columns, data types, primary/foreign keys. - For files: file name, size, format, last modified date. - For Power BI: datasets, reports, dashboards, measures. 3. Classification: During scanning, the Data Map applies built-in sensitive information types (SITs)—e.g., credit card numbers, social security numbers, passport numbers—to detect sensitive data. Custom SITs can be defined via regular expressions or keyword lists. Classification results are stored as metadata attributes. 4. Lineage Extraction: For data sources like Azure Data Factory or SQL Server Integration Services, the scanner captures data lineage—how data is transformed and moved from source to destination. Lineage is represented as a directed acyclic graph (DAG) in the Data Map. 5. Cataloging: All metadata is ingested into the Purview Atlas—a graph database based on Apache Atlas. The graph model allows rich relationships: an asset (e.g., a table) can be linked to its schema, classifications, lineage, and glossary terms. 6. Search and Discovery: Users can search the Data Map using the Purview governance portal or programmatically via REST APIs. Search results include assets, classifications, glossary terms, and lineage.

Key Components and Defaults

Scanning Connectors: Pre-built connectors for over 30 data sources, including:

- Azure: Blob Storage, Data Lake Storage, SQL Database, Cosmos DB - On-premises: SQL Server, Oracle, Teradata - Multi-cloud: Amazon S3, Google BigQuery, Snowflake - SaaS: Power BI, Salesforce - Scan Rulesets: Define which classifications to apply and how often to scan. Default rulesets include all built-in SITs. - Classification Thresholds: For pattern-based SITs, the minimum confidence level (default 60%) must be exceeded to apply a classification. This is configurable per SIT. - Sensitive Information Types (SITs): Over 200 built-in types covering PII, financial, medical, and more. Examples: - Credit Card Number: 12-19 digits, passes Luhn check. - U.S. Social Security Number: 9 digits with dashes (XXX-XX-XXXX). - Azure Storage Account Key: 88 characters, base64-encoded. - Custom Classifiers: Can be created using Microsoft's Microsoft 365 Compliance Center or via PowerShell (e.g., New-DlpSensitiveInformationType). They support keyword lists, proximity rules, and confidence levels. - Glossary: Business metadata that provides context. Terms can be assigned to assets to indicate business definitions, ownership, and stewardship. - Lineage: Captured for Azure Data Factory, Azure Synapse Pipelines, Power BI, SQL Server Integration Services, and custom pipelines via the Open Lineage API.

Configuration and Verification Commands

- Register Data Source (PowerShell):

Register-AzPurviewDataSource -AccountName 'purview-account' -Name 'my-sql-server' -DataSourceType 'AzureSqlDatabase' -ConnectionString 'Server=tcp:myserver.database.windows.net;Database=mydb;...'

- Trigger Scan (PowerShell):

New-AzPurviewScan -AccountName 'purview-account' -DataSourceName 'my-sql-server' -Name 'weekly-scan' -ScanRulesetName 'default' -CollectionName 'finance' -TriggerType 'Recurring' -TriggerRecurrenceInterval 'Weekly'

- List Classifications (REST API):

GET https://api.purview.microsoft.com/classificationrules?api-version=2021-07-01

- Search Assets (REST API):

POST https://api.purview.microsoft.com/search?api-version=2021-07-01
  Body: {"keywords":"credit card","filter":{"assetType":["AzureSqlTable"]}}

Interaction with Related Technologies

Microsoft Purview Information Protection: The Data Map can automatically apply sensitivity labels to assets based on classifications. For example, if a column contains credit card numbers, it can be labeled as "Highly Confidential." Labels are pushed to Microsoft 365 compliance center for DLP policies.

Microsoft Purview Data Loss Prevention (DLP): DLP policies can use classifications from the Data Map to detect and protect sensitive data in transit (e.g., email, Teams) or at rest (e.g., SharePoint). However, DLP primarily relies on the Microsoft 365 compliance center's classification engine, not directly on the Data Map. The Data Map provides visibility into where sensitive data resides, which informs DLP policy design.

Microsoft Purview Insider Risk Management: The Data Map helps identify sensitive data that may be at risk of insider threats. For example, if a user accesses a highly classified dataset and then attempts to export it, the risk score increases.

Azure Policy: The Data Map can integrate with Azure Policy to enforce tagging and classification requirements at scale. For example, a policy can require that all Azure SQL databases are registered in Purview and scanned weekly.

Microsoft Sentinel: The Data Map can send audit events (e.g., scan completion, classification changes) to Sentinel for security monitoring.

Performance and Scale Considerations

The Data Map can catalog billions of assets. However, scanning frequency and depth impact performance.

Full scans are resource-intensive; incremental scans (only changes since last scan) are recommended for large data sources.

The scanning infrastructure runs on Azure; for on-premises sources, an Integration Runtime (a self-hosted agent) must be installed. The agent must have network access to both the data source and the Purview account.

Classification accuracy depends on the quality of SITs and custom rules. False positives can occur; manual review and tuning are necessary.

Walk-Through

In the Microsoft Purview governance portal, navigate to 'Data Map' > 'Sources'. Click 'Register' and select the data source type (e.g., Azure SQL Database). Provide connection details: server name, database name, authentication method (SQL authentication or managed identity). For on-premises sources, you must first install a self-hosted Integration Runtime. The registration creates a metadata entry in the Data Map's graph database, linking the source to a collection (e.g., 'Finance' or 'HR'). This step does NOT scan data yet; it only establishes the connection.

Create a Scan Rule Set

Before scanning, define which classifications to apply. In the Purview portal, go to 'Data Map' > 'Scan rule sets'. You can use the default rule set (includes all built-in SITs) or create a custom one. For example, a custom rule set might only include SITs relevant to GDPR (e.g., EU passport number, EU debit card number). You can also configure classification thresholds: for each SIT, set a minimum confidence level (default 60%). Lowering the threshold increases recall but may introduce false positives. The rule set is versioned and can be updated without affecting active scans.

Trigger a Scan

Select the registered data source and click 'New scan'. Choose the scan rule set and configure the scan type: 'Full' (all data) or 'Incremental' (changes since last scan). Set the schedule—e.g., weekly full scan on Sunday nights, daily incremental scans. For large datasets, you can set a scan scope (e.g., specific schemas or folders) to limit the scan. The scanner connects to the data source using the provided credentials, enumerates assets (tables, files), and for each asset, extracts schema and sample data (up to 1,000 rows by default) to run classification. Metadata is sent to the Purview Atlas via HTTPS.

Review Classification Results

After the scan completes, view results in the 'Data Map' > 'Browse' or 'Search'. Each asset shows classifications applied (e.g., 'Credit Card Number' with 95% confidence). You can manually override classifications or assign sensitivity labels. For example, if a column is misclassified, you can remove the classification. The Data Map also provides a 'Glossary' where you can assign business terms (e.g., 'Customer PII') to assets. This step is crucial for data stewards to validate and enrich metadata.

Set Up Lineage Tracking

For data sources that support lineage (e.g., Azure Data Factory), enable lineage capture during scan configuration. The scanner extracts transformation steps: for example, a Data Factory pipeline that copies data from Blob Storage to SQL Database creates a lineage edge showing the source and destination. Lineage is stored in the graph database and can be visualized in the Purview portal as a DAG. This helps trace data origin and impact analysis—e.g., if a source table changes, which downstream reports are affected? Lineage is updated incrementally as pipelines run.

What This Looks Like on the Job

Enterprise Scenario 1: Financial Services Compliance

A large bank must comply with SOX and PCI DSS. They use the Data Map to scan on-premises SQL Server databases and Azure SQL Databases for credit card numbers and financial account details. They register 50 SQL Server instances and 20 Azure SQL databases. Scans are scheduled weekly (full) and nightly (incremental). The Data Map classifies over 10,000 columns containing PAN (Primary Account Numbers). They create a custom SIT for their internal 'Account Number' format. The Data Map integrates with Microsoft Purview DLP to block emails containing credit card numbers. A common misconfiguration: they initially set the threshold too low (40%), causing many false positives (e.g., random 16-digit numbers classified as credit cards). After tuning to 85%, accuracy improved. Performance: scanning 50 TB of data took 8 hours per full scan; incremental scans took 30 minutes. They use Integration Runtime on 5 VMs for on-premises scanning.

Enterprise Scenario 2: Healthcare Data Governance

A hospital network uses the Data Map to govern patient data across Azure Data Lake, on-premises file servers, and Salesforce. They scan for HIPAA identifiers like SSN, medical record numbers, and health plan beneficiary numbers. They create a glossary term 'PHI' (Protected Health Information) and assign it to classified assets. The Data Map lineage feature tracks how patient data flows from the EHR system (on-premises) through Azure Data Factory to analytics dashboards. This helps them demonstrate compliance during audits. A challenge: the hospital has many legacy file shares with unstructured data; the Data Map's file scanning supports Office documents, PDFs, and CSV files but not all proprietary formats. They had to convert some files to supported formats. They also use custom classifiers with regex to detect their internal patient ID format (e.g., 'PAT-XXXXXXXX').

Enterprise Scenario 3: Retail Multi-Cloud Discovery

A global retailer uses AWS S3, Google Cloud Storage, and Azure Blob for different business units. They need a unified view of data across clouds. They register each cloud storage as a source in Purview. Scans run weekly; they use the default rule set plus custom SITs for loyalty card numbers and gift card codes. The Data Map helps them identify that customer PII is stored in all three clouds, prompting them to centralize sensitive data to Azure for better control. A pitfall: they initially used the same scan schedule for all sources, but the AWS S3 bucket had millions of small files, causing scans to take 48 hours. They optimized by scoping scans to specific folders and using incremental scans. They also learned that the Data Map does not support all AWS services (e.g., DynamoDB is not natively supported; they had to use a custom connector via the Open Lineage API).

How MS-102 Actually Tests This

MS-102 Exam Focus on Microsoft Purview Data Map

Objective Code: This topic falls under Domain 3: Security Threats, Objective 3.3: Implement and manage data classification and sensitivity labels. Specifically, the Data Map is tested as the mechanism for automated data classification and discovery. Expect 1-2 questions on the Data Map itself, and 2-3 questions on its integration with sensitivity labels and DLP.

Common Wrong Answers and Why: 1. 'The Data Map stores actual data for compliance' — WRONG. The Data Map stores only metadata. Candidates confuse it with eDiscovery or backup solutions. The correct answer is that it stores metadata (schema, classifications, lineage). 2. 'Sensitivity labels are applied by the Data Map directly' — PARTIALLY WRONG. The Data Map can suggest labels based on classifications, but labels are applied by Microsoft Purview Information Protection (MIP) clients or automatic labeling policies. The Data Map's role is to classify and provide visibility. 3. 'The Data Map can scan all data sources without an agent' — WRONG. For on-premises sources, a self-hosted Integration Runtime is required. Azure sources can use managed identity. Cloud sources like AWS require network connectivity and credentials. 4. 'Classification is only based on content inspection' — WRONG. Classification can be based on metadata (e.g., column name 'SSN') and schema patterns. The Data Map uses a combination of pattern matching (regex), keyword lists, and machine learning (for some SITs).

Specific Numbers and Terms That Appear on the Exam: - Default confidence threshold: 60% - Number of built-in SITs: Over 200 - Supported data sources: Over 30 (know the major ones: Azure SQL, SQL Server, Power BI, Azure Blob, Amazon S3) - Scan types: Full and Incremental - Integration Runtime: Required for on-premises sources - Lineage: Captured for Azure Data Factory, Azure Synapse, Power BI, SQL Server Integration Services

Edge Cases and Exceptions:

The Data Map does NOT scan data in Microsoft 365 (Exchange, SharePoint, Teams). Those are governed by Microsoft 365 compliance center's own classification engine. This is a common trick: a question might ask about scanning SharePoint Online—the answer is not the Data Map.

The Data Map can be integrated with Azure Policy to enforce that all resources are registered and scanned. Know that Azure Policy can audit or deny resources not in compliance.

The Data Map supports custom classifiers but they must be defined in the Microsoft 365 Compliance Center (for Microsoft 365 SITs) or via Purview's own custom classification rules (for Data Map-specific ones). Candidates confuse these two.

How to Eliminate Wrong Answers:

If an answer says 'stores actual data', eliminate it.

If an answer says 'applies sensitivity labels automatically without any additional configuration', eliminate it (automatic labeling requires MIP policies).

If an answer says 'scans Exchange Online', eliminate it (Data Map does not cover Microsoft 365 workloads).

If an answer mentions 'requires an agent for Azure SQL', eliminate—Azure SQL uses managed identity; on-premises requires agent.

Focus on the word 'metadata'—if the answer describes metadata, it's likely correct.

Key Takeaways

Microsoft Purview Data Map is a metadata catalog; it does NOT store actual data.

The Data Map supports over 30 data sources including Azure, on-premises, and multi-cloud (AWS, GCP).

On-premises data sources require a self-hosted Integration Runtime to scan.

Built-in sensitive information types (SITs) number over 200; default confidence threshold is 60%.

The Data Map does NOT scan Microsoft 365 workloads (Exchange, SharePoint, Teams).

Lineage is captured for Azure Data Factory, Azure Synapse, Power BI, and SQL Server Integration Services.

Classification can be based on content (sample rows), metadata (column names), or both.

Sensitivity labels are NOT applied by the Data Map; they are applied by Microsoft Purview Information Protection policies.

Scans can be full or incremental; incremental scans only process changes since the last scan.

The Data Map integrates with Azure Policy to enforce registration and scanning compliance.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Microsoft Purview Data Map

Scans on-premises, Azure, and multi-cloud data sources (e.g., SQL Server, Amazon S3).

Stores metadata in a graph database (Apache Atlas).

Provides lineage tracking for data pipelines (ADF, Synapse).

Integrates with Azure Policy for governance at scale.

Uses built-in SITs and custom classifiers for data classification.

Microsoft 365 Compliance Center Classification

Scans Microsoft 365 data (Exchange, SharePoint, Teams, OneDrive).

Stores classification results in the Microsoft 365 compliance center (not a graph database).

Does not provide lineage tracking.

Integrates with DLP and retention policies in Microsoft 365.

Uses the same SITs but also supports trainable classifiers (machine learning).

Watch Out for These

Mistake

Microsoft Purview Data Map stores the actual data from scanned sources.

Correct

The Data Map stores only metadata—schema, classifications, lineage, and glossary terms. The actual data remains in its original location. This is a fundamental distinction: the Data Map is a catalog, not a data lake or backup.

Mistake

The Data Map can scan Microsoft 365 data (Exchange, SharePoint, Teams).

Correct

The Data Map does not scan Microsoft 365 workloads. Those are governed by the Microsoft 365 compliance center's own classification engine. The Data Map focuses on on-premises, Azure, and other cloud data sources.

Mistake

Sensitivity labels are applied automatically by the Data Map during scanning.

Correct

The Data Map can classify data and suggest labels, but the actual application of sensitivity labels is done by Microsoft Purview Information Protection (MIP) clients (e.g., Office apps) or automatic labeling policies in the compliance center. The Data Map provides the classification input, not the labeling action.

Mistake

All data sources can be scanned without installing any software.

Correct

Azure data sources can use managed identity (no agent), but on-premises sources (e.g., SQL Server on-prem) require a self-hosted Integration Runtime. Multi-cloud sources (AWS, GCP) require network connectivity and credentials but no agent if they are cloud-only.

Mistake

The Data Map classification is only based on content inspection (sample data).

Correct

Classification uses multiple methods: content inspection (regex, keywords), metadata (column names, data types), and machine learning (for some built-in SITs). The scanner samples up to 1,000 rows per table for content analysis, but schema-based classification is also applied.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Microsoft Purview Data Map and Microsoft Purview Data Catalog?

The Data Map is the underlying metadata storage and scanning engine. The Data Catalog is the user-facing experience (search, browse, glossary) built on top of the Data Map. On the exam, the term 'Data Map' often refers to the entire service, but technically the Data Catalog is the interface. Think of the Data Map as the database and the Data Catalog as the frontend.

Can Microsoft Purview Data Map scan data in Amazon S3?

Yes, the Data Map provides a native connector for Amazon S3. You register the S3 bucket as a source, provide AWS credentials (access key/secret), and scan. The scanner enumerates objects and extracts metadata. Note: the scanner runs in Azure, so network connectivity to AWS is required. Also, the Data Map does not support all AWS services (e.g., DynamoDB requires a custom connector).

How does the Data Map handle scanning large datasets?

For large datasets, use incremental scans after an initial full scan. The scanner samples up to 1,000 rows per table by default (configurable). For file stores, it scans file metadata and samples content for classification. To improve performance, scope scans to specific folders or schemas. The Integration Runtime can also be scaled out with multiple nodes for on-premises sources.

What is the role of the Integration Runtime in Purview Data Map?

The Integration Runtime (IR) is a self-hosted agent required for scanning on-premises data sources (e.g., SQL Server on VM, Oracle on-prem). It connects to the data source, extracts metadata, and sends it to the Purview account via HTTPS. The IR must be installed on a machine with network access to both the data source and the internet. It can be scaled by adding nodes for high availability.

Does the Data Map support custom classifiers?

Yes. You can create custom sensitive information types (SITs) in the Microsoft 365 Compliance Center (for use across Purview and Microsoft 365) or in the Purview governance portal (for Data Map-specific classification). Custom SITs support regex patterns, keyword lists, proximity rules, and confidence levels. They are useful for detecting proprietary data formats like employee IDs or internal project codes.

How does the Data Map integrate with Azure Policy?

Azure Policy can enforce that Azure resources (e.g., SQL databases, storage accounts) are registered in Purview and scanned regularly. For example, a policy can audit resources not registered or deny creation of resources without registration. This helps ensure consistent governance across subscriptions. The integration uses Azure Policy's compliance evaluation based on tags or resource properties.

What is data lineage in Purview Data Map?

Data lineage shows how data flows from source to destination, including transformations. For example, a pipeline in Azure Data Factory that copies data from Blob Storage to SQL Database creates a lineage edge. Lineage is stored as a directed acyclic graph (DAG) and can be viewed in the Data Catalog. It supports impact analysis (e.g., if a source changes, which reports are affected?) and root cause analysis.

Terms Worth Knowing

Data Non-relational database Relational database

Ready to put this to the test?

You've just covered Microsoft Purview Data Map — now see how well it sticks with free MS-102 practice questions. Full explanations included, no account needed.

Try MS-102 practice questions Back to all chapters

Done with this chapter?

Content Explorer and Activity Explorer

Microsoft Defender for Cloud Apps Administration

See the full MS-102 study guide