DP-900Chapter 80 of 101Objective 1.1

Data Governance: Lineage, Glossary, and Classification

This chapter covers three core pillars of data governance — data lineage, business glossary, and data classification — as tested in DP-900 Objective 1.1. These concepts are foundational for understanding how organizations manage, trust, and protect their data assets. Approximately 15-20% of exam questions touch on governance topics, making this a critical area for candidates. By the end, you'll know exactly what each term means, how they relate, and what the exam expects you to recall.

25 min read
Intermediate
Updated May 31, 2026

Library Catalog for Corporate Data

Imagine a massive corporate library with millions of books, documents, and files scattered across multiple floors and rooms. Without a central catalog, finding anything would be impossible. Data governance is like that library's catalog system. Data lineage is the checkout history and shelf map showing where each book came from, who borrowed it, and where it is now. It tracks every movement and transformation. The business glossary is the library's official subject-heading guide — it defines what each term means (e.g., 'Customer' is a person who has purchased in the last 12 months) so everyone uses the same language. Classification is the labeling system: each book gets a sticker indicating its sensitivity (e.g., 'Public' or 'Confidential'). Just as a librarian uses these tools to manage the collection, a data governance team uses lineage, glossary, and classification to ensure data is trustworthy, understandable, and secure. Without them, data becomes chaotic, duplicated, and risky — like a library where books are never returned, labels are wrong, and no one agrees on what a 'book' even is.

How It Actually Works

What is Data Governance?

Data governance is the overall management of data availability, usability, integrity, and security. It defines policies, procedures, and standards for how data is collected, stored, processed, and used. In Azure, data governance is implemented through services like Azure Purview (now Microsoft Purview), which provides a unified data governance solution. The DP-900 exam focuses on three specific components: data lineage, business glossary, and data classification.

Data Lineage: The Data Journey

Data lineage is the process of tracking data as it flows from source to destination, including all transformations and movements along the way. It answers: Where did this data come from? How was it changed? Where is it used? For example, a sales report might originate from a CRM database, be transformed in Azure Data Factory, and then loaded into Power BI. Lineage shows each step.

How Lineage Works: - Source Systems: Databases, files, APIs. - Transformation Steps: ETL/ELT processes, data flows, stored procedures. - Destinations: Data warehouses, data lakes, reporting tools. - Metadata Collection: Azure Purview automatically scans data sources using scanners. It extracts metadata about datasets, columns, and transformations. - Lineage Graph: A directed acyclic graph (DAG) showing the path. Each node is a dataset or process; each edge is a data flow. - Impact Analysis: If a source table changes, lineage shows all downstream reports affected. - Root Cause Analysis: If a report has errors, lineage helps trace back to the source.

Key Components: - Dataset: A named collection of data (e.g., a table, file, folder). - Process: An activity that transforms data (e.g., copy activity, SQL stored procedure). - Column-Level Lineage: Tracks individual columns (e.g., 'CustomerID' from source to destination). - Lineage Timestamps: When the lineage was captured (typically UTC).

Azure Purview Lineage: - Supports lineage for Azure Data Factory, Azure Synapse pipelines, Power BI datasets, SQL Server Integration Services (SSIS), and custom processes. - Lineage is captured automatically when data sources are registered and scanned. - Manual lineage can be added using the Purview Atlas API.

Exam Focus: - Know that lineage helps with impact analysis and root cause analysis. - Understand that lineage is about tracking data flow and transformations. - Remember that Azure Purview is the primary tool for lineage in Azure.

Business Glossary: A Common Language

A business glossary is a collection of terms and definitions used by an organization to ensure consistent understanding of business concepts. It provides a controlled vocabulary for data assets. For example, the term 'Customer' may be defined as 'An individual or entity that has purchased a product or service in the last 12 months.'

How a Business Glossary Works: - Terms: Each term has a name, definition, and optional attributes like synonyms, acronyms, and related terms. - Hierarchy: Terms can be organized into categories (e.g., 'Customer' under 'Sales'). - Stewards: Data stewards own and maintain terms. - Association: Terms are linked to data assets (e.g., a column 'CustID' is associated with the term 'Customer'). - Approval Workflow: Terms go through a lifecycle: draft, review, approved, deprecated.

Azure Purview Business Glossary: - Create terms via the Purview Studio UI, REST API, or import from a CSV file. - Terms can have parent terms, synonyms, acronyms, and related terms. - Stewards assign terms to assets and columns. - Glossary is searchable and supports role-based access control.

Exam Focus: - The business glossary ensures consistent definitions across the organization. - It helps avoid confusion when different departments use the same term differently. - Know that terms can be associated with data assets and columns.

Data Classification: Protecting Sensitive Data

Data classification is the process of categorizing data based on its sensitivity and business impact. Common classification labels include Public, Internal, Confidential, and Highly Confidential. Classification helps apply appropriate security controls, such as encryption, access restrictions, and retention policies.

How Classification Works: - Manual Classification: Users assign labels manually. - Automatic Classification: Uses pattern matching, machine learning, or content inspection. For example, scanning for credit card numbers (16 digits) can classify a column as 'Financial' or 'Confidential'. - System-Defined vs. Custom Labels: Azure provides built-in labels (e.g., 'Public', 'Confidential'), but organizations can create custom ones. - Sensitivity Labels: In Microsoft 365, sensitivity labels can be applied to documents and emails. Azure Purview extends this to data assets.

Azure Purview Classification: - Uses built-in classifiers (e.g., 'Credit Card Number', 'Person Name', 'Social Security Number'). - Custom classifiers can be defined using regular expressions (regex) or machine learning. - Classification results are stored as metadata and can trigger policies (e.g., masking, encryption). - Integration with Microsoft Information Protection (MIP) for sensitivity labels.

Exam Focus: - Classification helps protect sensitive data and comply with regulations (e.g., GDPR, HIPAA). - Know the difference between classification labels (Public, Confidential) and sensitivity labels. - Understand that Azure Purview can automatically classify data using built-in classifiers.

How They Work Together

Data lineage, business glossary, and classification are interconnected:

Lineage shows where data comes from and how it transforms.

Glossary provides business context for the data (e.g., what 'Customer' means).

Classification adds security context (e.g., this column contains PII).

Together, they give a complete picture: trusted, understandable, and secure data.

Example: A column named 'SSN' in a SQL database: - Lineage: Shows it was copied from a legacy HR system via Azure Data Factory. - Glossary: Associates the column with the term 'Social Security Number' defined as 'U.S. tax identifier'. - Classification: Automatically classified as 'Confidential' because it matches the 'Social Security Number' pattern.

Implementation in Azure

Azure Purview is the primary service for unified data governance. Key steps: 1. Register Data Sources: Register Azure SQL Database, Azure Data Lake Storage, Power BI, etc. 2. Scan Sources: Run scans to extract metadata, lineage, and classification. 3. Create Glossary Terms: Define business terms and associate them with assets. 4. Review Lineage: View the data flow graph in Purview Studio. 5. Apply Classification: Use built-in or custom classifiers. 6. Search and Discover: Use the catalog to find and understand data.

Commands/API: - REST API: POST https://api.purview-service.com/catalog/api/atlas/v2/types/typedefs for custom types. - PowerShell: New-AzPurviewAccount to create a Purview account. - Azure CLI: az purview account create (preview).

Exam Tip: - You don't need to memorize API endpoints for DP-900. Focus on concepts. - Know that Purview is the tool for all three: lineage, glossary, and classification.

Common Parameters and Defaults

Scan Frequency: Default is weekly; can be set to daily, hourly, or monthly.

Classification Confidence: Built-in classifiers have a confidence level (e.g., high, medium, low).

Glossary Term Status: Draft, Approved, Deprecated.

Lineage Retention: Not explicitly defined; lineage is kept as long as the account exists.

Interaction with Other Azure Services

Azure Data Factory: Provides lineage for copy activities and data flows.

Azure Synapse Analytics: Lineage for pipelines and SQL scripts.

Power BI: Lineage for datasets, reports, and dashboards.

Microsoft 365: Sensitivity labels can be applied to Office documents and synced with Purview.

Azure Policy: Can enforce classification labels on new resources.

Summary

Data governance in Azure is centered on Microsoft Purview. The three pillars — lineage, glossary, and classification — enable organizations to trust, understand, and protect their data. For DP-900, focus on the definitions, benefits, and how they relate to each other.

Walk-Through

1

Register Data Sources in Purview

Begin by registering your data sources (e.g., Azure SQL Database, Azure Data Lake Storage Gen2, Power BI) in Azure Purview. This tells Purview where to look for metadata. Each source type has a specific registration process — for example, for Azure SQL, you provide the server name, database name, and authentication credentials (SQL auth or managed identity). Purview stores these registration details and uses them during scans. The registration step is essential because without it, Purview cannot collect any metadata. You can register up to 100 sources per Purview account (default limit). The registration does not copy data; it only stores connection info.

2

Configure and Run Scans

After registration, configure a scan for each source. A scan extracts metadata: table names, column names, data types, and relationships. You set the scan frequency (e.g., weekly, daily) and scope (e.g., specific tables or folders). During a scan, Purview connects to the source, reads the schema, and for classification, it samples data (up to 128 rows per column by default) to detect sensitive patterns. The scan also captures lineage if the source is part of a data pipeline (e.g., Azure Data Factory). The scan duration depends on the volume of metadata; typical scans complete within minutes for small sources but can take hours for large data lakes. After the scan, the metadata is stored in Purview's catalog.

3

Create and Manage Glossary Terms

Next, create business glossary terms to standardize definitions. In Purview Studio, navigate to the Glossary section and click 'New term'. Provide a name (e.g., 'Customer'), definition (e.g., 'An individual or organization that has purchased a product or service in the last 12 months'), and optionally select a parent term, synonyms (e.g., 'Client'), and related terms (e.g., 'Order'). Terms can have a status: Draft, Approved, Expired, or Deprecated. After creation, assign a steward (a user responsible for maintaining the term). Terms can be imported via CSV using a specific template. Once terms are approved, they can be associated with data assets (e.g., link the 'Customer' term to the 'CustID' column in a table).

4

Associate Glossary Terms with Assets

After terms are created, manually or automatically associate them with data assets. In Purview, you can select a table or column and assign a glossary term. For example, select the 'Sales.Customers' table and assign the 'Customer' term. This association provides business context. You can also use automated rules: for instance, if a column name contains 'cust' or 'customer', automatically assign the 'Customer' term. This step is crucial for making data understandable to business users. The association is stored as metadata and appears in the data catalog when users search for 'Customer'. Multiple terms can be associated with one asset, and one term can link to many assets.

5

View and Validate Lineage

Lineage is automatically captured for supported sources (e.g., Azure Data Factory pipelines, Power BI datasets). In Purview Studio, navigate to the asset (e.g., a dataset) and click on the 'Lineage' tab. You'll see a directed graph showing sources, processes (e.g., copy activity), and destinations. Each node shows metadata like last update time. You can click on a process to see details: input datasets, output datasets, transformation logic (if available). Validate lineage by ensuring that all expected data flows are present. If a pipeline is missing, check that the pipeline is registered and scanned. Lineage helps answer: 'Where did this data come from?' and 'What downstream reports are affected if I change this source?'

6

Review Classification Results

After scans, Purview automatically applies built-in classifiers (e.g., 'Credit Card Number', 'Person Name', 'Social Security Number') to columns. In the asset details, go to the 'Classification' tab to see which columns are classified and with what label (e.g., 'Confidential'). You can also manually apply classifications. Review the results to ensure accuracy — false positives may occur (e.g., a numeric ID column might be misclassified as a credit card). You can reclassify or remove classifications. For custom classifiers, you define a regex pattern (e.g., `\d{3}-\d{2}-\d{4}` for SSN) and a minimum confidence threshold (default 60%). Classification results can trigger policies like data masking in Azure SQL Database.

What This Looks Like on the Job

Enterprise Scenario 1: Financial Services Compliance

A large bank needs to comply with GDPR and SOX regulations. They use Azure Purview to automatically classify columns containing PII (e.g., Social Security Numbers, account numbers) as 'Confidential'. The business glossary defines terms like 'Customer' and 'Account Balance' consistently across retail and investment divisions. Lineage tracks how customer data flows from the core banking system (on-premises SQL Server) through Azure Data Factory to Azure Synapse Analytics for reporting. When a regulator asks, 'Where is customer SSN data stored and who has access?', the governance team runs a Purview search to find all assets classified as 'Confidential' and views lineage to understand data movement. They also use impact analysis: if the source table schema changes, they can see which downstream reports break. Without this, compliance audits would be manual, error-prone, and time-consuming.

Enterprise Scenario 2: Healthcare Data Integration

A hospital network acquires a smaller clinic and needs to integrate patient data. The clinic uses different terminology (e.g., 'Patient ID' vs 'Medical Record Number'). The business glossary in Purview maps these terms to a standard definition: 'Patient Identifier – unique alphanumeric code assigned to each patient.' Classification automatically detects columns with health information (e.g., diagnosis codes, lab results) and labels them as 'Highly Confidential' per HIPAA. Lineage shows the data flow from the clinic's EHR system (on-premises) via Azure Data Factory to the central Azure SQL Database. During integration, they discover that a column named 'SSN' in the clinic data is not actually a Social Security Number but a patient ID; lineage helps trace back to verify. They correct the glossary association and reclassify. This scenario highlights how governance tools prevent data misinterpretation and ensure compliance.

Enterprise Scenario 3: Retail Analytics and Data Democratization

A retail company wants to empower business analysts to self-serve data for sales reports. They register all data sources (Azure SQL Database for transactions, Azure Data Lake for clickstream, Power BI for dashboards) in Purview. The business glossary defines terms like 'Revenue' (net sales after returns) and 'Customer Lifetime Value' (predicted total spend). Classification labels customer email addresses as 'Confidential' and masks them in non-production environments. Lineage shows that the 'Monthly Sales Report' in Power BI sources data from a curated layer in the data lake, which itself comes from raw transaction data transformed by Azure Data Factory. When an analyst finds an anomaly in the report, they use lineage to trace back to the source and identify a bug in the transformation. The glossary ensures that 'Revenue' is consistently calculated across all reports. This reduces time-to-insight and builds trust in data.

How DP-900 Actually Tests This

DP-900 Exam Focus: Data Governance

What the Exam Tests (Objective 1.1)

The DP-900 exam covers data governance under 'Core Data Concepts' with a focus on: - Describe data governance components: data lineage, business glossary, data classification. - Identify benefits: impact analysis, root cause analysis, consistent definitions, data protection. - Recognize Azure tools: Microsoft Purview (formerly Azure Purview) as the primary governance service.

Most Common Wrong Answers

1.

Confusing data lineage with data versioning: Many candidates think lineage is about tracking different versions of a dataset (like Git). Wrong. Lineage is about data flow and transformations, not version history.

2.

Thinking the business glossary is just a dictionary: Candidates often assume it's a simple list of terms. In reality, it includes definitions, synonyms, related terms, stewards, and associations with assets.

3.

Mixing up classification with access control: Classification labels (e.g., 'Confidential') are not the same as access permissions (e.g., 'Read' or 'Write'). Classification informs policy but does not enforce it directly.

4.

Believing lineage is only for ETL processes: Lineage also covers data movement in Power BI, SQL Server Integration Services (SSIS), and even custom applications via API.

Specific Numbers and Terms on the Exam

Built-in classifiers: 'Credit Card Number', 'Person Name', 'Social Security Number', 'U.S. Bank Account Number'.

Classification confidence: High, Medium, Low.

Glossary term statuses: Draft, Approved, Expired, Deprecated.

Lineage graph: Directed acyclic graph (DAG) – know the term.

Purview scanning: Default frequency is weekly.

Edge Cases and Exam Traps

Lineage for on-premises sources: Purview supports lineage for on-premises SQL Server via self-hosted integration runtime. The exam may test that lineage is not limited to cloud sources.

Custom classifiers: You can define custom classifiers using regular expressions. The exam might ask about regex-based classification.

Multiple glossary terms per asset: An asset can have multiple terms. For example, a column 'Email' could be associated with both 'Contact Information' and 'Personal Data'.

Lineage retention: Lineage persists until the asset is deleted or the scan is removed. There is no expiration timer.

How to Eliminate Wrong Answers

If the question asks about 'tracking data origin and transformations', the answer is 'data lineage', not 'data classification' or 'glossary'.

If the question mentions 'consistent definitions across departments', think 'business glossary'.

If the question involves 'sensitivity labels' or 'protecting sensitive data', think 'data classification' (and possibly Microsoft Information Protection).

For 'root cause analysis' or 'impact analysis', always choose 'data lineage'.

Exam Tips

Remember the mnemonic: Lineage = Location (where data came from/goes), Glossary = Glossary (definitions), Classification = Confidentiality (sensitivity).

Practice identifying scenarios: 'A data analyst wants to know which reports will break if a source table is modified' → Impact analysis → Data lineage.

Know that Azure Purview is the single tool for all three; no other Azure service provides all three natively.

Key Takeaways

Data lineage tracks the origin, movement, and transformation of data from source to destination.

Business glossary provides standardized definitions and terms for consistent business understanding.

Data classification labels data based on sensitivity (e.g., Public, Confidential, Highly Confidential).

Azure Purview (now Microsoft Purview) is the unified data governance service for lineage, glossary, and classification.

Lineage supports impact analysis (what breaks if source changes) and root cause analysis (why data is wrong).

Glossary terms can have synonyms, acronyms, parent terms, and be linked to specific columns or tables.

Classification uses built-in classifiers (e.g., Credit Card Number, SSN) with confidence levels (High, Medium, Low).

Custom classifiers can be defined using regular expressions (regex) for organization-specific patterns.

Lineage is automatically captured for Azure Data Factory, Azure Synapse, Power BI, and SSIS pipelines.

The business glossary helps avoid confusion when different departments use the same term differently.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Data Lineage

Tracks data flow from source to destination.

Used for impact analysis and root cause analysis.

Shows transformations and movements.

Visualized as a directed acyclic graph (DAG).

Automatically captured for supported pipelines.

Data Classification

Categorizes data by sensitivity (e.g., Public, Confidential).

Used for compliance and data protection.

Applies labels to columns and files.

Can be automatic (pattern matching) or manual.

Integrates with Microsoft Information Protection.

Watch Out for These

Mistake

Data lineage is the same as data versioning.

Correct

Data lineage tracks the flow and transformation of data from source to destination, not version history. Versioning (e.g., in Azure Blob Storage) keeps snapshots of data at different points in time, while lineage shows the path data takes through pipelines.

Mistake

The business glossary is just a list of terms with definitions.

Correct

A business glossary includes definitions, synonyms, acronyms, related terms, parent-child relationships, stewards, statuses, and associations with data assets (tables, columns). It is a rich metadata repository that enables consistent business understanding.

Mistake

Data classification automatically restricts access to data.

Correct

Classification labels (e.g., 'Confidential') indicate the sensitivity of data but do not enforce access controls. They can be used to trigger policies (e.g., masking, encryption) but are not permissions themselves. Access control is handled separately via Azure RBAC or SQL permissions.

Mistake

Lineage is only available for cloud data sources.

Correct

Azure Purview supports lineage for on-premises sources (e.g., SQL Server) via self-hosted integration runtime. It also supports custom lineage through the Atlas API, so any data flow can be tracked.

Mistake

You need to manually create all lineage connections.

Correct

For supported services like Azure Data Factory and Power BI, lineage is captured automatically during scans. You only need manual input for custom or unsupported processes.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between data lineage and data classification?

Data lineage tracks the flow and transformation of data from source to destination, showing where data came from and how it changed. Data classification labels data based on sensitivity (e.g., Public, Confidential). Lineage answers 'where did this data come from?', while classification answers 'how sensitive is this data?' Both are part of data governance in Azure Purview.

How does Azure Purview capture data lineage automatically?

Azure Purview captures lineage by scanning supported data sources and pipelines. For example, when you register an Azure Data Factory instance and run a scan, Purview extracts metadata about copy activities, data flows, and their inputs/outputs. It builds a directed acyclic graph (DAG) showing the data flow. This is automatic for services like ADF, Synapse, Power BI, and SSIS. No manual coding is required.

Can I create custom classifications in Azure Purview?

Yes, you can create custom classifications using regular expressions (regex) or machine learning. For example, to classify employee IDs that follow a pattern like 'EMP-12345', you can define a regex `EMP-\d{5}`. You also set a minimum confidence threshold (default 60%). Custom classifiers appear in the classification list and can be applied automatically during scans.

What is the purpose of a business glossary?

The business glossary ensures consistent definitions of business terms across the organization. For example, 'Revenue' might mean different things in sales vs. finance. The glossary defines it as 'Net sales after returns and discounts.' It also includes synonyms, acronyms, related terms, and associations with data assets. This helps business users find and understand data correctly.

How does data lineage help with impact analysis?

Impact analysis uses lineage to determine what downstream assets (reports, dashboards, datasets) will be affected if a source table or column changes. For example, if a column in a source database is renamed, lineage shows all the pipelines, data flows, and Power BI reports that depend on that column. This allows data engineers to assess risk and plan changes accordingly.

Is data lineage only for cloud data?

No, Azure Purview supports lineage for on-premises data sources via self-hosted integration runtime. You can register on-premises SQL Server, Oracle, or Teradata and capture lineage for data moved to Azure. Additionally, you can add custom lineage using the Atlas API for any data flow, regardless of location.

Can I associate multiple glossary terms with one column?

Yes, a single column can be associated with multiple glossary terms. For example, a column 'Email' could be linked to both 'Contact Information' and 'Personal Data' terms. This provides richer context. The association is stored as metadata and appears in the data catalog.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Data Governance: Lineage, Glossary, and Classification — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?