DP-900Chapter 33 of 101Objective 1.1

Data Governance: Quality, Lineage, Cataloguing

This chapter covers data governance fundamentals: cataloguing, lineage, and quality. These topics are critical for the DP-900 exam because they appear in roughly 10–15% of questions under Core Data Concepts (Objective 1.1). You will need to understand how Azure Purview provides unified data governance, how lineage traces data from source to consumption, and how quality dimensions like completeness and accuracy are measured. We will also explore the exam's focus on definitions, Azure services, and common misconceptions.

25 min read
Intermediate
Updated May 31, 2026

Library Catalog, Checkout Log, and Book Condition

Imagine a large public library with millions of books. Data cataloguing is like the library's online catalog: it lists every book's title, author, subject, location (aisle and shelf), and a summary. Without this catalog, finding a book would require wandering aimlessly. Data lineage is like the library's checkout log: it records every time a book was borrowed, by whom, when it was returned, and any damage noted. This log helps track the book's history and provenance. Data quality is like the library's condition assessment: each book is inspected for missing pages, stains, or binding wear. Books are rated (e.g., 'like new', 'good', 'fair'). If a book has critical errors, it's flagged and quarantined. In a data lake, the catalog is the metadata store (like Azure Purview), the log is the lineage graph, and quality checks are rules that assign a score (e.g., completeness >99%, accuracy >95%). Just as a librarian uses all three to manage the collection, a data engineer uses Purview to discover, trace, and trust data.

How It Actually Works

What is Data Governance and Why Does It Matter?

Data governance is the overall management of data availability, usability, integrity, and security. In Azure, it is implemented primarily through Azure Purview, a unified data governance service. Governance ensures that data consumers can find, understand, trust, and use data. Without governance, data lakes become data swamps—untrusted, undocumented, and unusable.

Data Cataloguing: The Inventory of Your Data Assets

A data catalog is a metadata repository that stores information about data assets: tables, files, reports, and even machine learning models. Azure Purview's catalog automatically scans on-premises and cloud data sources (Azure SQL Database, Azure Synapse, Azure Data Lake Storage, Power BI, etc.) and extracts technical metadata (schema, data types, partitions) and business metadata (descriptions, classifications, glossary terms).

Key Components: - Scanners: Purview uses scanners to connect to data sources via runtime environments (self-hosted integration runtime for on-premises, Azure IR for cloud). Scanners extract metadata at configurable intervals (default: every hour, can be set to daily/weekly). - Asset Types: Each scanned object becomes an asset. For example, a table in Azure SQL Database becomes an asset with columns, constraints, and relationships. - Classifications: Purview automatically applies classifications like "Credit Card Number" (based on regex patterns) and "Person Name" (based on ML classifiers). You can also create custom classifications. - Glossary: A business glossary defines terms like "Customer" or "Revenue" and maps them to technical assets. This bridges the gap between business and IT.

How Scanning Works: 1. A scanner connects to the source (e.g., Azure Data Lake Storage Gen2). 2. It reads the schema: for a Parquet file, it reads column names, data types, and partitioning. 3. It applies classification rules: if a column name contains "email" or matches a regex pattern, it is classified as "Email Address". 4. The metadata is stored in Purview's managed storage (Cosmos DB and Blob Storage). 5. Assets appear in the Purview Studio, searchable via the search bar or the catalog API.

Search and Discovery: Users can search by asset name, classification, or glossary term. For example, searching "credit card" returns all assets with that classification. The search is powered by Azure Cognitive Search, providing faceted filters (source type, classification, etc.).

Data Lineage: Tracing Data from Source to Consumption

Lineage shows the path data takes: from raw ingestion through transformations to final reports or models. It answers questions like "Where did this data come from?" and "What transformations were applied?"

How Azure Purview Captures Lineage: - Automatic Lineage: For certain Azure services, lineage is captured automatically. For example: - Azure Data Factory (ADF): Copy activity and Data Flow produce lineage. Purview reads ADF's lineage metadata. - Azure Synapse Pipelines: Similar to ADF. - Power BI: Datasets and reports can be linked to their source data assets. - Manual Lineage (via Atlas API): For custom processes, you can push lineage using the Apache Atlas REST API. You define entities (e.g., a table) and processes (e.g., a Spark job) and the relationships between them.

Lineage Graph: - Nodes: Data assets (tables, files) and processes (Copy Data, SQL Stored Procedure). - Edges: Directed edges showing data flow. For example, a raw CSV file is an input to a Copy Data process, which outputs to a SQL table. - Impact Analysis: If a source table changes, you can see all downstream assets that depend on it. This is critical for change management.

Lineage in the UI: In Purview Studio, you can click on an asset and view its lineage tab. It shows a graph with upstream sources and downstream consumers. You can click on any node to see its metadata.

Lineage vs. Data Catalog: Catalog tells you *what* data exists and *where* it is. Lineage tells you *how* it got there and *where it goes*.

Data Quality: Ensuring Data Is Fit for Purpose

Data quality measures the condition of data based on dimensions such as completeness, accuracy, consistency, timeliness, and uniqueness. Azure Purview integrates with Azure Data Quality (formerly part of Azure Data Catalog, now embedded in Purview) and Azure Data Factory's Data Flows for quality checks.

Key Dimensions (as tested on DP-900): - Completeness: Are all required fields present? For example, a customer record missing an email address has low completeness. - Accuracy: Does the data reflect the real world? For example, a birth date of "1900-01-01" may be inaccurate. - Consistency: Is the same data consistent across systems? For example, a customer's name spelled differently in CRM vs. ERP. - Timeliness: Is the data up to date? For example, stock prices from last month are not timely. - Uniqueness: Are there duplicate records? For example, two customer IDs for the same person.

How Quality Is Measured in Azure: - Profiling: Purview can profile data sources to compute statistics: row count, null count, min/max values, distinct values. This helps identify completeness and uniqueness issues. - Rules: You can define quality rules using Azure Data Factory Data Flows or Azure Data Quality (now part of Microsoft Data Quality Services). Rules are SQL-like expressions that flag bad data. - Scoring: Each rule can produce a pass/fail or a score (e.g., 95% complete). These scores can be stored as metadata in Purview.

Example Rule:

-- Check that OrderDate is not null and is in the past
SELECT OrderID, OrderDate
FROM Orders
WHERE OrderDate IS NULL OR OrderDate > GETDATE()

If this query returns any rows, the data fails the quality check.

Monitoring and Remediation: - Alerts: Azure Monitor can trigger alerts when quality scores drop below a threshold. - Data Cleansing: ADF Data Flows can clean data in-flight: remove duplicates, fill nulls, correct formats.

Azure Purview Architecture

Purview is a PaaS service with two main components: 1. Purview Account: The management layer that stores metadata and provides the UI and APIs. 2. Scanning Infrastructure: Self-hosted integration runtime (SHIR) for on-premises sources, or Azure IR for cloud sources. The SHIR is installed on a VM or on-premises machine and communicates outbound to Purview.

Data Sources Supported: - Azure: Blob Storage, ADLS Gen1/Gen2, Azure SQL Database, Azure Synapse, Azure Cosmos DB, Power BI, etc. - On-premises: SQL Server, Oracle, Teradata, etc. (via SHIR). - Other clouds: Amazon S3, Google BigQuery (via SHIR).

Authentication: - Managed Identity (recommended) for Azure sources. - Service Principal or SQL Authentication for others.

Cost: - Purview charges based on the number of scanned data assets (tables, files) and the amount of metadata stored. There is also a charge for self-hosted integration runtime hours.

Integration with Other Azure Services

Azure Synapse: Synapse Studio can search Purview for datasets. Lineage is captured automatically for Synapse pipelines.

Power BI: Power BI datasets can be registered in Purview, allowing users to see lineage from data source to report.

Azure Data Factory: ADF pipelines can be annotated with lineage information. ADF also provides data flow debugging with quality checks.

Azure Policy: You can enforce data classification tags via Azure Policy, ensuring new resources are automatically governed.

Common Exam Scenarios

Scenario 1: A company wants to know what data they have and where it resides. Solution: Use Azure Purview to scan all sources and create a catalog.

Scenario 2: A data engineer needs to understand the impact of changing a source schema. Solution: Use lineage in Purview to see all downstream dependencies.

Scenario 3: A data analyst wants to trust sales data. Solution: Run data quality profiling and rules, and view quality scores in Purview.

Key Terms for DP-900

Metadata: Data about data. Technical metadata (schema, size) and business metadata (description, owner).

Classification: A label applied to data based on content (e.g., "Social Security Number").

Glossary: A collection of business terms mapped to technical assets.

Lineage: The lifecycle of data from its origin to its consumption.

Data Quality Dimensions: Completeness, accuracy, consistency, timeliness, uniqueness.

Profiling: The process of examining data to compute statistics.

Exam Traps

Trap 1: Confusing data catalog with data warehouse. A catalog is metadata; a warehouse stores data.

Trap 2: Thinking lineage is only for ETL processes. It includes any transformation, even SQL views.

Trap 3: Assuming quality is only about correctness. It includes completeness, consistency, etc.

Trap 4: Believing Purview automatically scans all data in a subscription. You must configure scans and provide credentials.

Configuration Example: Scanning an Azure SQL Database

1.

In Purview Studio, go to Sources.

2.

Register a new source: Azure SQL Database.

3.

Select authentication: Managed Identity (recommended) or SQL Auth.

4.

Select the database and schema to scan.

5.

Choose scan frequency: Once, every hour, daily, weekly.

6.

Run the scan.

7.

View assets in the catalog.

Verification Commands

PowerShell to list scans:

Get-AzPurviewScan -AccountName 'purview-account' -ResourceGroupName 'rg' -DataSourceName 'sql-db'

REST API to get lineage:

GET https://purview-account.catalog.purview.azure.com/atlas/v2/entity/guid/{entityGuid}/lineage

Summary of Mechanisms

Cataloguing: Scanners extract metadata -> stored in Purview's metadata store -> indexed for search.

Lineage: ADF/Synapse/Power BI emit lineage events -> Purview stores graph relationships -> visualized in UI.

Quality: Profiling computes stats -> rules flag bad data -> scores stored as metadata -> alerts triggered.

Understanding these mechanisms allows you to answer exam questions about governance, stewardship, and data trust.

Walk-Through

1

Register Data Sources in Purview

First, you register your data sources in Azure Purview. This involves specifying the source type (e.g., Azure SQL Database, Azure Data Lake Storage Gen2) and providing connection details. For cloud sources, you typically use Managed Identity for authentication to avoid managing secrets. For on-premises sources, you must install a self-hosted integration runtime (SHIR) on a machine that can reach the source. The registration creates a data source object in Purview, which serves as a container for scans. Without registration, Purview cannot access the source.

2

Configure and Run Scans

After registration, you create a scan rule set that defines what to scan (e.g., all tables, specific folders) and how to classify data (e.g., apply built-in classifiers like 'Credit Card Number'). You set the scan frequency: once, hourly, daily, or weekly. The scanner reads the schema, extracts metadata, and applies classifications. For large sources, scanning can take hours. Purview stores the results in its managed storage. You can monitor scan status in the Purview Studio; failed scans show error details.

3

Browse and Search the Data Catalog

Once scans complete, assets appear in the catalog. Users search by asset name, classification, or glossary term. The search uses Azure Cognitive Search, providing faceted filters like data source type, classification, and owner. Each asset has a detail page showing technical metadata (columns, data types), classifications, and lineage (if available). Users can also add business metadata like descriptions and owners. This step is crucial for data discovery – without it, users would not know what data exists.

4

Define Business Glossary Terms

To bridge business and IT, you create a glossary with terms like 'Customer', 'Revenue', 'Product'. Each term can have a definition, synonyms, and related terms. Then you map glossary terms to technical assets (e.g., the 'Customer' term maps to the 'dbo.Customers' table). This mapping appears in the asset's detail page. Users can search by glossary term, making it easy for business analysts to find relevant data. Without glossary mapping, technical names like 'CUST_TBL' are meaningless to business users.

5

View and Analyze Lineage

For assets that have lineage (e.g., from ADF pipelines or Power BI), you can view the lineage graph. Click on an asset and select the 'Lineage' tab. The graph shows upstream sources (e.g., raw CSV files) and downstream consumers (e.g., a SQL view or a Power BI report). You can click on any node to see its metadata. This helps with impact analysis: if a source changes, you can identify all downstream dependencies. Lineage is automatically captured for supported services; for custom processes, you can push lineage via API.

6

Monitor and Improve Data Quality

Data quality is assessed through profiling and rules. Profiling computes statistics like null counts and distinct values. You can define quality rules in Azure Data Factory Data Flows or using Data Quality Services. For example, a rule might check that 'OrderDate' is not null and is in the past. The results are stored as metadata in Purview, and you can set up alerts via Azure Monitor when quality drops below a threshold. Remediation can be automated in ADF Data Flows to clean data in-flight.

What This Looks Like on the Job

Enterprise Scenario 1: Financial Services Regulatory Compliance

A large bank must prove to regulators that customer data is accurate, complete, and traceable. They use Azure Purview to scan all data sources: core banking databases (on-premises SQL Server), Azure SQL Database for transactions, and Azure Data Lake for analytics. The catalog provides a single inventory of all customer-related data. Lineage shows how raw transaction data flows through ETL into risk models and regulatory reports. Data quality rules check for completeness (all required fields present) and accuracy (e.g., account balances match general ledger). When a regulator requests evidence, the bank can produce a lineage report showing the data's provenance and quality scores. Misconfiguration: if scans are not run daily, new tables may be missing from the catalog, leading to incomplete reporting.

Enterprise Scenario 2: Retail Data Lake Governance

A global retailer ingests data from point-of-sale systems, e-commerce platforms, and inventory systems into Azure Data Lake Storage. Without governance, the lake becomes a swamp with thousands of files and no documentation. They deploy Purview to automatically scan the lake, classify sensitive data (e.g., credit card numbers), and build a business glossary for terms like 'Sales', 'Inventory', 'Customer'. Data engineers use lineage to track the impact of schema changes in source systems. Data quality profiling reveals that 15% of sales records have missing store IDs, triggering an alert. The retailer then uses ADF Data Flows to cleanse incoming data. At scale, scanning petabytes of data requires careful scheduling (e.g., incremental scans every hour) and sufficient SHIR resources for on-premises sources. Common failure: misconfigured scan rule sets that skip critical folders, leaving data undiscovered.

Enterprise Scenario 3: Healthcare Data Stewardship

A healthcare provider needs to govern patient data across multiple Azure SQL Databases and a Synapse Analytics warehouse. They use Purview to classify PHI (Protected Health Information) like patient names and diagnoses. The glossary maps clinical terms to database columns. Lineage tracks data from electronic health records (EHR) through de-identification processes into analytical datasets. Data quality rules ensure completeness of mandatory fields (e.g., diagnosis code). When a data breach occurs, the lineage graph helps identify which reports contained the exposed data. Performance consideration: scanning large databases with many tables can be slow; they use incremental scans and filter schemas to reduce load. Misconfiguration: failing to set up proper authentication (e.g., using SQL credentials instead of Managed Identity) can lead to scan failures.

How DP-900 Actually Tests This

Exactly What DP-900 Tests (Objective 1.1)

DP-900 questions on data governance focus on definitions and service capabilities. You will NOT be asked to configure Purview or write SQL. Instead, expect scenario-based questions where you choose the correct Azure service or governance concept.

Objective codes covered: - 1.1 Identify data governance concepts: cataloguing, lineage, quality. - Specifically: "Describe the role of Azure Purview in data governance" and "Identify data quality dimensions."

Common Wrong Answers and Why: 1. Confusing Azure Purview with Azure Data Catalog (retired). Candidates think Data Catalog is still the answer. Reality: Data Catalog was deprecated; Purview is the current service. Exam may ask "Which service provides unified data governance?" — answer is Azure Purview. 2. Mixing up lineage with data catalog. Candidates choose "catalog" when asked about data origin tracking. Lineage is the correct term for tracking data flow. 3. Thinking data quality is only about accuracy. The exam tests multiple dimensions: completeness, consistency, timeliness, uniqueness. A question might say "Which dimension measures if all required fields are present?" Answer: completeness. 4. Believing Purview automatically governs all data in a subscription. It only governs sources you register and scan. A trap question: "Azure Purview automatically scans all Azure data sources." False.

Specific Numbers, Values, and Terms: - Scan frequency options: Once, every hour, daily, weekly. - Data quality dimensions: Completeness, accuracy, consistency, timeliness, uniqueness (remember the acronym CACTU). - Lineage sources: Azure Data Factory, Azure Synapse Pipelines, Power BI. - Classification examples: "Credit Card Number", "Person Name", "Email Address". - Glossary vs. classification: Glossary is business terms; classification is content-based labels.

Edge Cases and Exceptions: - On-premises sources require a self-hosted integration runtime (SHIR). The exam may ask: "What is needed to scan an on-premises SQL Server?" Answer: Self-hosted integration runtime. - Lineage is not automatically captured for all sources; only for supported services (ADF, Synapse, Power BI). For custom processes, you must push lineage via API. - Data quality profiling is optional; you must enable it in scan rule sets.

How to Eliminate Wrong Answers: - If a question asks about describing data flow from source to report, eliminate any option mentioning "catalog" or "quality" — it's lineage. - If a question asks about finding data assets, eliminate lineage or quality — it's catalog. - If a question asks about data trustworthiness, eliminate catalog — it's quality. - Use the mechanism: catalog = inventory, lineage = map, quality = health check.

Key Takeaways

Azure Purview is the primary data governance tool for DP-900; it provides cataloguing, lineage, and classification.

Data cataloguing is the process of creating an inventory of data assets with metadata like schema, classifications, and glossary terms.

Data lineage tracks the flow of data from source to consumption, showing transformations and dependencies.

Data quality has five key dimensions: completeness, accuracy, consistency, timeliness, and uniqueness (CACTU).

Automatic lineage is captured for Azure Data Factory, Azure Synapse Pipelines, and Power BI.

On-premises data sources require a self-hosted integration runtime (SHIR) to scan.

Scan frequency options: once, every hour, daily, weekly.

Classification labels like 'Credit Card Number' are applied automatically using built-in classifiers.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Data Catalog

Answers 'What data exists and where is it located?'

Stores metadata like schema, classifications, glossary terms.

Enables search and discovery of data assets.

Updated via scanning at scheduled intervals.

Example: scanning a SQL table to list its columns.

Data Lineage

Answers 'Where did this data come from and how was it transformed?'

Stores graph relationships between data assets and processes.

Enables impact analysis and root cause tracing.

Updated in near real-time via pipeline events or API.

Example: showing a Power BI report depends on a SQL view.

Azure Purview

Unified data governance service for cataloguing, lineage, classification.

Does not move or transform data; only manages metadata.

Provides a searchable catalog and lineage graph.

Scans and classifies data at rest.

Integrates with ADF to capture lineage automatically.

Azure Data Factory (ADF)

ETL/ELT service for data movement and transformation.

Actually moves and transforms data (copy, data flow).

Provides pipeline orchestration and monitoring.

Processes data in motion (during pipeline execution).

Pushes lineage metadata to Purview for governance.

Watch Out for These

Mistake

Azure Purview automatically scans all data in an Azure subscription without any configuration.

Correct

Purview only scans sources that you explicitly register and configure scans for. You must provide credentials and choose scope. It does not auto-discover new sources.

Mistake

Data lineage is the same as a data catalog.

Correct

A data catalog is an inventory of data assets (what and where). Lineage shows the flow of data from source to destination (how and why). They are complementary but distinct.

Mistake

Data quality only measures accuracy.

Correct

Data quality includes multiple dimensions: completeness, accuracy, consistency, timeliness, uniqueness. Accuracy is just one of them.

Mistake

Lineage is automatically captured for all data transformations in Azure.

Correct

Automatic lineage is only supported for specific services: Azure Data Factory, Azure Synapse Pipelines, and Power BI. For custom transformations (e.g., Spark jobs), you must push lineage via API.

Mistake

Azure Data Catalog is still the recommended service for data governance.

Correct

Azure Data Catalog was retired and replaced by Azure Purview. Purview provides cataloguing, lineage, and classification in one service. Exam questions will reference Purview.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between data catalog and data lineage?

A data catalog is an inventory of data assets (e.g., tables, files) with metadata like schema, location, and classifications. It helps you find and understand what data exists. Data lineage shows the data's journey from source to destination, including transformations and dependencies. It helps you trace data provenance and perform impact analysis. In Azure Purview, the catalog is the searchable list of assets, while lineage is a graph showing how assets are connected via processes.

How does Azure Purview automatically capture lineage?

Azure Purview integrates with Azure Data Factory, Azure Synapse Pipelines, and Power BI to capture lineage automatically. When you run a pipeline in ADF, Purview reads the pipeline metadata and creates lineage entries showing input datasets, output datasets, and the copy or data flow activity. For Power BI, lineage from dataset to report is captured when datasets are registered in Purview. For custom processes, you can push lineage via the Apache Atlas REST API.

What are the five dimensions of data quality tested on DP-900?

The five dimensions are: completeness (are all required fields present?), accuracy (does data reflect reality?), consistency (is data the same across systems?), timeliness (is data up to date?), and uniqueness (are there duplicates?). A helpful mnemonic is CACTU. The exam may ask you to identify which dimension applies to a given scenario, e.g., 'A customer record missing an email address' is a completeness issue.

Do I need a self-hosted integration runtime to scan Azure SQL Database?

No, for Azure SQL Database (PaaS), you can use the Azure integration runtime (default). The self-hosted integration runtime is only required for on-premises or VM-based data sources, like SQL Server on-premises or Oracle. The SHIR must be installed on a machine that can reach the source and communicate outbound to Purview.

Can Azure Purview classify data in files like CSV or Parquet?

Yes, Azure Purview can scan files in Azure Data Lake Storage (Gen1/Gen2) and Blob Storage. It reads the schema of structured files (Parquet, Avro, ORC, CSV, TSV) and applies classifications based on column names and content patterns. For example, a column named 'Email' will be classified as 'Email Address'. Unstructured files like images are not classified.

What happens if a scan fails in Azure Purview?

If a scan fails, Purview reports the error in the scan history. Common causes include authentication failure (e.g., wrong credentials), network issues (SHIR cannot reach source), or schema changes. You can retry the scan after fixing the issue. Failed scans do not delete previously scanned assets; they simply do not update metadata. You should monitor scan health via Azure Monitor alerts.

Is Azure Purview free?

No, Azure Purview is a paid service. Pricing is based on the number of scanned data assets (per asset per month) and the amount of metadata stored (per GB per month). There is also a charge for self-hosted integration runtime hours. However, there is a free tier with limited capacity (e.g., 10 assets) for evaluation. For DP-900, you only need to know that it is a paid service, not the exact pricing.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Data Governance: Quality, Lineage, Cataloguing — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?