This chapter covers Master Data Management (MDM) concepts, a core topic for the DP-900 exam. MDM is the discipline of ensuring that an organization's critical data entities—such as customers, products, and employees—are consistent, accurate, and governed across all systems. Approximately 15-20% of exam questions touch on data management concepts, including MDM, data lakes, and data warehouses. Understanding MDM is essential for distinguishing between master data, transactional data, and reference data, and for recognizing how Azure tools like Azure Data Factory and Azure Purview support MDM initiatives.
Jump to a section
Imagine a global retail chain with 500 stores. Each store maintains its own product catalog: store A lists 'Widget X' as 'WidgX', store B calls it 'Widget X (blue)', and store C has it as 'Part# 12345'. When headquarters tries to run a total sales report for Widget X, they get three different numbers because each store uses a different name. This is the problem master data management (MDM) solves. MDM is like a central product master database that assigns a single, universal product ID (e.g., SKU-001) to every item. When a store receives a shipment, they scan the barcode, and the system automatically looks up the master record to get the correct name, price, and description. Any store can update the master record (with proper permissions), and all stores instantly see the change. Without MDM, the company would have data silos—each store's system is an island. With MDM, there is one authoritative source for product, customer, and supplier data. The MDM hub enforces data quality rules (e.g., no null names, unique SKUs) and publishes updates to all downstream systems via a publish-subscribe mechanism. If a store tries to enter a duplicate customer, the MDM system detects it via matching rules (e.g., same email or phone) and merges the records, keeping the best attributes from each.
What is Master Data Management (MDM)?
Master Data Management (MDM) is a comprehensive method of defining and managing the critical data of an organization to provide a single point of reference. The goal is to ensure that the organization has a consistent, accurate, and timely view of its core business entities across all operational and analytical systems. Master data typically includes: - Customer Data: Names, addresses, contact details, account numbers. - Product Data: SKUs, descriptions, prices, categories. - Employee Data: Employee IDs, names, departments, roles. - Supplier Data: Vendor names, tax IDs, payment terms. - Location Data: Store codes, warehouse IDs, geographic coordinates.
Master data is distinct from transactional data (e.g., sales orders, invoices) and reference data (e.g., country codes, currency codes). Transactional data records events (a sale), while master data describes the parties and items involved in those events. Reference data is a set of permissible values used to classify other data (e.g., ISO country codes).
Why MDM Exists
Without MDM, organizations suffer from data silos. Each department or system maintains its own version of the same entities, leading to inconsistencies. For example, the CRM system might have a customer as "John Smith" with email "john@email.com", while the ERP system has the same customer as "Jon Smith" with email "johnny@email.com". This causes errors in reporting, compliance issues, and poor customer experience. MDM solves this by creating a single version of the truth (often called the "golden record") that is shared across systems.
How MDM Works Internally
MDM is implemented through a central platform (an MDM hub) that manages the lifecycle of master data. The typical workflow involves:
Data Ingestion: Master data is ingested from source systems (CRM, ERP, legacy databases) via batch or real-time integration. Azure Data Factory can be used to orchestrate these data movements.
Data Cleansing and Standardization: Incoming data is cleaned to remove duplicates, correct formatting, and enforce standard values. For example, phone numbers are normalized to a standard format, and addresses are validated against a postal service database.
Matching and Deduplication: The MDM system uses matching rules (exact match, fuzzy match, probabilistic matching) to identify records that represent the same real-world entity. For example, two customer records with the same email address are likely the same person. The system merges them into a single golden record, keeping the best attributes from each source.
Survivorship: When merging records, survivorship rules determine which attribute value wins. For example, the most recent update might take precedence, or a trusted source system might override others.
Publishing and Syndication: The golden record is published to subscribing systems. This can be done via a publish-subscribe model, where the MDM hub sends updates to all downstream systems when a change occurs. Alternatively, systems can pull the latest data via APIs or data replication.
Governance and Stewardship: MDM includes data governance policies that define who can create, update, or delete master data. Data stewards are responsible for resolving conflicts and ensuring data quality.
Key Components of an MDM Solution
MDM Repository: A database that stores the golden records and their history. Often a relational database (like Azure SQL Database) or a specialized MDM platform.
Matching Engine: The component that performs identity resolution. It uses algorithms like Levenshtein distance for name matching, or blocking techniques to reduce comparison pairs.
Data Quality Tools: Tools that profile, cleanse, and standardize data. Azure Purview includes data quality capabilities.
Integration Layer: Connectors to source and target systems. Azure Data Factory and Azure Logic Apps are commonly used.
Governance Portal: A user interface for data stewards to manage exceptions, view data lineage, and enforce policies.
MDM Architecture Styles
There are three primary MDM architecture styles, each with different trade-offs:
Registry Style: The MDM hub stores only identifiers and pointers to source systems, not the full data. It maintains a registry of which system has the best version of each attribute. This style is lightweight but requires real-time access to source systems.
Consolidation Style: The MDM hub stores a full copy of the golden record. It periodically synchronizes with source systems. This provides a single point of access but can lead to data staleness.
Coexistence Style: A hybrid where the MDM hub stores some attributes (e.g., customer ID) while other attributes remain in source systems. This balances control and autonomy.
Azure Purview supports MDM capabilities by providing data cataloging, lineage, and governance. However, for full MDM, organizations often use third-party tools like Profisee, Informatica MDM, or Semarchy, which run on Azure infrastructure.
MDM and Azure Data Services
Azure provides several services that support MDM initiatives:
Azure Data Factory: Used to orchestrate data movement between source systems and the MDM hub.
Azure SQL Database or Azure Cosmos DB: Can serve as the MDM repository.
Azure Purview: Provides data discovery, classification, and governance. It can map data lineage and enforce data quality rules.
Azure Databricks: Can be used for advanced matching and machine learning-based deduplication.
Azure Logic Apps: Can automate workflows for data stewardship (e.g., sending alerts when a duplicate is detected).
Configuration and Verification
While MDM is not a native Azure service, you can implement a simple MDM solution using Azure Data Factory and Azure SQL Database. For example:
Use Azure Data Factory to copy customer data from CRM and ERP systems to a staging table in Azure SQL Database.
Use a stored procedure or Data Flow to perform matching and deduplication. A sample matching query might look like:
-- Detect potential duplicates based on email
SELECT a.CustomerID AS SourceA, b.CustomerID AS SourceB, a.Email
FROM StagingCustomers a
INNER JOIN StagingCustomers b
ON a.Email = b.Email AND a.CustomerID < b.CustomerIDCreate a golden record table and insert merged records.
Use Data Factory to publish the golden records back to source systems via update activities.
To verify the MDM process, you can query the golden record table and compare counts with source systems:
-- Count distinct customers after dedup
SELECT COUNT(*) FROM GoldenCustomersMDM Best Practices
Define clear data ownership: Each master data entity should have a designated data owner.
Establish data quality metrics: Measure completeness, accuracy, and uniqueness.
Use a phased approach: Start with one domain (e.g., customer) and expand.
Automate stewardship: Use rules to auto-merge low-risk duplicates, flag high-risk ones for manual review.
Monitor data lineage: Understand where data comes from and how it changes.
Common MDM Challenges
Data silos: Legacy systems may not support integration.
Data quality: Poor source data quality undermines MDM.
Scalability: Matching millions of records requires robust algorithms and infrastructure.
Change management: Organizational resistance to a single source of truth.
Interaction with Related Technologies
MDM interacts closely with data warehousing and business intelligence. A data warehouse (e.g., Azure Synapse Analytics) typically stores integrated, cleaned data sourced from an MDM hub. MDM ensures that the dimensions in the star schema (e.g., Customer dimension, Product dimension) are consistent. Without MDM, the data warehouse might contain duplicate customers, leading to inaccurate reports.
MDM also complements data lakes. While a data lake stores raw data in its native format, MDM provides a curated, governed view of key entities. In a lakehouse architecture, MDM can be used to create clean dimension tables that are used for analytics.
Summary of Key Concepts
Master data: Core business entities that are shared across systems.
Golden record: The single, authoritative version of a master data entity.
Matching: The process of identifying duplicate records.
Survivorship: Rules to determine which attribute value survives a merge.
MDM architecture styles: Registry, Consolidation, Coexistence.
This section provides the foundation for understanding MDM as tested in DP-900. The exam expects you to differentiate master data from other data types and recognize the role of MDM in ensuring data consistency.
Identify Master Data Domains
First, determine which data entities are critical for the organization. Common domains include customers, products, suppliers, employees, and locations. Each domain has attributes that are shared across multiple systems. For example, customer master data includes CustomerID, Name, Address, Phone, Email, and CreditLimit. The goal is to create a single source of truth for each domain. This step involves interviewing business stakeholders and analyzing existing data flows. On the DP-900 exam, you might be asked to identify which data is considered master data versus transactional data.
Define Data Governance Policies
Establish rules for data creation, update, deletion, and quality. This includes defining data ownership (who is responsible for each domain), data standards (format, naming conventions), and stewardship workflows (how to handle duplicates or errors). For example, a policy might state that only the CRM system can create new customer records, and any updates must be approved by a data steward. Governance policies are enforced by the MDM platform. Azure Purview can help define and enforce these policies. The exam may test your understanding of data governance as part of MDM.
Implement Data Integration
Connect source systems to the MDM hub using extract, transform, load (ETL) or enterprise service bus (ESB) patterns. Use Azure Data Factory to schedule data ingestion. For real-time integration, use Azure Event Hubs or Kafka. Data is extracted from sources like CRM, ERP, and legacy databases. During integration, data is cleansed and standardized. For example, all dates are converted to ISO 8601 format, and all country names are replaced with ISO codes. The integration layer must handle conflicts when the same entity appears in multiple sources. The exam may ask about tools used for data integration in Azure.
Perform Matching and Deduplication
The MDM system compares incoming records against existing golden records to find matches. Matching can be exact (e.g., same email) or fuzzy (e.g., similar name and address). Algorithms like Levenshtein distance or Jaro-Winkler are used for fuzzy matching. Blocking reduces the number of comparisons by grouping records that share a common key (e.g., first letter of last name). Once matches are found, the system merges them into a single golden record. Survivorship rules decide which attribute values to keep. For example, the most recent update wins, or the source system with highest trust takes precedence. The exam may test the concept of matching and survivorship.
Publish Golden Records
After deduplication, the golden records are published back to source systems and other downstream systems. This can be done via batch updates (e.g., nightly sync) or real-time APIs. The MDM hub maintains a history of changes for audit purposes. For example, when a customer address is updated in the golden record, the MDM system sends an update to the CRM, ERP, and marketing automation system. In Azure, you can use Azure Data Factory to push updates to Azure SQL Database, or use Azure API Management to expose REST APIs. The exam may ask about the distribution of master data to multiple systems.
Enterprise Scenario 1: Global Retail Customer Master
A multinational retail chain with 2000 stores across 50 countries uses separate CRM systems for each region. The marketing team wants a single view of customers to run global campaigns. They implement an MDM solution using Azure Data Factory and Profisee MDM. Customer data from each CRM is ingested daily into Azure SQL Database. The matching engine uses email and phone number to identify duplicates. For example, a customer in the US and a customer in Europe with the same email are merged into one golden record. Survivorship rules give priority to the most recent customer profile. The golden record is then published to a central Azure Synapse Analytics data warehouse for analytics. Common issues: data quality problems like missing emails or typos cause false negatives in matching. The team implemented a data quality dashboard in Power BI to monitor completeness. Performance: matching 50 million records took 4 hours using Azure Databricks with Spark.
Enterprise Scenario 2: Healthcare Provider Master Data
A large hospital network manages patient data across multiple EHR systems (Epic, Cerner, etc.). They need a unified patient identifier to coordinate care. They deploy an MDM hub using Informatica MDM on Azure VMs. Patient data is ingested via HL7 messages in real-time using Azure Event Hubs. Matching uses probabilistic algorithms with blocking on date of birth and gender. A data steward reviews potential matches with confidence scores between 80-95%. The golden record includes patient ID, name, DOB, SSN, and allergies. The system publishes updates to all EHR systems via Mirth Connect. A misconfiguration in survivorship rules once caused a patient's allergy information to be overwritten by an older record, leading to a near-miss event. The team added a rule that critical medical data (allergies, blood type) always takes precedence from the source that last updated it. Scale: 10 million patients, 50 million encounters.
Enterprise Scenario 3: Financial Services Product Master
A bank manages financial products (loans, accounts, investments) across multiple legacy core banking systems. They need a consistent product catalog for online banking and regulatory reporting. They implement a consolidation-style MDM using Azure SQL Database and custom .NET services. Product data is extracted nightly from each core system via SSIS. The MDM system standardizes product names and categories. For example, 'Savings Account' and 'Savings Acct' are merged. The golden record is stored with a unique product ID and published to the data warehouse for reporting. A common problem: new products are added in source systems without notifying the MDM team, causing orphan records. They automated alerts when a new product appears that doesn't match any existing golden record. Performance: 5000 products, but high attribute count (200+ attributes per product) requires careful indexing.
DP-900 Exam Focus on Master Data Management
The DP-900 exam (Objective 1.1: Core Data Concepts) tests your understanding of master data as one of several data types. Specifically, you must be able to:
Differentiate between master data, transactional data, and reference data.
Recognize the purpose of MDM: to create a single source of truth (golden record).
Identify common master data domains: customers, products, employees, suppliers.
Understand that MDM involves processes like deduplication, matching, and data governance.
Know that Azure Purview and Azure Data Factory are Azure services that support MDM initiatives.
Common Wrong Answers and Why Candidates Choose Them
Confusing master data with reference data: Candidates often think country codes or currency codes are master data. They are reference data—a set of permissible values. Master data describes business entities, not classification lists. The exam may present a list of data types and ask which is master data.
Thinking transactional data is master data: A sales order is transactional, not master. The customer and product in the order are master data. Candidates might mistakenly select 'sales order' as master data.
Believing MDM is only about customers: While customer is a common domain, MDM also covers products, employees, suppliers, etc. The exam may test that MDM applies to any critical business entity.
Assuming MDM eliminates all duplicates: MDM reduces duplicates but cannot eliminate them entirely due to data quality issues. The exam might ask about the goal of MDM—it is to create a single view, not necessarily 100% duplicate-free.
Specific Numbers, Values, and Terms That Appear on the Exam
The term 'golden record' is used to describe the single version of truth.
The phrase 'single source of truth' is often used in exam questions.
Matching algorithms: exact match, fuzzy match, probabilistic matching.
Survivorship rules: last update wins, most complete record wins, source priority.
Azure services: Azure Data Factory (integration), Azure Purview (governance), Azure SQL Database (repository).
Edge Cases and Exceptions the Exam Loves to Test
What if two systems have conflicting data for the same entity? The MDM system uses survivorship rules to resolve conflicts. The exam may ask which rule is applied.
Is a product price master data? Yes, price is an attribute of product master data, but it can change frequently. The exam might test that master data can be updated but is relatively stable compared to transactional data.
Can master data be stored in a data lake? Yes, but it is typically stored in a structured format (e.g., Parquet) and governed by MDM processes. The exam may ask about appropriate storage for master data.
How to Eliminate Wrong Answers Using the Underlying Mechanism
If a question asks about data that describes business entities (e.g., customer name, product ID), it is master data. If it describes a transaction (e.g., order date, quantity), it is transactional.
If a question mentions 'single version of truth' or 'golden record', the answer likely involves MDM.
If a question asks about a set of allowed values (e.g., country codes), it is reference data.
Use the process of elimination: master data is shared across systems, is relatively stable, and is critical for operations.
Master data describes core business entities (customers, products, employees, suppliers).
MDM creates a single source of truth (golden record) for master data across systems.
MDM involves data ingestion, cleansing, matching, deduplication, and publishing.
Matching algorithms include exact match, fuzzy match, and probabilistic matching.
Survivorship rules determine which attribute value wins when merging records.
Azure Data Factory and Azure Purview are key Azure services that support MDM.
Master data is distinct from transactional data (events) and reference data (code lists).
MDM is an ongoing process requiring data governance and stewardship.
These come up on the exam all the time. Here's how to tell them apart.
Master Data
Describes core business entities (customers, products, employees).
Relatively stable, changes infrequently.
Shared across multiple systems and processes.
Often used as dimensions in data warehouses.
Managed through MDM processes to ensure consistency.
Transactional Data
Records business events (sales orders, invoices, log entries).
Highly volatile, changes with each transaction.
Usually specific to a single system or process.
Stored as fact tables in data warehouses.
Typically not subject to MDM; data quality is enforced at entry.
Master Data
Describes unique business entities (e.g., CustomerID 123).
Each entity has many attributes that can change.
Managed by MDM with deduplication and survivorship.
Examples: customer name, product description.
Stored in MDM repository and shared across systems.
Reference Data
Describes classification codes (e.g., 'US' for United States).
Set of permissible values, rarely changes.
Managed by data governance, often in a code table.
Examples: country codes, currency codes, status codes.
Stored in reference data tables, may be shared via lookup services.
Mistake
Master data is the same as reference data.
Correct
Master data describes core business entities (customers, products). Reference data is a set of permissible values used to classify other data (e.g., ISO country codes, currency codes). Reference data is a type of master data in some definitions, but DP-900 treats them as distinct. Reference data is often static and managed separately.
Mistake
MDM is only needed for large enterprises.
Correct
MDM is valuable for any organization with multiple systems that share data. Even small businesses with a CRM and an accounting system can benefit from consistent customer data. The exam tests the concept, not the scale.
Mistake
MDM eliminates all data duplicates.
Correct
MDM reduces duplicates but cannot eliminate them entirely due to data quality issues (typos, missing values). Matching algorithms have thresholds; some duplicates may go undetected. The goal is to minimize duplicates, not achieve perfection.
Mistake
MDM is a one-time project.
Correct
MDM is an ongoing process. Data changes, new systems are added, and data quality degrades over time. Continuous monitoring and stewardship are required. The exam may test that MDM involves ongoing governance.
Mistake
Master data never changes.
Correct
Master data can change, though less frequently than transactional data. Customer addresses, product prices, and supplier terms all change. MDM systems track changes and maintain history. The exam might test that master data is relatively stable but not static.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Master data describes core business entities such as customers, products, and employees. It is relatively stable and shared across systems. Transactional data records events or transactions, such as sales orders, invoices, or log entries. Transactional data is highly volatile and usually specific to a single system. For example, a customer record (name, address) is master data, while a sales order (date, quantity, total) is transactional data. The exam expects you to differentiate between these two data types.
A golden record is the single, authoritative version of a master data entity after deduplication and merging. It contains the best attributes from multiple source systems. For example, if two source systems have different phone numbers for the same customer, the golden record keeps the most reliable one based on survivorship rules. The golden record is stored in the MDM repository and published to downstream systems. The term 'golden record' is commonly used in exam questions.
Azure provides several services that support MDM initiatives. Azure Data Factory is used for data integration (ETL) to ingest data from source systems. Azure Purview provides data governance, cataloging, and lineage. Azure SQL Database or Azure Cosmos DB can serve as the MDM repository. Azure Databricks can be used for advanced matching algorithms. However, Azure does not have a native MDM service; organizations typically use third-party MDM tools (e.g., Profisee, Informatica) on Azure infrastructure.
Matching is the process of identifying records from different source systems that represent the same real-world entity. For example, two customer records with the same email address are likely the same person. Matching can be exact (e.g., same Social Security number) or fuzzy (e.g., similar name and address). The goal is to merge duplicate records into a single golden record. Matching is a critical step in MDM to ensure a single source of truth.
Yes, master data can be stored in a data lake, but it is typically stored in a structured format (e.g., Parquet or Delta tables) and governed by MDM processes. The data lake stores raw data, and MDM provides a curated view of master data. In a lakehouse architecture, master data is often stored as dimension tables in a Delta Lake. The exam may test that master data is usually stored in structured databases, but it can also reside in a data lake with proper governance.
Survivorship refers to the rules that determine which attribute value is kept when merging duplicate records. For example, if one source has a customer's phone number as '555-1234' and another has '555-5678', survivorship rules might choose the most recent update, the most complete record, or the value from a trusted source. Survivorship is essential for creating an accurate golden record. The exam may ask about survivorship rules in the context of MDM.
Product price is an attribute of product master data. Even though prices change, they describe the product entity. Master data can change over time but is relatively stable compared to transactional data. For example, a price list is maintained as part of the product master. However, the price on a specific sales order is transactional data. The exam may test that attributes like price are part of master data.
You've just covered Master Data Management Concepts — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?