DP-900Chapter 95 of 101Objective 1.1

Data Quality: Completeness, Accuracy, Consistency

This chapter covers data quality dimensions — completeness, accuracy, and consistency — as tested in DP-900 Objective 1.1 (Core Data Concepts). Understanding these dimensions is critical because approximately 15-20% of DP-900 exam questions directly or indirectly assess data quality concepts, often as part of scenarios about data processing, storage, or analytics. You will learn the precise definitions, how each dimension is measured, and how they relate to data governance and data integrity in Azure data services.

25 min read
Intermediate
Updated May 31, 2026

Data Quality as a Library's Card Catalog

Imagine a library with millions of books. The card catalog is the system that tracks every book's location, author, title, and status. Completeness means every book has a card — no missing entries for books on the shelves. If a book is returned but no card is updated, that's an incompleteness. Accuracy means the card says the correct author and title — a card that lists 'Moby-Dick' under 'Melville' is accurate; one that says 'Hawthorne' is inaccurate. Consistency means all cards follow the same format: author last name first, title in quotes, shelf location in the same pattern (e.g., 'Section A, Row 3'). If one card uses 'Row 3' and another uses '3rd row', the catalog is inconsistent. In a library, a patron searching by card catalog expects to find the book exactly where the card says. Data quality in databases works identically: completeness ensures no missing records, accuracy ensures correct values, and consistency ensures uniform representation across all entries. A library with a flawed catalog frustrates patrons; a database with poor data quality leads to incorrect analytics, failed transactions, and broken reports — directly impacting business decisions.

How It Actually Works

What Is Data Quality and Why Does It Matter?

Data quality is a measure of how well a dataset meets the requirements of its intended use. In the context of DP-900, data quality is assessed across several dimensions, with completeness, accuracy, and consistency being the three most frequently tested. These dimensions are not optional features; they are fundamental to ensuring that data-driven decisions are based on reliable information.

Completeness refers to the extent to which all expected data is present. Missing values, null fields, or omitted records reduce completeness.

Accuracy measures how closely data values match the real-world entities or events they represent. An inaccurate value is factually wrong.

Consistency ensures that data is uniform across different systems, tables, or formats. Contradictions or variations in representation indicate inconsistency.

Data quality is a cross-cutting concern in Azure data services. For instance, Azure Data Factory pipelines often include data quality checks as part of data cleansing activities. Azure Synapse Analytics and Azure Databricks provide tools to profile data and identify quality issues. The exam expects you to know not only the definitions but also how to identify examples of each dimension in given scenarios.

How Completeness Works

Completeness is typically measured as a percentage: (number of non-null values in a field / total expected values) × 100. However, completeness can also refer to entire records. For example, a customer table with 100 rows but only 95 unique customer IDs indicates incomplete data if every customer should have a record.

Field-level completeness: Every column in a row should have a value where required. For example, a mandatory 'Email' column must not be NULL.

Record-level completeness: All expected rows are present. For instance, a daily sales transaction table should have entries for every business day.

Dataset-level completeness: The entire dataset covers the expected scope. For example, a dataset covering all US states should include 50 rows.

In Azure, you can use Azure Data Factory's Data Flow transformations (e.g., Derived Column, Conditional Split) to check for NULLs and filter incomplete records. Azure Synapse provides built-in data profiling capabilities that report completeness percentages per column.

Common exam trap: Candidates often confuse completeness with accuracy. A record can be complete (all fields filled) but inaccurate (wrong values). For example, a customer record with a phone number that is not actually the customer's is complete but inaccurate.

How Accuracy Works

Accuracy is harder to measure because it requires a trusted source of truth (e.g., a master data set or real-world verification). Accuracy is often expressed as a percentage of records that match the true value.

Syntactic accuracy: Does the value conform to a valid format? For example, an email address must contain '@' and a domain.

Semantic accuracy: Does the value represent the correct real-world entity? For example, a product price of $10.00 when the actual price is $9.99 is semantically inaccurate.

In Azure, accuracy checks are typically implemented through business rules. For example, you can use Azure Data Factory's Lookup activity to cross-reference data against a master database. Azure SQL Database supports CHECK constraints that enforce syntactic accuracy (e.g., CHECK (Email LIKE '%_@__%.__%')).

Exam trap: Candidates sometimes think accuracy is the same as precision. Precision refers to the granularity of data (e.g., 3.14159 vs. 3.14), while accuracy refers to correctness. A precise value can be inaccurate.

How Consistency Works

Consistency ensures that data values are the same across different systems or within the same system over time. Inconsistencies arise when:

The same entity is represented differently (e.g., 'NY' vs. 'New York').

The same rule is applied differently (e.g., date format 'MM/dd/yyyy' vs. 'dd/MM/yyyy').

Referential integrity is violated (e.g., an order references a customer ID that does not exist).

Consistency is often enforced through: - Data standards: Defining common formats, codes, and naming conventions. - Referential integrity constraints: Foreign keys in relational databases. - ETL/ELT transformations: Standardizing data during ingestion.

In Azure, Azure Data Factory can enforce consistency by using mapping data flows to apply standard transformations. Azure SQL Database enforces referential consistency via foreign keys. Azure Cosmos DB uses consistency levels (strong, bounded staleness, session, consistent prefix, eventual) to balance consistency with performance.

Exam trap: Consistency is often confused with integrity. Integrity is a broader concept that includes consistency, accuracy, and completeness working together. Consistency specifically focuses on uniform representation.

Measuring Data Quality

Data quality is not binary; it is measured on a continuum. Common metrics include: - Completeness rate: e.g., 95% of email fields are non-null. - Accuracy rate: e.g., 98% of addresses match the USPS database. - Consistency rate: e.g., 99% of customer IDs are formatted as 'CUST-XXXX'.

In Azure, you can use Azure Purview (now Microsoft Purview) to scan data sources and generate data quality reports. Azure Synapse Studio has a 'Data Profiling' feature that automatically computes completeness and uniqueness statistics.

Data Quality in Azure Data Services

Azure SQL Database: Supports constraints (NOT NULL, CHECK, FOREIGN KEY) to enforce completeness, syntactic accuracy, and referential consistency.

Azure Data Factory: Provides data flow transformations to cleanse data (e.g., Filter rows with NULLs, Derive standard formats).

Azure Databricks: Offers Delta Lake, which enforces schema on write and can enforce constraints like NOT NULL.

Azure Synapse Analytics: Includes data profiling and pipeline capabilities for quality checks.

Azure Cosmos DB: Uses consistency levels that affect how quickly data becomes consistent across replicas.

The Data Quality Lifecycle

Data quality is not a one-time activity; it is a continuous process: 1. Define quality rules: What constitutes complete, accurate, consistent data? 2. Measure: Profile data to identify violations. 3. Analyze: Root cause analysis of quality issues. 4. Improve: Cleanse, standardize, or enrich data. 5. Monitor: Ongoing tracking using dashboards.

In Azure, this lifecycle can be implemented using Azure Data Factory pipelines that trigger on a schedule, running data quality checks and logging results to Azure SQL Database or Azure Data Lake.

Edge Cases and Exam Pitfalls

NULL vs. empty string: In completeness, a NULL is missing, but an empty string ('') is present. The exam may test whether you consider an empty string as complete or incomplete. Typically, an empty string is considered present (complete) but potentially inaccurate.

Duplicate records: Duplicates can make data seem complete (all fields filled) but actually represent the same real-world entity multiple times — this is a consistency issue.

Out-of-range values: A value like '9999' for age is syntactically valid but semantically inaccurate. The exam may ask which dimension is violated.

Cross-system consistency: If two databases store the same customer address differently, that is a consistency issue. The exam may present a scenario with multiple data sources.

Summary of Key Numbers and Values

Completeness is often expressed as a percentage (0-100%).

Accuracy may be measured against a master data set with a threshold (e.g., 99.5% accuracy required).

Consistency is enforced via constraints like FOREIGN KEY, UNIQUE, and CHECK.

Azure Data Factory's data flow has transformations like 'Filter', 'Derived Column', and 'Conditional Split' for quality checks.

Azure Synapse's data profiling shows completeness per column.

Microsoft Purview provides data quality dashboards.

How Data Quality Interacts with Other Concepts

Data governance: Data quality is a key pillar of data governance. Policies define acceptable quality levels.

Data integration: ETL processes often include quality checks to prevent bad data from entering the warehouse.

Data security: While separate, poor quality can lead to security issues (e.g., inaccurate access logs).

Data privacy: Incomplete data may violate privacy regulations if personal data is missing but required.

Understanding these interactions helps you answer scenario-based questions on the DP-900 exam that ask you to choose the appropriate Azure service or technique to address a data quality problem.

Walk-Through

1

Define Data Quality Rules

Before measuring, you must define what constitutes complete, accurate, and consistent data for your specific use case. For completeness, specify which fields are mandatory (e.g., 'Email' must not be NULL). For accuracy, define valid ranges or reference data (e.g., 'State' must be a valid USPS abbreviation). For consistency, specify formats (e.g., 'Date' must be 'YYYY-MM-DD'). These rules become the basis for all subsequent checks. In Azure, rules can be documented in Microsoft Purview as business glossary terms or implemented as constraints in Azure SQL Database.

2

Profile the Data

Use data profiling tools to assess the current state of the data. In Azure Synapse Studio, you can right-click a table and select 'Profile Data' to see column statistics: null counts, distinct values, min/max, etc. For completeness, examine the 'Null Count' column. For accuracy, look for out-of-range values. For consistency, check distinct values for unexpected entries (e.g., 'NY' vs 'New York'). Profiling gives you baseline metrics. This step typically runs on a sample or full dataset and can be automated via Azure Data Factory pipelines using the 'Data Profiling' transformation.

3

Identify Quality Issues

Compare profiling results against your defined rules. Flag records that violate completeness (missing mandatory fields), accuracy (values outside allowed range), or consistency (format mismatches). For example, if 'State' column contains 'NY' and 'New York', that's an inconsistency. Use Azure Data Factory's 'Conditional Split' to route bad records to a separate output for review. Log issues to an Azure SQL Database table with columns: Source, Rule, Violation Count, Timestamp. This step produces actionable data quality reports.

4

Cleanse and Standardize Data

Apply transformations to fix identified issues. For completeness: fill missing values using default values, mean, or lookup. For accuracy: correct values using a reference dataset (e.g., replace 'New York' with 'NY'). For consistency: standardize formats (e.g., convert all dates to 'YYYY-MM-DD'). In Azure Data Factory, use 'Derived Column' to set defaults, 'Lookup' to cross-reference, and 'Data Flow' expressions for format conversions. This step may be iterative, as cleansing can introduce new issues (e.g., filling missing values may reduce accuracy).

5

Validate and Monitor

After cleansing, re-profile the data to confirm quality improvements. Then set up ongoing monitoring using Azure Data Factory triggers (e.g., daily pipeline) or Azure Monitor alerts. For example, if completeness of 'Email' field drops below 95%, send an alert. Use Azure Purview to track data quality metrics over time and create dashboards. This step ensures data quality is maintained, not just fixed once. The exam may ask about tools for monitoring: Azure Monitor for alerts, Power BI for dashboards.

What This Looks Like on the Job

Enterprise Scenario 1: Customer Data Migration to Azure SQL Database

A retail company is migrating its customer database from an on-premises SQL Server to Azure SQL Database. The source data has been accumulated over 20 years with minimal validation. Completeness: 12% of records lack phone numbers; 5% lack email. Accuracy: 8% of zip codes are invalid (e.g., 'ABCDE'). Consistency: State abbreviations appear as 'CA', 'Calif', 'California'. The data engineering team uses Azure Data Factory with a Copy activity to ingest data into a staging table in Azure SQL Database. Then, a Data Flow pipeline runs daily: it uses Conditional Split to separate records with missing email (incomplete) into a review table. A Lookup transformation cross-references zip codes against a USPS reference table; mismatches are flagged. Derived Column transformations standardize states using a lookup table mapping full names to abbreviations. The cleansed data is loaded into the production table. Performance considerations: the Data Flow runs on Azure Integration Runtime with 8 cores; processing 10 million records takes about 45 minutes. Common pitfalls: if the reference table is outdated, accuracy corrections may introduce new errors. Misconfiguration of the Data Flow's 'Row conditions' can cause too many records to be flagged, overwhelming the review queue.

Enterprise Scenario 2: Real-time IoT Data in Azure Stream Analytics

A manufacturing company uses IoT sensors to monitor equipment temperature. Data streams into Azure Event Hubs and is processed by Azure Stream Analytics. Completeness: every reading should include a timestamp, sensor ID, and temperature. Accuracy: temperature must be between -50°C and 150°C. Consistency: timestamps must be in UTC. During a network outage, some messages arrive without timestamps (incomplete). The Stream Analytics job uses a TIMESTAMP BY clause to assign a system timestamp when missing. For accuracy, a WHERE clause filters out temperatures outside the valid range, routing them to a dead-letter queue for analysis. Consistency is enforced by the sensor firmware, but the Stream Analytics job also checks that timestamps are in UTC; if not, it applies a conversion using a UDF. This scenario tests real-time quality enforcement. Common issues: if the dead-letter queue is not monitored, quality issues go unnoticed. Also, the TIMESTAMP BY clause can cause out-of-order events if the system clock is unsynchronized.

Enterprise Scenario 3: Data Lake Analytics with Azure Databricks

A financial services firm stores transaction data in Azure Data Lake Storage Gen2. Analysts use Azure Databricks to run daily reports. Completeness: all transactions must have a transaction ID and amount. Accuracy: amounts must be positive. Consistency: currency codes must follow ISO 4217 (e.g., 'USD', 'EUR'). The Databricks notebook reads Parquet files, applies a schema with NOT NULL constraints on transaction ID and amount using DataFrame API (e.g., df.filter(col('amount').isNotNull())). For accuracy, it filters out rows where amount <= 0. For consistency, it uses a lookup table to validate currency codes. Bad records are written to a separate 'quarantine' table. The notebook runs as an automated job every hour. Performance: processing 100 GB of data takes about 20 minutes with a 10-node cluster. Pitfalls: if the schema evolves (e.g., new column added), the notebook may break or silently drop columns. Also, the quarantine table can grow large if quality issues are not addressed at the source.

How DP-900 Actually Tests This

What DP-900 Tests on Data Quality

DP-900 Objective 1.1: 'Describe core data concepts' includes data quality dimensions. The exam expects you to:

Identify examples of completeness, accuracy, and consistency in given scenarios.

Understand how Azure services (Azure Data Factory, Azure Synapse, Azure SQL Database, Azure Databricks) support data quality.

Know the difference between data quality and data integrity.

Recognize common data quality issues (missing values, duplicates, format inconsistencies).

Most Common Wrong Answers and Why

1.

Confusing completeness with accuracy: A scenario shows a record with all fields filled but wrong values. Candidates often select 'completeness issue' because they see missing values as the only quality problem. The correct answer is 'accuracy issue' because the values are factually incorrect.

2.

Confusing consistency with integrity: A scenario about referential integrity (e.g., order with non-existent customer) is sometimes mislabeled as 'consistency'. Integrity is broader; consistency specifically is about uniform representation, not existence of relationships.

3.

Choosing the wrong Azure service: When asked which service to use for data profiling, candidates may pick Azure SQL Database (which stores data) instead of Azure Synapse or Azure Data Factory (which profile). Know that profiling is a feature of Synapse Studio and Data Factory data flows.

4.

Overlooking NULL vs empty string: A question may state that a column has empty strings but no NULLs. Candidates may call it incomplete, but empty strings are considered present (complete) — though they may be inaccurate.

Specific Numbers and Terms That Appear Verbatim

Completeness is often measured as a percentage.

Accuracy may reference 'master data' or 'reference data'.

Consistency is often tied to 'format standards' or 'naming conventions'.

Azure Data Factory 'Data Flow' transformations: Derived Column, Conditional Split, Lookup.

Azure Synapse 'Data Profiling' feature.

Microsoft Purview 'Data Quality' dashboards.

Edge Cases and Exceptions

Duplicate records: Duplicates are not a completeness issue (all fields present) but a consistency issue (multiple representations of same entity). The exam may present a table with duplicate rows and ask which dimension is violated.

Out-of-range but plausible values: For example, age 200 is inaccurate but may pass syntax checks. The exam expects you to recognize it as accuracy issue.

Cross-system consistency: If two databases store the same customer address differently, that's consistency. The exam may describe a scenario with multiple data sources.

How to Eliminate Wrong Answers

1.

Read the scenario carefully: Identify what exactly is wrong. Is a field missing? -> Completeness. Is a value factually wrong? -> Accuracy. Are representations different? -> Consistency.

2.

Match the symptom to the dimension: Use the specific definitions. For example, if the issue is about format (e.g., date format mismatch), it's consistency.

3.

Consider the Azure service: If the question asks which tool to use for data quality checks, think about profiling (Azure Synapse) or transformation (Azure Data Factory). Avoid services that only store data.

4.

Watch for traps: The exam may use the term 'data quality' broadly but expect you to pick the most specific dimension. Always choose the most precise answer.

Key Takeaways

Completeness = presence of data; measured as % of non-null values.

Accuracy = correctness of data; requires a trusted reference.

Consistency = uniform representation across systems; enforced via standards and constraints.

NULL indicates incompleteness; empty string is complete but may be inaccurate.

Duplicate records indicate a consistency issue, not completeness.

Azure Synapse Studio provides data profiling for completeness and consistency checks.

Azure Data Factory Data Flow transformations (Derived Column, Conditional Split) are used for cleansing.

Microsoft Purview offers data quality dashboards for monitoring.

Data quality is a continuous process, not a one-time fix.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Completeness

Measures presence of data: are all expected values there?

Example: Missing email address in customer record.

Expressed as percentage of non-null values.

Enforced via NOT NULL constraints.

Azure tool: Data Profiling in Synapse shows null counts.

Accuracy

Measures correctness of data: do values match reality?

Example: Email address is present but is wrong (e.g., typo).

Expressed as percentage of records matching a trusted source.

Enforced via CHECK constraints or reference data lookups.

Azure tool: Data Factory Lookup against master data.

Watch Out for These

Mistake

Completeness means every field must have a value — empty strings count as incomplete.

Correct

Empty strings are considered present (non-null) and therefore count as complete. Completeness specifically measures missing (NULL) values. An empty string may be inaccurate, but it is not incomplete.

Mistake

Accuracy is the same as precision.

Correct

Accuracy refers to correctness (value matches real world), while precision refers to granularity (number of decimal places). A value can be precise but inaccurate (e.g., 3.14159 when the true value is 2.71828). DP-900 tests accuracy, not precision.

Mistake

Consistency ensures that data is correct across systems.

Correct

Consistency ensures uniform representation (same format, same codes), but does not guarantee correctness. Two systems can consistently represent a wrong value (e.g., both store 'NY' for a customer in California). Accuracy must be checked separately.

Mistake

Data quality is a one-time activity during data migration.

Correct

Data quality must be continuously monitored and maintained because data changes over time. Azure services like Data Factory and Purview support ongoing quality checks through scheduled pipelines and alerts.

Mistake

Azure SQL Database automatically ensures data quality.

Correct

Azure SQL Database provides constraints (NOT NULL, CHECK, FOREIGN KEY) that can enforce some quality rules, but it does not automatically profile data or detect complex quality issues. Additional tools like Azure Data Factory or Synapse are needed for comprehensive quality management.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between data quality and data integrity?

Data integrity is a broader concept that encompasses data quality, but also includes aspects like referential integrity, entity integrity, and domain integrity. Data quality focuses on dimensions like completeness, accuracy, consistency, timeliness, and validity. In DP-900, integrity often refers to constraints (e.g., foreign keys), while quality refers to the fitness of data for use.

How do I check for data completeness in Azure Synapse?

In Azure Synapse Studio, you can right-click a table and select 'Profile Data'. The profiling report shows null counts per column, which directly indicates incompleteness. You can also use T-SQL queries like `SELECT COUNT(*) FROM table WHERE column IS NULL` to compute completeness percentages.

What Azure service is best for data quality cleansing?

Azure Data Factory is the primary service for data quality cleansing. Its Data Flow feature allows you to filter, transform, and standardize data using visual transformations like Derived Column, Conditional Split, and Lookup. For real-time data, Azure Stream Analytics can filter and transform data in motion.

Can Azure SQL Database enforce data quality automatically?

Azure SQL Database can enforce some aspects of data quality through constraints: NOT NULL for completeness, CHECK for syntactic accuracy, and FOREIGN KEY for referential consistency. However, it does not automatically profile data or detect semantic inaccuracies. Additional tools like Azure Data Factory are needed for comprehensive quality management.

What is the difference between syntactic and semantic accuracy?

Syntactic accuracy checks if data conforms to a valid format (e.g., email contains '@'). Semantic accuracy checks if the value correctly represents the real-world entity (e.g., the email actually belongs to the customer). DP-900 may test both, but semantic accuracy is more commonly associated with 'accuracy' as a dimension.

How do I handle duplicate records in terms of data quality?

Duplicate records are a consistency issue because the same real-world entity is represented multiple times. They can also affect completeness (if duplicates inflate counts) and accuracy (if different duplicates have conflicting values). To handle them, use Azure Data Factory's 'Aggregate' transformation or T-SQL's ROW_NUMBER() to deduplicate.

What is the role of Microsoft Purview in data quality?

Microsoft Purview provides data governance capabilities, including data quality dashboards. It can scan data sources, profile data, and report on quality metrics over time. It also allows you to define business rules and track compliance. For DP-900, know that Purview is used for data quality monitoring and governance.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Data Quality: Completeness, Accuracy, Consistency — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?