This chapter covers data quality dimensions — completeness, accuracy, and consistency — as tested in DP-900 Objective 1.1 (Core Data Concepts). Understanding these dimensions is critical because approximately 15-20% of DP-900 exam questions directly or indirectly assess data quality concepts, often as part of scenarios about data processing, storage, or analytics. You will learn the precise definitions, how each dimension is measured, and how they relate to data governance and data integrity in Azure data services.
Jump to a section
Imagine a library with millions of books. The card catalog is the system that tracks every book's location, author, title, and status. Completeness means every book has a card — no missing entries for books on the shelves. If a book is returned but no card is updated, that's an incompleteness. Accuracy means the card says the correct author and title — a card that lists 'Moby-Dick' under 'Melville' is accurate; one that says 'Hawthorne' is inaccurate. Consistency means all cards follow the same format: author last name first, title in quotes, shelf location in the same pattern (e.g., 'Section A, Row 3'). If one card uses 'Row 3' and another uses '3rd row', the catalog is inconsistent. In a library, a patron searching by card catalog expects to find the book exactly where the card says. Data quality in databases works identically: completeness ensures no missing records, accuracy ensures correct values, and consistency ensures uniform representation across all entries. A library with a flawed catalog frustrates patrons; a database with poor data quality leads to incorrect analytics, failed transactions, and broken reports — directly impacting business decisions.
What Is Data Quality and Why Does It Matter?
Data quality is a measure of how well a dataset meets the requirements of its intended use. In the context of DP-900, data quality is assessed across several dimensions, with completeness, accuracy, and consistency being the three most frequently tested. These dimensions are not optional features; they are fundamental to ensuring that data-driven decisions are based on reliable information.
Completeness refers to the extent to which all expected data is present. Missing values, null fields, or omitted records reduce completeness.
Accuracy measures how closely data values match the real-world entities or events they represent. An inaccurate value is factually wrong.
Consistency ensures that data is uniform across different systems, tables, or formats. Contradictions or variations in representation indicate inconsistency.
Data quality is a cross-cutting concern in Azure data services. For instance, Azure Data Factory pipelines often include data quality checks as part of data cleansing activities. Azure Synapse Analytics and Azure Databricks provide tools to profile data and identify quality issues. The exam expects you to know not only the definitions but also how to identify examples of each dimension in given scenarios.
How Completeness Works
Completeness is typically measured as a percentage: (number of non-null values in a field / total expected values) × 100. However, completeness can also refer to entire records. For example, a customer table with 100 rows but only 95 unique customer IDs indicates incomplete data if every customer should have a record.
Field-level completeness: Every column in a row should have a value where required. For example, a mandatory 'Email' column must not be NULL.
Record-level completeness: All expected rows are present. For instance, a daily sales transaction table should have entries for every business day.
Dataset-level completeness: The entire dataset covers the expected scope. For example, a dataset covering all US states should include 50 rows.
In Azure, you can use Azure Data Factory's Data Flow transformations (e.g., Derived Column, Conditional Split) to check for NULLs and filter incomplete records. Azure Synapse provides built-in data profiling capabilities that report completeness percentages per column.
Common exam trap: Candidates often confuse completeness with accuracy. A record can be complete (all fields filled) but inaccurate (wrong values). For example, a customer record with a phone number that is not actually the customer's is complete but inaccurate.
How Accuracy Works
Accuracy is harder to measure because it requires a trusted source of truth (e.g., a master data set or real-world verification). Accuracy is often expressed as a percentage of records that match the true value.
Syntactic accuracy: Does the value conform to a valid format? For example, an email address must contain '@' and a domain.
Semantic accuracy: Does the value represent the correct real-world entity? For example, a product price of $10.00 when the actual price is $9.99 is semantically inaccurate.
In Azure, accuracy checks are typically implemented through business rules. For example, you can use Azure Data Factory's Lookup activity to cross-reference data against a master database. Azure SQL Database supports CHECK constraints that enforce syntactic accuracy (e.g., CHECK (Email LIKE '%_@__%.__%')).
Exam trap: Candidates sometimes think accuracy is the same as precision. Precision refers to the granularity of data (e.g., 3.14159 vs. 3.14), while accuracy refers to correctness. A precise value can be inaccurate.
How Consistency Works
Consistency ensures that data values are the same across different systems or within the same system over time. Inconsistencies arise when:
The same entity is represented differently (e.g., 'NY' vs. 'New York').
The same rule is applied differently (e.g., date format 'MM/dd/yyyy' vs. 'dd/MM/yyyy').
Referential integrity is violated (e.g., an order references a customer ID that does not exist).
Consistency is often enforced through: - Data standards: Defining common formats, codes, and naming conventions. - Referential integrity constraints: Foreign keys in relational databases. - ETL/ELT transformations: Standardizing data during ingestion.
In Azure, Azure Data Factory can enforce consistency by using mapping data flows to apply standard transformations. Azure SQL Database enforces referential consistency via foreign keys. Azure Cosmos DB uses consistency levels (strong, bounded staleness, session, consistent prefix, eventual) to balance consistency with performance.
Exam trap: Consistency is often confused with integrity. Integrity is a broader concept that includes consistency, accuracy, and completeness working together. Consistency specifically focuses on uniform representation.
Measuring Data Quality
Data quality is not binary; it is measured on a continuum. Common metrics include: - Completeness rate: e.g., 95% of email fields are non-null. - Accuracy rate: e.g., 98% of addresses match the USPS database. - Consistency rate: e.g., 99% of customer IDs are formatted as 'CUST-XXXX'.
In Azure, you can use Azure Purview (now Microsoft Purview) to scan data sources and generate data quality reports. Azure Synapse Studio has a 'Data Profiling' feature that automatically computes completeness and uniqueness statistics.
Data Quality in Azure Data Services
Azure SQL Database: Supports constraints (NOT NULL, CHECK, FOREIGN KEY) to enforce completeness, syntactic accuracy, and referential consistency.
Azure Data Factory: Provides data flow transformations to cleanse data (e.g., Filter rows with NULLs, Derive standard formats).
Azure Databricks: Offers Delta Lake, which enforces schema on write and can enforce constraints like NOT NULL.
Azure Synapse Analytics: Includes data profiling and pipeline capabilities for quality checks.
Azure Cosmos DB: Uses consistency levels that affect how quickly data becomes consistent across replicas.
The Data Quality Lifecycle
Data quality is not a one-time activity; it is a continuous process: 1. Define quality rules: What constitutes complete, accurate, consistent data? 2. Measure: Profile data to identify violations. 3. Analyze: Root cause analysis of quality issues. 4. Improve: Cleanse, standardize, or enrich data. 5. Monitor: Ongoing tracking using dashboards.
In Azure, this lifecycle can be implemented using Azure Data Factory pipelines that trigger on a schedule, running data quality checks and logging results to Azure SQL Database or Azure Data Lake.
Edge Cases and Exam Pitfalls
NULL vs. empty string: In completeness, a NULL is missing, but an empty string ('') is present. The exam may test whether you consider an empty string as complete or incomplete. Typically, an empty string is considered present (complete) but potentially inaccurate.
Duplicate records: Duplicates can make data seem complete (all fields filled) but actually represent the same real-world entity multiple times — this is a consistency issue.
Out-of-range values: A value like '9999' for age is syntactically valid but semantically inaccurate. The exam may ask which dimension is violated.
Cross-system consistency: If two databases store the same customer address differently, that is a consistency issue. The exam may present a scenario with multiple data sources.
Summary of Key Numbers and Values
Completeness is often expressed as a percentage (0-100%).
Accuracy may be measured against a master data set with a threshold (e.g., 99.5% accuracy required).
Consistency is enforced via constraints like FOREIGN KEY, UNIQUE, and CHECK.
Azure Data Factory's data flow has transformations like 'Filter', 'Derived Column', and 'Conditional Split' for quality checks.
Azure Synapse's data profiling shows completeness per column.
Microsoft Purview provides data quality dashboards.
How Data Quality Interacts with Other Concepts
Data governance: Data quality is a key pillar of data governance. Policies define acceptable quality levels.
Data integration: ETL processes often include quality checks to prevent bad data from entering the warehouse.
Data security: While separate, poor quality can lead to security issues (e.g., inaccurate access logs).
Data privacy: Incomplete data may violate privacy regulations if personal data is missing but required.
Understanding these interactions helps you answer scenario-based questions on the DP-900 exam that ask you to choose the appropriate Azure service or technique to address a data quality problem.
Define Data Quality Rules
Before measuring, you must define what constitutes complete, accurate, and consistent data for your specific use case. For completeness, specify which fields are mandatory (e.g., 'Email' must not be NULL). For accuracy, define valid ranges or reference data (e.g., 'State' must be a valid USPS abbreviation). For consistency, specify formats (e.g., 'Date' must be 'YYYY-MM-DD'). These rules become the basis for all subsequent checks. In Azure, rules can be documented in Microsoft Purview as business glossary terms or implemented as constraints in Azure SQL Database.
Profile the Data
Use data profiling tools to assess the current state of the data. In Azure Synapse Studio, you can right-click a table and select 'Profile Data' to see column statistics: null counts, distinct values, min/max, etc. For completeness, examine the 'Null Count' column. For accuracy, look for out-of-range values. For consistency, check distinct values for unexpected entries (e.g., 'NY' vs 'New York'). Profiling gives you baseline metrics. This step typically runs on a sample or full dataset and can be automated via Azure Data Factory pipelines using the 'Data Profiling' transformation.
Identify Quality Issues
Compare profiling results against your defined rules. Flag records that violate completeness (missing mandatory fields), accuracy (values outside allowed range), or consistency (format mismatches). For example, if 'State' column contains 'NY' and 'New York', that's an inconsistency. Use Azure Data Factory's 'Conditional Split' to route bad records to a separate output for review. Log issues to an Azure SQL Database table with columns: Source, Rule, Violation Count, Timestamp. This step produces actionable data quality reports.
Cleanse and Standardize Data
Apply transformations to fix identified issues. For completeness: fill missing values using default values, mean, or lookup. For accuracy: correct values using a reference dataset (e.g., replace 'New York' with 'NY'). For consistency: standardize formats (e.g., convert all dates to 'YYYY-MM-DD'). In Azure Data Factory, use 'Derived Column' to set defaults, 'Lookup' to cross-reference, and 'Data Flow' expressions for format conversions. This step may be iterative, as cleansing can introduce new issues (e.g., filling missing values may reduce accuracy).
Validate and Monitor
After cleansing, re-profile the data to confirm quality improvements. Then set up ongoing monitoring using Azure Data Factory triggers (e.g., daily pipeline) or Azure Monitor alerts. For example, if completeness of 'Email' field drops below 95%, send an alert. Use Azure Purview to track data quality metrics over time and create dashboards. This step ensures data quality is maintained, not just fixed once. The exam may ask about tools for monitoring: Azure Monitor for alerts, Power BI for dashboards.
Enterprise Scenario 1: Customer Data Migration to Azure SQL Database
A retail company is migrating its customer database from an on-premises SQL Server to Azure SQL Database. The source data has been accumulated over 20 years with minimal validation. Completeness: 12% of records lack phone numbers; 5% lack email. Accuracy: 8% of zip codes are invalid (e.g., 'ABCDE'). Consistency: State abbreviations appear as 'CA', 'Calif', 'California'. The data engineering team uses Azure Data Factory with a Copy activity to ingest data into a staging table in Azure SQL Database. Then, a Data Flow pipeline runs daily: it uses Conditional Split to separate records with missing email (incomplete) into a review table. A Lookup transformation cross-references zip codes against a USPS reference table; mismatches are flagged. Derived Column transformations standardize states using a lookup table mapping full names to abbreviations. The cleansed data is loaded into the production table. Performance considerations: the Data Flow runs on Azure Integration Runtime with 8 cores; processing 10 million records takes about 45 minutes. Common pitfalls: if the reference table is outdated, accuracy corrections may introduce new errors. Misconfiguration of the Data Flow's 'Row conditions' can cause too many records to be flagged, overwhelming the review queue.
Enterprise Scenario 2: Real-time IoT Data in Azure Stream Analytics
A manufacturing company uses IoT sensors to monitor equipment temperature. Data streams into Azure Event Hubs and is processed by Azure Stream Analytics. Completeness: every reading should include a timestamp, sensor ID, and temperature. Accuracy: temperature must be between -50°C and 150°C. Consistency: timestamps must be in UTC. During a network outage, some messages arrive without timestamps (incomplete). The Stream Analytics job uses a TIMESTAMP BY clause to assign a system timestamp when missing. For accuracy, a WHERE clause filters out temperatures outside the valid range, routing them to a dead-letter queue for analysis. Consistency is enforced by the sensor firmware, but the Stream Analytics job also checks that timestamps are in UTC; if not, it applies a conversion using a UDF. This scenario tests real-time quality enforcement. Common issues: if the dead-letter queue is not monitored, quality issues go unnoticed. Also, the TIMESTAMP BY clause can cause out-of-order events if the system clock is unsynchronized.
Enterprise Scenario 3: Data Lake Analytics with Azure Databricks
A financial services firm stores transaction data in Azure Data Lake Storage Gen2. Analysts use Azure Databricks to run daily reports. Completeness: all transactions must have a transaction ID and amount. Accuracy: amounts must be positive. Consistency: currency codes must follow ISO 4217 (e.g., 'USD', 'EUR'). The Databricks notebook reads Parquet files, applies a schema with NOT NULL constraints on transaction ID and amount using DataFrame API (e.g., df.filter(col('amount').isNotNull())). For accuracy, it filters out rows where amount <= 0. For consistency, it uses a lookup table to validate currency codes. Bad records are written to a separate 'quarantine' table. The notebook runs as an automated job every hour. Performance: processing 100 GB of data takes about 20 minutes with a 10-node cluster. Pitfalls: if the schema evolves (e.g., new column added), the notebook may break or silently drop columns. Also, the quarantine table can grow large if quality issues are not addressed at the source.
What DP-900 Tests on Data Quality
DP-900 Objective 1.1: 'Describe core data concepts' includes data quality dimensions. The exam expects you to:
Identify examples of completeness, accuracy, and consistency in given scenarios.
Understand how Azure services (Azure Data Factory, Azure Synapse, Azure SQL Database, Azure Databricks) support data quality.
Know the difference between data quality and data integrity.
Recognize common data quality issues (missing values, duplicates, format inconsistencies).
Most Common Wrong Answers and Why
Confusing completeness with accuracy: A scenario shows a record with all fields filled but wrong values. Candidates often select 'completeness issue' because they see missing values as the only quality problem. The correct answer is 'accuracy issue' because the values are factually incorrect.
Confusing consistency with integrity: A scenario about referential integrity (e.g., order with non-existent customer) is sometimes mislabeled as 'consistency'. Integrity is broader; consistency specifically is about uniform representation, not existence of relationships.
Choosing the wrong Azure service: When asked which service to use for data profiling, candidates may pick Azure SQL Database (which stores data) instead of Azure Synapse or Azure Data Factory (which profile). Know that profiling is a feature of Synapse Studio and Data Factory data flows.
Overlooking NULL vs empty string: A question may state that a column has empty strings but no NULLs. Candidates may call it incomplete, but empty strings are considered present (complete) — though they may be inaccurate.
Specific Numbers and Terms That Appear Verbatim
Completeness is often measured as a percentage.
Accuracy may reference 'master data' or 'reference data'.
Consistency is often tied to 'format standards' or 'naming conventions'.
Azure Data Factory 'Data Flow' transformations: Derived Column, Conditional Split, Lookup.
Azure Synapse 'Data Profiling' feature.
Microsoft Purview 'Data Quality' dashboards.
Edge Cases and Exceptions
Duplicate records: Duplicates are not a completeness issue (all fields present) but a consistency issue (multiple representations of same entity). The exam may present a table with duplicate rows and ask which dimension is violated.
Out-of-range but plausible values: For example, age 200 is inaccurate but may pass syntax checks. The exam expects you to recognize it as accuracy issue.
Cross-system consistency: If two databases store the same customer address differently, that's consistency. The exam may describe a scenario with multiple data sources.
How to Eliminate Wrong Answers
Read the scenario carefully: Identify what exactly is wrong. Is a field missing? -> Completeness. Is a value factually wrong? -> Accuracy. Are representations different? -> Consistency.
Match the symptom to the dimension: Use the specific definitions. For example, if the issue is about format (e.g., date format mismatch), it's consistency.
Consider the Azure service: If the question asks which tool to use for data quality checks, think about profiling (Azure Synapse) or transformation (Azure Data Factory). Avoid services that only store data.
Watch for traps: The exam may use the term 'data quality' broadly but expect you to pick the most specific dimension. Always choose the most precise answer.
Completeness = presence of data; measured as % of non-null values.
Accuracy = correctness of data; requires a trusted reference.
Consistency = uniform representation across systems; enforced via standards and constraints.
NULL indicates incompleteness; empty string is complete but may be inaccurate.
Duplicate records indicate a consistency issue, not completeness.
Azure Synapse Studio provides data profiling for completeness and consistency checks.
Azure Data Factory Data Flow transformations (Derived Column, Conditional Split) are used for cleansing.
Microsoft Purview offers data quality dashboards for monitoring.
Data quality is a continuous process, not a one-time fix.
These come up on the exam all the time. Here's how to tell them apart.
Completeness
Measures presence of data: are all expected values there?
Example: Missing email address in customer record.
Expressed as percentage of non-null values.
Enforced via NOT NULL constraints.
Azure tool: Data Profiling in Synapse shows null counts.
Accuracy
Measures correctness of data: do values match reality?
Example: Email address is present but is wrong (e.g., typo).
Expressed as percentage of records matching a trusted source.
Enforced via CHECK constraints or reference data lookups.
Azure tool: Data Factory Lookup against master data.
Mistake
Completeness means every field must have a value — empty strings count as incomplete.
Correct
Empty strings are considered present (non-null) and therefore count as complete. Completeness specifically measures missing (NULL) values. An empty string may be inaccurate, but it is not incomplete.
Mistake
Accuracy is the same as precision.
Correct
Accuracy refers to correctness (value matches real world), while precision refers to granularity (number of decimal places). A value can be precise but inaccurate (e.g., 3.14159 when the true value is 2.71828). DP-900 tests accuracy, not precision.
Mistake
Consistency ensures that data is correct across systems.
Correct
Consistency ensures uniform representation (same format, same codes), but does not guarantee correctness. Two systems can consistently represent a wrong value (e.g., both store 'NY' for a customer in California). Accuracy must be checked separately.
Mistake
Data quality is a one-time activity during data migration.
Correct
Data quality must be continuously monitored and maintained because data changes over time. Azure services like Data Factory and Purview support ongoing quality checks through scheduled pipelines and alerts.
Mistake
Azure SQL Database automatically ensures data quality.
Correct
Azure SQL Database provides constraints (NOT NULL, CHECK, FOREIGN KEY) that can enforce some quality rules, but it does not automatically profile data or detect complex quality issues. Additional tools like Azure Data Factory or Synapse are needed for comprehensive quality management.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Data integrity is a broader concept that encompasses data quality, but also includes aspects like referential integrity, entity integrity, and domain integrity. Data quality focuses on dimensions like completeness, accuracy, consistency, timeliness, and validity. In DP-900, integrity often refers to constraints (e.g., foreign keys), while quality refers to the fitness of data for use.
In Azure Synapse Studio, you can right-click a table and select 'Profile Data'. The profiling report shows null counts per column, which directly indicates incompleteness. You can also use T-SQL queries like `SELECT COUNT(*) FROM table WHERE column IS NULL` to compute completeness percentages.
Azure Data Factory is the primary service for data quality cleansing. Its Data Flow feature allows you to filter, transform, and standardize data using visual transformations like Derived Column, Conditional Split, and Lookup. For real-time data, Azure Stream Analytics can filter and transform data in motion.
Azure SQL Database can enforce some aspects of data quality through constraints: NOT NULL for completeness, CHECK for syntactic accuracy, and FOREIGN KEY for referential consistency. However, it does not automatically profile data or detect semantic inaccuracies. Additional tools like Azure Data Factory are needed for comprehensive quality management.
Syntactic accuracy checks if data conforms to a valid format (e.g., email contains '@'). Semantic accuracy checks if the value correctly represents the real-world entity (e.g., the email actually belongs to the customer). DP-900 may test both, but semantic accuracy is more commonly associated with 'accuracy' as a dimension.
Duplicate records are a consistency issue because the same real-world entity is represented multiple times. They can also affect completeness (if duplicates inflate counts) and accuracy (if different duplicates have conflicting values). To handle them, use Azure Data Factory's 'Aggregate' transformation or T-SQL's ROW_NUMBER() to deduplicate.
Microsoft Purview provides data governance capabilities, including data quality dashboards. It can scan data sources, profile data, and report on quality metrics over time. It also allows you to define business rules and track compliance. For DP-900, know that Purview is used for data quality monitoring and governance.
You've just covered Data Quality: Completeness, Accuracy, Consistency — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?