A large retail company is integrating customer data from two separate CRM systems into a new data warehouse. System A stores customer IDs as integers (e.g., 12345), while System B stores them as alphanumeric strings (e.g., 'CUST-12345-X'). Additionally, some customers exist in both systems but with slight name variations (e.g., 'John Smith' vs 'Jon Smith'). The data warehouse requires a unified customer table with a single unique identifier for each customer. The analyst needs to design the data acquisition process. Which of the following is the most appropriate first step?
Profiling provides the necessary insights to plan transformations, handle inconsistencies, and design the matching strategy.
Why this answer
Option C is correct because data profiling is the foundational first step in any data integration project. It systematically assesses source data types, formats, completeness, and quality issues (e.g., integer vs. alphanumeric IDs, name variations) before designing transformation logic. Without profiling, subsequent steps like fuzzy matching or ID standardization risk being built on incorrect assumptions about the data.
Exam trap
The trap here is that candidates often jump to a technical solution (fuzzy matching or ID standardization) without recognizing that data profiling is the prerequisite step that validates source assumptions and prevents costly rework.
How to eliminate wrong answers
Option A is wrong because exact name matches cannot resolve the known name variations (e.g., 'John Smith' vs 'Jon Smith'), leading to missed linkages and duplicate customers. Option B is wrong because loading all data into a staging table before profiling risks propagating unknown data quality issues (e.g., inconsistent ID formats, nulls) into the staging area, making fuzzy matching less reliable and harder to tune. Option D is wrong because standardizing IDs to a common format (e.g., UUIDs) without first profiling the source data ignores the need to understand existing relationships and quality issues, and may break referential integrity if applied prematurely.