CCNA Data Concepts Questions

40 of 115 questions · Page 2/2 · Data Concepts topic · Answers revealed

76
MCQhard

Refer to the exhibit. A database administrator notices that queries filtering on both CustomerID and OrderDate are slow. Which single change would most likely improve performance for such queries?

A.Partition the table by OrderDate
B.Convert TotalAmount to VARCHAR
C.Add a composite index on (CustomerID, OrderDate)
D.Remove the primary key constraint
AnswerC

A composite index can satisfy both conditions in one index seek.

Why this answer

A composite index on (CustomerID, OrderDate) allows the database to use a single index to filter on both columns, which is more efficient than using separate indexes and combining results.

77
MCQmedium

A data quality report shows that 95% of records have all required fields completed, but 20% of the completed fields contain values that are outside valid ranges. Which data quality dimension is most affected?

A.Consistency
B.Accuracy
C.Timeliness
D.Completeness
AnswerB

Accuracy is compromised because values outside valid ranges are incorrect.

Why this answer

Accuracy measures how well data reflects real-world values or a defined standard. Here, 20% of completed fields contain values outside valid ranges, meaning the data is present but incorrect, directly degrading accuracy. Completeness (95% filled) is high, but the core issue is that the values themselves are wrong, not missing or late.

Exam trap

The trap here is that candidates see '95% of records have all required fields completed' and immediately think 'Completeness is high, so that dimension is fine,' but then incorrectly assume the 20% out-of-range values also affect Completeness, when in fact Accuracy is the dimension that suffers when present data is invalid.

How to eliminate wrong answers

Option A (Consistency) is wrong because consistency checks for logical coherence across datasets or over time (e.g., same customer ID format in two tables), not whether individual field values fall within valid ranges. Option C (Timeliness) is wrong because timeliness concerns whether data is available when needed or within a required time window, not the correctness of values. Option D (Completeness) is wrong because completeness measures the presence of data (95% of records have all required fields), which is high; the problem is with the quality of the present data, not its absence.

78
Multi-Selecthard

A data analyst is evaluating data quality issues in a customer database. Which TWO actions are best practices for ensuring data consistency?

Select 2 answers
A.Allowing null values for foreign keys
B.Standardizing date formats across all tables
C.Implementing referential integrity constraints
D.Enabling cascading updates on primary keys
E.Using data profiling to identify duplicate records
AnswersB, C

Correct: Uniform formats ensure consistency in temporal data.

Why this answer

Standardizing date formats across all tables (Option B) ensures that date values are stored and interpreted uniformly, eliminating inconsistencies that arise from mixed formats (e.g., MM/DD/YYYY vs. DD-MM-YY). This practice directly supports data consistency by enforcing a single representation, which is critical for accurate querying, reporting, and integration across systems.

Exam trap

CompTIA often tests the distinction between data quality dimensions (e.g., consistency vs. accuracy), leading candidates to confuse data profiling (which identifies duplicates) with a direct method for enforcing consistency.

79
MCQhard

Refer to the exhibit. Which conclusion can be drawn from this data quality report?

A.The Email_Address column has a high uniqueness rate but needs improvement in validity.
B.The column is fully consistent but has low completeness.
C.The column has low validity and low uniqueness.
D.The column requires immediate action to improve completeness.
AnswerA

Uniqueness is 97%, but validity is only 85%, meaning some emails may be in invalid format.

Why this answer

Option A is correct because the data quality report shows that the Email_Address column has a high uniqueness rate (e.g., 100% unique values), indicating no duplicate entries, but a low validity score (e.g., many entries fail format checks like missing '@' or domain). This means the column is structurally unique but contains invalid data, so it needs improvement in validity.

Exam trap

CompTIA often tests the distinction between uniqueness and validity, trapping candidates who assume high uniqueness implies high quality, when in fact validity is a separate dimension that can be poor even with perfect uniqueness.

How to eliminate wrong answers

Option B is wrong because the report indicates low validity, not full consistency; consistency refers to adherence to a standard format, which is violated here. Option C is wrong because the report shows high uniqueness (not low uniqueness), so the claim of 'low uniqueness' is factually incorrect. Option D is wrong because completeness (non-null values) appears high or acceptable; the issue is with validity, not missing data.

80
MCQmedium

Refer to the exhibit. Which type of data is the field "region"?

A.Qualitative
B.Continuous
C.Quantitative
D.Discrete
AnswerA

Correct. Region is a descriptive category.

Why this answer

The field 'region' contains categorical labels (e.g., 'North', 'South', 'East', 'West') that represent distinct groups or categories, not numerical measurements. Qualitative data (also called categorical data) describes attributes or characteristics that can be named but not meaningfully ordered or measured on a numeric scale. Since 'region' assigns a name to a geographic area without any inherent numeric value or order, it is a classic example of qualitative data.

Exam trap

The trap here is that candidates may confuse 'region' with a numeric code (e.g., region ID 1, 2, 3) and incorrectly classify it as discrete quantitative data, but the field 'region' as shown contains text labels, making it qualitative.

How to eliminate wrong answers

Option B is wrong because continuous data represents measurements that can take any value within a range (e.g., temperature, time), but 'region' consists of discrete labels with no numeric continuum. Option C is wrong because quantitative data involves numerical values that can be counted or measured (e.g., sales amount, age), whereas 'region' is a non-numeric category. Option D is wrong because discrete data is a subset of quantitative data that takes countable integer values (e.g., number of customers), but 'region' is not numeric at all.

81
Multi-Selectmedium

Which TWO of the following are examples of semi-structured data?

Select 2 answers
A.XML document
B.JSON object
C.Relational table
D.Plain text file
E.CSV file
AnswersA, B

XML uses tags and has flexible schema, semi-structured.

Why this answer

XML and JSON have tags/keys but no rigid schema, making them semi-structured. CSV is structured, relational tables are structured, plain text is unstructured.

82
MCQeasy

A hospital wants to analyze patient readmission rates. The data contains daily patient visits. What is the level of granularity?

A.Patient
B.Visit
C.Day
D.Hospital
AnswerB

Correct. Each record captures one visit.

Why this answer

The level of granularity refers to the finest detail captured in the dataset. Since the data contains daily patient visits, each record represents a single visit event, not the patient or the day itself. Therefore, 'Visit' is the correct granularity because each row corresponds to one visit occurrence.

Exam trap

The trap here is confusing the subject of analysis (patient readmission rates) with the actual data granularity (each row is a visit), leading candidates to incorrectly select 'Patient' instead of 'Visit'.

How to eliminate wrong answers

Option A is wrong because 'Patient' would be the granularity if the data summarized all visits per patient (e.g., one row per patient with aggregated readmission counts), but here each visit is a separate record. Option C is wrong because 'Day' would be the granularity if the data aggregated all visits per day (e.g., total visits per day), but the data contains individual visit records, not daily summaries. Option D is wrong because 'Hospital' would be the granularity if the data aggregated across the entire hospital (e.g., total readmission rate for the hospital), but the data is at the individual visit level.

83
MCQmedium

A retail company analyzes customer purchase data to improve inventory management. They store daily transaction records in a relational database and monthly aggregate reports in a data warehouse. Which difference between these storage methods best explains why the warehouse is more suitable for trend analysis?

A.The database uses a star schema while the warehouse uses a normalized schema.
B.The database enforces ACID transactions, while the warehouse uses eventual consistency.
C.The database is optimized for write-heavy OLTP, while the warehouse is optimized for read-heavy OLAP.
D.The database stores only current data, while the warehouse stores historical data.
AnswerC

Correct: OLTP supports many writes; OLAP supports complex reads.

Why this answer

Option C is correct because OLTP databases are optimized for high-frequency write operations (INSERT/UPDATE/DELETE) and ACID compliance, making them ideal for transaction processing but poor for complex analytical queries. In contrast, a data warehouse is optimized for read-heavy OLAP workloads, using columnar storage, pre-aggregated tables, and indexing strategies that enable fast aggregation and trend analysis over large historical datasets. This architectural difference directly supports the retail company's need to analyze purchase trends over time.

Exam trap

CompTIA often tests the misconception that 'data warehouses only store historical data' (Option D) as the primary reason for trend analysis suitability, but the real differentiator is the workload optimization (OLTP vs. OLAP), not merely the presence of history.

How to eliminate wrong answers

Option A is wrong because a star schema (with fact and dimension tables) is actually typical of data warehouses for analytical queries, while OLTP databases usually use normalized schemas to reduce redundancy and maintain data integrity. Option B is wrong because data warehouses often support ACID or snapshot isolation for consistency, and eventual consistency is more characteristic of NoSQL systems, not traditional data warehouses. Option D is wrong because relational databases can store historical data as well; the key difference is not the presence of history but the optimization for read-heavy analytical queries versus write-heavy transactional processing.

84
MCQeasy

A data analyst is creating a report for a marketing campaign. The campaign data includes customer names, email addresses, and purchase history. Which of the following best describes the 'customer name' data type?

A.Nominal
B.Quantitative
C.Ordinal
D.Discrete
AnswerA

Nominal is categorical without order.

Why this answer

Customer names are categorical labels that identify individuals without any inherent order or numerical value. This fits the definition of nominal data, which is used for naming or classifying variables. In data analysis, nominal data can be stored as strings and used for grouping or filtering, but arithmetic operations are meaningless.

Exam trap

CompTIA often tests the distinction between nominal and ordinal data by presenting a label that could be mistaken for having an order (e.g., 'customer name' might be confused with 'rank' or 'tier'), but the trap here is that names are purely categorical with no intrinsic ranking.

How to eliminate wrong answers

Option B is wrong because quantitative data represents numerical measurements or counts (e.g., purchase amount), not text labels like names. Option C is wrong because ordinal data has a meaningful order or rank (e.g., customer satisfaction rating), but customer names have no inherent sequence. Option D is wrong because discrete data consists of countable numerical values (e.g., number of purchases), whereas customer names are non-numeric categories.

85
Multi-Selecthard

Which TWO of the following are primary benefits of implementing a data governance program?

Select 2 answers
A.Faster data processing speed
B.Increased data volume
C.Improved data quality and consistency
D.Lower storage costs
E.Reduced data redundancy
AnswersC, E

Governance establishes standards that enhance quality and consistency.

Why this answer

Option C is correct because a primary benefit of a data governance program is improved data quality and consistency. Data governance establishes policies, standards, and procedures for data management, ensuring that data is accurate, complete, and reliable across the organization. This directly enhances decision-making and operational efficiency by reducing errors and inconsistencies in data assets.

Exam trap

The trap here is that candidates may confuse data governance with data management or data engineering tasks, mistakenly thinking it directly improves performance or reduces costs, when its core value is in quality, consistency, and compliance.

86
MCQeasy

Refer to the exhibit. Which data quality dimension is compromised by the missing value for Charlie's salary?

A.Uniqueness
B.Completeness
C.Timeliness
D.Accuracy
AnswerB

Correct. The salary field is missing, so data is incomplete.

Why this answer

Completeness measures whether all required data is present. Charlie's missing salary value means the record is incomplete, directly violating this dimension. In data quality frameworks, completeness is assessed by the proportion of non-null values in a field, and a null salary here fails that check.

Exam trap

CompTIA often tests the distinction between 'missing' (completeness) and 'wrong' (accuracy), leading candidates to confuse a null value with an incorrect value.

How to eliminate wrong answers

Option A is wrong because uniqueness refers to the absence of duplicate records or values, not missing data; a missing salary does not create a duplicate. Option C is wrong because timeliness concerns whether data is up-to-date or available when needed, not whether a value is present or absent. Option D is wrong because accuracy measures correctness of values against a reference source; a missing value is not an inaccurate value—it is an absent one.

87
Multi-Selectmedium

Which THREE of the following are common characteristics of unstructured data?

Select 3 answers
A.Easily queried using SQL
B.Often stored in NoSQL databases or data lakes
C.Can include text, images, and video
D.Stored in relational tables
E.Lacks a predefined schema
AnswersB, C, E

NoSQL and data lakes are designed to store unstructured data at scale.

Why this answer

Options B, C, and E are correct. Unstructured data lacks a predefined schema, can include various media types, and is often stored in NoSQL or data lakes. A is wrong because unstructured data is not stored in relational tables.

D is wrong because SQL queries are not designed for unstructured data.

88
MCQhard

An organization has multiple systems that store customer information inconsistently. To create a single authoritative view of customer data, they implement a process that identifies and merges duplicate records. This is an example of which data management discipline?

A.Data governance
B.Data warehousing
C.Data quality
D.Master Data Management (MDM)
AnswerD

MDM focuses on creating a single, consistent view of master data entities like customers.

Why this answer

Option A is correct because Master Data Management (MDM) creates a golden record by merging duplicates. Option B is wrong because data governance provides oversight but does not directly merge records. Option C is wrong because data warehousing consolidates data for reporting.

Option D is wrong because data quality is a component but not the specific discipline for master data.

89
Multi-Selecteasy

Which TWO of the following are examples of data transformation? (Choose TWO.)

Select 2 answers
A.Normalizing data to eliminate redundancy
B.Creating a backup of the database
C.Converting string dates to date format
D.Generating summary statistics
E.Removing duplicate records
AnswersA, C

Normalization is a transformation.

Why this answer

Option A is correct because data normalization is a transformation process that reorganizes data to reduce redundancy and improve integrity, typically by decomposing tables into smaller, related tables (e.g., achieving 3NF in relational databases). This changes the structure and representation of the data, which is a core example of data transformation.

Exam trap

CompTIA often tests the distinction between data transformation (changing format/structure) and data cleansing (removing errors/duplicates) or data analysis (generating summaries), leading candidates to mistakenly select removal of duplicates or summary statistics as transformations.

90
MCQmedium

A data analyst is working with a dataset that contains customer names and addresses. Some records have missing state codes. Which data quality issue is this?

A.Duplication
B.Incompleteness
C.Outliers
D.Inconsistency
AnswerB

Missing state codes make the record incomplete.

Why this answer

Incompleteness is the correct answer because missing state codes in customer address records represent a lack of required data. This is a classic example of incomplete data, where fields that should contain values are left null or blank, reducing the dataset's usability for analysis.

Exam trap

The trap here is that candidates may confuse incompleteness with inconsistency, but incompleteness is about missing data (nulls), while inconsistency is about contradictory data across records.

How to eliminate wrong answers

Option A is wrong because duplication refers to duplicate records (e.g., same customer appearing multiple times), not missing values. Option C is wrong because outliers are data points that deviate significantly from the norm (e.g., an unusually high age), not absent data. Option D is wrong because inconsistency involves contradictory or conflicting data (e.g., same customer with different state codes in different records), not missing values.

91
MCQmedium

A telecommunications company is experiencing issues with its customer satisfaction survey data. The data is collected from multiple channels: phone, email, and web forms. Each channel uses a different scale for ratings: phone uses 1-10, email uses 1-5, and web uses 1-7. Additionally, some survey responses contain missing values for demographic fields. The data analyst needs to calculate an overall satisfaction score that is comparable across all channels. The company's leadership wants a single metric that minimizes distortion from the different scales. Which approach should the analyst use to standardize the ratings?

A.Normalize each rating to a 0-100 scale using min-max normalization.
B.Calculate the average rating separately for each channel and then compare the averages.
C.Convert all ratings to a binary metric of satisfied (above midpoint) or unsatisfied.
D.Convert all ratings to a 1-10 scale by multiplying email ratings by 2 and web by 1.43.
AnswerA

Correct: Min-max normalization maps each scale to a common range, preserving relative differences.

Why this answer

Option A is correct because min-max normalization rescales each rating to a common 0-100 range using the formula (x - min) / (max - min) * 100. This preserves the relative distribution of responses within each channel while eliminating the effect of different scale lengths, making the scores directly comparable. It minimizes distortion better than simple multiplication or binary conversion, as it accounts for the full range of each original scale.

Exam trap

The trap here is that candidates may think simple multiplication (Option D) is sufficient for scale conversion, but Cisco tests the understanding that linear scaling without considering the full range and distribution can introduce distortion, whereas min-max normalization is the proper technique for creating a comparable metric across different scales.

How to eliminate wrong answers

Option B is wrong because calculating separate averages per channel does not standardize the ratings; it only produces channel-specific means that remain on different scales, making direct comparison invalid. Option C is wrong because converting to a binary satisfied/unsatisfied metric discards granularity and loses information about the degree of satisfaction, which can distort the overall score and reduce statistical power. Option D is wrong because multiplying email ratings by 2 and web by 1.43 assumes linear proportionality between scales, which is arbitrary and does not account for differences in distribution shape or endpoints, potentially introducing systematic bias.

92
Multi-Selectmedium

Which TWO of the following are examples of quantitative data? (Choose TWO.)

Select 2 answers
A.Product color
B.Age in years
C.Customer satisfaction rating (Poor, Fair, Good)
D.Country of origin
E.Shoe size
AnswersB, E

Age is a numeric, quantitative variable.

Why this answer

Option B is correct because age in years is a numerical measurement that can be counted or measured on a ratio scale, making it quantitative data. Quantitative data represents quantities that can be expressed numerically and subjected to mathematical operations, such as calculating the average age of a group.

Exam trap

The trap here is that candidates often confuse ordinal data (like customer satisfaction ratings) with quantitative data because the categories have an order, but they are still qualitative since the values are not numeric measurements.

93
MCQhard

A company is designing a data lake to store raw sensor data from IoT devices. The data arrives as JSON objects with varying schemas. Which storage approach is most appropriate?

A.Ingest into a relational database with a predefined schema
B.Store each JSON object as a separate file in a compressed columnar format
C.Convert all JSON to Avro with a fixed schema before storing
D.Store raw JSON files in a distributed file system and apply schema-on-read
AnswerD

Schema-on-read allows handling varying schemas without upfront transformation.

Why this answer

Option D is correct because a data lake is designed to store raw data in its native format, and IoT sensor data with varying schemas is best handled by storing raw JSON files in a distributed file system (e.g., HDFS or Amazon S3). This approach leverages schema-on-read, where the schema is applied at query time rather than at write time, allowing flexibility for heterogeneous JSON objects without data loss or transformation overhead.

Exam trap

The trap here is that candidates confuse 'schema-on-read' with 'schema-on-write' and assume that converting to a structured format like Avro or columnar storage is always better for performance, ignoring the requirement to store raw, varying-schema data as-is.

How to eliminate wrong answers

Option A is wrong because relational databases require a predefined schema and enforce ACID constraints, which cannot accommodate JSON objects with varying schemas without costly schema migrations or data loss. Option B is wrong because storing each JSON object as a separate file in a compressed columnar format (e.g., Parquet or ORC) is inefficient for small, variable-schema records; columnar formats are optimized for analytical queries on large, homogeneous datasets, not for raw ingestion of many small, schema-varying JSON objects. Option C is wrong because converting all JSON to Avro with a fixed schema before storing defeats the purpose of a data lake, which is to preserve raw data; Avro requires a predefined schema at write time, and forcing a fixed schema on varying JSON objects would either lose data or require complex schema evolution management.

94
MCQmedium

A data engineer needs to store logs from web servers that have varying fields. The logs are in JSON format. Which data type describes this JSON data?

A.Binary data
B.Structured data
C.Semi-structured data
D.Unstructured data
AnswerC

JSON allows schema flexibility with key-value pairs, fitting the semi-structured definition.

Why this answer

JSON data with varying fields is classified as semi-structured data because it has organizational properties (key-value pairs, nested structures) but does not conform to a rigid schema like a relational table. The logs from web servers may have different fields per record, which is a hallmark of semi-structured data, as it allows flexibility while still being self-describing.

Exam trap

The trap here is that candidates confuse 'structured' with any data that has a format, but JSON's lack of a fixed schema and varying fields disqualifies it from being structured data, which requires a rigid, predefined schema like a relational database table.

How to eliminate wrong answers

Option A is wrong because binary data refers to raw bytes or encoded formats (e.g., images, executables) that lack any inherent structure or human-readable format, whereas JSON is text-based and has explicit key-value organization. Option B is wrong because structured data requires a fixed schema with predefined fields and data types (e.g., rows in a SQL table), but JSON logs with varying fields violate this strict schema requirement. Option D is wrong because unstructured data has no predefined format or organization (e.g., plain text, video files), while JSON has a defined syntax with keys, values, and nesting, providing a clear structure.

95
MCQhard

A database table has columns: OrderID (primary key), ProductID, CustomerID, CustomerName, OrderDate, ProductName. All products are purchased only by the customer who placed the order. Which normal form violation exists if CustomerName depends on CustomerID?

A.Boyce-Codd normal form (BCNF)
B.Third normal form (3NF)
C.Second normal form (2NF)
D.First normal form (1NF)
AnswerB

CustomerName depends on CustomerID, which is not a candidate key, creating a transitive dependency and violating 3NF.

Why this answer

The table violates Third Normal Form (3NF) because CustomerName depends on CustomerID, which is not a candidate key (the primary key is OrderID). 3NF requires that every non-key attribute be non-transitively dependent on the primary key; here, CustomerName is transitively dependent on OrderID via CustomerID. Since CustomerID is a non-key attribute (it is not part of the primary key), this transitive dependency breaks 3NF.

Exam trap

The trap here is that candidates often confuse transitive dependencies (3NF violation) with partial dependencies (2NF violation) or think that any dependency on a non-key attribute automatically violates BCNF, but the specific scenario of CustomerName depending on CustomerID is a textbook transitive dependency that breaks 3NF first.

How to eliminate wrong answers

Option A is wrong because Boyce-Codd Normal Form (BCNF) is a stricter version of 3NF that requires every determinant to be a candidate key; while this table also violates BCNF, the question asks which normal form violation exists, and the dependency described is a classic 3NF violation (transitive dependency), not a BCNF-specific one. Option C is wrong because Second Normal Form (2NF) is violated only when a non-key attribute depends on a proper subset of a composite primary key; here the primary key is a single column (OrderID), so no partial dependency exists, and 2NF is satisfied. Option D is wrong because First Normal Form (1NF) is violated only if there are repeating groups or non-atomic values; the table as described has atomic columns and no repeating groups, so 1NF is satisfied.

96
MCQmedium

A data team is building a predictive model. They have data on 'Number of employees' (whole numbers) and 'Revenue' (currency). Which statement correctly compares these data types?

A.Number of employees is discrete; revenue is continuous
B.Both are continuous data
C.Both are ratio data
D.Number of employees is qualitative; revenue is quantitative
AnswerA

Employees are counted in whole units (discrete), while revenue can have fractional values (continuous).

Why this answer

Option A is correct because 'Number of employees' is a count of distinct entities, making it discrete data (only whole numbers), while 'Revenue' can take any value within a range (including decimals), making it continuous data. Discrete data arises from counting, whereas continuous data arises from measurement.

Exam trap

The trap here is that candidates confuse the measurement scale (ratio) with the data type (discrete vs. continuous), leading them to pick option C even though the question specifically asks about data type classification.

How to eliminate wrong answers

Option B is wrong because 'Number of employees' is not continuous; it is discrete as it can only take integer values (e.g., 10, 11, not 10.5). Option C is wrong because while both are ratio data (they have a true zero point), the question asks about data types (discrete vs. continuous), not measurement scales. Option D is wrong because both 'Number of employees' and 'Revenue' are quantitative (numerical) data, not qualitative (categorical).

97
MCQhard

A data modeler is designing a dimensional model for a sales analytics system. The fact table contains sales transactions, and the dimension tables include product, customer, and time. To reduce data redundancy, the modeler normalizes the dimension tables into multiple related tables. Which schema is being implemented?

A.Vault schema
B.Star schema
C.Galaxy schema
D.Snowflake schema
AnswerD

Snowflake schema normalizes dimension tables to reduce redundancy.

Why this answer

The snowflake schema is a dimensional model where dimension tables are normalized into multiple related tables to reduce data redundancy. In this scenario, the product, customer, and time dimensions are split into sub-dimensions (e.g., product category, customer geography, time hierarchy), which is the defining characteristic of a snowflake schema. This contrasts with a star schema where dimensions remain denormalized.

Exam trap

CompTIA often tests the distinction between star and snowflake schemas by emphasizing normalization of dimensions; the trap here is that candidates may confuse 'normalized dimensions' with a star schema, which actually uses denormalized dimensions for simplicity and performance.

How to eliminate wrong answers

Option A is wrong because a vault schema (Data Vault) is a hybrid modeling approach focused on auditability and flexibility using hubs, links, and satellites, not on normalizing dimension tables for a sales analytics fact table. Option B is wrong because a star schema keeps dimension tables denormalized (single table per dimension) to optimize query performance, which directly contradicts the normalization described in the question. Option C is wrong because a galaxy schema (also called a fact constellation) contains multiple fact tables sharing dimension tables, not the normalization of a single fact table’s dimensions.

98
MCQeasy

A healthcare database stores patient records. Each patient has a unique patient_id, and the database includes a table 'visits' with visit_id, patient_id, visit_date, and diagnosis_code. To ensure data integrity, which constraint should be applied to the patient_id column in the 'visits' table?

A.Unique constraint
B.Foreign key
C.Primary key
D.Check constraint
AnswerB

Foreign key enforces referential integrity.

Why this answer

Option B is correct because a foreign key constraint ensures that patient_id in visits references a valid patient_id in the patient table. Option A is wrong because primary key ensures uniqueness in its own table. Option C is wrong because unique constraint prevents duplicates.

Option D is wrong because check constraint validates values based on a condition.

99
Multi-Selecthard

Which THREE of the following are valid data quality dimensions? (Choose THREE.)

Select 3 answers
A.Encryption
B.Redundancy
C.Completeness
D.Timeliness
E.Accuracy
AnswersC, D, E

Completeness is a data quality dimension.

Why this answer

Completeness is a core data quality dimension that measures whether all required data is present. In the context of the DA0-001 exam, completeness ensures that no fields or records are missing, which is fundamental for reliable analysis and reporting.

Exam trap

CompTIA often tests the distinction between data quality dimensions and data management techniques, so candidates may mistakenly select encryption or redundancy because they sound like important data concepts, but they are not part of the standard quality dimensions.

100
MCQmedium

A manufacturing company has two primary data systems: an ERP system that stores production orders with fields like OrderID, ProductID, Quantity, and ProductionDate, and a CRM system that stores customer sales with fields like SaleID, CustomerID, ProductID, SaleDate, and Amount. The data analyst needs to create a unified view of product performance by joining these tables. However, the ProductID field in the ERP uses a 5-character alphanumeric code (e.g., 'P1234'), while the CRM uses a 6-character code (e.g., 'PR1234'). Additionally, some products have multiple entries due to slight variations in naming. The analyst wants to ensure accurate matching without losing data. Which action should the analyst take first to address the data inconsistency?

A.Create a mapping table that standardizes ProductID formats between ERP and CRM.
B.Perform data profiling to identify all unique ProductID values and their frequencies.
C.Aggregate data by product name and ignore ProductID mismatches.
D.Use a fuzzy matching algorithm to join on similar ProductID strings.
AnswerA

Correct: Standardization of keys is necessary before joining.

Why this answer

Option A is correct because creating a mapping table allows the analyst to explicitly define the relationship between the 5-character ERP ProductID and the 6-character CRM ProductID, ensuring accurate joins without data loss. This approach standardizes the inconsistent formats and handles variations by providing a controlled, deterministic lookup, which is essential for maintaining referential integrity in a unified view.

Exam trap

The trap here is that candidates may choose fuzzy matching (Option D) thinking it handles all variations, but CompTIA often tests the principle that deterministic mapping is preferred over probabilistic methods when the inconsistency is systematic and can be resolved with a known transformation.

How to eliminate wrong answers

Option B is wrong because data profiling only identifies the unique values and their frequencies but does not resolve the format mismatch; it merely highlights the problem without providing a mechanism to align the keys for joining. Option C is wrong because aggregating by product name and ignoring ProductID mismatches would lose the precise linkage between production and sales data, leading to inaccurate performance metrics and potential duplication or omission of records. Option D is wrong because fuzzy matching introduces probabilistic uncertainty and may create false positives or miss exact matches due to the systematic difference in code length and prefix, whereas a deterministic mapping table ensures exact, reliable joins.

101
Multi-Selectmedium

Which THREE of the following are characteristics of a relational database?

Select 3 answers
A.Enforces referential integrity through foreign keys
B.Stores data in key-value pairs
C.Supports NoSQL document storage
D.Uses Structured Query Language (SQL) for data manipulation
E.Data is organized into tables with rows and columns
AnswersA, D, E

Referential integrity ensures relationships.

Why this answer

Option A is correct because relational databases enforce referential integrity through foreign keys, which ensure that relationships between tables remain consistent. A foreign key in a child table must match a primary key value in the parent table, preventing orphaned records and maintaining data integrity.

Exam trap

The trap here is that candidates may confuse key-value stores or document databases with relational databases, especially when they hear terms like 'keys' or 'documents' in other contexts, but relational databases strictly use tables, rows, columns, and SQL.

102
MCQhard

A table Orders has OrderID (primary key), CustomerID, and CustomerEmail. During analysis, it is found that CustomerID uniquely identifies CustomerEmail. Which normal form is violated if both CustomerID and CustomerEmail are stored in this table?

A.Second normal form (2NF)
B.Third normal form (3NF)
C.No violation
D.First normal form (1NF)
AnswerB

CustomerEmail depends on CustomerID, which is a non-key attribute, creating a transitive dependency violating 3NF.

Why this answer

The table violates Third Normal Form (3NF) because CustomerEmail is transitively dependent on CustomerID, which is not a candidate key. In 3NF, every non-key attribute must depend only on the primary key (OrderID), not on another non-key attribute. Since CustomerID uniquely identifies CustomerEmail, CustomerEmail depends on CustomerID, not directly on OrderID, creating a transitive dependency.

Exam trap

The trap here is that candidates often confuse transitive dependencies with partial dependencies, mistakenly thinking that because CustomerID is not part of the primary key, the violation is 2NF rather than 3NF.

How to eliminate wrong answers

Option A is wrong because Second Normal Form (2NF) requires that all non-key attributes are fully functionally dependent on the entire primary key; here, the primary key is a single column (OrderID), so there is no partial dependency, and 2NF is satisfied. Option C is wrong because a violation does exist — the transitive dependency between CustomerID and CustomerEmail breaks 3NF. Option D is wrong because First Normal Form (1NF) is not violated; the table has atomic values and a primary key, so it meets 1NF requirements.

103
MCQmedium

A company is implementing a data lifecycle management policy. Which stage occurs immediately after data is created?

A.Storage
B.Deletion
C.Archival
D.Analysis
AnswerA

Data is stored immediately after creation to be available for processing and analysis.

Why this answer

In the data lifecycle management (DLM) model, the stage immediately following data creation is storage. Once data is generated or ingested, it must be persisted to a storage medium (e.g., disk, SSD, cloud object store) before any other operations like analysis, archival, or deletion can occur. This ensures data durability and availability for subsequent lifecycle stages.

Exam trap

CompTIA often tests the misconception that analysis or processing is the immediate next step after data creation, but the correct sequence in DLM always begins with storage to ensure data persistence.

How to eliminate wrong answers

Option B (Deletion) is wrong because deletion is a final stage in the lifecycle, occurring only after data is no longer needed and retention policies have expired. Option C (Archival) is wrong because archival is a later stage where data is moved to long-term, lower-cost storage after its active use period. Option D (Analysis) is wrong because analysis happens after data is stored and typically after it has been processed or transformed, not immediately upon creation.

104
MCQeasy

A data analyst notices that customer addresses in the database contain invalid ZIP codes. Which data quality dimension is being violated?

A.Validity
B.Timeliness
C.Consistency
D.Completeness
AnswerA

Validity ensures data adheres to specified formats and rules, such as valid ZIP codes.

Why this answer

A is correct because validity refers to the degree to which data conforms to its defined format, rules, or constraints. Invalid ZIP codes (e.g., a five-digit code containing letters or a non-existent postal code) directly violate the format and domain rules expected for that field, making this a validity issue.

Exam trap

The trap here is that candidates confuse 'validity' with 'completeness' or 'consistency,' mistakenly thinking a missing or mismatched ZIP code is a completeness or consistency issue, when in fact the violation is about the data not conforming to the required format or rule set.

How to eliminate wrong answers

Option B (Timeliness) is wrong because timeliness concerns whether data is available when needed, not whether individual values match expected formats. Option C (Consistency) is wrong because consistency checks for logical coherence across related data sets or fields (e.g., ZIP code matching city/state), not the intrinsic correctness of a single value. Option D (Completeness) is wrong because completeness measures whether all required data is present (e.g., missing ZIP codes), not whether present data is correctly formatted.

105
Multi-Selecthard

Which THREE of the following are properties of ratio data? (Choose THREE.)

Select 3 answers
A.Data can be categorized into groups
B.Allows negative values
C.Supports multiplication and division
D.Intervals between values are equal
E.Has a meaningful zero point
AnswersC, D, E

Ratio data allows meaningful ratios (e.g., twice as heavy).

Why this answer

Ratio data supports multiplication and division because it has a true, meaningful zero point that indicates the absence of the measured attribute. This allows ratios to be computed (e.g., one value is twice another), which is a defining property of ratio scales in measurement theory.

Exam trap

The trap here is that candidates confuse the 'meaningful zero' property with the ability to have negative values, or they think categorization is a defining feature of ratio data, when it is actually a property shared by all measurement scales.

106
MCQhard

A dataset contains a column 'Education Level' with values: 'High School', 'Bachelor', 'Master', 'PhD'. An analyst computes the average by assigning numbers 1-4. Which data concept is being violated?

A.Misclassifying data as structured
B.Treating ordinal data as interval
C.Treating nominal data as ordinal
D.Treating ratio data as interval
AnswerB

Assigning numbers and averaging assumes equal intervals, which ordinal data lacks.

Why this answer

The analyst assigned numeric values (1-4) to 'Education Level' categories and computed an average. This treats the ordinal data as if it were interval data, assuming equal spacing between categories (e.g., the difference between 'High School' and 'Bachelor' is the same as between 'Master' and 'PhD'), which is not valid. Ordinal data only preserves order, not magnitude or equal intervals, so calculating a mean is inappropriate.

Exam trap

CompTIA often tests the distinction between ordinal and interval scales by presenting a scenario where a mean is computed on ranked categories, tempting candidates to think the error is about nominal vs. ordinal (Option C) rather than the misuse of arithmetic operations on ordinal data.

How to eliminate wrong answers

Option A is wrong because misclassifying data as structured refers to incorrectly labeling unstructured data (e.g., text) as structured, but the dataset already has a structured column; the violation is about measurement scale, not structure. Option C is wrong because treating nominal data as ordinal would involve imposing an order on unordered categories (e.g., colors), but 'Education Level' already has a natural order, so the error is not about misordering but about assuming equal intervals. Option D is wrong because treating ratio data as interval would ignore a true zero point (e.g., income), but 'Education Level' has no meaningful zero, so the violation is not about ratio vs. interval but about ordinal vs. interval.

107
MCQeasy

A data analyst is working with a dataset containing customer information. The dataset includes a column 'full_name' which stores first and last names together. To perform analysis on first names separately, which data concept describes the process of splitting 'full_name' into 'first_name' and 'last_name'?

A.Data deduplication
B.Data summarization
C.Data normalization
D.Data aggregation
AnswerC

Normalization reduces redundancy and breaks down attributes.

Why this answer

Option C is correct because data normalization is the process of organizing data to reduce redundancy and improve integrity, which includes splitting composite attributes like 'full_name' into atomic values ('first_name', 'last_name'). This aligns with the first normal form (1NF) principle in database design, where each column should contain indivisible values. The data analyst is decomposing a single field into multiple, more granular fields to enable separate analysis.

Exam trap

The trap here is that candidates confuse data normalization with data aggregation or summarization, because both involve restructuring data, but normalization focuses on reducing redundancy and achieving atomicity, not on computing summary statistics.

How to eliminate wrong answers

Option A is wrong because data deduplication refers to identifying and removing duplicate records or entries, not splitting a single column into multiple columns. Option B is wrong because data summarization involves aggregating or condensing data (e.g., calculating averages or totals) to provide a high-level view, not decomposing a field. Option D is wrong because data aggregation combines multiple data points into a single summary value (e.g., sum, count), which is the opposite of splitting a field into more granular components.

108
MCQeasy

A hospital's patient records system must process thousands of small transactions per second. Which type of database system is best suited for this workload?

A.Data mart
B.OLTP
C.Data warehouse
D.OLAP
AnswerB

OLTP handles many concurrent short transactions efficiently.

Why this answer

OLTP (Online Transaction Processing) systems are designed to handle a high volume of small, concurrent transactions with low latency and high concurrency. This makes them ideal for a hospital patient records system that must process thousands of small transactions per second, such as patient check-ins, prescription updates, and billing entries.

Exam trap

The trap here is that candidates often confuse OLTP with OLAP, mistakenly thinking that 'processing many transactions' implies analytical processing, when in fact OLTP is the correct choice for high-frequency, small, write-heavy workloads.

How to eliminate wrong answers

Option A is wrong because a data mart is a subset of a data warehouse focused on a specific business line (e.g., cardiology), not designed for high-throughput transactional processing. Option C is wrong because a data warehouse is optimized for complex analytical queries on large historical datasets, not for handling thousands of small, real-time transactions per second. Option D is wrong because OLAP (Online Analytical Processing) is used for multidimensional analysis and reporting, not for high-frequency transactional workloads.

109
Multi-Selectmedium

Which TWO data types are considered quantitative? (Select two.)

Select 2 answers
A.Customer satisfaction rating (1-5)
B.Temperature in Celsius
C.Product color
D.Zip code
E.Employee ID
AnswersA, B

Correct. Although ordinal, it is a numeric rating and considered quantitative.

Why this answer

Customer satisfaction rating (1-5) is quantitative because it represents numerical values that can be measured and subjected to mathematical operations, such as calculating an average. Temperature in Celsius is also quantitative as it is a continuous numerical measurement on an interval scale, allowing for meaningful arithmetic comparisons. Both options involve numeric data that can be analyzed statistically.

Exam trap

The trap here is that candidates often mistake numeric-looking identifiers (like zip codes or employee IDs) for quantitative data, failing to recognize that they are nominal categorical variables with no mathematical significance.

110
MCQeasy

An e-commerce company wants to provide real-time personalized product recommendations based on customer browsing behavior. Currently, they have a traditional data warehouse that processes batch updates every night. The marketing team complains that recommendations are outdated within hours because customers see yesterday's data. The data engineer needs to modify the architecture to support near-real-time analytics. The budget is limited, and the existing warehouse infrastructure must be reused as much as possible. Which architectural change would best meet the requirement?

A.Replace the warehouse with an in-memory database for real-time processing.
B.Add more nodes to the warehouse cluster to speed up batch processing.
C.Implement a streaming data pipeline (e.g., Apache Kafka) that feeds a real-time recommendation engine.
D.Increase the frequency of batch load from nightly to every hour.
AnswerC

Correct: Streaming enables real-time analytics without replacing the warehouse.

Why this answer

Option C is correct because implementing a streaming data pipeline like Apache Kafka enables the ingestion and processing of customer browsing events in near real-time, feeding a dedicated recommendation engine that can update recommendations within seconds or minutes. This approach reuses the existing data warehouse for historical analytics and batch reporting while adding a lightweight streaming layer for low-latency recommendations, aligning with the limited budget and reuse requirement.

Exam trap

The trap here is that candidates may assume increasing batch frequency (Option D) is sufficient for near-real-time needs, but Cisco tests the understanding that 'near-real-time' typically requires sub-minute latency, which batch processing cannot achieve due to scheduling overhead and resource contention.

How to eliminate wrong answers

Option A is wrong because replacing the warehouse with an in-memory database would discard the existing infrastructure entirely, incurring high migration costs and losing the warehouse's batch processing capabilities for other workloads, which violates the constraint to reuse the existing warehouse. Option B is wrong because adding more nodes to the warehouse cluster only improves the throughput of batch processing, but does not reduce the latency of data freshness—recommendations would still be based on data that is at least hours old, failing the near-real-time requirement. Option D is wrong because increasing batch frequency to every hour still introduces a delay of up to 60 minutes, which is insufficient for real-time personalization; moreover, frequent batch loads can cause resource contention and degrade warehouse performance for other queries.

111
MCQhard

A retail company has merged with another firm and now needs to create a unified customer data warehouse. The existing systems use different data classification methods: System A stores customer income as a categorical range (e.g., '$0-$50k', '$50k-$100k', '$100k+') while System B stores exact income as a decimal number. A data analyst must combine these into a single table. The goal is to perform statistical analysis that includes calculating average income, but the categorical data from System A loses precision. The analyst proposes converting System B's exact values into the same ranges as System A to ensure consistency. However, the data governance team wants to preserve as much detail as possible. Which course of action should the analyst recommend?

A.Store both columns separately and treat them as independent attributes
B.Convert System B's exact income to ranges matching System A, then combine
C.Impute System A's categorical data with the midpoint of each range to create a continuous numeric field, then combine with System B's exact values
D.Use only System B's data and discard System A because it is less precise
AnswerC

This preserves detail from System B and creates a usable numeric field from System A for analysis.

Why this answer

Option C is correct because imputing the midpoint of each income range converts System A's categorical data into a continuous numeric field, allowing it to be combined with System B's exact decimal values. This approach preserves the granularity of System B's data while enabling statistical calculations like average income across the unified dataset, balancing the data governance team's requirement for detail with the need for consistency.

Exam trap

The trap here is that candidates may choose Option B, thinking consistency requires downgrading all data to the lowest common denominator, but the exam tests the ability to preserve precision while achieving integration through transformation techniques like midpoint imputation.

How to eliminate wrong answers

Option A is wrong because storing both columns separately as independent attributes fails to create a unified customer data warehouse and prevents direct statistical analysis across the combined dataset, such as calculating a single average income. Option B is wrong because converting System B's exact decimal values into the same categorical ranges as System A discards precision unnecessarily, violating the data governance team's goal to preserve as much detail as possible. Option D is wrong because discarding System A's data entirely ignores valuable customer information from the merged firm, leading to data loss and an incomplete unified warehouse.

112
MCQmedium

A data analyst receives a dataset with inconsistent date formats (e.g., "01/02/2023", "2023-01-02", "Jan 2, 2023"). Which data quality dimension is most directly affected?

A.Accuracy
B.Consistency
C.Completeness
D.Timeliness
AnswerB

Inconsistent formats directly impact data consistency.

Why this answer

Consistency refers to the uniformity of data representation. Inconsistent date formats violate consistency, not accuracy, completeness, or timeliness.

113
MCQhard

Refer to the exhibit. A data architect is designing a data dictionary for a relational database. Based on the exhibit, which data concept is being illustrated?

A.Data constraints
B.Data aggregation
C.Data normalization
D.Data cardinality
AnswerA

The exhibit specifies field properties like nullable and unique, which are constraints on the data.

Why this answer

The exhibit shows a table definition with column attributes such as NOT NULL, UNIQUE, and PRIMARY KEY, which are data constraints that enforce rules on the data values. Data constraints ensure data integrity by restricting what data can be stored in a column, such as preventing null values or duplicate entries. This directly aligns with the concept of data constraints, making option A correct.

Exam trap

The trap here is that candidates may confuse data constraints with data cardinality, because both involve 'rules' in a database, but cardinality specifically describes the nature of relationships between tables, not the column-level restrictions shown in the exhibit.

How to eliminate wrong answers

Option B is wrong because data aggregation involves summarizing or combining data from multiple rows (e.g., using SUM, AVG), which is not illustrated in the table definition. Option C is wrong because data normalization is a process of organizing data to reduce redundancy and dependency, typically involving splitting tables into related tables, not defining column-level constraints. Option D is wrong because data cardinality refers to the relationship between tables (e.g., one-to-many), not the rules applied to individual columns in a table definition.

114
MCQhard

A data engineer is designing a data warehouse for a multinational corporation. The company has sales data from different regions with varying currencies and date formats. To ensure consistency, which data concept should be applied to standardize the data before loading into the warehouse?

A.Data cleansing
B.Data transformation
C.Data profiling
D.Data masking
AnswerB

Transformation includes standardization of formats.

Why this answer

Data transformation is the correct concept because it involves converting data from source formats (e.g., different currencies and date formats) into a consistent, standardized format before loading into the data warehouse. This process includes applying conversion rules, such as using ISO 8601 for dates and a single base currency (e.g., USD) with exchange rate tables, ensuring uniformity across all regional data. Without transformation, the warehouse would contain incompatible data types, breaking referential integrity and analytical queries.

Exam trap

CompTIA often tests the distinction between data cleansing and data transformation, where candidates mistakenly choose cleansing because they think fixing formats is about 'cleaning' data, but cleansing addresses errors and missing values, not structural conversions like currency or date standardization.

How to eliminate wrong answers

Option A is wrong because data cleansing focuses on detecting and correcting inaccuracies, inconsistencies, or missing values (e.g., removing duplicates or fixing typos), not on converting data types or formats like currencies and dates. Option C is wrong because data profiling is an exploratory process that analyzes source data to understand its structure, quality, and relationships (e.g., checking data types or null percentages), but it does not perform any standardization or conversion. Option D is wrong because data masking is a security technique used to obfuscate sensitive information (e.g., replacing credit card numbers with tokens) for privacy or compliance, and it has no role in standardizing currencies or date formats.

115
MCQhard

A logistics company is analyzing truck delivery times. Which variable is discrete?

A.Number of stops
B.Time taken in hours
C.Fuel consumption in liters
D.Distance traveled
AnswerA

Correct. The number of stops is a count and therefore discrete.

Why this answer

A discrete variable is one that takes on a countable number of distinct values, often integers. The number of stops a truck makes is a count (e.g., 0, 1, 2, 3) and cannot be a fraction, making it a classic discrete variable in data analysis.

Exam trap

The trap here is that candidates confuse 'recorded as an integer' with 'discrete'—for example, thinking distance in whole kilometers is discrete, when the underlying measurement scale is continuous.

How to eliminate wrong answers

Option B is wrong because time taken in hours is a continuous variable—it can be measured to any fractional precision (e.g., 2.5 hours, 3.75 hours). Option C is wrong because fuel consumption in liters is continuous; it can take any value within a range (e.g., 45.3 liters). Option D is wrong because distance traveled is continuous, as it can be measured in fractional units (e.g., 120.7 km).

← PreviousPage 2 of 2 · 115 questions total

Ready to test yourself?

Try a timed practice session using only Data Concepts questions.