Knowledge + Practice

CCNA Data Concepts Questions

75 of 115 questions · Page 1/2 · Data Concepts topic · Answers revealed

Practice these questions Exam hub All questions

1

MCQeasy

A healthcare provider needs to integrate patient data from multiple clinics into a single data warehouse. Which process is used to extract, transform, and load the data?

A.ELT

B.ETL

C.OLAP

D.OLTP

AnswerB

ETL extracts data, transforms it, and loads it into the warehouse, suitable for structured integration.

Why this answer

ETL (Extract, Transform, Load) is the correct process because the healthcare provider must first extract data from multiple source clinics, then transform it (e.g., standardize formats, clean duplicates, apply business rules) before loading it into the target data warehouse. This ensures data quality and consistency, which is critical for clinical analytics and reporting.

Exam trap

The trap here is confusing ETL with ELT, where candidates assume ELT is always better due to modern big data tools, but the question explicitly describes a traditional data warehouse integration requiring pre-load transformations.

How to eliminate wrong answers

Option A is wrong because ELT (Extract, Load, Transform) loads raw data into the target system first and transforms it later, which is less suitable for a data warehouse requiring pre-integrated, clean data from multiple sources; it is more common in big data environments like Hadoop. Option C is wrong because OLAP (Online Analytical Processing) is a category of database systems optimized for complex queries and multidimensional analysis, not a data integration process. Option D is wrong because OLTP (Online Transaction Processing) is designed for high-volume transactional operations (e.g., recording patient visits), not for extracting, transforming, and loading data into a warehouse.

Practice this question →

2

MCQhard

A healthcare organization is subject to strict data privacy regulations requiring the classification of all data assets. The data governance team has identified three data sensitivity levels: Public, Internal, and Restricted. They have a new data pipeline importing patient health records from multiple clinics. The records include patient names, diagnoses, treatment codes, and insurance information. The team must ensure that the classification is applied correctly and that restricted data (e.g., diagnoses) is not exposed to unauthorized personnel. However, the pipeline uses automated tagging based on metadata rules, and some fields are misclassified. What is the most effective immediate action to improve classification accuracy?

A.Encrypt all data at rest and in transit regardless of classification.

B.Require manual review and reclassification of all incoming records.

C.Expand the metadata rule set to include more keywords and patterns.

D.Implement data loss prevention (DLP) tools that inspect content and enforce classification rules.

AnswerD

Correct: DLP can reclassify based on actual content, improving accuracy.

Why this answer

Option A is correct. Implementing data loss prevention (DLP) tools with content inspection can automatically detect sensitive patterns (e.g., diagnosis codes) and apply correct classification, even if metadata rules are incomplete. Option B (manual review) is too slow for a pipeline.

Option C (expanding metadata rules) can help but may still miss cases; DLP provides an additional check. Option D (encrypting all data) is a security measure but does not fix classification. DLP directly addresses misclassification by analyzing content.

Practice this question →

3

Matchingmedium

Match each data quality dimension to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Degree to which data correctly reflects real-world values

Extent to which all required data is present

Absence of contradictions across data sources

Data is up-to-date and available when needed

No duplicate records exist within the dataset

Why these pairings

These are key data quality dimensions tested in Data+.

Practice this question →

4

MCQhard

A data analyst notices that a column labeled 'Income' contains values like '$50,000' and '$75,000', but also 'High' and 'Low'. What data concept issue is occurring?

A.Mixing quantitative and qualitative data

B.Mixing discrete and continuous data

C.Mixing nominal and ordinal data

D.Mixing structured and unstructured data

AnswerA

Income should be quantitative, but text labels are qualitative.

Why this answer

The 'Income' column contains both numeric values (e.g., '$50,000', '$75,000') which are quantitative data, and categorical labels ('High', 'Low') which are qualitative data. Mixing these two distinct data types in a single column violates data consistency principles and prevents proper statistical analysis or machine learning processing. This is a classic example of mixing quantitative and qualitative data.

Exam trap

CompTIA often tests the distinction between data type categories (quantitative vs. qualitative) versus subtypes (discrete/continuous or nominal/ordinal), so candidates mistakenly pick a subtype option when the core issue is the fundamental type mismatch.

How to eliminate wrong answers

Option B is wrong because discrete and continuous data are both subtypes of quantitative data (e.g., number of children vs. height), but the issue here is mixing numbers with text labels, not distinguishing between countable and measurable values. Option C is wrong because nominal and ordinal data are both categorical (qualitative) subtypes (e.g., colors vs. rankings), but the column includes actual numeric income values, not just ordered categories. Option D is wrong because structured data refers to organized formats like tables (which this column is part of), while unstructured data refers to free-form text or media; the problem is not about format but about inconsistent data types within a structured field.

Practice this question →

5

MCQeasy

A marketing team needs to store customer feedback from social media posts, including text, images, and emojis. Which data concept is most appropriate for this storage?

A.Unstructured data in a NoSQL document database

B.Structured data in a relational database

C.Unstructured data in a relational database

D.Semi-structured data in an XML database

AnswerA

NoSQL document databases store unstructured data such as text, images, and emojis without a fixed schema.

Why this answer

Customer feedback from social media includes text, images, and emojis, which lack a predefined schema and are best stored as unstructured data. NoSQL document databases (e.g., MongoDB) store such data in flexible JSON-like documents, allowing each record to have varying fields and data types without requiring a fixed schema.

Exam trap

CompTIA often tests the misconception that 'unstructured data' cannot be stored in any database, when in fact NoSQL document databases are purpose-built for it, while relational databases require rigid schemas that fail with variable content.

How to eliminate wrong answers

Option B is wrong because structured data in a relational database requires a fixed schema with predefined columns and data types, which cannot efficiently handle variable-length text, images, and emojis without complex workarounds like BLOBs. Option C is wrong because relational databases are designed for structured data; storing unstructured data in them forces schema rigidity and poor performance for heterogeneous content. Option D is wrong because XML databases are semi-structured and impose hierarchical markup, which is unnecessary overhead for social media posts that are naturally schema-less and better served by document stores.

Practice this question →

6

Multi-Selecteasy

Which TWO of the following are characteristics of structured data? (Choose TWO.)

Select 2 answers

A.Stored in rows and columns

B.Lacks a fixed schema

C.Has a predefined data model

D.Uses tags to define elements

E.Consists of free-form text

AnswersA, C

Tabular storage is a hallmark of structured data like relational databases.

Why this answer

Option A is correct because structured data is organized into rows and columns, typically within relational databases or spreadsheets, where each column represents a specific attribute and each row represents a record. This tabular format enables efficient querying, sorting, and indexing using languages like SQL. The rigid row-and-column structure ensures data consistency and supports ACID (Atomicity, Consistency, Isolation, Durability) properties.

Exam trap

The trap here is that candidates often confuse semi-structured data (which uses tags or labels) with structured data, or they incorrectly assume structured data can lack a schema, when in fact a predefined schema is its defining requirement.

Practice this question →

7

MCQmedium

To consolidate data from multiple operational databases into a central repository for reporting, a company decides to transform data before loading it into the target system. Which data integration approach is being used?

A.ETL (Extract, Transform, Load)

B.Data virtualization

C.Change data capture

D.ELT (Extract, Load, Transform)

AnswerA

ETL transforms data during the integration process before loading into the target.

Why this answer

The scenario describes transforming data before loading it into the target system, which is the defining characteristic of ETL (Extract, Transform, Load). In ETL, data is extracted from source systems, transformed in a staging area (e.g., cleaning, aggregating, joining), and then loaded into the central repository. This approach is commonly used when the target system (e.g., a data warehouse) requires pre-processed, high-quality data for reporting.

Exam trap

The trap here is that candidates often confuse ETL with ELT, assuming that any transformation before loading is ELT, but the key distinction is that ELT loads raw data first and transforms it later inside the target system, whereas ETL transforms data before it reaches the target.

How to eliminate wrong answers

Option B (Data virtualization) is wrong because it does not physically move or transform data before loading; instead, it creates a virtual layer that queries source systems in real-time, leaving data in place. Option C (Change data capture) is wrong because it is a technique for identifying and capturing only changed data from source systems, not a complete integration approach that includes transformation before loading. Option D (ELT) is wrong because it loads raw data into the target system first and then transforms it within the target, which contradicts the 'transform before loading' requirement in the question.

Practice this question →

8

Multi-Selecthard

Which THREE of the following are NoSQL database types?

Select 3 answers

A.Document

B.Hierarchical

C.Relational

D.Key-Value

E.Graph

AnswersA, D, E

Document stores (e.g., MongoDB) are NoSQL.

Why this answer

Option A is correct because document databases, such as MongoDB, store data in flexible, JSON-like documents (BSON in MongoDB's case). This allows for nested structures and schema-less designs, making them a core NoSQL category distinct from relational models.

Exam trap

CompTIA often tests the distinction between legacy database models (hierarchical) and modern NoSQL categories, leading candidates to mistakenly include hierarchical as a NoSQL type due to its non-relational nature.

Practice this question →

9

MCQmedium

A healthcare analytics team is building a dashboard to monitor patient vitals. They receive data from two sources: Source 1 provides 'heart rate' as an integer (beats per minute), and Source 2 provides 'blood pressure' as a ratio (systolic/diastolic, e.g., 120/80). The team wants to create a combined metric called 'cardiac stress index' that uses both heart rate and systolic blood pressure. However, they notice that heart rate data occasionally contains negative values due to sensor errors. The data governance policy requires that all data be valid and meaningful. Which action best addresses the data quality issue while preserving the data types?

A.Convert heart rate to absolute values (remove the negative sign)

B.Keep negative values but set them to NULL to indicate missing data

C.Change heart rate from integer to categorical (e.g., low, normal, high) to avoid negative issues

D.Remove all records with negative heart rate values as they are invalid

AnswerD

Negative heart rates are not physiologically possible, so deletion is appropriate for data quality.

Why this answer

Option D is correct because negative heart rate values are physiologically impossible and violate the data governance policy requiring valid and meaningful data. Removing these records ensures the dashboard only contains accurate, actionable data without altering the original integer data type of heart rate, preserving its numerical integrity for the 'cardiac stress index' calculation.

Exam trap

The trap here is that candidates may choose Option A (converting to absolute values) thinking it 'fixes' the data, but this introduces false data and violates data validity, whereas the correct approach is to remove invalid records to maintain data integrity.

How to eliminate wrong answers

Option A is wrong because converting negative heart rates to absolute values introduces false data, masking sensor errors and potentially skewing the cardiac stress index with artificially inflated readings. Option B is wrong because setting negative values to NULL retains invalid records in the dataset, which can cause calculation errors or missing data handling issues in the dashboard without addressing the root cause of sensor errors. Option C is wrong because changing heart rate from integer to categorical loses granularity and prevents the precise numerical computation required for the cardiac stress index, violating the requirement to preserve data types.

Practice this question →

10

MCQeasy

A retail analyst needs to determine the most popular product category. The dataset includes columns: ProductID, Category, SalesDate, QuantitySold, UnitPrice. Which column contains qualitative data?

A.SalesDate

B.QuantitySold

C.UnitPrice

D.Category

AnswerD

Correct. Category is a qualitative variable as it describes a product attribute.

Why this answer

Qualitative data (also called categorical data) represents non-numeric categories or labels. The 'Category' column contains text values such as 'Electronics' or 'Clothing', which are descriptive and cannot be used in arithmetic operations. This makes it the only qualitative column in the dataset.

Exam trap

The trap here is that candidates often mistake dates (SalesDate) for qualitative data because they are not numeric, but dates are actually quantitative interval data with a meaningful order and equal intervals.

How to eliminate wrong answers

Option A is wrong because SalesDate represents a point in time, which is quantitative (interval) data, not qualitative. Option B is wrong because QuantitySold is a numeric count, making it quantitative (discrete) data. Option C is wrong because UnitPrice is a numeric monetary value, making it quantitative (continuous) data.

Practice this question →

11

MCQeasy

A data architect is designing a schema for a product catalog where each product has a variable number of attributes. Which NoSQL database type is most appropriate?

A.Graph database

B.Document store

C.Key-value store

D.Relational database

AnswerB

Document stores allow flexible schemas, perfect for variable attributes.

Why this answer

A document store (e.g., MongoDB, Couchbase) is the most appropriate choice because it stores data in flexible, self-describing documents (typically JSON or BSON), allowing each product to have a variable number of attributes without requiring a predefined schema. This directly matches the requirement of a product catalog where attributes can differ per product, unlike rigid relational tables that would require complex EAV (Entity-Attribute-Value) patterns or frequent schema migrations.

Exam trap

The trap here is that candidates often confuse 'variable attributes' with 'relationships' and incorrectly choose a graph database, or they assume key-value stores are flexible enough, overlooking the need for queryability on individual attributes.

How to eliminate wrong answers

Option A is wrong because graph databases (e.g., Neo4j) are optimized for highly connected data and relationship traversal, not for storing documents with variable attributes; they would force you to model each attribute as a node or relationship, adding unnecessary complexity. Option C is wrong because key-value stores (e.g., Redis, DynamoDB) treat the entire product as an opaque value, making it impossible to query or index individual attributes without application-level parsing, which defeats the purpose of a catalog. Option D is wrong because relational databases require a fixed schema per table; handling variable attributes would necessitate either many nullable columns, frequent ALTER TABLE statements, or a cumbersome EAV pattern, all of which degrade performance and maintainability.

Practice this question →

12

MCQmedium

A company's database has a table 'orders' with columns: order_id, customer_id, order_date, and total_amount. A data analyst needs to identify customers who have placed more than 5 orders in the past year. Which data concept should be used to group orders by customer and count them?

A.Joining with other tables

B.Filtering with WHERE clause

C.Sorting with ORDER BY

D.Aggregation with GROUP BY

AnswerD

GROUP BY groups rows and aggregation functions compute counts.

Why this answer

Option D is correct because the requirement to count orders per customer requires grouping rows by customer_id and then applying a count function. The GROUP BY clause in SQL aggregates rows that share a common value (customer_id) into summary rows, and the COUNT function tallies the number of orders per group. This is the standard approach for such 'per-customer' aggregations.

Exam trap

The trap here is that candidates confuse filtering (WHERE) with aggregation (GROUP BY), thinking that a WHERE clause alone can count orders per customer, when in fact WHERE only filters rows and cannot produce grouped counts.

How to eliminate wrong answers

Option A is wrong because joining with other tables merges columns from multiple tables but does not group or count rows; it would not produce a count of orders per customer. Option B is wrong because filtering with a WHERE clause restricts rows before any grouping but does not aggregate or count; it cannot produce a count of orders per customer. Option C is wrong because sorting with ORDER BY only arranges the result set order and has no effect on grouping or counting rows.

Practice this question →

13

Multi-Selectmedium

A data analyst is comparing characteristics of structured and unstructured data. Which TWO of the following are characteristics of structured data? (Choose two.)

Select 2 answers

A.Data is typically stored as raw text

B.Data lacks a fixed format

C.Data is stored in predefined schemas

D.Data often requires NoSQL databases for storage

E.Data can be easily queried using SQL

AnswersC, E

Structured data follows a predefined schema, such as tables in a relational database.

Why this answer

Structured data is organized into predefined schemas, such as tables with rows and columns, which enforce a consistent data format and relationships. This rigid structure allows structured data to be easily queried using SQL, as SQL is designed to operate on relational database management systems (RDBMS) that rely on these schemas. Option C is correct because a predefined schema is a defining characteristic of structured data, enabling efficient storage, retrieval, and integrity constraints.

Exam trap

The trap here is that candidates often confuse 'lack of fixed format' (unstructured) with 'flexibility in storage' (NoSQL), leading them to select options B or D, which describe unstructured or semi-structured data, not structured data.

Practice this question →

14

Matchingmedium

Match each data visualization type to its best use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Compare quantities across categories

Show relationship between two numeric variables

Display distribution of a single continuous variable

Show magnitude of values across two dimensions

Summarize distribution and identify outliers

Why these pairings

Choosing the right chart is key to effective data presentation.

Practice this question →

15

MCQmedium

A data scientist is building a model to predict customer churn. The company's internal CRM system provides customer demographics and transaction history. They also purchase demographic data from a third-party vendor. How should the purchased data be classified?

A.Secondary data

B.Internal data

C.Structured data

D.Primary data

AnswerA

Correct. Secondary data is collected by another entity and reused.

Why this answer

Purchased demographic data from a third-party vendor is classified as secondary data because it was originally collected by another entity for a different purpose and is being reused by the data scientist for churn prediction. Secondary data contrasts with primary data, which is collected firsthand for the specific analysis at hand. This classification is independent of whether the data is structured or unstructured.

Exam trap

The trap here is that candidates confuse 'secondary data' with 'structured data' because purchased data is often delivered in a structured format like CSV, but the classification is based on data origin and collection purpose, not its structure.

How to eliminate wrong answers

Option B (Internal data) is wrong because the purchased data originates from an external vendor, not from the company's own CRM or internal systems. Option C (Structured data) is wrong because the classification of data as primary or secondary is about its origin and collection purpose, not its format; purchased data could be structured or unstructured. Option D (Primary data) is wrong because primary data is collected directly by the researcher for the specific study, whereas this data was pre-existing and collected by a third party.

Practice this question →

16

Multi-Selecthard

Which THREE data quality dimensions are commonly assessed in a data profiling task?

Select 3 answers

A.Scalability

B.Consistency

C.Uniqueness

D.Availability

E.Completeness

AnswersB, C, E

Consistency ensures uniform data representation, a common profiling check.

Why this answer

Consistency is a core data quality dimension assessed in data profiling because it evaluates whether data values are free from contradiction and adhere to the same representation rules across records. In profiling tools like Informatica or Talend, consistency checks identify violations such as 'NY' vs 'New York' in a state column, ensuring semantic uniformity.

Exam trap

CompTIA often tests the distinction between data quality dimensions (completeness, consistency, uniqueness) and system-level attributes (scalability, availability), leading candidates to mistakenly select non-quality terms like 'Availability' or 'Scalability' because they sound relevant to data management.

Practice this question →

17

Multi-Selectmedium

Which TWO of the following are considered structured data?

Select 2 answers

A.A PDF report with free-form text

B.A relational database table

C.A JPEG image of a product

D.A JSON file with nested key-value pairs

E.A CSV file containing sales records

AnswersB, E

Tables have a fixed schema.

Why this answer

Option B is correct because a relational database table stores data in a predefined schema of rows and columns, where each column has a fixed data type. This rigid structure allows for efficient querying, indexing, and relational operations, making it a classic example of structured data.

Exam trap

The trap here is that candidates often mistake semi-structured data (like JSON) for structured data because it has key-value pairs, but the DA0-001 exam strictly defines structured data as having a fixed, predefined schema—typically found in relational databases or CSV files with consistent column headers.

Practice this question →

18

MCQmedium

A data engineer is designing a data warehouse for a retail company. The fact table must record each sale transaction, including product ID, store ID, date, and quantity sold. The product details (name, category, price) are stored in a separate table. This design is an example of which data modeling concept?

A.Star schema

B.Data lake

C.Normalization

D.Snowflake schema

AnswerA

Correct: fact table linked to dimension tables.

Why this answer

This design is a classic star schema, where a central fact table (sales transactions) contains foreign keys to dimension tables (product, store, date). The fact table stores quantitative measures (quantity sold) and foreign keys, while dimension tables hold descriptive attributes (product name, category, price). This separation optimizes query performance for OLAP workloads by reducing joins and enabling straightforward aggregations.

Exam trap

The trap here is that candidates confuse star schema with snowflake schema, but the key differentiator is whether dimension tables are further normalized (snowflake) or kept denormalized (star), and this question's single product table clearly indicates a star schema.

How to eliminate wrong answers

Option B is wrong because a data lake stores raw, unprocessed data in its native format (e.g., CSV, Parquet) without a predefined schema, whereas this design explicitly separates facts and dimensions with a structured schema. Option C is wrong because normalization would split data into many related tables to eliminate redundancy (e.g., separating product category into its own table), but here product details are kept in a single dimension table, which is denormalized. Option D is wrong because a snowflake schema further normalizes dimension tables into sub-dimensions (e.g., splitting product category into a separate table), but this design keeps product details in one table, making it a star schema, not a snowflake.

Practice this question →

19

MCQhard

An e-commerce company stores customer support emails in a text database, product images in a blob store, and sales transactions in a SQL table. Which data store holds only structured data?

A.Blob store

B.Text database

C.SQL table

D.None

AnswerC

Correct. SQL tables have rows and columns with defined data types.

Why this answer

Structured data conforms to a predefined schema with rows and columns, enforcing data types and relationships. A SQL table is the canonical example of a structured data store because it organizes data into tables with fixed schemas, supports ACID transactions, and enables relational queries via SQL. In contrast, blob stores and text databases store unstructured or semi-structured data without a rigid schema.

Exam trap

The trap here is that candidates confuse 'structured data' with any data that has some organization (like tags in a blob store or fields in a text document), but only a SQL table enforces a rigid, predefined schema with typed columns and relational constraints, which is the defining characteristic of structured data.

How to eliminate wrong answers

Option A is wrong because a blob store (e.g., Amazon S3, Azure Blob Storage) stores binary large objects such as images, videos, or documents as opaque blobs with no inherent schema or structure — it is designed for unstructured data. Option B is wrong because a text database (e.g., a NoSQL document store like MongoDB or a plain text file repository) stores free-form text or semi-structured documents (e.g., JSON, XML) that lack a fixed, predefined schema and are not organized into rows and columns. Option D is wrong because the SQL table explicitly holds structured data, so 'None' is incorrect.

Practice this question →

20

Drag & Dropmedium

Drag and drop the steps to create a data visualization dashboard in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Dashboard creation starts with planning, then chart selection, layout design, building, and testing.

Practice this question →

21

MCQhard

Refer to the exhibit. Based on the data profiling results, what is a likely data quality issue?

A.Completeness

B.Accuracy

C.Validity

D.Consistency

AnswerC

Correct. Values like 0 and 150 violate reasonable constraints.

Why this answer

The min of 0 and max of 150 are biologically implausible for age. A 0-year-old or 150-year-old customer likely indicates invalid data, affecting validity.

Practice this question →

22

MCQeasy

An organization wants to assign responsibility for data quality and metadata management. Which role is primarily accountable for defining data standards and ensuring data quality across a specific domain?

A.Data analyst

B.Data owner

C.Data steward

D.Data custodian

AnswerC

A data steward ensures data quality, standards, and metadata management for a specific domain.

Why this answer

The data steward is the role primarily accountable for defining data standards and ensuring data quality within a specific domain. This aligns with the DAMA-DMBOK framework, where the data steward acts as the business-side owner of data content, establishing rules for data entry, validation, and metadata management to maintain consistency and accuracy.

Exam trap

The trap here is confusing the data steward with the data owner or data custodian, as many candidates mistakenly think the owner handles domain-level quality or that the custodian defines standards, when in fact the steward is the bridge between business requirements and technical enforcement.

How to eliminate wrong answers

Option A is wrong because a data analyst focuses on querying, analyzing, and reporting data, not on defining standards or governing data quality across a domain. Option B is wrong because a data owner is typically a senior executive accountable for data assets at an enterprise level, not for day-to-day domain-specific standards and quality enforcement. Option D is wrong because a data custodian (or data steward in some frameworks) handles technical implementation, storage, and security, but does not define business-level data standards or quality rules.

Practice this question →

23

MCQeasy

A data analyst receives a dataset with a column 'salary' that contains values like '45,000', '55,000', and '65,000'. The analyst notices that the values are stored as text. Which data concept should be applied to convert the salary column from text to numeric format for analysis?

A.Data imputation

B.Data type conversion

C.Data validation

D.Data normalization

AnswerB

Conversion changes data type, e.g., string to integer.

Why this answer

Data type conversion is the correct concept because the salary values are stored as text (string) but need to be converted to a numeric type (e.g., integer or float) for mathematical operations like aggregation or averaging. In tools like Python (pandas `astype(float)`), SQL (`CAST(salary AS INTEGER)`), or Excel (`VALUE()` function), this explicit conversion ensures the data is treated as numbers, not strings. Without conversion, operations like `SUM` or `AVG` would fail or produce incorrect results.

Exam trap

CompTIA often tests the distinction between data transformation (type conversion) and data preparation techniques like imputation or normalization, trapping candidates who confuse 'changing format' with 'filling gaps' or 'scaling values'.

How to eliminate wrong answers

Option A is wrong because data imputation deals with filling missing values (e.g., using mean or median), not changing the data type of existing values. Option C is wrong because data validation checks whether data meets predefined rules (e.g., range or format constraints), but it does not transform text to numeric format. Option D is wrong because data normalization rescales numeric values to a standard range (e.g., 0–1 or z-scores), which assumes the data is already numeric, not converting text to numbers.

Practice this question →

24

MCQmedium

A researcher wants to study the effect of a new drug. She collects data directly from clinical trial participants. Later, she compares her findings with historical data from medical journals. Which contrast best describes her data sources?

A.Internal vs. External

B.Quantitative vs. Qualitative

C.Structured vs. Unstructured

D.Primary vs. Secondary

AnswerD

Primary data is collected firsthand; secondary data is obtained from existing sources.

Why this answer

Option D is correct because the researcher is directly collecting data from clinical trial participants (primary data) and then comparing it with historical data from medical journals (secondary data). Primary data is original data collected firsthand for a specific purpose, while secondary data is pre-existing data collected by others for different purposes. This contrast directly maps to the primary vs. secondary data classification in data management.

Exam trap

The trap here is that candidates confuse 'internal vs. external' (Option A) with 'primary vs. secondary' because both involve a contrast between data from the researcher's own work versus outside sources, but the DA0-001 exam specifically tests the distinction based on whether the data was collected firsthand (primary) or reused from existing records (secondary).

How to eliminate wrong answers

Option A is wrong because internal vs. external refers to data originating within or outside an organization, not the method of collection; the clinical trial data is internal to the study but the historical data is external, but the core contrast here is about data origin (firsthand vs. reused), not organizational boundary. Option B is wrong because quantitative vs. qualitative describes data types (numerical vs. categorical/textual), not the source of data; both the clinical trial data and historical journal data could be quantitative or qualitative. Option C is wrong because structured vs. unstructured refers to data format (e.g., tables vs. free text), not the source; both data sources could be structured (e.g., trial results in a database) or unstructured (e.g., narrative journal articles).

Practice this question →

25

MCQmedium

A company requires real-time masking of credit card numbers for customer support agents while allowing full access for accountants. Which technique should be implemented?

A.Dynamic data masking

B.Tokenization

C.Static data masking

D.Data encryption

AnswerA

Dynamic masking masks data on-the-fly based on user roles, perfect for this requirement.

Why this answer

Dynamic data masking (DDM) applies masking rules at query runtime based on user privileges, allowing accountants full access while customer support agents see only masked credit card numbers. Unlike static masking, DDM does not alter the underlying stored data, making it ideal for real-time, role-based obfuscation without duplicating or transforming the database.

Exam trap

CompTIA often tests the misconception that encryption or tokenization can provide real-time, role-based masking, but these technologies either require decryption (exposing the full value) or introduce latency and storage overhead, making dynamic data masking the only correct choice for this use case.

How to eliminate wrong answers

Option B (Tokenization) is wrong because it replaces sensitive data with a non-sensitive token stored in a separate vault, requiring a detokenization process that adds latency and is not designed for real-time, role-based masking within the same database. Option C (Static data masking) is wrong because it creates a permanent, masked copy of the data in a non-production environment, which cannot provide real-time, on-the-fly masking for live queries. Option D (Data encryption) is wrong because encryption protects data at rest or in transit but does not provide role-based masking at query time; decryption keys grant full access, not partial masking.

Practice this question →

26

MCQmedium

A retail company stores customer purchase history in a relational database. The database contains a table 'transactions' with columns: transaction_id, customer_id, product_id, quantity, price, and transaction_date. A data analyst needs to create a report that shows total revenue per customer for the last quarter. Which data concept describes the relationship between customer_id and total revenue?

A.Foreign key

B.Composite attribute

C.Derived attribute

D.Atomic attribute

AnswerC

Total revenue is calculated from other attributes, making it derived.

Why this answer

Total revenue is calculated by summing (quantity * price) for each customer, making it a derived attribute because it is computed from existing stored data (quantity and price) rather than stored directly. In the context of the 'transactions' table, customer_id is a stored key, but total_revenue is not stored; it is derived via aggregation, which matches the definition of a derived attribute in database design.

Exam trap

CompTIA often tests the confusion between a derived attribute (computed from other attributes) and a foreign key (a referential constraint), leading candidates to incorrectly select 'foreign key' because customer_id appears in multiple tables.

How to eliminate wrong answers

Option A is wrong because a foreign key is a column that references a primary key in another table to enforce referential integrity; customer_id in the transactions table is a foreign key referencing the customers table, but total revenue is not a key—it is a computed value. Option B is wrong because a composite attribute is an attribute that can be divided into smaller sub-parts (e.g., address into street, city, zip); total revenue is a single calculated value, not composed of multiple atomic sub-attributes. Option D is wrong because an atomic attribute is indivisible and stored directly (e.g., price, quantity); total revenue is not stored but derived, so it violates the atomicity principle.

Practice this question →

27

MCQhard

A data governance team is drafting a policy for handling personally identifiable information (PII). According to data governance best practices, which document should define the classification levels and handling procedures?

A.Data dictionary

B.Data classification policy

C.Data quality report

D.Data flow diagram

AnswerB

A data classification policy categorizes data by sensitivity and outlines handling rules.

Why this answer

The data classification policy is the authoritative document that defines classification levels (e.g., public, internal, confidential, restricted) and specifies handling procedures for each category, including PII. This aligns with data governance best practices, as it establishes the rules for labeling, storing, transmitting, and disposing of sensitive data. A data dictionary describes metadata and schema, not classification rules.

Exam trap

The trap here is that candidates confuse the data dictionary (which describes data structure) with the data classification policy (which governs data sensitivity and handling), leading them to select the dictionary as the document that defines classification levels.

How to eliminate wrong answers

Option A is wrong because a data dictionary documents metadata such as field names, data types, and definitions, but it does not define classification levels or handling procedures for PII. Option C is wrong because a data quality report measures data accuracy, completeness, and consistency, not security or classification policies. Option D is wrong because a data flow diagram visually maps how data moves between systems, but it does not prescribe classification levels or handling rules.

Practice this question →

28

MCQhard

A large financial institution is implementing a data governance framework to comply with new regulations requiring strict control over sensitive customer data. The data governance committee has identified several domains, including customer master data, transaction data, and risk assessment data. They need to decide on a master data management (MDM) approach that ensures a single, authoritative source of customer information across all systems. However, the current environment has multiple legacy systems with conflicting customer records. The committee is concerned about downtime and business disruption during the transition. Which MDM approach best balances data consistency with minimal operational impact?

A.Registry MDM, which creates a virtual view by linking records without moving data.

B.Centralized MDM, which replaces all systems with one central master system.

C.Consolidation MDM, which merges all customer data into a single physical hub.

D.Coexistence MDM, which maintains multiple masters with synchronization.

AnswerA

Correct: Registry avoids disruption by leaving data in place and creating a unified index.

Why this answer

Registry MDM is correct because it creates a virtual, unified view of customer data by linking records from multiple legacy systems via a registry index, without physically moving or consolidating data. This approach ensures a single authoritative source of customer information while avoiding downtime and business disruption, as the underlying systems remain unchanged and operational during the transition.

Exam trap

The trap here is that candidates often confuse Registry MDM with Consolidation MDM, assuming that a single authoritative source requires physical data movement, when in fact a virtual registry can achieve the same goal with minimal operational impact.

How to eliminate wrong answers

Option B is wrong because Centralized MDM requires replacing all existing systems with a single central master system, which would cause significant downtime and business disruption due to the need for data migration and system decommissioning. Option C is wrong because Consolidation MDM merges all customer data into a single physical hub, which still requires moving data from legacy systems and can lead to operational interruptions during the consolidation process. Option D is wrong because Coexistence MDM maintains multiple masters with synchronization, which does not ensure a single authoritative source of customer information and can lead to data conflicts and inconsistency across systems.

Practice this question →

29

MCQeasy

A retail company stores customer transaction data in a relational database. They want to analyze purchasing patterns over time. Which type of data structure best supports this analysis?

A.Relational table

B.Graph database

C.Document store

D.Key-value store

AnswerA

Relational tables with SQL support are excellent for structured data and time-series analysis.

Why this answer

A relational table is the correct choice because it organizes transaction data into structured rows and columns with defined schemas, enabling efficient SQL-based queries for time-series analysis (e.g., aggregating purchases by date, customer, or product). The relational model supports ACID transactions and joins across related tables (e.g., customers, products, transactions), which is essential for analyzing purchasing patterns over time while maintaining data integrity.

Exam trap

The trap here is that candidates may confuse 'analyzing purchasing patterns over time' with needing a graph database for relationships, but the key requirement is structured time-series aggregation, which is a core strength of relational tables, not graph or NoSQL stores.

How to eliminate wrong answers

Option B (Graph database) is wrong because graph databases excel at modeling relationships between entities (e.g., social networks or recommendation engines) but are not optimized for time-series aggregation or range queries on structured transaction data; they lack native support for SQL-style GROUP BY and window functions. Option C (Document store) is wrong because document stores (e.g., MongoDB) store semi-structured JSON-like documents, which can lead to data duplication and complex aggregation pipelines for time-based analysis, and they typically do not enforce strict schemas or support ACID transactions across multiple collections. Option D (Key-value store) is wrong because key-value stores (e.g., Redis) provide fast lookups by a single key but cannot efficiently query on multiple attributes (e.g., date range, product category) or perform relational joins, making them unsuitable for analytical queries on purchasing patterns.

Practice this question →

30

MCQeasy

A data analyst at a marketing agency is working with a dataset containing customer demographics, purchase history, and social media engagement metrics. The agency wants to perform sentiment analysis on unstructured social media comments to identify brand perception. The dataset also includes structured fields like age, income, and purchase amounts. The analyst needs to choose a storage and processing platform that can handle both structured and unstructured data efficiently without requiring extensive schema definition upfront. Which platform should the analyst recommend?

A.Relational database (RDBMS)

B.Data lake

C.Data warehouse

D.NoSQL document database

AnswerB

Correct: Data lakes store raw data in native format and support schema-on-read for both structured and unstructured.

Why this answer

A data lake is the correct choice because it can store both structured data (e.g., age, income, purchase amounts) and unstructured data (e.g., social media comments) in its native format without requiring a predefined schema. This flexibility allows the analyst to ingest raw social media text for sentiment analysis and later apply schema-on-read for structured queries, avoiding the upfront schema definition needed by other platforms.

Exam trap

The trap here is that candidates often confuse a data warehouse with a data lake, assuming both can handle unstructured data, but a data warehouse requires structured, transformed data and cannot natively store raw social media comments without prior schema definition.

How to eliminate wrong answers

Option A is wrong because a relational database (RDBMS) requires a rigid, predefined schema and is optimized for structured data, making it inefficient for storing and processing unstructured social media comments without extensive ETL. Option C is wrong because a data warehouse is designed for structured, processed data and typically uses a schema-on-write approach, which cannot natively handle unstructured text like social media comments without significant transformation. Option D is wrong because a NoSQL document database can store semi-structured data (e.g., JSON) but is not optimized for large-scale, raw unstructured text and lacks the integrated processing capabilities (e.g., Apache Spark or Hadoop) that a data lake provides for sentiment analysis.

Practice this question →

31

MCQmedium

Refer to the exhibit. A data analyst notices that direct S3 access to files outside the "incoming/" prefix is blocked. Which data governance principle does this policy enforce?

A.Data colocation

B.Data retention

C.Data access control

D.Data encryption

AnswerC

The policy restricts which objects can be accessed, controlling access based on prefix.

Why this answer

The policy blocks direct S3 access to files outside the 'incoming/' prefix, which restricts which users or roles can read or write objects in specific S3 prefixes. This is a classic implementation of data access control, as it enforces permissions based on the resource path, ensuring only authorized operations are allowed on designated data. In AWS S3, such restrictions are typically applied via bucket policies or IAM policies that use conditions like `s3:prefix` to limit access.

Exam trap

CompTIA often tests the distinction between access control and encryption by presenting a policy that restricts access based on a path or condition, leading candidates to confuse it with data encryption, which is about scrambling data rather than authorizing access.

How to eliminate wrong answers

Option A is wrong because data colocation refers to physically or logically placing related data together for performance or compliance, not to restricting access based on a prefix. Option B is wrong because data retention governs how long data is kept (e.g., lifecycle policies or retention periods), not who can access it. Option D is wrong because data encryption protects data at rest or in transit (e.g., using SSE-S3 or TLS), but the policy described does not mention encryption keys, algorithms, or any cryptographic controls.

Practice this question →

32

MCQhard

A data audit reveals that some numbers in the "Revenue" column were manually entered from PDF invoices. This introduces potential errors. Which data concept is being addressed?

A.Data lineage

B.Data quality

C.Data security

D.Data governance

AnswerA

Correct. Lineage documents data origin and movement.

Why this answer

The scenario describes a data audit that identifies potential errors introduced by manual data entry from PDF invoices. This directly concerns the accuracy, consistency, and reliability of the data, which are core dimensions of data quality. Data quality encompasses the fitness of data for its intended use, including aspects like accuracy, completeness, and timeliness, making it the correct concept being addressed.

Exam trap

The trap here is that candidates may confuse the concept of data lineage (tracking data origins) with data quality (assessing data accuracy), but the question explicitly mentions 'potential errors' from manual entry, which is a quality concern, not a lineage tracking issue.

How to eliminate wrong answers

Option A is wrong because data lineage tracks the origin, movement, and transformation of data through its lifecycle, not the potential errors from manual entry. Option C is wrong because data security focuses on protecting data from unauthorized access, breaches, or corruption, not on the accuracy of manually entered values. Option D is wrong because data governance defines policies, roles, and procedures for managing data assets, but the specific issue of manual entry errors falls under data quality assessment, not governance frameworks.

Practice this question →

33

MCQmedium

The exhibit shows a JSON schema for a dataset. Which statement correctly describes the data types represented?

A.All fields are qualitative

B.age and income are continuous; customer_id and education are discrete

C.customer_id and age are discrete quantitative; income is continuous quantitative; education is qualitative

D.All fields are quantitative

AnswerC

Integers are discrete; float is continuous; string is qualitative.

Why this answer

Option C is correct because it accurately classifies the data types: customer_id (a unique identifier) is discrete quantitative, age (countable whole numbers) is discrete quantitative, income (measurable on a continuous scale) is continuous quantitative, and education (categorical levels) is qualitative. This aligns with the JSON schema where numeric fields like age and income are represented as numbers, while education is a string and customer_id is an integer.

Exam trap

The trap here is that candidates often confuse 'discrete' with 'continuous' for age, assuming all numeric fields are continuous, and fail to recognize that customer_id, though numeric, is a discrete identifier (nominal) rather than a measurable quantity.

How to eliminate wrong answers

Option A is wrong because it claims all fields are qualitative, but age, income, and customer_id are quantitative (numeric) fields, not categorical. Option B is wrong because it states age and income are continuous, but age is discrete (countable integer values) and customer_id is discrete quantitative, not qualitative; it also misclassifies education as discrete instead of qualitative. Option D is wrong because it claims all fields are quantitative, but education is a string field representing categorical data, which is qualitative.

Practice this question →

34

Multi-Selecteasy

An organization is implementing a data lake to store raw data from various sources. Which THREE characteristics are typically associated with a data lake compared to a data warehouse?

Select 3 answers

A.Supports batch and real-time processing

B.Stores data in its native format

C.Schema-on-read approach

D.Supports only structured data

E.Requires data transformation before loading

AnswersA, B, C

Correct: Data lakes handle both processing modes.

Why this answer

Option A is correct because data lakes are designed to ingest and process data in both batch and real-time streaming modes, unlike data warehouses which primarily handle batch ETL workloads. This flexibility is enabled by technologies like Apache Kafka for streaming and Apache Spark for batch processing, allowing the data lake to support diverse ingestion patterns without pre-schematization.

Exam trap

CompTIA often tests the misconception that data lakes require data transformation before loading (schema-on-write), when in fact they use schema-on-read, allowing raw data storage without upfront transformation.

Practice this question →

35

MCQeasy

A company needs to store raw, unprocessed data from IoT sensors for future machine learning experiments. The data is in various formats and schemas are not yet defined. Which storage solution is most appropriate?

A.Data lake

B.Data mart

C.Data warehouse

D.Operational database

AnswerA

A data lake stores raw data in its native format and supports schema-on-read, making it ideal for storing unstructured and semi-structured data for future analysis.

Why this answer

A data lake is the correct choice because it stores raw, unprocessed data in its native format (structured, semi-structured, or unstructured) without requiring a predefined schema. This aligns perfectly with the need to ingest IoT sensor data in various formats for future machine learning experiments, where schemas are not yet defined. Unlike data warehouses or data marts, a data lake supports schema-on-read, allowing the data to be transformed and queried later as needed.

Exam trap

CompTIA often tests the misconception that 'raw data' belongs in a data warehouse because it is 'data,' but the trap is that data warehouses require structured, processed data with a fixed schema, while a data lake is specifically designed for raw, schema-less data storage.

How to eliminate wrong answers

Option B is wrong because a data mart is a subset of a data warehouse designed for a specific business line or department, requiring pre-defined schemas and processed data, not raw unprocessed data. Option C is wrong because a data warehouse stores structured, cleaned, and transformed data optimized for business intelligence and reporting, not raw data in various formats. Option D is wrong because an operational database (e.g., OLTP system) is designed for real-time transaction processing with strict schemas and ACID compliance, not for storing large volumes of raw, schema-less IoT data for future analytics.

Practice this question →

36

Multi-Selecthard

Which THREE characteristics describe unstructured data? (Select three.)

Select 3 answers

A.Contains text, images, audio, video

B.Easily processed by standard SQL

C.Organised in rows and columns

D.Requires schema-on-read

E.Often stored in data lakes

AnswersA, D, E

Correct. Unstructured data includes multimedia and free text.

Why this answer

Option A is correct because unstructured data lacks a predefined data model and is typically composed of non-tabular formats such as text, images, audio, and video. Unlike structured data, it cannot be directly stored in relational databases without transformation. This characteristic is fundamental to distinguishing unstructured data from structured or semi-structured types.

Exam trap

The trap here is that candidates may confuse 'unstructured' with 'semi-structured' (e.g., JSON or XML) and incorrectly assume that SQL can process all non-tabular data, or they may forget that schema-on-read is a defining trait of unstructured data in modern big data architectures.

Practice this question →

37

MCQmedium

A healthcare organization maintains a database of patient records. The database has a table 'patients' with columns: patient_id (primary key), first_name, last_name, date_of_birth, gender, and last_visit_date. A data analyst is tasked with creating a report that lists all patients who have not visited in the last two years. The analyst writes a query: SELECT * FROM patients WHERE last_visit_date < DATEADD(year, -2, GETDATE()); However, the query returns zero rows, even though the analyst knows there are patients who have not visited for over two years. Upon inspection, the analyst discovers that the last_visit_date column contains NULL values for patients who have never visited. Which modification to the query should the analyst make to include patients with NULL last_visit_date?

A.Remove the WHERE clause entirely.

B.Add OR last_visit_date IS NULL to the WHERE clause.

C.Use COALESCE(last_visit_date, '1900-01-01') in the WHERE clause.

D.Add AND last_visit_date IS NOT NULL to the WHERE clause.

AnswerB

Includes NULLs in the result.

Why this answer

The original query uses a WHERE clause that compares last_visit_date to a computed date, but NULL comparisons in SQL always yield UNKNOWN, so rows with NULL last_visit_date are excluded. Adding OR last_visit_date IS NULL explicitly includes those rows, ensuring patients who have never visited are listed in the report.

Exam trap

The trap here is that candidates often forget that NULL comparisons in SQL do not return TRUE, leading them to incorrectly think the original query already handles NULLs, and they may choose Option C (COALESCE) as a workaround instead of the simpler and correct IS NULL check.

How to eliminate wrong answers

Option A is wrong because removing the WHERE clause entirely would return all rows, including those with recent visits, which fails to filter for patients who have not visited in two years. Option C is wrong because COALESCE(last_visit_date, '1900-01-01') would replace NULL with a very old date, making the comparison work, but it is not the standard or most efficient approach; the correct method is to use IS NULL to handle NULLs directly. Option D is wrong because AND last_visit_date IS NOT NULL would explicitly exclude rows with NULL last_visit_date, which is the opposite of what is needed.

Practice this question →

38

Multi-Selecteasy

Which TWO of the following are characteristics of a data lake?

Select 2 answers

A.Retains raw data in native format

B.Optimized for OLTP

C.Stores only structured data

D.Enforces ACID transactions

E.Uses schema-on-read

AnswersA, E

Data lakes store data as-is without transformation.

Why this answer

A data lake retains raw data in its native format, meaning data is ingested without transformation or schema enforcement. This allows storage of structured, semi-structured, and unstructured data as-is, preserving fidelity for future analytics. Unlike a data warehouse, a data lake does not require upfront schema definition, enabling flexible exploration and machine learning workloads.

Exam trap

The trap here is that candidates confuse data lakes with data warehouses, assuming all enterprise data stores enforce ACID and schema-on-write, when in fact data lakes prioritize raw storage and schema flexibility.

Practice this question →

39

Multi-Selectmedium

Which TWO of the following are characteristics of structured data? (Choose TWO.)

Select 2 answers

A.Has a defined schema

B.Requires NoSQL databases for storage

C.Often contains natural language text

D.Cannot be queried using SQL

E.Organized in rows and columns

AnswersA, E

Schema defines structure.

Why this answer

Structured data is defined by having a predefined schema, which specifies the data types, constraints, and relationships for each field. This schema ensures consistency and allows for efficient querying and validation. Option A is correct because a defined schema is a fundamental characteristic of structured data, as seen in relational database tables where each column has a specific data type and constraints.

Exam trap

The trap here is that candidates often confuse structured data with semi-structured data (e.g., JSON or XML) and incorrectly assume that structured data cannot be queried with SQL or that it requires NoSQL databases.

Practice this question →

40

MCQmedium

Refer to the exhibit. Which data concept does this exhibit best represent?

A.Entity relationship diagram

B.JSON schema

C.Data quality rule

D.Data dictionary

AnswerB

The exhibit is a JSON schema that validates the structure of JSON data.

Why this answer

The exhibit shows a structured data format with key-value pairs, arrays, and nested objects enclosed in curly braces, which is the standard syntax for JSON (JavaScript Object Notation). A JSON schema defines the structure, data types, and validation rules for JSON data, making it the best fit for the exhibit. The presence of fields like "name", "age", and "address" with specific types (e.g., string, integer) confirms this is a schema definition, not a generic data representation.

Exam trap

CompTIA often tests the distinction between a JSON schema (a structural definition with type constraints) and a raw JSON object (just data), leading candidates to mistake the schema for a simple data dictionary or ERD.

How to eliminate wrong answers

Option A is wrong because an entity relationship diagram (ERD) uses graphical symbols like rectangles and diamonds to represent entities and their relationships, not structured text with curly braces and key-value pairs. Option C is wrong because a data quality rule defines constraints like uniqueness or completeness (e.g., 'age must be > 0'), but the exhibit shows a structural blueprint with data types, not validation rules. Option D is wrong because a data dictionary is a tabular or textual catalog of data elements with definitions, not a machine-readable schema with nested objects and arrays.

Practice this question →

41

MCQhard

A financial institution wants to analyze transaction networks to detect fraud rings. Which database type is best suited for this analysis?

A.Wide-column store

B.Graph database

C.Key-value store

D.Document store

AnswerB

Graph databases excel at traversing relationships, perfect for network analysis.

Why this answer

A graph database is designed to store and traverse relationships between entities, making it ideal for analyzing transaction networks where connections between accounts, merchants, and transactions reveal fraud rings. Its native graph model (nodes and edges) allows efficient pattern matching and pathfinding queries, such as detecting circular transactions or shared attributes, which are common in fraud detection.

Exam trap

CompTIA often tests the misconception that any NoSQL database can handle relationship-heavy workloads, but the trap here is that only graph databases are purpose-built for deep relationship traversal and pattern matching, while other NoSQL types sacrifice relationship performance for scalability or flexibility.

How to eliminate wrong answers

Option A is wrong because wide-column stores (e.g., Cassandra, HBase) are optimized for high-volume, low-latency reads/writes on sparse data with flexible schemas, but they lack native relationship traversal capabilities, making multi-hop queries across transaction networks slow and complex. Option C is wrong because key-value stores (e.g., Redis, DynamoDB) provide fast lookups by primary key but cannot efficiently model or query the interconnected relationships between transactions and entities, requiring application-level joins that degrade performance. Option D is wrong because document stores (e.g., MongoDB, Couchbase) store semi-structured data as JSON-like documents and support indexing, but they do not have built-in graph traversal algorithms, so analyzing fraud rings would require expensive recursive queries or external graph processing.

Practice this question →

42

MCQhard

A data analyst is troubleshooting a report that shows unusually high sales for a specific product. Upon investigation, the analyst finds that the product was returned by several customers, but the returns were recorded in a separate system and not reflected in the sales data. Which data integration concept was likely missing?

A.ETL (Extract, Transform, Load)

B.Data reconciliation

C.Data profiling

D.Data governance

AnswerB

Reconciliation ensures data matches across sources.

Why this answer

The core issue is that the sales data and returns data are inconsistent because they were not cross-verified. Data reconciliation is the process of comparing datasets to ensure they are in agreement and identifying discrepancies, such as returns not being reflected in sales figures. Without reconciliation, the analyst would not detect that the high sales number is inflated by unrecorded returns.

Exam trap

The trap here is that candidates confuse the data movement process (ETL) with the data validation process (reconciliation), assuming that simply extracting and loading data will automatically ensure consistency between separate systems.

How to eliminate wrong answers

Option A is wrong because ETL (Extract, Transform, Load) is a process for moving and transforming data from source to target systems, but it does not inherently include a step to compare or verify data consistency between separate systems; the missing concept here is not about data movement but about data agreement. Option C is wrong because data profiling focuses on examining data quality, structure, and content (e.g., nulls, duplicates, data types), not on cross-system consistency checks; the problem is not about the quality of the sales data itself but about its mismatch with returns data. Option D is wrong because data governance refers to the overall management of data availability, usability, integrity, and security through policies and standards, not a specific technical process for reconciling discrepancies between two systems.

Practice this question →

43

MCQmedium

A business needs to store large volumes of raw data in its native format for future analytics. Which storage architecture is most appropriate?

A.Relational database

B.Data lake

C.Operational data store

D.Data warehouse

AnswerB

Data lakes store raw data in its native format without transformation.

Why this answer

A data lake is designed to store large volumes of raw data in its native format (structured, semi-structured, or unstructured) without requiring a predefined schema. This makes it ideal for future analytics where the data schema may not yet be known, as it supports schema-on-read rather than schema-on-write.

Exam trap

The trap here is that candidates confuse a data warehouse with a data lake, assuming both are for analytics, but the key differentiator is that a data warehouse requires schema-on-write and processed data, while a data lake stores raw data in native format.

How to eliminate wrong answers

Option A is wrong because a relational database enforces a strict schema-on-write and is optimized for transactional processing (OLTP), not for storing raw, unprocessed data at scale. Option C is wrong because an operational data store (ODS) is used for integrating data from multiple operational systems for near-real-time reporting, not for storing raw data in native format for future analytics. Option D is wrong because a data warehouse stores cleansed, transformed, and structured data optimized for query performance and business intelligence, not raw data in its native format.

Practice this question →

44

MCQeasy

A market research firm collects survey responses where customers rate satisfaction on a scale of 'Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied'. What type of data is being collected?

A.Interval

B.Ordinal

C.Ratio

D.Nominal

AnswerB

The satisfaction scale has a clear order but unequal intervals, fitting ordinal data definition.

Why this answer

The data is ordinal because the satisfaction levels have a clear, ordered ranking from 'Very Unsatisfied' to 'Very Satisfied', but the intervals between categories are not necessarily equal. This type of categorical data preserves the order without assuming a consistent numerical difference between each level.

Exam trap

The trap here is that candidates mistakenly treat ordered categorical data as interval data because they assume the numeric labels (e.g., 1 to 5) imply equal spacing, but the exam expects you to recognize that the underlying measurement scale lacks guaranteed equal intervals.

How to eliminate wrong answers

Option A is wrong because interval data requires equal, measurable intervals between values (e.g., temperature in Celsius), but the satisfaction scale does not guarantee equal psychological distance between categories. Option C is wrong because ratio data requires a true, meaningful zero point (e.g., income, height), and 'Very Unsatisfied' does not represent an absolute absence of satisfaction. Option D is wrong because nominal data is unordered categorical data (e.g., colors, gender), but the satisfaction scale has a natural order that must be preserved.

Practice this question →

45

MCQhard

A data scientist is building a machine learning model to predict customer churn. The dataset includes both numerical features (age, income) and categorical features (gender, marital status). Which data concept describes the process of converting categorical features into numerical values that can be used by the algorithm?

A.Data sampling

B.Encoding

C.Feature scaling

D.Dimensionality reduction

AnswerB

Encoding converts categories to numbers, e.g., one-hot encoding.

Why this answer

Encoding is the correct data concept because it transforms categorical features (like gender and marital status) into numerical representations (e.g., one-hot encoding, label encoding) that machine learning algorithms can process. Unlike feature scaling or dimensionality reduction, encoding directly addresses the incompatibility of non-numeric data with mathematical model operations.

Exam trap

CompTIA often tests the distinction between encoding and feature scaling, where candidates mistakenly think scaling applies to categorical data, but scaling only adjusts numeric ranges and cannot convert text labels to numbers.

How to eliminate wrong answers

Option A is wrong because data sampling refers to selecting a subset of data for training/testing, not converting categorical data to numeric. Option C is wrong because feature scaling normalizes numerical ranges (e.g., via min-max scaling or z-score standardization) and does not handle categorical-to-numeric conversion. Option D is wrong because dimensionality reduction (e.g., PCA, t-SNE) reduces the number of features, but it assumes all input features are already numeric and does not address the encoding of categorical variables.

Practice this question →

46

MCQmedium

When the analyst runs the query, it fails. What is the most likely reason?

A.The alias 'TotalValue' cannot be used in the WHERE clause.

B.The table name 'Products' is misspelled.

C.The data types of Price and Quantity are incompatible.

D.The expression 'Price * Quantity' is invalid in SQL.

AnswerA

Aliases are not recognized in WHERE due to order of execution.

Why this answer

The alias 'TotalValue' is defined in the SELECT clause but is referenced in the WHERE clause. In SQL, column aliases cannot be used in the WHERE clause because the WHERE clause is evaluated before the SELECT clause, so the alias does not yet exist at that point in the query execution order. This causes a syntax or 'unknown column' error.

Exam trap

CompTIA often tests the SQL query execution order, specifically that column aliases cannot be used in the WHERE clause, leading candidates to mistakenly think the alias is available everywhere in the query.

How to eliminate wrong answers

Option B is wrong because a misspelled table name would cause a 'table not found' error, not the alias-related failure described. Option C is wrong because Price and Quantity are typically numeric types (e.g., DECIMAL, INT), and multiplication is valid between compatible numeric types; if they were incompatible, the error would be about implicit conversion, not alias usage. Option D is wrong because 'Price * Quantity' is a valid arithmetic expression in SQL, and the multiplication operator works on numeric columns.

Practice this question →

47

Drag & Dropmedium

Drag and drop the steps for the ETL (Extract, Transform, Load) process in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

ETL begins with extraction, followed by cleaning, transformation, loading, and verification.

Practice this question →

48

MCQmedium

A data analyst finds that the "Age" column contains values like "N/A", "unknown", and negative numbers. Which data quality dimension is primarily affected?

A.Accuracy

B.Consistency

C.Validity

D.Completeness

AnswerC

Correct. The values are not valid for the Age attribute.

Why this answer

Validity refers to the degree to which data conforms to defined business rules or constraints. Invalid entries (non-numeric, negative) violate the expected data type and range.

Practice this question →

49

MCQeasy

Which of the following is an example of qualitative data?

A.Stock price

B.Customer feedback comments

C.Number of website visitors

D.Product weight in grams

AnswerB

Comments are text, non-numeric, qualitative data.

Why this answer

Customer feedback comments are qualitative data because they consist of non-numerical, descriptive text that captures opinions, sentiments, or experiences. Unlike quantitative data, which can be measured or counted, qualitative data is categorical and often requires thematic analysis to derive insights.

Exam trap

The trap here is that candidates often confuse 'qualitative' with 'quantifiable' and may incorrectly select a numeric option like stock price or website visitors, not realizing that qualitative data is inherently non-numeric and descriptive.

How to eliminate wrong answers

Option A is wrong because stock price is a numerical value that can be measured and compared, making it quantitative data. Option C is wrong because the number of website visitors is a count, which is a discrete numerical value and thus quantitative data. Option D is wrong because product weight in grams is a continuous numerical measurement, falling under quantitative data.

Practice this question →

50

MCQeasy

Refer to the exhibit. An Avro schema is defined as shown. Which data design concept does this represent?

A.Schema-on-read

B.Schema-less design

C.Dynamic schema

D.Schema-on-write

AnswerD

Avro uses a predefined schema that is applied when data is written, typical of schema-on-write.

Why this answer

Avro requires defining the schema before writing data, imposing structure at write time (schema-on-write). Schema-on-read would apply structure when reading, and schema-less design has no predefined schema.

Practice this question →

51

MCQmedium

A data engineer is comparing data warehouses and data lakes. Which statement accurately describes a data warehouse?

A.Typically stores data in object storage

B.Optimized for complex queries on structured data

C.Stores raw, unprocessed data

D.Uses schema-on-read

AnswerB

Data warehouses are designed for analytical queries on structured data.

Why this answer

A data warehouse is optimized for complex queries on structured data because it uses a schema-on-write approach, where data is cleaned, transformed, and organized into relational tables (e.g., star or snowflake schemas) before loading. This pre-processing enables efficient execution of aggregations, joins, and reporting queries using SQL, making it ideal for business intelligence and analytics. In contrast, data lakes store raw data in native formats and rely on schema-on-read, which is less performant for structured query patterns.

Exam trap

The trap here is that candidates confuse the storage location (object storage) or data state (raw vs. processed) with the defining characteristic of a data warehouse, which is its schema-on-write design and optimization for structured query performance.

How to eliminate wrong answers

Option A is wrong because data warehouses typically store data in structured, columnar formats (e.g., Parquet, ORC) within relational databases or dedicated storage engines, not in object storage like Amazon S3 or Azure Blob Storage, which is characteristic of data lakes. Option C is wrong because data warehouses store processed, transformed, and cleansed data optimized for analysis, not raw, unprocessed data; raw data is a hallmark of data lakes. Option D is wrong because data warehouses use schema-on-write, where the schema is defined and enforced at data ingestion time, whereas schema-on-read is a property of data lakes where the schema is applied only when the data is queried.

Practice this question →

52

MCQhard

A mid-sized e-commerce company stores customer data in a relational database. The database has a table named 'Customers' with columns: CustomerID (primary key), FirstName, LastName, Email, Phone, Address, City, State, ZipCode, and SignUpDate. The company is migrating to a new CRM system that requires a denormalized structure for performance reasons. The new system expects a single table 'CustomerDetails' with columns: CustomerID, FullName (concatenation of first and last name), ContactInfo (JSON object containing email, phone, and address), SignUpDate, and Region (derived from state). The data analyst must design an ETL process to transform the data. During a test run, the analyst notices that some records have missing Phone or Address values. Which of the following is the best approach to handle missing data in the ContactInfo JSON object?

A.Exclude any record with missing Phone or Address from the migration.

B.Set missing values to an empty string in the JSON object.

C.Include the missing fields as null in the JSON object.

D.Replace missing values with 'N/A' string.

AnswerC

Null explicitly indicates missing data.

Why this answer

Option C is correct because representing missing fields as null in the JSON object preserves the data structure and allows downstream systems to explicitly handle null values. This approach maintains data integrity without discarding records or introducing ambiguous placeholder strings that could be misinterpreted as actual data.

Exam trap

The trap here is that candidates may confuse 'handling missing data' with 'filling in missing data,' leading them to choose placeholder strings (B or D) instead of preserving the null representation that JSON natively supports.

How to eliminate wrong answers

Option A is wrong because excluding records with missing Phone or Address would result in data loss, violating the migration requirement to preserve all customer data. Option B is wrong because setting missing values to an empty string conflates 'no data' with 'empty data,' which can cause incorrect processing in JSON parsers or CRM logic that expects null for absent values. Option D is wrong because replacing missing values with 'N/A' string introduces a non-standard placeholder that may be treated as valid data, leading to errors in downstream analytics or validation rules.

Practice this question →

53

MCQeasy

Refer to the exhibit. The data shown is an example of which data concept?

A.Structured data

B.Unstructured data

C.Metadata

D.Semi-structured data

AnswerA

Structured data conforms to a predefined schema, as shown in the CSV.

Why this answer

The exhibit shows data organized into rows and columns with a fixed schema (e.g., 'Name', 'Age', 'City'), which is the defining characteristic of structured data. Structured data conforms to a predefined data model, typically stored in relational databases or spreadsheets, and can be easily queried using SQL. This tabular format with consistent data types per column is the classic example of structured data.

Exam trap

CompTIA often tests the distinction between structured and semi-structured data, trapping candidates who see any organization (like tags in JSON) and mistakenly label it as structured, when the rigid row-and-column format is the key differentiator.

How to eliminate wrong answers

Option B is wrong because unstructured data lacks a predefined schema or organization, such as raw text, images, or audio files, whereas the exhibit clearly has a tabular structure. Option C is wrong because metadata is 'data about data' (e.g., file size, creation date), not the actual data values shown in the table. Option D is wrong because semi-structured data (e.g., JSON, XML) has some organizational properties (tags, key-value pairs) but does not enforce a rigid row-and-column schema like the exhibit does.

Practice this question →

54

MCQmedium

An e-commerce company uses a star schema for its data warehouse. The fact table 'sales_fact' contains foreign keys to dimension tables: customer_dim, product_dim, time_dim, and store_dim. A business user wants to know the total sales for each product category in the last month. Which join operation is required to retrieve this data?

A.Self-join on the fact table

B.Cross join between fact and dimension tables

C.Inner join between fact table and dimension tables

D.Left outer join between fact and dimension tables

AnswerC

Inner join returns only matching rows, which is typical in star schema queries.

Why this answer

To retrieve total sales for each product category, you need to join the fact table with the product dimension table to map product keys to categories, and with the time dimension table to filter on the last month. An inner join is correct because it returns only rows where matching keys exist in both tables, which is the standard approach for star-schema queries where all required dimension attributes are present. This ensures that only valid sales transactions with corresponding product and time entries are included in the aggregation.

Exam trap

The trap here is that candidates often confuse the need for a left outer join to 'preserve all fact rows,' but in a well-designed star schema with referential integrity, inner join is sufficient and more performant, and left outer join is only needed when fact rows might lack matching dimension keys (e.g., orphaned records).

How to eliminate wrong answers

Option A is wrong because a self-join on the fact table would match rows within the same table, which is unnecessary here since the required attributes (product category and month) are in dimension tables, not in the fact table itself. Option B is wrong because a cross join between fact and dimension tables would produce a Cartesian product, generating every possible combination of fact rows with dimension rows, leading to massively inflated and incorrect sales totals. Option D is wrong because a left outer join would include fact rows even if there is no matching dimension row (e.g., a product key not in product_dim), which could introduce NULL values for category and potentially skew the aggregation; inner join is the standard for guaranteed referential integrity in a star schema.

Practice this question →

55

MCQhard

A sensor records temperature readings in Celsius and a separate sensor records wind speed in meters per second. A data scientist wants to combine these datasets for analysis. Which statement accurately compares these data types?

A.Both are ratio data

B.Temperature is discrete; wind speed is continuous

C.Both are discrete data

D.Temperature is interval; wind speed is ratio

AnswerD

Celsius has no true zero (interval), while wind speed has a true zero (ratio).

Why this answer

Temperature measured in Celsius has an arbitrary zero point (0°C does not mean 'no heat'), so it is interval data. Wind speed in meters per second has a true zero point (0 m/s means no wind), making it ratio data. Therefore, option D correctly identifies temperature as interval and wind speed as ratio.

Exam trap

The trap here is confusing interval and ratio data by overlooking the significance of a true zero point, leading candidates to incorrectly classify temperature as ratio data.

How to eliminate wrong answers

Option A is wrong because temperature in Celsius is interval data, not ratio data, due to the lack of a true zero point. Option B is wrong because temperature is continuous (can take any value within a range), not discrete; wind speed is also continuous. Option C is wrong because both temperature and wind speed are continuous data types, not discrete.

Practice this question →

56

Multi-Selecthard

Which TWO of the following are examples of data governance best practices?

Select 2 answers

A.Defining data owners for each dataset

B.Implementing data quality standards

C.Creating indexes on frequently queried columns

D.Using a data lake for storage

E.Encrypting all data at rest

AnswersA, B

Ownership is a governance practice.

Why this answer

Option A is correct because defining data owners for each dataset is a core data governance practice that establishes accountability and responsibility for data assets. Data owners are typically senior stakeholders who ensure data is managed according to policies, including access controls and quality standards. This practice aligns with frameworks like DAMA-DMBOK, which emphasizes stewardship and ownership as foundational to governance.

Exam trap

CompTIA often tests the distinction between data governance (policies, ownership, quality) and data management (implementation, storage, performance) or security (encryption, access controls), leading candidates to confuse operational tasks with governance practices.

Practice this question →

57

MCQeasy

A market researcher conducts a survey with questions like "What is your favorite brand?" and "How many units do you purchase per year?" Which data types correspond?

A.Qualitative & Quantitative

B.Quantitative & Qualitative

C.Both quantitative

D.Both qualitative

AnswerA

Correct. Brand is qualitative; units is quantitative.

Why this answer

Option A is correct because 'favorite brand' is a categorical label (qualitative data), while 'units purchased per year' is a numerical count (quantitative data). The question explicitly pairs these two distinct data types, matching the definition of qualitative (non-numeric categories) and quantitative (numeric measurements).

Exam trap

The trap here is that candidates often confuse the order of the data types in the question, assuming the first listed data type must be quantitative, leading them to select Option B instead of correctly identifying 'favorite brand' as qualitative.

How to eliminate wrong answers

Option B is wrong because it reverses the order: 'favorite brand' is qualitative, not quantitative, and 'units purchased per year' is quantitative, not qualitative. Option C is wrong because 'favorite brand' is not a numeric value; it is a categorical label, so both cannot be quantitative. Option D is wrong because 'units purchased per year' is a numeric count, not a categorical label, so both cannot be qualitative.

Practice this question →

58

Multi-Selectmedium

Which TWO roles are primarily responsible for defining and enforcing data governance policies within an organization?

Select 2 answers

A.Data analyst

B.Data architect

C.Data custodian

D.Data steward

E.Data owner

AnswersD, E

Ensures compliance and enforces data governance rules.

Why this answer

Option D (Data steward) is correct because data stewards are responsible for the day-to-day management, quality, and enforcement of data governance policies, including data classification, access controls, and compliance with regulatory standards. Option E (Data owner) is correct because data owners are senior stakeholders who define the governance policies, approve data access decisions, and are accountable for the data assets within their domain.

Exam trap

The trap here is that candidates often confuse 'data custodian' (technical implementation) with 'data steward' (policy enforcement), or assume 'data analyst' has governance authority because they work closely with data, but the exam specifically tests the distinct RACI model roles in data governance.

Practice this question →

59

MCQeasy

A company stores customer data in a relational database with tables for orders, products, and customers. Which type of data best describes this?

A.Structured data

B.Unstructured data

C.Qualitative data

D.Semi-structured data

AnswerA

Relational databases impose a strict schema, making data structured.

Why this answer

A is correct because the data is stored in a relational database with predefined schemas (tables for orders, products, and customers), which enforces a fixed structure of rows and columns. This makes it structured data, as each field has a specific data type and relationships are defined via foreign keys, enabling efficient querying with SQL.

Exam trap

CompTIA often tests the misconception that any data stored in a database is automatically structured, but the trap here is that candidates might confuse semi-structured data (like JSON in NoSQL) with relational tables, which are strictly structured.

How to eliminate wrong answers

Option B is wrong because unstructured data lacks a predefined schema and cannot be stored in relational tables; examples include text files, images, or videos. Option C is wrong because qualitative data is non-numerical and descriptive (e.g., customer feedback text), but the scenario describes structured tables with quantitative and categorical fields. Option D is wrong because semi-structured data has some organizational properties (like tags or key-value pairs) but does not conform to a rigid relational schema; examples include JSON or XML files, not relational database tables.

Practice this question →

60

Multi-Selecteasy

Which TWO of the following are considered internal data sources within an organization?

Select 2 answers

A.Social media feeds

B.Employee payroll data

C.Government census data

D.Sales transaction records

E.Market research reports from third parties

AnswersB, D

Payroll data is generated and maintained internally by HR systems.

Why this answer

Employee payroll data is generated and stored internally by an organization's HR or finance systems, making it an internal data source. It is proprietary and not accessible from outside the organization, fitting the definition of internal data.

Exam trap

The trap here is that candidates may confuse 'data used internally' with 'internal data source,' mistakenly selecting options like social media feeds or third-party reports because the organization uses them for analysis, even though they originate externally.

Practice this question →

61

MCQmedium

A financial application requires fast query performance for aggregations on large historical datasets. The schema has many lookup tables. Which schema design is most efficient for this workload?

A.Snowflake schema

B.Star schema

C.Wide table

D.Third normal form (3NF)

AnswerB

Star schema denormalizes dimension tables, reducing the number of joins and improving query performance for aggregations.

Why this answer

The star schema is most efficient for this workload because it denormalizes lookup tables into dimension tables, reducing the number of joins required for aggregations. This design optimizes query performance for large historical datasets by enabling faster full table scans and simpler query plans, which is critical for financial applications needing rapid aggregations.

Exam trap

The trap here is that candidates often confuse normalization with performance, assuming snowflake or 3NF schemas are faster due to reduced redundancy, when in fact denormalization in a star schema minimizes joins for analytical queries.

How to eliminate wrong answers

Option A is wrong because the snowflake schema normalizes dimension tables into sub-dimensions, increasing join complexity and degrading query performance on large datasets. Option C is wrong because a wide table, while denormalized, leads to excessive redundancy and storage overhead, and can cause performance issues due to wide row scans and index inefficiencies. Option D is wrong because third normal form (3NF) prioritizes data integrity over query speed, requiring many joins that slow down aggregations on historical data.

Practice this question →

62

MCQeasy

A retail company processes daily transactions. The current system transforms data before loading it into the data warehouse. The volume is growing rapidly, and they want to load raw data first to reduce processing time. Which approach should they adopt?

A.Change data capture (CDC)

B.ETL (Extract, Transform, Load)

C.ELT (Extract, Load, Transform)

D.Data replication

AnswerC

ELT loads raw data first, then transforms in the warehouse, reducing initial load time and utilizing warehouse resources.

Why this answer

The correct answer is C (ELT) because the company wants to load raw data first and then transform it later, reducing initial processing time. ELT leverages the power of modern data warehouses to perform transformations after loading, which is ideal for rapidly growing volumes of raw transaction data.

Exam trap

The trap here is that candidates often confuse ETL and ELT, assuming that 'transform before load' (ETL) is always faster, but the question explicitly states the goal is to reduce processing time by loading raw data first, which directly points to ELT.

How to eliminate wrong answers

Option A is wrong because Change Data Capture (CDC) is a technique for capturing incremental changes from source systems, not a data loading approach that loads raw data first. Option B is wrong because ETL (Extract, Transform, Load) transforms data before loading, which contradicts the requirement to reduce processing time by loading raw data first. Option D is wrong because Data Replication copies data between systems in real-time or near-real-time, but it does not inherently load raw data into a data warehouse for later transformation.

Practice this question →

63

Multi-Selecteasy

Which TWO are examples of primary data? (Select two.)

Select 2 answers

A.Industry reports from a trade association

B.Government census data

C.Customer survey responses collected by the company themselves

D.Company sales records

E.Social media data purchased from a vendor

AnswersC, D

Correct. Surveys conducted by the company are primary.

Why this answer

Option C is correct because primary data is collected directly by the researcher or organization for a specific purpose. Customer survey responses gathered by the company itself are firsthand, original data that have not been previously published or aggregated by an external source. This aligns with the definition of primary data as original, unprocessed information collected from the source.

Exam trap

CompTIA often tests the distinction between primary and secondary data by including options that appear firsthand but are actually collected by an external entity, such as purchased datasets or government reports, leading candidates to mistakenly classify them as primary.

Practice this question →

64

MCQmedium

Refer to the exhibit. A data analyst is trying to understand access permissions for the company-data bucket. Which statement accurately describes the effective permissions?

A.DataAnalyst can read objects in prod bucket except those in the sensitive subfolder.

B.DataAnalyst can read all objects in the prod bucket, including the sensitive subfolder.

C.No one can read from the prod bucket except DataAnalyst.

D.Only DataAnalyst is allowed to read from the entire prod bucket.

AnswerA

Allow on prod/*, Deny on prod/sensitive/* explicitly blocks access to sensitive subfolder.

Why this answer

Option A is correct because the exhibit shows an IAM policy that grants the DataAnalyst user s3:GetObject permission on the prod bucket, but includes a Deny effect for the sensitive subfolder via a condition key (e.g., s3:prefix). AWS IAM evaluates explicit Deny statements before Allow statements, so the Deny on the sensitive subfolder overrides the Allow on the bucket, effectively blocking read access to objects in that subfolder while permitting reads elsewhere in the bucket.

Exam trap

The trap here is that candidates often assume an Allow statement on a bucket grants full access to all objects, forgetting that an explicit Deny on a subfolder (via a condition key) takes precedence and creates a narrower effective permission.

How to eliminate wrong answers

Option B is wrong because it claims DataAnalyst can read all objects including the sensitive subfolder, but the explicit Deny on that subfolder prevents read access, so this statement is false. Option C is wrong because it states 'No one can read from the prod bucket except DataAnalyst,' which is incorrect; the policy only applies to DataAnalyst and does not grant or deny permissions to other principals, so other users or roles may have separate policies allowing read access. Option D is wrong because it says 'Only DataAnalyst is allowed to read from the entire prod bucket,' but the Deny on the sensitive subfolder means DataAnalyst cannot read from the entire bucket, and other principals might also have read permissions via different policies.

Practice this question →

65

MCQhard

During an ETL process, a data quality check fails due to duplicate customer IDs. Which data quality dimension is violated?

A.Consistency

B.Uniqueness

C.Completeness

D.Accuracy

AnswerB

Duplicates violate the uniqueness dimension.

Why this answer

Duplicate customer IDs violate the uniqueness dimension because uniqueness ensures that each record in a dataset has a distinct identifier with no duplicates. In an ETL process, a primary key or unique constraint on the customer ID column would reject duplicate values, causing the data quality check to fail. This is distinct from consistency, which checks for logical agreement across data sources.

Exam trap

The trap here is that candidates confuse uniqueness with accuracy, thinking a duplicate ID is 'inaccurate' data, but accuracy concerns correctness of values, not their distinctness.

How to eliminate wrong answers

Option A is wrong because consistency refers to data being logically coherent across systems (e.g., same customer name in CRM and ERP), not to the absence of duplicate IDs. Option C is wrong because completeness measures whether all required data is present (e.g., missing customer names), not whether values are duplicated. Option D is wrong because accuracy checks if data correctly reflects real-world values (e.g., correct spelling of a name), not uniqueness of identifiers.

Practice this question →

66

MCQeasy

A data analyst needs to ensure that a customer's address is stored in a consistent format across multiple databases. Which data quality dimension is the analyst primarily concerned with?

A.Consistency

B.Completeness

C.Accuracy

D.Timeliness

AnswerA

Consistency ensures data is uniform across systems.

Why this answer

The data analyst is primarily concerned with consistency, which ensures that the same data values are represented uniformly across different systems or databases. In this scenario, the customer's address must follow the same format (e.g., street, city, state, ZIP code) in every database to enable reliable merging and querying. Consistency is a key data quality dimension that focuses on cross-system uniformity, distinct from accuracy (correctness of values) or completeness (presence of all required fields).

Exam trap

The trap here is that candidates often confuse consistency with accuracy, thinking that if the address is correct (accurate), it must be consistent, but consistency is about format uniformity across systems, not the truthfulness of the data.

How to eliminate wrong answers

Option B (Completeness) is wrong because completeness measures whether all required data fields are present, not whether the data is formatted uniformly across databases. Option C (Accuracy) is wrong because accuracy refers to the correctness of the data values relative to the real-world entity, not the format or representation. Option D (Timeliness) is wrong because timeliness concerns whether the data is up-to-date and available when needed, not the consistency of its format across systems.

Practice this question →

67

MCQmedium

In a customer database, each row represents a customer with columns: CustomerID, Name, Address, Phone. What does the column "Name" represent?

A.Instance

B.Entity

C.Attribute

D.Record

AnswerC

Correct. Name is an attribute of the customer entity.

Why this answer

In the context of a relational database, a column represents an attribute of an entity. The 'Name' column stores a specific characteristic (the customer's name) for each row, making it an attribute. This aligns with the data modeling concept where attributes define the properties of an entity.

Exam trap

The trap here is that candidates confuse 'attribute' with 'record' because they think of a row as containing all attributes, but the question specifically asks what a single column represents, not the row itself.

How to eliminate wrong answers

Option A is wrong because an instance refers to a single occurrence of an entity (e.g., a specific customer row), not a column. Option B is wrong because an entity is a table-level concept representing a real-world object (e.g., the Customer table), not a column within it. Option D is wrong because a record is a row in the table, which contains values for all attributes, not a single column like 'Name'.

Practice this question →

68

MCQeasy

Refer to the exhibit. A data pipeline is failing to parse this log entry. What is the most likely cause of the error?

A.Missing comma between fields

B.Incorrect data type for age

C.Extra whitespace

D.Unquoted string for country

AnswerA

There should be a comma after '30' before the next key.

Why this answer

The log entry shows fields separated by spaces, but the country value 'United States' contains a space, which breaks the expected delimiter-based parsing. Without a comma or other explicit delimiter between fields, the parser cannot distinguish where the age field ends and the country field begins, causing a parsing failure. A comma-separated format (CSV) would require a comma between each field to avoid ambiguity.

Exam trap

CompTIA often tests the concept that missing or inconsistent delimiters cause parsing failures, and the trap here is that candidates may focus on the unquoted string or extra whitespace rather than recognizing that the fundamental issue is the absence of a required comma between fields.

How to eliminate wrong answers

Option B is wrong because the age field '30' is a valid integer and would parse correctly if the fields were properly delimited; the error is not due to data type mismatch. Option C is wrong because extra whitespace is not the issue—the spaces are part of the intended delimiter or the country value, and the parser is failing due to the lack of a comma, not due to excessive whitespace. Option D is wrong because the country 'United States' is not unquoted in a way that causes the error; the core problem is the missing comma between fields, not the lack of quotes around the string.

Practice this question →

69

MCQhard

A financial services company is migrating its customer data from a legacy on-premises relational database to a cloud-based data warehouse. The legacy database uses a denormalized schema with a single table 'customer_master' that contains all customer attributes, including repeated groups for multiple accounts per customer (account1_type, account1_balance, account2_type, account2_balance, etc.). The data warehouse team wants to implement a normalized star schema with separate dimension and fact tables. During the ETL process, the team encounters an error: 'Data truncation: string data right truncation' when loading account_type values into the dim_account table. The account_type column in dim_account is defined as VARCHAR(10), but the source data contains account types like 'SavingsPlus' (11 characters) and 'CheckingPremium' (15 characters). The team must resolve this issue without losing data. Which course of action should the team take?

A.Truncate the account_type values to 10 characters during ETL.

B.Change the data type of dim_account.account_type to TEXT.

C.Ignore the error and continue loading with NULL values for truncated rows.

D.Increase the VARCHAR length of dim_account.account_type to accommodate the longest account type.

AnswerD

This resolves truncation without data loss.

Why this answer

Option D is correct because increasing the VARCHAR length of dim_account.account_type to accommodate the longest account type (e.g., VARCHAR(15) for 'CheckingPremium') resolves the data truncation error without data loss. This aligns with the star schema design principle of preserving source data integrity while ensuring the column definition matches the actual data length. The team must avoid truncation or NULL insertion to maintain accurate dimensional attributes for analytics.

Exam trap

The trap here is that candidates may choose truncation (Option A) or NULL insertion (Option C) as quick fixes, overlooking the requirement to preserve data integrity, or mistakenly think TEXT (Option B) is a safe catch-all without considering performance implications in a data warehouse context.

How to eliminate wrong answers

Option A is wrong because truncating account_type values to 10 characters would lose data, violating the requirement to resolve the issue without data loss. Option B is wrong because changing the data type to TEXT is unnecessary and can introduce performance overhead in indexing and querying, as TEXT is a large object type not optimized for VARCHAR-like operations in a data warehouse. Option C is wrong because ignoring the error and loading NULL values for truncated rows would discard valid account_type data, breaking referential integrity and analytics accuracy.

Practice this question →

70

MCQmedium

A data analyst needs to compare sales data from the company's internal CRM with public demographic data from a government census. Which data concept best describes this scenario?

A.Internal vs. External data

B.Primary vs. Secondary data

C.Structured vs. Unstructured data

D.Quantitative vs. Qualitative data

AnswerA

CRM data is internal; census data is external, directly contrasting sources.

Why this answer

The scenario involves comparing internal CRM data (generated and owned by the company) with external government census data (publicly sourced from outside the organization). This directly maps to the Internal vs. External data concept, where internal data is collected within the enterprise (e.g., sales transactions, customer records) and external data is acquired from third-party sources (e.g., census bureaus, market research firms).

The key distinction is the data's origin and ownership, not its structure, collection method, or measurement type.

Exam trap

CompTIA often tests the Internal vs. External data concept by presenting a scenario where the key differentiator is the data's source (inside vs. outside the organization), tempting candidates to confuse it with Primary vs. Secondary data, which focuses on whether the data was collected firsthand or repurposed.

How to eliminate wrong answers

Option B (Primary vs. Secondary data) is wrong because both datasets could be primary (collected firsthand by the CRM or census) or secondary (repurposed from another source), but the question focuses on the origin relative to the organization, not the collection method. Option C (Structured vs.

Unstructured data) is wrong because both CRM sales data and census demographic data are typically structured (e.g., tables with rows and columns), so the contrast is not about format but about source. Option D (Quantitative vs. Qualitative data) is wrong because both datasets contain quantitative values (e.g., sales figures, population counts) and possibly qualitative labels (e.g., region names), but the core distinction in the scenario is internal versus external sourcing, not measurement scale.

Practice this question →

71

MCQhard

Refer to the exhibit. A data analyst runs this query to identify high-value customers. However, the result does not include customers with exactly 5 orders. Which data concept does the HAVING clause illustrate?

A.Data sorting with ORDER BY

B.Data joining with INNER JOIN

C.Data aggregation with filtering on aggregated values

D.Data filtering on row-level conditions

AnswerC

HAVING filters after GROUP BY, operating on aggregated results.

Why this answer

HAVING filters groups after aggregation, unlike WHERE which filters rows before aggregation. This demonstrates data aggregation with filtering on aggregated values.

Practice this question →

72

Multi-Selectmedium

A data team must implement a data retention policy to reduce storage costs while meeting legal requirements. Which TWO actions best achieve this?

Select 2 answers

A.Set data retention limits with automated deletion

B.Use data compression

C.Increase primary storage capacity

D.Implement data deduplication

E.Archive historical data to tape or cloud archive

AnswersA, E

Ensures data is deleted after a defined period, complying with legal requirements.

Why this answer

Archiving old data to cheaper storage reduces primary storage costs, and setting retention limits ensures data is deleted when no longer needed, balancing cost and compliance.

Practice this question →

73

MCQeasy

A retail company is merging customer data from three separate systems: an e-commerce platform, a point-of-sale (POS) system, and a loyalty program. The e-commerce platform stores customer names in "FirstName LastName" format, the POS system stores names as "LastName, FirstName", and the loyalty program stores names in separate "first_name" and "last_name" fields. The data analyst needs to create a unified customer master table. After initial merging, there are 20% more records than expected, including duplicates with slight name variations (e.g., "John Smith" vs "John A. Smith"). To ensure accurate consolidation, which data concept should the analyst prioritize applying first?

A.Data profiling

B.Data standardization

C.Data indexing

D.Data encryption

AnswerB

Standardizing name formats to a common convention reduces variations and allows accurate matching and deduplication.

Why this answer

Data standardization is the correct first step because it resolves the inconsistent name formats (e.g., 'FirstName LastName', 'LastName, FirstName', and separate fields) into a single, consistent representation. By applying a standardized format (e.g., 'FirstName LastName'), the analyst can then accurately identify and merge duplicates like 'John Smith' and 'John A. Smith' using fuzzy matching or exact matching on the standardized values.

This ensures the unified customer master table has the correct number of records without the 20% inflation caused by formatting variations.

Exam trap

The trap here is that candidates confuse data profiling (which only identifies issues) with data standardization (which actively resolves format inconsistencies), leading them to choose A instead of B, even though profiling alone cannot fix the duplicate records caused by name variations.

How to eliminate wrong answers

Option A is wrong because data profiling is an exploratory process that assesses data quality and structure (e.g., detecting nulls, patterns, or anomalies), but it does not transform or resolve the inconsistent name formats that cause duplicate records. Option C is wrong because data indexing improves query performance by creating sorted structures (e.g., B-trees or hash indexes) on columns, but it does not address the underlying data inconsistency or deduplication needed for accurate consolidation. Option D is wrong because data encryption protects data at rest or in transit (e.g., using AES-256 or TLS 1.3), but it has no role in standardizing name formats or removing duplicates from merged datasets.

Practice this question →

74

Multi-Selecthard

Which THREE of the following are valid methods for handling missing data?

Select 3 answers

A.Using a placeholder like 'Unknown' for categorical data

B.Ignoring missing values and proceeding with analysis

C.Replacing missing values with the mean of the column

D.Sorting the data to bring missing values to the top

E.Deleting rows with missing values

AnswersA, C, E

Placeholder is a valid approach.

Why this answer

Option A is correct because using a placeholder like 'Unknown' for categorical missing data preserves the dataset's structure and allows analysis to proceed without introducing statistical bias. This method is particularly valid for nominal data where the missing category can be treated as a distinct value, enabling downstream operations like one-hot encoding or frequency analysis without distorting the original distribution.

Exam trap

The trap here is that candidates may confuse 'handling missing data' with 'preprocessing steps'—sorting (Option D) is a data organization technique, not a valid method for dealing with missing values, and ignoring missing data (Option B) is often mistakenly considered acceptable in quick analyses, but it violates best practices for robust data science workflows.

Practice this question →

75

MCQhard

A data analyst is reviewing the error log from a nightly batch load. What is the most likely cause of the error?

A.A row with the same primary key was already loaded in a previous batch.

B.The data type of order_id is incorrect.

C.The source and target schemas are mismatched.

D.The order_id field contains null values.

AnswerA

The error explicitly says duplicate key.

Why this answer

The error log from a nightly batch load indicates a primary key violation. This occurs when a row with the same primary key value already exists in the target table from a previous batch load. Since batch loads typically use INSERT operations, attempting to insert a duplicate primary key will raise a constraint violation error, halting the load process.

Exam trap

The trap here is that candidates confuse a primary key violation with a data type mismatch or schema mismatch, but the error log's specific reference to a duplicate key points directly to the primary key constraint.

How to eliminate wrong answers

Option B is wrong because an incorrect data type for order_id would cause a data type conversion error or truncation error, not a primary key violation. Option C is wrong because a schema mismatch (e.g., missing columns or different column order) would produce a column mapping error or a 'column not found' error, not a duplicate key error. Option D is wrong because null values in order_id would violate a NOT NULL constraint if the primary key column is defined as NOT NULL, but the error message specifically points to a duplicate key violation, not a null constraint violation.

Practice this question →

Page 1 of 2 · 115 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Data Concepts questions.

Start 20-question session