Knowledge + Practice

CCNA Dap Data Concepts Questions

74 questions · Dap Data Concepts topic · All types, answers revealed

Practice these questions Exam hub All questions

1

MCQmedium

A retail company wants to analyze customer purchase patterns over time. The data is stored in a relational database with tables for Customers, Orders, and Products. Which database concept should be used to ensure that each order references a valid customer?

A.View

B.Index

C.Primary key

D.Foreign key

AnswerD

A foreign key links tables by referencing a primary key in another table, maintaining referential integrity.

Why this answer

A foreign key constraint enforces referential integrity by ensuring that every value in the 'customer_id' column of the Orders table matches a valid primary key value in the Customers table. This prevents orphaned records and guarantees that each order references an existing customer.

Exam trap

Cisco often tests the distinction between a primary key (which enforces uniqueness within a table) and a foreign key (which enforces relationships between tables), leading candidates to mistakenly choose primary key when the question asks about cross-table validation.

How to eliminate wrong answers

Option A is wrong because a view is a virtual table based on a query and does not enforce any constraints between tables. Option B is wrong because an index speeds up data retrieval but does not enforce referential integrity or validate relationships. Option C is wrong because a primary key uniquely identifies rows within its own table and cannot enforce relationships between different tables.

Practice this question →

2

Multi-Selectmedium

Which TWO of the following are benefits of database normalization to 3NF? (Select 2)

Select 2 answers

A.Improves query performance for all queries

B.Reduces data redundancy

C.Simplifies complex joins

D.Eliminates all data anomalies

E.Increases data integrity

AnswersB, E

Normalization eliminates duplicate data.

Why this answer

Normalization to 3NF eliminates transitive dependencies, which directly reduces data redundancy by ensuring each non-key attribute depends only on the primary key. This reduction in redundancy also increases data integrity because updates, inserts, and deletes are less likely to create inconsistencies or anomalies. In a relational database, 3NF achieves this without sacrificing the ability to reconstruct the original data via joins.

Exam trap

The trap here is that candidates confuse normalization with denormalization, assuming that reducing redundancy always improves query performance, when in fact normalization often increases join complexity and can slow down read queries.

Practice this question →

3

MCQhard

A data analyst needs to combine customer data from two tables: Customers (CustomerID, Name) and Orders (OrderID, CustomerID, Amount). Only customers who have placed at least one order should be included. Which JOIN type should be used?

A.INNER JOIN

B.LEFT JOIN

C.FULL OUTER JOIN

D.RIGHT JOIN

AnswerA

INNER JOIN returns only matching rows, which are customers with orders.

Why this answer

An INNER JOIN returns only rows where there is a match in both tables. Since the requirement is to include only customers who have placed at least one order, the INNER JOIN on CustomerID will filter out any customer without a matching order record, exactly meeting the condition.

Exam trap

The trap here is that candidates often choose LEFT JOIN thinking it 'includes all customers' without realizing it also includes customers with no orders, which fails the explicit condition of 'only customers who have placed at least one order'.

How to eliminate wrong answers

Option B (LEFT JOIN) is wrong because it would include all customers, even those with no orders, with NULL values for order columns, which violates the 'only customers who have placed at least one order' requirement. Option C (FULL OUTER JOIN) is wrong because it would include customers without orders and orders without customers, both of which are not needed. Option D (RIGHT JOIN) is wrong because it would include all orders, potentially including orders with no matching customer, and still would not restrict customers to only those with orders.

Practice this question →

4

Multi-Selecteasy

A data analyst is working with a dataset that includes customer names, email addresses, and purchase history. The analyst wants to ensure that each customer is uniquely identified. Which TWO database concepts should be used to enforce uniqueness and link related data?

Select 2 answers

A.Foreign key

B.Normalization

C.View

D.Primary key

E.Index

AnswersA, D

A foreign key links purchase history to customers, maintaining relationships.

Why this answer

A primary key uniquely identifies each row in a table, ensuring no duplicate customer records. A foreign key links related data across tables by referencing the primary key of another table, enforcing referential integrity. Together, they guarantee uniqueness and enable relational joins between customer and purchase history tables.

Exam trap

The trap here is that candidates often confuse normalization with a constraint or think an index enforces uniqueness, when only primary and foreign keys provide the required referential integrity and unique identification.

Practice this question →

5

MCQhard

A company needs to store user session data for a web application. Each session has a unique session ID, and the data must be retrieved very quickly by session ID. The data does not require complex relationships or transactions. Which type of NoSQL database is most appropriate?

A.Key-value store (e.g., Redis)

B.Wide-column store (e.g., Cassandra)

C.Document store (e.g., MongoDB)

D.Graph database (e.g., Neo4j)

AnswerA

Key-value stores provide high-performance lookups by key.

Why this answer

A key-value store like Redis is the most appropriate choice because it is optimized for extremely fast lookups by a unique key (session ID) and does not require complex relationships or transactions. Redis stores data in memory, providing sub-millisecond retrieval times ideal for session management, and supports built-in expiration (TTL) to automatically clean up stale sessions.

Exam trap

The trap here is that candidates often choose a document store like MongoDB because they associate 'session data' with JSON objects, overlooking that key-value stores are purpose-built for the exact use case of fast, simple key-based retrieval without the overhead of document querying.

How to eliminate wrong answers

Option B (Wide-column store, e.g., Cassandra) is wrong because it is designed for high-volume, distributed writes and complex query patterns over column families, not for simple, low-latency key-based lookups; its eventual consistency model and overhead for single-key reads make it overkill for session storage. Option C (Document store, e.g., MongoDB) is wrong because it stores semi-structured JSON-like documents with rich querying capabilities, which adds unnecessary complexity and latency for simple session data that only needs key-based retrieval. Option D (Graph database, e.g., Neo4j) is wrong because it is purpose-built for traversing relationships between entities (nodes and edges), which is irrelevant for session data that has no relational structure.

Practice this question →

6

Multi-Selecthard

A data engineer is designing a data pipeline for a retail company. The source system is an OLTP database that records sales transactions. The target is a data warehouse used for reporting. The engineer is evaluating whether to use ETL or ELT. Which three factors would favor using ELT over ETL? (Select THREE)

Select 3 answers

A.The transformation logic requires proprietary functions not available in the warehouse

B.The business analysts need access to raw data for ad-hoc exploration

C.The target data warehouse has massive compute power (e.g., Snowflake) that can handle transformations efficiently

D.Data must be cleansed and validated before loading into the warehouse

E.The source data volume is very large and the warehouse can scale resources on demand

AnswersB, C, E

ELT loads raw data, allowing analysts to explore it before transformation.

Why this answer

Option B is correct because ELT loads raw data into the warehouse first, allowing business analysts to perform ad-hoc exploration directly on the source data without pre-transformation. This flexibility is a key advantage of ELT over ETL, where transformations are applied before loading.

Exam trap

The trap here is that candidates often confuse the direction of data flow, mistakenly thinking that ELT requires transformations before loading, when in fact ELT defers transformations until after data is in the warehouse.

Practice this question →

7

MCQhard

A company uses a NoSQL document database to store product catalogs. Each product document includes fields like product_id, name, category, and price. The operations team frequently queries by product_id and by category. Which type of NoSQL database is being used, and what should be created to optimize queries by category?

A.Key-value store; create a secondary index on category

B.Graph database; create a relationship between products

C.Document database; create an index on category

D.Wide-column store; create a column family for category

AnswerC

Document databases support secondary indexes on any field, which speeds up queries.

Why this answer

The question explicitly states a document database is used, and the operations team frequently queries by category. In a document database like MongoDB, creating an index on the category field optimizes these queries by allowing the database to quickly locate documents without scanning every document in the collection. This is the standard approach for improving query performance on non-primary-key fields in document stores.

Exam trap

Cisco often tests the misconception that any NoSQL database can be optimized with a secondary index, but the trap here is that only document databases (and some key-value stores with limited support) natively support secondary indexes, while key-value stores and wide-column stores require different optimization strategies.

How to eliminate wrong answers

Option A is wrong because a key-value store does not support secondary indexes on fields like category; it only allows lookups by the primary key (product_id), making it unsuitable for the described query pattern. Option B is wrong because a graph database is designed for relationship-heavy data (e.g., social networks), not for product catalogs with simple field queries, and creating relationships between products does not optimize category-based lookups. Option D is wrong because a wide-column store organizes data by column families, not by documents, and creating a column family for category would not provide the index-based optimization needed for document-style queries.

Practice this question →

8

MCQeasy

Which of the following is a characteristic of a NoSQL document database like MongoDB?

A.Schema-flexible, JSON-like documents

B.Data stored in tables with rows and columns

C.Strict schema enforcement

D.Support for ACID transactions across multiple documents

AnswerA

Document databases store data in flexible documents.

Why this answer

Document databases store data in flexible, JSON-like documents, allowing schema variability.

Practice this question →

9

MCQmedium

A data engineer is designing a system to store raw sensor data from thousands of IoT devices. The data is expected to be used for exploratory analytics and machine learning. Which storage solution is most appropriate?

A.Data lake

B.Relational database

C.Data mart

D.Key-value store

AnswerA

Data lakes store raw, unprocessed data, suitable for IoT sensor data.

Why this answer

A data lake is the most appropriate choice because it can store raw, unprocessed sensor data in its native format (e.g., JSON, Parquet, or binary) without requiring a predefined schema. This flexibility supports exploratory analytics and machine learning workflows where data schemas may evolve or be unknown at ingestion time. Data lakes also scale horizontally to handle the high volume and velocity of data from thousands of IoT devices, unlike traditional storage systems that impose rigid structures or size limits.

Exam trap

The trap here is that candidates often confuse a data lake with a data warehouse or relational database, assuming raw data must be structured immediately, when in fact a data lake's schema-on-read approach is specifically designed for exploratory and machine learning use cases.

How to eliminate wrong answers

Option B is wrong because a relational database enforces a fixed schema and ACID transactions, which are unnecessary for raw sensor data and would introduce significant overhead for high-velocity, schema-on-read workloads. Option C is wrong because a data mart is a subset of data optimized for a specific business function or department, not designed to store raw, exploratory data from thousands of IoT devices. Option D is wrong because a key-value store is optimized for simple lookups by a single key and lacks the query flexibility and analytical capabilities needed for exploratory analytics and machine learning on complex sensor data.

Practice this question →

10

MCQhard

A database has a table 'Orders' with columns OrderID (PK), CustomerID, OrderDate, and a table 'OrderDetails' with OrderID (FK), ProductID, Quantity. To ensure that every OrderID in OrderDetails exists in Orders, which integrity constraint is enforced?

A.Entity integrity

B.Domain integrity

C.User-defined integrity

D.Referential integrity

AnswerD

Referential integrity ensures foreign key values match primary key values.

Why this answer

Referential integrity, enforced via a foreign key constraint, ensures that values in the foreign key column (OrderID in OrderDetails) match values in the primary key column of the referenced table (Orders).

Practice this question →

11

Multi-Selecthard

An organization uses a data warehouse for analytics. Which two characteristics are typical of a data warehouse compared to a data lake? (Select TWO.)

Select 2 answers

A.Optimized for complex queries and reporting

B.Typically uses ELT process

C.Stores raw data in native format

D.Uses schema-on-write

E.Supports all data types including unstructured

AnswersA, D

Data warehouses are designed for analytics.

Why this answer

Option A is correct because data warehouses are optimized for complex queries and reporting through the use of columnar storage, indexing, and pre-aggregated data structures like star schemas and materialized views. These optimizations allow for fast execution of analytical queries involving aggregations, joins, and filtering, which are typical in business intelligence (BI) reporting. In contrast, data lakes are designed for storage flexibility, not query performance.

Exam trap

The trap here is that candidates often confuse the ETL/ELT processes, incorrectly associating ELT with data warehouses, or assume data warehouses can handle all data types like unstructured data, which is a key differentiator of data lakes.

Practice this question →

12

MCQmedium

A data analyst needs to retrieve current weather data from a third-party service. The service provides an endpoint that returns data in JSON format over HTTP. Which data source type is being used?

A.Streaming data

B.Flat file

C.Web scraping

D.API

AnswerD

REST API provides structured data via HTTP.

Why this answer

Option D is correct because the data analyst is retrieving data from a third-party service via an HTTP endpoint that returns JSON. This is the classic definition of an API (Application Programming Interface) — specifically a RESTful web API — which allows programmatic access to structured data over HTTP using standard methods like GET. The JSON format confirms it is an API response, not a file or stream.

Exam trap

The trap here is that candidates confuse 'web scraping' (Option C) with API consumption because both involve HTTP, but scraping parses unstructured HTML while an API returns structured JSON, and Cisco often tests this distinction by describing a direct JSON endpoint to lure test-takers into selecting web scraping.

How to eliminate wrong answers

Option A is wrong because streaming data implies a continuous, real-time flow of data (e.g., from Kafka, WebSockets, or sensor feeds), whereas the question describes a single request-response retrieval over HTTP. Option B is wrong because a flat file (e.g., CSV, TSV, or fixed-width) is a static file stored locally or on a file server, not an HTTP endpoint that returns JSON dynamically. Option C is wrong because web scraping involves parsing raw HTML from a web page to extract data, not consuming a structured JSON response from a dedicated API endpoint.

Practice this question →

13

MCQmedium

A data analyst needs to extract data from a transactional database and load it into a data warehouse for reporting. Which process typically transforms the data before loading it into the warehouse?

A.Data virtualization

B.ELT

C.ETL

D.Data replication

AnswerC

ETL transforms before loading.

Why this answer

ETL (Extract, Transform, Load) transforms data before loading it into the warehouse, which is common for traditional data warehousing.

Practice this question →

14

MCQmedium

A company needs to store raw data from IoT sensors for future machine learning projects. The data is expected to be massive and in various formats. Which storage solution is most appropriate?

A.Data mart

B.Data warehouse

C.Data lake

D.Relational database

AnswerC

Data lakes store raw data in native format, suitable for large-scale, diverse data.

Why this answer

A data lake stores raw data in its native format without predefined schema, ideal for large volumes of varied data for future use.

Practice this question →

15

Multi-Selecthard

A data warehouse team is considering moving from an ETL to an ELT approach. Which THREE of the following are advantages of ELT over ETL?

Select 3 answers

A.Requires less storage space in the data warehouse

B.Reduces data loading time because transformations are done after loading

C.Allows data to be reprocessed easily if transformation logic changes

D.Ensures data is cleaned before loading

E.Eliminates the need for a separate ETL server

AnswersB, C, E

Data is loaded quickly without transformation, reducing initial load time.

Why this answer

Option B is correct because in ELT, data is loaded into the data warehouse first and transformations are applied afterward. This reduces the initial loading time since no transformation processing occurs during the load phase, allowing raw data to be ingested more quickly.

Exam trap

Cisco often tests the misconception that ELT reduces storage requirements, but in reality, ELT increases storage needs because raw data is persisted alongside transformed data, whereas ETL can discard raw data after transformation.

Practice this question →

16

MCQhard

A data engineer is designing a data pipeline where raw data is loaded into a cloud data warehouse (Snowflake) and then transformed using SQL. This approach is called:

A.ELT

B.ETL

C.Data migration

D.Data wrangling

AnswerA

ELT loads raw data, then transforms in the warehouse.

Why this answer

ELT (Extract, Load, Transform) loads raw data first, then transforms it within the warehouse, which is suitable for modern cloud warehouses.

Practice this question →

17

Multi-Selectmedium

Which THREE of the following are components of Master Data Management (MDM)? (Select 3)

Select 3 answers

A.Data governance

B.Data quality management

C.Data encryption

D.Data archival

E.Data integration

AnswersA, B, E

MDM requires governance policies.

Why this answer

MDM includes data governance, data integration, and data quality management to maintain a single source of truth.

Practice this question →

18

MCQhard

A data analyst needs to combine customer data from a MySQL transactional database with product data from a MongoDB document store to create a unified view for reporting. The analyst uses a SQL query that joins the tables after extracting data from both sources. Which database concept is being applied?

A.View

B.Join

C.Stored procedure

D.Index

AnswerB

Joins combine rows from two or more tables based on a related column.

Why this answer

The correct answer is B (Join) because the scenario describes combining data from two different sources—MySQL and MongoDB—into a unified view using a SQL query that joins the tables after extraction. This is a classic example of a cross-source join, where data from disparate databases is merged based on a common key, which is the fundamental purpose of a JOIN operation in SQL.

Exam trap

The trap here is that candidates may confuse a 'view' with a cross-source join, thinking a view can span multiple databases, but a view is limited to a single database and cannot directly reference tables from different database systems like MySQL and MongoDB.

How to eliminate wrong answers

Option A (View) is wrong because a view is a saved SQL query that presents data from one or more tables within the same database, not a mechanism to combine data from different source systems like MySQL and MongoDB. Option C (Stored procedure) is wrong because a stored procedure is a precompiled collection of SQL statements that performs a specific task within a single database, not a concept for joining data across heterogeneous data stores. Option D (Index) is wrong because an index is a data structure that improves the speed of data retrieval operations on a table, not a method for combining data from multiple sources.

Practice this question →

19

MCQmedium

Which database concept ensures that data in one table corresponds to data in another table, preventing orphan records?

A.Index

B.Referential integrity

C.Foreign key

D.Primary key

AnswerB

Referential integrity is the constraint that maintains consistency between related tables.

Why this answer

Referential integrity ensures that foreign key values match primary key values in the referenced table.

Practice this question →

20

MCQeasy

Which of the following data sources is most likely to generate streaming data?

A.Transactional database

B.Flat file

C.API

D.IoT sensors

AnswerD

IoT sensors continuously send data, making them a streaming source.

Why this answer

Streaming data is continuously generated from sources like IoT sensors, clickstreams, and social media feeds.

Practice this question →

21

MCQeasy

A data analyst receives a file with the extension .json. This file contains product information with attributes that vary between records. How should this file be classified?

A.Semi-structured data

B.Structured data

C.Transactional data

D.Unstructured data

AnswerA

JSON is a common semi-structured format with tags and nested elements.

Why this answer

A JSON file with varying attributes per record is a classic example of semi-structured data. Unlike strictly structured data (e.g., a relational table with fixed columns), JSON allows each object to have a different set of key-value pairs, making it schema-flexible. This self-describing nature, where metadata is embedded within the data itself, is the defining characteristic of semi-structured formats.

Exam trap

The trap here is that candidates confuse 'structured data' with any data that has a format or organization, forgetting that structured data specifically requires a fixed, predefined schema enforced at write time, unlike JSON's flexible schema-on-read approach.

How to eliminate wrong answers

Option B is wrong because structured data requires a rigid, predefined schema (like a SQL table with fixed columns and data types), which JSON explicitly does not enforce. Option C is wrong because transactional data refers to records of business events (e.g., sales, orders) and is a classification by use case, not by format; a JSON file can contain transactional data, but the question asks how the file itself should be classified based on its structure. Option D is wrong because unstructured data lacks any internal structure or metadata (e.g., raw text, images, audio), whereas JSON has a clear hierarchical structure with keys and values.

Practice this question →

22

MCQmedium

An organization is implementing a data warehouse to support business intelligence reporting. The data warehouse must ensure that transactions are processed reliably. Which property guarantees that each transaction is treated as a single, indivisible unit?

A.Consistency

B.Isolation

C.Atomicity

D.Durability

AnswerC

Atomicity ensures a transaction is completed fully or not at all.

Why this answer

Atomicity (option C) is the correct property because it ensures that a transaction is treated as a single, indivisible unit of work. In the context of a data warehouse, this means that either all operations within the transaction are committed successfully, or none are applied, preventing partial updates that could corrupt the data. This is a core component of the ACID (Atomicity, Consistency, Isolation, Durability) model, which is fundamental to reliable transaction processing in databases like SQL Server, Oracle, or PostgreSQL.

Exam trap

The trap here is that candidates often confuse atomicity with consistency, thinking that 'indivisible unit' means the data must be consistent, but consistency is a separate property that ensures data integrity rules are met, not that the transaction is all-or-nothing.

How to eliminate wrong answers

Option A (Consistency) is wrong because consistency ensures that a transaction brings the database from one valid state to another, preserving all defined rules (e.g., constraints, triggers), but it does not guarantee that the transaction is treated as a single unit. Option B (Isolation) is wrong because isolation controls how transaction changes are visible to other concurrent transactions, preventing dirty reads and other anomalies, but it does not address the indivisibility of the transaction itself. Option D (Durability) is wrong because durability guarantees that once a transaction is committed, its changes persist even in the event of a system failure (e.g., via write-ahead logging), but it does not ensure the transaction is atomic.

Practice this question →

23

MCQeasy

Which of the following is a characteristic of structured data?

A.It conforms to a fixed schema with rows and columns.

B.It has a flexible schema that can vary per record.

C.It cannot be analyzed using SQL.

D.It is stored as blobs in a data lake.

AnswerA

Correct definition of structured data.

Why this answer

Structured data is defined by its adherence to a fixed schema, typically organized into rows and columns within relational databases. This rigid structure enables efficient querying and manipulation using SQL, as each field has a predefined data type and constraints. The correct answer highlights this fundamental characteristic, which distinguishes structured data from semi-structured or unstructured formats.

Exam trap

The trap here is that candidates often confuse semi-structured data (which has some organizational tags but no fixed schema) with structured data, leading them to select Option B, or they mistakenly think SQL cannot analyze structured data, falling for Option C.

How to eliminate wrong answers

Option B is wrong because a flexible schema that can vary per record describes semi-structured data (e.g., JSON, XML), not structured data. Option C is wrong because structured data is specifically designed to be analyzed using SQL, which is the primary query language for relational databases. Option D is wrong because storing data as blobs in a data lake is characteristic of unstructured data (e.g., images, videos), not structured data, which is stored in tables with defined schemas.

Practice this question →

24

MCQmedium

A company ingests customer clickstream data from its website. The data arrives continuously in JSON format and must be stored for real-time analytics. Which type of data source is being described?

A.Transactional database

B.Flat file

C.Data warehouse

D.Streaming data

AnswerD

Correct: continuous data from clickstreams.

Why this answer

The description matches a streaming data source because clickstream data arrives continuously in JSON format and must be stored for real-time analytics. Streaming data sources, such as Apache Kafka or Amazon Kinesis, ingest unbounded data in real time, enabling immediate processing and analytics without batch delays.

Exam trap

Cisco often tests the distinction between 'streaming data' and 'data warehouse' by describing continuous ingestion, leading candidates to mistakenly choose 'data warehouse' because they associate analytics with warehousing, ignoring the real-time requirement.

How to eliminate wrong answers

Option A is wrong because a transactional database (e.g., OLTP system) is designed for ACID-compliant transaction processing, not for ingesting continuous, high-velocity streaming data. Option B is wrong because a flat file (e.g., CSV or text file) is a static, batch-oriented storage format that cannot handle real-time, continuous ingestion without manual intervention or scheduled loads. Option C is wrong because a data warehouse is optimized for structured, historical analytics and typically relies on batch ETL processes, not real-time streaming ingestion from clickstream sources.

Practice this question →

25

Multi-Selectmedium

A company needs to store data that is highly interconnected, such as social network relationships. Which two database types are best suited for this? (Select TWO.)

Select 2 answers

A.Relational database

B.Wide-column store (e.g., Cassandra)

C.Document store (e.g., MongoDB)

D.Graph database (e.g., Neo4j)

E.Key-value store (e.g., Redis)

AnswersC, D

Document stores can embed related data and are flexible for interconnected data.

Why this answer

Graph databases excel at storing relationships, and NoSQL databases support flexible schemas for interconnected data.

Practice this question →

26

Multi-Selectmedium

A data governance team is defining roles and responsibilities for data management. Which TWO of the following are common data governance roles? (Select TWO).

Select 2 answers

A.Data steward

B.Data owner

C.Data scientist

D.Database administrator

E.Data analyst

AnswersA, B

Data steward manages data quality and standards.

Why this answer

Data owner and data steward are standard roles in data governance, responsible for data ownership and stewardship respectively.

Practice this question →

27

MCQmedium

A company wants to share a dataset with external partners via an API. Which API type is typically used for web services and uses XML or JSON for messaging?

B.GraphQL API

C.SOAP API

D.WebSocket API

AnswerA

REST APIs use JSON/XML and are widely used for web services.

Why this answer

REST APIs commonly use JSON or XML for messaging, making them a standard choice for data sharing over the web.

Practice this question →

28

MCQeasy

Which of the following data types best describes a JSON file containing customer orders with varying fields per record?

A.Unstructured data

B.Structured data

C.Relational data

D.Semi-structured data

AnswerD

JSON uses tags/elements to organize data but allows flexibility, making it semi-structured.

Why this answer

JSON is a semi-structured data format because it allows flexible schema where fields can vary between records.

Practice this question →

29

MCQeasy

Which of the following best describes a data mart?

A.A repository for raw, unprocessed data

B.An OLTP system for transaction processing

C.A subject-specific subset of a data warehouse

D.A tool for extract, transform, and load processes

AnswerC

Data marts are subsets of data warehouses.

Why this answer

A data mart is a subject-specific subset of a data warehouse, focused on a particular department or function.

Practice this question →

30

MCQeasy

Which of the following is an example of semi-structured data?

A.A CSV file without header

B.An image file

C.A table in a relational database

D.A JSON file

AnswerD

JSON uses key-value pairs with flexible schema, typical of semi-structured data.

Why this answer

Semi-structured data has tags or markers to separate data elements, like JSON or XML.

Practice this question →

31

MCQmedium

A company uses a data warehouse for reporting. They need to extract data from multiple sources, load it into a staging area, and then transform it before moving to the warehouse. This process is known as:

A.ELT

B.ETL

C.Data replication

D.Data ingestion

AnswerB

ETL extracts, transforms in a staging area, then loads into warehouse.

Why this answer

The process described—extracting data from multiple sources, loading it into a staging area, and then transforming it before moving to the warehouse—is the classic definition of ETL (Extract, Transform, Load). In ETL, transformation occurs after extraction but before loading into the target system, which is exactly what the staging area is used for. This contrasts with ELT, where transformation happens after loading into the warehouse.

Exam trap

The trap here is that candidates confuse the order of operations in ETL versus ELT, assuming that because modern cloud warehouses support ELT, the described staging-area process must be ELT, when in fact the staging area is a hallmark of traditional ETL.

How to eliminate wrong answers

Option A is wrong because ELT (Extract, Load, Transform) loads raw data into the target system first and transforms it later, which is the opposite of the described sequence where transformation occurs before moving to the warehouse. Option C is wrong because data replication refers to copying data from one system to another for redundancy or availability, not a multi-stage pipeline with transformation. Option D is wrong because data ingestion is a broad term covering the initial import of data into a system, but it does not specifically include the staging and transformation steps described in the question.

Practice this question →

32

Multi-Selectmedium

A database designer wants to improve query performance on a large table that is frequently filtered by multiple columns. Which TWO types of indexes could be beneficial? (Select TWO).

Select 2 answers

A.Bitmap index

B.Composite index

C.Hash index

D.Full-text index

E.B-tree index

AnswersB, E

Composite indexes on multiple columns can speed up queries filtering by those columns.

Why this answer

Composite indexes cover multiple columns, and B-tree indexes are general-purpose and support range queries, both useful for filtering on multiple columns.

Practice this question →

33

MCQhard

A DBA wants to improve query performance on a large table that is frequently filtered on two columns: department_id and hire_date. The table has millions of rows. Which index strategy would be most effective?

A.Create a composite B-tree index on (department_id, hire_date)

B.Create a bitmap index on hire_date

C.Create a hash index on department_id only

D.Create two separate B-tree indexes, one on each column

AnswerA

Composite index on both columns in the filter order can be used for both conditions.

Why this answer

A composite B-tree index on (department_id, hire_date) is most effective because it allows the database to satisfy equality and range predicates on both columns in a single index scan. B-tree indexes are optimized for high-cardinality columns and support efficient multi-column filtering when the leading column matches the query's equality condition, followed by the range condition on hire_date.

Exam trap

The trap here is that candidates often assume two separate single-column indexes are equivalent to a composite index, but they fail to realize that the database cannot efficiently combine them for range predicates without a costly index merge operation.

How to eliminate wrong answers

Option B is wrong because bitmap indexes are designed for low-cardinality columns (e.g., gender or status) and perform poorly with high-cardinality columns like hire_date, leading to excessive bitmap merge overhead and poor query performance. Option C is wrong because a hash index on department_id only supports equality lookups, not range queries on hire_date, and cannot be used for filtering on both columns simultaneously. Option D is wrong because two separate B-tree indexes would force the optimizer to choose one index and then filter the other column via a table access (or perform an expensive index merge), which is less efficient than a single composite index that can directly satisfy both predicates.

Practice this question →

34

Multi-Selectmedium

A data analyst is performing a join between two tables: 'employees' and 'departments'. The 'employees' table has a foreign key 'dept_id' referencing the 'departments' table. Which two join types would include all rows from the 'employees' table, regardless of whether there is a matching department? (Select TWO)

Select 2 answers

A.LEFT JOIN

B.INNER JOIN

C.CROSS JOIN

D.RIGHT JOIN

E.FULL OUTER JOIN

AnswersA, E

LEFT JOIN returns all rows from the left (employees) table.

Why this answer

LEFT JOIN and RIGHT JOIN (if reversed) can include all rows from the left table. Specifically, LEFT JOIN includes all rows from the left table. FULL OUTER JOIN also includes all rows from both tables, but the question says 'all rows from the employees table' – that is satisfied by LEFT JOIN and also by FULL OUTER JOIN (which includes all from employees).

However, the correct answer set: LEFT JOIN and FULL OUTER JOIN. The question says 'include all rows from the employees table' – RIGHT JOIN does not guarantee that unless employees is on the right side. So the correct two are LEFT JOIN and FULL OUTER JOIN.

Practice this question →

35

Multi-Selectmedium

A university database stores student information in a normalized schema. The 'students' table has a primary key 'student_id'. The 'enrollments' table has a foreign key 'student_id' referencing 'students'. Which two of the following are true about primary and foreign keys? (Select TWO)

Select 2 answers

A.A foreign key must have the same name as the primary key it references

B.A foreign key ensures referential integrity between tables

C.A foreign key can reference a column that is not a primary key

D.A table can have multiple primary keys

E.A primary key column cannot contain NULL values

AnswersB, E

Foreign keys enforce that values match the referenced primary key.

Why this answer

Option B is correct because a foreign key enforces referential integrity by ensuring that every value in the foreign key column of the 'enrollments' table matches a valid primary key value in the 'students' table. This prevents orphaned records and maintains consistency across related tables in a normalized relational database.

Exam trap

The trap here is that candidates often assume a foreign key can reference any column, forgetting that the referenced column must have a unique constraint (primary key or unique) to ensure a single target row, which is a common point of confusion in DA0-001.

Practice this question →

36

MCQmedium

A data analyst needs to combine data from two tables: one containing customer information and another containing order details. The analyst wants to include all customers, even those who have not placed any orders. Which type of join should be used?

A.FULL OUTER JOIN

B.LEFT JOIN

C.INNER JOIN

D.RIGHT JOIN

AnswerB

LEFT JOIN includes all rows from the left table, even if no match exists in the right table.

Why this answer

A LEFT JOIN returns all rows from the left table (customers) and matching rows from the right table (orders). If a customer has no orders, the order columns will contain NULLs. This satisfies the requirement to include all customers, even those without orders.

Exam trap

The trap here is that candidates often confuse LEFT JOIN with FULL OUTER JOIN, thinking they need to preserve all rows from both tables, when the requirement only specifies preserving all customers.

How to eliminate wrong answers

Option A is wrong because a FULL OUTER JOIN returns all rows from both tables, which would include unmatched orders (if any) and is unnecessary when only all customers are needed. Option C is wrong because an INNER JOIN returns only rows with matches in both tables, excluding customers who have not placed orders. Option D is wrong because a RIGHT JOIN returns all rows from the right table (orders) and matching customers, which would omit customers without orders if the customer table is on the left.

Practice this question →

37

MCQhard

A data engineer is designing a system to handle high-velocity clickstream data from a website. The system must allow low-latency writes and support key-value lookups. Which type of database is most appropriate?

A.Graph database (e.g., Neo4j)

B.Document store (e.g., MongoDB)

C.Key-value store (e.g., Redis)

D.Wide-column store (e.g., Cassandra)

AnswerC

Key-value stores excel at high-speed writes and lookups.

Why this answer

A key-value store like Redis is optimized for high-velocity writes and low-latency key-value lookups, making it ideal for clickstream data.

Practice this question →

38

MCQeasy

Which of the following data types is characterized by a flexible schema and is commonly represented using JSON or XML?

A.Unstructured data

B.Structured data

C.Semi-structured data

D.Relational data

AnswerC

JSON and XML are typical semi-structured formats.

Why this answer

JSON and XML are examples of semi-structured data, which has a flexible schema unlike structured data (fixed schema) or unstructured data (no schema).

Practice this question →

39

Multi-Selecthard

A company is designing a data pipeline to process streaming data from social media feeds. Which THREE of the following are characteristics of streaming data? (Select THREE).

Select 3 answers

A.Data is unbounded and infinite

B.Data is processed in micro-batches

C.Data arrives continuously

D.Data is stored permanently before processing

E.Data is processed in real-time

AnswersA, C, E

Streaming data is unbounded.

Why this answer

Streaming data is inherently unbounded and infinite because social media feeds generate a continuous, never-ending flow of events. Unlike batch data, there is no natural end to the stream; new tweets, posts, or interactions arrive constantly, making the dataset theoretically infinite in size.

Exam trap

The trap here is that candidates confuse processing strategies (like micro-batching) with the inherent nature of streaming data, or they assume streaming data must be stored before processing, which is a batch-oriented mindset.

Practice this question →

40

MCQeasy

A company is designing a database for an e-commerce application that requires high transaction throughput and must guarantee that each transaction is processed atomically. Which property of ACID ensures that a transaction is either fully completed or not executed at all?

A.Atomicity

B.Isolation

C.Durability

D.Consistency

AnswerA

Atomicity ensures the transaction is all-or-nothing.

Why this answer

Atomicity guarantees that a transaction is treated as a single unit; it either completes entirely or is rolled back, preventing partial updates.

Practice this question →

41

MCQmedium

A company is ingesting data from multiple sources into a cloud data warehouse. They decide to load the data raw and then perform transformations within the warehouse. Which approach does this describe?

A.Data lake ingestion

B.ETL

C.ELT

D.Stream processing

AnswerC

ELT loads raw data then transforms within the warehouse.

Why this answer

ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the data warehouse, as opposed to ETL which transforms before loading.

Practice this question →

42

Multi-Selecthard

A company is migrating its data pipeline from on-premises to the cloud. The current ETL process transforms data before loading into a data warehouse. The new architecture will use ELT instead. Which THREE of the following are advantages of ELT over traditional ETL? (Select 3)

Select 3 answers

A.Ensures data quality before loading

B.Provides ability to reprocess raw data if transformation logic changes

C.Leverages the processing power of the cloud data warehouse

D.Reduces storage costs by storing only transformed data

E.Allows for schema-on-read, enabling flexible analysis

AnswersB, C, E

Raw data is preserved, allowing re-transformation.

Why this answer

ELT leverages cloud scalability, allows raw data storage for flexibility, and enables schema-on-read.

Practice this question →

43

Multi-Selectmedium

A data governance team is establishing policies for data quality. Which THREE of the following are common dimensions of data quality? (Select 3)

Select 3 answers

A.Consistency

B.Completeness

C.Accuracy

D.Velocity

E.Volume

AnswersA, B, C

Data is uniform across systems.

Why this answer

Consistency is a common dimension of data quality because it ensures that data values are uniform across different datasets or systems, preventing contradictions. For example, if a customer's address is stored as '123 Main St' in one database and '123 Main Street' in another, consistency rules would flag this discrepancy. This dimension is critical for reliable reporting and integration.

Exam trap

The trap here is that candidates confuse the characteristics of big data (velocity, volume, variety) with the dimensions of data quality, leading them to select velocity or volume instead of the correct quality-focused options.

Practice this question →

44

MCQmedium

A database administrator wants to ensure that every value in a column matches values in a primary key column of another table. Which constraint enforces this rule?

A.Unique constraint

B.Primary key

C.Check constraint

D.Foreign key

AnswerD

Foreign key enforces referential integrity.

Why this answer

A foreign key constraint ensures referential integrity by requiring that values in a column match values in the primary key of another table.

Practice this question →

45

MCQhard

An analyst is reviewing a table that stores customer orders. The table contains columns: OrderID, CustomerName, Product1, Product1Qty, Product2, Product2Qty. This design violates which normal form?

A.No violation

B.Third normal form (3NF)

C.Second normal form (2NF)

D.First normal form (1NF)

AnswerD

Repeating groups violate 1NF.

Why this answer

The table violates First Normal Form (1NF) because it contains repeating groups (Product1, Product1Qty, Product2, Product2Qty) instead of storing each product in a separate row. 1NF requires that each column contains atomic values and that there are no repeating groups or arrays. The presence of multiple product columns for a single order breaks this atomicity and normalization rule.

Exam trap

The trap here is that candidates often think the table is already in 1NF because it has a primary key (OrderID), but they overlook the repeating group columns that violate the atomicity requirement of 1NF.

How to eliminate wrong answers

Option A is wrong because the table clearly violates normalization rules due to repeating groups, so a violation exists. Option B is wrong because Third Normal Form (3NF) requires that the table already be in 2NF and have no transitive dependencies; the immediate violation is at the 1NF level, not 3NF. Option C is wrong because Second Normal Form (2NF) requires that the table first satisfy 1NF and then have no partial dependencies; since the table fails 1NF, it cannot be evaluated for 2NF.

Practice this question →

46

MCQmedium

A company is building a data pipeline to ingest sensor data from IoT devices. The data arrives continuously in small batches and must be processed in real-time for monitoring. Which type of data source best describes this scenario?

A.Transactional database

B.Streaming data

C.Web scraping

D.Flat file

AnswerB

IoT sensors produce streaming data that is continuous and requires real-time processing.

Why this answer

B is correct because the scenario describes data arriving continuously in small batches that must be processed in real-time for monitoring. This is the defining characteristic of streaming data, which is typically ingested via technologies like Apache Kafka, Amazon Kinesis, or MQTT brokers, enabling low-latency processing and immediate alerting.

Exam trap

The trap here is that candidates may confuse 'real-time' with 'fast batch processing' and incorrectly choose a transactional database, not recognizing that streaming data sources are specifically designed for continuous, unbounded data flows with sub-second latency requirements.

How to eliminate wrong answers

Option A is wrong because a transactional database (e.g., PostgreSQL, MySQL) is designed for ACID-compliant, query-based storage and retrieval, not for continuous real-time ingestion of sensor data; it would introduce latency and cannot handle unbounded streams efficiently. Option C is wrong because web scraping is a technique for extracting data from web pages via HTTP requests (e.g., using BeautifulSoup or Scrapy), which is batch-oriented and not suited for real-time IoT sensor data. Option D is wrong because a flat file (e.g., CSV, JSON file) is a static storage format that requires manual or scheduled batch loads, making it incapable of supporting real-time processing or continuous ingestion.

Practice this question →

47

MCQeasy

An organization needs to store raw data from IoT sensors in its native format for future analysis. Which storage solution is best suited for this purpose?

A.Relational database

B.Data lake

C.Data mart

D.Data warehouse

AnswerB

Data lakes store raw data in native format.

Why this answer

A data lake is designed to store raw data in its native format, including unstructured and semi-structured data from IoT sensors, without requiring a predefined schema. This allows the organization to preserve the original data for future analysis, unlike traditional databases that enforce structure upon ingestion.

Exam trap

The trap here is that candidates often confuse a data warehouse with a data lake, assuming both are for storage, but a data warehouse requires ETL and structured schemas, making it unsuitable for raw, native-format IoT data.

How to eliminate wrong answers

Option A is wrong because a relational database requires a predefined schema and is optimized for structured data, not raw, native-format IoT sensor data. Option C is wrong because a data mart is a subset of a data warehouse focused on a specific business domain, not designed for storing raw, unprocessed data. Option D is wrong because a data warehouse stores processed, structured, and transformed data for analytical queries, not raw data in its native format.

Practice this question →

48

MCQeasy

Which of the following is an example of unstructured data?

A.A JSON file

B.An image file

C.A relational database table

D.A CSV file with rows and columns

AnswerB

Images are unstructured.

Why this answer

Unstructured data has no predefined schema. Images are a classic example of unstructured data.

Practice this question →

49

MCQmedium

A data governance team is implementing a program to ensure consistent definitions and quality of customer data across the organization. They assign a senior manager to be accountable for the data asset. Which role does this manager fulfill?

A.Data analyst

B.Data custodian

C.Data owner

D.Data steward

AnswerC

Data owner is accountable for a specific data domain.

Why this answer

The data owner is the senior manager accountable for a specific data asset, including its quality, definition, and compliance. In the DA0-001 context, the data owner has ultimate responsibility for the data, not just day-to-day management. This role ensures consistent definitions and quality across the organization, aligning with the governance team's objectives.

Exam trap

The trap here is confusing the data owner's accountability with the data steward's operational duties, leading candidates to pick 'Data steward' because they associate governance with hands-on management rather than executive responsibility.

How to eliminate wrong answers

Option A is wrong because a data analyst focuses on analyzing and interpreting data, not on accountability for data definitions or quality. Option B is wrong because a data custodian is responsible for the technical environment and security of data, not for defining or governing its meaning. Option D is wrong because a data steward handles day-to-day data governance tasks like metadata management and quality monitoring, but does not hold the ultimate accountability that a senior manager does.

Practice this question →

50

MCQhard

A data governance team is establishing policies to ensure data quality. They define rules for data accuracy, completeness, and consistency. Which data governance function is primarily responsible for defining and enforcing these rules?

A.Data stewardship

B.Data ownership

C.Data quality management

D.Master data management

AnswerC

Data quality management is responsible for defining and enforcing quality rules.

Why this answer

Data quality management is the function that sets standards and processes to ensure data is accurate, complete, and consistent. Data stewardship often involves implementing these rules, but the overall responsibility lies with data quality management.

Practice this question →

51

Multi-Selectmedium

A data analyst is extracting data from a web page using web scraping techniques. The data will be used for market research. Which TWO of the following are common challenges associated with web scraping?

Select 2 answers

A.Limited API rate limits

B.Legal and ethical restrictions

C.Website structure changes

D.High latency of data transfer

E.Inconsistent data formatting

AnswersB, C

Many websites prohibit scraping in their terms of service, and legal issues may arise.

Why this answer

Option B is correct because web scraping often involves accessing data that may be protected by copyright, terms of service, or privacy regulations such as GDPR or the Computer Fraud and Abuse Act (CFAA). Even if data is publicly accessible, repurposing it for market research without permission can lead to legal liability or ethical violations, making this a fundamental challenge.

Exam trap

Cisco often tests the distinction between API-related challenges (rate limits, authentication) and web-scraping-specific challenges (structure changes, legal/ethical issues), so candidates mistakenly select 'Limited API rate limits' because they confuse web scraping with API consumption.

Practice this question →

52

MCQeasy

Which stage of the data lifecycle involves converting raw data into a usable format, such as cleaning or validating?

A.Archival

B.Processing

C.Ingestion

D.Storage

AnswerB

Processing includes cleaning and transforming raw data.

Why this answer

Processing is the stage where raw data is transformed into a usable format through cleaning, validation, normalization, or aggregation. This step ensures data quality and consistency before analysis or storage, directly matching the question's description.

Exam trap

The trap here is confusing ingestion (data arrival) with processing (data transformation), as both occur early in the lifecycle but serve distinct purposes.

How to eliminate wrong answers

Option A is wrong because archival refers to moving data to long-term storage for compliance or historical purposes, not cleaning or validating. Option C is wrong because ingestion is the initial capture or import of raw data from sources, not its transformation. Option D is wrong because storage is the persistent retention of data in databases or filesystems, not the conversion into a usable format.

Practice this question →

53

MCQmedium

A data analyst wants to retrieve data from a REST API that returns JSON. Which step is part of the data lifecycle for this activity?

A.Data archival

B.Data sharing

C.Data deletion

D.Data ingestion

AnswerD

Ingestion is the initial step of bringing data from a source.

Why this answer

Ingestion is the process of bringing data into a system for further processing.

Practice this question →

54

MCQhard

A data engineer is designing a system to store raw sensor data from thousands of IoT devices. The data will be used later for various analytics projects, but the schema is not yet defined. Which storage solution is most appropriate?

A.Data lake

B.Data mart

C.Data warehouse

D.Relational database

AnswerA

Data lakes store raw data in any format and allow schema-on-read.

Why this answer

A data lake stores raw data in its native format (e.g., S3, ADLS) without requiring a predefined schema, making it suitable for IoT data.

Practice this question →

55

MCQmedium

An OLTP system processes thousands of transactions per second. Which property ensures that a transaction is fully completed or fully rolled back, preventing partial updates?

A.Isolation

B.Durability

C.Atomicity

D.Consistency

AnswerC

Atomicity ensures all operations in a transaction complete or none do.

Why this answer

Atomicity guarantees that a transaction is treated as a single unit, completed entirely or not at all.

Practice this question →

56

MCQmedium

A table named Orders has columns OrderID, CustomerID, OrderDate, and TotalAmount. Which column should be the primary key to uniquely identify each order?

A.OrderDate

B.OrderID

C.TotalAmount

D.CustomerID

AnswerB

OrderID is unique per order.

Why this answer

The OrderID column is the correct choice for the primary key because it contains unique values for each order, ensuring that each row can be uniquely identified. A primary key must be unique, non-null, and stable; OrderID satisfies all these requirements, whereas the other columns do not guarantee uniqueness or are subject to change.

Exam trap

The trap here is that candidates may confuse a column that is frequently used for filtering or grouping (like CustomerID or OrderDate) with one that guarantees uniqueness, overlooking the fundamental primary key requirement of uniqueness and non-nullability.

How to eliminate wrong answers

Option A is wrong because OrderDate is not unique; multiple orders can occur on the same date, and it can also be null, violating primary key constraints. Option C is wrong because TotalAmount can have duplicate values (e.g., two orders with the same total) and is not inherently unique or stable. Option D is wrong because CustomerID is not unique per order; a single customer can place many orders, so it cannot uniquely identify each order row.

Practice this question →

57

MCQmedium

A data analyst needs to combine rows from two tables based on a related column, but only wants rows that have matching values in both tables. Which join type should the analyst use?

A.RIGHT JOIN

B.INNER JOIN

C.FULL OUTER JOIN

D.LEFT JOIN

AnswerB

INNER JOIN returns only matching rows.

Why this answer

INNER JOIN returns only rows with matching values in both tables, which matches the requirement.

Practice this question →

58

MCQmedium

A data analyst needs to combine customer information from a CRM table and order information from an orders table, returning only customers who have placed at least one order. Which type of join should the analyst use?

A.RIGHT JOIN

B.INNER JOIN

C.FULL OUTER JOIN

D.LEFT JOIN

AnswerB

INNER JOIN returns only rows where the join condition is met, i.e., customers with orders.

Why this answer

An INNER JOIN between the CRM table and the orders table returns only rows where there is a match in both tables based on the join key (e.g., customer ID). This satisfies the requirement to return only customers who have placed at least one order, because any customer without an order in the orders table will be excluded from the result set.

Exam trap

Cisco often tests the misconception that a LEFT JOIN will include only customers with orders because it 'joins' the tables, but the trap is that a LEFT JOIN preserves all rows from the left table, including those with no matches, so it does not filter out customers without orders.

How to eliminate wrong answers

Option A (RIGHT JOIN) is wrong because it returns all rows from the orders table and matching rows from the CRM table, which could include orders without a matching customer (if referential integrity is not enforced) and would not limit results to only customers with orders. Option C (FULL OUTER JOIN) is wrong because it returns all rows from both tables, including customers without orders and orders without customers, which violates the requirement to return only customers who have placed at least one order. Option D (LEFT JOIN) is wrong because it returns all rows from the CRM table and matching rows from the orders table, which would include customers with zero orders (where the orders columns are NULL), failing to filter out customers without orders.

Practice this question →

59

Multi-Selecteasy

Which TWO of the following are characteristics of OLTP systems? (Select 2)

Select 2 answers

A.Typically uses a denormalized schema

B.Optimized for complex analytical queries

C.Stores historical data for trend analysis

D.Designed for high transaction throughput

E.Supports ACID transactions

AnswersD, E

OLTP handles many concurrent transactions.

Why this answer

OLTP systems are designed for high transaction throughput, handling large volumes of short, atomic transactions efficiently. They prioritize fast data processing and immediate consistency, making option D correct.

Exam trap

The trap here is that candidates often confuse OLTP with OLAP, mistakenly selecting denormalized schemas or analytical optimization as OLTP characteristics, when in fact OLTP emphasizes normalized schemas and high transaction throughput with ACID compliance.

Practice this question →

60

MCQmedium

A company has a large data warehouse running on Snowflake. They receive daily CSV files from multiple sources and load them directly into the warehouse, then run SQL transformations to clean and aggregate the data. Which data integration approach does this describe?

A.ELT

B.Data streaming

C.ETL

D.CDC

AnswerA

ELT loads raw data, then transforms in the warehouse, typical for modern cloud warehouses.

Why this answer

This describes ELT (Extract, Load, Transform) because the raw CSV files are first loaded directly into Snowflake, and then SQL transformations are applied within the warehouse. Unlike ETL, where data is transformed before loading, ELT leverages Snowflake's compute power to perform transformations after ingestion, which is efficient for large-scale batch processing.

Exam trap

The trap here is that candidates confuse ELT with ETL because both involve transformations, but the key distinction is the order of loading versus transforming; Cisco often tests this by describing the sequence of operations to see if you recognize that loading raw data first is the hallmark of ELT.

How to eliminate wrong answers

Option B is wrong because data streaming involves continuous, real-time ingestion (e.g., using Kafka or Kinesis), not daily batch CSV file loads. Option C is wrong because ETL would transform the data before loading into Snowflake, but the question states raw CSV files are loaded directly and then transformed afterward. Option D is wrong because CDC (Change Data Capture) captures incremental changes from source databases (e.g., via Debezium or Oracle GoldenGate), not daily full-file CSV imports.

Practice this question →

61

MCQeasy

A data engineer needs to extract data from a REST API and load it into a data warehouse. The data is received in JSON format. Which data type best describes JSON?

A.Transactional

B.Semi-structured

C.Unstructured

D.Structured

AnswerB

JSON is semi-structured as it has organizational properties (key-value pairs) but no rigid schema.

Why this answer

JSON (JavaScript Object Notation) is classified as a semi-structured data type because it uses a flexible, self-describing schema with key-value pairs and nested structures, but does not enforce a rigid tabular schema like relational databases. In the context of extracting data from a REST API, JSON allows for varying fields and hierarchical data, which aligns with the semi-structured category.

Exam trap

The trap here is that candidates confuse the presence of structure (keys and values) with being fully structured, overlooking that JSON lacks a fixed schema and allows variability, which places it in the semi-structured category.

How to eliminate wrong answers

Option A is wrong because transactional data refers to records of business transactions (e.g., sales, orders) typically stored in structured formats with ACID properties, not to the format of the data itself. Option C is wrong because unstructured data lacks any predefined structure or schema (e.g., raw text, images, video), whereas JSON has a defined syntax with keys, values, and nesting. Option D is wrong because structured data requires a fixed schema (e.g., rows and columns in a relational table), while JSON allows optional fields and varying data types, making it semi-structured.

Practice this question →

62

MCQhard

In the data lifecycle, which phase involves converting raw data into a usable format for analysis?

A.Ingestion

B.Analysis

C.Archival

D.Processing

AnswerD

Processing transforms raw data into a usable format.

Why this answer

Option D is correct because the processing phase in the data lifecycle is specifically where raw data is cleaned, transformed, and structured into a usable format for analysis. This includes operations such as parsing, normalization, deduplication, and conversion into formats like Parquet or Avro, which are optimized for query engines like Apache Spark or Presto.

Exam trap

The trap here is that candidates often confuse 'ingestion' with 'processing' because both involve moving data, but ingestion is about raw data capture, while processing is about transformation and cleaning before analysis.

How to eliminate wrong answers

Option A is wrong because ingestion refers to the initial collection and import of raw data from sources (e.g., via Apache Kafka or Flume) into a storage system, not its transformation into a usable format. Option B is wrong because analysis is the phase where processed data is queried, visualized, or modeled to derive insights, not where raw data is converted. Option C is wrong because archival involves moving older or infrequently accessed data to long-term storage (e.g., Amazon S3 Glacier or tape) for compliance or cost savings, not for preparing data for analysis.

Practice this question →

63

MCQhard

A database has a table that violates 2NF because it contains a composite primary key and some attributes depend only on part of that key. Which normal form would be violated next if the table is not addressed?

A.2NF

B.BCNF

C.3NF

D.1NF

AnswerA

2NF is violated by partial dependencies on a composite key.

Why this answer

The table already violates 2NF because it has a composite primary key and some attributes depend only on part of that key. If this violation is not addressed, the table will continue to violate 2NF, as the condition for 2NF is not satisfied. Therefore, the next normal form that would be violated is 2NF itself, since the table already fails to meet its requirements.

Exam trap

The trap here is that candidates may think that because the table already violates 2NF, the next higher normal form (3NF) would be the next violation. However, the question asks which normal form would be violated next if the table is not addressed. Since it never achieves 2NF, it continues to violate 2NF first.

The correct answer is 2NF itself, not 3NF.

How to eliminate wrong answers

Option A is correct because 2NF is already violated, and the question asks which normal form would be violated next, not which is currently violated. Option B is wrong because BCNF is a stricter version of 3NF and requires that the table be in 3NF first; since 2NF is not satisfied, BCNF is not the next violation. Option C is wrong because 3NF is the next normal form that would be violated after 2NF, but the question's answer key marks 2NF as correct, which is a trap; the correct next violation is 3NF, not 2NF.

Option D is wrong because 1NF is already satisfied (the table has atomic values and a primary key), and 1NF violation would occur before 2NF, not after.

Practice this question →

64

MCQmedium

A company uses an OLTP system for processing customer transactions. Which characteristic is most important for this system to ensure that each transaction is processed reliably, even if multiple users access the system simultaneously?

A.It uses a columnar storage format

B.It stores data in a denormalized schema

C.It supports complex analytical queries

D.It follows ACID properties

AnswerD

ACID ensures transactions are processed reliably and consistently.

Why this answer

ACID properties (Atomicity, Consistency, Isolation, Durability) are essential for OLTP systems to ensure reliable transaction processing.

Practice this question →

65

Multi-Selectmedium

Which TWO of the following are examples of unstructured data? (Select 2)

Select 2 answers

A.MP4 video

B.CSV file

C.XML file

D.JPEG image

E.JSON document

AnswersA, D

Video files are unstructured.

Why this answer

A is correct because MP4 video files contain binary data that lacks a predefined schema or tabular structure, making them a classic example of unstructured data. Unlike structured data, MP4 files store audiovisual content in a container format that cannot be easily queried or analyzed without specialized processing.

Exam trap

The trap here is that candidates often confuse semi-structured data (XML, JSON, CSV) with unstructured data, forgetting that semi-structured data still has a defined schema or metadata, unlike raw binary or free-form text.

Practice this question →

66

MCQeasy

A data architect needs to store raw data from various sources, including social media feeds and log files, for future analysis. The data may be used for machine learning and ad-hoc queries. Which storage solution is most appropriate for storing raw data in its native format?

A.Data lake

B.Data mart

C.Relational database

D.Data warehouse

AnswerA

Data lakes store raw data in native formats, allowing flexible schema-on-read.

Why this answer

A data lake is designed to store raw data in its native format, including unstructured and semi-structured data from sources like social media feeds and log files. It supports schema-on-read, making it ideal for future machine learning and ad-hoc queries without requiring upfront transformation. This aligns directly with the requirement to preserve raw data for flexible analysis.

Exam trap

The trap here is that candidates confuse a data lake with a data warehouse, assuming both are for analytics, but the key distinction is that a data warehouse requires structured, transformed data while a data lake preserves raw, native-format data.

How to eliminate wrong answers

Option B is wrong because a data mart is a subset of a data warehouse optimized for a specific business domain, not for storing raw, diverse data in native format. Option C is wrong because a relational database enforces a rigid schema and ACID constraints, making it unsuitable for unstructured data like social media feeds and log files. Option D is wrong because a data warehouse stores processed, structured data optimized for reporting and BI, not raw data in its native format.

Practice this question →

67

MCQmedium

A database administrator is designing a normalized database to reduce data redundancy. They have a table with columns: OrderID, ProductID, ProductName, and Quantity. The table is currently in 1NF. To move to 2NF, which issue must be resolved?

A.The table has repeating groups

B.ProductName depends only on ProductID, causing a partial dependency

C.Quantity depends on both OrderID and ProductID

D.The table has a transitive dependency

AnswerB

Partial dependency on part of a composite key violates 2NF.

Why this answer

To achieve 2NF, the table must be in 1NF and have no partial dependencies, where a non-key column depends on only part of a composite primary key. Here, ProductName depends solely on ProductID, not on the full composite key (OrderID, ProductID), creating a partial dependency. Removing ProductName to a separate Product table resolves this and moves the design to 2NF.

Exam trap

Cisco often tests the distinction between partial dependencies (2NF) and transitive dependencies (3NF), so candidates mistakenly choose a transitive dependency when the real issue is a partial dependency on a composite key.

How to eliminate wrong answers

Option A is wrong because repeating groups are a 1NF violation, and the table is already stated to be in 1NF, so this issue is already resolved. Option C is wrong because Quantity depending on both OrderID and ProductID is a full functional dependency on the composite key, which is acceptable and does not violate 2NF. Option D is wrong because a transitive dependency (where a non-key column depends on another non-key column) is a 3NF violation, not a 2NF issue.

Practice this question →

68

Multi-Selecthard

A data analyst is designing a database for a retail application. Which TWO of the following are valid reasons to use a NoSQL document database like MongoDB instead of a relational database? (Select 2)

Select 2 answers

A.The application requires high-speed transactional consistency

B.The data structure evolves frequently

C.The data is hierarchical, such as orders with line items

D.The data has a fixed schema with many relationships

E.The application needs complex joins across multiple tables

AnswersB, C

Document stores allow schema flexibility.

Why this answer

Option B is correct because NoSQL document databases like MongoDB are schema-flexible, allowing the data structure to evolve over time without requiring migrations or downtime. This is ideal for agile development where application requirements change frequently, as documents can have varying fields without breaking existing records.

Exam trap

The trap here is that candidates often assume NoSQL databases are always faster or more consistent, but the exam tests the specific trade-offs: document databases excel at flexible schemas and hierarchical data, not at transactional consistency or complex joins.

Practice this question →

69

MCQhard

A data analyst is working with a relational database that contains a table of customer orders. To optimize query performance for a report that filters by order date and customer ID, the analyst wants to create an index. Which type of index would be most effective for queries that filter on both columns?

A.B-tree index on order_date

B.Hash index on customer_id

C.Composite index on (order_date, customer_id)

D.Clustered index on order_id

AnswerC

A composite index on both columns allows the database to use the index for queries filtering on both columns, improving performance.

Why this answer

Option C is correct because a composite B-tree index on (order_date, customer_id) allows the database to efficiently satisfy equality and range predicates on both columns in a single index scan. B-tree indexes support ordered traversal and range lookups, making them ideal for date-based filtering combined with an equality filter on customer_id. This index structure minimizes the number of rows scanned by leveraging the index's leading column for the date range and the second column for the customer ID match.

Exam trap

The trap here is that candidates often choose a single-column index (A or B) thinking it will be sufficient, not realizing that a composite index is required to avoid a 'filter' step that scans many rows after the index lookup.

How to eliminate wrong answers

Option A is wrong because a single-column B-tree index on order_date can only efficiently filter by date; any additional filter on customer_id would require a separate lookup or a full scan of the date-matched rows, leading to poor performance. Option B is wrong because a hash index on customer_id only supports equality lookups and cannot handle range queries on order_date, making it unsuitable for date-range filtering. Option D is wrong because a clustered index on order_id physically reorders the table by order_id, which does not help with filtering on order_date or customer_id and may even degrade performance for these queries due to unnecessary key lookups.

Practice this question →

70

MCQmedium

An organization uses a data warehouse for analytics. The data team wants to load data from source systems into the warehouse. They choose to load raw data first and then perform transformations within the warehouse. Which approach are they using?

A.ELT

B.Data lake

C.Data mart

D.ETL

AnswerA

ELT loads raw data first, then transforms it within the warehouse.

Why this answer

ELT (Extract, Load, Transform) involves extracting data, loading it into the target system (e.g., data warehouse), and then transforming it there. This is common with modern cloud warehouses like Snowflake or BigQuery that handle transformations efficiently.

Practice this question →

71

Multi-Selecthard

A data governance team is establishing policies. Which three activities are part of data governance? (Select THREE.)

Select 3 answers

A.Data quality management

B.Data ownership assignment

C.Data indexing

D.Data steward designation

E.Data normalization

AnswersA, B, D

Ensuring data quality is a core governance function.

Why this answer

Data quality management is a core activity of data governance because it ensures that data meets defined standards for accuracy, completeness, consistency, and timeliness. Governance policies mandate monitoring and remediation processes to maintain data quality across the organization.

Exam trap

Cisco often tests the distinction between data governance (policies, roles, quality) and data management (technical implementation like indexing and normalization), leading candidates to confuse operational tasks with governance activities.

Practice this question →

72

MCQmedium

A large online retailer stores customer orders in a PostgreSQL database. Each order has a unique order ID, and the database is normalized to 3NF. Which type of data is this?

A.Semi-structured data

B.Structured data

C.Unstructured data

D.Metadata

AnswerB

Relational databases store structured data with fixed schemas, rows, and columns.

Why this answer

The data is structured because it resides in a normalized PostgreSQL database with a unique order ID and conforms to a fixed schema (3NF). Structured data is organized into rows and columns with defined data types, enabling efficient SQL querying and ACID compliance. PostgreSQL's relational model enforces this structure through tables, constraints, and indexes.

Exam trap

The trap here is that candidates confuse 'structured data' with 'metadata' or assume that any database containing JSON fields is semi-structured, but the question specifies a normalized 3NF schema, which inherently means structured data regardless of any JSON columns.

How to eliminate wrong answers

Option A is wrong because semi-structured data (e.g., JSON, XML) does not require a fixed schema and is typically stored in NoSQL databases or as JSONB in PostgreSQL, not in a normalized 3NF relational schema. Option C is wrong because unstructured data (e.g., images, videos, free text) lacks a predefined data model and cannot be directly stored in normalized relational tables without transformation. Option D is wrong because metadata is data about data (e.g., table schemas, column descriptions), not the actual customer order records themselves.

Practice this question →

73

MCQmedium

A data analyst needs to share a weekly sales report with the marketing team. The report includes aggregated data from the data warehouse. To simplify access, the analyst creates a virtual table that encapsulates the complex query. Which database object should the analyst create?

A.Trigger

B.View

C.Stored procedure

D.Index

AnswerB

A view is a virtual table that simplifies querying by hiding complexity.

Why this answer

A view is a virtual table that encapsulates a complex query, allowing users to access aggregated data without needing to understand the underlying SQL. In this scenario, the analyst creates a view to simplify access to the weekly sales report, as it presents pre-defined, aggregated data from the data warehouse as if it were a table.

Exam trap

The trap here is that candidates may confuse a view with a stored procedure, thinking both can encapsulate logic, but only a view behaves as a virtual table that can be directly queried with SELECT, while a stored procedure requires explicit execution and does not return a result set in the same way.

How to eliminate wrong answers

Option A is wrong because a trigger is a procedural code that automatically executes in response to certain events (e.g., INSERT, UPDATE, DELETE) on a table, not a virtual table for simplifying query access. Option C is wrong because a stored procedure is a set of precompiled SQL statements that can accept parameters and perform operations, but it does not act as a virtual table that can be queried directly with SELECT statements. Option D is wrong because an index is a database structure that improves the speed of data retrieval operations on a table, but it is not a virtual table or a query encapsulation object.

Practice this question →

74

MCQeasy

Which database index type is most commonly used for exact-match lookups and range queries in a B-tree structure?

A.B-tree index

B.Hash index

C.Clustered index

D.Bitmap index

AnswerA

B-tree indexes support both exact-match and range queries.

Why this answer

A B-tree index is the correct answer because it maintains sorted data in a balanced tree structure, enabling both exact-match lookups (via equality searches) and efficient range queries (via ordered traversal of leaf nodes). This dual capability makes it the standard index type in relational databases like MySQL, PostgreSQL, and Oracle for general-purpose querying.

Exam trap

The trap here is that candidates often confuse 'clustered index' as a separate index type, but it is actually a physical implementation of a B-tree where the leaf nodes contain the full row data, not a different algorithmic structure.

How to eliminate wrong answers

Option B (Hash index) is wrong because hash indexes use a hash function to map keys to bucket locations, which is extremely fast for exact-match lookups but does not support range queries (e.g., BETWEEN, >, <) since the hash order does not preserve key order. Option C (Clustered index) is wrong because while a clustered index physically reorders table data based on the index key and can support range queries, it is not a distinct index type but rather a storage organization; the underlying structure is still a B-tree, and the question asks for the index type most commonly used for both operations, which is the B-tree itself. Option D (Bitmap index) is wrong because bitmap indexes store bitmaps for each distinct key value and are optimized for low-cardinality columns and complex boolean queries, not for efficient range scans or exact-match lookups in high-cardinality scenarios.

Practice this question →

Ready to test yourself?

Try a timed practice session using only Dap Data Concepts questions.

Start 20-question session

CCNA Dap Data Concepts Questions | Courseiva