Knowledge + Practice

CCNA Mining Acquiring Data Questions

75 of 99 questions · Page 1/2 · Mining Acquiring Data topic · Answers revealed

Practice these questions Exam hub All questions

1

MCQhard

A large retail company is integrating customer data from two separate CRM systems into a new data warehouse. System A stores customer IDs as integers (e.g., 12345), while System B stores them as alphanumeric strings (e.g., 'CUST-12345-X'). Additionally, some customers exist in both systems but with slight name variations (e.g., 'John Smith' vs 'Jon Smith'). The data warehouse requires a unified customer table with a single unique identifier for each customer. The analyst needs to design the data acquisition process. Which of the following is the most appropriate first step?

A.Use a simple crosswalk table based on exact name matches to link records

B.Load all data from both systems into a staging table, then run a fuzzy matching algorithm to identify duplicates

C.Perform data profiling to analyze data distributions, data types, and quality issues in each source

D.Standardize all customer IDs to a common format (e.g., UUIDs) and then merge the tables

AnswerC

Profiling provides the necessary insights to plan transformations, handle inconsistencies, and design the matching strategy.

Why this answer

Option C is correct because data profiling is the foundational first step in any data integration project. It systematically assesses source data types, formats, completeness, and quality issues (e.g., integer vs. alphanumeric IDs, name variations) before designing transformation logic. Without profiling, subsequent steps like fuzzy matching or ID standardization risk being built on incorrect assumptions about the data.

Exam trap

The trap here is that candidates often jump to a technical solution (fuzzy matching or ID standardization) without recognizing that data profiling is the prerequisite step that validates source assumptions and prevents costly rework.

How to eliminate wrong answers

Option A is wrong because exact name matches cannot resolve the known name variations (e.g., 'John Smith' vs 'Jon Smith'), leading to missed linkages and duplicate customers. Option B is wrong because loading all data into a staging table before profiling risks propagating unknown data quality issues (e.g., inconsistent ID formats, nulls) into the staging area, making fuzzy matching less reliable and harder to tune. Option D is wrong because standardizing IDs to a common format (e.g., UUIDs) without first profiling the source data ignores the need to understand existing relationships and quality issues, and may break referential integrity if applied prematurely.

Practice this question →

2

MCQeasy

A data analyst needs to combine two datasets: one contains customer information (customer_id, name, address) and the other contains order information (order_id, customer_id, order_date). The analyst wants to include all customers, even those who have not placed orders. Which type of join should be used?

A.FULL OUTER JOIN

B.INNER JOIN

C.LEFT JOIN

D.RIGHT JOIN

AnswerC

LEFT JOIN includes all customers, with order data where available.

Why this answer

A LEFT JOIN returns all rows from the left table (customers) and the matching rows from the right table (orders). If a customer has no orders, the order columns will contain NULLs. This satisfies the requirement to include all customers, even those without orders.

Exam trap

The trap here is that candidates often confuse LEFT JOIN with INNER JOIN, assuming all customers must have orders, or they pick FULL OUTER JOIN thinking it includes all customers, but it also includes unmatched orders, which is not required.

How to eliminate wrong answers

Option A is wrong because a FULL OUTER JOIN returns all rows from both tables, which would include unmatched orders (if any) — unnecessary for this requirement. Option B is wrong because an INNER JOIN returns only rows with matching keys in both tables, excluding customers who have never placed an order. Option D is wrong because a RIGHT JOIN returns all rows from the right table (orders) and only matching rows from the left table (customers), which would exclude customers without orders.

Practice this question →

3

MCQmedium

Refer to the exhibit. What does the query return?

A.All orders grouped by customer ID.

B.Customers who have placed at least 5 orders.

C.Customers who have placed more than 5 orders.

D.All customers who have placed orders.

AnswerC

HAVING COUNT(*) > 5 ensures only customers with more than 5 orders are included.

Why this answer

The query uses a HAVING clause with COUNT(*) > 5, which filters groups (by customer ID) to only those with more than 5 orders. The GROUP BY customer ID ensures the count is per customer, so the result is customers who have placed more than 5 orders. Option C is correct because the condition is strictly greater than 5, not at least 5.

Exam trap

CompTIA often tests the distinction between 'at least' (>=) and 'more than' (>) in HAVING clauses, and candidates may misread the condition as including exactly 5 orders.

How to eliminate wrong answers

Option A is wrong because the query does not return all orders; it returns aggregated results (counts) per customer, not individual order rows. Option B is wrong because the condition is COUNT(*) > 5, not COUNT(*) >= 5; 'at least 5' would include exactly 5, which is excluded by the strict greater-than operator. Option D is wrong because the HAVING clause filters out customers with 5 or fewer orders; the query does not return all customers who have placed orders, only those exceeding the threshold.

Practice this question →

4

MCQeasy

A marketing analyst needs to combine customer data from a CRM database with social media engagement data from a third-party API. Which data acquisition method is most appropriate?

A.Web scraping

B.Manual data entry

C.API integration

D.Batch file upload

AnswerC

API integration provides structured, real-time access to third-party data, which is ideal for social media engagement data.

Why this answer

API integration is the most appropriate method because it allows the analyst to programmatically retrieve structured social media engagement data directly from the third-party service's RESTful or GraphQL API endpoints. This approach ensures real-time or near-real-time data synchronization, supports authentication (e.g., OAuth 2.0), and returns data in standardized formats like JSON or XML, which can be directly ingested into the CRM system without manual intervention.

Exam trap

The trap here is that candidates may confuse web scraping with API integration, assuming both can retrieve web data, but the question specifically requires combining structured data from a third-party API, where web scraping would be unreliable, unauthorized, and technically inappropriate for programmatic data acquisition.

How to eliminate wrong answers

Option A is wrong because web scraping is used to extract unstructured data from HTML pages, which is inefficient, brittle, and often violates the third-party API's terms of service; it is not designed for reliable, authenticated access to structured social media metrics. Option B is wrong because manual data entry is error-prone, time-consuming, and impractical for large volumes of social media engagement data, and it lacks any automated validation or consistency checks. Option D is wrong because batch file upload assumes the data is already exported into a file (e.g., CSV) and delivered manually, which introduces latency and requires the third-party to support file exports, whereas the API provides direct, on-demand access to live data.

Practice this question →

5

Multi-Selecthard

Which THREE are best practices for acquiring data via web scraping? (Select exactly 3)

Select 3 answers

A.Use multiple IP addresses

B.Respect robots.txt

C.Identify yourself with a user-agent

D.Scrape all data without regard to terms

E.Limit request rate

AnswersB, C, E

Legal and ethical best practice.

Why this answer

Options A, B, and E are correct because best practices include respecting robots.txt (legal and ethical), limiting request rate (avoid overloading servers), and identifying yourself with a user-agent (transparency). Option C is wrong because using multiple IP addresses may be necessary but is not a best practice per se; it can be used to avoid blocking but may violate terms. Option D is wrong because scraping all data without regard to terms is unethical and possibly illegal.

Practice this question →

6

MCQeasy

A marketing team wants to collect data on competitor pricing for similar products. Which data source is most appropriate?

A.Customer surveys

B.Internal ERP system

C.External public web scraping

D.Internal sales data

AnswerC

Web scraping can collect competitor pricing from public websites.

Why this answer

External public web scraping is the most appropriate data source because competitor pricing is publicly available on websites, and web scraping allows automated extraction of this structured or unstructured data. This approach directly addresses the need for external competitive intelligence without relying on internal or customer-reported data.

Exam trap

The trap here is that candidates may confuse internal data sources (ERP, sales) with external data needs, or mistakenly think customer surveys can provide accurate, unbiased competitor pricing data.

How to eliminate wrong answers

Option A is wrong because customer surveys collect subjective opinions and self-reported data, not objective, real-time competitor pricing from external sources. Option B is wrong because an internal ERP system contains only the company's own operational and financial data, not competitor pricing information. Option D is wrong because internal sales data reflects the company's own transactions and pricing, not competitor pricing.

Practice this question →

7

MCQhard

A financial analytics firm needs to acquire historical stock market tick data (millions of records per day) from a data vendor. The vendor provides data via FTP in binary format. The firm's existing infrastructure uses on-premise servers with limited storage and processing power. They need to stream the data into a cloud data lake for analysis. However, the binary format is proprietary and requires a licensed decoder. The budget is constrained. Which approach best meets the data acquisition requirements?

A.Negotiate with the vendor to provide an API that outputs JSON

B.Purchase a license for the decoder and set up an ETL job to convert and upload

C.Request the vendor to send data in CSV format via email

D.Use a third-party cloud service that already decodes and normalizes the data for a subscription fee

AnswerB

This allows processing of the binary format and integration with the cloud data lake, with a one-time cost.

Why this answer

Purchasing the decoder license and setting up an ETL job to convert and upload to the cloud directly addresses the format issue and enables streaming to the data lake. Requesting CSV via email is impractical for millions of records. Negotiating for a JSON API is a good idea but may not be available or cost more than the decoder.

Using a third-party service adds recurring costs and may introduce dependency.

Practice this question →

8

Multi-Selectmedium

Which TWO of the following are common methods for acquiring data from external sources?

Select 2 answers

A.Data warehousing

B.Manual data entry

C.Public APIs

D.Web scraping

E.Direct database connection to an internal server

AnswersC, D

APIs provide structured access to external data.

Why this answer

Public APIs (C) are a common method for acquiring data from external sources because they provide a standardized, programmatic interface (often RESTful over HTTP/HTTPS) for requesting and receiving structured data, such as JSON or XML, from third-party services like social media platforms or weather services. Web scraping (D) is another common method that involves programmatically extracting data from web pages by parsing HTML or DOM structures, often using tools like BeautifulSoup or Selenium, when no API is available.

Exam trap

The trap here is that candidates may confuse data warehousing (a storage/management process) with data acquisition methods, or think manual data entry is a valid external acquisition method, when the exam specifically tests automated, programmatic techniques for pulling data from outside the organization.

Practice this question →

9

Multi-Selecthard

A data scientist is merging retail transaction data from online and in-store sources. Which THREE steps are required to ensure data consistency?

Select 3 answers

A.Ensure product IDs are standardized across sources

B.Convert all monetary amounts to a common currency

C.Remove all transactions with missing customer ID

D.Synchronize timestamps to a single time zone

E.Merge data using only store location

AnswersA, B, D

Standard IDs prevent mismatches.

Why this answer

Options A (standardize product IDs), C (convert currency), and D (synchronize timestamps) are essential. Option B (use only store location) is insufficient. Option E (remove transactions with missing customer ID) may discard useful data.

Practice this question →

10

MCQhard

A data engineer needs to acquire data from a legacy mainframe system that does not support modern APIs or direct database connectivity. Which approach is most feasible?

A.Re-platform the mainframe to a modern system

B.Use a database gateway

C.Use FTP to transfer flat files

D.Manual data entry

AnswerC

FTP is a standard, simple method for file transfer from legacy systems.

Why this answer

Option C is correct because FTP (File Transfer Protocol, RFC 959) is a widely supported, low-overhead method for transferring flat files (e.g., CSV, EBCDIC-encoded text) from legacy mainframe systems that lack modern APIs or direct database connectivity. Mainframes like IBM z/OS natively support FTP, allowing the data engineer to schedule periodic file exports without requiring system modernization or complex middleware.

Exam trap

The trap here is that candidates may assume a database gateway (Option B) is always the best integration approach, but the question explicitly denies direct database connectivity, making FTP the only practical option that leverages existing mainframe capabilities without major infrastructure changes.

How to eliminate wrong answers

Option A is wrong because re-platforming the mainframe to a modern system is a costly, high-risk, and time-consuming project that far exceeds the scope of a simple data acquisition task; it introduces unnecessary complexity and potential downtime. Option B is wrong because a database gateway typically requires the mainframe to support ODBC/JDBC or similar database connectivity protocols, which the question explicitly states is not available. Option D is wrong because manual data entry is error-prone, unscalable, and impractical for any reasonable volume of data, violating basic data integrity and efficiency requirements.

Practice this question →

11

MCQmedium

A logistics company receives GPS tracking data from fleet vehicles at 1-second intervals via a cellular network. The data is used to optimize routes and monitor driver behavior. Recently, the data acquisition system has been missing updates for some vehicles when they pass through tunnels or remote areas. The data team notices gaps during these periods. The company needs a solution to ensure near-real-time data continuity. What should they do?

A.Use a hybrid approach that combines cellular and Wi-Fi networks

B.Implement a store-and-forward mechanism that buffers data on the vehicle's onboard unit and uploads when connectivity resumes

C.Increase the frequency of data transmission to every 0.5 seconds

D.Switch to a satellite-based GPS system

AnswerB

Buffering ensures data is not lost and is transmitted later, providing continuity despite temporary outages.

Why this answer

Option B is correct because a store-and-forward mechanism buffers GPS data locally on the vehicle's onboard unit during connectivity loss (e.g., in tunnels) and automatically uploads the backlog when cellular connectivity resumes. This ensures data continuity without requiring real-time transmission, directly addressing the intermittent connectivity issue while maintaining near-real-time updates.

Exam trap

The trap here is that candidates confuse the data source (GPS) with the transmission method, thinking satellite GPS solves connectivity issues, when the real problem is the cellular network's coverage gaps, not the positioning technology.

How to eliminate wrong answers

Option A is wrong because Wi-Fi networks are not suitable for fleet vehicles in motion; they have limited range and are not available in tunnels or remote areas, so combining them with cellular does not solve the core problem of coverage gaps. Option C is wrong because increasing transmission frequency to 0.5 seconds would exacerbate data loss during connectivity gaps and increase bandwidth/cost without addressing the root cause of missing updates. Option D is wrong because switching to satellite-based GPS only changes the positioning source, not the data transmission method; the vehicle still needs a network to send data, and satellite communication (e.g., Iridium) is expensive, high-latency, and not typically used for high-frequency GPS telemetry in logistics.

Practice this question →

12

MCQmedium

A retail company is migrating its on-premises data warehouse to a cloud data warehouse. The current ETL process extracts data from a transactional database (SQL Server) and a web analytics system (JSON logs). The ETL runs nightly and takes 6 hours. The business requires that the new cloud warehouse support real-time reporting with data latency of less than 15 minutes. The data engineer proposes using change data capture (CDC) from the SQL Server database and streaming the JSON logs via a message queue. However, management is concerned about cost and complexity. The engineer must design a solution that meets the latency requirement while minimizing operational overhead. Which approach should the engineer recommend?

A.Export the SQL Server data to flat files every 15 minutes and use a cloud storage trigger to load

B.Continue with nightly batch loads but increase the frequency to every hour

C.Implement CDC for the SQL Server database and stream the JSON logs via a message queue to the cloud warehouse

D.Use a data virtualization tool to query the source systems directly without moving data

AnswerC

CDC provides real-time changes; streaming handles JSON logs with low latency.

Why this answer

Option C is correct because CDC captures only changed rows from SQL Server, minimizing data volume and enabling near-real-time ingestion, while streaming JSON logs via a message queue (e.g., Apache Kafka or Amazon Kinesis) provides sub-15-minute latency. This combination meets the latency requirement without the overhead of full batch exports or complex virtualization, addressing management's cost and complexity concerns.

Exam trap

The trap here is that candidates may choose Option A or D because they seem simpler, but they fail to meet the strict latency requirement or introduce hidden operational complexity, while Option C's CDC and streaming approach is the only one that balances low latency with minimal overhead.

How to eliminate wrong answers

Option A is wrong because exporting SQL Server data to flat files every 15 minutes introduces latency from file generation, cloud storage upload, and trigger-based loading, which can easily exceed the 15-minute requirement and adds operational overhead for file management. Option B is wrong because increasing nightly batch loads to hourly still results in up to 60-minute latency, failing the 15-minute requirement, and does not address the need for real-time streaming of JSON logs. Option D is wrong because data virtualization queries source systems directly, which can cause performance degradation on the transactional SQL Server and web analytics system, and does not provide a persistent, low-latency data pipeline to the cloud warehouse.

Practice this question →

13

MCQmedium

A data analyst is tasked with extracting data from a legacy system that outputs fixed-width text files. The analyst needs to parse these files into a structured format. Which tool or method is most appropriate for this task?

A.A spreadsheet application

B.An ETL tool with a graphical interface

C.A scripting language such as Python

D.SQL

AnswerC

Python provides libraries and string manipulation ideal for parsing fixed-width files.

Why this answer

Python is the most appropriate choice because fixed-width text files require precise column slicing based on character positions, which Python's string slicing and libraries like `struct` or `pandas.read_fwf` handle natively. Unlike graphical ETL tools or spreadsheets, Python provides programmatic control to define exact field widths, handle edge cases like missing delimiters, and process large files efficiently without manual intervention.

Exam trap

The trap here is that candidates assume a graphical ETL tool is always the best for data extraction, but the question specifically tests the ability to handle unstructured or semi-structured legacy formats where scripting provides the necessary precision and automation.

How to eliminate wrong answers

Option A is wrong because spreadsheet applications like Excel are designed for delimited data (e.g., CSV) and lack built-in functionality to parse fixed-width columns without manual column splitting, which is error-prone and impractical for large datasets. Option B is wrong because while ETL tools can parse fixed-width files, they typically require defining column widths in a graphical interface, which is less flexible and harder to automate than a scripting language for legacy systems with inconsistent formatting. Option D is wrong because SQL operates on structured data within a database and cannot directly parse raw fixed-width text files; it would require the data to be pre-processed into a table format first.

Practice this question →

14

MCQmedium

During data acquisition, an analyst notices that the data from an external vendor has inconsistent date formats. What is the first step the analyst should take?

A.Contact the vendor to request corrected data

B.Immediately transform dates to a standard format

C.Perform data profiling

D.Reject the entire dataset

AnswerC

Profiling identifies inconsistencies and guides next steps.

Why this answer

Data profiling is the initial step to understand the structure, quality, and issues in the data. Rejecting or transforming without profiling may lead to errors, and contacting the vendor is premature without understanding the scope.

Practice this question →

15

Multi-Selecthard

Which THREE of the following are best practices when performing data extraction for a data pipeline?

Select 3 answers

A.Performing a full refresh every time

B.Implementing error handling and logging

C.Documenting the extraction process

D.Ignoring data quality issues during extraction

E.Using incremental extraction where possible

AnswersB, C, E

Error handling ensures the pipeline can recover from failures.

Why this answer

Option B is correct because implementing error handling and logging is a critical best practice in data pipeline extraction. It ensures that failures (e.g., network timeouts, authentication errors, or schema mismatches) are captured and can be diagnosed without data loss or silent corruption, which is essential for maintaining pipeline reliability and auditability.

Exam trap

CompTIA often tests the misconception that full refreshes are always safer or simpler, but the trap is that they ignore the operational cost and scalability issues, while incremental extraction with proper error handling is the standard in production pipelines.

Practice this question →

16

MCQeasy

A data analyst is extracting data from a relational database using SQL. Which clause is essential for limiting the rows retrieved to only those needed?

A.GROUP BY

B.ORDER BY

C.WHERE

D.HAVING

AnswerC

Filters rows based on conditions.

Why this answer

Option A is correct because the WHERE clause filters rows based on conditions. Option B is wrong because GROUP BY groups rows for aggregation. Option C is wrong because ORDER BY sorts results.

Option D is wrong because HAVING filters groups after aggregation.

Practice this question →

17

MCQhard

A financial institution is merging transaction data from two different systems. System A stores currency amounts as integers in cents, and System B stores as decimals in dollars. What is the best way to integrate the data?

A.Convert System A amounts to dollars by dividing by 100.

B.Keep both as is and use a transformation layer.

C.Store all amounts as strings to preserve precision.

D.Convert System B amounts to cents by multiplying by 100.

AnswerA

This standardizes all amounts to dollar decimal format.

Why this answer

Option A is correct because converting System A's integer cents to dollars by dividing by 100 ensures both datasets share a consistent unit (dollars) and numeric data type (decimal). This direct transformation eliminates ambiguity in aggregation and reporting, as financial calculations require uniform precision and scale. Using a transformation layer or storing as strings would introduce unnecessary complexity or risk of rounding errors.

Exam trap

The trap here is that candidates may assume keeping both formats (Option B) is simpler or that converting to cents (Option D) is safer, but they overlook the critical requirement for a single, consistent unit to enable direct arithmetic and avoid precision loss in financial data integration.

How to eliminate wrong answers

Option B is wrong because keeping both formats as-is forces every downstream query or application to repeatedly apply conversion logic, increasing complexity, maintenance overhead, and the risk of inconsistent results. Option C is wrong because storing currency amounts as strings prevents arithmetic operations (e.g., SUM, AVG) without explicit casting, degrades query performance, and can lead to sorting or comparison errors due to lexical ordering. Option D is wrong because converting System B's decimal dollars to cents by multiplying by 100 would lose fractional cent precision (e.g., $1.234 becomes 123 cents, truncating 0.4 cents), which is unacceptable for financial data integrity.

Practice this question →

18

MCQeasy

A data engineer is configuring access to a data lake in Amazon S3. What does the JSON policy shown allow?

A.Change user permissions

B.Read objects from the bucket

C.Delete objects from the bucket

D.Write objects to the bucket

AnswerB

GetObject allows reading.

Why this answer

The JSON policy shown grants the `s3:GetObject` action, which allows reading objects from the specified S3 bucket. This is a standard AWS IAM policy that explicitly permits the `GetObject` API call, enabling the data engineer to retrieve objects from the data lake.

Exam trap

The trap here is that candidates may confuse `s3:GetObject` with broader permissions like `s3:PutObject` or `s3:DeleteObject`, or assume that any S3 policy allows all actions, when in fact each action must be explicitly listed.

How to eliminate wrong answers

Option A is wrong because changing user permissions requires the `iam:ChangePassword` or `iam:UpdateUser` actions, which are not included in this S3-specific policy. Option C is wrong because deleting objects requires the `s3:DeleteObject` action, which is not listed in the policy. Option D is wrong because writing objects requires the `s3:PutObject` action, which is also absent from the policy.

Practice this question →

19

MCQhard

A social media monitoring company collects public tweets using the Twitter API. The API has a tiered access: free tier allows 500,000 tweets per month, and paid tier allows 2 million tweets per month. The company needs to collect 1.5 million tweets per month for analysis. They are on a free tier but have been exceeding the limit, causing account suspension. They need a sustainable solution without significantly increasing costs. What should they do?

A.Request an academic research exemption

B.Reduce the collection to exactly 500,000 tweets per month by sampling

C.Use multiple developer accounts to stay within free limits

D.Upgrade to the paid tier

AnswerC

Multiple accounts can split the load, staying within free limits and avoiding costs.

Why this answer

Using multiple developer accounts to distribute the collection load can allow access to more tweets while staying within each account's free limit. This avoids the cost of upgrading to a paid tier. Reducing collection to 500,000 tweets would cause loss of critical data.

Requesting an academic exemption is unlikely because the company is commercial. Upgrading to paid tier increases costs significantly.

Practice this question →

20

MCQmedium

A data team is designing an ETL process to extract data from an operational database daily. The database experiences heavy write loads during business hours. What is the best practice to minimize impact on operations?

A.Extract directly from the primary database with high priority

B.Run the extraction during peak hours to ensure data freshness

C.Schedule extraction at midnight when load is low

D.Use replication or a read replica to extract data

AnswerD

Read replicas are designed for such purposes and do not affect the primary.

Why this answer

Option B (use a read replica) is best because it offloads the extraction from the primary. Option A (extract from primary) impacts performance. Option C (schedule at midnight) still hits primary.

Option D (run during peak) increases load.

Practice this question →

21

MCQmedium

A data analyst is pulling data from a production database for a report. The database contains customer orders with a column 'order_date'. The analyst notices that some orders have dates in the future. Which data quality issue does this represent?

A.Invalid data type

B.Inconsistent data

C.Missing data

D.Violation of business rules

AnswerD

Future orders are not valid per business rules, indicating a data quality issue.

Why this answer

Option D is correct because future order dates violate a business rule that order_date must be in the past or present. This is a classic data integrity issue where the data does not conform to domain-specific constraints, such as 'order_date <= CURRENT_DATE'. The analyst should flag this as a violation of business rules, not a data type or consistency problem.

Exam trap

The trap here is that candidates confuse 'invalid data type' (Option A) with 'invalid data value' — the data is of the correct type but violates a logical business rule, which is a distinct quality issue often tested in DA0-001.

How to eliminate wrong answers

Option A is wrong because the column 'order_date' is of a valid date data type (e.g., DATE or TIMESTAMP), so there is no data type mismatch. Option B is wrong because inconsistent data refers to contradictory values across related columns (e.g., different date formats), not a single column containing future dates. Option C is wrong because missing data would involve NULL or empty values, not dates that are present but invalid according to business logic.

Practice this question →

22

MCQmedium

A healthcare organization acquires data from multiple hospitals with different patient record systems. The data includes patient IDs but no common identifier across systems. Which technique should be used to link records?

A.Merge all records without deduplication

B.Generate random unique IDs for each system

C.Manually match records for all patients

D.Probabilistic record linkage using name, DOB, and ZIP

AnswerD

Probabilistic linkage uses multiple attributes to find matches with high confidence.

Why this answer

Option D (probabilistic linkage) is designed for such situations. Option A (random IDs) loses connections. Option B (merge without dedup) creates duplicates.

Option C (manual matching) is not scalable.

Practice this question →

23

MCQhard

A research firm is acquiring data from public government databases via API. The API rate limits at 100 requests per minute. They need to download 10,000 records, but each request returns a maximum of 100 records. What is the most efficient approach to ensure complete acquisition without being blocked?

A.Use a retry logic with exponential backoff and pagination

B.Request a data dump from the government via email

C.Download one record per second

D.Send all requests simultaneously in parallel

AnswerA

This approach respects the rate limit, handles failures gracefully, and ensures complete data acquisition.

Why this answer

Pagination with retry logic using exponential backoff allows the firm to send requests in a controlled manner, respecting the rate limit and handling potential failures. Sending all requests in parallel would likely exceed the rate limit and cause blocking. Downloading one record per second is too slow.

Requesting a data dump via email is inefficient and may not be supported.

Practice this question →

24

MCQhard

A healthcare organization is building a data warehouse to support population health analytics. The data sources include: (1) an electronic health record (EHR) system with a relational database containing patient demographics, diagnoses, and medications; (2) a claims system that generates CSV files daily; (3) patient-generated health data from mobile apps via a REST API returning JSON. The data engineer needs to design a data acquisition process that runs nightly. The EHR system has a change tracking mechanism that logs changes with timestamps. The claims CSV files are appended daily. The API supports filtering by date. The data warehouse uses a star schema with fact and dimension tables. The engineer must ensure data consistency and minimize load times. Which approach should the engineer take?

A.Perform a full extraction of all data from all sources every night and load directly into the data warehouse

B.Extract only new and changed EHR data using change tracking, extract the full claims CSV (since it's append-only), and extract API data filtered by the last extraction date

C.Use a staging area to land all raw data first, then transform and load

D.Extract the EHR data using change tracking, extract the full claims CSV, and extract the API data using a full dump

AnswerB

This minimizes data transfer and load time while capturing all changes.

Why this answer

Option B is correct because it uses incremental extraction for the EHR system (via change tracking) and the API (via date filtering), while performing a full extraction of the claims CSV since it is append-only and small enough to reload nightly. This minimizes load times by avoiding full re-extraction of large, slowly changing datasets, and ensures data consistency by capturing only new or modified records. The star schema in the data warehouse is then populated efficiently from these targeted extracts.

Exam trap

The trap here is that candidates may assume a staging area (Option C) is always required for data consistency, but the question specifically asks for the acquisition approach to minimize load times, and incremental extraction (Option B) directly achieves that without mandating a staging area.

How to eliminate wrong answers

Option A is wrong because performing a full extraction of all data every night would be extremely inefficient, causing unnecessarily long load times and high resource consumption, especially for large relational databases like the EHR system. Option C is wrong because while using a staging area is a best practice for data quality and transformation, it does not address the core requirement of minimizing load times through incremental extraction; the question specifically asks for the acquisition approach, not the ETL pipeline design. Option D is wrong because extracting a full dump of the API data every night ignores the API's built-in date filtering capability, leading to redundant data transfer and longer load times compared to incremental extraction.

Practice this question →

25

MCQmedium

An analyst needs to combine two datasets from different sources that share a common key but have different levels of granularity. Dataset A has daily sales per store, Dataset B has hourly foot traffic per store. The analyst wants to analyze correlation. Which approach is appropriate?

A.Aggregate Dataset B to daily level before merging

B.Use an outer join and keep all rows

C.Disaggregate Dataset A to hourly level by dividing daily sales by hours

D.Join on store and date without aggregation

AnswerA

Aggregating the more granular dataset to match the less granular is the standard approach.

Why this answer

Aggregating Dataset B (hourly foot traffic) to the daily level ensures both datasets share the same granularity before merging on the common key (store and date). This allows a valid correlation analysis between daily sales and daily foot traffic without introducing artificial patterns or data duplication. Merging at mismatched granularities would violate the assumption that each row represents a comparable unit of observation.

Exam trap

CompTIA often tests the misconception that disaggregating (splitting) the coarser dataset is acceptable, but this introduces artificial data and violates the assumption of uniform distribution, whereas aggregation preserves the actual measured values.

How to eliminate wrong answers

Option B is wrong because an outer join without aggregation would produce multiple rows per store-date (one for each hour) when joined with daily sales, inflating the number of rows and creating a many-to-one relationship that distorts correlation calculations. Option C is wrong because disaggregating daily sales by simply dividing by hours (e.g., 24) assumes uniform sales distribution, which is rarely true and introduces artificial hourly values that do not reflect actual sales patterns. Option D is wrong because joining on store and date without aggregation retains hourly granularity from Dataset B, causing each daily sales row to repeat for every hour, leading to duplicate data and invalid statistical analysis.

Practice this question →

26

MCQhard

A data architect is designing an ETL pipeline to ingest streaming data from IoT sensors. The data must be available for real-time analytics. Which acquisition method is best?

A.Real-time streaming via API

B.Poll sensors every hour

C.Manually upload sensor logs

D.Batch load daily CSV files

AnswerA

Streaming provides continuous, low-latency data flow.

Why this answer

Real-time streaming via API is the best method because IoT sensors generate continuous data that must be ingested with sub-second latency for real-time analytics. APIs (e.g., REST, WebSocket, or MQTT) enable event-driven ingestion, allowing the ETL pipeline to process each sensor reading as it arrives, which is essential for time-sensitive use cases like anomaly detection or live monitoring.

Exam trap

The trap here is that candidates may confuse 'real-time' with 'frequent batch' and choose hourly polling (Option B), not realizing that real-time analytics requires sub-second latency, not just periodic updates.

How to eliminate wrong answers

Option B is wrong because polling sensors every hour introduces latency of up to 60 minutes, which violates the real-time analytics requirement and can cause data staleness for time-critical decisions. Option C is wrong because manually uploading sensor logs is not automated, introduces human error, and cannot achieve the low-latency ingestion needed for streaming data. Option D is wrong because batch loading daily CSV files imposes a 24-hour delay, making the data unavailable for real-time analytics and contradicting the explicit requirement for immediate data availability.

Practice this question →

27

Multi-Selecthard

Which THREE are common challenges when acquiring data from external APIs? (Choose three.)

Select 3 answers

A.Authentication requirements

B.Consistent data schemas

C.Rate limiting

D.Guaranteed uptime

E.Data volume constraints

AnswersA, C, E

Most APIs require keys or tokens for access.

Why this answer

Authentication requirements are a common challenge because external APIs typically require valid credentials (e.g., API keys, OAuth 2.0 tokens, or JWT) to access protected resources. Without proper authentication, requests are rejected with HTTP 401 Unauthorized or 403 Forbidden errors, and managing token expiration, refresh cycles, and secure storage adds significant complexity to data acquisition pipelines.

Exam trap

The trap here is that candidates confuse 'consistent data schemas' (which APIs typically provide) with 'inconsistent data quality' (which is a separate challenge), leading them to incorrectly select Option B instead of recognizing that schema consistency is actually a benefit of using APIs.

Practice this question →

28

MCQhard

An organization needs to acquire data from a third-party vendor. The data will be used for regulatory reporting. Which of the following should be the primary consideration before acquiring the data?

A.Legal and compliance requirements

B.Volume of data

C.Data format

D.Cost of the data

AnswerA

Regulatory reporting requires adherence to data governance and privacy laws.

Why this answer

When acquiring data for regulatory reporting, legal and compliance requirements must be the primary consideration because the data must adhere to specific laws (e.g., GDPR, HIPAA, SOX) and industry regulations. Failing to ensure compliance can result in legal penalties, fines, or rejection of the report by regulatory bodies. This overrides technical or cost concerns, as non-compliant data is unusable for its intended purpose.

Exam trap

The trap here is that candidates prioritize technical or cost factors (volume, format, price) over the foundational legal and compliance gate, mistakenly assuming any data can be adapted later without verifying regulatory fitness first.

How to eliminate wrong answers

Option B is wrong because the volume of data is a secondary operational concern (e.g., storage, processing bandwidth) but does not address whether the data legally satisfies regulatory mandates. Option C is wrong because data format (e.g., CSV, JSON, XML) is a technical integration detail that can be transformed later, not a primary legal or compliance gate. Option D is wrong because cost is a business negotiation factor; even free data must first meet regulatory requirements to be used for reporting.

Practice this question →

29

MCQmedium

A data analyst is building a dataset from multiple sources and needs to ensure data quality. During the data acquisition phase, which activity is most important to perform?

A.Data visualization

B.Data cleaning

C.Data profiling

D.Data modeling

AnswerC

Profiling assesses data quality and structure before further processing.

Why this answer

Data profiling is the most important activity during the data acquisition phase because it involves examining source data to understand its structure, content, and quality issues before integration. This step identifies missing values, data types, duplicates, and inconsistencies early, preventing downstream errors in analysis. Without profiling, subsequent cleaning and modeling may be based on flawed assumptions about the data.

Exam trap

CompTIA often tests the distinction between data profiling (discovery/assessment) and data cleaning (correction), leading candidates to mistakenly choose cleaning as the first step during acquisition when profiling must come first to identify what needs cleaning.

How to eliminate wrong answers

Option A is wrong because data visualization is a presentation and exploratory analysis technique used after data is acquired and cleaned, not during acquisition. Option B is wrong because data cleaning is a corrective process that typically follows data profiling; performing cleaning without first profiling can waste effort on unknown issues or miss critical quality problems. Option D is wrong because data modeling defines relationships and structures for storage or analysis, which occurs after data is acquired and understood, not during the initial acquisition phase.

Practice this question →

30

Multi-Selecthard

Which THREE are challenges in acquiring data from external sources? (Select three.)

Select 3 answers

A.Data redundancy

B.Unauthorized access

C.Licensing restrictions

D.Rate limiting

E.Data format inconsistency

AnswersC, D, E

External data may have legal restrictions on usage, sharing, or redistribution.

Why this answer

Data format inconsistency occurs when integrating data from different sources. Rate limiting is a common API restriction that limits how much data can be accessed. Licensing restrictions may limit the use or redistribution of acquired data.

Data redundancy is an internal data quality issue, not a challenge specific to acquisition. Unauthorized access is a security concern but not a typical acquisition challenge.

Practice this question →

31

MCQhard

During a data mining project, an analyst discovers that a significant number of records have a negative value for the age field. What is the most appropriate first step?

A.Impute using regression.

B.Replace negative age with the mean age.

C.Investigate the source system for data entry errors.

D.Remove all records with negative age.

AnswerC

Determining why negative ages occur enables targeted correction and prevents future errors.

Why this answer

Option C is correct because the first step in handling anomalous data like negative ages is to investigate the source system for data entry errors. This aligns with the data mining process, where understanding the root cause of data quality issues is critical before applying any imputation or removal techniques. Without investigation, you risk masking systemic problems that could affect all records.

Exam trap

The trap here is that candidates often jump to data cleaning techniques like imputation or removal without first verifying whether the anomaly is a data quality issue or a legitimate value, which the DA0-001 exam tests by emphasizing the investigative step as the most appropriate first action.

How to eliminate wrong answers

Option A is wrong because imputing using regression assumes the negative values are missing at random and that other variables can predict age, which is inappropriate when the negative values likely indicate a data entry error rather than missing data. Option B is wrong because replacing negative age with the mean age introduces bias and does not address the underlying cause; it also assumes the negative values are outliers rather than errors. Option D is wrong because removing all records with negative age could discard valid data if the negative values are due to a correctable entry mistake, and it reduces sample size without solving the root issue.

Practice this question →

32

Multi-Selectmedium

A data analyst is merging two datasets from different departments. The analyst notices that the 'CustomerID' field in the first dataset is stored as an integer, while in the second dataset it is stored as a string with leading zeros. Which TWO steps should the analyst take to ensure successful data integration?

Select 2 answers

A.Perform the merge directly without transformation since databases handle type conversions automatically.

B.Strip all non-numeric characters from the string CustomerID before joining.

C.Use a left join and treat the CustomerID as a string after conversion.

D.Convert the string CustomerID to an integer by removing leading zeros.

E.Convert the integer CustomerID to a string with leading zeros to match the format in the second dataset.

AnswersC, E

A left join requires matching keys; converting to string ensures compatibility.

Why this answer

Option C is correct because converting the integer CustomerID to a string ensures both datasets have a compatible data type for the join. This approach preserves the leading zeros in the second dataset, which are semantically significant (e.g., '00123' vs. 123). A left join is appropriate to retain all records from the primary dataset while matching on the converted key.

Exam trap

The trap here is that candidates assume implicit type conversion will handle the join correctly, but they overlook that leading zeros are lost during conversion, causing silent data loss or incorrect matches.

Practice this question →

33

MCQmedium

Refer to the exhibit. What is the most likely issue causing the unexpectedly low count?

A.The customers table is indexed incorrectly

B.The query is missing a GROUP BY clause

C.The database was not refreshed

D.The signup_date column is in a different date format

AnswerD

Format mismatch causes filter mismatch.

Why this answer

Option A is correct because if the signup_date column is stored in a different date format (e.g., MM/DD/YYYY), the comparison with '2023-01-01' (YYYY-MM-DD) may not match many records. Option B is wrong because GROUP BY is not needed for COUNT(*). Option C is wrong because database refresh does not directly affect query result.

Option D is wrong because indexing affects performance, not correctness.

Practice this question →

34

MCQhard

A financial institution needs to acquire credit transaction data from multiple sources while ensuring compliance with data privacy regulations. What is the most critical step?

A.Data replication for redundancy

B.Data enrichment with external sources

C.Data compression for storage

D.Data anonymization during extraction

AnswerD

Ensures sensitive information is protected early.

Why this answer

Data anonymization during extraction is the most critical step because it ensures that personally identifiable information (PII) is irreversibly masked or removed before the data enters the processing pipeline, directly addressing compliance with regulations such as GDPR and PCI DSS. Without this step, even if other measures are applied later, the initial exposure of sensitive data violates privacy mandates and increases breach risk.

Exam trap

The trap here is that candidates confuse operational efficiency measures (replication, compression) or data enhancement (enrichment) with privacy compliance, overlooking that anonymization must be applied at the earliest point of data acquisition to satisfy regulatory requirements.

How to eliminate wrong answers

Option A is wrong because data replication for redundancy focuses on high availability and disaster recovery, not on privacy compliance; it does not prevent exposure of sensitive credit transaction data. Option B is wrong because data enrichment with external sources typically adds more data attributes, which can increase privacy risk and regulatory exposure rather than ensuring compliance. Option C is wrong because data compression for storage reduces storage footprint and may improve I/O performance but has no effect on data privacy or regulatory compliance.

Practice this question →

35

MCQeasy

A company receives daily sales data in CSV format. The data includes a 'Date' column in MM/DD/YYYY format. To load this into a database that expects YYYY-MM-DD, the analyst should:

A.Manually edit the CSV files before loading

B.Change the database schema to accept MM/DD/YYYY

C.Ignore the date column and use a default date

D.Use a data transformation tool to convert the date format during ETL

AnswerD

Transformation tools automate the conversion and ensure consistency.

Why this answer

Option B (use a transformation tool) is the standard practice during ETL. Option A (manually edit) is inefficient and error-prone. Option C (change schema) may cause other issues.

Option D (ignore) loses data integrity.

Practice this question →

36

MCQhard

A retail company is acquiring sales data from 150 stores worldwide. Each store sends daily CSV files via email to a central email address. The data acquisition process is manual: an intern downloads each attachment and copies it into a shared folder. The shared folder is then accessed by an ETL tool that loads data into a data warehouse. Recently, the data warehouse has been missing records for several stores. The intern reports that some emails are not being received or are delayed. The company needs to improve the reliability and timeliness of data acquisition. Which course of action should be taken first?

A.Train the intern to check email more frequently and manually verify all attachments.

B.Replace the email method with a web-based API that stores push data in real-time.

C.Implement an automated email parser that downloads attachments and moves them to the shared folder.

D.Require stores to upload CSV files directly to a cloud-based storage bucket.

AnswerD

Eliminates email dependency and manual steps.

Why this answer

Option B is correct because requiring stores to upload CSV files directly to a cloud-based storage bucket eliminates dependency on email and manual intervention, improving reliability and timeliness. Option A is wrong because it still relies on email as the transport mechanism. Option C is wrong because implementing a web-based API might be too complex and time-consuming as a first step.

Option D is wrong because it does not address the root cause of email unreliability.

Practice this question →

37

Multi-Selectmedium

A data analyst is evaluating data quality issues during acquisition. Which TWO issues are most likely to arise from merging data from different sources? (Select exactly 2)

Select 2 answers

A.User access permissions

B.Duplicate records

C.Slow network speed

D.High storage cost

E.Formatting inconsistencies

AnswersB, E

Common when merging overlapping data.

Why this answer

Options B and D are correct because merging data from different sources often results in duplicate records (same entity represented differently) and formatting inconsistencies (date formats, units). Option A is wrong because high storage cost is not a direct data quality issue. Option C is wrong because slow network speed is a performance issue.

Option E is wrong because user access permissions are an authorization issue.

Practice this question →

38

MCQeasy

A data analyst is tasked with collecting data from a web API that returns JSON. The API requires an API key in the header. Which method should be used to authenticate?

A.Use a session cookie

B.Add the API key in the HTTP header as 'Authorization: Bearer <key>'

C.Store the API key in the database and reference it

D.Include the API key in the URL query string

AnswerB

Standard bearer token authentication is secure and widely used.

Why this answer

The correct method is to include the API key in the HTTP header using the 'Authorization: Bearer <key>' format. This is the standard approach for token-based authentication in REST APIs, as defined by RFC 6750. It keeps the credential out of URLs and logs, and is the expected mechanism for API key authentication in modern web APIs.

Exam trap

CompTIA often tests the distinction between authentication methods, and the trap here is that candidates may confuse storing credentials (Option C) with transmitting them, or think that query strings (Option D) are acceptable because they work technically, ignoring security and standard practices.

How to eliminate wrong answers

Option A is wrong because session cookies are used for stateful web application sessions, not for stateless API authentication with a fixed API key; cookies are typically managed by the server and browser, not suitable for programmatic API calls. Option C is wrong because storing the API key in a database and referencing it describes a storage mechanism, not an authentication method sent in the request; the key must be transmitted with each API call, not just stored. Option D is wrong because including the API key in the URL query string exposes the key in server logs, browser history, and is less secure; it violates best practices and is not the standard method for API key authentication.

Practice this question →

39

MCQmedium

An organization is integrating data from multiple sources into a data warehouse. They need to handle differences in data granularity (e.g., daily vs. hourly sales data). Which technique is most appropriate?

A.Data aggregation

B.Data normalization

C.Data deduplication

D.Data profiling

AnswerA

Aggregation rolls up data to a consistent level.

Why this answer

Data aggregation is the correct technique because it allows the organization to roll up hourly sales data to a daily granularity, ensuring consistency when integrating sources with different levels of detail. By applying aggregation functions (e.g., SUM, AVG) during the ETL process, the data warehouse can store all data at a common grain, which is essential for accurate reporting and analysis.

Exam trap

The trap here is that candidates may confuse data normalization (a schema design concept) with the need to standardize data granularity, leading them to incorrectly select normalization instead of aggregation.

How to eliminate wrong answers

Option B is wrong because data normalization is a database design technique used to reduce redundancy and dependency by organizing columns and tables, not to reconcile differences in data granularity. Option C is wrong because data deduplication focuses on identifying and removing duplicate records, which does not address the mismatch in time-based granularity between daily and hourly data. Option D is wrong because data profiling is an exploratory process to assess data quality and structure, but it does not transform or harmonize data to a common granularity level.

Practice this question →

40

MCQeasy

Refer to the exhibit. What data quality issue is indicated?

A.Data inconsistency

B.Non-standardized data entry

C.Outlier

D.Data duplication

AnswerB

The use of 'N/A' in a numeric field indicates lack of standardization.

Why this answer

The error shows that a non-numeric value 'N/A' is present in a numeric column, indicating non-standardized data entry. Duplication, inconsistency, or outliers are not directly shown.

Practice this question →

41

MCQhard

A data team is using web scraping to collect competitor pricing data. The target website has anti-scraping measures like CAPTCHAs and rate limiting. Which approach is most effective?

A.Use a single IP address

B.Disregard robots.txt

C.Use rotating proxies and respectful delays

D.Increase request frequency

AnswerC

Mimics human behavior and avoids detection.

Why this answer

Option D is correct because using rotating proxies and respectful delays mimics human behavior and avoids detection. Option A is wrong because increasing request frequency triggers rate limiting. Option B is wrong because using a single IP address is easily blocked.

Option C is wrong because disregarding robots.txt may violate terms and lead to legal issues or IP bans.

Practice this question →

42

MCQeasy

A healthcare organization collects patient questionnaire data via paper forms at clinics. The forms are scanned and sent to a central office, where staff manually enter data into an electronic system. This process is slow and error-prone. The organization wants to reduce manual entry errors and speed up data availability. Which method should they adopt?

A.Continue manual entry but double-check all entries

B.Use optical character recognition (OCR) to digitize the forms and automatically populate the database

C.Send forms to an external data processing company

D.Require patients to fill out forms online at home

AnswerB

OCR automates data extraction from scanned forms, reducing errors and increasing speed.

Why this answer

Optical Character Recognition (OCR) can convert scanned images to text automatically, reducing manual entry errors and speeding up the process. Requiring patients to fill out online forms may not be feasible for all patients, especially those without internet access. Continuing manual entry with double-checking is still slow and labor-intensive.

Sending to an external company introduces additional cost and potential privacy concerns.

Practice this question →

43

MCQmedium

An e-commerce company wants to integrate product pricing data from competitor websites to adjust its own prices dynamically. They plan to scrape pricing pages every hour. However, the competitors' websites have anti-scraping measures such as IP blocking and CAPTCHAs. The company's legal team also advises caution regarding terms of service. Which data acquisition strategy is both effective and compliant?

A.Use a public data aggregator that already provides competitor pricing with permission

B.Use a rotating proxy service and human-like browser automation to bypass blocks

C.Negotiate with competitors to obtain pricing data via API agreements

D.Instruct staff to manually record prices once a week

AnswerC

An API agreement is legal, compliant, and provides structured data access.

Why this answer

Negotiating with competitors to obtain pricing data via API agreements is the most compliant approach, as it avoids violating terms of service and ensures reliable data access. Using rotating proxies and automation to bypass anti-scraping measures may be effective but could violate laws or terms of service. Manual recording is too slow and not dynamic.

Using a public data aggregator may not provide the specific competitor data needed and could be costly.

Practice this question →

44

Matchingmedium

Match each database concept to its definition.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Unique identifier for each record in a table

Field that links to primary key in another table

Structure to speed up data retrieval

Virtual table based on a query result

Process to reduce data redundancy

Why these pairings

These are fundamental database concepts.

Practice this question →

45

MCQmedium

During data acquisition, a data engineer uses a tool to extract data from a source system incrementally based on a timestamp column. Which method is being used?

A.Change data capture (CDC)

B.Snapshot extraction

C.Full extraction

D.Manual extraction

AnswerA

CDC uses timestamps or logs to extract only changed data.

Why this answer

Change data capture (CDC) captures modifications since the last extraction. Full extraction retrieves all data each time, snapshot extracts a point-in-time copy, and manual is not automated.

Practice this question →

46

MCQhard

A company is merging two databases from different departments. In Database A, customer IDs are integers. In Database B, customer IDs are alphanumeric strings. To merge, the data analyst must reconcile these differences. Which step should be taken first?

A.Drop the ID column and use a surrogate key

B.Convert all IDs to integers using CAST

C.Perform data profiling to understand the ID formats and relationships

D.Create a mapping table based on the first character

AnswerC

Profiling helps determine the best strategy for reconciliation.

Why this answer

Option C is correct because data profiling is the essential first step before any transformation or mapping. It allows the analyst to examine the actual formats, patterns, and relationships in both ID columns (e.g., whether Database B's alphanumeric IDs contain embedded numeric sequences or consistent prefixes). Without profiling, any conversion or mapping would be based on assumptions that could lead to data loss or incorrect merges.

Exam trap

The trap here is that candidates assume immediate conversion (Option B) is the simplest solution, but the exam tests the principle that data profiling must precede any transformation to avoid irreversible data corruption.

How to eliminate wrong answers

Option A is wrong because dropping the ID column and using a surrogate key discards the existing business meaning and relationships, which may be critical for linking records across departments. Option B is wrong because converting all IDs to integers using CAST will fail on alphanumeric strings that contain non-numeric characters, causing errors or data loss. Option D is wrong because creating a mapping table based solely on the first character is arbitrary and ignores the full ID structure, leading to incorrect or incomplete mappings.

Practice this question →

47

Multi-Selecteasy

Which THREE data sources are suitable for web scraping? (Select three.)

Select 3 answers

A.HTML pages

B.JSON APIs

C.CSV files

D.Database connections

E.PDF documents

AnswersA, B, E

HTML is the primary source for web scraping.

Why this answer

HTML pages are suitable for web scraping because they contain structured or semi-structured data in markup format that can be parsed using libraries like BeautifulSoup or Scrapy. Web scrapers extract information from the DOM tree by targeting specific tags, classes, or attributes, making HTML a primary source for scraping.

Exam trap

The trap here is that candidates may confuse 'web scraping' with any form of data extraction, but the exam specifically tests the understanding that scraping involves HTTP-based retrieval of web content, not direct file downloads or database queries.

Practice this question →

48

MCQhard

A data pipeline log shows the above error. Which data transformation should be applied during acquisition?

A.Skip rows that cause errors

B.Preprocess the string to remove non-numeric characters, then convert to DECIMAL

C.Use CAST(transaction_amount AS DECIMAL(10,2)) in SQL

D.Change the target column type to VARCHAR

AnswerB

Removing symbols before conversion ensures successful casting.

Why this answer

Option B is correct because the error indicates that the pipeline encountered a string with non-numeric characters (e.g., '$1,234.56') when trying to load it into a DECIMAL column. Preprocessing the string to remove non-numeric characters (like currency symbols, commas) before conversion ensures the data is clean and parseable, which is a standard data transformation during acquisition to handle dirty source data.

Exam trap

The trap here is that candidates assume CAST in SQL can handle any string-to-number conversion, but CAST strictly requires a valid numeric string and will throw an error for non-numeric characters, making preprocessing essential.

How to eliminate wrong answers

Option A is wrong because skipping rows that cause errors would result in data loss and is not a proper transformation; it ignores the root cause of the dirty data. Option C is wrong because using CAST(transaction_amount AS DECIMAL(10,2)) in SQL would still fail if the string contains non-numeric characters, as CAST does not automatically strip them. Option D is wrong because changing the target column type to VARCHAR would avoid the conversion error but defeats the purpose of storing numeric data for calculations, leading to data integrity and performance issues.

Practice this question →

49

MCQhard

A financial analyst is integrating data from multiple stock exchanges. One exchange provides trade timestamps in UTC, another in Eastern Time. The analyst needs accurate time synchronization for time-series analysis. What is the best approach?

A.Keep original timezones and add a timezone offset column

B.Use the local time of the analyst's location

C.Convert all timestamps to a single timezone (e.g., UTC) during ETL

D.Ignore timezone differences if analysis is intraday

AnswerC

Converting to a common timezone ensures consistent timestamps for analysis.

Why this answer

Option D (convert to UTC) is the standard. Option A (keep original with offset) adds complexity. Option B (local time) is inconsistent.

Option C (ignore) leads to errors.

Practice this question →

50

MCQhard

Refer to the exhibit. An analyst sees this log during data acquisition. What action should be taken first?

A.Modify the ETL mapping for data types

B.Reject the entire dataset

C.Ignore warnings and continue

D.Correct the date string in the source

AnswerA

Adjusting the mapping resolves the type mismatch for all rows.

Why this answer

The log shows a data type mismatch during ETL (Extract, Transform, Load) processing, where a date field is being read as a string. The correct first action is to modify the ETL mapping for data types to ensure the date string is properly cast or converted to the target date format, preventing data loss or corruption. This aligns with standard data acquisition best practices: adjust the transformation layer to handle source data anomalies before rejecting or altering the source.

Exam trap

CompTIA often tests the misconception that you should always fix the source data first, but in data acquisition, the ETL layer is the standard place to handle format conversions without altering the original source.

How to eliminate wrong answers

Option B is wrong because rejecting the entire dataset is an overreaction; a single data type mismatch can be resolved by adjusting the ETL mapping without discarding potentially valid data. Option C is wrong because ignoring warnings can lead to downstream errors, such as failed joins or incorrect date calculations, violating data integrity requirements. Option D is wrong because correcting the date string in the source is not always feasible (e.g., if the source is a third-party system or read-only), and the ETL layer is the appropriate place to handle such transformations.

Practice this question →

51

MCQeasy

A retail company's data analytics team needs to acquire point-of-sale (POS) transaction data from 200 stores daily. Each store sends a CSV file via email at the end of the day. The files often arrive late, have inconsistent column names (e.g., "StoreID", "Store_ID", "store_id"), and occasionally contain corrupted rows. The team manually processes these files, leading to frequent errors and delays. The company wants to automate the acquisition process to ensure data is available by 9 AM the next business day with high quality. Which approach best addresses these issues?

A.Create a script to automatically download email attachments, validate and standardize columns, and flag corrupted rows for review

B.Hire a data entry contractor to manually check and re-enter data

C.Ask stores to use a standardized web form to enter data directly into a cloud database

D.Implement a VPN so stores can connect to the central database and write transactions in real time

AnswerA

This automates the entire process, handles inconsistencies, and ensures timely availability with quality checks.

Why this answer

Option A is correct because it directly addresses all three issues: automating the retrieval of email attachments (handling late arrivals), standardizing inconsistent column names via a script (e.g., mapping 'StoreID', 'Store_ID', 'store_id' to a canonical schema), and implementing validation logic to flag corrupted rows for manual review. This approach ensures data is processed reliably by 9 AM without manual intervention, meeting the automation and quality requirements.

Exam trap

The trap here is that candidates may choose Option C or D because they seem more 'modern' or 'direct,' but they fail to recognize that the question specifically requires handling existing CSV files and late arrivals, which a script-based ETL approach (Option A) directly solves without requiring stores to change their behavior or infrastructure.

How to eliminate wrong answers

Option B is wrong because hiring a data entry contractor introduces manual processing, which is the root cause of delays and errors, and does not automate the acquisition process. Option C is wrong because asking stores to use a standardized web form shifts the burden to 200 stores, which is impractical to enforce uniformly and does not address the existing CSV files or late arrivals; it also introduces new integration complexity without solving the immediate data pipeline issue. Option D is wrong because implementing a VPN for real-time writes requires significant network infrastructure changes, assumes stores have stable high-speed internet, and does not handle the existing CSV files or the need for batch processing by 9 AM; real-time writes also increase the risk of data corruption without validation.

Practice this question →

52

MCQhard

An organization is acquiring data from an external vendor. The vendor provides a flat file with inconsistent delimiters and missing values. Which step should be performed first in data acquisition?

A.Data integration

B.Data profiling

C.Data transformation

D.Data cleansing

AnswerB

Profiling reveals structure, quality, and inconsistencies first.

Why this answer

Option C is correct because data profiling examines the data to understand its structure, quality, and issues before any cleansing or transformation. Option A is wrong because data cleansing is performed after profiling. Option B is wrong because data integration combines data from multiple sources.

Option D is wrong because data transformation occurs after profiling.

Practice this question →

53

Multi-Selecteasy

A data analyst is validating a dataset acquired from an external source. Which TWO actions are appropriate for data quality assessment?

Select 2 answers

A.Check for missing values in critical fields

B.Delete any rows with null values without review

C.Validate data format against expected schema

D.Immediately load all data into production

E.Transform data to match target system without verification

AnswersA, C

Missing value checks are fundamental to data quality.

Why this answer

Checking for missing values in critical fields is a fundamental data quality assessment step because missing data can indicate incomplete records, data corruption, or extraction errors. Identifying these gaps early allows the analyst to decide on appropriate handling strategies, such as imputation or rejection, before further processing. This aligns with data profiling best practices in the mining and acquisition phase.

Exam trap

The trap here is that candidates may confuse data cleaning (which includes deletion or transformation) with data quality assessment, which is the diagnostic step that should occur before any irreversible actions like deletion or production loading.

Practice this question →

54

MCQeasy

A marketing company is building a customer segmentation model. The data team has access to two sources: a CRM database with customer demographics and purchase history, and a third-party data provider that offers social media activity scores. The CRM data is updated daily, while the third-party data is refreshed weekly on Sundays. The analyst needs to create a unified dataset for the model training scheduled for Wednesday morning. The analyst runs a SQL query to join the two tables on CustomerID, but the resulting dataset has far fewer rows than expected. Upon investigation, the analyst finds that many customers in the CRM do not have matching records in the third-party data. Additionally, some customers in the third-party data have multiple entries due to unresolved duplicates. The analyst must produce the most complete dataset possible while maintaining data quality. Which course of action should the analyst take?

A.First deduplicate the third-party data by keeping the most recent record per CustomerID, then perform a LEFT JOIN from CRM to the deduplicated third-party data.

B.Perform an INNER JOIN on CustomerID and then remove duplicates from the result.

C.Use only the third-party data because it provides the social media scores needed for segmentation.

D.Perform a LEFT JOIN from the third-party data to CRM, then aggregate duplicates by averaging scores.

AnswerA

This preserves all CRM customers and handles duplicates correctly.

Why this answer

Option A is correct because it first resolves the duplicate issue in the third-party data by keeping the most recent record per CustomerID, ensuring each customer has a single, current social media score. Then, a LEFT JOIN from CRM to the deduplicated third-party data preserves all CRM customers, maximizing completeness while maintaining data quality. This approach aligns with the goal of producing the most complete dataset for model training, as the CRM is the primary source with daily updates.

Exam trap

The trap here is that candidates may choose an INNER JOIN (Option B) thinking it ensures data quality by only including matched records, but they overlook the requirement for completeness, which necessitates preserving all CRM customers even without third-party matches.

How to eliminate wrong answers

Option B is wrong because an INNER JOIN would exclude CRM customers without matching third-party records, reducing dataset completeness, and removing duplicates after the join does not address the root cause of multiple entries in the third-party data. Option C is wrong because using only third-party data discards the CRM's daily-updated demographics and purchase history, which are essential for segmentation and would result in an incomplete dataset. Option D is wrong because a LEFT JOIN from third-party data to CRM would prioritize third-party customers, potentially losing CRM-only customers, and averaging scores across duplicates introduces data quality issues by conflating multiple records into a single value without considering recency or validity.

Practice this question →

55

MCQmedium

What is the primary purpose of the HAVING clause in the query shown?

A.Sort the results in descending order

B.Join two tables

C.Filter rows before grouping

D.Filter groups after aggregation

AnswerD

HAVING filters groups that meet the aggregate condition.

Why this answer

The HAVING clause is used to filter groups after the GROUP BY clause has aggregated the data. In SQL, WHERE filters individual rows before aggregation, while HAVING applies conditions to the results of aggregate functions like SUM, COUNT, or AVG. Option D is correct because the query uses HAVING to restrict which grouped results appear in the final output.

Exam trap

The trap here is confusing WHERE and HAVING: candidates often pick 'Filter rows before grouping' because they think all filtering happens before aggregation, but HAVING specifically filters groups after aggregation, not individual rows.

How to eliminate wrong answers

Option A is wrong because sorting is performed by the ORDER BY clause, not HAVING; HAVING has no sorting functionality. Option B is wrong because joining tables is done with JOIN (or FROM with comma-separated tables) and ON conditions, not with HAVING. Option C is wrong because filtering rows before grouping is the role of the WHERE clause; HAVING operates after aggregation, on groups, not on individual rows.

Practice this question →

56

MCQeasy

A data analyst needs to collect customer sentiment data from social media platforms. Which data acquisition method is most appropriate?

A.Conduct a survey

B.Organize focus groups

C.Use web scraping

D.Query the internal CRM

AnswerC

Web scraping automates extraction of data from social media platforms.

Why this answer

Web scraping is the most appropriate method because it allows the data analyst to programmatically extract unstructured customer sentiment data (e.g., posts, comments, reviews) directly from social media platforms using HTTP requests and HTML parsing. Unlike surveys or focus groups, scraping can collect large volumes of real-time, publicly available data without relying on self-reported or curated responses.

Exam trap

CompTIA often tests the distinction between primary data collection (surveys, focus groups) and secondary data acquisition (web scraping, APIs), where candidates mistakenly choose a primary method for a task that requires large-scale, unsolicited external data.

How to eliminate wrong answers

Option A is wrong because conducting a survey collects self-reported, structured data from a controlled sample, which is not suitable for capturing organic, unsolicited sentiment from social media platforms in real time. Option B is wrong because organizing focus groups gathers qualitative feedback from a small, moderated group, which lacks the scale and authenticity of public social media sentiment and introduces moderator bias. Option D is wrong because querying the internal CRM retrieves structured customer data from internal systems (e.g., purchase history, support tickets), not the unstructured, external social media content needed for sentiment analysis.

Practice this question →

57

Multi-Selecteasy

Which TWO are common methods for acquiring internal data? (Choose two.)

Select 2 answers

A.Social media APIs

B.Transaction logs

C.Government databases

D.ERP systems

E.Web scraping

AnswersB, D

Transaction logs record internal system activities.

Why this answer

Transaction logs are a primary source of internal data because they record every interaction or event within a system, such as database changes, user access, or application errors. This data is generated and stored internally by the organization's own infrastructure, making it a classic example of internal data acquisition.

Exam trap

The trap here is that candidates may confuse 'internal data' with 'publicly available data' or 'data from third-party sources,' leading them to select social media APIs or government databases, which are external, not internal.

Practice this question →

58

MCQhard

A data team is integrating customer data from three sources. After joining, they find that the count of unique customers is lower than expected. What is the most likely cause?

A.Inconsistent key definitions.

B.Missing values in join keys.

C.Data truncation during transfer.

D.Duplicate entries across sources.

AnswerA

Mismatched key formats cause join failures, reducing matches.

Why this answer

When joining customer data from multiple sources, inconsistent key definitions (e.g., one source uses integer IDs while another uses string IDs, or different formats like 'CUST-001' vs '1001') cause the join to fail to match records that actually represent the same customer. This results in fewer unique customers than expected because the join treats mismatched keys as different entities, effectively dropping or misaligning records. The data team likely used an inner join or a left join that only retains matches based on exact key equality, so any key inconsistency reduces the count of matched unique customers.

Exam trap

The trap here is that candidates often assume missing values or duplicates are the primary cause of a lower unique count, but Cisco tests the nuance that inconsistent key definitions—not missing data—are the most common reason for unexpected join results in multi-source integration scenarios.

How to eliminate wrong answers

Option B is wrong because missing values in join keys would typically cause rows to be excluded from the join (e.g., NULL keys in SQL inner joins are not matched), which could reduce the total row count but not specifically the count of unique customers—missing keys usually lead to fewer rows overall, not a lower unique customer count after join. Option C is wrong because data truncation during transfer (e.g., cutting off characters from a VARCHAR field) would likely cause data loss or corruption, but it would not systematically reduce the count of unique customers; it might introduce mismatches or duplicates, but the primary effect is not a lower unique count. Option D is wrong because duplicate entries across sources would actually increase the count of unique customers if duplicates are not deduplicated, or if they are deduplicated, the unique count might be accurate; duplicates do not inherently cause a lower unique count—they cause inflated counts or require deduplication logic.

Practice this question →

59

MCQmedium

A data analyst is tasked with combining customer data from a CRM system and a billing system. The CRM uses a GUID for customer ID, while billing uses an integer. Which approach should the analyst use to ensure a reliable merge?

A.Standardize the customer ID format and use it as the join key.

B.Use the customer name as the join key.

C.Merge using a cross-join and then filter manually.

D.Perform a fuzzy match on the customer address.

AnswerA

Standardizing keys ensures a consistent, unique identifier for accurate merging.

Why this answer

Option A is correct because standardizing the customer ID format (e.g., converting the billing integer to a GUID or mapping both to a common string key) ensures a consistent join key across heterogeneous systems. This eliminates type mismatch errors and guarantees that each customer record can be matched reliably, as GUIDs are globally unique and integers are typically sequential, so direct comparison would fail without transformation.

Exam trap

The trap here is that candidates may assume customer name or address are sufficient join keys due to their human readability, underestimating the importance of unique, system-agnostic identifiers for reliable data merging.

How to eliminate wrong answers

Option B is wrong because customer names are not guaranteed to be unique (e.g., multiple customers named 'John Smith') and may have formatting inconsistencies (e.g., case, spaces), leading to incorrect or missed matches. Option C is wrong because a cross-join produces a Cartesian product of all rows, which is computationally expensive and requires manual filtering that is error-prone and does not leverage any reliable key for accurate merging. Option D is wrong because fuzzy matching on addresses is imprecise and computationally intensive; addresses can have variations (e.g., 'St.' vs 'Street') and may not uniquely identify a customer (e.g., multiple customers at the same address), making it unreliable for a deterministic merge.

Practice this question →

60

MCQmedium

A data analyst needs to combine sales data from multiple regional databases with different schemas. Which process is best?

A.Data federation

B.ETL (Extract, Transform, Load)

C.Data replication

D.Data virtualization

AnswerB

Designed to extract, transform, and load data from heterogeneous sources.

Why this answer

Option B is correct because ETL (Extract, Transform, Load) is designed to extract data from various sources, transform it to a common schema, and load it into a target system. Option A is wrong because data replication copies data without transformation. Option C is wrong because data virtualization provides real-time access without storing.

Option D is wrong because data federation also provides virtual integration but is less suited for combining data into a single store.

Practice this question →

61

MCQmedium

A company wants to collect real-time clickstream data from its website. Which acquisition method is most suitable?

A.Streaming API

B.Web scraping

C.Batch processing nightly

D.Manual entry

AnswerA

Enables continuous, low-latency data ingestion.

Why this answer

A streaming API is the most suitable method for collecting real-time clickstream data because it enables continuous, low-latency ingestion of events as they occur. Unlike batch or manual methods, a streaming API (e.g., using WebSockets or HTTP/2 Server-Sent Events) pushes each click event immediately to the data pipeline, satisfying the real-time requirement.

Exam trap

CompTIA often tests the distinction between 'real-time' and 'near-real-time' or 'batch' methods, and the trap here is that candidates may confuse web scraping (which can be automated frequently) with true streaming, not realizing that scraping is still a pull-based, scheduled operation that cannot match the push-based immediacy of a streaming API.

How to eliminate wrong answers

Option B (Web scraping) is wrong because it is a pull-based technique that typically retrieves static HTML pages at intervals, not real-time event streams, and is inefficient for high-frequency click data. Option C (Batch processing nightly) is wrong because it introduces a delay of up to 24 hours, failing the real-time requirement. Option D (Manual entry) is wrong because it is error-prone, non-scalable, and cannot capture high-velocity clickstream data in real time.

Practice this question →

62

MCQmedium

Refer to the exhibit. A data engineer is setting up data acquisition from an S3 bucket with this policy. What does the policy enforce?

A.Only objects with key 'data-bucket' can be accessed

B.Only GET requests are allowed

C.The bucket is publicly accessible

D.Access is restricted to a specific IP range

AnswerD

The condition limits access to IPs in the 10.0.0.0/8 range.

Why this answer

The S3 bucket policy includes a condition using `aws:SourceIp` with a CIDR block, which restricts access to requests originating from the specified IP range. This enforces network-level access control, allowing only clients within that IP range to perform the allowed actions (like GET requests) on the bucket. Option D correctly identifies this IP-based restriction.

Exam trap

CompTIA often tests the misconception that a Deny statement with a condition on IP addresses makes the bucket publicly accessible, when in fact it restricts access to only the specified IP range.

How to eliminate wrong answers

Option A is wrong because the policy does not restrict access based on object keys; the `arn:aws:s3:::data-bucket/*` in the Resource element specifies all objects in the bucket, not a specific key. Option B is wrong because the policy allows both GET and PUT requests (as shown in the Action list), not only GET requests. Option C is wrong because the policy explicitly denies access from IPs outside the specified range, making the bucket private to that IP range, not publicly accessible.

Practice this question →

63

MCQeasy

A data analyst needs to extract data from an API that returns JSON. The analyst wants to convert the JSON output into a tabular format for analysis. Which function in a scripting language is commonly used for this purpose?

A.json.loads()

B.to_csv()

C.read_json()

D.json_normalize()

AnswerD

This function normalizes semi-structured JSON data into a flat table.

Why this answer

Option D is correct because `json_normalize()` is a function in the pandas library specifically designed to flatten semi-structured JSON data (including nested lists and dictionaries) into a tabular DataFrame. This makes it the ideal tool for converting API responses with complex nesting into rows and columns for analysis, unlike simpler JSON parsing functions.

Exam trap

The trap here is that candidates confuse `read_json()` (which works only for flat JSON) with `json_normalize()` (which handles nested structures), leading them to choose option C when the API response contains hierarchical data.

How to eliminate wrong answers

Option A is wrong because `json.loads()` only parses a JSON string into a Python dictionary or list; it does not flatten nested structures or produce a tabular format. Option B is wrong because `to_csv()` is a pandas method for exporting a DataFrame to a CSV file, not for converting JSON to a table. Option C is wrong because `read_json()` in pandas reads a JSON file or string into a DataFrame but only handles simple, flat JSON structures; it fails with deeply nested JSON (e.g., arrays of objects with sub-objects) without additional normalization.

Practice this question →

64

Multi-Selecteasy

Which TWO are examples of internal data sources? (Select exactly 2)

Select 2 answers

A.APIs

B.Relational databases

C.Sensor readings

D.Social media comments

E.Flat files

AnswersB, E

Common internal source.

Why this answer

Relational databases are internal data sources because they store structured data generated and controlled within an organization's own systems. They are typically managed by internal IT teams and accessed via SQL queries, making them a classic example of an internal data repository.

Exam trap

CompTIA often tests the distinction between data sources and data access methods, so candidates mistakenly select APIs or sensor readings as internal sources when they are actually mechanisms or external origin points.

Practice this question →

65

MCQmedium

Refer to the exhibit. If the date column is stored as a string in 'MM/DD/YYYY' format, what will be the result?

A.Incorrect results because string comparison is lexicographic.

B.NULL values

C.Error because DATE type is expected.

D.Correct results because string comparison works for dates.

AnswerA

The different format causes lexicographic comparison to fail.

Why this answer

Option A is correct because when dates are stored as strings in 'MM/DD/YYYY' format, string comparison is lexicographic (character-by-character). This means that '01/02/2023' (January 2) would be considered greater than '12/31/2022' because '0' > '1' at the first character, leading to incorrect chronological ordering. The comparison does not interpret the string as a date value.

Exam trap

CompTIA often tests the misconception that string comparison of dates in 'MM/DD/YYYY' format will yield correct chronological order, but the trap is that lexicographic comparison compares month first, not year, leading to incorrect results.

How to eliminate wrong answers

Option B is wrong because string comparison does not produce NULL values; it simply compares strings lexicographically and returns a valid boolean result. Option C is wrong because no error occurs; the database or application will perform string comparison without expecting a DATE type, as the column is defined as a string. Option D is wrong because string comparison does not work correctly for dates in this format; lexicographic order does not match chronological order for 'MM/DD/YYYY' strings.

Practice this question →

66

MCQeasy

A data analyst needs to merge two customer tables from different sources. One table uses 'CUST_ID' as the primary key, the other uses 'CustomerID'. To ensure accurate merging, the analyst should first:

A.Perform a fuzzy match on names

B.Normalize the key column names to a common format

C.Remove duplicate rows from both tables

D.Aggregate data by region

AnswerB

Standardizing key names allows for accurate merging without data loss.

Why this answer

Normalizing key column names to a common format (Option B) is the correct first step because the merge operation requires a consistent join key. Without aligning 'CUST_ID' and 'CustomerID' to a single name and data type, the database or ETL tool will treat them as different columns, resulting in a cross join or an error. This step ensures referential integrity and enables an accurate inner or outer join based on the primary key.

Exam trap

The trap here is that candidates assume deduplication (Option C) is the most critical first step, but without first standardizing the join keys, any deduplication logic would operate on mismatched or incomplete data, leading to incorrect results.

How to eliminate wrong answers

Option A is wrong because performing a fuzzy match on names is an advanced, resource-intensive technique used only when exact key values are unavailable or inconsistent; it is unnecessary when the tables already have primary key columns that can be standardized. Option C is wrong because removing duplicate rows before aligning key names could inadvertently delete legitimate records that only appear duplicated due to key naming differences, and deduplication should occur after the merge or as a separate quality step. Option D is wrong because aggregating data by region is a post-merge analytical operation that has no bearing on resolving key column mismatches and would corrupt the granularity needed for accurate joining.

Practice this question →

67

MCQeasy

Refer to the exhibit. An analyst runs this query before acquiring data from a PostgreSQL database. What is the primary purpose of this query?

A.To verify data types

B.To check for data freshness

C.To find primary keys

D.To identify duplicate tables

AnswerB

The 'last_analyzed' column shows when statistics were last updated, indicating freshness.

Why this answer

The query `SELECT current_timestamp;` returns the current date and time from the PostgreSQL server. By comparing this value with the timestamp of the most recent record in a target table, an analyst can determine if the data is up-to-date or stale. This is a standard technique to check data freshness before proceeding with data acquisition.

Exam trap

The trap here is that candidates may confuse `current_timestamp` with a function used for data type verification or schema inspection, when in fact it is solely a temporal function used to gauge data freshness.

How to eliminate wrong answers

Option A is wrong because verifying data types requires querying the `information_schema.columns` table or using `pg_typeof()`, not `current_timestamp`. Option C is wrong because finding primary keys involves querying `information_schema.table_constraints` or `pg_indexes`, not a simple timestamp function. Option D is wrong because identifying duplicate tables would require comparing table names or schemas via `information_schema.tables`, not a timestamp query.

Practice this question →

68

MCQeasy

A data analyst is tasked with gathering data from a legacy system that only exports CSV files. The files contain headers but no data types. Which tool would best facilitate initial data exploration?

A.Hadoop

B.Tableau

C.SQL database

D.Python pandas

AnswerD

Provides powerful data structures and functions for CSV exploration.

Why this answer

Option A is correct because Python pandas provides powerful data structures and functions for exploring and manipulating CSV data, including detecting data types. Option B is wrong because SQL database requires importing data first. Option C is wrong because Hadoop is overkill for CSV exploration.

Option D is wrong because Tableau is visualization-focused and not ideal for initial data exploration without pre-processing.

Practice this question →

69

Drag & Dropmedium

Drag and drop the steps to perform a data audit in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Data audit begins with inventory, quality assessment, compliance check, documentation, and recommendations.

Practice this question →

70

MCQhard

A data engineer is designing a data pipeline to ingest streaming data from IoT sensors. The sensors send data every second, and the pipeline must handle bursts of up to 10,000 messages per second. Which approach is most appropriate for capturing this data before processing?

A.Directly write each message to a relational database

B.Load directly into a data warehouse

C.Use a message queue to buffer the incoming data

D.Store data in flat files and process in nightly batches

AnswerC

A message queue handles high throughput and provides reliable buffering.

Why this answer

Option C is correct because a message queue (e.g., Apache Kafka, Amazon Kinesis, or RabbitMQ) provides an asynchronous buffer that decouples the high-velocity ingestion (up to 10,000 messages/second) from downstream processing. This allows the pipeline to absorb burst traffic without overwhelming the processing layer, ensures data durability, and supports replayability in case of failures.

Exam trap

CompTIA often tests the misconception that relational databases or data warehouses can handle real-time streaming ingestion at scale, when in fact they require a buffering layer like a message queue to absorb bursts and decouple ingestion from processing.

How to eliminate wrong answers

Option A is wrong because directly writing each message to a relational database (RDBMS) at 10,000 messages/second would cause severe write contention, lock contention, and I/O bottlenecks, leading to dropped data and unacceptable latency. Option B is wrong because loading directly into a data warehouse (e.g., Snowflake, Redshift) is designed for batch or micro-batch ingestion, not for real-time streaming at this scale; it would incur high costs and fail to handle bursty throughput without prior buffering. Option D is wrong because storing data in flat files and processing in nightly batches introduces unacceptable latency (up to 24 hours) for streaming IoT data, and the file system cannot reliably handle 10,000 writes per second without data loss or corruption.

Practice this question →

71

MCQeasy

A data analyst is importing a CSV file that contains a mixture of numeric and text fields. What is the most common issue when importing?

A.Duplicate rows

B.Missing header row

C.Data types being incorrectly inferred

D.File size limitation

AnswerC

CSV import tools often guess types incorrectly, leading to conversion errors.

Why this answer

Data type inference often fails, causing numbers to be read as text or vice versa. File size limitations, missing headers, and duplicate rows are less common or not specific to mixed types.

Practice this question →

72

MCQeasy

A marketing team wants to analyze customer sentiment from social media posts. Which data acquisition method is most appropriate?

A.Internal database query

B.Physical sensor data

C.Web scraping from public social media APIs

D.Survey questionnaire

AnswerC

Allows direct access to public posts for sentiment analysis.

Why this answer

Option D is correct because web scraping from public social media APIs allows direct access to public posts for sentiment analysis. Option A is wrong because internal databases do not contain social media data. Option B is wrong because surveys are not real-time from social media.

Option C is wrong because physical sensors are unrelated.

Practice this question →

73

MCQhard

An e-commerce company is merging customer data from three legacy systems. Two systems use email as unique identifier, but one system allows multiple customers per email. The third uses phone number. To create a unified customer view, the analyst should first:

A.Request the IT team to modify the legacy system

B.Build a customer matching rule that uses multiple attributes (email, phone, name) with a confidence score

C.Use email as primary key and ignore conflicts

D.Assign new unique IDs and discard existing identifiers

AnswerB

Multi-attribute matching handles non-unique identifiers and improves accuracy.

Why this answer

Option B is correct because merging data from systems with different identifier schemas requires a probabilistic matching approach. Using multiple attributes (email, phone, name) with a confidence score allows the analyst to resolve conflicts where email is not unique and phone numbers may be missing or formatted differently, creating a unified customer view without forcing a single key.

Exam trap

The trap here is that candidates assume a single unique identifier (email) can be forced as a primary key, ignoring the real-world data quality issue of non-unique emails, which the question explicitly states.

How to eliminate wrong answers

Option A is wrong because modifying legacy systems is often impractical, costly, and outside the analyst's scope; the question asks what the analyst should do first, not a long-term IT project. Option C is wrong because using email as primary key and ignoring conflicts would lose data integrity when one email maps to multiple customers, violating the goal of a unified view. Option D is wrong because assigning new unique IDs and discarding existing identifiers eliminates the ability to link records back to source systems and loses valuable matching context, making deduplication impossible.

Practice this question →

74

Multi-Selectmedium

Which THREE are best practices for data profiling during acquisition? (Choose three.)

Select 3 answers

A.Immediately normalize data

B.Check for completeness

C.Assess data types

D.Identify outliers

E.Skip validation for trusted sources

AnswersB, C, D

Ensuring all required fields are populated is essential.

Why this answer

Checking for completeness (Option B) is a best practice during data acquisition because it ensures that all required fields and records are present before further processing. Incomplete data can lead to incorrect analysis or failed transformations, so profiling for missing values or nulls is a fundamental validation step.

Exam trap

The trap here is that candidates confuse 'best practices for acquisition' with 'best practices for transformation,' leading them to select normalization (Option A) as an immediate step rather than a later processing stage.

Practice this question →

75

Matchingmedium

Match each data analysis technique to its primary purpose.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Model relationships between variables

Group similar data points without labels

Analyze data points collected over time

Compare means across multiple groups

Test association between categorical variables

Why these pairings

These are common statistical techniques for data analysis.

Practice this question →

Page 1 of 2 · 99 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Mining Acquiring Data questions.

Start 20-question session