CCNA Mining and Acquiring Data Questions

24 of 99 questions · Page 2/2 · Mining and Acquiring Data · Answers revealed

76
MCQmedium

An analyst is reviewing the above SQL query used to acquire data. What does this query retrieve?

A.Customers who placed more than 5 orders in 2023
B.All customers who placed at least 5 orders in 2023
C.The total number of orders per customer in 2023
D.Customers who placed exactly 5 orders in 2023
AnswerA

The HAVING clause filters for counts greater than 5.

Why this answer

The SQL query uses a HAVING clause with COUNT(*) > 5 to filter customers who placed more than 5 orders in 2023. The WHERE clause restricts records to the year 2023, and the GROUP BY customer_id aggregates orders per customer. The condition '> 5' explicitly excludes customers with exactly 5 or fewer orders, making option A correct.

Exam trap

The trap here is confusing the comparison operator '>' with '>=', leading candidates to mistakenly include customers with exactly 5 orders when the query explicitly excludes them.

How to eliminate wrong answers

Option B is wrong because 'at least 5 orders' would require the condition COUNT(*) >= 5, not > 5. Option C is wrong because the query returns customer IDs, not the total number of orders per customer; the COUNT is used only for filtering, not as a selected column. Option D is wrong because 'exactly 5 orders' would require COUNT(*) = 5, not > 5.

77
Multi-Selectmedium

A company is acquiring social media data via a public API. Which TWO considerations are important for ensuring ethical and legal compliance?

Select 2 answers
A.Share raw data with third parties for additional insights
B.Use the data for any internal analysis without restrictions
C.Anonymize personal identifiable information (PII) before storage
D.Cache data indefinitely to avoid repeated API calls
E.Comply with the platform's terms of service
AnswersC, E

Anonymization protects individual privacy and complies with regulations.

Why this answer

Option C is correct because anonymizing PII before storage is a fundamental data privacy requirement under regulations like GDPR and CCPA. When acquiring data via a public API, the company must ensure that personal identifiers (e.g., names, email addresses, IP addresses) are removed or obfuscated to prevent re-identification, reducing legal liability and ethical risk.

Exam trap

The trap here is that candidates may confuse 'caching for efficiency' (Option D) with ethical compliance, overlooking that indefinite storage violates data minimization principles and platform terms, while 'internal analysis' (Option B) seems harmless but ignores explicit usage restrictions in the API's terms of service.

78
MCQhard

Refer to the exhibit. What is the most likely cause of the extraction failure?

A.The source table is locked
B.The network firewall is blocking the port
C.The extraction query is too complex
D.The database server is down
AnswerB

Causes connection to hang until timeout.

Why this answer

Option B is correct because connection timeouts with consistent 30-second delays suggest the network firewall is blocking the port, causing the connection to hang until timeout. Option A is wrong because if the server were down, the error would be connection refused immediately. Option C is wrong because a complex query would cause a slow query, not a connection timeout.

Option D is wrong because a locked table would cause a lock wait timeout, not a connection timeout.

79
Multi-Selectmedium

A data analyst needs to identify duplicate customer records. Which TWO methods are commonly used? (Select two.)

Select 2 answers
A.Fuzzy matching using Levenshtein distance
B.Sorting and comparing adjacent rows
C.Visual inspection of random sample
D.Using a hash function on primary key
E.Exact match on all fields
AnswersA, B

Levenshtein distance catches spelling differences.

Why this answer

Fuzzy matching using Levenshtein distance (Option A) is correct because it measures the edit distance between two strings, allowing identification of duplicates even when there are minor typographical differences, such as 'Jon Smith' vs. 'John Smith'. This is essential for deduplicating customer records where names, addresses, or other fields may have slight variations without being exact matches.

Exam trap

The trap here is that candidates often choose 'Exact match on all fields' (Option E) thinking it is a reliable deduplication method, but in practice it fails to catch real-world duplicates that have any minor variation, and the exam expects you to recognize that fuzzy matching and sorted adjacency comparisons are the standard techniques for duplicate detection.

80
MCQeasy

A data analyst receives the above JSON snippet from a web API. The analyst needs to extract the email addresses for all customers. Which JSONPath expression should be used?

A.$.customers[0].email
B.$..email
C.$.customers[*].email
D.$.customers.email
AnswerC

This expression selects email from every customer object.

Why this answer

Option C is correct because the JSONPath expression `$.customers[*].email` uses the wildcard `[*]` to select all elements in the `customers` array and then accesses the `email` property of each element. This matches the requirement to extract email addresses for all customers from the JSON snippet.

Exam trap

The trap here is that candidates often confuse the deep scan operator `..` with the array wildcard `[*]`, thinking `$..email` will neatly extract all customer emails, but it actually retrieves every `email` property at any depth, including from non-customer objects, leading to incorrect data extraction.

How to eliminate wrong answers

Option A is wrong because `$.customers[0].email` only retrieves the email address of the first customer in the array, not all customers. Option B is wrong because `$..email` uses the deep scan operator `..` which recursively searches the entire JSON tree for any property named `email`, potentially returning emails from nested objects or arrays that are not customers (e.g., from an `orders` or `address` object), leading to incorrect or extra results. Option D is wrong because `$.customers.email` attempts to access `email` directly on the `customers` array object, but arrays in JSONPath do not have a property named `email`; this expression would return `null` or an empty result unless the array itself has an `email` property, which it does not.

81
MCQeasy

A data team needs to extract data from a legacy system that only supports flat file exports. Which data acquisition method is most appropriate?

A.Database replication
B.API call
C.Web scraping
D.File transfer via SFTP
AnswerD

SFTP enables secure transfer of flat files, aligning with the system's export capability.

Why this answer

The legacy system only supports flat file exports, meaning it cannot provide direct database or API access. SFTP (SSH File Transfer Protocol) is the most appropriate method because it securely transfers flat files over a network, aligning with the system's export capabilities while ensuring data integrity and encryption during transit.

Exam trap

The trap here is that candidates may confuse 'flat file exports' with a need for real-time or API-based methods, overlooking that SFTP is the standard secure file transfer protocol for batch-oriented legacy systems.

How to eliminate wrong answers

Option A is wrong because database replication requires the source system to support a database engine with replication features (e.g., transactional logs or CDC), which a legacy flat-file-only system lacks. Option B is wrong because an API call requires the legacy system to expose a programmatic interface (e.g., REST or SOAP), which is not available if it only supports flat file exports. Option C is wrong because web scraping is used to extract data from web pages via HTTP, not from a legacy system that exports flat files via a file transfer protocol.

82
MCQhard

A data analyst discovers that a dataset contains multiple records for the same customer with different spellings (e.g., 'Jon' vs 'John'). Which data preparation step should be applied first?

A.Merge all records into one per customer.
B.Remove duplicates based on exact match.
C.Standardize text fields using a lookup table.
D.Flag records for manual review.
AnswerC

Standardization harmonizes variations like 'Jon' and 'John'.

Why this answer

Option C is correct because the first step when dealing with inconsistent text values (like 'Jon' vs 'John') is to standardize the data using a lookup table or reference mapping. This ensures that all variations are normalized to a canonical form before any merging or deduplication is attempted, preventing data loss and preserving referential integrity.

Exam trap

The trap here is that candidates often jump to 'remove duplicates' (Option B) because they think of exact-match deduplication, but the question specifically tests the understanding that data quality issues like inconsistent spellings must be resolved through standardization before any deduplication logic can be applied.

How to eliminate wrong answers

Option A is wrong because merging records before standardizing spellings would combine data based on non-uniform keys, likely creating erroneous composite records or losing the ability to correctly identify which records belong to the same customer. Option B is wrong because removing duplicates based on exact match would treat 'Jon' and 'John' as different records, failing to identify them as the same customer and leaving the inconsistency unresolved. Option D is wrong because flagging records for manual review is a downstream action that should only be taken after automated standardization has been attempted; skipping standardization first would result in an unnecessarily large and inefficient manual review workload.

83
MCQeasy

A data analyst is using SQL to extract data. The analyst wants to retrieve all records from a table named 'sales' where the 'amount' column is greater than 100. Which SQL clause should be used?

A.WHERE
B.ORDER BY
C.GROUP BY
D.HAVING
AnswerA

WHERE clause filters rows based on a condition.

Why this answer

The WHERE clause in SQL is used to filter records based on a specified condition, such as 'amount > 100'. It is applied directly to the rows in the 'sales' table before any grouping or ordering, making it the correct choice for retrieving only records where the amount exceeds 100.

Exam trap

The trap here is that candidates often confuse HAVING with WHERE, thinking both can filter rows, but HAVING is only valid after GROUP BY and for aggregate conditions, while WHERE filters individual rows before any grouping.

How to eliminate wrong answers

Option B (ORDER BY) is wrong because it is used to sort the result set by one or more columns, not to filter rows based on a condition. Option C (GROUP BY) is wrong because it groups rows that have the same values in specified columns into summary rows, often for use with aggregate functions, and does not filter individual records. Option D (HAVING) is wrong because it is used to filter groups after the GROUP BY clause has been applied, typically with aggregate functions, and cannot be used to filter individual rows before grouping.

84
MCQeasy

A retail company wants to analyze customer purchase patterns to identify products frequently bought together. Which data mining technique is most appropriate?

A.Classification
B.Clustering
C.Regression
D.Association rules
AnswerD

Association rules identify frequent itemsets and co-occurrence patterns.

Why this answer

Association rules (market basket analysis) discover co-occurrence relationships. Regression predicts numeric values, clustering groups similar items, classification assigns categories.

85
Multi-Selecteasy

Which TWO of the following are valid SQL clauses used to filter and sort data?

Select 2 answers
A.DELETE
B.WHERE
C.ORDER BY
D.UPDATE
E.INSERT
AnswersB, C

WHERE filters rows based on conditions.

Why this answer

The WHERE clause is used to filter rows based on specified conditions, while the ORDER BY clause sorts the result set in ascending or descending order. Both are standard SQL clauses for data filtering and sorting, respectively.

Exam trap

CompTIA often tests the distinction between SQL DML statements (DELETE, UPDATE, INSERT) and query clauses (WHERE, ORDER BY), trapping candidates who confuse data manipulation commands with data retrieval or sorting operations.

86
Drag & Dropmedium

Drag and drop the steps to perform a data backup using the 3-2-1 rule in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

The 3-2-1 rule involves multiple copies, different media, offsite storage, and regular testing.

87
MCQhard

A data engineer is tasked with acquiring data from a third-party vendor that provides daily file drops via SFTP. The files are large (10 GB each). The pipeline must load data into a data warehouse. Which approach optimizes for speed and reliability?

A.Download the file to a staging server, then bulk insert into warehouse
B.Stream the file directly from SFTP into warehouse using a data pipeline tool
C.Have the vendor push data via API instead of SFTP
D.Split the file into smaller chunks and load concurrently
AnswerB

Streaming minimizes latency and storage overhead.

Why this answer

Option B is correct because streaming the file directly from SFTP into the warehouse using a data pipeline tool (e.g., Apache NiFi, Airbyte, or Fivetran) eliminates the intermediate staging step, reducing disk I/O and latency. This approach leverages incremental processing and parallel streams to handle large 10 GB files efficiently, while built-in retry and checkpoint mechanisms ensure reliability against network interruptions.

Exam trap

The trap here is that candidates assume 'download then load' (Option A) is the most reliable approach, but the question specifically asks for speed and reliability, and streaming avoids the I/O bottleneck and single-point-of-failure of a staging server.

How to eliminate wrong answers

Option A is wrong because downloading the file to a staging server introduces an unnecessary intermediate write and read cycle, doubling I/O time and adding a single point of failure; bulk insert after full download also delays loading until the entire file is present, which is suboptimal for speed. Option C is wrong because having the vendor push data via API instead of SFTP does not inherently optimize speed or reliability for large daily file drops—APIs often have payload size limits (e.g., 10 MB) and require chunking, adding complexity and potential throttling, while SFTP is already a reliable file transfer protocol. Option D is wrong because splitting the file into smaller chunks and loading concurrently can cause resource contention (e.g., connection pool exhaustion, lock contention) and requires careful coordination to maintain data consistency; it does not address the fundamental bottleneck of downloading the entire file before processing.

88
MCQhard

A data analyst sees this error in the ETL logs. What is the most likely cause?

A.The materialized view log was updated after the last refresh
B.The source table was dropped
C.The analyst does not have permission to refresh the view
D.There is a network connection timeout
AnswerA

The log is newer, indicating changes that need a full refresh.

Why this answer

The error indicates that the materialized view's underlying data has changed since its last refresh, specifically because the materialized view log was updated. Materialized views rely on logs to track changes for fast refreshes; if the log is updated after the last refresh, the view's snapshot becomes stale and cannot be incrementally refreshed without a complete refresh. This is a common cause of refresh failures in Oracle databases.

Exam trap

CompTIA often tests the distinction between fast refresh and complete refresh errors, and the trap here is that candidates confuse a log update with a source table drop or permission issue, not realizing that the error message specifically points to a log timestamp mismatch.

How to eliminate wrong answers

Option B is wrong because dropping the source table would cause a different error (e.g., 'table or view does not exist') rather than a log-related error. Option C is wrong because a permission issue would typically result in an 'insufficient privileges' error, not a log mismatch. Option D is wrong because a network timeout would produce a connection error (e.g., ORA-12170 or ORA-03113), not a materialized view log inconsistency.

89
Multi-Selecthard

After merging two datasets, an analyst finds that the resulting dataset has many null values in some columns. Which TWO steps should the analyst take to address this? (Select two.)

Select 2 answers
A.Ignore nulls and proceed.
B.Impute nulls with the median.
C.Remove all rows with nulls.
D.Replace nulls with a placeholder value like 'Unknown'.
E.Investigate the cause of nulls.
AnswersB, E

Median imputation preserves dataset size and reduces bias.

Why this answer

Option B is correct because imputing nulls with the median is a standard technique for handling missing numerical data, especially when the distribution is skewed or contains outliers. The median is robust to extreme values and preserves the central tendency of the column, making it a safe choice for many analytical models. This approach avoids data loss while maintaining statistical integrity.

Exam trap

The trap here is that candidates may think 'Ignore nulls and proceed' is acceptable, but the exam tests the understanding that nulls must be actively handled to ensure data quality and model validity, not simply overlooked.

90
MCQeasy

A data analyst is tasked with collecting data from multiple spreadsheets provided by different departments. Each spreadsheet has different column names and formats. What is the best first step?

A.Develop a data dictionary and standardize column names
B.Discard any mismatched data
C.Use a machine learning model to clean data
D.Immediately load all data into a database
AnswerA

Standardization ensures all data sources align, making subsequent loading and analysis consistent.

Why this answer

Developing a data dictionary and standardizing column names ensures consistency across all data sources before loading, reducing errors and facilitating integration. Immediately loading data can cause inconsistencies. Discarding mismatched data loses potentially valuable information.

Using a machine learning model is an unnecessary and complex first step.

91
MCQmedium

An e-commerce company is acquiring product data from multiple supplier APIs. The APIs return JSON with inconsistent field naming conventions. Which data acquisition technique should be applied?

A.Data compression
B.Data mapping and transformation
C.Data deduplication
D.Data aggregation
AnswerB

Standardizes field names and structures.

Why this answer

Data mapping and transformation is the correct technique because the JSON responses from different supplier APIs use inconsistent field naming conventions (e.g., 'product_id' vs. 'ProductID'). This technique defines a schema to map source fields to a standardized target format, ensuring data consistency before loading into the company's system. Without transformation, downstream processes like analytics or inventory management would fail due to mismatched field names.

Exam trap

The trap here is that candidates confuse data transformation with data aggregation or deduplication, assuming any processing step can fix schema inconsistencies, but only mapping and transformation directly address field naming and structure mismatches.

How to eliminate wrong answers

Option A is wrong because data compression reduces storage size or transfer bandwidth, but does not address structural inconsistencies in field naming. Option C is wrong because data deduplication removes duplicate records based on content, but does not reconcile different field names or schemas. Option D is wrong because data aggregation summarizes or combines data (e.g., sums, averages), but does not resolve naming conflicts or schema mismatches.

92
Multi-Selectmedium

Which TWO are valid data acquisition methods? (Select two.)

Select 2 answers
A.Web scraping
B.Data normalization
C.API calls
D.Data encryption
E.Data profiling
AnswersA, C

Web scraping extracts data from websites and is a common acquisition method.

Why this answer

Web scraping and API calls are direct methods to acquire data from external sources. Data profiling and data normalization are data preparation techniques, not acquisition. Data encryption is a security measure.

93
Multi-Selecteasy

A data analyst is performing data acquisition from multiple source files. Which TWO data profiling tasks should the analyst complete before loading the data into the target system?

Select 2 answers
A.Create a dashboard for stakeholders
B.Verify data types and formats
C.Build a linear regression model
D.Perform cluster analysis
E.Identify missing values and nulls
AnswersB, E

Verifying data types ensures consistency and prevents errors during loading.

Why this answer

Identifying missing values and verifying data types are fundamental data profiling tasks that help ensure data quality before acquisition. Building models, creating dashboards, or clustering are not part of profiling.

94
MCQeasy

A small business wants to acquire customer feedback through a short questionnaire emailed after purchase. Which data acquisition method does this represent?

A.Transaction log
B.Interview
C.Survey
D.Observation
AnswerC

A questionnaire is a classic survey tool for collecting feedback.

Why this answer

The correct answer is C because a survey is a structured data collection method where respondents answer predefined questions, typically via a form or questionnaire. In this scenario, the business is using a short questionnaire emailed after purchase to gather customer feedback, which directly aligns with the definition of a survey as a data acquisition method.

Exam trap

The trap here is that candidates may confuse a survey with a transaction log because both can be automated and delivered electronically, but a transaction log captures system events, not user-provided feedback.

How to eliminate wrong answers

Option A is wrong because a transaction log records system-level events such as database changes, user logins, or API calls, not subjective customer feedback via a questionnaire. Option B is wrong because an interview involves a direct, synchronous conversation between an interviewer and a respondent, often with open-ended questions, whereas the scenario describes an asynchronous, self-administered questionnaire. Option D is wrong because observation involves watching and recording behavior or events without direct interaction, whereas the scenario explicitly involves asking customers for their opinions through a questionnaire.

95
MCQhard

A data analyst is using a public API to collect historical weather data. The API has a rate limit of 100 requests per minute, but the analyst needs to retrieve 10,000 records as quickly as possible. What strategy should be used?

A.Increase the request rate
B.Use multiple API keys
C.Paginate with appropriate delays
D.Download a precompiled dataset
AnswerC

Pagination allows systematic retrieval; delays ensure compliance with rate limits.

Why this answer

Option C is correct because paginating with appropriate delays respects the API's rate limit of 100 requests per minute while maximizing throughput. By splitting the 10,000 records into pages (e.g., 100 records per page) and sending requests at a rate just under the limit (e.g., one request every 0.6 seconds), the analyst can retrieve all data in approximately 100 minutes without triggering HTTP 429 rate-limit errors.

Exam trap

The trap here is that candidates may assume 'as quickly as possible' means sending requests as fast as possible (Option A) or using multiple keys (Option B), overlooking that rate limits are enforced per key or IP and that proper pagination with delays is the only compliant way to maximize throughput.

How to eliminate wrong answers

Option A is wrong because increasing the request rate beyond 100 requests per minute would violate the API's rate limit, resulting in HTTP 429 (Too Many Requests) responses or temporary IP bans. Option B is wrong because using multiple API keys to circumvent rate limits violates the API's terms of service and could lead to account suspension or revocation of access. Option D is wrong because downloading a precompiled dataset may not be available, may not contain the specific historical weather data needed, or may be outdated, and the question explicitly states the analyst is using a public API to collect data.

96
MCQmedium

Refer to the exhibit. A data analyst is trying to extract data from a SQL Server database but receives the error. Which configuration change should the analyst recommend to the database administrator?

A.Change the server firewall to allow port 1433
B.Enable Mixed Mode authentication on the SQL Server
C.Use a different extraction tool that supports Windows authentication
D.Grant the 'dataminer' user SELECT permissions
AnswerB

Mixed Mode allows SQL authentication, which matches the login attempt.

Why this answer

The error indicates that the server is configured for Windows authentication only, but the login attempt used SQL authentication. Enabling Mixed Mode authentication allows both Windows and SQL authentication, resolving the issue.

97
MCQmedium

A company is merging two customer databases from different acquisitions. They need to identify duplicate records. Which data profiling technique is most effective?

A.Fuzzy matching on name and address
B.Manually compare all records
C.Exact match on customer names
D.Use primary keys from each database
AnswerA

Fuzzy matching handles variations and is appropriate for deduplication.

Why this answer

Fuzzy matching on name and address is the most effective technique because customer databases from different acquisitions often contain variations in spelling, formatting, and abbreviations (e.g., 'Bob' vs. 'Robert', 'St.' vs. 'Street'). Exact matching would miss these duplicates, while fuzzy matching uses algorithms like Levenshtein distance or Jaro-Winkler to quantify similarity and identify near-matches, ensuring comprehensive deduplication.

Exam trap

The trap here is that candidates assume exact matching or primary keys are sufficient for deduplication, overlooking the real-world data inconsistencies that fuzzy matching is designed to handle.

How to eliminate wrong answers

Option B is wrong because manually comparing all records is impractical and error-prone for large datasets, lacking scalability and consistency. Option C is wrong because exact match on customer names fails to capture duplicates caused by typos, nicknames, or inconsistent formatting (e.g., 'Jon' vs. 'John'). Option D is wrong because primary keys from each database are unique within their own system but cannot identify cross-database duplicates, as the same customer may have different primary keys in each source.

98
MCQmedium

A data engineer is designing an ETL pipeline to extract sales data from a legacy on-premise database and load it into a cloud data warehouse. The database is slow and queries during business hours affect performance. Which extraction strategy minimizes impact?

A.Query the database with SELECT * every hour
B.Incremental extraction using Change Data Capture (CDC)
C.Full table extraction nightly
D.Use a database log shipping
AnswerB

CDC minimizes database load by extracting only changed data, reducing performance impact.

Why this answer

Incremental extraction using Change Data Capture (CDC) minimizes impact on the legacy on-premise database by reading only the changed rows (inserts, updates, deletes) from transaction logs or change tables, rather than issuing heavy SELECT queries. This avoids full table scans or frequent queries during business hours, preserving database performance for operational workloads.

Exam trap

The trap here is that candidates confuse 'log shipping' (a high-availability technique) with 'Change Data Capture' (an extraction method), or assume that any periodic query (like hourly SELECT *) is acceptable without considering the cumulative performance impact on a slow legacy database.

How to eliminate wrong answers

Option A is wrong because querying the database with SELECT * every hour performs full table scans on the legacy database, which is slow and would degrade performance during business hours, directly contradicting the goal of minimizing impact. Option C is wrong because full table extraction nightly still requires a complete scan of the entire table, which can be resource-intensive and may not complete within a reasonable window if the database is slow, and it does not capture intra-day changes without additional overhead. Option D is wrong because database log shipping is a disaster recovery technique that continuously copies transaction logs to a standby server, not an extraction strategy for ETL; it does not provide a queryable change stream and would require additional processing to parse logs for CDC.

99
MCQhard

Based on the exhibit, what is the most likely cause of the import failure?

A.The file is empty or contains only headers.
B.The price field includes non-numeric characters that cannot be parsed.
C.The source file is corrupted or in an unsupported format.
D.A data quality issue: the date field contains an invalid date.
AnswerD

The date '2024-02-30' is invalid and caused the import to halt.

Why this answer

Option D is correct because the exhibit shows a date field containing '2023-02-30', which is an invalid date (February never has 30 days). This data quality issue causes the import to fail, as the system likely validates date values against calendar rules before inserting them into the target table. The error is not due to file emptiness, non-numeric characters, or corruption, but specifically a semantic data integrity violation.

Exam trap

The trap here is that candidates may overlook semantic data quality issues (like invalid dates) and instead focus on syntactic problems (like file format or non-numeric characters), even though the exhibit clearly shows a date that does not exist in the calendar.

How to eliminate wrong answers

Option A is wrong because the file contains multiple rows of data beyond headers, as evidenced by the visible records in the exhibit. Option B is wrong because the price field shows numeric values (e.g., 19.99, 29.99) without any non-numeric characters that would cause parsing failures. Option C is wrong because the source file is displayed in a standard CSV format with proper delimiters and readable content, indicating it is not corrupted or in an unsupported format.

← PreviousPage 2 of 2 · 99 questions total

Ready to test yourself?

Try a timed practice session using only Mining and Acquiring Data questions.