Knowledge + Practice

CompTIA Data+ DA0-001 (DA0-001) — Questions 376–450

509 questions total · 7pages · All types, answers revealed

Take a mock exam Exam hub

Page 6 of 7

376

Multi-Selectmedium

A company is acquiring social media data via a public API. Which TWO considerations are important for ensuring ethical and legal compliance?

Select 2 answers

A.Share raw data with third parties for additional insights

B.Use the data for any internal analysis without restrictions

C.Anonymize personal identifiable information (PII) before storage

D.Cache data indefinitely to avoid repeated API calls

E.Comply with the platform's terms of service

AnswersC, E

Anonymization protects individual privacy and complies with regulations.

Why this answer

Option C is correct because anonymizing PII before storage is a fundamental data privacy requirement under regulations like GDPR and CCPA. When acquiring data via a public API, the company must ensure that personal identifiers (e.g., names, email addresses, IP addresses) are removed or obfuscated to prevent re-identification, reducing legal liability and ethical risk.

Exam trap

The trap here is that candidates may confuse 'caching for efficiency' (Option D) with ethical compliance, overlooking that indefinite storage violates data minimization principles and platform terms, while 'internal analysis' (Option B) seems harmless but ignores explicit usage restrictions in the API's terms of service.

Full explanation →

377

MCQeasy

A data analyst is creating a report for a marketing campaign. The campaign data includes customer names, email addresses, and purchase history. Which of the following best describes the 'customer name' data type?

A.Nominal

B.Quantitative

C.Ordinal

D.Discrete

AnswerA

Nominal is categorical without order.

Why this answer

Customer names are categorical labels that identify individuals without any inherent order or numerical value. This fits the definition of nominal data, which is used for naming or classifying variables. In data analysis, nominal data can be stored as strings and used for grouping or filtering, but arithmetic operations are meaningless.

Exam trap

CompTIA often tests the distinction between nominal and ordinal data by presenting a label that could be mistaken for having an order (e.g., 'customer name' might be confused with 'rank' or 'tier'), but the trap here is that names are purely categorical with no intrinsic ranking.

How to eliminate wrong answers

Option B is wrong because quantitative data represents numerical measurements or counts (e.g., purchase amount), not text labels like names. Option C is wrong because ordinal data has a meaningful order or rank (e.g., customer satisfaction rating), but customer names have no inherent sequence. Option D is wrong because discrete data consists of countable numerical values (e.g., number of purchases), whereas customer names are non-numeric categories.

Full explanation →

378

MCQhard

Refer to the exhibit. What is the most likely cause of the extraction failure?

A.The source table is locked

B.The network firewall is blocking the port

C.The extraction query is too complex

D.The database server is down

AnswerB

Causes connection to hang until timeout.

Why this answer

Option B is correct because connection timeouts with consistent 30-second delays suggest the network firewall is blocking the port, causing the connection to hang until timeout. Option A is wrong because if the server were down, the error would be connection refused immediately. Option C is wrong because a complex query would cause a slow query, not a connection timeout.

Option D is wrong because a locked table would cause a lock wait timeout, not a connection timeout.

Full explanation →

379

MCQmedium

A retail company wants to forecast monthly sales for the next 12 months. Sales data shows a clear upward trend and seasonal patterns that repeat yearly. Which time series model is most appropriate?

A.SARIMA

B.Simple exponential smoothing

C.Holt-Winters exponential smoothing

D.ARIMA

AnswerC

Holt-Winters includes trend and seasonality components, making it suitable for this data.

Why this answer

The Holt-Winters exponential smoothing model (option C) is the most appropriate because it explicitly captures both trend and seasonality components, which are present in the sales data (upward trend and yearly seasonal patterns). Unlike simple exponential smoothing, Holt-Winters includes additive or multiplicative seasonal terms, making it ideal for data with clear, repeating seasonal cycles over a 12-month horizon.

Exam trap

The trap here is that candidates often choose ARIMA or SARIMA because they are more 'advanced,' but the question specifically describes clear trend and seasonality without requiring stationarity or differencing, making Holt-Winters the most direct and appropriate choice.

How to eliminate wrong answers

Option A (SARIMA) is wrong because while SARIMA can model trend and seasonality, it requires the data to be stationary (differencing) and involves more complex parameter selection (p, d, q, P, D, Q, s); for a straightforward forecasting task with clear trend and seasonality, Holt-Winters is simpler and often more robust. Option B (Simple exponential smoothing) is wrong because it only handles level (no trend or seasonality), so it would fail to capture the upward trend and yearly seasonal patterns in the sales data. Option D (ARIMA) is wrong because it models trend but not seasonality; without seasonal differencing or seasonal AR terms, it cannot account for the repeating yearly patterns in the data.

Full explanation →

380

Multi-Selectmedium

In multiple linear regression, which TWO assumptions are critical for unbiased coefficient estimates? (Choose two.)

Select 2 answers

A.Linearity: the relationship between predictors and response is linear

B.Large sample size

C.Normality of errors

D.Homoscedasticity: errors have constant variance

E.Independence of errors

AnswersA, D

Nonlinear relationships can bias coefficient estimates.

Why this answer

For unbiased coefficient estimates in multiple linear regression, the linearity assumption (A) ensures that the model correctly specifies the functional form between predictors and the response. Homoscedasticity (D) ensures that the variance of errors is constant across all levels of the predictors, which is necessary for the Gauss-Markov theorem to hold and for ordinary least squares (OLS) estimates to be unbiased.

Exam trap

CompTIA often tests the distinction between assumptions required for unbiasedness (linearity and homoscedasticity) versus those needed for efficiency or inference (normality, independence, large sample size), causing candidates to mistakenly select normality or independence as critical for unbiased coefficients.

Full explanation →

381

MCQeasy

Refer to the exhibit. A stakeholder complains that the line chart exaggerates the changes in sales. What is the most likely cause?

A.The y-axis does not start at zero

B.There are too few data points

C.The data labels are incorrect

D.The chart type should be a bar chart

AnswerA

Starting at a non-zero value exaggerates differences.

Why this answer

Setting beginAtZero to false truncates the y-axis, making small changes appear larger.

Full explanation →

382

MCQhard

While reviewing a dashboard, an analyst notices that the data in a trend line chart does not match the underlying data due to a filter setting. The dashboard is used for weekly executive meetings. What should the analyst do?

A.Ignore the discrepancy if it is small.

B.Wait for someone to complain before acting.

C.Immediately remove the dashboard and send raw data.

D.Document the issue and fix the filter before the next meeting.

AnswerD

Proactive approach maintains data integrity and trust.

Why this answer

Option D is correct because the analyst has identified a data integrity issue caused by a filter setting that directly impacts the accuracy of the trend line chart. The dashboard is used for weekly executive meetings, so the analyst must document the discrepancy and correct the filter before the next meeting to ensure data-driven decisions are based on accurate visualizations. This aligns with best practices in data governance and the principle of maintaining trust in reporting tools.

Exam trap

The trap here is that candidates may assume small discrepancies are acceptable or that waiting for complaints is a valid approach, but the exam emphasizes proactive data integrity and the importance of maintaining accurate visualizations for scheduled stakeholder meetings.

How to eliminate wrong answers

Option A is wrong because ignoring even a small discrepancy in a dashboard used for executive decision-making can lead to compounded errors in trend analysis and erode trust in the data; any deviation from the underlying data must be investigated and corrected. Option B is wrong because waiting for someone to complain is reactive and unprofessional; the analyst should proactively ensure data accuracy, especially for a recurring weekly meeting where stakeholders rely on consistent, correct visualizations. Option C is wrong because immediately removing the dashboard and sending raw data disrupts the established reporting workflow and forces executives to interpret unaggregated data, which is inefficient and likely to introduce new errors; the proper action is to fix the filter and restore the correct trend line chart.

Full explanation →

383

Multi-Selectmedium

A data analyst needs to identify duplicate customer records. Which TWO methods are commonly used? (Select two.)

Select 2 answers

A.Fuzzy matching using Levenshtein distance

B.Sorting and comparing adjacent rows

C.Visual inspection of random sample

D.Using a hash function on primary key

E.Exact match on all fields

AnswersA, B

Levenshtein distance catches spelling differences.

Why this answer

Fuzzy matching using Levenshtein distance (Option A) is correct because it measures the edit distance between two strings, allowing identification of duplicates even when there are minor typographical differences, such as 'Jon Smith' vs. 'John Smith'. This is essential for deduplicating customer records where names, addresses, or other fields may have slight variations without being exact matches.

Exam trap

The trap here is that candidates often choose 'Exact match on all fields' (Option E) thinking it is a reliable deduplication method, but in practice it fails to catch real-world duplicates that have any minor variation, and the exam expects you to recognize that fuzzy matching and sorted adjacency comparisons are the standard techniques for duplicate detection.

Full explanation →

384

MCQeasy

A data analyst receives the above JSON snippet from a web API. The analyst needs to extract the email addresses for all customers. Which JSONPath expression should be used?

A.$.customers[0].email

B.$..email

C.$.customers[*].email

D.$.customers.email

AnswerC

This expression selects email from every customer object.

Why this answer

Option C is correct because the JSONPath expression `$.customers[*].email` uses the wildcard `[*]` to select all elements in the `customers` array and then accesses the `email` property of each element. This matches the requirement to extract email addresses for all customers from the JSON snippet.

Exam trap

The trap here is that candidates often confuse the deep scan operator `..` with the array wildcard `[*]`, thinking `$..email` will neatly extract all customer emails, but it actually retrieves every `email` property at any depth, including from non-customer objects, leading to incorrect data extraction.

How to eliminate wrong answers

Option A is wrong because `$.customers[0].email` only retrieves the email address of the first customer in the array, not all customers. Option B is wrong because `$..email` uses the deep scan operator `..` which recursively searches the entire JSON tree for any property named `email`, potentially returning emails from nested objects or arrays that are not customers (e.g., from an `orders` or `address` object), leading to incorrect or extra results. Option D is wrong because `$.customers.email` attempts to access `email` directly on the `customers` array object, but arrays in JSONPath do not have a property named `email`; this expression would return `null` or an empty result unless the array itself has an `email` property, which it does not.

Full explanation →

385

Multi-Selecthard

Which TWO of the following are primary benefits of implementing a data governance program?

Select 2 answers

A.Faster data processing speed

B.Increased data volume

C.Improved data quality and consistency

D.Lower storage costs

E.Reduced data redundancy

AnswersC, E

Governance establishes standards that enhance quality and consistency.

Why this answer

Option C is correct because a primary benefit of a data governance program is improved data quality and consistency. Data governance establishes policies, standards, and procedures for data management, ensuring that data is accurate, complete, and reliable across the organization. This directly enhances decision-making and operational efficiency by reducing errors and inconsistencies in data assets.

Exam trap

The trap here is that candidates may confuse data governance with data management or data engineering tasks, mistakenly thinking it directly improves performance or reduces costs, when its core value is in quality, consistency, and compliance.

Full explanation →

386

MCQeasy

A data team needs to extract data from a legacy system that only supports flat file exports. Which data acquisition method is most appropriate?

A.Database replication

B.API call

C.Web scraping

D.File transfer via SFTP

AnswerD

SFTP enables secure transfer of flat files, aligning with the system's export capability.

Why this answer

The legacy system only supports flat file exports, meaning it cannot provide direct database or API access. SFTP (SSH File Transfer Protocol) is the most appropriate method because it securely transfers flat files over a network, aligning with the system's export capabilities while ensuring data integrity and encryption during transit.

Exam trap

The trap here is that candidates may confuse 'flat file exports' with a need for real-time or API-based methods, overlooking that SFTP is the standard secure file transfer protocol for batch-oriented legacy systems.

How to eliminate wrong answers

Option A is wrong because database replication requires the source system to support a database engine with replication features (e.g., transactional logs or CDC), which a legacy flat-file-only system lacks. Option B is wrong because an API call requires the legacy system to expose a programmatic interface (e.g., REST or SOAP), which is not available if it only supports flat file exports. Option C is wrong because web scraping is used to extract data from web pages via HTTP, not from a legacy system that exports flat files via a file transfer protocol.

Full explanation →

387

MCQeasy

Refer to the exhibit. Which data quality dimension is compromised by the missing value for Charlie's salary?

A.Uniqueness

B.Completeness

C.Timeliness

D.Accuracy

AnswerB

Correct. The salary field is missing, so data is incomplete.

Why this answer

Completeness measures whether all required data is present. Charlie's missing salary value means the record is incomplete, directly violating this dimension. In data quality frameworks, completeness is assessed by the proportion of non-null values in a field, and a null salary here fails that check.

Exam trap

CompTIA often tests the distinction between 'missing' (completeness) and 'wrong' (accuracy), leading candidates to confuse a null value with an incorrect value.

How to eliminate wrong answers

Option A is wrong because uniqueness refers to the absence of duplicate records or values, not missing data; a missing salary does not create a duplicate. Option C is wrong because timeliness concerns whether data is up-to-date or available when needed, not whether a value is present or absent. Option D is wrong because accuracy measures correctness of values against a reference source; a missing value is not an inaccurate value—it is an absent one.

Full explanation →

388

MCQeasy

A dashboard designer wants to highlight the sales performance of individual sales representatives compared to team averages. Which chart type is most suitable for this comparison?

A.Scatter plot

B.Bar chart with average line

C.Pie chart

D.Line chart

AnswerB

Bar chart allows easy comparison of each rep's sales to the average line.

Why this answer

A bar chart is ideal for comparing individual values to an average because it clearly shows the magnitude of each value relative to a reference line. Options B, C, and D are less suitable for direct comparison of multiple entities.

Full explanation →

389

Multi-Selecthard

Which THREE are best practices for designing a dashboard for executive consumption?

Select 3 answers

A.Include detailed raw data tables for transparency

B.Ensure the dashboard is responsive for mobile devices

C.Use a separate chart for each metric to avoid clutter

D.Provide interactive filters for time periods and regions

E.Display the most critical KPIs at the top

AnswersB, D, E

Executives often view dashboards on tablets or phones.

Why this answer

Options A, C, and E are correct. Prioritize key metrics (A), provide interactive filters (C), and ensure mobile responsiveness (E). Option B is wrong because daily updates may be too frequent and clutter the dashboard.

Option D is wrong because summary views are more useful for executives.

Full explanation →

390

MCQmedium

A data analyst notices that a line chart showing monthly sales over the past two years has a steep drop in one month. Upon investigation, the analyst discovers that a new sales region was added mid-month and the data was not normalized. Which of the following best practices should the analyst apply to communicate this insight accurately?

A.Remove the month with the drop from the report.

B.Use a bar chart instead to show the data.

C.Normalize the sales data by region and explain the data anomaly in the report.

D.Present the data as-is and let stakeholders interpret the drop.

AnswerC

Normalization corrects the artifact, and explanation provides transparency.

Why this answer

Option C is correct because the core issue is that the sales data is not normalized by region, causing a misleading drop when a new region was added mid-month. By normalizing the data (e.g., calculating per-region averages or percentages) and explicitly noting the anomaly in the report, the analyst ensures accurate communication of insights. This aligns with the DA0-001 domain of Communicating Data Insights, where transparency and data integrity are paramount.

Exam trap

CompTIA often tests the misconception that simply changing the chart type (Option B) or omitting problematic data (Option A) is acceptable, when the correct approach is to normalize and explain the anomaly to maintain data integrity and clarity.

How to eliminate wrong answers

Option A is wrong because removing the month with the drop would hide the anomaly and distort the trend, violating the principle of data integrity and transparency. Option B is wrong because switching to a bar chart does not address the underlying issue of unnormalized data; the same misleading drop would still appear, just in a different visual format. Option D is wrong because presenting the data as-is without explanation shifts the burden of interpretation to stakeholders, who may draw incorrect conclusions due to the unnormalized data, which is a failure in clear communication.

Full explanation →

391

MCQeasy

Refer to the exhibit. A data analyst is unable to access the dataset during weekends. Based on the policy, what is the most likely reason?

A.The analyst's role is not in the list

B.The policy is misconfigured

C.The access condition requires being on the internal network

D.The condition restricts access to business hours only

AnswerD

Weekends are outside business hours, so access is denied.

Why this answer

Option D is correct because the policy explicitly includes a condition that restricts access to business hours only (e.g., Monday–Friday, 9 AM–5 PM). Since the analyst is attempting to access the dataset during a weekend, the condition denies access regardless of the analyst's role or network location. This is a common attribute-based access control (ABAC) rule where time-based conditions are evaluated before granting permissions.

Exam trap

The trap here is that candidates may assume the issue is a missing role or network restriction, but the policy explicitly shows a time-based condition that overrides other permissions, and the weekend access attempt directly violates that condition.

How to eliminate wrong answers

Option A is wrong because the exhibit shows the analyst's role is listed in the policy (e.g., 'Data Analyst' is present in the role list), so the issue is not a missing role. Option B is wrong because the policy is correctly configured with a time-based condition; misconfiguration would imply a syntax error or invalid attribute, but the policy is syntactically valid and the denial is expected behavior. Option C is wrong because the policy does not include a network location condition (e.g., 'sourceIp' or 'internalNetwork' attribute); the only condition is time-based, so being on the internal network would not grant access during weekends.

Full explanation →

392

MCQhard

A data analyst discovers that a dataset contains multiple records for the same customer with different spellings (e.g., 'Jon' vs 'John'). Which data preparation step should be applied first?

A.Merge all records into one per customer.

B.Remove duplicates based on exact match.

C.Standardize text fields using a lookup table.

D.Flag records for manual review.

AnswerC

Standardization harmonizes variations like 'Jon' and 'John'.

Why this answer

Option C is correct because the first step when dealing with inconsistent text values (like 'Jon' vs 'John') is to standardize the data using a lookup table or reference mapping. This ensures that all variations are normalized to a canonical form before any merging or deduplication is attempted, preventing data loss and preserving referential integrity.

Exam trap

The trap here is that candidates often jump to 'remove duplicates' (Option B) because they think of exact-match deduplication, but the question specifically tests the understanding that data quality issues like inconsistent spellings must be resolved through standardization before any deduplication logic can be applied.

How to eliminate wrong answers

Option A is wrong because merging records before standardizing spellings would combine data based on non-uniform keys, likely creating erroneous composite records or losing the ability to correctly identify which records belong to the same customer. Option B is wrong because removing duplicates based on exact match would treat 'Jon' and 'John' as different records, failing to identify them as the same customer and leaving the inconsistency unresolved. Option D is wrong because flagging records for manual review is a downstream action that should only be taken after automated standardization has been attempted; skipping standardization first would result in an unnecessarily large and inefficient manual review workload.

Full explanation →

393

Multi-Selectmedium

Which THREE of the following are common characteristics of unstructured data?

Select 3 answers

A.Easily queried using SQL

B.Often stored in NoSQL databases or data lakes

C.Can include text, images, and video

D.Stored in relational tables

E.Lacks a predefined schema

AnswersB, C, E

NoSQL and data lakes are designed to store unstructured data at scale.

Why this answer

Options B, C, and E are correct. Unstructured data lacks a predefined schema, can include various media types, and is often stored in NoSQL or data lakes. A is wrong because unstructured data is not stored in relational tables.

D is wrong because SQL queries are not designed for unstructured data.

Full explanation →

394

MCQeasy

A data analyst is using SQL to extract data. The analyst wants to retrieve all records from a table named 'sales' where the 'amount' column is greater than 100. Which SQL clause should be used?

A.WHERE

B.ORDER BY

C.GROUP BY

D.HAVING

AnswerA

WHERE clause filters rows based on a condition.

Why this answer

The WHERE clause in SQL is used to filter records based on a specified condition, such as 'amount > 100'. It is applied directly to the rows in the 'sales' table before any grouping or ordering, making it the correct choice for retrieving only records where the amount exceeds 100.

Exam trap

The trap here is that candidates often confuse HAVING with WHERE, thinking both can filter rows, but HAVING is only valid after GROUP BY and for aggregate conditions, while WHERE filters individual rows before any grouping.

How to eliminate wrong answers

Option B (ORDER BY) is wrong because it is used to sort the result set by one or more columns, not to filter rows based on a condition. Option C (GROUP BY) is wrong because it groups rows that have the same values in specified columns into summary rows, often for use with aggregate functions, and does not filter individual records. Option D (HAVING) is wrong because it is used to filter groups after the GROUP BY clause has been applied, typically with aggregate functions, and cannot be used to filter individual rows before grouping.

Full explanation →

395

MCQmedium

A data analyst is analyzing survey responses where respondents rated satisfaction on a scale of 1-5. The analyst wants to visualize the distribution of responses. Which chart type is most appropriate?

A.Box plot

B.Scatter plot

C.Line chart

D.Histogram

AnswerD

Histograms display the frequency distribution of a single numeric variable across bins.

Why this answer

A histogram is the most appropriate chart for visualizing the distribution of a single discrete variable, such as satisfaction ratings on a 1-5 scale. It groups the responses into bins (each rating value) and displays the frequency of each bin using bars, clearly showing the shape, central tendency, and spread of the data.

Exam trap

The trap here is that candidates often confuse a histogram with a bar chart, but the key distinction is that a histogram is used for quantitative (ordinal or continuous) data where bin order matters, while a bar chart is for categorical (nominal) data with no inherent order.

How to eliminate wrong answers

Option A is wrong because a box plot summarizes data using five-number statistics (min, Q1, median, Q3, max) and is better for comparing distributions across groups, not for showing the detailed frequency distribution of a single ordinal variable. Option B is wrong because a scatter plot is used to visualize the relationship between two continuous variables, not the distribution of a single categorical or ordinal variable. Option C is wrong because a line chart is typically used to display trends over time or sequential data, not the frequency distribution of discrete survey responses.

Full explanation →

396

MCQeasy

Which type of chart is most suitable for showing the distribution of a single continuous variable?

A.Histogram

B.Line chart

C.Scatter plot

D.Bar chart

AnswerA

Histograms display the frequency distribution of a continuous variable.

Why this answer

A histogram is the most suitable chart for showing the distribution of a single continuous variable because it groups the data into bins (intervals) along the x-axis and displays the frequency or count of observations within each bin on the y-axis. This allows you to see the shape, spread, and central tendency of the data, such as whether it is normally distributed, skewed, or has multiple modes. For a single continuous variable, a histogram directly visualizes the probability density or frequency distribution, which is its primary purpose in data analysis.

Exam trap

The trap here is that candidates often confuse a histogram with a bar chart, assuming both are interchangeable for showing frequencies, but a histogram is specifically for continuous data with no gaps between bars, while a bar chart is for discrete categories with gaps.

How to eliminate wrong answers

Option B is wrong because a line chart is designed to show trends over time or ordered categories by connecting data points with lines, not to display the distribution of a single continuous variable. Option C is wrong because a scatter plot is used to show the relationship between two continuous variables by plotting points on a Cartesian plane, not to visualize the distribution of a single variable. Option D is wrong because a bar chart is used for categorical data, where each bar represents a distinct category, and it does not handle continuous variables properly since it would require arbitrary binning without the statistical properties of a histogram.

Full explanation →

397

MCQhard

An organization has multiple systems that store customer information inconsistently. To create a single authoritative view of customer data, they implement a process that identifies and merges duplicate records. This is an example of which data management discipline?

A.Data governance

B.Data warehousing

C.Data quality

D.Master Data Management (MDM)

AnswerD

MDM focuses on creating a single, consistent view of master data entities like customers.

Why this answer

Option A is correct because Master Data Management (MDM) creates a golden record by merging duplicates. Option B is wrong because data governance provides oversight but does not directly merge records. Option C is wrong because data warehousing consolidates data for reporting.

Option D is wrong because data quality is a component but not the specific discipline for master data.

Full explanation →

398

Multi-Selecteasy

Which TWO of the following are examples of data transformation? (Choose TWO.)

Select 2 answers

A.Normalizing data to eliminate redundancy

B.Creating a backup of the database

C.Converting string dates to date format

D.Generating summary statistics

E.Removing duplicate records

AnswersA, C

Normalization is a transformation.

Why this answer

Option A is correct because data normalization is a transformation process that reorganizes data to reduce redundancy and improve integrity, typically by decomposing tables into smaller, related tables (e.g., achieving 3NF in relational databases). This changes the structure and representation of the data, which is a core example of data transformation.

Exam trap

CompTIA often tests the distinction between data transformation (changing format/structure) and data cleansing (removing errors/duplicates) or data analysis (generating summaries), leading candidates to mistakenly select removal of duplicates or summary statistics as transformations.

Full explanation →

399

MCQeasy

A data analyst is presenting findings to a non-technical audience. Which data visualization technique is most appropriate?

A.Box plot

B.Heat map

C.Scatter plot

D.Bar chart

AnswerD

Bar charts are straightforward and show comparisons clearly, making them suitable for non-technical audiences.

Why this answer

Bar charts are the most appropriate visualization for a non-technical audience because they present categorical comparisons in a simple, intuitive format that requires no statistical background to interpret. Unlike box plots, heat maps, or scatter plots, bar charts clearly show relative magnitudes with minimal cognitive load, making them ideal for communicating data insights to stakeholders who may not be familiar with data analysis conventions.

Exam trap

The trap here is that candidates often choose scatter plots or box plots because they seem more 'analytical,' but the question specifically targets audience comprehension, not analytical depth — Cisco tests whether you prioritize clarity over complexity when the audience lacks technical background.

How to eliminate wrong answers

Option A is wrong because box plots display data distribution through quartiles and outliers, which require understanding of statistical concepts like median, interquartile range, and whiskers — concepts that are not intuitive for a non-technical audience. Option B is wrong because heat maps use color gradients to represent data density or correlation, which can be ambiguous without a legend and are better suited for identifying patterns in large matrices rather than straightforward comparisons. Option C is wrong because scatter plots show relationships between two continuous variables and rely on the audience's ability to interpret correlation or clustering, which is not immediately accessible to non-technical viewers.

Full explanation →

400

MCQhard

A data analyst discovers a data discrepancy in a report that has already been distributed to executives. The discrepancy is due to a known bug that will be fixed next month. What is the most appropriate action?

A.Send a corrected report immediately with an apology

B.Wait for the next scheduled update

C.Ignore it since it will be fixed soon

D.Inform the executives of the error and its impact

AnswerD

Transparency builds trust and allows informed decision-making.

Why this answer

Option B is correct because informing executives of the error and its impact maintains transparency and trust. Option A (ignore) is unethical. Option C (send corrected report immediately) may cause confusion without context.

Option D (wait) delays communication.

Full explanation →

401

MCQmedium

A data analyst is working with a dataset that contains customer names and addresses. Some records have missing state codes. Which data quality issue is this?

A.Duplication

B.Incompleteness

C.Outliers

D.Inconsistency

AnswerB

Missing state codes make the record incomplete.

Why this answer

Incompleteness is the correct answer because missing state codes in customer address records represent a lack of required data. This is a classic example of incomplete data, where fields that should contain values are left null or blank, reducing the dataset's usability for analysis.

Exam trap

The trap here is that candidates may confuse incompleteness with inconsistency, but incompleteness is about missing data (nulls), while inconsistency is about contradictory data across records.

How to eliminate wrong answers

Option A is wrong because duplication refers to duplicate records (e.g., same customer appearing multiple times), not missing values. Option C is wrong because outliers are data points that deviate significantly from the norm (e.g., an unusually high age), not absent data. Option D is wrong because inconsistency involves contradictory or conflicting data (e.g., same customer with different state codes in different records), not missing values.

Full explanation →

402

MCQmedium

A telecommunications company is experiencing issues with its customer satisfaction survey data. The data is collected from multiple channels: phone, email, and web forms. Each channel uses a different scale for ratings: phone uses 1-10, email uses 1-5, and web uses 1-7. Additionally, some survey responses contain missing values for demographic fields. The data analyst needs to calculate an overall satisfaction score that is comparable across all channels. The company's leadership wants a single metric that minimizes distortion from the different scales. Which approach should the analyst use to standardize the ratings?

A.Normalize each rating to a 0-100 scale using min-max normalization.

B.Calculate the average rating separately for each channel and then compare the averages.

C.Convert all ratings to a binary metric of satisfied (above midpoint) or unsatisfied.

D.Convert all ratings to a 1-10 scale by multiplying email ratings by 2 and web by 1.43.

AnswerA

Correct: Min-max normalization maps each scale to a common range, preserving relative differences.

Why this answer

Option A is correct because min-max normalization rescales each rating to a common 0-100 range using the formula (x - min) / (max - min) * 100. This preserves the relative distribution of responses within each channel while eliminating the effect of different scale lengths, making the scores directly comparable. It minimizes distortion better than simple multiplication or binary conversion, as it accounts for the full range of each original scale.

Exam trap

The trap here is that candidates may think simple multiplication (Option D) is sufficient for scale conversion, but Cisco tests the understanding that linear scaling without considering the full range and distribution can introduce distortion, whereas min-max normalization is the proper technique for creating a comparable metric across different scales.

How to eliminate wrong answers

Option B is wrong because calculating separate averages per channel does not standardize the ratings; it only produces channel-specific means that remain on different scales, making direct comparison invalid. Option C is wrong because converting to a binary satisfied/unsatisfied metric discards granularity and loses information about the degree of satisfaction, which can distort the overall score and reduce statistical power. Option D is wrong because multiplying email ratings by 2 and web by 1.43 assumes linear proportionality between scales, which is arbitrary and does not account for differences in distribution shape or endpoints, potentially introducing systematic bias.

Full explanation →

403

MCQeasy

A retail company wants to analyze customer purchase patterns to identify products frequently bought together. Which data mining technique is most appropriate?

A.Classification

B.Clustering

C.Regression

D.Association rules

AnswerD

Association rules identify frequent itemsets and co-occurrence patterns.

Why this answer

Association rules (market basket analysis) discover co-occurrence relationships. Regression predicts numeric values, clustering groups similar items, classification assigns categories.

Full explanation →

404

Multi-Selectmedium

Which THREE of the following are common steps in data cleaning?

Select 3 answers

A.Removing outliers without justification

B.Imputing missing values

C.Standardizing data formats

D.Removing duplicate records

E.Increasing sample size

AnswersB, C, D

Missing values are often imputed to maintain dataset completeness.

Why this answer

Imputing missing values is a common data cleaning step because real-world datasets often have gaps due to data collection errors or system failures. Techniques like mean/median imputation, regression imputation, or using algorithms like k-NN help preserve sample size and avoid bias that would result from simply dropping rows. This ensures the dataset remains usable for analysis without introducing significant distortion.

Exam trap

CompTIA often tests the distinction between data cleaning steps and data collection or preprocessing steps, so the trap here is confusing 'increasing sample size' (a data augmentation or collection activity) with actual cleaning tasks like imputation, standardization, and deduplication.

Full explanation →

405

MCQhard

A data team is preparing a dashboard for executives. The team wants to highlight key performance indicators (KPIs) that are below target. Which of the following visualization techniques would most effectively draw attention to underperforming metrics without causing confusion?

A.Remove underperforming KPIs from the dashboard to avoid confusion.

B.Use a scatter plot to show the relationship between KPIs.

C.Apply conditional formatting to turn KPI values red when below target.

D.Use a pie chart showing the proportion of each KPI.

AnswerC

Red highlights call attention to issues immediately.

Why this answer

Conditional formatting that turns KPI values red when below target is the most effective technique because it leverages pre-attentive visual processing — the human eye naturally notices color changes (especially red) before other visual elements. This allows executives to instantly identify underperforming metrics without needing to interpret complex chart types, reducing cognitive load and confusion.

Exam trap

CompTIA often tests the misconception that removing or hiding negative data simplifies communication, when in fact effective dashboards must highlight both successes and failures to drive data-driven decisions.

How to eliminate wrong answers

Option A is wrong because removing underperforming KPIs from the dashboard hides critical negative information, violating the principle of transparency in data communication and preventing executives from taking corrective action. Option B is wrong because a scatter plot is designed to show relationships or correlations between two variables, not to highlight individual KPI performance against a target; it would introduce unnecessary complexity and fail to draw immediate attention to underperforming metrics. Option D is wrong because a pie chart shows proportions of a whole, which is irrelevant for comparing KPIs against targets; it cannot effectively indicate which metrics are below target and often leads to misinterpretation due to difficulty in comparing angles and areas.

Full explanation →

406

MCQhard

A data analyst is building a dashboard that will be used by both executives and operational managers. The executives need high-level summaries, while managers need granular details. Which dashboard design principle should the analyst apply?

A.Use a single chart that shows both summary and detail simultaneously.

B.Display all available data on one page for transparency.

C.Design the dashboard with drill-down capabilities from summary to detail.

D.Create two separate dashboards for each audience.

AnswerC

Drill-down allows executives to see overview and managers to access details on demand.

Why this answer

Option C is correct because drill-down capabilities allow users to start with a high-level summary (e.g., total revenue by region) and then interactively navigate to granular details (e.g., individual transactions) without overwhelming either audience. This design principle supports both executive and operational manager needs within a single dashboard, maintaining clarity and performance by loading only the required level of detail on demand.

Exam trap

The trap here is that candidates often confuse 'drill-down' with 'separate dashboards' (Option D), thinking that different audiences require entirely different views, when in fact a single interactive dashboard with drill-down capabilities is the standard best practice for serving both summary and detail needs efficiently.

How to eliminate wrong answers

Option A is wrong because a single chart attempting to show both summary and detail simultaneously would likely become cluttered, violate data-ink ratio principles, and fail to provide clear high-level insights for executives or actionable granularity for managers. Option B is wrong because displaying all available data on one page creates information overload, reduces readability, and violates the principle of progressive disclosure, making it difficult for both audiences to quickly find relevant insights. Option D is wrong because creating two separate dashboards introduces maintenance overhead, potential data inconsistency, and forces users to switch contexts, whereas a single dashboard with drill-down capabilities is more efficient and user-friendly.

Full explanation →

407

Multi-Selectmedium

Which TWO of the following are examples of quantitative data? (Choose TWO.)

Select 2 answers

A.Product color

B.Age in years

C.Customer satisfaction rating (Poor, Fair, Good)

D.Country of origin

E.Shoe size

AnswersB, E

Age is a numeric, quantitative variable.

Why this answer

Option B is correct because age in years is a numerical measurement that can be counted or measured on a ratio scale, making it quantitative data. Quantitative data represents quantities that can be expressed numerically and subjected to mathematical operations, such as calculating the average age of a group.

Exam trap

The trap here is that candidates often confuse ordinal data (like customer satisfaction ratings) with quantitative data because the categories have an order, but they are still qualitative since the values are not numeric measurements.

Full explanation →

408

Matchingmedium

Match each data security concept to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Convert data into code to prevent unauthorized access

Replace sensitive data with non-sensitive placeholder

Hide parts of data (e.g., show only last 4 digits)

Restrict data access based on user permissions

Log of all data access and modifications

Why these pairings

Data security is critical for protecting sensitive information.

Full explanation →

409

MCQhard

A company is designing a data lake to store raw sensor data from IoT devices. The data arrives as JSON objects with varying schemas. Which storage approach is most appropriate?

A.Ingest into a relational database with a predefined schema

B.Store each JSON object as a separate file in a compressed columnar format

C.Convert all JSON to Avro with a fixed schema before storing

D.Store raw JSON files in a distributed file system and apply schema-on-read

AnswerD

Schema-on-read allows handling varying schemas without upfront transformation.

Why this answer

Option D is correct because a data lake is designed to store raw data in its native format, and IoT sensor data with varying schemas is best handled by storing raw JSON files in a distributed file system (e.g., HDFS or Amazon S3). This approach leverages schema-on-read, where the schema is applied at query time rather than at write time, allowing flexibility for heterogeneous JSON objects without data loss or transformation overhead.

Exam trap

The trap here is that candidates confuse 'schema-on-read' with 'schema-on-write' and assume that converting to a structured format like Avro or columnar storage is always better for performance, ignoring the requirement to store raw, varying-schema data as-is.

How to eliminate wrong answers

Option A is wrong because relational databases require a predefined schema and enforce ACID constraints, which cannot accommodate JSON objects with varying schemas without costly schema migrations or data loss. Option B is wrong because storing each JSON object as a separate file in a compressed columnar format (e.g., Parquet or ORC) is inefficient for small, variable-schema records; columnar formats are optimized for analytical queries on large, homogeneous datasets, not for raw ingestion of many small, schema-varying JSON objects. Option C is wrong because converting all JSON to Avro with a fixed schema before storing defeats the purpose of a data lake, which is to preserve raw data; Avro requires a predefined schema at write time, and forcing a fixed schema on varying JSON objects would either lose data or require complex schema evolution management.

Full explanation →

410

MCQhard

A large e-commerce company has a data warehouse that stores daily sales transactions. The data analytics team creates a weekly report for the executive team that includes total revenue, number of orders, and average order value. Recently, the report showed a sudden 20% drop in total revenue compared to the previous week. Upon investigation, the data analyst discovers that a major promotional event ended the previous week, and the current week's data includes a new product launch with lower prices. The analyst needs to present this data to the executives in a way that accurately reflects the business situation. The analyst must also provide recommendations for actions. Which of the following is the best course of action?

A.Remove the new product launch data from the report to show a clearer comparison.

B.Present only the data from the new product launch to show growth potential.

C.Present the data with annotations explaining the promotion end and new product launch, and recommend analyzing customer retention and cross-selling opportunities.

D.Present the drop as-is and recommend cutting costs to compensate for the revenue decline.

AnswerC

This provides full context and actionable insights.

Why this answer

Option C is correct because it adheres to the principle of transparent data communication: the analyst must present the full dataset with contextual annotations (promotion end, new product launch) to avoid misleading executives. This approach enables informed decision-making, and the recommendation to analyze customer retention and cross-selling opportunities directly addresses the underlying business dynamics—understanding whether the revenue drop is temporary or signals a shift in customer behavior. In a data warehouse context, this ensures that the weekly report reflects the true state of the business rather than a manipulated or incomplete view.

Exam trap

The trap here is that candidates may think removing or cherry-picking data (Options A and B) simplifies the story for executives, but the exam tests the ethical and professional standard of presenting complete, contextualized data—even when it shows a negative trend—while offering actionable recommendations based on the full picture.

How to eliminate wrong answers

Option A is wrong because removing the new product launch data constitutes data manipulation that violates the integrity of the report; it would hide a significant business event and prevent executives from understanding the full revenue picture. Option B is wrong because presenting only the new product launch data is a form of selection bias that ignores the overall revenue decline and the impact of the promotion end, leading to an overly optimistic and incomplete analysis. Option D is wrong because presenting the drop as-is without context or analysis fails to provide actionable insights; recommending cost cuts without understanding the root cause (promotion end and new product pricing) could lead to misguided strategic decisions.

Full explanation →

411

MCQmedium

A data engineer needs to store logs from web servers that have varying fields. The logs are in JSON format. Which data type describes this JSON data?

A.Binary data

B.Structured data

C.Semi-structured data

D.Unstructured data

AnswerC

JSON allows schema flexibility with key-value pairs, fitting the semi-structured definition.

Why this answer

JSON data with varying fields is classified as semi-structured data because it has organizational properties (key-value pairs, nested structures) but does not conform to a rigid schema like a relational table. The logs from web servers may have different fields per record, which is a hallmark of semi-structured data, as it allows flexibility while still being self-describing.

Exam trap

The trap here is that candidates confuse 'structured' with any data that has a format, but JSON's lack of a fixed schema and varying fields disqualifies it from being structured data, which requires a rigid, predefined schema like a relational database table.

How to eliminate wrong answers

Option A is wrong because binary data refers to raw bytes or encoded formats (e.g., images, executables) that lack any inherent structure or human-readable format, whereas JSON is text-based and has explicit key-value organization. Option B is wrong because structured data requires a fixed schema with predefined fields and data types (e.g., rows in a SQL table), but JSON logs with varying fields violate this strict schema requirement. Option D is wrong because unstructured data has no predefined format or organization (e.g., plain text, video files), while JSON has a defined syntax with keys, values, and nesting, providing a clear structure.

Full explanation →

412

MCQeasy

A data analyst needs to present the distribution of customer ages to a non-technical audience. Which type of chart would be most appropriate?

A.Scatter plot

B.Histogram

C.Pie chart

D.Line chart

AnswerB

Histograms show distribution of continuous data.

Why this answer

A histogram is the most appropriate chart for displaying the distribution of a single continuous variable, such as customer ages, to a non-technical audience. It groups ages into bins and shows the frequency of customers within each bin, making the shape, center, and spread of the distribution immediately visible. This aligns with the DA0-001 objective of selecting visualizations that clearly communicate data insights to stakeholders.

Exam trap

The trap here is that candidates often confuse a histogram with a bar chart, thinking a bar chart can show distribution, but a bar chart is for categorical data, not continuous numeric data grouped into bins.

How to eliminate wrong answers

Option A is wrong because a scatter plot is used to show the relationship between two continuous variables, not the distribution of a single variable like age. Option C is wrong because a pie chart is designed for parts-of-a-whole categorical data, not for displaying the frequency distribution of a continuous numeric variable. Option D is wrong because a line chart is typically used to show trends over time or ordered categories, not the distribution of a single continuous variable.

Full explanation →

413

MCQhard

A data scientist is building a classification model to predict customer churn. The dataset has 10,000 records with 500 churners. The scientist uses logistic regression and achieves 98% accuracy, but the precision for churn class is only 15%. Which of the following is the most likely cause?

A.Class imbalance

B.Non‑linear decision boundary

C.Multicollinearity among predictor variables

D.Overfitting due to too many features

AnswerA

With only 500 churners out of 10,000, the model predicts most as non-churners, achieving high accuracy but low precision for the minority class.

Why this answer

The dataset has only 500 churners out of 10,000 records (5% churn rate), which is a classic class imbalance. Logistic regression can achieve high accuracy by simply predicting the majority class (non-churn) for all records, yielding 95% accuracy even without learning anything about churn. The very low precision (15%) for the churn class indicates that most of the positive predictions are false positives, a direct consequence of the model being biased toward the majority class due to imbalance.

Exam trap

CompTIA often tests the misconception that high accuracy always means a good model, hiding the fact that with imbalanced data, accuracy is misleading and metrics like precision, recall, or F1-score for the minority class are critical.

How to eliminate wrong answers

Option B is wrong because logistic regression inherently models a linear decision boundary; while non-linear boundaries can be approximated with feature engineering (e.g., polynomial terms), the core issue here is class imbalance, not boundary shape. Option C is wrong because multicollinearity inflates coefficient standard errors but does not cause the extreme precision drop seen here; it affects interpretability, not the fundamental accuracy-imbalance trade-off. Option D is wrong because overfitting would typically yield high training accuracy but poor generalization, not a specific low precision for the minority class while maintaining high overall accuracy; the model is actually underfitting the minority class.

Full explanation →

414

MCQeasy

A retail analyst wants to visualize monthly sales over the past year to identify seasonal patterns. Which chart type is best?

A.Pie chart

B.Histogram

C.Line chart

D.Box plot

AnswerC

Line chart clearly shows trends and seasonal patterns over time.

Why this answer

A line chart is best for visualizing monthly sales over time because it explicitly shows trends and seasonal patterns across a continuous time axis. The sequential ordering of months allows the line to connect data points, making it easy to identify peaks, troughs, and recurring cycles. This aligns with the DA0-001 objective of selecting appropriate visualizations for time-series data.

Exam trap

The trap here is that candidates confuse a histogram with a line chart because both use bars or lines, but a histogram bins continuous data without regard to order, while a line chart requires a sequential axis like time to show trends.

How to eliminate wrong answers

Option A is wrong because a pie chart is designed to show parts of a whole at a single point in time, not trends across multiple time periods; it cannot convey seasonal patterns or month-over-month changes. Option B is wrong because a histogram displays the distribution of a continuous variable by binning values, but it does not preserve the temporal order of months, making it impossible to identify seasonal patterns. Option D is wrong because a box plot summarizes data distribution (median, quartiles, outliers) for one or more groups, but it lacks a time axis and cannot show the sequential progression of sales across months.

Full explanation →

415

Multi-Selecteasy

Which TWO of the following are valid SQL clauses used to filter and sort data?

Select 2 answers

A.DELETE

B.WHERE

C.ORDER BY

D.UPDATE

E.INSERT

AnswersB, C

WHERE filters rows based on conditions.

Why this answer

The WHERE clause is used to filter rows based on specified conditions, while the ORDER BY clause sorts the result set in ascending or descending order. Both are standard SQL clauses for data filtering and sorting, respectively.

Exam trap

CompTIA often tests the distinction between SQL DML statements (DELETE, UPDATE, INSERT) and query clauses (WHERE, ORDER BY), trapping candidates who confuse data manipulation commands with data retrieval or sorting operations.

Full explanation →

416

Multi-Selecthard

Which THREE factors should be considered when choosing a chart type for a dataset?

Select 3 answers

A.The animation capabilities of the software

B.The data types (categorical, numerical, time series)

C.The number of variables to display

D.The key insight or message to convey

E.The color scheme of the company logo

AnswersB, C, D

Data type determines suitable chart types.

Why this answer

Option B is correct because the data type (categorical, numerical, time series) determines which chart types are semantically valid. For example, a line chart requires a continuous numerical or time-series axis, while a bar chart works with categorical data. Choosing a chart that mismatches the data type can misrepresent the underlying distribution or trends.

Exam trap

The trap here is that candidates often confuse aesthetic or software-specific features (like animation or branding) with the fundamental data characteristics that dictate chart appropriateness, leading them to select options that are about polish rather than analytical correctness.

Full explanation →

417

MCQmedium

A marketing team wants to segment customers into groups based on purchasing behavior without prior labels. Which algorithm should the data analyst use?

A.K-means clustering

B.K-nearest neighbors

C.Linear regression

D.Decision tree

AnswerA

K-means is an unsupervised clustering algorithm suitable for segmentation.

Why this answer

K-means clustering is the correct choice because it is an unsupervised learning algorithm that groups unlabeled data into clusters based on feature similarity. Since the marketing team has no prior labels for customer segments, K-means can partition customers by purchasing behavior patterns, such as frequency and monetary value, without needing predefined categories.

Exam trap

The trap here is that candidates often confuse unsupervised clustering (K-means) with supervised classification (K-nearest neighbors) because both involve 'K' and grouping, but KNN requires labeled data and predicts labels, while K-means discovers inherent structures without labels.

How to eliminate wrong answers

Option B is wrong because K-nearest neighbors is a supervised learning algorithm that requires labeled training data to classify or predict outcomes, making it unsuitable for unlabeled segmentation. Option C is wrong because linear regression is a supervised regression algorithm used to predict a continuous target variable, not to discover hidden groupings in unlabeled data. Option D is wrong because decision trees are typically used for supervised classification or regression tasks, relying on labeled data to split on features, and cannot perform unsupervised clustering without prior labels.

Full explanation →

418

MCQhard

A data scientist creates a box plot of employee salaries and notices many outliers above the upper whisker. What action should be taken to best understand the salary distribution?

A.Replace the box plot with a histogram of the salaries

B.Remove all outliers to create a more typical box plot

C.Trim the top 5% of salaries and recreate the box plot

D.Investigate the outliers to determine if they are data entry errors or valid extremes

AnswerD

Outliers should be investigated to ensure data quality and contextual understanding.

Why this answer

Option A is correct because investigating outliers may reveal legitimate high salaries or data errors. Option B is wrong because removing outliers without cause can bias results. Option C is wrong because trimming changes the distribution.

Option D is wrong because a histogram may not highlight outliers clearly.

Full explanation →

419

MCQhard

A data scientist is working with a dataset containing 1000 features and 500 samples. The goal is to build a predictive model. Which technique should be used to reduce the number of features while retaining most of the variance?

A.Ridge regression

B.Forward selection

C.Principal Component Analysis (PCA)

D.Lasso regression

AnswerC

PCA reduces dimensionality by creating new features that capture maximum variance.

Why this answer

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms the original features into a set of orthogonal components, ordered by the variance they capture. Given 1000 features and only 500 samples, PCA is ideal because it reduces the feature space while retaining the maximum variance, helping to avoid overfitting and the curse of dimensionality.

Exam trap

CompTIA often tests the distinction between supervised feature selection (Lasso, Forward selection) and unsupervised dimensionality reduction (PCA), trapping candidates who confuse regularization with variance-based reduction.

How to eliminate wrong answers

Option A is wrong because Ridge regression is a regularization technique that shrinks coefficients but does not reduce the number of features; it retains all features with penalized weights. Option B is wrong because Forward selection is a supervised feature selection method that selects features based on their predictive power, not on variance retention, and it can be computationally expensive with 1000 features. Option D is wrong because Lasso regression performs feature selection by shrinking some coefficients to zero, but it is a supervised method that selects features based on target correlation, not on maximizing variance retention, and may not be optimal for unsupervised dimensionality reduction.

Full explanation →

420

MCQmedium

The exhibit shows a JSON configuration for a line chart that displays average response time over time. However, the chart shows jagged lines with many spikes. Which configuration change would smooth the visualization?

A.Change "aggregation" to "average"

B.Change the query to include a moving average

C.Change "interpolation" to "monotone"

D.Change chartType to "bar"

AnswerC

Monotone interpolation produces a smoother curve through points.

Why this answer

Setting interpolation to 'monotone' or 'smooth' can reduce jaggedness. The aggregation is already applied in the query, so changing it to 'average' in the widget would double-aggregate. Changing chart type or data source is not necessary.

Full explanation →

421

MCQeasy

A company must comply with GDPR when reporting customer data. What must be included in the report?

A.Customer names and emails

B.The source of the data

C.A notice about data processing purposes

D.Data retention period

AnswerC

GDPR mandates that data subjects be informed of the purposes of processing, especially when data is shared.

Why this answer

Under GDPR, when reporting customer data, the controller must provide a notice about the purposes of data processing to ensure transparency and lawful basis for processing. This is required by Article 13 of the GDPR, which mandates that data subjects be informed of the specific purposes for which their personal data is being processed. Including this notice in the report demonstrates compliance with the accountability principle.

Exam trap

The trap here is that candidates often confuse operational data elements (like names, source, or retention) with the mandatory transparency obligations under GDPR, assuming that including any data field satisfies compliance, whereas the core requirement is the notice about processing purposes.

How to eliminate wrong answers

Option A is wrong because customer names and emails are personal data that may be included in the report, but GDPR does not require their inclusion; in fact, pseudonymization or anonymization is often preferred to minimize data exposure. Option B is wrong because while the source of data may be relevant for transparency, GDPR does not mandate including the data source in every report; the focus is on processing purposes and lawful basis. Option D is wrong because the data retention period is a policy requirement under GDPR Article 5(1)(e), but it is not a mandatory element to be included in a customer data report; it is typically documented in the privacy notice or data retention policy, not the report itself.

Full explanation →

422

Multi-Selectmedium

A data analyst is troubleshooting a dashboard that displays slowly. Which two of the following are common causes of dashboard performance issues? (Select TWO.)

Select 2 answers

A.Minimal use of filters.

B.Use of summary tables.

C.Large unaggregated datasets.

D.Complex calculated fields with many dependencies.

E.Excessive use of live data connections.

AnswersC, D

Correct. Large unaggregated datasets increase load on the dashboard.

Why this answer

Option C is correct because large unaggregated datasets force the dashboard to process every row of raw data at query time, which dramatically increases load times and memory consumption. Aggregated or pre-summarized tables reduce the volume of data scanned, improving performance.

Exam trap

The trap here is that candidates often confuse 'summary tables' (which improve performance) with 'large unaggregated datasets' (which degrade it), and they may overlook that complex calculated fields with dependencies are a hidden performance killer.

Full explanation →

423

MCQmedium

A logistics company has data on delivery times (continuous) and distance traveled (continuous). They want to visualize the relationship between these two variables. Which chart type is most appropriate?

A.Histogram

B.Bar chart

C.Line chart

D.Scatter plot

AnswerD

Scatter plots are ideal for visualizing the correlation between two continuous variables.

Why this answer

Option D is correct because scatter plots show the relationship between two continuous variables. Option A is for categorical comparisons. Option B is for trends over time.

Option C is for frequency distribution.

Full explanation →

424

Multi-Selecthard

A data team is creating a report for the board of directors. The board wants to see both the big picture and the ability to explore details for specific regions. Which TWO approaches should the team implement?

Select 2 answers

A.Use a single summary metric for all regions.

B.Include an interactive dashboard with filters by region.

C.Embed a table of data for each region in an appendix.

D.Provide a static executive summary with no drill-down.

E.Create a hierarchical report with linked pages.

AnswersB, E

Interactive dashboards allow users to filter and drill into specific regions on demand.

Why this answer

Option B is correct because interactive dashboards with filters allow the board to see high-level summaries while drilling into region-specific details on demand. This approach aligns with modern BI tools like Tableau or Power BI, which support dynamic filtering and cross-filtering without requiring separate report pages.

Exam trap

CompTIA often tests the distinction between static data presentation (tables, summaries) and interactive exploration features (filters, drill-through), trapping candidates who choose static options like an appendix or single metric when the question explicitly asks for both big-picture and detail exploration.

Full explanation →

425

MCQhard

A data analyst is presenting a weekly sales report to a diverse audience that includes sales managers, marketing team, and finance. The analyst wants to ensure the data is understood correctly. Which communication strategy is most effective?

A.Use a single complex chart to show all data at once.

B.Create a slide deck with key insights and contextual explanations.

C.Email the raw data table to all stakeholders.

D.Use technical terms such as 'variance' and 'standard deviation'.

AnswerB

Contextual explanations cater to different roles and improve understanding.

Why this answer

Option B is correct because a slide deck with key insights and contextual explanations tailors the presentation to a diverse audience, ensuring that sales managers, marketing, and finance can each grasp the relevant data without being overwhelmed. This approach aligns with best practices for data communication, where visual summaries and narrative context improve comprehension over raw data or overly technical jargon.

Exam trap

The trap here is that candidates may overvalue technical precision (Option D) or data completeness (Option C), forgetting that the primary goal is audience comprehension, not data exhaustiveness.

How to eliminate wrong answers

Option A is wrong because a single complex chart (e.g., a multi-series radar or stacked area chart) can obscure individual trends and overload viewers, leading to misinterpretation across roles with different analytical backgrounds. Option C is wrong because emailing raw data tables forces stakeholders to derive insights themselves, risking errors and inefficiency, especially for non-technical teams like marketing. Option D is wrong because using technical terms like 'variance' and 'standard deviation' without explanation alienates audience members who lack statistical training, violating the principle of audience-appropriate communication.

Full explanation →

426

MCQhard

A database table has columns: OrderID (primary key), ProductID, CustomerID, CustomerName, OrderDate, ProductName. All products are purchased only by the customer who placed the order. Which normal form violation exists if CustomerName depends on CustomerID?

A.Boyce-Codd normal form (BCNF)

B.Third normal form (3NF)

C.Second normal form (2NF)

D.First normal form (1NF)

AnswerB

CustomerName depends on CustomerID, which is not a candidate key, creating a transitive dependency and violating 3NF.

Why this answer

The table violates Third Normal Form (3NF) because CustomerName depends on CustomerID, which is not a candidate key (the primary key is OrderID). 3NF requires that every non-key attribute be non-transitively dependent on the primary key; here, CustomerName is transitively dependent on OrderID via CustomerID. Since CustomerID is a non-key attribute (it is not part of the primary key), this transitive dependency breaks 3NF.

Exam trap

The trap here is that candidates often confuse transitive dependencies (3NF violation) with partial dependencies (2NF violation) or think that any dependency on a non-key attribute automatically violates BCNF, but the specific scenario of CustomerName depending on CustomerID is a textbook transitive dependency that breaks 3NF first.

How to eliminate wrong answers

Option A is wrong because Boyce-Codd Normal Form (BCNF) is a stricter version of 3NF that requires every determinant to be a candidate key; while this table also violates BCNF, the question asks which normal form violation exists, and the dependency described is a classic 3NF violation (transitive dependency), not a BCNF-specific one. Option C is wrong because Second Normal Form (2NF) is violated only when a non-key attribute depends on a proper subset of a composite primary key; here the primary key is a single column (OrderID), so no partial dependency exists, and 2NF is satisfied. Option D is wrong because First Normal Form (1NF) is violated only if there are repeating groups or non-atomic values; the table as described has atomic columns and no repeating groups, so 1NF is satisfied.

Full explanation →

427

MCQmedium

A data team is building a predictive model. They have data on 'Number of employees' (whole numbers) and 'Revenue' (currency). Which statement correctly compares these data types?

A.Number of employees is discrete; revenue is continuous

B.Both are continuous data

C.Both are ratio data

D.Number of employees is qualitative; revenue is quantitative

AnswerA

Employees are counted in whole units (discrete), while revenue can have fractional values (continuous).

Why this answer

Option A is correct because 'Number of employees' is a count of distinct entities, making it discrete data (only whole numbers), while 'Revenue' can take any value within a range (including decimals), making it continuous data. Discrete data arises from counting, whereas continuous data arises from measurement.

Exam trap

The trap here is that candidates confuse the measurement scale (ratio) with the data type (discrete vs. continuous), leading them to pick option C even though the question specifically asks about data type classification.

How to eliminate wrong answers

Option B is wrong because 'Number of employees' is not continuous; it is discrete as it can only take integer values (e.g., 10, 11, not 10.5). Option C is wrong because while both are ratio data (they have a true zero point), the question asks about data types (discrete vs. continuous), not measurement scales. Option D is wrong because both 'Number of employees' and 'Revenue' are quantitative (numerical) data, not qualitative (categorical).

Full explanation →

428

Drag & Dropmedium

Drag and drop the steps to perform a data backup using the 3-2-1 rule in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

The 3-2-1 rule involves multiple copies, different media, offsite storage, and regular testing.

Full explanation →

429

MCQhard

A data modeler is designing a dimensional model for a sales analytics system. The fact table contains sales transactions, and the dimension tables include product, customer, and time. To reduce data redundancy, the modeler normalizes the dimension tables into multiple related tables. Which schema is being implemented?

A.Vault schema

B.Star schema

C.Galaxy schema

D.Snowflake schema

AnswerD

Snowflake schema normalizes dimension tables to reduce redundancy.

Why this answer

The snowflake schema is a dimensional model where dimension tables are normalized into multiple related tables to reduce data redundancy. In this scenario, the product, customer, and time dimensions are split into sub-dimensions (e.g., product category, customer geography, time hierarchy), which is the defining characteristic of a snowflake schema. This contrasts with a star schema where dimensions remain denormalized.

Exam trap

CompTIA often tests the distinction between star and snowflake schemas by emphasizing normalization of dimensions; the trap here is that candidates may confuse 'normalized dimensions' with a star schema, which actually uses denormalized dimensions for simplicity and performance.

How to eliminate wrong answers

Option A is wrong because a vault schema (Data Vault) is a hybrid modeling approach focused on auditability and flexibility using hubs, links, and satellites, not on normalizing dimension tables for a sales analytics fact table. Option B is wrong because a star schema keeps dimension tables denormalized (single table per dimension) to optimize query performance, which directly contradicts the normalization described in the question. Option C is wrong because a galaxy schema (also called a fact constellation) contains multiple fact tables sharing dimension tables, not the normalization of a single fact table’s dimensions.

Full explanation →

430

MCQeasy

The exhibit shows log entries. A data analyst wants to visualize the frequency of each error type over time. Which chart type is most appropriate?

A.Time series line chart

B.Bar chart of error types

C.Scatter plot of timestamp vs error code

D.Pie chart of error types

AnswerA

Shows frequency trend over time.

Why this answer

A time series line chart counts events per time interval, best for showing frequency over time. Bar chart of counts is static. Pie of error types ignores time.

Scatter plot not suitable.

Full explanation →

431

MCQmedium

A data analyst is troubleshooting a dashboard that displays incorrect totals for sales by region. The data source queries are correct. Which of the following is the most likely cause?

A.The visualizations are using a different aggregation level.

B.The data model includes duplicate records.

C.The dashboard is using a live connection instead of an extract.

D.The filter context is inadvertently excluding some regions.

AnswerD

Correct. Filters can exclude data without obvious indication, causing incorrect totals.

Why this answer

Option D is correct because filter context can inadvertently exclude certain regions, leading to incorrect totals. Options A, B, and C are less likely: A affects performance not accuracy; B would be caught by correct queries; C could cause aggregation differences but filters are a common issue.

Full explanation →

432

MCQeasy

A data analyst needs to combine two datasets that have the same columns but different rows. Which operation should they use?

A.Concatenate

B.Append

C.Merge

D.Aggregate

AnswerB

Append adds rows from one dataset to another with same columns.

Why this answer

Option B (Append) is correct because appending is the standard operation for combining two datasets with identical columns but different rows, stacking the rows from one dataset onto the other. In tools like SQL, this is achieved with the UNION or UNION ALL operator, and in Python pandas, it is done via the `append()` method or `pd.concat()` with axis=0. This operation preserves the column structure while extending the row count.

Exam trap

The trap here is that candidates confuse 'concatenate' (which can mean row-wise or column-wise) with 'append' (which specifically means row-wise stacking), leading them to choose Option A when the question explicitly requires combining rows.

How to eliminate wrong answers

Option A (Concatenate) is wrong because concatenation is a general term that can refer to combining along any axis (rows or columns), and in many contexts (e.g., SQL string functions, pandas with axis=1), it implies joining side-by-side rather than stacking rows; the question specifically requires row-wise stacking, which is append. Option C (Merge) is wrong because merge is used to combine datasets based on a common key column (like a SQL JOIN), not to simply stack rows when columns are identical. Option D (Aggregate) is wrong because aggregation involves summarizing data (e.g., SUM, AVG, COUNT) across groups, not combining separate datasets.

Full explanation →

433

MCQmedium

A data analyst encounters the above error log when trying to connect to a database. The analyst needs to explain the issue to the database administrator. Which of the following correctly describes the problem?

A.The database connection pool has reached its maximum limit.

B.The database table is corrupted.

C.The database server is out of disk space.

D.The database authentication credentials are invalid.

AnswerA

The log explicitly says 'Connection pool exhausted'.

Why this answer

The error log indicates a 'connection pool exhausted' or 'too many connections' message, which occurs when the database connection pool has reached its maximum limit. This means all available connections are in use, and no new connections can be established until existing ones are released. The analyst should explain to the DBA that the application is attempting to open more connections than the pool allows, often due to a connection leak or insufficient pool size.

Exam trap

The trap here is that candidates confuse connection pool exhaustion with authentication or disk space issues, but the error log's specific wording (e.g., 'cannot acquire connection from pool') directly points to a connection limit problem.

How to eliminate wrong answers

Option B is wrong because a corrupted table typically produces errors like 'table corruption' or 'index corruption', not connection pool exhaustion. Option C is wrong because out-of-disk-space errors manifest as 'disk full' or 'no space left on device', not connection limit errors. Option D is wrong because invalid authentication credentials result in 'access denied' or 'login failed' errors, not connection pool limit messages.

Full explanation →

434

MCQhard

Refer to the exhibit. A data analyst is unable to run a query on the customers table after October 1, 2023. What is the reason?

A.The resource name is incorrect

B.The policy allows access only before October 1, 2023

C.The action should be READ not SELECT

D.The policy denies access after October 1, 2023

AnswerB

After October 1, the condition fails, and the Allow effect no longer applies, resulting in denial.

Why this answer

The policy explicitly allows access only before October 1, 2023, meaning any query attempt on or after that date is denied. This is a time-based access control condition, often implemented using AWS IAM or Azure RBAC policies with a `Condition` block that checks the `aws:CurrentTime` or equivalent attribute. Since the query fails after October 1, 2023, the policy's effective date restriction is the direct cause.

Exam trap

CompTIA often tests the distinction between an explicit deny and an implicit deny caused by a missing allow condition; the trap here is that candidates mistakenly think the policy contains an explicit 'deny after date' statement, when in reality it simply grants access only before the date, relying on the default implicit deny for all other times.

How to eliminate wrong answers

Option A is wrong because the resource name being incorrect would cause a different error (e.g., 'Table not found' or 'Invalid resource'), not a time-based denial. Option C is wrong because `SELECT` is the correct SQL action for reading data; the policy uses `SELECT` as the action identifier, not `READ`, and changing it would not resolve the time restriction. Option D is wrong because it misstates the policy logic: the policy does not explicitly deny access after October 1, 2023; instead, it grants access only before that date, which implicitly denies access after it.

The distinction matters because an explicit deny would override any allow, but here the absence of an allow after the date is the issue.

Full explanation →

435

MCQeasy

A dashboard needs to show sales trends for each of five regions over the past year. The intended audience wants to compare trends easily. Which chart type is best?

A.Line chart with multiple lines

B.Pie chart

C.Stacked bar chart

D.Area chart

AnswerA

Multiple line charts clearly show each region's trend over time.

Why this answer

Option B is correct because multiple line chart lines allow comparison of trends across regions. A stacked bar would show composition, not trends.

Full explanation →

436

Multi-Selectmedium

Which TWO of the following are best practices for designing a data dashboard?

Select 2 answers

A.Include animated transitions between data views.

B.Use consistent color schemes to indicate performance levels.

C.Use 3D effects to make charts more visually appealing.

D.Place the most important KPIs at the top of the dashboard.

E.Include as many charts as possible to provide comprehensive data.

AnswersB, D

Consistent colors help users quickly interpret data.

Why this answer

Option B is correct because consistent color schemes (e.g., red for critical, yellow for warning, green for normal) allow users to instantly interpret performance levels without cognitive overload. This aligns with dashboard design principles that prioritize clarity and rapid pattern recognition over decorative elements.

Exam trap

The trap here is that candidates confuse 'visually appealing' with 'effective communication' — CompTIA often tests that decorative elements like 3D effects and animations reduce data accuracy and user comprehension, even though they may look impressive.

Full explanation →

437

MCQeasy

A healthcare database stores patient records. Each patient has a unique patient_id, and the database includes a table 'visits' with visit_id, patient_id, visit_date, and diagnosis_code. To ensure data integrity, which constraint should be applied to the patient_id column in the 'visits' table?

A.Unique constraint

B.Foreign key

C.Primary key

D.Check constraint

AnswerB

Foreign key enforces referential integrity.

Why this answer

Option B is correct because a foreign key constraint ensures that patient_id in visits references a valid patient_id in the patient table. Option A is wrong because primary key ensures uniqueness in its own table. Option C is wrong because unique constraint prevents duplicates.

Option D is wrong because check constraint validates values based on a condition.

Full explanation →

438

MCQmedium

A data analyst needs to create a dashboard that updates automatically every hour. The data source is a large database. Which approach minimizes performance impact?

A.Query the entire database each time

B.Use incremental refresh only for new or changed data

C.Export the data to Excel and import

D.Create a static report monthly

AnswerB

Incremental refresh minimizes database load.

Why this answer

Option B is correct because incremental refresh queries only new or changed records since the last refresh, drastically reducing data transfer and processing load on the large database. This approach uses change-tracking mechanisms (e.g., timestamps, CDC) to avoid full-table scans, minimizing performance impact while maintaining near-real-time updates.

Exam trap

CompTIA often tests the misconception that 'more data is better' or that full refreshes are simpler and equally acceptable, but the trap here is ignoring the performance cost of full database scans on large datasets in favor of the more efficient incremental approach.

How to eliminate wrong answers

Option A is wrong because querying the entire database each hour performs a full table scan on a large database, causing excessive I/O, CPU, and memory usage that degrades performance for all users. Option C is wrong because exporting the entire database to Excel and importing it adds unnecessary data transformation overhead, loses real-time capability, and still requires a full data pull. Option D is wrong because a static monthly report does not meet the requirement for automatic hourly updates and provides stale data, making it functionally incorrect for the use case.

Full explanation →

439

Multi-Selecthard

Which THREE of the following are valid data quality dimensions? (Choose THREE.)

Select 3 answers

A.Encryption

B.Redundancy

C.Completeness

D.Timeliness

E.Accuracy

AnswersC, D, E

Completeness is a data quality dimension.

Why this answer

Completeness is a core data quality dimension that measures whether all required data is present. In the context of the DA0-001 exam, completeness ensures that no fields or records are missing, which is fundamental for reliable analysis and reporting.

Exam trap

CompTIA often tests the distinction between data quality dimensions and data management techniques, so candidates may mistakenly select encryption or redundancy because they sound like important data concepts, but they are not part of the standard quality dimensions.

Full explanation →

440

MCQhard

A data engineer is tasked with acquiring data from a third-party vendor that provides daily file drops via SFTP. The files are large (10 GB each). The pipeline must load data into a data warehouse. Which approach optimizes for speed and reliability?

A.Download the file to a staging server, then bulk insert into warehouse

B.Stream the file directly from SFTP into warehouse using a data pipeline tool

C.Have the vendor push data via API instead of SFTP

D.Split the file into smaller chunks and load concurrently

AnswerB

Streaming minimizes latency and storage overhead.

Why this answer

Option B is correct because streaming the file directly from SFTP into the warehouse using a data pipeline tool (e.g., Apache NiFi, Airbyte, or Fivetran) eliminates the intermediate staging step, reducing disk I/O and latency. This approach leverages incremental processing and parallel streams to handle large 10 GB files efficiently, while built-in retry and checkpoint mechanisms ensure reliability against network interruptions.

Exam trap

The trap here is that candidates assume 'download then load' (Option A) is the most reliable approach, but the question specifically asks for speed and reliability, and streaming avoids the I/O bottleneck and single-point-of-failure of a staging server.

How to eliminate wrong answers

Option A is wrong because downloading the file to a staging server introduces an unnecessary intermediate write and read cycle, doubling I/O time and adding a single point of failure; bulk insert after full download also delays loading until the entire file is present, which is suboptimal for speed. Option C is wrong because having the vendor push data via API instead of SFTP does not inherently optimize speed or reliability for large daily file drops—APIs often have payload size limits (e.g., 10 MB) and require chunking, adding complexity and potential throttling, while SFTP is already a reliable file transfer protocol. Option D is wrong because splitting the file into smaller chunks and loading concurrently can cause resource contention (e.g., connection pool exhaustion, lock contention) and requires careful coordination to maintain data consistency; it does not address the fundamental bottleneck of downloading the entire file before processing.

Full explanation →

441

MCQhard

Refer to the exhibit. A data analyst is reviewing a data quality report. Which of the following actions should the analyst take first?

A.Delete the 1200 records with null emails.

B.Fill null emails with a placeholder.

C.Investigate the source system to understand why emails are missing.

D.Ignore the nulls as they are not critical.

AnswerC

Correct. Root cause analysis should precede any corrective action.

Why this answer

Option C is correct because the first step in data quality remediation is root cause analysis. Without understanding why 1200 records have null emails (e.g., a source system bug, a failed ETL join, or a missing required field), any corrective action like deletion or placeholder insertion risks introducing bias or masking a systemic issue. Investigating the source system aligns with the data governance principle of 'fix the source, not the symptom.'

Exam trap

CompTIA often tests the principle that 'fix the source, not the symptom'—the trap here is that candidates jump to data cleansing actions (delete, fill, ignore) without first diagnosing why the nulls exist, which is a classic data quality management mistake.

How to eliminate wrong answers

Option A is wrong because deleting 1200 records with null emails reduces dataset size and may discard valid records if the nulls are due to a temporary system glitch, not actual missing data. Option B is wrong because filling null emails with a placeholder (e.g., 'unknown@domain.com') introduces false data that can skew analysis, violate email format constraints, and mislead downstream processes. Option D is wrong because ignoring nulls assumes they are non-critical without verification; in many contexts (e.g., customer communications, deduplication), missing emails are critical and can lead to incomplete insights or compliance issues.

Full explanation →

442

MCQhard

A healthcare organization must ensure patient data privacy when sharing reports with external auditors. Which practice is most important?

A.Encrypt the report file

B.Obtain consent from patients

C.Aggregate data at low granularity

D.Use pseudonymization

AnswerD

Pseudonymization de-identifies data while retaining analytical value, meeting HIPAA requirements for sharing with auditors.

Why this answer

Pseudonymization replaces identifying information with pseudonyms, allowing data utility while protecting privacy. Aggregation reduces granularity but may still reveal identities; encryption secures transport but not the content; obtaining consent is impractical for large datasets.

Full explanation →

443

MCQmedium

A manufacturing company has two primary data systems: an ERP system that stores production orders with fields like OrderID, ProductID, Quantity, and ProductionDate, and a CRM system that stores customer sales with fields like SaleID, CustomerID, ProductID, SaleDate, and Amount. The data analyst needs to create a unified view of product performance by joining these tables. However, the ProductID field in the ERP uses a 5-character alphanumeric code (e.g., 'P1234'), while the CRM uses a 6-character code (e.g., 'PR1234'). Additionally, some products have multiple entries due to slight variations in naming. The analyst wants to ensure accurate matching without losing data. Which action should the analyst take first to address the data inconsistency?

A.Create a mapping table that standardizes ProductID formats between ERP and CRM.

B.Perform data profiling to identify all unique ProductID values and their frequencies.

C.Aggregate data by product name and ignore ProductID mismatches.

D.Use a fuzzy matching algorithm to join on similar ProductID strings.

AnswerA

Correct: Standardization of keys is necessary before joining.

Why this answer

Option A is correct because creating a mapping table allows the analyst to explicitly define the relationship between the 5-character ERP ProductID and the 6-character CRM ProductID, ensuring accurate joins without data loss. This approach standardizes the inconsistent formats and handles variations by providing a controlled, deterministic lookup, which is essential for maintaining referential integrity in a unified view.

Exam trap

The trap here is that candidates may choose fuzzy matching (Option D) thinking it handles all variations, but CompTIA often tests the principle that deterministic mapping is preferred over probabilistic methods when the inconsistency is systematic and can be resolved with a known transformation.

How to eliminate wrong answers

Option B is wrong because data profiling only identifies the unique values and their frequencies but does not resolve the format mismatch; it merely highlights the problem without providing a mechanism to align the keys for joining. Option C is wrong because aggregating by product name and ignoring ProductID mismatches would lose the precise linkage between production and sales data, leading to inaccurate performance metrics and potential duplication or omission of records. Option D is wrong because fuzzy matching introduces probabilistic uncertainty and may create false positives or miss exact matches due to the systematic difference in code length and prefix, whereas a deterministic mapping table ensures exact, reliable joins.

Full explanation →

444

MCQmedium

A data analyst needs to visualize the distribution of a continuous variable across different categories. Which chart type is most suitable?

A.Bar chart

B.Histogram

C.Scatter plot

D.Box plot

AnswerD

Box plot displays distribution across groups.

Why this answer

A box plot (option D) is the most suitable chart for visualizing the distribution of a continuous variable across different categories because it displays the median, quartiles, and potential outliers for each group, enabling direct comparison of spread and central tendency. Unlike a histogram, which shows the distribution of a single continuous variable without categorical grouping, the box plot inherently supports categorical axes. This makes it ideal for exploratory data analysis when assessing how a metric like revenue varies by region or product category.

Exam trap

CompTIA often tests the distinction between histograms and box plots by presenting a scenario where a candidate mistakenly chooses a histogram for grouped categorical data, overlooking that histograms require a continuous x-axis and cannot inherently separate categories without additional faceting.

How to eliminate wrong answers

Option A is wrong because a bar chart is designed for comparing categorical data using discrete counts or sums, not for showing the distribution of a continuous variable across categories. Option B is wrong because a histogram visualizes the distribution of a single continuous variable using bins, but it does not natively separate data into distinct categories; you would need faceting or multiple histograms, which is less efficient than a box plot. Option C is wrong because a scatter plot is used to examine the relationship between two continuous variables, not to compare distributions of one continuous variable across categories.

Full explanation →

445

MCQhard

During a presentation, a stakeholder questions the validity of a data insight because the sample size appears small. The analyst knows the sample is statistically significant. What is the best way to address this concern?

A.Ignore the question and continue the presentation.

B.Explain the margin of error and confidence interval used.

C.Ask the stakeholder to trust the analysis and move on.

D.Agree to collect more data before finalizing the report.

AnswerB

This provides statistical context to reassure the stakeholder.

Why this answer

Option B is correct because it directly addresses the stakeholder's concern by explaining the statistical concepts of margin of error and confidence interval, which demonstrate that the sample size is sufficient for the desired level of precision. This approach validates the data insight's reliability without dismissing the stakeholder's valid question, aligning with best practices in communicating data insights.

Exam trap

The trap here is that candidates may assume a small sample size is always invalid, but the DA0-001 exam tests understanding that statistical significance depends on the margin of error and confidence interval, not just sample size alone.

How to eliminate wrong answers

Option A is wrong because ignoring the question undermines trust and fails to address a legitimate concern about data validity, which is critical in data-driven presentations. Option C is wrong because asking for blind trust is unprofessional and does not provide the technical justification needed to alleviate doubts about sample size significance. Option D is wrong because agreeing to collect more data is unnecessary when the sample is already statistically significant, and it delays decision-making without addressing the underlying statistical reasoning.

Full explanation →

446

MCQmedium

A data analyst is designing a dashboard for executives. Which best practice should be followed?

A.Use 3D effects to make charts more engaging

B.Include every data point in the dashboard

C.Minimize clutter and use clear visual hierarchy

D.Use rainbow color palette to highlight all data points

AnswerC

Clean design improves readability and decision-making.

Why this answer

Reducing clutter helps viewers focus on key insights without distraction.

Full explanation →

447

MCQmedium

An analyst is presenting findings to a non-technical audience. The data shows a 20% increase in customer churn after a price change. Which presentation approach is BEST?

A.Explain the p-value

B.Provide the raw data table

C.Use a simple bar chart comparing churn before and after

D.Show a complex statistical model

AnswerC

A bar chart is simple, visual, and directly shows the comparison without jargon.

Why this answer

Option C is correct because a simple bar chart visually and intuitively communicates the 20% increase in churn to a non-technical audience without requiring statistical literacy. This approach aligns with best practices for presenting data insights to stakeholders who need clear, actionable takeaways rather than technical details.

Exam trap

The trap here is that candidates often overcomplicate the presentation by choosing technical options (like p-values or models) to demonstrate rigor, forgetting that the exam prioritizes audience-appropriate communication over statistical depth.

How to eliminate wrong answers

Option A is wrong because explaining a p-value introduces statistical significance testing, which is unnecessary and confusing for a non-technical audience that only needs to understand the magnitude of the change. Option B is wrong because providing the raw data table overwhelms the audience with numbers and fails to highlight the key insight (the 20% increase) effectively. Option D is wrong because showing a complex statistical model is inappropriate for a non-technical audience, as it obscures the simple before-and-after comparison and may lead to misinterpretation or disengagement.

Full explanation →

448

Multi-Selecthard

A data analyst is performing data cleaning. Which THREE steps are part of this process? (Choose three.)

Select 3 answers

A.Correcting inconsistent data

B.Normalization

C.Handling missing values

D.Feature engineering

E.Removing duplicate records

AnswersA, C, E

Standardizing formats and fixing typos are cleaning tasks.

Why this answer

Correcting inconsistent data (Option A) is a core data cleaning step because it ensures that values follow a consistent format, such as standardizing date formats (e.g., 'MM/DD/YYYY' vs 'DD-MM-YYYY') or fixing capitalization (e.g., 'USA' vs 'usa'). This process directly addresses data quality issues that arise from human entry errors or system differences, making the dataset reliable for analysis.

Exam trap

The trap here is that candidates confuse data cleaning with data transformation or feature engineering, leading them to select normalization or feature engineering as cleaning steps, when in fact cleaning strictly addresses data quality issues like consistency, completeness, and uniqueness.

Full explanation →

449

MCQhard

A data analyst sees this error in the ETL logs. What is the most likely cause?

A.The materialized view log was updated after the last refresh

B.The source table was dropped

C.The analyst does not have permission to refresh the view

D.There is a network connection timeout

AnswerA

The log is newer, indicating changes that need a full refresh.

Why this answer

The error indicates that the materialized view's underlying data has changed since its last refresh, specifically because the materialized view log was updated. Materialized views rely on logs to track changes for fast refreshes; if the log is updated after the last refresh, the view's snapshot becomes stale and cannot be incrementally refreshed without a complete refresh. This is a common cause of refresh failures in Oracle databases.

Exam trap

CompTIA often tests the distinction between fast refresh and complete refresh errors, and the trap here is that candidates confuse a log update with a source table drop or permission issue, not realizing that the error message specifically points to a log timestamp mismatch.

How to eliminate wrong answers

Option B is wrong because dropping the source table would cause a different error (e.g., 'table or view does not exist') rather than a log-related error. Option C is wrong because a permission issue would typically result in an 'insufficient privileges' error, not a log mismatch. Option D is wrong because a network timeout would produce a connection error (e.g., ORA-12170 or ORA-03113), not a materialized view log inconsistency.

Full explanation →

450

Multi-Selectmedium

Which THREE of the following are characteristics of a relational database?

Select 3 answers

A.Enforces referential integrity through foreign keys

B.Stores data in key-value pairs

C.Supports NoSQL document storage

D.Uses Structured Query Language (SQL) for data manipulation

E.Data is organized into tables with rows and columns

AnswersA, D, E

Referential integrity ensures relationships.

Why this answer

Option A is correct because relational databases enforce referential integrity through foreign keys, which ensure that relationships between tables remain consistent. A foreign key in a child table must match a primary key value in the parent table, preventing orphaned records and maintaining data integrity.

Exam trap

The trap here is that candidates may confuse key-value stores or document databases with relational databases, especially when they hear terms like 'keys' or 'documents' in other contexts, but relational databases strictly use tables, rows, columns, and SQL.

Full explanation →

Page 6 of 7

All pages

Practice DA0-001 by domain

Target a specific domain to shore up weak areas.

Comparing and Contrasting Data Concepts Mining and Acquiring Data Analyzing and Modeling Data Visualizing Data Communicating Data Insights

See all domains with question counts →