CompTIA Data+ DA0-001 (DA0-001) — Questions 175

509 questions total · 7pages · All types, answers revealed

Page 1 of 7

Page 2
1
MCQeasy

A healthcare provider needs to integrate patient data from multiple clinics into a single data warehouse. Which process is used to extract, transform, and load the data?

A.ELT
B.ETL
C.OLAP
D.OLTP
AnswerB

ETL extracts data, transforms it, and loads it into the warehouse, suitable for structured integration.

Why this answer

ETL (Extract, Transform, Load) is the correct process because the healthcare provider must first extract data from multiple source clinics, then transform it (e.g., standardize formats, clean duplicates, apply business rules) before loading it into the target data warehouse. This ensures data quality and consistency, which is critical for clinical analytics and reporting.

Exam trap

The trap here is confusing ETL with ELT, where candidates assume ELT is always better due to modern big data tools, but the question explicitly describes a traditional data warehouse integration requiring pre-load transformations.

How to eliminate wrong answers

Option A is wrong because ELT (Extract, Load, Transform) loads raw data into the target system first and transforms it later, which is less suitable for a data warehouse requiring pre-integrated, clean data from multiple sources; it is more common in big data environments like Hadoop. Option C is wrong because OLAP (Online Analytical Processing) is a category of database systems optimized for complex queries and multidimensional analysis, not a data integration process. Option D is wrong because OLTP (Online Transaction Processing) is designed for high-volume transactional operations (e.g., recording patient visits), not for extracting, transforming, and loading data into a warehouse.

2
MCQhard

Refer to the exhibit. Before running the code, the original salary column had 50 missing values. The median was calculated as 52000. After imputation, which of the following statements is true?

A.The mean decreased significantly
B.The standard deviation increased
C.The median remains unchanged
D.The minimum value decreased
AnswerC

Since missing values are replaced by the median, the median of the dataset does not change.

Why this answer

Imputing missing values with the median (52000) replaces only the 50 missing entries with that value, leaving all original non-missing values unchanged. Since the median is a positional statistic, adding values equal to the current median does not shift the middle position of the sorted data, so the median remains unchanged. This is why option C is correct.

Exam trap

CompTIA often tests the misconception that imputing with the median will change the median itself, when in fact adding values equal to the current median leaves the median unchanged because it is a rank-based statistic.

How to eliminate wrong answers

Option A is wrong because imputing with the median does not significantly change the mean; the mean may shift slightly toward the median but not decrease significantly unless the missing values were extremely high. Option B is wrong because adding values exactly at the median reduces variance (since imputed values are all equal to the median), so the standard deviation decreases, not increases. Option D is wrong because the minimum value is unaffected—imputation only adds values at the median, which is far above the minimum, so the minimum remains the same.

3
MCQeasy

A data analyst wants to show the relative proportions of defects by type in a manufacturing process. There are 6 defect types. Which chart is most appropriate?

A.Line chart
B.Pie chart
C.Stacked bar chart
D.Scatter plot
AnswerB

Pie chart effectively shows each defect type's share of total.

Why this answer

A pie chart is standard for showing parts of a whole, especially with few categories. Bar chart is also possible but not the best for proportions. Scatter and line charts are not for proportions.

4
MCQhard

A healthcare organization is subject to strict data privacy regulations requiring the classification of all data assets. The data governance team has identified three data sensitivity levels: Public, Internal, and Restricted. They have a new data pipeline importing patient health records from multiple clinics. The records include patient names, diagnoses, treatment codes, and insurance information. The team must ensure that the classification is applied correctly and that restricted data (e.g., diagnoses) is not exposed to unauthorized personnel. However, the pipeline uses automated tagging based on metadata rules, and some fields are misclassified. What is the most effective immediate action to improve classification accuracy?

A.Encrypt all data at rest and in transit regardless of classification.
B.Require manual review and reclassification of all incoming records.
C.Expand the metadata rule set to include more keywords and patterns.
D.Implement data loss prevention (DLP) tools that inspect content and enforce classification rules.
AnswerD

Correct: DLP can reclassify based on actual content, improving accuracy.

Why this answer

Option A is correct. Implementing data loss prevention (DLP) tools with content inspection can automatically detect sensitive patterns (e.g., diagnosis codes) and apply correct classification, even if metadata rules are incomplete. Option B (manual review) is too slow for a pipeline.

Option C (expanding metadata rules) can help but may still miss cases; DLP provides an additional check. Option D (encrypting all data) is a security measure but does not fix classification. DLP directly addresses misclassification by analyzing content.

5
Multi-Selecteasy

Which TWO of the following are effective techniques for presenting data to a non-technical audience?

Select 2 answers
A.Explain the statistical methods used in the analysis.
B.Include detailed data tables for reference.
C.Highlight the most important insights using callouts.
D.Use many different colors to distinguish data points.
E.Use simple language and avoid jargon.
AnswersC, E

Callouts draw attention to key findings.

Why this answer

Option C is correct because highlighting key insights with callouts directly addresses the needs of a non-technical audience by drawing attention to the most important findings without requiring them to interpret complex data. This technique aligns with best practices for data storytelling, where visual emphasis on critical points improves comprehension and retention for stakeholders who may not have a technical background.

Exam trap

CompTIA often tests the misconception that non-technical audiences need more data (tables, statistics) to understand insights, when in fact they need less—focusing on simplicity, visual emphasis, and clear language—so candidates mistakenly choose options A, B, or D thinking they are thorough.

6
MCQhard

A data analyst creates a dashboard for executives that shows a key metric trending downward. The analyst notices that the metric is highly volatile day-to-day. Which visualization type is most appropriate to show the underlying trend?

A.7-day moving average line chart.
B.Pie chart showing proportion of days.
C.Scatter plot with regression line.
D.Daily bar chart.
AnswerA

Correct. Moving averages filter out short-term fluctuations to show the long-term trend.

Why this answer

Option C is correct because a moving average line chart smooths out daily volatility to reveal the trend. A daily bar chart shows noise; a pie chart is inappropriate for trends; a scatter plot with regression line could show correlation but not trend over time as clearly.

7
MCQmedium

An analyst needs to present quarterly sales data to the board. The CEO wants to see both overall trend and breakdown by region. Which dashboard layout is most effective?

A.A single line chart with all regions
B.A KPI card with total sales
C.A combination of a line chart for total and a stacked area chart for regional breakdown
D.A table with all quarterly figures
AnswerC

This layout clearly shows the overall trend and regional contributions in a cohesive way.

Why this answer

Option C is correct because it simultaneously satisfies the CEO's dual requirement: a line chart clearly shows the overall quarterly sales trend, while a stacked area chart breaks down total sales by region, allowing the board to see both the aggregate performance and the contribution of each region over time. This combination leverages the strengths of each chart type—line for trend clarity and stacked area for part-to-whole relationships—without overloading the viewer with data.

Exam trap

The trap here is that candidates often choose a single line chart (Option A) thinking it shows both trend and breakdown, but they overlook that multiple overlapping lines make it hard to see the aggregate trend, which is the CEO's primary need.

How to eliminate wrong answers

Option A is wrong because a single line chart with all regions would create visual clutter and make it difficult to discern the overall trend from the regional lines, especially if regions have overlapping values; it fails to provide a clear aggregate view. Option B is wrong because a KPI card with total sales only shows a single number, which cannot convey the quarterly trend or regional breakdown required by the CEO. Option D is wrong because a table with all quarterly figures forces the board to manually parse numbers to identify trends and regional contributions, which is inefficient for a high-level presentation and violates the principle of data visualization for quick insight.

8
MCQhard

A government agency's data analyst is commissioned to produce a report on public transportation usage trends. The report will be read by policymakers, transit planners, and the general public. The data includes ridership numbers, delay rates, and demographic breakdowns. The analyst needs to ensure the report is accessible and persuasive, especially to non-technical readers. The goal is to advocate for increased funding in underserved areas. The report must be data-driven but also tell a compelling story. What strategy should the analyst prioritize?

A.Provide raw data in appendices only.
B.Create a narrative that highlights the impact of delayed trains on low-income commuters.
C.Use complex statistical analysis to show significance of trends.
D.Focus solely on ridership numbers without context.
AnswerB

Makes data relatable and persuasive, driving home the need for funding.

Why this answer

Option B is correct because it directly addresses the need to make data accessible and persuasive to non-technical readers by weaving a narrative around a specific, relatable impact (delayed trains on low-income commuters). This approach aligns with the goal of advocating for increased funding in underserved areas, as it humanizes the data and creates a compelling story that policymakers and the public can understand and act upon, without requiring technical expertise.

Exam trap

The trap here is that candidates often choose Option C (complex statistical analysis) because they equate 'data-driven' with technical rigor, failing to recognize that the exam's focus on 'communicating data insights' prioritizes accessibility and persuasion over statistical complexity for non-technical stakeholders.

How to eliminate wrong answers

Option A is wrong because providing raw data only in appendices fails to make the report accessible or persuasive; it buries the key insights and requires readers to perform their own analysis, which is ineffective for non-technical audiences. Option C is wrong because using complex statistical analysis (e.g., p-values, regression coefficients) would alienate non-technical readers like the general public and many policymakers, making the report inaccessible and undermining its persuasive power. Option D is wrong because focusing solely on ridership numbers without context (e.g., demographic breakdowns, delay rates) provides no narrative or actionable insight, failing to tell a compelling story or advocate for specific funding needs.

9
MCQhard

A company uses a dashboard to monitor server uptime. The data is collected every minute, but the dashboard only refreshes every hour. Users see gaps in the line chart. What is the most likely cause, and how should it be fixed?

A.The line chart should interpolate missing data points
B.Switch to a bar chart to avoid gaps
C.Increase the dashboard refresh rate to match data collection frequency
D.Use a different data series with the same refresh rate
AnswerC

Refreshing every minute eliminates gaps because data is fetched in near real-time.

Why this answer

The gaps are due to the dashboard not displaying data for the time between refreshes; the chart should interpolate or the refresh rate should match data collection. Using a bar chart does not solve gaps. Not all series need the same refresh.

10
Multi-Selectmedium

A data analyst is creating a dashboard for a retail company. The dashboard should provide insights into sales performance across multiple dimensions. Which TWO chart types are best suited for showing the contribution of each product category to total sales?

Select 2 answers
A.Scatter plot
B.Histogram
C.Pie chart
D.Line chart
E.Stacked bar chart
AnswersC, E

Pie charts effectively show each category's proportion of the total.

Why this answer

A pie chart is ideal for showing the contribution of each product category to total sales because it visually represents parts of a whole, making it easy to compare proportions at a glance. The stacked bar chart also effectively shows category contributions within a total, allowing for both absolute and relative comparisons across different time periods or segments. Both chart types directly address the need to visualize proportional breakdowns of a single aggregate metric.

Exam trap

The trap here is that candidates often choose a line chart (Option D) for any sales data because they associate sales with trends, overlooking that the question specifically asks for contribution to total sales, not change over time.

11
Multi-Selecteasy

A data analyst discovers an anomaly in a dataset. Which two actions should be taken before reporting? (Choose TWO.)

Select 2 answers
A.Assume the anomaly is real and report it
B.Immediately alert all stakeholders
C.Verify the data source and extraction process
D.Check for data entry errors or technical glitches
E.Remove the anomaly without documentation
AnswersC, D

This confirms the anomaly is not due to data collection issues.

Why this answer

Option C is correct because before reporting an anomaly, the data analyst must verify the data source and extraction process to ensure the anomaly is not due to a pipeline error, such as a misconfigured ETL job or a corrupted data feed. This step confirms data integrity and prevents false alarms based on extraction artifacts rather than genuine data issues.

Exam trap

The trap here is that candidates may confuse 'immediate reporting' with proactive communication, but Cisco tests the understanding that data validation must precede any stakeholder notification to maintain data credibility.

12
MCQhard

A large retail company is integrating customer data from two separate CRM systems into a new data warehouse. System A stores customer IDs as integers (e.g., 12345), while System B stores them as alphanumeric strings (e.g., 'CUST-12345-X'). Additionally, some customers exist in both systems but with slight name variations (e.g., 'John Smith' vs 'Jon Smith'). The data warehouse requires a unified customer table with a single unique identifier for each customer. The analyst needs to design the data acquisition process. Which of the following is the most appropriate first step?

A.Use a simple crosswalk table based on exact name matches to link records
B.Load all data from both systems into a staging table, then run a fuzzy matching algorithm to identify duplicates
C.Perform data profiling to analyze data distributions, data types, and quality issues in each source
D.Standardize all customer IDs to a common format (e.g., UUIDs) and then merge the tables
AnswerC

Profiling provides the necessary insights to plan transformations, handle inconsistencies, and design the matching strategy.

Why this answer

Option C is correct because data profiling is the foundational first step in any data integration project. It systematically assesses source data types, formats, completeness, and quality issues (e.g., integer vs. alphanumeric IDs, name variations) before designing transformation logic. Without profiling, subsequent steps like fuzzy matching or ID standardization risk being built on incorrect assumptions about the data.

Exam trap

The trap here is that candidates often jump to a technical solution (fuzzy matching or ID standardization) without recognizing that data profiling is the prerequisite step that validates source assumptions and prevents costly rework.

How to eliminate wrong answers

Option A is wrong because exact name matches cannot resolve the known name variations (e.g., 'John Smith' vs 'Jon Smith'), leading to missed linkages and duplicate customers. Option B is wrong because loading all data into a staging table before profiling risks propagating unknown data quality issues (e.g., inconsistent ID formats, nulls) into the staging area, making fuzzy matching less reliable and harder to tune. Option D is wrong because standardizing IDs to a common format (e.g., UUIDs) without first profiling the source data ignores the need to understand existing relationships and quality issues, and may break referential integrity if applied prematurely.

13
MCQeasy

A data analyst creates a line chart showing monthly sales over the past year. The chart uses a y-axis starting at $100,000 instead of zero. What is the most likely misinterpretation a viewer might have?

A.The differences between months are exaggerated, making small changes look large.
B.The sales appear to be decreasing when they are actually increasing.
C.The chart is correctly scaled, so no misinterpretation occurs.
D.The sales appear to be increasing when they are actually decreasing.
AnswerA

A non-zero baseline exaggerates differences, which can mislead viewers about the magnitude of change.

Why this answer

Option A is correct because starting the y-axis at $100,000 instead of zero truncates the baseline, which visually exaggerates the relative differences between monthly sales values. This is a common data visualization pitfall that can mislead viewers into perceiving small fluctuations as significant trends, violating the principle of using a zero baseline for bar and line charts to accurately represent proportional change.

Exam trap

The trap here is that candidates may think a truncated y-axis only affects bar charts or that it reverses trends, but CompTIA often tests the specific misinterpretation that small changes appear exaggerated due to the loss of a zero baseline, not that the direction of the trend is flipped.

How to eliminate wrong answers

Option B is wrong because a truncated y-axis does not inherently reverse the direction of a trend; it only amplifies the visual magnitude of changes, so sales that are actually increasing would still appear to increase, just more dramatically. Option C is wrong because the chart is not correctly scaled for accurate proportional interpretation; starting the y-axis at a non-zero value is a deliberate distortion that can mislead viewers, and best practices for data visualization recommend a zero baseline for line charts showing magnitude. Option D is wrong because a truncated y-axis does not reverse the direction of a trend; if sales are actually decreasing, they would still appear to decrease, but the visual drop would be exaggerated, not inverted.

14
MCQmedium

A data analyst is performing a hypothesis test with a significance level of 0.05. The p-value obtained is 0.03. What should the analyst conclude?

A.Reject the null hypothesis
B.Fail to reject the null hypothesis
C.Accept the null hypothesis
D.The result is practically significant
AnswerA

p < alpha indicates statistically significant result.

Why this answer

Since the p-value (0.03) is less than the significance level (0.05), the result is statistically significant. This means the observed data provides sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. The analyst should conclude that there is a statistically significant effect or difference.

Exam trap

The trap here is that candidates often confuse 'fail to reject' with 'accept' the null hypothesis, or they mistakenly think a p-value less than α means the null hypothesis is proven false with certainty, rather than just providing sufficient evidence to reject it.

How to eliminate wrong answers

Option B is wrong because failing to reject the null hypothesis occurs only when the p-value is greater than or equal to the significance level (p ≥ 0.05), not when it is smaller. Option C is wrong because hypothesis testing never 'accepts' the null hypothesis; we either reject it or fail to reject it, as acceptance implies proof of truth, which is not a valid statistical conclusion. Option D is wrong because practical significance is a separate consideration from statistical significance; a statistically significant result (p < 0.05) does not automatically imply practical importance, and the question only asks about the hypothesis test conclusion.

15
Multi-Selectmedium

Which TWO of the following are common assumptions of linear regression?

Select 2 answers
A.Independence of observations
B.No multicollinearity
C.Linearity of the relationship
D.Normality of the dependent variable
E.Homoscedasticity
AnswersC, E

Assumes linear relationship between predictors and outcome.

Why this answer

Linear regression assumes that the relationship between the independent and dependent variables is linear (option C). This means the model expects that a unit change in the predictor results in a constant change in the outcome, which is the core assumption for ordinary least squares (OLS) estimation to produce unbiased coefficients.

Exam trap

The trap here is that candidates often confuse the assumption of normality of residuals with normality of the dependent variable, leading them to incorrectly select option D instead of recognizing that homoscedasticity (option E) is a core assumption.

16
MCQeasy

A data analyst needs to combine two datasets: one contains customer information (customer_id, name, address) and the other contains order information (order_id, customer_id, order_date). The analyst wants to include all customers, even those who have not placed orders. Which type of join should be used?

A.FULL OUTER JOIN
B.INNER JOIN
C.LEFT JOIN
D.RIGHT JOIN
AnswerC

LEFT JOIN includes all customers, with order data where available.

Why this answer

A LEFT JOIN returns all rows from the left table (customers) and the matching rows from the right table (orders). If a customer has no orders, the order columns will contain NULLs. This satisfies the requirement to include all customers, even those without orders.

Exam trap

The trap here is that candidates often confuse LEFT JOIN with INNER JOIN, assuming all customers must have orders, or they pick FULL OUTER JOIN thinking it includes all customers, but it also includes unmatched orders, which is not required.

How to eliminate wrong answers

Option A is wrong because a FULL OUTER JOIN returns all rows from both tables, which would include unmatched orders (if any) — unnecessary for this requirement. Option B is wrong because an INNER JOIN returns only rows with matching keys in both tables, excluding customers who have never placed an order. Option D is wrong because a RIGHT JOIN returns all rows from the right table (orders) and only matching rows from the left table (customers), which would exclude customers without orders.

17
Multi-Selecthard

A data analyst is creating a data story about sales performance. Which THREE elements are essential for effective data storytelling? (Choose THREE.)

Select 3 answers
A.Raw data tables for reference.
B.A clear narrative with a beginning, middle, and end.
C.Context and background information.
D.Use of multiple chart types to show variety.
E.A call to action.
AnswersB, C, E

Provides structure and guides the audience.

Why this answer

Option B is correct because a clear narrative with a beginning, middle, and end is the structural backbone of effective data storytelling. It guides the audience through the data insights in a logical, engaging sequence, transforming raw numbers into a compelling story that drives understanding and retention.

Exam trap

CompTIA often tests the distinction between supporting elements (like raw data tables or chart variety) and the core structural components (narrative, context, call to action) that define effective data storytelling.

18
MCQmedium

Refer to the exhibit. What does the query return?

A.All orders grouped by customer ID.
B.Customers who have placed at least 5 orders.
C.Customers who have placed more than 5 orders.
D.All customers who have placed orders.
AnswerC

HAVING COUNT(*) > 5 ensures only customers with more than 5 orders are included.

Why this answer

The query uses a HAVING clause with COUNT(*) > 5, which filters groups (by customer ID) to only those with more than 5 orders. The GROUP BY customer ID ensures the count is per customer, so the result is customers who have placed more than 5 orders. Option C is correct because the condition is strictly greater than 5, not at least 5.

Exam trap

CompTIA often tests the distinction between 'at least' (>=) and 'more than' (>) in HAVING clauses, and candidates may misread the condition as including exactly 5 orders.

How to eliminate wrong answers

Option A is wrong because the query does not return all orders; it returns aggregated results (counts) per customer, not individual order rows. Option B is wrong because the condition is COUNT(*) > 5, not COUNT(*) >= 5; 'at least 5' would include exactly 5, which is excluded by the strict greater-than operator. Option D is wrong because the HAVING clause filters out customers with 5 or fewer orders; the query does not return all customers who have placed orders, only those exceeding the threshold.

19
Matchingmedium

Match each data quality dimension to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Degree to which data correctly reflects real-world values

Extent to which all required data is present

Absence of contradictions across data sources

Data is up-to-date and available when needed

No duplicate records exist within the dataset

Why these pairings

These are key data quality dimensions tested in Data+.

20
MCQhard

A data analyst notices that a column labeled 'Income' contains values like '$50,000' and '$75,000', but also 'High' and 'Low'. What data concept issue is occurring?

A.Mixing quantitative and qualitative data
B.Mixing discrete and continuous data
C.Mixing nominal and ordinal data
D.Mixing structured and unstructured data
AnswerA

Income should be quantitative, but text labels are qualitative.

Why this answer

The 'Income' column contains both numeric values (e.g., '$50,000', '$75,000') which are quantitative data, and categorical labels ('High', 'Low') which are qualitative data. Mixing these two distinct data types in a single column violates data consistency principles and prevents proper statistical analysis or machine learning processing. This is a classic example of mixing quantitative and qualitative data.

Exam trap

CompTIA often tests the distinction between data type categories (quantitative vs. qualitative) versus subtypes (discrete/continuous or nominal/ordinal), so candidates mistakenly pick a subtype option when the core issue is the fundamental type mismatch.

How to eliminate wrong answers

Option B is wrong because discrete and continuous data are both subtypes of quantitative data (e.g., number of children vs. height), but the issue here is mixing numbers with text labels, not distinguishing between countable and measurable values. Option C is wrong because nominal and ordinal data are both categorical (qualitative) subtypes (e.g., colors vs. rankings), but the column includes actual numeric income values, not just ordered categories. Option D is wrong because structured data refers to organized formats like tables (which this column is part of), while unstructured data refers to free-form text or media; the problem is not about format but about inconsistent data types within a structured field.

21
Multi-Selecthard

A company is designing a dashboard for real-time monitoring. Which THREE considerations are most critical?

Select 3 answers
A.Color palette aesthetics
B.Alert thresholds
C.Mobile responsiveness
D.Drill-down capability
E.Data refresh frequency
AnswersB, C, E

Thresholds trigger notifications when metrics go out of range, enabling prompt action.

Why this answer

Alert thresholds (B) are critical for real-time monitoring because they define the conditions that trigger notifications when metrics exceed or fall below acceptable ranges. Without thresholds, the dashboard cannot proactively alert operators to anomalies, defeating the purpose of real-time oversight. This directly supports the domain of communicating data insights by ensuring actionable alerts are delivered promptly.

Exam trap

CompTIA often tests the misconception that aesthetic or exploratory features (like color palettes or drill-downs) are as critical as operational necessities (like thresholds and refresh frequency), leading candidates to overlook the core requirements for real-time monitoring.

22
MCQeasy

A data analyst needs to present findings about customer churn to business stakeholders. The analysis identified that churn is highest among customers who have called customer support more than three times in the last month. Which of the following is the best way to communicate this insight?

A.A scatter plot to show the relationship between support calls and churn.
B.A pie chart showing the proportion of churned vs. retained customers.
C.A bar chart comparing churn rates for different support call counts.
D.A table of raw churn data by customer ID.
AnswerC

A bar chart effectively shows the relationship between a categorical variable (call count bins) and churn rate.

Why this answer

Option C is correct because a bar chart directly compares churn rates across discrete categories of support call counts (e.g., 0, 1, 2, 3, 4+ calls), making it easy for stakeholders to see the spike at 'more than three calls'. This aligns with the insight that churn is highest among customers with >3 support calls, and a bar chart is the standard visualization for comparing a continuous metric (churn rate) across categorical bins.

Exam trap

The trap here is that candidates may choose a scatter plot (Option A) because they think it shows 'relationship', but they fail to recognize that a scatter plot is inappropriate for a binary dependent variable and discrete independent variable, whereas a bar chart is the correct choice for comparing rates across categories.

How to eliminate wrong answers

Option A is wrong because a scatter plot is used to show the relationship between two continuous variables, but here the independent variable (number of support calls) is discrete and the dependent variable (churn) is binary, so a scatter plot would produce overlapping points and fail to clearly communicate the categorical threshold of 'more than three calls'. Option B is wrong because a pie chart only shows the overall proportion of churned vs. retained customers, which does not convey the relationship between support call frequency and churn, missing the key insight entirely. Option D is wrong because a table of raw churn data by customer ID presents unaggregated, granular data that obscures the pattern and is not suitable for a high-level stakeholder presentation; it would require the audience to manually compute churn rates per call count.

23
MCQeasy

A marketing team needs to store customer feedback from social media posts, including text, images, and emojis. Which data concept is most appropriate for this storage?

A.Unstructured data in a NoSQL document database
B.Structured data in a relational database
C.Unstructured data in a relational database
D.Semi-structured data in an XML database
AnswerA

NoSQL document databases store unstructured data such as text, images, and emojis without a fixed schema.

Why this answer

Customer feedback from social media includes text, images, and emojis, which lack a predefined schema and are best stored as unstructured data. NoSQL document databases (e.g., MongoDB) store such data in flexible JSON-like documents, allowing each record to have varying fields and data types without requiring a fixed schema.

Exam trap

CompTIA often tests the misconception that 'unstructured data' cannot be stored in any database, when in fact NoSQL document databases are purpose-built for it, while relational databases require rigid schemas that fail with variable content.

How to eliminate wrong answers

Option B is wrong because structured data in a relational database requires a fixed schema with predefined columns and data types, which cannot efficiently handle variable-length text, images, and emojis without complex workarounds like BLOBs. Option C is wrong because relational databases are designed for structured data; storing unstructured data in them forces schema rigidity and poor performance for heterogeneous content. Option D is wrong because XML databases are semi-structured and impose hierarchical markup, which is unnecessary overhead for social media posts that are naturally schema-less and better served by document stores.

24
MCQhard

A data scientist is analyzing a dataset with 100 variables and 5,000 records. The dataset has several missing values and a few extreme outliers. The goal is to build a regression model to predict a continuous target. Which combination of preprocessing steps is most likely to improve model performance?

A.Impute missing values with median, apply robust scaling, and then log transform skewed variables
B.Impute missing values with mean, then use PCA for dimensionality reduction
C.Drop all rows with missing values, then apply min-max scaling
D.Remove outliers using Z-score, then apply standard scaling
AnswerA

Median imputation is robust, robust scaling handles outliers, log transform handles skewness.

Why this answer

Option A is correct because imputing missing values with the median is robust to outliers, robust scaling handles extreme values by using median and IQR, and log transformation reduces skewness in predictors. This combination preserves data integrity and stabilizes variance, which is critical for regression models on a dataset with 100 variables and 5,000 records.

Exam trap

CompTIA often tests the misconception that mean imputation and standard scaling are universally safe, but the trap here is that outliers and skewness require robust methods like median imputation and robust scaling to avoid distorting the model.

How to eliminate wrong answers

Option B is wrong because imputing with the mean is sensitive to outliers, which can distort the distribution and negatively affect PCA, and PCA may discard important variance related to the target. Option C is wrong because dropping all rows with missing values reduces the already limited 5,000 records, potentially losing significant information and introducing bias, and min-max scaling is not robust to outliers. Option D is wrong because removing outliers using Z-score assumes a normal distribution, which may not hold with skewed variables, and standard scaling is also sensitive to outliers, leading to poor model performance.

25
MCQmedium

In a dataset with variables on different scales (e.g., age in years and income in dollars), which preprocessing step is necessary before applying k-means clustering?

A.Feature selection
B.Dimensionality reduction
C.Normalization (scaling)
D.One-hot encoding
AnswerC

Normalization ensures each feature contributes equally to distance calculations.

Why this answer

K-means clustering relies on Euclidean distance to measure similarity between data points. When variables like age (in years) and income (in dollars) are on different scales, the variable with larger numeric values (income) will dominate the distance calculation, skewing the clustering results. Normalization (scaling), such as min-max scaling or z-score standardization, rescales all features to a comparable range (e.g., [0,1] or mean=0, variance=1), ensuring each feature contributes equally to the distance computation.

Exam trap

The trap here is that candidates may confuse normalization with other preprocessing steps like feature selection or dimensionality reduction, thinking that removing irrelevant features or reducing dimensions will automatically fix scale differences, but k-means specifically requires scaling to ensure equal feature influence in distance calculations.

How to eliminate wrong answers

Option A is wrong because feature selection is about choosing a subset of relevant features to reduce noise or improve model performance, but it does not address the issue of differing scales among features, which is required before k-means. Option B is wrong because dimensionality reduction (e.g., PCA) reduces the number of features, but it does not inherently scale the data; scaling is typically performed before dimensionality reduction, not as a substitute for it. Option D is wrong because one-hot encoding is used to convert categorical variables into numerical format, not to handle numerical variables on different scales; applying one-hot encoding to already numerical features would be incorrect and does not solve the scaling problem.

26
MCQhard

A data analyst at a retail company is building a multiple linear regression model to forecast weekly sales. The dataset contains 50 predictor variables, including store size, promotional spend, holiday indicators, and many others. After training the model, the analyst observes an R-squared of 0.99 on the training set but only 0.55 on the holdout test set. Which action should the analyst take first to address this discrepancy?

A.Remove highly correlated predictor variables and apply regularization (e.g., Ridge or Lasso).
B.Add more predictor variables to increase the training R-squared further.
C.Use k-fold cross-validation with a different random seed to get a more reliable test set estimate.
D.Increase the number of hidden layers in the model to capture more complexity.
AnswerA

Regularization and feature selection reduce overfitting by penalizing large coefficients and removing redundant predictors.

Why this answer

The high R-squared of 0.99 on training data versus 0.55 on test data is a classic sign of overfitting, where the model has learned noise and specific patterns in the training set that do not generalize. Removing highly correlated predictors reduces multicollinearity and model complexity, while regularization (Ridge or Lasso) penalizes large coefficients, shrinking them to prevent overfitting. This is the most direct first step to improve generalization.

Exam trap

The trap here is that candidates may think a high R-squared is always good, or they may confuse overfitting with underfitting and choose to add more complexity (Option D) or more data (Option B), rather than recognizing the need to reduce model complexity and apply regularization.

How to eliminate wrong answers

Option B is wrong because adding more predictor variables would increase the training R-squared but worsen overfitting, making the test set performance even lower. Option C is wrong because k-fold cross-validation with a different random seed does not address the fundamental overfitting issue; it only provides a different estimate of test error but does not change the model's tendency to overfit. Option D is wrong because increasing the number of hidden layers (a neural network technique) is irrelevant for a multiple linear regression model and would introduce unnecessary complexity, likely exacerbating overfitting.

27
MCQeasy

A marketing analyst needs to combine customer data from a CRM database with social media engagement data from a third-party API. Which data acquisition method is most appropriate?

A.Web scraping
B.Manual data entry
C.API integration
D.Batch file upload
AnswerC

API integration provides structured, real-time access to third-party data, which is ideal for social media engagement data.

Why this answer

API integration is the most appropriate method because it allows the analyst to programmatically retrieve structured social media engagement data directly from the third-party service's RESTful or GraphQL API endpoints. This approach ensures real-time or near-real-time data synchronization, supports authentication (e.g., OAuth 2.0), and returns data in standardized formats like JSON or XML, which can be directly ingested into the CRM system without manual intervention.

Exam trap

The trap here is that candidates may confuse web scraping with API integration, assuming both can retrieve web data, but the question specifically requires combining structured data from a third-party API, where web scraping would be unreliable, unauthorized, and technically inappropriate for programmatic data acquisition.

How to eliminate wrong answers

Option A is wrong because web scraping is used to extract unstructured data from HTML pages, which is inefficient, brittle, and often violates the third-party API's terms of service; it is not designed for reliable, authenticated access to structured social media metrics. Option B is wrong because manual data entry is error-prone, time-consuming, and impractical for large volumes of social media engagement data, and it lacks any automated validation or consistency checks. Option D is wrong because batch file upload assumes the data is already exported into a file (e.g., CSV) and delivered manually, which introduces latency and requires the third-party to support file exports, whereas the API provides direct, on-demand access to live data.

28
Multi-Selecthard

Which THREE are best practices for acquiring data via web scraping? (Select exactly 3)

Select 3 answers
A.Use multiple IP addresses
B.Respect robots.txt
C.Identify yourself with a user-agent
D.Scrape all data without regard to terms
E.Limit request rate
AnswersB, C, E

Legal and ethical best practice.

Why this answer

Options A, B, and E are correct because best practices include respecting robots.txt (legal and ethical), limiting request rate (avoid overloading servers), and identifying yourself with a user-agent (transparency). Option C is wrong because using multiple IP addresses may be necessary but is not a best practice per se; it can be used to avoid blocking but may violate terms. Option D is wrong because scraping all data without regard to terms is unethical and possibly illegal.

29
MCQhard

A data analyst is creating a report on customer satisfaction scores across different regions. The analyst wants to highlight regions that are significantly below average. Which of the following statistical methods is most appropriate for identifying these outliers?

A.Bar chart with average line.
B.Pie chart of satisfaction categories.
C.Box plot with interquartile range (IQR) to identify outliers.
D.Scatter plot of satisfaction vs. region.
AnswerC

Correct. IQR-based box plots are a standard method for identifying statistical outliers.

Why this answer

A box plot with interquartile range (IQR) is the most appropriate method because it explicitly identifies outliers as data points falling below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. This directly addresses the analyst's goal of highlighting regions significantly below average, as the IQR method is a standard statistical technique for detecting extreme values in a distribution.

Exam trap

The trap here is that candidates may choose a bar chart with an average line (Option A) because it visually shows deviations, but it lacks a formal statistical criterion to define 'significantly below average,' which the IQR-based box plot provides.

How to eliminate wrong answers

Option A is wrong because a bar chart with an average line only shows the mean and individual region values, but does not provide a statistical threshold to determine which regions are significantly below average; it merely visualizes deviations without identifying outliers. Option B is wrong because a pie chart of satisfaction categories shows proportions of categorical data, not numerical scores across regions, and cannot identify outliers or deviations from the mean. Option D is wrong because a scatter plot of satisfaction vs. region treats region as a categorical variable on one axis, which does not produce a meaningful distribution for outlier detection; it would simply plot points per region without any statistical measure of dispersion or outlier boundaries.

30
Multi-Selecteasy

Which TWO of the following are characteristics of structured data? (Choose TWO.)

Select 2 answers
A.Stored in rows and columns
B.Lacks a fixed schema
C.Has a predefined data model
D.Uses tags to define elements
E.Consists of free-form text
AnswersA, C

Tabular storage is a hallmark of structured data like relational databases.

Why this answer

Option A is correct because structured data is organized into rows and columns, typically within relational databases or spreadsheets, where each column represents a specific attribute and each row represents a record. This tabular format enables efficient querying, sorting, and indexing using languages like SQL. The rigid row-and-column structure ensures data consistency and supports ACID (Atomicity, Consistency, Isolation, Durability) properties.

Exam trap

The trap here is that candidates often confuse semi-structured data (which uses tags or labels) with structured data, or they incorrectly assume structured data can lack a schema, when in fact a predefined schema is its defining requirement.

31
Multi-Selecthard

Which THREE of the following are appropriate methods to handle outliers in a dataset?

Select 3 answers
A.Transforming the data using log transformation
B.Removing the outlier records
C.Capping the outlier values at a certain percentile
D.Binning continuous variables
E.Imputing outliers with the mean
AnswersA, B, C

Transformation can reduce the impact of outliers.

Why this answer

Option A is correct because log transformation compresses the scale of data, reducing the impact of extreme values and making the distribution more symmetric. This is a standard technique for handling skewed data where outliers are present, as it preserves the relative order of observations while mitigating outlier influence.

Exam trap

The trap here is that candidates may confuse data preprocessing techniques like binning or imputation with outlier handling methods, but binning is for discretization and mean imputation is not robust for outliers, while the correct methods (transformation, removal, capping) directly address outlier impact.

32
MCQeasy

A marketing team wants to collect data on competitor pricing for similar products. Which data source is most appropriate?

A.Customer surveys
B.Internal ERP system
C.External public web scraping
D.Internal sales data
AnswerC

Web scraping can collect competitor pricing from public websites.

Why this answer

External public web scraping is the most appropriate data source because competitor pricing is publicly available on websites, and web scraping allows automated extraction of this structured or unstructured data. This approach directly addresses the need for external competitive intelligence without relying on internal or customer-reported data.

Exam trap

The trap here is that candidates may confuse internal data sources (ERP, sales) with external data needs, or mistakenly think customer surveys can provide accurate, unbiased competitor pricing data.

How to eliminate wrong answers

Option A is wrong because customer surveys collect subjective opinions and self-reported data, not objective, real-time competitor pricing from external sources. Option B is wrong because an internal ERP system contains only the company's own operational and financial data, not competitor pricing information. Option D is wrong because internal sales data reflects the company's own transactions and pricing, not competitor pricing.

33
Matchingmedium

Match each data governance role to its responsibility.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Ensures data quality and adherence to policies

Manages technical environment and data access

Has accountability for specific data assets

Sets strategic direction for data management

Designs data structures and integration processes

Why these pairings

These roles are defined in data governance frameworks.

34
MCQhard

A financial analytics firm needs to acquire historical stock market tick data (millions of records per day) from a data vendor. The vendor provides data via FTP in binary format. The firm's existing infrastructure uses on-premise servers with limited storage and processing power. They need to stream the data into a cloud data lake for analysis. However, the binary format is proprietary and requires a licensed decoder. The budget is constrained. Which approach best meets the data acquisition requirements?

A.Negotiate with the vendor to provide an API that outputs JSON
B.Purchase a license for the decoder and set up an ETL job to convert and upload
C.Request the vendor to send data in CSV format via email
D.Use a third-party cloud service that already decodes and normalizes the data for a subscription fee
AnswerB

This allows processing of the binary format and integration with the cloud data lake, with a one-time cost.

Why this answer

Purchasing the decoder license and setting up an ETL job to convert and upload to the cloud directly addresses the format issue and enables streaming to the data lake. Requesting CSV via email is impractical for millions of records. Negotiating for a JSON API is a good idea but may not be available or cost more than the decoder.

Using a third-party service adds recurring costs and may introduce dependency.

35
MCQmedium

To consolidate data from multiple operational databases into a central repository for reporting, a company decides to transform data before loading it into the target system. Which data integration approach is being used?

A.ETL (Extract, Transform, Load)
B.Data virtualization
C.Change data capture
D.ELT (Extract, Load, Transform)
AnswerA

ETL transforms data during the integration process before loading into the target.

Why this answer

The scenario describes transforming data before loading it into the target system, which is the defining characteristic of ETL (Extract, Transform, Load). In ETL, data is extracted from source systems, transformed in a staging area (e.g., cleaning, aggregating, joining), and then loaded into the central repository. This approach is commonly used when the target system (e.g., a data warehouse) requires pre-processed, high-quality data for reporting.

Exam trap

The trap here is that candidates often confuse ETL with ELT, assuming that any transformation before loading is ELT, but the key distinction is that ELT loads raw data first and transforms it later inside the target system, whereas ETL transforms data before it reaches the target.

How to eliminate wrong answers

Option B (Data virtualization) is wrong because it does not physically move or transform data before loading; instead, it creates a virtual layer that queries source systems in real-time, leaving data in place. Option C (Change data capture) is wrong because it is a technique for identifying and capturing only changed data from source systems, not a complete integration approach that includes transformation before loading. Option D (ELT) is wrong because it loads raw data into the target system first and then transforms it within the target, which contradicts the 'transform before loading' requirement in the question.

36
MCQhard

Refer to the exhibit. Which data quality dimension is being violated?

A.Uniqueness
B.Consistency
C.Timeliness
D.Completeness
AnswerB

Consistency ensures data formats and values are uniform; mixed data types violate this.

Why this answer

The exhibit shows the same customer ID (C001) associated with two different customer names ('John Smith' and 'Jon Smith'), which violates the consistency dimension. Consistency requires that data values be free from contradiction and adhere to the same representation rules across the dataset. Here, the conflicting names for the same identifier break referential integrity and data uniformity.

Exam trap

The trap here is that candidates confuse consistency with uniqueness, assuming any conflict between rows must be a duplicate record issue, when in fact consistency violations involve contradictory values for the same identifier across multiple records.

How to eliminate wrong answers

Option A is wrong because uniqueness is about ensuring no duplicate records exist for the same entity, but here the issue is conflicting attribute values for the same ID, not duplicate rows. Option C is wrong because timeliness concerns whether data is up-to-date and available when needed, which is not indicated by the name mismatch. Option D is wrong because completeness checks for missing values, but both records have all fields populated; the problem is contradictory data, not absent data.

37
Multi-Selecthard

Which THREE of the following are NoSQL database types?

Select 3 answers
A.Document
B.Hierarchical
C.Relational
D.Key-Value
E.Graph
AnswersA, D, E

Document stores (e.g., MongoDB) are NoSQL.

Why this answer

Option A is correct because document databases, such as MongoDB, store data in flexible, JSON-like documents (BSON in MongoDB's case). This allows for nested structures and schema-less designs, making them a core NoSQL category distinct from relational models.

Exam trap

CompTIA often tests the distinction between legacy database models (hierarchical) and modern NoSQL categories, leading candidates to mistakenly include hierarchical as a NoSQL type due to its non-relational nature.

38
Multi-Selectmedium

Which TWO of the following are common methods for acquiring data from external sources?

Select 2 answers
A.Data warehousing
B.Manual data entry
C.Public APIs
D.Web scraping
E.Direct database connection to an internal server
AnswersC, D

APIs provide structured access to external data.

Why this answer

Public APIs (C) are a common method for acquiring data from external sources because they provide a standardized, programmatic interface (often RESTful over HTTP/HTTPS) for requesting and receiving structured data, such as JSON or XML, from third-party services like social media platforms or weather services. Web scraping (D) is another common method that involves programmatically extracting data from web pages by parsing HTML or DOM structures, often using tools like BeautifulSoup or Selenium, when no API is available.

Exam trap

The trap here is that candidates may confuse data warehousing (a storage/management process) with data acquisition methods, or think manual data entry is a valid external acquisition method, when the exam specifically tests automated, programmatic techniques for pulling data from outside the organization.

39
MCQmedium

A healthcare analytics team is building a dashboard to monitor patient vitals. They receive data from two sources: Source 1 provides 'heart rate' as an integer (beats per minute), and Source 2 provides 'blood pressure' as a ratio (systolic/diastolic, e.g., 120/80). The team wants to create a combined metric called 'cardiac stress index' that uses both heart rate and systolic blood pressure. However, they notice that heart rate data occasionally contains negative values due to sensor errors. The data governance policy requires that all data be valid and meaningful. Which action best addresses the data quality issue while preserving the data types?

A.Convert heart rate to absolute values (remove the negative sign)
B.Keep negative values but set them to NULL to indicate missing data
C.Change heart rate from integer to categorical (e.g., low, normal, high) to avoid negative issues
D.Remove all records with negative heart rate values as they are invalid
AnswerD

Negative heart rates are not physiologically possible, so deletion is appropriate for data quality.

Why this answer

Option D is correct because negative heart rate values are physiologically impossible and violate the data governance policy requiring valid and meaningful data. Removing these records ensures the dashboard only contains accurate, actionable data without altering the original integer data type of heart rate, preserving its numerical integrity for the 'cardiac stress index' calculation.

Exam trap

The trap here is that candidates may choose Option A (converting to absolute values) thinking it 'fixes' the data, but this introduces false data and violates data validity, whereas the correct approach is to remove invalid records to maintain data integrity.

How to eliminate wrong answers

Option A is wrong because converting negative heart rates to absolute values introduces false data, masking sensor errors and potentially skewing the cardiac stress index with artificially inflated readings. Option B is wrong because setting negative values to NULL retains invalid records in the dataset, which can cause calculation errors or missing data handling issues in the dashboard without addressing the root cause of sensor errors. Option C is wrong because changing heart rate from integer to categorical loses granularity and prevents the precise numerical computation required for the cardiac stress index, violating the requirement to preserve data types.

40
MCQeasy

A data analyst is creating a data story for a marketing campaign results. Which of the following narrative structures is most effective for engaging the audience?

A.Use a question-and-answer format without a clear flow.
B.Present all data points chronologically.
C.Start with the methodology, then data, then results.
D.Start with a key insight or finding, then provide supporting evidence.
AnswerD

Correct. This engages the audience immediately and builds the story around the insight.

Why this answer

Option B is correct because starting with a key insight captures attention and then providing supporting evidence builds a compelling story. Other options are less effective: A starts with methodology, which may lose the audience; C is chronological and may be flat; D lacks a clear narrative flow.

41
Multi-Selecteasy

Which TWO visualization types are suitable for showing the distribution of a single continuous variable?

Select 2 answers
A.Box plot
B.Line chart
C.Histogram
D.Scatter plot
E.Pie chart
AnswersA, C

Box plots display summary statistics and distribution shape.

Why this answer

A box plot is correct because it graphically depicts the distribution of a single continuous variable through its five-number summary (minimum, first quartile, median, third quartile, maximum), clearly showing spread, central tendency, and outliers. A histogram is correct because it bins the continuous variable into intervals and displays the frequency of data points within each bin, providing a direct view of the underlying probability distribution.

Exam trap

CompTIA often tests the distinction between visualization types by presenting a line chart as a distractor, tempting candidates to confuse trend visualization with distribution analysis, especially when the continuous variable is time-based.

42
MCQmedium

A data analyst needs to determine whether the mean sales of two different regions are significantly different. The samples are independent and the data is normally distributed. Which statistical test should be used?

A.Chi-square test for independence
B.ANOVA
C.Independent samples t-test
D.Paired t-test
AnswerC

This test compares means of two independent groups with normal distribution.

Why this answer

The independent samples t-test is the correct choice because the scenario involves comparing the means of two independent groups (two different regions) with normally distributed data. This test specifically assesses whether the difference between the two sample means is statistically significant, assuming equal or unequal variances as determined by Levene's test.

Exam trap

CompTIA often tests the distinction between independent and paired t-tests, trapping candidates who overlook the 'independent samples' condition and mistakenly choose the paired t-test for any two-group comparison.

How to eliminate wrong answers

Option A is wrong because the Chi-square test for independence is used for categorical data to assess associations between two variables, not for comparing means of continuous data. Option B is wrong because ANOVA is used to compare means among three or more groups, not exactly two independent groups. Option D is wrong because the paired t-test is used for dependent samples (e.g., before-and-after measurements on the same subjects), not for independent samples from different regions.

43
MCQmedium

A data analyst needs to visualize the relationship between two continuous variables, such as sales revenue and advertising spend, to identify potential correlation. Which chart type is most appropriate?

A.Pie chart
B.Line chart
C.Bar chart
D.Scatter plot
AnswerD

Scatter plots show the relationship between two continuous variables and can reveal correlations.

Why this answer

Option C is correct because a scatter plot is specifically designed to show the relationship between two continuous variables and can reveal correlations, clusters, or outliers. Option A is wrong because bar charts compare discrete categories, not continuous relationships. Option B is wrong because line charts show trends over time, not correlation between two variables.

Option D is wrong because pie charts show proportions of a whole.

44
MCQmedium

A dashboard automatically refreshes every hour, but users report stale data. What is the most likely issue?

A.The dashboard is not published
B.The refresh interval is too long
C.The dashboard uses cached data
D.The data source connection is broken
AnswerB

An hourly refresh may be too slow if users need more up-to-date information.

Why this answer

The most likely issue is that the refresh interval is too long. If the dashboard refreshes every hour but users are seeing stale data, the data source may update more frequently than the dashboard's refresh cycle, causing a lag between data changes and dashboard updates. This is a common scheduling mismatch in BI tools like Tableau or Power BI where the refresh interval must align with data source update frequency.

Exam trap

The trap here is that candidates may confuse 'cached data' (which is a normal performance feature) with 'stale data' (which is a scheduling issue), leading them to choose option C instead of recognizing that the refresh interval is the root cause.

How to eliminate wrong answers

Option A is wrong because an unpublished dashboard would not be accessible to users at all, not just show stale data; the issue is about data freshness, not visibility. Option C is wrong because cached data is a normal part of dashboard performance and does not inherently cause staleness; caching can actually improve load times, and the problem is the refresh schedule, not the cache itself. Option D is wrong because a broken data source connection would result in no data or error messages, not stale data; the dashboard is still displaying data, just outdated data.

45
MCQmedium

A data analyst is preparing a dataset for a predictive model. The dataset contains a feature 'age' with values ranging from 18 to 80, and a feature 'income' ranging from 20,000 to 200,000. To ensure both features contribute equally to distance-based algorithms, which transformation should the analyst apply?

A.Min-max normalization
B.Log transformation
C.Standardization (z-score)
D.Box-Cox transformation
AnswerC

Standardization ensures each feature has mean 0 and std 1, providing equal weight in distance calculations.

Why this answer

Standardization (z-score) transforms features to have a mean of 0 and a standard deviation of 1, which ensures that both 'age' (18–80) and 'income' (20,000–200,000) contribute equally to distance-based algorithms like k-NN or k-means. Unlike min-max normalization, standardization is not affected by outliers and preserves the relative distances between data points, making it the preferred choice when the data does not follow a uniform distribution.

Exam trap

The trap here is that candidates often confuse min-max normalization with standardization, assuming that scaling to a fixed range is sufficient for distance-based algorithms, without considering the impact of outliers or the need for zero mean and unit variance.

How to eliminate wrong answers

Option A is wrong because min-max normalization scales features to a fixed range (e.g., [0,1]), but it is highly sensitive to outliers and does not guarantee equal contribution if the data contains extreme values; it also does not center the data around zero, which can distort distance calculations. Option B is wrong because log transformation is used to reduce skewness in positively skewed data, not to standardize features with different scales; it changes the shape of the distribution and would not make 'age' and 'income' comparable for distance-based algorithms. Option D is wrong because Box-Cox transformation is designed to make data more normally distributed and requires all values to be positive, but it does not standardize features to a common scale; applying it to 'age' and 'income' would not ensure equal contribution to distance metrics.

46
MCQmedium

Refer to the exhibit. What is the impact of the validation result?

A.The staging table is missing the 'age' column, which may cause query errors
B.The validation passed successfully
C.Only duplicates were found
D.The row counts match, so data is complete
AnswerA

Queries expecting the 'age' column will fail in staging.

Why this answer

The validation result shows that the staging table has a different schema than the target table, specifically missing the 'age' column. This mismatch will cause query errors when attempting to insert or query data that references the 'age' column, as the staging table lacks the required column definition. The validation result explicitly flags this schema discrepancy, making option A correct.

Exam trap

CompTIA often tests the misconception that matching row counts alone guarantee data completeness, ignoring critical schema mismatches that cause query failures.

How to eliminate wrong answers

Option B is wrong because the validation result clearly indicates a schema mismatch (missing 'age' column), so the validation did not pass successfully. Option C is wrong because while duplicates may be present, the validation result specifically highlights a missing column, not just duplicates. Option D is wrong because even though row counts match, the schema mismatch means data is incomplete and queries will fail due to the missing 'age' column.

47
MCQeasy

A retail analyst needs to determine the most popular product category. The dataset includes columns: ProductID, Category, SalesDate, QuantitySold, UnitPrice. Which column contains qualitative data?

A.SalesDate
B.QuantitySold
C.UnitPrice
D.Category
AnswerD

Correct. Category is a qualitative variable as it describes a product attribute.

Why this answer

Qualitative data (also called categorical data) represents non-numeric categories or labels. The 'Category' column contains text values such as 'Electronics' or 'Clothing', which are descriptive and cannot be used in arithmetic operations. This makes it the only qualitative column in the dataset.

Exam trap

The trap here is that candidates often mistake dates (SalesDate) for qualitative data because they are not numeric, but dates are actually quantitative interval data with a meaningful order and equal intervals.

How to eliminate wrong answers

Option A is wrong because SalesDate represents a point in time, which is quantitative (interval) data, not qualitative. Option B is wrong because QuantitySold is a numeric count, making it quantitative (discrete) data. Option C is wrong because UnitPrice is a numeric monetary value, making it quantitative (continuous) data.

48
Multi-Selecteasy

Which TWO are common mistakes when creating data visualizations?

Select 2 answers
A.Using excessive 3D effects that obscure data
B.Sorting categories alphabetically in a bar chart
C.Choosing a color-blind friendly palette
D.Including a legend to identify chart elements
E.Starting the y-axis at a value other than zero
AnswersA, E

3D effects can distort proportions and reduce readability.

Why this answer

Options A and D are correct. Truncated axis (A) can mislead, and excessive 3D effects (D) distort data. Option B is wrong because using a legend is good.

Option C is wrong because sorting alphabetically can be helpful for lookup. Option E is wrong because color-blind friendly palettes are a best practice.

49
MCQeasy

A data architect is designing a schema for a product catalog where each product has a variable number of attributes. Which NoSQL database type is most appropriate?

A.Graph database
B.Document store
C.Key-value store
D.Relational database
AnswerB

Document stores allow flexible schemas, perfect for variable attributes.

Why this answer

A document store (e.g., MongoDB, Couchbase) is the most appropriate choice because it stores data in flexible, self-describing documents (typically JSON or BSON), allowing each product to have a variable number of attributes without requiring a predefined schema. This directly matches the requirement of a product catalog where attributes can differ per product, unlike rigid relational tables that would require complex EAV (Entity-Attribute-Value) patterns or frequent schema migrations.

Exam trap

The trap here is that candidates often confuse 'variable attributes' with 'relationships' and incorrectly choose a graph database, or they assume key-value stores are flexible enough, overlooking the need for queryability on individual attributes.

How to eliminate wrong answers

Option A is wrong because graph databases (e.g., Neo4j) are optimized for highly connected data and relationship traversal, not for storing documents with variable attributes; they would force you to model each attribute as a node or relationship, adding unnecessary complexity. Option C is wrong because key-value stores (e.g., Redis, DynamoDB) treat the entire product as an opaque value, making it impossible to query or index individual attributes without application-level parsing, which defeats the purpose of a catalog. Option D is wrong because relational databases require a fixed schema per table; handling variable attributes would necessitate either many nullable columns, frequent ALTER TABLE statements, or a cumbersome EAV pattern, all of which degrade performance and maintainability.

50
Multi-Selecteasy

A data analyst is designing a dashboard for non-technical managers. Which TWO design principles should be applied? (Choose TWO.)

Select 2 answers
A.Use pie charts for all comparisons.
B.Place the most important metrics at the top.
C.Include complex statistical terms in labels.
D.Provide interactive filters for drill-down.
E.Use consistent color schemes.
AnswersB, E

Prioritizes key information for quick understanding.

Why this answer

Option B is correct because placing the most important metrics at the top follows the principle of visual hierarchy, ensuring that non-technical managers immediately see key performance indicators without scrolling. This aligns with dashboard design best practices for executive audiences, where attention span is limited and decisions rely on top-level data first.

Exam trap

CompTIA often tests the misconception that interactive features like drill-down filters are always beneficial for all audiences, but the trap here is that non-technical managers need simplicity and immediate insight, not exploratory complexity.

51
MCQeasy

A data analyst needs to join two tables in a SQL database: Orders and Customers. The analyst wants to include all orders, even if there is no matching customer record. Which type of join should be used?

A.RIGHT JOIN
B.FULL OUTER JOIN
C.LEFT JOIN
D.INNER JOIN
AnswerC

LEFT JOIN returns all orders, including those without matching customers.

Why this answer

A LEFT JOIN returns all rows from the left table (Orders) and the matching rows from the right table (Customers). If there is no match, NULL values are returned for the right table's columns. This satisfies the requirement to include all orders, even those without a matching customer record.

Exam trap

The trap here is that candidates often confuse LEFT JOIN with RIGHT JOIN, mistakenly thinking they need to 'keep all customers' instead of 'keep all orders,' or they overcomplicate the requirement by choosing FULL OUTER JOIN when only one side needs to be preserved.

How to eliminate wrong answers

Option A (RIGHT JOIN) is wrong because it returns all rows from the right table (Customers) and matching rows from the left table (Orders), which would include all customers, not all orders. Option B (FULL OUTER JOIN) is wrong because it returns all rows from both tables, including unmatched rows from both sides, which is unnecessary when the requirement is specifically to keep all orders. Option D (INNER JOIN) is wrong because it returns only rows where there is a match in both tables, which would exclude orders without a matching customer record.

52
MCQmedium

A business analyst wants to compare the proportion of total sales contributed by each product category in the current year. Which visualization is most suitable?

A.Pie chart
B.Scatter plot
C.Line chart
D.Histogram
AnswerA

Pie charts are ideal for displaying proportions of a whole.

Why this answer

Option C is correct because a pie chart effectively shows parts of a whole when the number of categories is small. Option A is for trends. Option B is for distribution of a continuous variable.

Option D is for relationships.

53
Multi-Selecthard

A data scientist is merging retail transaction data from online and in-store sources. Which THREE steps are required to ensure data consistency?

Select 3 answers
A.Ensure product IDs are standardized across sources
B.Convert all monetary amounts to a common currency
C.Remove all transactions with missing customer ID
D.Synchronize timestamps to a single time zone
E.Merge data using only store location
AnswersA, B, D

Standard IDs prevent mismatches.

Why this answer

Options A (standardize product IDs), C (convert currency), and D (synchronize timestamps) are essential. Option B (use only store location) is insufficient. Option E (remove transactions with missing customer ID) may discard useful data.

54
MCQhard

A data engineer needs to acquire data from a legacy mainframe system that does not support modern APIs or direct database connectivity. Which approach is most feasible?

A.Re-platform the mainframe to a modern system
B.Use a database gateway
C.Use FTP to transfer flat files
D.Manual data entry
AnswerC

FTP is a standard, simple method for file transfer from legacy systems.

Why this answer

Option C is correct because FTP (File Transfer Protocol, RFC 959) is a widely supported, low-overhead method for transferring flat files (e.g., CSV, EBCDIC-encoded text) from legacy mainframe systems that lack modern APIs or direct database connectivity. Mainframes like IBM z/OS natively support FTP, allowing the data engineer to schedule periodic file exports without requiring system modernization or complex middleware.

Exam trap

The trap here is that candidates may assume a database gateway (Option B) is always the best integration approach, but the question explicitly denies direct database connectivity, making FTP the only practical option that leverages existing mainframe capabilities without major infrastructure changes.

How to eliminate wrong answers

Option A is wrong because re-platforming the mainframe to a modern system is a costly, high-risk, and time-consuming project that far exceeds the scope of a simple data acquisition task; it introduces unnecessary complexity and potential downtime. Option B is wrong because a database gateway typically requires the mainframe to support ODBC/JDBC or similar database connectivity protocols, which the question explicitly states is not available. Option D is wrong because manual data entry is error-prone, unscalable, and impractical for any reasonable volume of data, violating basic data integrity and efficiency requirements.

55
Multi-Selectmedium

A data analyst is creating a dashboard to display monthly sales trends for the past two years. The dataset includes monthly sales figures with seasonal fluctuations. The analyst wants to highlight both the overall trend and the seasonal patterns effectively. Which TWO chart types are most appropriate for this purpose? (Select two.)

Select 2 answers
A.Line chart
B.Stacked area chart
C.Scatter plot
D.Bar chart
E.Pie chart
AnswersA, D

Correct. Line charts are ideal for showing trends over time, such as monthly sales trends.

Why this answer

A line chart (A) clearly shows the overall trend over time, while a bar chart (C) allows easy comparison of monthly values, revealing seasonal peaks and troughs. A pie chart (B) is for parts of a whole and not suitable for trends. A scatter plot (D) is for correlations, not time series.

A stacked area chart (E) can show cumulative trends but makes individual monthly comparisons difficult.

56
MCQeasy

Refer to the exhibit. A data analyst wants to create a visualization that best shows the trend of sales over time for each department. Which chart type should be used?

A.Stacked bar chart.
B.Pie chart for each quarter.
C.Line chart with multiple lines.
D.Grouped bar chart.
AnswerC

Multiple line charts clearly show the trend for each department over time.

Why this answer

A line chart with multiple lines is the best choice because it clearly shows the trend of sales over time for each department, with time on the x-axis and sales on the y-axis. Each line represents a department, making it easy to compare trends across departments while preserving the continuous nature of time. This aligns with the goal of visualizing trends, as line charts excel at showing changes over a continuous interval.

Exam trap

CompTIA often tests the distinction between showing trends over time versus comparing discrete categories; the trap here is that candidates may choose a grouped bar chart (Option D) because it can display multiple departments, but they overlook that bars are better for comparing values at specific points rather than showing the continuous flow of time.

How to eliminate wrong answers

Option A is wrong because a stacked bar chart shows part-to-whole relationships over time, but it obscures individual department trends by stacking values on top of each other, making it difficult to compare the trend of each department separately. Option B is wrong because a pie chart for each quarter shows proportions within a single time period, not trends over time; pie charts are designed for static composition, not continuous temporal changes. Option D is wrong because a grouped bar chart compares discrete categories side by side, but it does not effectively convey the continuous trend of sales over time; the gaps between bars can make it harder to perceive the overall direction of change for each department.

57
Multi-Selecthard

A data analyst is presenting a complex statistical analysis to a group of data scientists. The audience is highly knowledgeable. Which TWO approaches are most appropriate? (Choose two.)

Select 2 answers
A.Avoid mentioning uncertainty to maintain confidence
B.Use basic visualizations like pie charts
C.Include technical details and methodology
D.Present assumptions and limitations of the analysis
E.Simplify the findings to avoid confusion
AnswersC, D

Technical details are expected and valued.

Why this answer

Option C is correct because data scientists expect rigorous technical depth; including methodology and technical details aligns with their expertise and allows them to evaluate the analysis's validity. In a highly knowledgeable audience, omitting such details would undermine credibility and hinder peer review.

Exam trap

CompTIA often tests the misconception that simplifying findings is always best for any audience, but the trap here is that highly knowledgeable audiences require technical precision and transparency, not oversimplification.

58
Multi-Selectmedium

A data analyst is preparing a data storytelling presentation for a non-technical audience. Which THREE techniques are most effective for communicating insights?

Select 3 answers
A.Using relevant visuals such as charts and graphs.
B.Including raw data tables for reference.
C.Adding complex statistical terms to demonstrate expertise.
D.Highlighting the most important finding with annotations.
E.Using a clear narrative with a beginning, middle, and end.
AnswersA, D, E

Visuals make data more accessible and memorable.

Why this answer

Option A is correct because data storytelling for non-technical audiences relies on visuals like charts and graphs to make complex data patterns immediately understandable, reducing cognitive load and enabling faster insight absorption. Effective visuals should be simple, clearly labeled, and directly tied to the narrative, avoiding clutter that could confuse the audience.

Exam trap

The trap here is that candidates often confuse 'data completeness' with 'effective communication,' selecting raw data tables (Option B) thinking they provide transparency, when in fact they hinder comprehension for non-technical stakeholders.

59
MCQmedium

A sales dashboard shows monthly revenue but the bars are very tall for some months and very short for others, making comparisons difficult. Which visualization modification would best improve readability?

A.Change bar colors to gradient
B.Apply a logarithmic scale on the y-axis
C.Add more horizontal gridlines
D.Use a 3D bar chart for depth
AnswerB

Log scale compresses wide ranges so differences are visible.

Why this answer

A logarithmic scale compresses the y-axis so that large values are displayed proportionally to small values, making it easier to compare relative changes across months with vastly different revenue figures. This is particularly useful when the data spans several orders of magnitude, as it prevents tall bars from dominating the view and short bars from being barely visible.

Exam trap

CompTIA often tests the misconception that adding decorative elements (like colors or 3D effects) improves readability, when the real issue is the scale of the data, and candidates may overlook the logarithmic scale as a legitimate axis transformation.

How to eliminate wrong answers

Option A is wrong because changing bar colors to a gradient does not address the scale disparity; it only adds visual noise without improving the comparability of bar heights. Option C is wrong because adding more horizontal gridlines does not change the axis scaling; it merely adds reference lines that do not help when the tall bars already dwarf the short ones. Option D is wrong because using a 3D bar chart introduces perspective distortion that can misrepresent the actual bar heights, making comparisons even more difficult rather than improving readability.

60
Multi-Selecthard

A data analyst is troubleshooting a map visualization that shows null values for some regions. Which TWO actions should the analyst take to resolve the issue?

Select 2 answers
A.Check that latitude and longitude fields are properly geocoded.
B.Verify that the data source includes all region names.
C.Add a filter to exclude null values.
D.Change the mark type from Map to Pie.
E.Remove the color encoding from profit.
AnswersA, B

Improper geocoding can cause null regions; verifying geocoding resolves the issue.

Why this answer

Option B is correct because improper geocoding is a common cause of null values in map visualizations. Option D is correct if the data source lacks region names or has mismatches. Option A is wrong because changing mark type does not fix missing data.

Option C is wrong because removing color encoding hides the symptom but doesn't address the root cause. Option E is wrong because filtering out nulls may hide missing data issues instead of resolving them.

61
MCQhard

A data analyst is creating a presentation for the board of directors. The board members have varying levels of data literacy. The analyst wants to ensure that the key insight—that customer satisfaction scores have declined by 15% due to longer wait times—is understood by everyone. Which approach is best?

A.Include a complex statistical model showing the correlation.
B.Show a scatter plot of wait time vs. satisfaction.
C.Provide raw data in a spreadsheet for review.
D.Use a simple annotated line chart with a clear callout on the decline.
AnswerD

An annotated line chart clearly shows the trend and the decline, with annotations guiding viewers to the key insight.

Why this answer

Option D is correct because a simple annotated line chart with a clear callout on the decline is intuitive and draws attention to the key insight. Options A, B, and C are either too complex or not focused.

62
MCQmedium

A logistics company receives GPS tracking data from fleet vehicles at 1-second intervals via a cellular network. The data is used to optimize routes and monitor driver behavior. Recently, the data acquisition system has been missing updates for some vehicles when they pass through tunnels or remote areas. The data team notices gaps during these periods. The company needs a solution to ensure near-real-time data continuity. What should they do?

A.Use a hybrid approach that combines cellular and Wi-Fi networks
B.Implement a store-and-forward mechanism that buffers data on the vehicle's onboard unit and uploads when connectivity resumes
C.Increase the frequency of data transmission to every 0.5 seconds
D.Switch to a satellite-based GPS system
AnswerB

Buffering ensures data is not lost and is transmitted later, providing continuity despite temporary outages.

Why this answer

Option B is correct because a store-and-forward mechanism buffers GPS data locally on the vehicle's onboard unit during connectivity loss (e.g., in tunnels) and automatically uploads the backlog when cellular connectivity resumes. This ensures data continuity without requiring real-time transmission, directly addressing the intermittent connectivity issue while maintaining near-real-time updates.

Exam trap

The trap here is that candidates confuse the data source (GPS) with the transmission method, thinking satellite GPS solves connectivity issues, when the real problem is the cellular network's coverage gaps, not the positioning technology.

How to eliminate wrong answers

Option A is wrong because Wi-Fi networks are not suitable for fleet vehicles in motion; they have limited range and are not available in tunnels or remote areas, so combining them with cellular does not solve the core problem of coverage gaps. Option C is wrong because increasing transmission frequency to 0.5 seconds would exacerbate data loss during connectivity gaps and increase bandwidth/cost without addressing the root cause of missing updates. Option D is wrong because switching to satellite-based GPS only changes the positioning source, not the data transmission method; the vehicle still needs a network to send data, and satellite communication (e.g., Iridium) is expensive, high-latency, and not typically used for high-frequency GPS telemetry in logistics.

63
MCQhard

A time series dataset has several missing months of data. Which chart type will present the most honest picture of the trend?

A.Area chart with interpolation
B.Line chart with gaps for missing months
C.Bar chart with zero values for missing months
D.Scatter plot with connected lines
AnswerB

Gaps indicate unknown values without interpolation.

Why this answer

Line charts with gaps explicitly show missing periods, avoiding false continuity.

64
MCQeasy

A data analyst needs to present the results of a customer segmentation analysis to the marketing team. The analysis identified four segments based on purchasing behavior. Which visualization is most effective for showing the characteristics of each segment?

A.Histogram
B.Heatmap
C.Radar chart
D.Scatter plot
AnswerC

Radar charts display multiple variables for each segment on a common scale.

Why this answer

A radar chart is the most effective visualization for comparing multiple quantitative variables across different categories, such as the purchasing behavior characteristics of each customer segment. It allows the marketing team to see the profile of each segment at a glance by plotting each characteristic on a separate axis radiating from a central point, making it easy to identify strengths, weaknesses, and similarities between segments.

Exam trap

The trap here is that candidates often choose a scatter plot or heatmap because they are more common in exploratory analysis, but the question specifically asks for showing the characteristics (multiple attributes) of each segment, which is best served by a radar chart's multi-axis comparison.

How to eliminate wrong answers

Option A is wrong because a histogram is used to show the distribution of a single continuous variable (e.g., frequency of purchase amounts) and cannot display multiple characteristics for multiple segments simultaneously. Option B is wrong because a heatmap is best for showing the magnitude of a single value across two categorical dimensions (e.g., segment vs. time period) but does not allow direct comparison of multiple distinct characteristics per segment. Option D is wrong because a scatter plot is designed to show the relationship between two continuous variables (e.g., age vs. spending) and cannot effectively display the multi-attribute profile of each segment.

65
MCQmedium

A company's database has a table 'orders' with columns: order_id, customer_id, order_date, and total_amount. A data analyst needs to identify customers who have placed more than 5 orders in the past year. Which data concept should be used to group orders by customer and count them?

A.Joining with other tables
B.Filtering with WHERE clause
C.Sorting with ORDER BY
D.Aggregation with GROUP BY
AnswerD

GROUP BY groups rows and aggregation functions compute counts.

Why this answer

Option D is correct because the requirement to count orders per customer requires grouping rows by customer_id and then applying a count function. The GROUP BY clause in SQL aggregates rows that share a common value (customer_id) into summary rows, and the COUNT function tallies the number of orders per group. This is the standard approach for such 'per-customer' aggregations.

Exam trap

The trap here is that candidates confuse filtering (WHERE) with aggregation (GROUP BY), thinking that a WHERE clause alone can count orders per customer, when in fact WHERE only filters rows and cannot produce grouped counts.

How to eliminate wrong answers

Option A is wrong because joining with other tables merges columns from multiple tables but does not group or count rows; it would not produce a count of orders per customer. Option B is wrong because filtering with a WHERE clause restricts rows before any grouping but does not aggregate or count; it cannot produce a count of orders per customer. Option C is wrong because sorting with ORDER BY only arranges the result set order and has no effect on grouping or counting rows.

66
MCQhard

A data team is communicating findings from a machine learning model that predicts equipment failure. The model has high accuracy but low recall. Which of the following statements is the most accurate way to communicate the model's performance to the maintenance team?

A."The model has a high precision, so when it alerts, it is usually correct, but it may miss some failures."
B."The model rarely misses a failure, but may have false positives."
C."The model has a high precision but low recall, so it misses many failures."
D."The model is highly reliable and catches almost all failures."
AnswerC

Correct. This accurately communicates the trade-off between precision and recall.

Why this answer

Option C is correct because it directly states that the model has high precision but low recall, which means that when the model predicts a failure, it is likely correct (few false positives), but it fails to identify many actual failures (many false negatives). This is the most accurate way to communicate the trade-off to the maintenance team, as it clearly indicates that the model will miss some failures despite its high accuracy.

Exam trap

CompTIA often tests the confusion between accuracy and recall; candidates mistakenly assume high accuracy implies high recall, but accuracy can be high even with low recall if the class imbalance is severe (e.g., many non-failure cases dominate the metric).

How to eliminate wrong answers

Option A is wrong because it describes high precision correctly but omits the critical low recall issue; saying 'it may miss some failures' understates the severity of low recall, which means the model misses many failures, not just some. Option B is wrong because it describes high recall ('rarely misses a failure') and high false positives, which is the opposite of the given scenario (high accuracy, low recall). Option D is wrong because it claims the model 'catches almost all failures,' which directly contradicts low recall; a model with low recall misses a significant portion of actual failures.

67
MCQmedium

A retail company is migrating its on-premises data warehouse to a cloud data warehouse. The current ETL process extracts data from a transactional database (SQL Server) and a web analytics system (JSON logs). The ETL runs nightly and takes 6 hours. The business requires that the new cloud warehouse support real-time reporting with data latency of less than 15 minutes. The data engineer proposes using change data capture (CDC) from the SQL Server database and streaming the JSON logs via a message queue. However, management is concerned about cost and complexity. The engineer must design a solution that meets the latency requirement while minimizing operational overhead. Which approach should the engineer recommend?

A.Export the SQL Server data to flat files every 15 minutes and use a cloud storage trigger to load
B.Continue with nightly batch loads but increase the frequency to every hour
C.Implement CDC for the SQL Server database and stream the JSON logs via a message queue to the cloud warehouse
D.Use a data virtualization tool to query the source systems directly without moving data
AnswerC

CDC provides real-time changes; streaming handles JSON logs with low latency.

Why this answer

Option C is correct because CDC captures only changed rows from SQL Server, minimizing data volume and enabling near-real-time ingestion, while streaming JSON logs via a message queue (e.g., Apache Kafka or Amazon Kinesis) provides sub-15-minute latency. This combination meets the latency requirement without the overhead of full batch exports or complex virtualization, addressing management's cost and complexity concerns.

Exam trap

The trap here is that candidates may choose Option A or D because they seem simpler, but they fail to meet the strict latency requirement or introduce hidden operational complexity, while Option C's CDC and streaming approach is the only one that balances low latency with minimal overhead.

How to eliminate wrong answers

Option A is wrong because exporting SQL Server data to flat files every 15 minutes introduces latency from file generation, cloud storage upload, and trigger-based loading, which can easily exceed the 15-minute requirement and adds operational overhead for file management. Option B is wrong because increasing nightly batch loads to hourly still results in up to 60-minute latency, failing the 15-minute requirement, and does not address the need for real-time streaming of JSON logs. Option D is wrong because data virtualization queries source systems directly, which can cause performance degradation on the transactional SQL Server and web analytics system, and does not provide a persistent, low-latency data pipeline to the cloud warehouse.

68
MCQmedium

A data analyst at a marketing firm is tasked with segmenting customers based on their purchasing behavior. The dataset contains 10,000 customers with features such as annual spend, frequency of purchases, recency of last purchase, and average order value. The analyst decides to use k-means clustering. After standardizing the features, the analyst runs k-means with k=3, k=4, and k=5, and computes the silhouette score for each: k=3: 0.45, k=4: 0.52, k=5: 0.48. The analyst also plots the elbow curve and observes that the within-cluster sum of squares (WCSS) decreases sharply from k=2 to k=4, then levels off. Based on these results, what is the most appropriate number of clusters?

A.k=4
B.k=2
C.k=3
D.k=5
AnswerA

Highest silhouette score and elbow point.

Why this answer

The silhouette score is highest at k=4 (0.52), indicating that clusters are well-separated and cohesive. The elbow curve shows WCSS decreasing sharply up to k=4 and then leveling off, suggesting that k=4 captures the optimal trade-off between model complexity and variance explained. Together, these metrics point to k=4 as the most appropriate number of clusters.

Exam trap

The trap here is that candidates might rely solely on the elbow curve and pick k=3 or k=5, ignoring the silhouette score which directly measures cluster quality and clearly favors k=4.

How to eliminate wrong answers

Option B (k=2) is wrong because the elbow curve shows a sharp decrease in WCSS from k=2 to k=4, meaning k=2 would underfit the data and miss meaningful segmentation. Option C (k=3) is wrong because its silhouette score (0.45) is lower than k=4 (0.52), indicating poorer cluster separation and cohesion. Option D (k=5) is wrong because its silhouette score (0.48) is lower than k=4, and the elbow curve shows WCSS leveling off after k=4, so adding a fifth cluster introduces unnecessary complexity without significant improvement.

69
Multi-Selectmedium

Which TWO of the following chart types are appropriate for showing the distribution of a continuous variable? (Choose 2.)

Select 2 answers
A.Box plot
B.Bar chart
C.Pie chart
D.Line chart
E.Histogram
AnswersA, E

Shows distribution summary with quartiles and outliers.

Why this answer

A box plot is correct because it visually summarizes the distribution of a continuous variable through its five-number summary (minimum, first quartile, median, third quartile, maximum), clearly showing spread, central tendency, and potential outliers. This makes it ideal for distribution analysis in data visualization contexts like the DA0-001 exam.

Exam trap

The trap here is that candidates often confuse bar charts with histograms, mistakenly thinking bar charts can show continuous distributions, but bar charts require categorical x-axis values and have gaps between bars, while histograms use continuous intervals with no gaps.

70
MCQmedium

A data analyst is tasked with extracting data from a legacy system that outputs fixed-width text files. The analyst needs to parse these files into a structured format. Which tool or method is most appropriate for this task?

A.A spreadsheet application
B.An ETL tool with a graphical interface
C.A scripting language such as Python
D.SQL
AnswerC

Python provides libraries and string manipulation ideal for parsing fixed-width files.

Why this answer

Python is the most appropriate choice because fixed-width text files require precise column slicing based on character positions, which Python's string slicing and libraries like `struct` or `pandas.read_fwf` handle natively. Unlike graphical ETL tools or spreadsheets, Python provides programmatic control to define exact field widths, handle edge cases like missing delimiters, and process large files efficiently without manual intervention.

Exam trap

The trap here is that candidates assume a graphical ETL tool is always the best for data extraction, but the question specifically tests the ability to handle unstructured or semi-structured legacy formats where scripting provides the necessary precision and automation.

How to eliminate wrong answers

Option A is wrong because spreadsheet applications like Excel are designed for delimited data (e.g., CSV) and lack built-in functionality to parse fixed-width columns without manual column splitting, which is error-prone and impractical for large datasets. Option B is wrong because while ETL tools can parse fixed-width files, they typically require defining column widths in a graphical interface, which is less flexible and harder to automate than a scripting language for legacy systems with inconsistent formatting. Option D is wrong because SQL operates on structured data within a database and cannot directly parse raw fixed-width text files; it would require the data to be pre-processed into a table format first.

71
Drag & Dropmedium

Drag and drop the steps to conduct a hypothesis test in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Hypothesis testing involves stating hypotheses, setting alpha, collecting data, computing test statistic, and making a decision.

72
MCQmedium

A data analyst is tasked with visualizing the distribution of customer ages across different regions. The dataset contains outliers. Which chart type best displays the distribution and highlights outliers?

A.Box plot
B.Violin plot
C.Histogram
D.Bar chart
AnswerA

Box plot displays distribution spread and outliers as individual points beyond whiskers.

Why this answer

A box plot explicitly shows median, quartiles, and outliers, making it ideal for distribution and outlier detection. Histogram shows distribution but not outliers clearly. Violin plot is similar but more complex.

Bar chart is not for distribution.

73
MCQmedium

During data acquisition, an analyst notices that the data from an external vendor has inconsistent date formats. What is the first step the analyst should take?

A.Contact the vendor to request corrected data
B.Immediately transform dates to a standard format
C.Perform data profiling
D.Reject the entire dataset
AnswerC

Profiling identifies inconsistencies and guides next steps.

Why this answer

Data profiling is the initial step to understand the structure, quality, and issues in the data. Rejecting or transforming without profiling may lead to errors, and contacting the vendor is premature without understanding the scope.

74
MCQeasy

An analyst creates a pie chart showing market share of four companies: A (45%), B (30%), C (15%), D (10%). A stakeholder complains that it is difficult to compare C and D. Which alternative chart should the analyst recommend?

A.Horizontal bar chart
B.Line chart
C.Stacked bar chart
D.Scatter plot
AnswerA

Horizontal bar chart allows easy length comparison between categories.

Why this answer

A bar chart allows easier comparison of individual values, especially when differences are small. The pie chart's reliance on angles makes small slices hard to compare. Options A, B, and D are not as effective for comparing discrete values.

75
Multi-Selecthard

Which THREE of the following are best practices when performing data extraction for a data pipeline?

Select 3 answers
A.Performing a full refresh every time
B.Implementing error handling and logging
C.Documenting the extraction process
D.Ignoring data quality issues during extraction
E.Using incremental extraction where possible
AnswersB, C, E

Error handling ensures the pipeline can recover from failures.

Why this answer

Option B is correct because implementing error handling and logging is a critical best practice in data pipeline extraction. It ensures that failures (e.g., network timeouts, authentication errors, or schema mismatches) are captured and can be diagnosed without data loss or silent corruption, which is essential for maintaining pipeline reliability and auditability.

Exam trap

CompTIA often tests the misconception that full refreshes are always safer or simpler, but the trap is that they ignore the operational cost and scalability issues, while incremental extraction with proper error handling is the standard in production pipelines.

Page 1 of 7

Page 2

All pages