CompTIA Data+ DA0-001 DA0-001 Questions 676–750 | Page 10/14

676

MCQmedium

When communicating uncertainty in a report, which of the following is the most appropriate way to convey the reliability of a survey result showing 75% customer satisfaction?

A."The satisfaction rate might be lower or higher."

B."75% of customers are satisfied."

C."We are 95% confident that the true satisfaction rate is between 72% and 78%."

D."The margin of error is 3%."

AnswerC

Provides a confidence interval, clearly expressing uncertainty.

Why this answer

Confidence intervals are standard for communicating uncertainty around a point estimate.

Full explanation →

677

MCQeasy

A marketing team wants to analyze customer sentiment from social media posts. Which data acquisition method is most appropriate?

A.Internal database query

B.Physical sensor data

C.Web scraping from public social media APIs

D.Survey questionnaire

AnswerC

Allows direct access to public posts for sentiment analysis.

Why this answer

Option D is correct because web scraping from public social media APIs allows direct access to public posts for sentiment analysis. Option A is wrong because internal databases do not contain social media data. Option B is wrong because surveys are not real-time from social media.

Option C is wrong because physical sensors are unrelated.

Full explanation →

678

MCQhard

An e-commerce company is merging customer data from three legacy systems. Two systems use email as unique identifier, but one system allows multiple customers per email. The third uses phone number. To create a unified customer view, the analyst should first:

A.Request the IT team to modify the legacy system

B.Build a customer matching rule that uses multiple attributes (email, phone, name) with a confidence score

C.Use email as primary key and ignore conflicts

D.Assign new unique IDs and discard existing identifiers

AnswerB

Multi-attribute matching handles non-unique identifiers and improves accuracy.

Why this answer

Option B is correct because merging data from systems with different identifier schemas requires a probabilistic matching approach. Using multiple attributes (email, phone, name) with a confidence score allows the analyst to resolve conflicts where email is not unique and phone numbers may be missing or formatted differently, creating a unified customer view without forcing a single key.

Exam trap

The trap here is that candidates assume a single unique identifier (email) can be forced as a primary key, ignoring the real-world data quality issue of non-unique emails, which the question explicitly states.

How to eliminate wrong answers

Option A is wrong because modifying legacy systems is often impractical, costly, and outside the analyst's scope; the question asks what the analyst should do first, not a long-term IT project. Option C is wrong because using email as primary key and ignoring conflicts would lose data integrity when one email maps to multiple customers, violating the goal of a unified view. Option D is wrong because assigning new unique IDs and discarding existing identifiers eliminates the ability to link records back to source systems and loses valuable matching context, making deduplication impossible.

Full explanation →

679

MCQmedium

A data governance team is implementing a program to ensure consistent definitions and quality of customer data across the organization. They assign a senior manager to be accountable for the data asset. Which role does this manager fulfill?

A.Data analyst

B.Data custodian

C.Data owner

D.Data steward

AnswerC

Data owner is accountable for a specific data domain.

Why this answer

The data owner is the senior manager accountable for a specific data asset, including its quality, definition, and compliance. In the DA0-001 context, the data owner has ultimate responsibility for the data, not just day-to-day management. This role ensures consistent definitions and quality across the organization, aligning with the governance team's objectives.

Exam trap

The trap here is confusing the data owner's accountability with the data steward's operational duties, leading candidates to pick 'Data steward' because they associate governance with hands-on management rather than executive responsibility.

How to eliminate wrong answers

Option A is wrong because a data analyst focuses on analyzing and interpreting data, not on accountability for data definitions or quality. Option B is wrong because a data custodian is responsible for the technical environment and security of data, not for defining or governing its meaning. Option D is wrong because a data steward handles day-to-day data governance tasks like metadata management and quality monitoring, but does not hold the ultimate accountability that a senior manager does.

Full explanation →

680

MCQhard

A data analyst is reviewing the error log from a nightly batch load. What is the most likely cause of the error?

A.A row with the same primary key was already loaded in a previous batch.

B.The data type of order_id is incorrect.

C.The source and target schemas are mismatched.

D.The order_id field contains null values.

AnswerA

The error explicitly says duplicate key.

Why this answer

The error log from a nightly batch load indicates a primary key violation. This occurs when a row with the same primary key value already exists in the target table from a previous batch load. Since batch loads typically use INSERT operations, attempting to insert a duplicate primary key will raise a constraint violation error, halting the load process.

Exam trap

The trap here is that candidates confuse a primary key violation with a data type mismatch or schema mismatch, but the error log's specific reference to a duplicate key points directly to the primary key constraint.

How to eliminate wrong answers

Option B is wrong because an incorrect data type for order_id would cause a data type conversion error or truncation error, not a primary key violation. Option C is wrong because a schema mismatch (e.g., missing columns or different column order) would produce a column mapping error or a 'column not found' error, not a duplicate key error. Option D is wrong because null values in order_id would violate a NOT NULL constraint if the primary key column is defined as NOT NULL, but the error message specifically points to a duplicate key violation, not a null constraint violation.

Full explanation →

681

MCQeasy

A data analyst is preparing a presentation for executive leadership. The analyst wants to highlight the correlation between marketing spend and revenue over the past year. Which visualization type is most appropriate for showing this relationship?

A.Scatter plot

B.Pie chart

C.Bar chart

D.Histogram

AnswerA

Scatter plots effectively display correlations between two continuous variables.

Why this answer

A scatter plot is the most appropriate visualization for showing the relationship between two continuous variables—marketing spend and revenue—because it plots individual data points on an X-Y axis, allowing the analyst to visually assess correlation, trends, and outliers. This directly supports the goal of highlighting correlation, as the pattern of points (e.g., upward slope) indicates the strength and direction of the relationship.

Exam trap

The trap here is that candidates often confuse a bar chart or histogram with a scatter plot because they think any chart with axes can show relationships, but only a scatter plot directly plots paired continuous data to reveal correlation without aggregation.

How to eliminate wrong answers

Option B (Pie chart) is wrong because pie charts are designed to show parts of a whole (proportions) for categorical data, not the relationship between two continuous variables. Option C (Bar chart) is wrong because bar charts compare discrete categories or aggregated values, not the correlation between two continuous metrics; they would require binning or summarizing the data, losing the granularity needed for correlation analysis. Option D (Histogram) is wrong because histograms display the distribution of a single continuous variable (e.g., frequency of revenue values), not the relationship between two variables.

Full explanation →

682

MCQeasy

In SQL, which string function would you use to remove leading and trailing spaces from a column named 'city'?

A.TRIM

B.RTRIM

C.LTRIM

D.CLEAN

AnswerA

Correct. TRIM removes both leading and trailing spaces.

Why this answer

TRIM removes leading and trailing spaces (or other specified characters) from a string. TRIM(city) returns the city without extra spaces.

Full explanation →

683

Multi-Selectmedium

A data analyst is creating a dashboard in Tableau to monitor sales performance. The dashboard will be used by executives to quickly identify trends. Which THREE design principles should the analyst apply?

Select 3 answers

A.Maximize data-ink ratio

B.Clear labels and titles

C.Visual hierarchy

D.Consistent color coding

E.Include as many interactive filters as possible

AnswersB, C, D

Makes the dashboard easy to understand at a glance.

Why this answer

Visual hierarchy ensures key metrics stand out; consistent color coding aids interpretation; clear labels prevent confusion. Data-ink ratio is important but not as critical for executive dashboards as clarity; including many interactive filters may clutter.

Full explanation →

684

MCQhard

A sales dashboard shows a map with many overlapping markers in the same city, making it hard to read. What is the best improvement?

A.Add tooltips to show details on hover

B.Aggregate the data by region and use a choropleth map

C.Use a bubble chart instead of a map

D.Use different marker colors for each store

AnswerB

Aggregating reduces density and choropleth shows region-level values.

Why this answer

Option B is correct because aggregating sales data by region and using a choropleth map eliminates visual clutter from overlapping markers by shading entire geographic areas based on a metric (e.g., total sales). This approach leverages spatial aggregation to provide a clear, high-level view of regional performance, which is the best practice when individual point markers become unreadable due to density.

Exam trap

The trap here is that candidates may choose tooltips (Option A) thinking interactivity solves the problem, but the question asks for the 'best improvement' to readability, and tooltips do not address the fundamental issue of overlapping markers obscuring the visualization.

How to eliminate wrong answers

Option A is wrong because tooltips only provide details on hover and do not solve the core problem of overlapping markers obscuring data; they add interactivity but do not reduce visual density. Option C is wrong because a bubble chart, while useful for comparing values, is not a map-based visualization and would lose the geographic context that the dashboard intends to convey. Option D is wrong because using different marker colors for each store does not address overlapping markers; it only adds visual differentiation without reducing clutter, and in dense areas, colored markers still overlap and remain unreadable.

Full explanation →

685

MCQhard

Refer to the exhibit. A database administrator notices that queries filtering on both CustomerID and OrderDate are slow. Which single change would most likely improve performance for such queries?

A.Partition the table by OrderDate

B.Convert TotalAmount to VARCHAR

C.Add a composite index on (CustomerID, OrderDate)

D.Remove the primary key constraint

AnswerC

A composite index can satisfy both conditions in one index seek.

Why this answer

A composite index on (CustomerID, OrderDate) allows the database to use a single index to filter on both columns, which is more efficient than using separate indexes and combining results.

Full explanation →

686

MCQmedium

A data quality report shows that 95% of records have all required fields completed, but 20% of the completed fields contain values that are outside valid ranges. Which data quality dimension is most affected?

A.Consistency

B.Accuracy

C.Timeliness

D.Completeness

AnswerB

Accuracy is compromised because values outside valid ranges are incorrect.

Why this answer

Accuracy measures how well data reflects real-world values or a defined standard. Here, 20% of completed fields contain values outside valid ranges, meaning the data is present but incorrect, directly degrading accuracy. Completeness (95% filled) is high, but the core issue is that the values themselves are wrong, not missing or late.

Exam trap

The trap here is that candidates see '95% of records have all required fields completed' and immediately think 'Completeness is high, so that dimension is fine,' but then incorrectly assume the 20% out-of-range values also affect Completeness, when in fact Accuracy is the dimension that suffers when present data is invalid.

How to eliminate wrong answers

Option A (Consistency) is wrong because consistency checks for logical coherence across datasets or over time (e.g., same customer ID format in two tables), not whether individual field values fall within valid ranges. Option C (Timeliness) is wrong because timeliness concerns whether data is available when needed or within a required time window, not the correctness of values. Option D (Completeness) is wrong because completeness measures the presence of data (95% of records have all required fields), which is high; the problem is with the quality of the present data, not its absence.

Full explanation →

687

MCQeasy

Which chart type is best for showing the number of website visitors at each stage of a conversion funnel, from initial visit to purchase?

A.Stacked bar chart

B.Treemap

C.Funnel chart

D.Waterfall chart

AnswerC

Correct. Funnel charts are ideal for conversion funnels.

Why this answer

A funnel chart is specifically designed to visualize the progressive reduction in volume across stages of a linear process, such as a conversion funnel. It clearly shows the number of visitors at each stage (e.g., initial visit, product view, add to cart, purchase) and the drop-off between them, making it the optimal choice for this scenario.

Exam trap

The trap here is that candidates often confuse a funnel chart with a waterfall chart because both show sequential steps, but a waterfall chart is for cumulative changes (additions/subtractions), not for displaying the count at each stage of a funnel.

How to eliminate wrong answers

Option A is wrong because a stacked bar chart is used to compare parts of a whole across categories, not to show the sequential reduction in a funnel; it would obscure the drop-off between stages. Option B is wrong because a treemap displays hierarchical data as nested rectangles based on proportion, which is not suitable for a linear, sequential process like a conversion funnel. Option D is wrong because a waterfall chart is designed to show the cumulative effect of sequential positive and negative values (e.g., financial statements), not the simple count of visitors at each stage of a funnel.

Full explanation →

688

MCQmedium

A stock analyst is analyzing monthly sales data for a retail company and observes a consistent pattern of high sales every December. This pattern is most likely an example of which time series component?

A.Irregular

B.Cyclical

C.Seasonality

D.Trend

AnswerC

Correct: regular pattern within a fixed period.

Why this answer

Seasonality refers to regular, predictable patterns that repeat at fixed intervals (e.g., yearly, monthly). The consistent December peak indicates a seasonal pattern.

Full explanation →

689

MCQhard

A data team is preparing a quarterly business review for the CEO. The report must include both high-level summaries and the ability for the CEO to drill down into specific departments. Which reporting technique best meets this requirement?

A.A slide deck with one slide per department.

B.An interactive dashboard with drill-down capabilities.

C.A static PDF with a summary page and appendices.

D.A data dump in Excel with filters.

AnswerB

Interactive dashboards allow users to start with a summary and click to see underlying details for specific departments.

Why this answer

An interactive dashboard with drill-down capabilities (Option B) is the correct choice because it directly addresses the requirement for both high-level summaries and the ability to explore specific departments. Dashboards allow the CEO to view aggregated KPIs at a glance and then click through to detailed views for each department, providing a seamless, user-driven exploration experience without switching between separate reports or slides.

Exam trap

The trap here is that candidates often choose a static PDF (Option C) or a slide deck (Option A) because they associate 'report' with printed or presentation materials, but the question explicitly requires 'drill-down' capability, which is a hallmark of interactive business intelligence tools, not static documents.

How to eliminate wrong answers

Option A is wrong because a slide deck with one slide per department forces a linear, static presentation; the CEO cannot dynamically drill down from a summary view into a specific department without manually navigating slides, which breaks the requirement for interactive drill-down. Option C is wrong because a static PDF with a summary page and appendices is non-interactive; the CEO would have to jump to appendix pages manually, which is not a true drill-down capability and lacks the real-time filtering or cross-filtering that an interactive dashboard provides. Option D is wrong because a data dump in Excel with filters is a raw data file that requires the CEO to understand the data structure and apply filters manually; it does not offer a curated high-level summary or a guided drill-down path, and it risks overwhelming the user with granular data without pre-built aggregations.

Full explanation →

690

Multi-Selectmedium

Which THREE are best practices for data profiling during acquisition? (Choose three.)

Select 3 answers

A.Immediately normalize data

B.Check for completeness

C.Assess data types

D.Identify outliers

E.Skip validation for trusted sources

AnswersB, C, D

Ensuring all required fields are populated is essential.

Why this answer

Checking for completeness (Option B) is a best practice during data acquisition because it ensures that all required fields and records are present before further processing. Incomplete data can lead to incorrect analysis or failed transformations, so profiling for missing values or nulls is a fundamental validation step.

Exam trap

The trap here is that candidates confuse 'best practices for acquisition' with 'best practices for transformation,' leading them to select normalization (Option A) as an immediate step rather than a later processing stage.

Full explanation →

691

MCQeasy

In simple linear regression, the coefficient of determination R² measures:

A.The probability that the slope is zero

B.The slope of the regression line

C.The proportion of variance in the dependent variable explained by the independent variable

D.The strength and direction of the linear relationship

AnswerC

Correct interpretation of R².

Why this answer

R² indicates the proportion of variance in the dependent variable explained by the independent variable.

Full explanation →

692

MCQhard

A data analyst is using a recursive CTE to traverse an organizational hierarchy. What is the purpose of the anchor member in the recursive CTE?

A.It provides the initial seed or starting rows for the recursion.

B.It filters the final output of the recursive CTE.

C.It specifies how to join the CTE with itself recursively.

D.It defines the termination condition for the recursion.

AnswerA

The anchor member returns the base result set.

Why this answer

The anchor member initializes the recursion with the base result set.

Full explanation →

693

Multi-Selecthard

A data analyst is evaluating data quality issues in a customer database. Which TWO actions are best practices for ensuring data consistency?

Select 2 answers

A.Allowing null values for foreign keys

B.Standardizing date formats across all tables

C.Implementing referential integrity constraints

D.Enabling cascading updates on primary keys

E.Using data profiling to identify duplicate records

AnswersB, C

Correct: Uniform formats ensure consistency in temporal data.

Why this answer

Standardizing date formats across all tables (Option B) ensures that date values are stored and interpreted uniformly, eliminating inconsistencies that arise from mixed formats (e.g., MM/DD/YYYY vs. DD-MM-YY). This practice directly supports data consistency by enforcing a single representation, which is critical for accurate querying, reporting, and integration across systems.

Exam trap

CompTIA often tests the distinction between data quality dimensions (e.g., consistency vs. accuracy), leading candidates to confuse data profiling (which identifies duplicates) with a direct method for enforcing consistency.

Full explanation →

694

MCQeasy

In a regression analysis, the coefficient of determination (R²) is 0.85. How should this value be interpreted?

A.85% of the data points lie on the regression line

B.The slope of the regression line is 0.85

C.85% of the variance in the dependent variable is explained by the model

D.85% of the independent variables are significant

AnswerC

Correct interpretation of R².

Why this answer

R² represents the proportion of variance in the dependent variable that is explained by the independent variable(s). An R² of 0.85 means the model explains 85% of the variability.

Full explanation →

695

MCQmedium

A data analyst creates a report and wants to ensure it tells a compelling story. Which element is most important for data storytelling?

A.Using only one chart type for consistency.

B.Avoiding any visual elements to keep focus on text.

C.Including all data without filtering.

D.Using a narrative arc with context.

AnswerD

Engages audience and makes findings memorable.

Why this answer

Option D is correct because data storytelling relies on a narrative arc—introducing context, building tension through data insights, and resolving with actionable conclusions—to engage the audience and make the data memorable. Without a narrative, even the most accurate data fails to drive understanding or decision-making. This aligns with the DA0-001 objective of communicating data insights effectively.

Exam trap

The trap here is that candidates often confuse 'data storytelling' with 'data presentation' and assume that including all data (Option C) is thorough, when in fact the exam emphasizes that a compelling story requires filtering and context to avoid overwhelming the audience.

How to eliminate wrong answers

Option A is wrong because using only one chart type ignores the fact that different data relationships (e.g., trends vs. distributions) require different visual encodings; forcing consistency sacrifices clarity. Option B is wrong because avoiding visual elements contradicts the principle that humans process visual information faster than text; data storytelling relies on charts, graphs, and annotations to highlight key patterns. Option C is wrong because including all data without filtering leads to cognitive overload and obscures the main message; effective storytelling requires selective inclusion based on the narrative's focus.

Full explanation →

696

Multi-Selecthard

A data analyst is preparing a report on customer satisfaction scores. To comply with GDPR, which THREE actions must be taken? (Select THREE.)

Select 3 answers

A.Retain data indefinitely for analysis

B.Ensure aggregates do not identify individuals

C.Include customer names for context

D.Anonymize personally identifiable information

E.Establish data retention periods for the report data

AnswersB, D, E

Correct. Aggregates must be safe from re-identification.

Why this answer

GDPR requires anonymization of PII, ensuring aggregate data does not allow re-identification, and respecting data retention policies.

Full explanation →

697

MCQmedium

A data analyst is creating a presentation for executives to explain why customer churn has increased over the last quarter. The analyst wants to present the story in a compelling way. Which narrative structure is most appropriate?

A.Problem, Hypothesis, Test

B.Background, Analysis, Recommendation

C.Situation, Complication, Resolution

D.Data, Visualization, Conclusion

AnswerC

This structure effectively sets the context, presents the problem, and offers a solution.

Why this answer

The Situation-Complication-Resolution structure is ideal for executive presentations because it first establishes the context (situation), then introduces the problem (complication—increased churn), and finally proposes a solution (resolution). This narrative arc aligns with how executives process strategic issues, making the data story compelling and actionable. In contrast, other structures are better suited for technical reports or hypothesis testing, not high-level storytelling.

Exam trap

Cisco often tests the distinction between narrative structures for different audiences; the trap here is that candidates mistake 'Background, Analysis, Recommendation' (a common technical report format) as appropriate for executives, when in fact it lacks the persuasive arc needed for strategic decision-making.

How to eliminate wrong answers

Option A is wrong because 'Problem, Hypothesis, Test' is a scientific method structure used for experimental validation, not for presenting a business narrative to executives. Option B is wrong because 'Background, Analysis, Recommendation' is a linear report format that lacks the dramatic tension needed to engage an executive audience on a problem like churn. Option D is wrong because 'Data, Visualization, Conclusion' is a data-centric sequence that prioritizes outputs over storytelling, failing to frame the business impact and resolution in a compelling way.

Full explanation →

698

MCQmedium

A data analyst needs to compare the salary distribution across five departments. Which visualization is most appropriate?

A.Line chart

B.Side-by-side box plot

C.Scatter plot

D.Stacked bar chart

AnswerB

Box plots display distribution statistics for each group.

Why this answer

A side-by-side box plot (option B) is the most appropriate visualization for comparing salary distributions across multiple departments because it displays the median, quartiles, and potential outliers for each group simultaneously. This allows the analyst to assess central tendency, spread, and skewness across all five departments in a single, compact chart.

Exam trap

The trap here is that candidates often confuse 'comparing distributions' with 'showing trends' or 'showing relationships,' leading them to incorrectly select a line chart or scatter plot instead of recognizing that a box plot is purpose-built for distribution comparison across groups.

How to eliminate wrong answers

Option A is wrong because a line chart is designed to show trends over a continuous interval (e.g., time series) and is not suitable for comparing distributions of categorical groups like departments. Option C is wrong because a scatter plot visualizes the relationship between two continuous variables, not the distribution of a single variable across categories. Option D is wrong because a stacked bar chart is used to show the composition of parts to a whole across categories, not the distribution (e.g., quartiles, outliers) of a continuous variable like salary.

Full explanation →

699

MCQeasy

A data analyst wants to compare the sales revenue of five different product categories for the current month. Which chart type is most suitable for this comparison?

A.Bar chart

B.Histogram

C.Pie chart

D.Line chart

AnswerA

Bar charts compare categories easily.

Why this answer

A bar chart is ideal for comparing discrete categories across a single metric like sales revenue.

Full explanation →

700

Multi-Selectmedium

An analyst wants to use Python (pandas) to compute the average sales amount per region from a DataFrame 'df' with columns 'region' and 'sales'. Which TWO pandas operations are needed? (Select TWO).

Select 2 answers

A.df.fillna(0)

B.df.pivot_table(index='region', values='sales', aggfunc='mean')

C.df['sales'].apply(np.sqrt)

D.df.merge(df2, on='region')

E.df.groupby('region')['sales'].mean()

AnswersB, E

Pivot table with mean aggregation.

Why this answer

To compute average per group, you can use groupby() followed by mean(), or pivot_table() with aggfunc='mean'. merge() combines DataFrames, apply() can be used but is less direct, and fillna() handles missing values.

Full explanation →

701

MCQhard

In a table with columns 'employee_id' and 'manager_id', a data analyst needs to retrieve the hierarchy level of each employee, where the top manager has manager_id NULL. Which SQL feature is best suited?

A.A window function with ROW_NUMBER()

B.A recursive CTE

C.A GROUP BY clause with aggregation

D.A self-join with a LEFT JOIN

AnswerB

Recursive CTE can iterate through levels to assign hierarchy depth.

Why this answer

Recursive CTE can traverse hierarchical data to compute levels.

Full explanation →

702

MCQmedium

A data analyst is reviewing sales data and wants to find orders where the order total is between $100 and $500, inclusive. Which WHERE clause is correct?

A.total > 100 AND total < 500

B.total IN (100, 500)

C.total BETWEEN 100 AND 500

D.total >= 100 OR total <= 500

AnswerC

BETWEEN includes both boundary values.

Why this answer

BETWEEN is inclusive of both endpoints.

Full explanation →

703

MCQeasy

A sales VP wants a quick summary of last month's revenue change and key drivers. Which report section is most relevant?

A.Executive summary

B.Data dictionary

C.Methodology notes

D.Row-level data

AnswerA

Correct. Executive summary gives headline and context.

Why this answer

Executive summaries provide high-level numbers and context for quick decision-making.

Full explanation →

704

MCQhard

A data analyst is building a binary classification model to predict customer churn. The dataset is imbalanced, with only 10% churners. The analyst wants to evaluate model performance with a focus on correctly identifying churners. Which metric is most appropriate?

A.Recall (sensitivity)

B.F1-score

C.Precision

D.Accuracy

AnswerA

Recall measures how many actual churners were correctly found, directly addressing the focus.

Why this answer

Recall (sensitivity) is the most appropriate metric because it measures the proportion of actual churners correctly identified by the model. Since the dataset is imbalanced (only 10% churners) and the analyst's focus is on correctly identifying churners, recall directly addresses the cost of missing positive cases (false negatives). Accuracy would be misleading due to class imbalance, while precision and F1-score prioritize different trade-offs.

Exam trap

The trap here is that candidates often default to accuracy as the default metric, failing to recognize that class imbalance renders accuracy misleading, and that the question's explicit focus on 'correctly identifying churners' points directly to recall, not precision or F1-score.

How to eliminate wrong answers

Option B (F1-score) is wrong because it balances precision and recall, but the analyst's primary goal is to maximize identification of churners, not to balance false positives and false negatives; F1-score would penalize a model that achieves high recall at the expense of precision, which may be acceptable in this scenario. Option C (Precision) is wrong because it measures the proportion of predicted churners that are actual churners, focusing on false positives rather than false negatives; the analyst wants to minimize missed churners, not necessarily avoid false alarms. Option D (Accuracy) is wrong because with only 10% churners, a naive model predicting all non-churners would achieve 90% accuracy, masking poor performance on the minority class; accuracy is inappropriate for imbalanced classification problems.

Full explanation →

705

MCQeasy

A data analytics team has created a report for stakeholders. The report includes complex statistical terms and raw data tables. Stakeholders are confused and ask for clarification. Which of the following should the team do to improve communication?

A.Schedule a follow-up meeting to explain each term individually.

B.Provide a glossary of statistical terms and keep the report as is.

C.Remove all data and only give conclusions.

D.Simplify the report by using clear visualizations and plain language summaries.

AnswerD

Correct. This directly addresses the confusion by making the report accessible.

Why this answer

Option D is correct because effective data communication requires tailoring the message to the audience. By replacing complex statistical terms and raw data tables with clear visualizations and plain language summaries, the team makes insights accessible to stakeholders who may lack technical expertise, directly addressing the confusion.

Exam trap

The trap here is that candidates may think providing more explanation (Option A) or more data (Option B) is always better, but the DA0-001 exam emphasizes that communication must be tailored to the audience's level of understanding, not just the completeness of the information.

How to eliminate wrong answers

Option A is wrong because scheduling a follow-up meeting to explain each term individually is inefficient and does not improve the report itself; stakeholders should be able to understand the report without needing a separate tutorial. Option B is wrong because providing a glossary while keeping the report as is forces stakeholders to constantly cross-reference terms, which does not simplify the communication and still leaves raw data tables that are hard to interpret. Option C is wrong because removing all data and only giving conclusions removes the evidence and context needed for stakeholders to trust and verify the insights, which undermines transparency and data-driven decision-making.

Full explanation →

706

MCQmedium

A retail company wants to predict future sales based on historical data. Which modeling approach is most appropriate if the data shows a clear seasonal pattern?

A.Linear regression

B.Time series analysis

C.K-means clustering

D.Logistic regression

AnswerB

Time series analysis explicitly models seasonal patterns.

Why this answer

Time series analysis is specifically designed to model data points indexed in time order, making it ideal for capturing and forecasting seasonal patterns. Unlike regression models, it accounts for autocorrelation, trends, and seasonality components, which are critical for accurate sales prediction from historical data.

Exam trap

The trap here is that candidates see 'predict future sales' and mistakenly choose linear regression, overlooking that time series methods are required when data has temporal dependencies and seasonality.

How to eliminate wrong answers

Option A is wrong because linear regression assumes independence of observations and cannot model time-dependent structures like seasonality or autocorrelation. Option C is wrong because K-means clustering is an unsupervised learning method used for grouping similar data points, not for forecasting future values. Option D is wrong because logistic regression is used for binary classification problems, not for predicting continuous numeric sales figures.

Full explanation →

707

MCQhard

A data governance team is establishing policies to ensure data quality. They define rules for data accuracy, completeness, and consistency. Which data governance function is primarily responsible for defining and enforcing these rules?

A.Data stewardship

B.Data ownership

C.Data quality management

D.Master data management

AnswerC

Data quality management is responsible for defining and enforcing quality rules.

Why this answer

Data quality management is the function that sets standards and processes to ensure data is accurate, complete, and consistent. Data stewardship often involves implementing these rules, but the overall responsibility lies with data quality management.

Full explanation →

708

Multi-Selectmedium

A retail company wants to segment its customers based on purchase history. Which THREE methods are appropriate for customer segmentation?

Select 3 answers

A.RFM analysis

B.Linear regression

C.K-means clustering

D.t-test

E.Hierarchical clustering

AnswersA, C, E

Segments based on recency, frequency, monetary value.

Why this answer

K-means clustering, hierarchical clustering, and RFM analysis are common segmentation techniques. Linear regression and t-test are not segmentation methods.

Full explanation →

709

MCQhard

An analyst needs to create a report in Power BI that shows year-to-date sales compared to the same period last year, with the ability to drill down from year to quarter. Which DAX function combination is most appropriate for the year-to-date calculation?

A.CALCULATE and ALL

B.SUM and FILTER

C.TOTALYTD and SAMEPERIODLASTYEAR

D.RANKX and TOPN

AnswerC

These time intelligence functions compute YTD and compare to the same period last year.

Why this answer

TOTALYTD calculates year-to-date values, and SAMEPERIODLASTYEAR shifts the comparison to the prior year. Together they allow YTD vs prior YTD comparison.

Full explanation →

710

MCQmedium

The exhibit shows an SQL query executed on an 'orders' table that contains 'order_id', 'customer_id', and 'order_date'. What is the purpose of this query?

A.Count total orders per customer regardless of date

B.Calculate average order count per customer for 2023

C.Find products with more than 5 orders in 2023

D.Identify customers who placed more than 5 orders in 2023

AnswerD

The query filters by 2023 date and having count > 5.

Why this answer

The query groups orders by customer_id and filters using a HAVING clause with COUNT(*) > 5, which counts the number of orders per customer. The WHERE clause restricts orders to those placed in 2023, so the result identifies customers who placed more than 5 orders in that year. This matches option D exactly.

Exam trap

CompTIA often tests the distinction between WHERE and HAVING, and the trap here is confusing a count of orders per customer with a count of products or an average, leading candidates to pick option B or C.

How to eliminate wrong answers

Option A is wrong because the WHERE clause filters for order_date in 2023, so the count is not regardless of date. Option B is wrong because the query counts orders per customer, not the average order count per customer. Option C is wrong because the query operates on an 'orders' table with no product-related column; it counts orders per customer, not products.

Full explanation →

711

MCQmedium

An analyst is reviewing the above SQL query used to acquire data. What does this query retrieve?

A.Customers who placed more than 5 orders in 2023

B.All customers who placed at least 5 orders in 2023

C.The total number of orders per customer in 2023

D.Customers who placed exactly 5 orders in 2023

AnswerA

The HAVING clause filters for counts greater than 5.

Why this answer

The SQL query uses a HAVING clause with COUNT(*) > 5 to filter customers who placed more than 5 orders in 2023. The WHERE clause restricts records to the year 2023, and the GROUP BY customer_id aggregates orders per customer. The condition '> 5' explicitly excludes customers with exactly 5 or fewer orders, making option A correct.

Exam trap

The trap here is confusing the comparison operator '>' with '>=', leading candidates to mistakenly include customers with exactly 5 orders when the query explicitly excludes them.

How to eliminate wrong answers

Option B is wrong because 'at least 5 orders' would require the condition COUNT(*) >= 5, not > 5. Option C is wrong because the query returns customer IDs, not the total number of orders per customer; the COUNT is used only for filtering, not as a selected column. Option D is wrong because 'exactly 5 orders' would require COUNT(*) = 5, not > 5.

Full explanation →

712

MCQmedium

A data analyst is building a dashboard for executives and wants to ensure the most important metric, total revenue, is immediately visible. Which design principle should the analyst apply?

A.Appropriate precision

B.Consistent color coding

C.Visual hierarchy

D.Data-ink ratio

AnswerC

Correct. Visual hierarchy guides the viewer's eye to the most important metric first.

Why this answer

Visual hierarchy means arranging elements so that the most important information is most prominent, often by placing it in the top-left or center and making it larger.

Full explanation →

713

Multi-Selectmedium

An analyst is preparing data for an A/B test and wants to ensure valid results. Which TWO of the following should be considered when calculating the required sample size?

Select 2 answers

A.Data dimensionality

B.Desired effect size

C.Skewness of data

D.Number of features

E.Statistical power

AnswersB, E

Correct: effect size is a key input.

Why this answer

Sample size calculation depends on desired effect size and statistical power, among other factors like significance level.

Full explanation →

714

Multi-Selectmedium

A data analyst is extracting data from a web page using web scraping techniques. The data will be used for market research. Which TWO of the following are common challenges associated with web scraping?

Select 2 answers

A.Limited API rate limits

B.Legal and ethical restrictions

C.Website structure changes

D.High latency of data transfer

E.Inconsistent data formatting

AnswersB, C

Many websites prohibit scraping in their terms of service, and legal issues may arise.

Why this answer

Option B is correct because web scraping often involves accessing data that may be protected by copyright, terms of service, or privacy regulations such as GDPR or the Computer Fraud and Abuse Act (CFAA). Even if data is publicly accessible, repurposing it for market research without permission can lead to legal liability or ethical violations, making this a fundamental challenge.

Exam trap

Cisco often tests the distinction between API-related challenges (rate limits, authentication) and web-scraping-specific challenges (structure changes, legal/ethical issues), so candidates mistakenly select 'Limited API rate limits' because they confuse web scraping with API consumption.

Full explanation →

715

MCQmedium

A dataset contains sales transactions with columns 'order_date', 'amount', and 'region'. The analyst wants to calculate the total sales per region for orders placed in 2023, but only include regions where total sales exceed $10,000. Which SQL clause should be used to filter the aggregated results?

A.HAVING

B.WHERE

C.GROUP BY

D.FILTER

AnswerA

HAVING filters aggregated results after GROUP BY.

Why this answer

The HAVING clause filters groups after aggregation, whereas WHERE filters rows before grouping.

Full explanation →

716

MCQeasy

A data analyst wants to compare the means of three different training methods on employee productivity. Which statistical test is most appropriate?

A.Correlation analysis

B.ANOVA

C.Chi-square test

D.t-test

AnswerB

ANOVA compares means across multiple groups.

Why this answer

ANOVA (Analysis of Variance) is used to compare means of three or more groups.

Full explanation →

717

MCQeasy

A data analyst wants to retrieve the top 5 highest-paid employees from a table named 'employees' that has columns 'employee_id', 'salary', and 'name'. Which SQL query should they use?

A.SELECT TOP 5 name, salary FROM employees ORDER BY salary DESC;

B.SELECT name, salary FROM employees ORDER BY salary DESC LIMIT 5;

C.SELECT name, salary FROM employees ORDER BY salary ASC LIMIT 5;

D.SELECT name, salary FROM employees WHERE ROWNUM <= 5 ORDER BY salary DESC;

AnswerB

Correct syntax.

Why this answer

ORDER BY salary DESC sorts from highest to lowest, and LIMIT 5 restricts to the first 5 rows.

Full explanation →

718

MCQmedium

A data team needs to communicate insights about customer churn to the sales team. The insights include confidence intervals and p-values. The sales team is not familiar with statistics. Which of the following should the data team do?

A.Explain the practical implications of the results without statistical jargon.

B.Assume the sales team will learn the terms over time.

C.Use technical terms but provide written definitions.

D.Include a detailed statistical appendix.

AnswerA

Correct. This makes the insights accessible and actionable for the sales team.

Why this answer

Option A is correct because the sales team lacks statistical background, so presenting confidence intervals and p-values directly would cause confusion. The data team should translate these results into practical business implications—such as 'customers with a 30-day inactivity are 40% more likely to churn'—without using terms like p-value or confidence interval. This aligns with the DA0-001 objective of tailoring communication to the audience's expertise level.

Exam trap

The trap here is that candidates often choose Option C (providing definitions) thinking it balances accuracy and clarity, but the DA0-001 exam emphasizes audience adaptation—definitions still require the audience to learn technical terms, which is less effective than plain-language explanations.

How to eliminate wrong answers

Option B is wrong because assuming the sales team will learn statistical terms over time is unrealistic and risks misinterpretation of critical insights, leading to poor business decisions. Option C is wrong because providing written definitions of technical terms still forces the sales team to process unfamiliar jargon, which can slow understanding and reduce engagement. Option D is wrong because a detailed statistical appendix is excessive for a non-technical audience and may overwhelm them, defeating the purpose of clear communication.

Full explanation →

719

MCQhard

Refer to the exhibit. Which conclusion can be drawn from this data quality report?

A.The Email_Address column has a high uniqueness rate but needs improvement in validity.

B.The column is fully consistent but has low completeness.

C.The column has low validity and low uniqueness.

D.The column requires immediate action to improve completeness.

AnswerA

Uniqueness is 97%, but validity is only 85%, meaning some emails may be in invalid format.

Why this answer

Option A is correct because the data quality report shows that the Email_Address column has a high uniqueness rate (e.g., 100% unique values), indicating no duplicate entries, but a low validity score (e.g., many entries fail format checks like missing '@' or domain). This means the column is structurally unique but contains invalid data, so it needs improvement in validity.

Exam trap

CompTIA often tests the distinction between uniqueness and validity, trapping candidates who assume high uniqueness implies high quality, when in fact validity is a separate dimension that can be poor even with perfect uniqueness.

How to eliminate wrong answers

Option B is wrong because the report indicates low validity, not full consistency; consistency refers to adherence to a standard format, which is violated here. Option C is wrong because the report shows high uniqueness (not low uniqueness), so the claim of 'low uniqueness' is factually incorrect. Option D is wrong because completeness (non-null values) appears high or acceptable; the issue is with validity, not missing data.

Full explanation →

720

MCQeasy

A marketing team wants to explore the relationship between advertising spend (in dollars) and resulting revenue. Which chart type is most suitable?

A.Line chart

B.Table

C.Pie chart

D.Scatter plot

AnswerD

Scatter plot displays relationship between two numerical variables.

Why this answer

Scatter plots reveal correlation and distribution of two continuous variables.

Full explanation →

721

MCQhard

A data analyst at a retail company is building a dashboard for store managers to track sales performance. The data comes from three sources: point-of-sale (POS) systems, inventory, and customer loyalty. The POS table contains columns transaction_id, store_id, date, product_id, quantity, and price. The inventory table has product_id, store_id, stock_level, and reorder_point. The loyalty table has customer_id, transaction_id, and points_earned. The analyst creates a star schema with a sales_fact fact table containing all rows from POS, dimension tables for store, product, date, and customer. To calculate average transaction value, the analyst uses the formula SUM(quantity * price) / COUNT(*). Store managers report that the average transaction value appears too low, especially for stores with multiple registers. The analyst realizes that because each product sold in a transaction creates a separate row in sales_fact, a single transaction with multiple items contributes multiple rows. The current calculation divides by the number of rows rather than the number of distinct transactions. Which of the following is the best course of action to correct the average transaction value metric? (Choose one.)

A.Use the MEDIAN function instead of AVG

B.Aggregate the data at the transaction level before calculating the average

C.Use a different data model that denormalizes transaction totals into a new fact table

D.Create a calculated field that sums sales per transaction (quantity * price) and then averages across distinct transaction IDs

AnswerD

This correctly computes average transaction value by first summing per transaction.

Why this answer

Option A is correct because it calculates total sales per transaction (summing product-level rows) and then averages across distinct transactions, fixing the over-counting issue. Option B is too vague and does not specify how to aggregate. Option C is not required since the star schema is appropriate.

Option D uses median, which does not address the counting issue.

Full explanation →

722

MCQmedium

Refer to the exhibit. Which type of data is the field "region"?

A.Qualitative

B.Continuous

C.Quantitative

D.Discrete

AnswerA

Correct. Region is a descriptive category.

Why this answer

The field 'region' contains categorical labels (e.g., 'North', 'South', 'East', 'West') that represent distinct groups or categories, not numerical measurements. Qualitative data (also called categorical data) describes attributes or characteristics that can be named but not meaningfully ordered or measured on a numeric scale. Since 'region' assigns a name to a geographic area without any inherent numeric value or order, it is a classic example of qualitative data.

Exam trap

The trap here is that candidates may confuse 'region' with a numeric code (e.g., region ID 1, 2, 3) and incorrectly classify it as discrete quantitative data, but the field 'region' as shown contains text labels, making it qualitative.

How to eliminate wrong answers

Option B is wrong because continuous data represents measurements that can take any value within a range (e.g., temperature, time), but 'region' consists of discrete labels with no numeric continuum. Option C is wrong because quantitative data involves numerical values that can be counted or measured (e.g., sales amount, age), whereas 'region' is a non-numeric category. Option D is wrong because discrete data is a subset of quantitative data that takes countable integer values (e.g., number of customers), but 'region' is not numeric at all.

Full explanation →

723

MCQmedium

A data analyst wants to show the relationship between advertising spend and sales revenue for 50 stores. Which chart type is most appropriate?

A.Line chart

B.Scatter plot

C.Bar chart

D.Pie chart

AnswerB

Scatter plots show the relationship between two continuous variables.

Why this answer

A scatter plot is the most appropriate chart for showing the relationship between two continuous variables—advertising spend and sales revenue—across 50 stores. Each point on the plot represents one store, allowing the analyst to visually assess correlation, trends, or outliers. This aligns with the DA0-001 objective of selecting visualizations that best represent bivariate relationships.

Exam trap

The trap here is that candidates often confuse a line chart (which connects points in sequence) with a scatter plot (which treats points as independent observations), leading them to incorrectly choose a line chart when no temporal or ordered dimension exists.

How to eliminate wrong answers

Option A is wrong because a line chart is typically used to display trends over time or ordered categories, not to show the relationship between two independent continuous variables like advertising spend and sales revenue. Option C is wrong because a bar chart compares discrete categories or groups, not the correlation between two continuous metrics across 50 individual stores. Option D is wrong because a pie chart shows proportions of a whole for categorical data, which is irrelevant for analyzing the relationship between two numerical variables.

Full explanation →

724

Multi-Selectmedium

A data analyst is reviewing a dataset of customer transactions and wants to assess data quality by profiling the 'order_date' column. Which TWO profiling tasks are most appropriate for this date column? (Select TWO).

Select 2 answers

A.Pattern analysis (e.g., format consistency)

B.Count of null values

C.Variance

D.Cardinality (number of unique values)

E.Data type verification

AnswersB, E

Null count is a standard profiling check for any column.

Why this answer

Profiling a date column typically includes checking for null values and verifying the data type. Cardinality and pattern analysis are more relevant for categorical or string columns; variance is for numeric data.

Full explanation →

725

Multi-Selectmedium

An analyst is choosing a chart to show the correlation between two continuous variables. Which TWO chart types could be used? (Select two.)

Select 2 answers

A.Bubble chart

B.Scatter plot

C.Pie chart

D.Waterfall chart

E.Histogram

AnswersA, B

Bubble charts are scatter plots with a third variable; they also show correlation.

Why this answer

A bubble chart is an extension of a scatter plot that can show the correlation between two continuous variables on the x- and y-axes, while a third variable is represented by the size of the bubbles. For the specific purpose of showing correlation between exactly two continuous variables, the bubble chart is valid because the bubble size is optional and does not interfere with the primary x-y relationship. This makes it a correct choice for visualizing the relationship between two continuous variables.

Exam trap

Cisco often tests the misconception that a histogram can show relationships between two variables, but it only displays the frequency distribution of a single continuous variable, making it a common distractor for candidates who confuse it with a bar chart or scatter plot.

Full explanation →

726

Multi-Selectmedium

Which TWO of the following are examples of semi-structured data?

Select 2 answers

A.XML document

B.JSON object

C.Relational table

D.Plain text file

E.CSV file

AnswersA, B

XML uses tags and has flexible schema, semi-structured.

Why this answer

XML and JSON have tags/keys but no rigid schema, making them semi-structured. CSV is structured, relational tables are structured, plain text is unstructured.

Full explanation →

727

MCQmedium

A data analyst creates a scatter plot showing the relationship between advertising spend and revenue. The plot shows a strong positive correlation. Which of the following should the analyst include in the report to ensure accurate communication?

A.Include a note that correlation does not imply causation.

B.Replace the scatter plot with a bar chart.

C.Remove any outliers from the plot.

D.Add a trend line to the scatter plot.

AnswerA

This prevents misinterpretation of the relationship.

Why this answer

Option B is correct because correlation does not imply causation, and this caveat is essential. Option A is wrong because the scatter plot already shows the relationship. Option C is wrong because removing points could bias the analysis.

Option D is wrong because a regression line is not necessary for every scatter plot.

Full explanation →

728

MCQmedium

A data analyst uses a CTE to simplify a complex query. Which keyword is used to define a CTE?

A.DEFINE

B.CTE

C.DECLARE

D.WITH

AnswerD

CTEs are defined using the WITH keyword.

Why this answer

The WITH clause introduces a CTE.

Full explanation →

729

Multi-Selectmedium

An analyst is planning an A/B test to compare two website designs. Which TWO factors should be considered when calculating the required sample size?

Select 2 answers

A.Data type of the outcome variable

B.Desired effect size

C.Statistical power

D.Color scheme of the designs

E.Number of missing values

AnswersB, C

Correct.

Why this answer

Statistical power and desired effect size are key inputs for sample size calculations.

Full explanation →

730

MCQeasy

Which stage of the data lifecycle involves converting raw data into a usable format, such as cleaning or validating?

A.Archival

B.Processing

C.Ingestion

D.Storage

AnswerB

Processing includes cleaning and transforming raw data.

Why this answer

Processing is the stage where raw data is transformed into a usable format through cleaning, validation, normalization, or aggregation. This step ensures data quality and consistency before analysis or storage, directly matching the question's description.

Exam trap

The trap here is confusing ingestion (data arrival) with processing (data transformation), as both occur early in the lifecycle but serve distinct purposes.

How to eliminate wrong answers

Option A is wrong because archival refers to moving data to long-term storage for compliance or historical purposes, not cleaning or validating. Option C is wrong because ingestion is the initial capture or import of raw data from sources, not its transformation. Option D is wrong because storage is the persistent retention of data in databases or filesystems, not the conversion into a usable format.

Full explanation →

731

MCQmedium

A data analyst wants to retrieve data from a REST API that returns JSON. Which step is part of the data lifecycle for this activity?

A.Data archival

B.Data sharing

C.Data deletion

D.Data ingestion

AnswerD

Ingestion is the initial step of bringing data from a source.

Why this answer

Ingestion is the process of bringing data into a system for further processing.

Full explanation →

732

MCQeasy

A hospital wants to analyze patient readmission rates. The data contains daily patient visits. What is the level of granularity?

A.Patient

B.Visit

C.Day

D.Hospital

AnswerB

Correct. Each record captures one visit.

Why this answer

The level of granularity refers to the finest detail captured in the dataset. Since the data contains daily patient visits, each record represents a single visit event, not the patient or the day itself. Therefore, 'Visit' is the correct granularity because each row corresponds to one visit occurrence.

Exam trap

The trap here is confusing the subject of analysis (patient readmission rates) with the actual data granularity (each row is a visit), leading candidates to incorrectly select 'Patient' instead of 'Visit'.

How to eliminate wrong answers

Option A is wrong because 'Patient' would be the granularity if the data summarized all visits per patient (e.g., one row per patient with aggregated readmission counts), but here each visit is a separate record. Option C is wrong because 'Day' would be the granularity if the data aggregated all visits per day (e.g., total visits per day), but the data contains individual visit records, not daily summaries. Option D is wrong because 'Hospital' would be the granularity if the data aggregated across the entire hospital (e.g., total readmission rate for the hospital), but the data is at the individual visit level.

Full explanation →

733

MCQhard

A data analyst is cleaning a dataset with missing values in a time series of daily temperatures. The missing values occur sporadically. Which imputation method is most appropriate to maintain the temporal trend?

A.Forward-fill

B.Mean imputation

C.Median imputation

D.Interpolation

AnswerD

Correct: uses neighboring values to estimate missing points, preserving trend.

Why this answer

Interpolation estimates missing values by using surrounding data points and is suitable for time series with a trend. Forward-fill carries the last observation forward, which may not capture trend well. Mean imputation ignores order.

Full explanation →

734

MCQmedium

A retail company analyzes customer purchase data to improve inventory management. They store daily transaction records in a relational database and monthly aggregate reports in a data warehouse. Which difference between these storage methods best explains why the warehouse is more suitable for trend analysis?

A.The database uses a star schema while the warehouse uses a normalized schema.

B.The database enforces ACID transactions, while the warehouse uses eventual consistency.

C.The database is optimized for write-heavy OLTP, while the warehouse is optimized for read-heavy OLAP.

D.The database stores only current data, while the warehouse stores historical data.

AnswerC

Correct: OLTP supports many writes; OLAP supports complex reads.

Why this answer

Option C is correct because OLTP databases are optimized for high-frequency write operations (INSERT/UPDATE/DELETE) and ACID compliance, making them ideal for transaction processing but poor for complex analytical queries. In contrast, a data warehouse is optimized for read-heavy OLAP workloads, using columnar storage, pre-aggregated tables, and indexing strategies that enable fast aggregation and trend analysis over large historical datasets. This architectural difference directly supports the retail company's need to analyze purchase trends over time.

Exam trap

CompTIA often tests the misconception that 'data warehouses only store historical data' (Option D) as the primary reason for trend analysis suitability, but the real differentiator is the workload optimization (OLTP vs. OLAP), not merely the presence of history.

How to eliminate wrong answers

Option A is wrong because a star schema (with fact and dimension tables) is actually typical of data warehouses for analytical queries, while OLTP databases usually use normalized schemas to reduce redundancy and maintain data integrity. Option B is wrong because data warehouses often support ACID or snapshot isolation for consistency, and eventual consistency is more characteristic of NoSQL systems, not traditional data warehouses. Option D is wrong because relational databases can store historical data as well; the key difference is not the presence of history but the optimization for read-heavy analytical queries versus write-heavy transactional processing.

Full explanation →

735

MCQeasy

A marketing team conducted a customer satisfaction survey for five different departments (Sales, Support, Billing, Shipping, Returns). The survey asked customers to rate their satisfaction on a scale of 1 (Very Dissatisfied) to 5 (Very Satisfied). The data is ordinal and the team wants to visualize the distribution of responses for each department to quickly see which department has the most 'Very Satisfied' customers and which has the most 'Very Dissatisfied'. They also want to compare the spread of responses across departments. Which chart type should they use?

A.Stacked bar chart with departments on x-axis and counts of each rating stacked

B.Line chart with departments on x-axis and average rating on y-axis

C.Box plot for each department

D.Scatter plot with department as category and satisfaction score as value

AnswerA

Stacked bars show the full distribution of ordinal responses, highlighting proportions of top and bottom ratings.

Why this answer

Option C is correct because a stacked bar chart can show the proportion of each rating for each department, making it easy to see satisfaction levels. Option A is wrong because a box plot requires continuous data. Option B is wrong because a line chart is for trends.

Option D is wrong because a scatter plot is for relationships.

Full explanation →

736

MCQhard

An analyst writes a SQL query that uses a window function: SELECT employee_id, salary, LAG(salary, 1) OVER (ORDER BY salary DESC) AS prev_salary FROM employees. What does the LAG function return for the row with the highest salary?

A.The same salary value

B.NULL

C.The next highest salary

D.Zero

AnswerB

LAG returns NULL if there is no preceding row.

Why this answer

LAG returns the previous row's value in the ordered partition. For the first row (highest salary), there is no previous row, so it returns NULL.

Full explanation →

737

MCQmedium

An analyst is sampling a large customer database to estimate the average purchase amount. To ensure that the sample proportionally represents different customer segments (e.g., age groups), which sampling method should be used?

A.Systematic sampling

B.Simple random sampling

C.Cluster sampling

D.Stratified sampling

AnswerD

Stratified sampling ensures proportional representation from each stratum.

Why this answer

Stratified sampling divides the population into strata (e.g., age groups) and samples proportionally from each stratum.

Full explanation →

738

MCQhard

A data engineer is designing a system to store raw sensor data from thousands of IoT devices. The data will be used later for various analytics projects, but the schema is not yet defined. Which storage solution is most appropriate?

A.Data lake

B.Data mart

C.Data warehouse

D.Relational database

AnswerA

Data lakes store raw data in any format and allow schema-on-read.

Why this answer

A data lake stores raw data in its native format (e.g., S3, ADLS) without requiring a predefined schema, making it suitable for IoT data.

Full explanation →

739

Multi-Selectmedium

A company is acquiring social media data via a public API. Which TWO considerations are important for ensuring ethical and legal compliance?

Select 2 answers

A.Share raw data with third parties for additional insights

B.Use the data for any internal analysis without restrictions

C.Anonymize personal identifiable information (PII) before storage

D.Cache data indefinitely to avoid repeated API calls

E.Comply with the platform's terms of service

AnswersC, E

Anonymization protects individual privacy and complies with regulations.

Why this answer

Option C is correct because anonymizing PII before storage is a fundamental data privacy requirement under regulations like GDPR and CCPA. When acquiring data via a public API, the company must ensure that personal identifiers (e.g., names, email addresses, IP addresses) are removed or obfuscated to prevent re-identification, reducing legal liability and ethical risk.

Exam trap

The trap here is that candidates may confuse 'caching for efficiency' (Option D) with ethical compliance, overlooking that indefinite storage violates data minimization principles and platform terms, while 'internal analysis' (Option B) seems harmless but ignores explicit usage restrictions in the API's terms of service.

Full explanation →

740

MCQmedium

An OLTP system processes thousands of transactions per second. Which property ensures that a transaction is fully completed or fully rolled back, preventing partial updates?

A.Isolation

B.Durability

C.Atomicity

D.Consistency

AnswerC

Atomicity ensures all operations in a transaction complete or none do.

Why this answer

Atomicity guarantees that a transaction is treated as a single unit, completed entirely or not at all.

Full explanation →

741

MCQmedium

A data analyst is reviewing a dataset containing house prices. The mean price is $350,000 and the median is $280,000. Which of the following best describes the distribution of house prices?

A.The distribution is right-skewed.

B.The distribution is symmetric.

C.The distribution is left-skewed.

D.The distribution is bimodal.

AnswerA

Correct: Mean > median indicates right skew.

Why this answer

When the mean is greater than the median, the distribution is right-skewed (positively skewed) because higher values pull the mean upward.

Full explanation →

742

MCQeasy

A data analyst is creating a report for a marketing campaign. The campaign data includes customer names, email addresses, and purchase history. Which of the following best describes the 'customer name' data type?

A.Nominal

B.Quantitative

C.Ordinal

D.Discrete

AnswerA

Nominal is categorical without order.

Why this answer

Customer names are categorical labels that identify individuals without any inherent order or numerical value. This fits the definition of nominal data, which is used for naming or classifying variables. In data analysis, nominal data can be stored as strings and used for grouping or filtering, but arithmetic operations are meaningless.

Exam trap

CompTIA often tests the distinction between nominal and ordinal data by presenting a label that could be mistaken for having an order (e.g., 'customer name' might be confused with 'rank' or 'tier'), but the trap here is that names are purely categorical with no intrinsic ranking.

How to eliminate wrong answers

Option B is wrong because quantitative data represents numerical measurements or counts (e.g., purchase amount), not text labels like names. Option C is wrong because ordinal data has a meaningful order or rank (e.g., customer satisfaction rating), but customer names have no inherent sequence. Option D is wrong because discrete data consists of countable numerical values (e.g., number of purchases), whereas customer names are non-numeric categories.

Full explanation →

743

MCQhard

Refer to the exhibit. What is the most likely cause of the extraction failure?

A.The source table is locked

B.The network firewall is blocking the port

C.The extraction query is too complex

D.The database server is down

AnswerB

Causes connection to hang until timeout.

Why this answer

Option B is correct because connection timeouts with consistent 30-second delays suggest the network firewall is blocking the port, causing the connection to hang until timeout. Option A is wrong because if the server were down, the error would be connection refused immediately. Option C is wrong because a complex query would cause a slow query, not a connection timeout.

Option D is wrong because a locked table would cause a lock wait timeout, not a connection timeout.

Full explanation →

744

MCQmedium

A retail company wants to forecast monthly sales for the next 12 months. Sales data shows a clear upward trend and seasonal patterns that repeat yearly. Which time series model is most appropriate?

A.SARIMA

B.Simple exponential smoothing

C.Holt-Winters exponential smoothing

D.ARIMA

AnswerC

Holt-Winters includes trend and seasonality components, making it suitable for this data.

Why this answer

The Holt-Winters exponential smoothing model (option C) is the most appropriate because it explicitly captures both trend and seasonality components, which are present in the sales data (upward trend and yearly seasonal patterns). Unlike simple exponential smoothing, Holt-Winters includes additive or multiplicative seasonal terms, making it ideal for data with clear, repeating seasonal cycles over a 12-month horizon.

Exam trap

The trap here is that candidates often choose ARIMA or SARIMA because they are more 'advanced,' but the question specifically describes clear trend and seasonality without requiring stationarity or differencing, making Holt-Winters the most direct and appropriate choice.

How to eliminate wrong answers

Option A (SARIMA) is wrong because while SARIMA can model trend and seasonality, it requires the data to be stationary (differencing) and involves more complex parameter selection (p, d, q, P, D, Q, s); for a straightforward forecasting task with clear trend and seasonality, Holt-Winters is simpler and often more robust. Option B (Simple exponential smoothing) is wrong because it only handles level (no trend or seasonality), so it would fail to capture the upward trend and yearly seasonal patterns in the sales data. Option D (ARIMA) is wrong because it models trend but not seasonality; without seasonal differencing or seasonal AR terms, it cannot account for the repeating yearly patterns in the data.

Full explanation →

745

Multi-Selectmedium

In multiple linear regression, which TWO assumptions are critical for unbiased coefficient estimates? (Choose two.)

Select 2 answers

A.Linearity: the relationship between predictors and response is linear

B.Large sample size

C.Normality of errors

D.Homoscedasticity: errors have constant variance

E.Independence of errors

AnswersA, D

Nonlinear relationships can bias coefficient estimates.

Why this answer

For unbiased coefficient estimates in multiple linear regression, the linearity assumption (A) ensures that the model correctly specifies the functional form between predictors and the response. Homoscedasticity (D) ensures that the variance of errors is constant across all levels of the predictors, which is necessary for the Gauss-Markov theorem to hold and for ordinary least squares (OLS) estimates to be unbiased.

Exam trap

CompTIA often tests the distinction between assumptions required for unbiasedness (linearity and homoscedasticity) versus those needed for efficiency or inference (normality, independence, large sample size), causing candidates to mistakenly select normality or independence as critical for unbiased coefficients.

Full explanation →

746

MCQeasy

Refer to the exhibit. A stakeholder complains that the line chart exaggerates the changes in sales. What is the most likely cause?

A.The y-axis does not start at zero

B.There are too few data points

C.The data labels are incorrect

D.The chart type should be a bar chart

AnswerA

Starting at a non-zero value exaggerates differences.

Why this answer

Setting beginAtZero to false truncates the y-axis, making small changes appear larger.

Full explanation →

747

MCQhard

While reviewing a dashboard, an analyst notices that the data in a trend line chart does not match the underlying data due to a filter setting. The dashboard is used for weekly executive meetings. What should the analyst do?

A.Ignore the discrepancy if it is small.

B.Wait for someone to complain before acting.

C.Immediately remove the dashboard and send raw data.

D.Document the issue and fix the filter before the next meeting.

AnswerD

Proactive approach maintains data integrity and trust.

Why this answer

Option D is correct because the analyst has identified a data integrity issue caused by a filter setting that directly impacts the accuracy of the trend line chart. The dashboard is used for weekly executive meetings, so the analyst must document the discrepancy and correct the filter before the next meeting to ensure data-driven decisions are based on accurate visualizations. This aligns with best practices in data governance and the principle of maintaining trust in reporting tools.

Exam trap

The trap here is that candidates may assume small discrepancies are acceptable or that waiting for complaints is a valid approach, but the exam emphasizes proactive data integrity and the importance of maintaining accurate visualizations for scheduled stakeholder meetings.

How to eliminate wrong answers

Option A is wrong because ignoring even a small discrepancy in a dashboard used for executive decision-making can lead to compounded errors in trend analysis and erode trust in the data; any deviation from the underlying data must be investigated and corrected. Option B is wrong because waiting for someone to complain is reactive and unprofessional; the analyst should proactively ensure data accuracy, especially for a recurring weekly meeting where stakeholders rely on consistent, correct visualizations. Option C is wrong because immediately removing the dashboard and sending raw data disrupts the established reporting workflow and forces executives to interpret unaggregated data, which is inefficient and likely to introduce new errors; the proper action is to fix the filter and restore the correct trend line chart.

Full explanation →

748

Multi-Selectmedium

A data analyst needs to identify duplicate customer records. Which TWO methods are commonly used? (Select two.)

Select 2 answers

A.Fuzzy matching using Levenshtein distance

B.Sorting and comparing adjacent rows

C.Visual inspection of random sample

D.Using a hash function on primary key

E.Exact match on all fields

AnswersA, B

Levenshtein distance catches spelling differences.

Why this answer

Fuzzy matching using Levenshtein distance (Option A) is correct because it measures the edit distance between two strings, allowing identification of duplicates even when there are minor typographical differences, such as 'Jon Smith' vs. 'John Smith'. This is essential for deduplicating customer records where names, addresses, or other fields may have slight variations without being exact matches.

Exam trap

The trap here is that candidates often choose 'Exact match on all fields' (Option E) thinking it is a reliable deduplication method, but in practice it fails to catch real-world duplicates that have any minor variation, and the exam expects you to recognize that fuzzy matching and sorted adjacency comparisons are the standard techniques for duplicate detection.

Full explanation →

749

Multi-Selecthard

An analyst is using a CTE to compute hierarchical data. Which TWO statements about recursive CTEs are true?

Select 2 answers

A.The recursive member must reference the CTE name

B.The anchor member is the first SELECT that does not reference the CTE

C.Recursive CTEs can only be used for numerical sequences

D.Recursive CTEs must include a UNION ALL operator

E.Recursive CTEs cannot be used with GROUP BY

AnswersB, D

Anchor provides initial rows.

Why this answer

Recursive CTEs consist of an anchor member (initial result) and a recursive member that references the CTE itself.

Full explanation →

750

MCQeasy

A data analyst receives the above JSON snippet from a web API. The analyst needs to extract the email addresses for all customers. Which JSONPath expression should be used?

A.$.customers[0].email

B.$..email

C.$.customers[*].email

D.$.customers.email

AnswerC

This expression selects email from every customer object.

Why this answer

Option C is correct because the JSONPath expression `$.customers[*].email` uses the wildcard `[*]` to select all elements in the `customers` array and then accesses the `email` property of each element. This matches the requirement to extract email addresses for all customers from the JSON snippet.

Exam trap

The trap here is that candidates often confuse the deep scan operator `..` with the array wildcard `[*]`, thinking `$..email` will neatly extract all customer emails, but it actually retrieves every `email` property at any depth, including from non-customer objects, leading to incorrect data extraction.

How to eliminate wrong answers

Option A is wrong because `$.customers[0].email` only retrieves the email address of the first customer in the array, not all customers. Option B is wrong because `$..email` uses the deep scan operator `..` which recursively searches the entire JSON tree for any property named `email`, potentially returning emails from nested objects or arrays that are not customers (e.g., from an `orders` or `address` object), leading to incorrect or extra results. Option D is wrong because `$.customers.email` attempts to access `email` directly on the `customers` array object, but arrays in JSONPath do not have a property named `email`; this expression would return `null` or an empty result unless the array itself has an `email` property, which it does not.

Full explanation →

CompTIA Data+ DA0-001 (DA0-001) — Questions 676–750