Knowledge + Practice

CompTIA Data+ DA0-001 (DA0-001) — Questions 301–375

509 questions total · 7pages · All types, answers revealed

Take a mock exam Exam hub

Page 5 of 7

301

MCQhard

A data visualization specialist needs to display the relationship between advertising spend and revenue for 50 product categories over 12 months. The data has many overlapping points. Which chart type best reveals the correlation and density?

A.Heatmap with revenue binned

B.Line chart for each category

C.Bubble chart

D.Scatter plot with alpha blending

AnswerD

Transparency allows overlapping points to show density while preserving the relationship.

Why this answer

A scatter plot with transparency (alpha blending) reveals both correlation and point density by showing overlapping points as darker areas. A bubble chart adds a third variable. A heatmap on a grid loses resolution.

A line chart is not for correlation.

Full explanation →

302

MCQmedium

A data analyst needs to combine sales data from multiple regional databases with different schemas. Which process is best?

A.Data federation

B.ETL (Extract, Transform, Load)

C.Data replication

D.Data virtualization

AnswerB

Designed to extract, transform, and load data from heterogeneous sources.

Why this answer

Option B is correct because ETL (Extract, Transform, Load) is designed to extract data from various sources, transform it to a common schema, and load it into a target system. Option A is wrong because data replication copies data without transformation. Option C is wrong because data virtualization provides real-time access without storing.

Option D is wrong because data federation also provides virtual integration but is less suited for combining data into a single store.

Full explanation →

303

MCQmedium

A healthcare company's data analyst is tasked with presenting patient readmission rates to a group of doctors and nurses. The data shows that readmission rates are higher among patients with certain chronic conditions. The audience has limited data literacy, but they are familiar with clinical terms. The analyst wants to communicate the insights effectively and encourage discussion on preventive measures. The presentation will last 15 minutes, and the audience expects actionable insights. What should the analyst do?

A.Present a table of raw numbers for each condition.

B.Show a complex regression model output to demonstrate correlation.

C.Use a simple bar chart comparing readmission rates by condition.

D.Show a scatter plot of age vs. readmission rate.

AnswerC

Clearly shows which conditions have highest readmission, enabling discussion.

Why this answer

Option C is correct because a simple bar chart directly compares readmission rates across chronic conditions using a visual format that is intuitive for an audience with limited data literacy. It avoids overwhelming them with raw numbers or complex statistical outputs, enabling quick comprehension and fostering discussion on actionable preventive measures within the 15-minute timeframe.

Exam trap

CompTIA often tests the principle that the best visualization for an audience with low data literacy is the simplest one that directly maps the insight, and the trap here is that candidates may overcomplicate the choice by selecting a scatter plot (Option D) because it looks 'analytical,' even though it fails to address the specific categorical comparison needed.

How to eliminate wrong answers

Option A is wrong because presenting a table of raw numbers requires the audience to manually interpret and compare values, which is inefficient for a 15-minute presentation and assumes a level of data literacy the audience lacks. Option B is wrong because showing a complex regression model output introduces statistical jargon (e.g., coefficients, p-values) that the audience cannot interpret, distracting from the core insight of higher readmission rates by condition. Option D is wrong because a scatter plot of age vs. readmission rate does not directly address the key insight about chronic conditions; it shifts focus to a different variable (age) and may confuse the audience with overplotting or lack of clear categorical comparison.

Full explanation →

304

MCQmedium

A finance department wants to compare actual expenses against budgeted amounts for each of 10 departments over the past 12 months. They need to show the variance (over/under budget) as well as the trend of expenses over time. The data includes monthly actual and budget figures for each department. The audience is the CFO who needs a quick overview. Which visualization approach should be used?

A.A dual-axis chart with bars for actual and budget by month for each department, and a line for variance

B.A waterfall chart showing cumulative variance over the year

C.A table with conditional formatting to highlight over-budget cells

D.Stacked bar chart with one bar per month, segments showing each department's actual and budget

AnswerA

Dual-axis allows direct comparison of actual vs budget over time, and the line highlights variance trend.

Why this answer

Option B is correct because a dual-axis chart (bars for budget vs actual per month, line for variance) shows both comparison and trend clearly. Option A is wrong because a single stacked bar for all departments obscures monthly trends. Option C is wrong because a table is not a quick-visual overview.

Option D is wrong because a waterfall chart is for cumulative changes, not month-by-month trends.

Full explanation →

305

Multi-Selectmedium

A data analyst is building a supervised learning model to predict customer churn. The target variable is binary (churn = yes/no). Which TWO modeling techniques are appropriate for this task? (Select two.)

Select 2 answers

A.K-means clustering

B.Linear regression

C.Logistic regression

D.Decision trees

E.Apriori algorithm

AnswersC, D

Logistic regression models binary outcomes and is appropriate for classification.

Why this answer

Logistic regression is appropriate because it models the probability of a binary outcome (churn yes/no) using a logistic function, making it a standard choice for binary classification tasks. It outputs a value between 0 and 1, which can be thresholded to predict the class label.

Exam trap

The trap here is that candidates may confuse unsupervised clustering (K-means) or association rule mining (Apriori) with supervised classification, or mistakenly think linear regression can be adapted for binary outcomes without transformation.

Full explanation →

306

MCQmedium

A company wants to collect real-time clickstream data from its website. Which acquisition method is most suitable?

A.Streaming API

B.Web scraping

C.Batch processing nightly

D.Manual entry

AnswerA

Enables continuous, low-latency data ingestion.

Why this answer

A streaming API is the most suitable method for collecting real-time clickstream data because it enables continuous, low-latency ingestion of events as they occur. Unlike batch or manual methods, a streaming API (e.g., using WebSockets or HTTP/2 Server-Sent Events) pushes each click event immediately to the data pipeline, satisfying the real-time requirement.

Exam trap

CompTIA often tests the distinction between 'real-time' and 'near-real-time' or 'batch' methods, and the trap here is that candidates may confuse web scraping (which can be automated frequently) with true streaming, not realizing that scraping is still a pull-based, scheduled operation that cannot match the push-based immediacy of a streaming API.

How to eliminate wrong answers

Option B (Web scraping) is wrong because it is a pull-based technique that typically retrieves static HTML pages at intervals, not real-time event streams, and is inefficient for high-frequency click data. Option C (Batch processing nightly) is wrong because it introduces a delay of up to 24 hours, failing the real-time requirement. Option D (Manual entry) is wrong because it is error-prone, non-scalable, and cannot capture high-velocity clickstream data in real time.

Full explanation →

307

MCQmedium

In a customer database, each row represents a customer with columns: CustomerID, Name, Address, Phone. What does the column "Name" represent?

A.Instance

B.Entity

C.Attribute

D.Record

AnswerC

Correct. Name is an attribute of the customer entity.

Why this answer

In the context of a relational database, a column represents an attribute of an entity. The 'Name' column stores a specific characteristic (the customer's name) for each row, making it an attribute. This aligns with the data modeling concept where attributes define the properties of an entity.

Exam trap

The trap here is that candidates confuse 'attribute' with 'record' because they think of a row as containing all attributes, but the question specifically asks what a single column represents, not the row itself.

How to eliminate wrong answers

Option A is wrong because an instance refers to a single occurrence of an entity (e.g., a specific customer row), not a column. Option B is wrong because an entity is a table-level concept representing a real-world object (e.g., the Customer table), not a column within it. Option D is wrong because a record is a row in the table, which contains values for all attributes, not a single column like 'Name'.

Full explanation →

308

MCQeasy

Refer to the exhibit. A data pipeline is failing to parse this log entry. What is the most likely cause of the error?

A.Missing comma between fields

B.Incorrect data type for age

C.Extra whitespace

D.Unquoted string for country

AnswerA

There should be a comma after '30' before the next key.

Why this answer

The log entry shows fields separated by spaces, but the country value 'United States' contains a space, which breaks the expected delimiter-based parsing. Without a comma or other explicit delimiter between fields, the parser cannot distinguish where the age field ends and the country field begins, causing a parsing failure. A comma-separated format (CSV) would require a comma between each field to avoid ambiguity.

Exam trap

CompTIA often tests the concept that missing or inconsistent delimiters cause parsing failures, and the trap here is that candidates may focus on the unquoted string or extra whitespace rather than recognizing that the fundamental issue is the absence of a required comma between fields.

How to eliminate wrong answers

Option B is wrong because the age field '30' is a valid integer and would parse correctly if the fields were properly delimited; the error is not due to data type mismatch. Option C is wrong because extra whitespace is not the issue—the spaces are part of the intended delimiter or the country value, and the parser is failing due to the lack of a comma, not due to excessive whitespace. Option D is wrong because the country 'United States' is not unquoted in a way that causes the error; the core problem is the missing comma between fields, not the lack of quotes around the string.

Full explanation →

309

MCQhard

A financial services company is migrating its customer data from a legacy on-premises relational database to a cloud-based data warehouse. The legacy database uses a denormalized schema with a single table 'customer_master' that contains all customer attributes, including repeated groups for multiple accounts per customer (account1_type, account1_balance, account2_type, account2_balance, etc.). The data warehouse team wants to implement a normalized star schema with separate dimension and fact tables. During the ETL process, the team encounters an error: 'Data truncation: string data right truncation' when loading account_type values into the dim_account table. The account_type column in dim_account is defined as VARCHAR(10), but the source data contains account types like 'SavingsPlus' (11 characters) and 'CheckingPremium' (15 characters). The team must resolve this issue without losing data. Which course of action should the team take?

A.Truncate the account_type values to 10 characters during ETL.

B.Change the data type of dim_account.account_type to TEXT.

C.Ignore the error and continue loading with NULL values for truncated rows.

D.Increase the VARCHAR length of dim_account.account_type to accommodate the longest account type.

AnswerD

This resolves truncation without data loss.

Why this answer

Option D is correct because increasing the VARCHAR length of dim_account.account_type to accommodate the longest account type (e.g., VARCHAR(15) for 'CheckingPremium') resolves the data truncation error without data loss. This aligns with the star schema design principle of preserving source data integrity while ensuring the column definition matches the actual data length. The team must avoid truncation or NULL insertion to maintain accurate dimensional attributes for analytics.

Exam trap

The trap here is that candidates may choose truncation (Option A) or NULL insertion (Option C) as quick fixes, overlooking the requirement to preserve data integrity, or mistakenly think TEXT (Option B) is a safe catch-all without considering performance implications in a data warehouse context.

How to eliminate wrong answers

Option A is wrong because truncating account_type values to 10 characters would lose data, violating the requirement to resolve the issue without data loss. Option B is wrong because changing the data type to TEXT is unnecessary and can introduce performance overhead in indexing and querying, as TEXT is a large object type not optimized for VARCHAR-like operations in a data warehouse. Option C is wrong because ignoring the error and loading NULL values for truncated rows would discard valid account_type data, breaking referential integrity and analytics accuracy.

Full explanation →

310

MCQmedium

A data analyst needs to compare sales data from the company's internal CRM with public demographic data from a government census. Which data concept best describes this scenario?

A.Internal vs. External data

B.Primary vs. Secondary data

C.Structured vs. Unstructured data

D.Quantitative vs. Qualitative data

AnswerA

CRM data is internal; census data is external, directly contrasting sources.

Why this answer

The scenario involves comparing internal CRM data (generated and owned by the company) with external government census data (publicly sourced from outside the organization). This directly maps to the Internal vs. External data concept, where internal data is collected within the enterprise (e.g., sales transactions, customer records) and external data is acquired from third-party sources (e.g., census bureaus, market research firms).

The key distinction is the data's origin and ownership, not its structure, collection method, or measurement type.

Exam trap

CompTIA often tests the Internal vs. External data concept by presenting a scenario where the key differentiator is the data's source (inside vs. outside the organization), tempting candidates to confuse it with Primary vs. Secondary data, which focuses on whether the data was collected firsthand or repurposed.

How to eliminate wrong answers

Option B (Primary vs. Secondary data) is wrong because both datasets could be primary (collected firsthand by the CRM or census) or secondary (repurposed from another source), but the question focuses on the origin relative to the organization, not the collection method. Option C (Structured vs.

Unstructured data) is wrong because both CRM sales data and census demographic data are typically structured (e.g., tables with rows and columns), so the contrast is not about format but about source. Option D (Quantitative vs. Qualitative data) is wrong because both datasets contain quantitative values (e.g., sales figures, population counts) and possibly qualitative labels (e.g., region names), but the core distinction in the scenario is internal versus external sourcing, not measurement scale.

Full explanation →

311

MCQeasy

Refer to the exhibit. The data shows sales by product category for two years. Which product category had the highest percentage increase from 2023 to 2024?

A.Clothing

B.Books

C.Electronics

D.Home & Garden

AnswerD

Home & Garden had the highest percentage increase of about 15.8%.

Why this answer

Home & Garden had the highest percentage increase from 2023 to 2024 because its sales grew from $20,000 to $30,000, a 50% increase, which is greater than the percentage increases for Clothing (25%), Books (33.3%), and Electronics (40%). The percentage change is calculated as ((2024 value - 2023 value) / 2023 value) * 100, and Home & Garden yields the highest result.

Exam trap

CompTIA often tests the distinction between absolute increase and percentage increase, leading candidates to mistakenly choose Clothing because it has the largest absolute dollar increase ($10,000), while the question asks for the highest percentage increase.

How to eliminate wrong answers

Option A is wrong because Clothing increased from $40,000 to $50,000, a 25% increase, which is lower than Home & Garden's 50%. Option B is wrong because Books increased from $15,000 to $20,000, a 33.3% increase, which is lower than Home & Garden's 50%. Option C is wrong because Electronics increased from $25,000 to $35,000, a 40% increase, which is lower than Home & Garden's 50%.

Full explanation →

312

MCQmedium

Refer to the exhibit. A data engineer is setting up data acquisition from an S3 bucket with this policy. What does the policy enforce?

A.Only objects with key 'data-bucket' can be accessed

B.Only GET requests are allowed

C.The bucket is publicly accessible

D.Access is restricted to a specific IP range

AnswerD

The condition limits access to IPs in the 10.0.0.0/8 range.

Why this answer

The S3 bucket policy includes a condition using `aws:SourceIp` with a CIDR block, which restricts access to requests originating from the specified IP range. This enforces network-level access control, allowing only clients within that IP range to perform the allowed actions (like GET requests) on the bucket. Option D correctly identifies this IP-based restriction.

Exam trap

CompTIA often tests the misconception that a Deny statement with a condition on IP addresses makes the bucket publicly accessible, when in fact it restricts access to only the specified IP range.

How to eliminate wrong answers

Option A is wrong because the policy does not restrict access based on object keys; the `arn:aws:s3:::data-bucket/*` in the Resource element specifies all objects in the bucket, not a specific key. Option B is wrong because the policy allows both GET and PUT requests (as shown in the Action list), not only GET requests. Option C is wrong because the policy explicitly denies access from IPs outside the specified range, making the bucket private to that IP range, not publicly accessible.

Full explanation →

313

MCQeasy

A data analyst creates a bar chart showing monthly sales for the current year. The CEO asks to see the same data but grouped by region. Which chart type is most appropriate for this new requirement?

A.Grouped bar chart

B.Pie chart

C.Stacked bar chart

D.Line chart

AnswerA

Grouped bar charts compare values across categories and subgroups.

Why this answer

A grouped bar chart is the most appropriate choice because it allows the CEO to compare monthly sales across different regions side by side within each month. This chart type effectively displays two categorical variables (month and region) and one quantitative variable (sales), making it easy to see both regional performance and monthly trends simultaneously.

Exam trap

CompTIA often tests the distinction between grouped and stacked bar charts, trapping candidates who think any multi-category bar chart is equivalent, when in fact grouped bars are for comparing individual category values and stacked bars are for comparing totals and proportions.

How to eliminate wrong answers

Option B (Pie chart) is wrong because pie charts are designed to show parts of a whole for a single categorical variable at a single point in time, not to compare multiple categories (regions) across a time series (months). Option C (Stacked bar chart) is wrong because while it can show regions within months, it emphasizes the total sales per month and the proportion each region contributes, rather than allowing direct comparison of individual region sales across months. Option D (Line chart) is wrong because line charts are best for showing continuous trends over time for one or more series, but they do not effectively compare discrete categories like regions within each month; a grouped bar chart provides clearer categorical comparison.

Full explanation →

314

MCQeasy

A data analyst needs to extract data from an API that returns JSON. The analyst wants to convert the JSON output into a tabular format for analysis. Which function in a scripting language is commonly used for this purpose?

A.json.loads()

B.to_csv()

C.read_json()

D.json_normalize()

AnswerD

This function normalizes semi-structured JSON data into a flat table.

Why this answer

Option D is correct because `json_normalize()` is a function in the pandas library specifically designed to flatten semi-structured JSON data (including nested lists and dictionaries) into a tabular DataFrame. This makes it the ideal tool for converting API responses with complex nesting into rows and columns for analysis, unlike simpler JSON parsing functions.

Exam trap

The trap here is that candidates confuse `read_json()` (which works only for flat JSON) with `json_normalize()` (which handles nested structures), leading them to choose option C when the API response contains hierarchical data.

How to eliminate wrong answers

Option A is wrong because `json.loads()` only parses a JSON string into a Python dictionary or list; it does not flatten nested structures or produce a tabular format. Option B is wrong because `to_csv()` is a pandas method for exporting a DataFrame to a CSV file, not for converting JSON to a table. Option C is wrong because `read_json()` in pandas reads a JSON file or string into a DataFrame but only handles simple, flat JSON structures; it fails with deeply nested JSON (e.g., arrays of objects with sub-objects) without additional normalization.

Full explanation →

315

MCQmedium

A scatter plot of advertising spend vs. revenue shows no clear correlation, but the analyst suspects a relationship exists. Which addition to the plot could help reveal a hidden trend?

A.Change to a bar chart

B.Increase the marker size

C.Remove data points with low spend

D.Add a trendline

AnswerD

A trendline helps model and visualize the underlying relationship.

Why this answer

Option A is correct because adding a trendline (e.g., linear regression) can reveal patterns in noisy data. Other options may obscure or not address the issue.

Full explanation →

316

Multi-Selecteasy

Which TWO are examples of internal data sources? (Select exactly 2)

Select 2 answers

A.APIs

B.Relational databases

C.Sensor readings

D.Social media comments

E.Flat files

AnswersB, E

Common internal source.

Why this answer

Relational databases are internal data sources because they store structured data generated and controlled within an organization's own systems. They are typically managed by internal IT teams and accessed via SQL queries, making them a classic example of an internal data repository.

Exam trap

CompTIA often tests the distinction between data sources and data access methods, so candidates mistakenly select APIs or sensor readings as internal sources when they are actually mechanisms or external origin points.

Full explanation →

317

MCQhard

Refer to the exhibit. A data analyst runs this query to identify high-value customers. However, the result does not include customers with exactly 5 orders. Which data concept does the HAVING clause illustrate?

A.Data sorting with ORDER BY

B.Data joining with INNER JOIN

C.Data aggregation with filtering on aggregated values

D.Data filtering on row-level conditions

AnswerC

HAVING filters after GROUP BY, operating on aggregated results.

Why this answer

HAVING filters groups after aggregation, unlike WHERE which filters rows before aggregation. This demonstrates data aggregation with filtering on aggregated values.

Full explanation →

318

Multi-Selecteasy

A data analyst is preparing a presentation to share findings with non-technical stakeholders. Which TWO practices should the analyst follow to effectively communicate data insights? (Choose two.)

Select 2 answers

A.Tailor the message to the audience

B.Provide raw data tables

C.Use technical jargon to demonstrate expertise

D.Include all data anomalies and outliers

E.Focus on actionable insights

AnswersA, E

Different stakeholders have different priorities and levels of understanding.

Why this answer

Option B (focus on actionable insights) is correct because stakeholders need clear recommendations. Option D (tailor the message to the audience) is correct because different stakeholders have different priorities. Option A is wrong because technical jargon can confuse non-technical audiences.

Option C is wrong because raw data tables are overwhelming and not insightful. Option E is wrong because including all anomalies can distract from key findings.

Full explanation →

319

Multi-Selectmedium

Which TWO are best practices for designing effective dashboards? (Select exactly two.)

Select 2 answers

A.Use 3D charts for visual appeal.

B.Use pie charts for comparing parts of a whole.

C.Include every possible metric to provide completeness.

D.Use a single page to avoid scrolling.

E.Use a consistent color scheme across visualizations.

AnswersD, E

A single page allows immediate overview.

Why this answer

Options A and C are correct. A single-page dashboard provides an at-a-glance view, and consistent color schemes improve readability. 3D charts (B) can distort perception; including every metric (D) clutters; pie charts (E) are often misused.

Full explanation →

320

MCQmedium

Refer to the exhibit. If the date column is stored as a string in 'MM/DD/YYYY' format, what will be the result?

A.Incorrect results because string comparison is lexicographic.

B.NULL values

C.Error because DATE type is expected.

D.Correct results because string comparison works for dates.

AnswerA

The different format causes lexicographic comparison to fail.

Why this answer

Option A is correct because when dates are stored as strings in 'MM/DD/YYYY' format, string comparison is lexicographic (character-by-character). This means that '01/02/2023' (January 2) would be considered greater than '12/31/2022' because '0' > '1' at the first character, leading to incorrect chronological ordering. The comparison does not interpret the string as a date value.

Exam trap

CompTIA often tests the misconception that string comparison of dates in 'MM/DD/YYYY' format will yield correct chronological order, but the trap is that lexicographic comparison compares month first, not year, leading to incorrect results.

How to eliminate wrong answers

Option B is wrong because string comparison does not produce NULL values; it simply compares strings lexicographically and returns a valid boolean result. Option C is wrong because no error occurs; the database or application will perform string comparison without expecting a DATE type, as the column is defined as a string. Option D is wrong because string comparison does not work correctly for dates in this format; lexicographic order does not match chronological order for 'MM/DD/YYYY' strings.

Full explanation →

321

MCQhard

A data scientist is tuning a decision tree model to prevent overfitting. The model currently has a high variance. Which hyperparameter adjustment is most effective?

A.Reduce maximum depth

B.Increase minimum samples split

C.Increase number of leaves

D.Use a smaller dataset

AnswerA

Reducing max depth stops the tree from growing too deep, simplifying the model and reducing variance.

Why this answer

Reducing maximum depth limits the number of splits in the decision tree, which directly reduces model complexity and variance. A high-variance model is overfitting to training data, and capping depth prevents the tree from learning overly specific patterns that do not generalize.

Exam trap

CompTIA often tests the misconception that increasing model complexity (e.g., more leaves) reduces overfitting, when in reality it increases variance; the trap here is that candidates may confuse 'minimum samples split' as the only regularization technique, overlooking that reducing max depth is a more direct and effective hyperparameter for high variance.

How to eliminate wrong answers

Option B is wrong because increasing minimum samples split actually reduces overfitting by requiring more samples per split, which is also effective but not the most direct adjustment for high variance; the question asks for the most effective hyperparameter adjustment, and reducing depth is more aggressive. Option C is wrong because increasing the number of leaves increases model complexity, which would exacerbate overfitting and increase variance, not reduce it. Option D is wrong because using a smaller dataset would increase variance (less data leads to more unstable splits) and is not a hyperparameter adjustment; it is a data-level change that typically worsens overfitting.

Full explanation →

322

MCQhard

A marketing analyst wants to segment customers based on purchasing behavior and demographics. The dataset includes continuous variables (spending amount, frequency) and categorical variables (region, gender). The analyst decides to use k-means clustering. What should the analyst do to prepare the data?

A.Use raw data because k-means works with mixed types

B.Standardize continuous variables and one-hot encode categorical variables

C.Apply PCA first to reduce dimensionality

D.Remove categorical variables entirely

AnswerB

Standardization ensures equal weight; one-hot encoding converts categories to binary vectors.

Why this answer

Option B is correct because k-means clustering relies on Euclidean distance, which is sensitive to the scale of features. Standardizing continuous variables (e.g., spending amount, frequency) ensures they contribute equally to distance calculations, while one-hot encoding categorical variables (e.g., region, gender) converts them into numerical form without implying ordinal relationships, allowing k-means to process mixed data types correctly.

Exam trap

The trap here is that candidates assume k-means can natively handle mixed data types because it is a common clustering algorithm, but it strictly requires numerical input and scale normalization to avoid skewed distance calculations.

How to eliminate wrong answers

Option A is wrong because k-means cannot directly handle categorical variables; it requires numerical input and assumes continuous features, so using raw mixed-type data would produce meaningless distance calculations. Option C is wrong because PCA is a dimensionality reduction technique applied after preprocessing, not a substitute for standardizing and encoding; it may be used optionally but is not the required preparation step. Option D is wrong because removing categorical variables discards valuable demographic information that could improve segmentation, and k-means can incorporate them after proper encoding.

Full explanation →

323

MCQeasy

A data analyst wants to compare the sales performance of four different stores over the same time period. Which chart type is most suitable?

A.Line chart with multiple lines

B.Grouped bar chart

C.Stacked bar chart

D.Pie chart with multiple pies

AnswerB

Grouped bars allow side-by-side comparison of stores.

Why this answer

A grouped bar chart is the most suitable choice because it allows direct comparison of discrete categories (four stores) across a common time period, with each group representing a time interval and individual bars representing each store's sales. This chart type excels at side-by-side comparisons of multiple entities over the same categorical axis, making it ideal for the analyst's goal.

Exam trap

The trap here is that candidates often choose a line chart (Option A) because they associate 'time period' with trend analysis, but the question explicitly asks to compare sales performance over the same time period, not to show changes over time, making the grouped bar chart the correct choice for direct comparison of discrete entities.

How to eliminate wrong answers

Option A is wrong because a line chart with multiple lines is designed to show trends over continuous time, but here the time period is fixed and the focus is on comparing discrete stores, not tracking changes over time. Option C is wrong because a stacked bar chart shows part-to-whole relationships within each time period, which obscures individual store performance and makes direct comparisons between stores difficult. Option D is wrong because a pie chart with multiple pies is meant to show proportions of a whole for each store separately, but comparing sales across stores requires a common baseline, which pie charts do not provide.

Full explanation →

324

MCQeasy

A data analyst wants to predict customer churn based on categorical features like region and plan type, and continuous features like usage and tenure. Which regression type should be used?

A.Logistic regression

B.Ridge regression

C.Linear regression

D.Lasso regression

AnswerA

Logistic regression is used for binary classification, suitable for churn prediction.

Why this answer

Logistic regression is the correct choice because the target variable, customer churn, is binary (churn vs. no churn). Logistic regression models the probability of a binary outcome using a sigmoid function, making it suitable for classification tasks with both categorical and continuous predictors.

Exam trap

CompTIA often tests the misconception that 'regression' in the option name implies it is only for continuous outcomes, leading candidates to overlook logistic regression as a valid classification technique.

How to eliminate wrong answers

Option B (Ridge regression) is wrong because it is a regularized form of linear regression used for continuous outcomes, not binary classification. Option C (Linear regression) is wrong because it predicts a continuous value and is inappropriate for a binary dependent variable; it can produce probabilities outside [0,1] and violates the assumption of normally distributed errors. Option D (Lasso regression) is wrong because, like Ridge, it is a regularized linear regression for continuous targets and performs feature selection via L1 penalty, but it does not handle binary classification.

Full explanation →

325

MCQeasy

During data exploration, an analyst notices that the target variable has a heavily right-skewed distribution. Which data transformation would be most appropriate to make the distribution more symmetric?

A.Log transformation

B.Reciprocal transformation

C.No transformation needed

D.Square root transformation

AnswerA

Log transformation effectively reduces right skewness.

Why this answer

Log transformation is commonly used to reduce right skew. Square root is for moderate skew. Box-Cox can also work but log is simpler.

Option B is correct.

Full explanation →

326

MCQmedium

A company’s marketing team wants to segment customers based on purchase history, demographics, and website behavior. The data includes both numeric and categorical variables. Which clustering algorithm is best suited for handling mixed data types?

A.Hierarchical clustering with Gower distance

B.K-modes clustering

C.DBSCAN with Euclidean distance

D.K-means clustering

AnswerA

Gower distance can handle mixed data types by computing a dissimilarity matrix that combines numeric and categorical attributes.

Why this answer

Hierarchical clustering with Gower distance is best suited for mixed data types because Gower distance computes a dissimilarity measure that handles both numeric and categorical variables by normalizing numeric differences and using a simple matching coefficient for categorical ones. This allows the algorithm to create a distance matrix that equally weights all variable types, making it ideal for segmenting customers with purchase history, demographics, and website behavior data.

Exam trap

The trap here is that candidates often assume K-means or DBSCAN can handle mixed data by simply encoding categorical variables, but they overlook that Euclidean distance on encoded data distorts the geometry and fails to preserve the natural dissimilarity structure of categorical variables.

How to eliminate wrong answers

Option B (K-modes clustering) is wrong because it is designed exclusively for categorical data and cannot handle numeric variables like purchase history or website behavior metrics. Option C (DBSCAN with Euclidean distance) is wrong because Euclidean distance is only meaningful for numeric data and cannot properly measure dissimilarity between categorical variables, leading to distorted clusters. Option D (K-means clustering) is wrong because it relies on Euclidean distance and assumes numeric, continuous data; it cannot directly incorporate categorical variables without encoding, and even with encoding, it is sensitive to scaling and does not naturally handle mixed types.

Full explanation →

327

MCQeasy

A data analyst needs to merge two customer tables from different sources. One table uses 'CUST_ID' as the primary key, the other uses 'CustomerID'. To ensure accurate merging, the analyst should first:

A.Perform a fuzzy match on names

B.Normalize the key column names to a common format

C.Remove duplicate rows from both tables

D.Aggregate data by region

AnswerB

Standardizing key names allows for accurate merging without data loss.

Why this answer

Normalizing key column names to a common format (Option B) is the correct first step because the merge operation requires a consistent join key. Without aligning 'CUST_ID' and 'CustomerID' to a single name and data type, the database or ETL tool will treat them as different columns, resulting in a cross join or an error. This step ensures referential integrity and enables an accurate inner or outer join based on the primary key.

Exam trap

The trap here is that candidates assume deduplication (Option C) is the most critical first step, but without first standardizing the join keys, any deduplication logic would operate on mismatched or incomplete data, leading to incorrect results.

How to eliminate wrong answers

Option A is wrong because performing a fuzzy match on names is an advanced, resource-intensive technique used only when exact key values are unavailable or inconsistent; it is unnecessary when the tables already have primary key columns that can be standardized. Option C is wrong because removing duplicate rows before aligning key names could inadvertently delete legitimate records that only appear duplicated due to key naming differences, and deduplication should occur after the merge or as a separate quality step. Option D is wrong because aggregating data by region is a post-merge analytical operation that has no bearing on resolving key column mismatches and would corrupt the granularity needed for accurate joining.

Full explanation →

328

MCQeasy

An analyst creates a dashboard with multiple visualizations. Which feature allows users to change the data displayed across all charts simultaneously?

A.Linked chart

B.Drill-down

C.Filter or slicer

D.Data segmentation

AnswerC

Filters and slicers allow users to interactively change the data displayed on multiple charts.

Why this answer

Option D is correct because a filter (or slicer) can apply to multiple visuals. Option A is for limited data view. Option B is a chart type.

Option C is dividing data into segments.

Full explanation →

329

MCQeasy

Refer to the exhibit. An analyst runs this query before acquiring data from a PostgreSQL database. What is the primary purpose of this query?

A.To verify data types

B.To check for data freshness

C.To find primary keys

D.To identify duplicate tables

AnswerB

The 'last_analyzed' column shows when statistics were last updated, indicating freshness.

Why this answer

The query `SELECT current_timestamp;` returns the current date and time from the PostgreSQL server. By comparing this value with the timestamp of the most recent record in a target table, an analyst can determine if the data is up-to-date or stale. This is a standard technique to check data freshness before proceeding with data acquisition.

Exam trap

The trap here is that candidates may confuse `current_timestamp` with a function used for data type verification or schema inspection, when in fact it is solely a temporal function used to gauge data freshness.

How to eliminate wrong answers

Option A is wrong because verifying data types requires querying the `information_schema.columns` table or using `pg_typeof()`, not `current_timestamp`. Option C is wrong because finding primary keys involves querying `information_schema.table_constraints` or `pg_indexes`, not a simple timestamp function. Option D is wrong because identifying duplicate tables would require comparing table names or schemas via `information_schema.tables`, not a timestamp query.

Full explanation →

330

MCQhard

A data analyst is working with a dataset containing house prices. After building a multiple linear regression model, the analyst observes that the model performs well on training data but poorly on validation data. Which technique is most appropriate to address this issue?

A.Decrease the training data size

B.Use a polynomial transformation

C.Increase the number of features

D.Apply L2 regularization (Ridge)

AnswerD

Ridge regularization adds a penalty to large coefficients, reducing variance and combating overfitting.

Why this answer

The model is overfitting the training data, as evidenced by high performance on training data but poor performance on validation data. L2 regularization (Ridge) adds a penalty term proportional to the square of the coefficients, which shrinks them and reduces model complexity, thereby improving generalization to unseen data.

Exam trap

CompTIA often tests the distinction between overfitting and underfitting, and candidates mistakenly choose polynomial transformation or adding features thinking they will improve fit, when in fact they increase model complexity and worsen overfitting.

How to eliminate wrong answers

Option A is wrong because decreasing the training data size would exacerbate overfitting by providing the model with even less information to learn generalizable patterns. Option B is wrong because polynomial transformation increases model complexity and feature interactions, which typically worsens overfitting rather than addressing it. Option C is wrong because increasing the number of features adds more predictors, which increases the risk of overfitting and does not directly penalize large coefficients.

Full explanation →

331

MCQeasy

A data analyst is tasked with gathering data from a legacy system that only exports CSV files. The files contain headers but no data types. Which tool would best facilitate initial data exploration?

A.Hadoop

B.Tableau

C.SQL database

D.Python pandas

AnswerD

Provides powerful data structures and functions for CSV exploration.

Why this answer

Option A is correct because Python pandas provides powerful data structures and functions for exploring and manipulating CSV data, including detecting data types. Option B is wrong because SQL database requires importing data first. Option C is wrong because Hadoop is overkill for CSV exploration.

Option D is wrong because Tableau is visualization-focused and not ideal for initial data exploration without pre-processing.

Full explanation →

332

Multi-Selectmedium

A data team must implement a data retention policy to reduce storage costs while meeting legal requirements. Which TWO actions best achieve this?

Select 2 answers

A.Set data retention limits with automated deletion

B.Use data compression

C.Increase primary storage capacity

D.Implement data deduplication

E.Archive historical data to tape or cloud archive

AnswersA, E

Ensures data is deleted after a defined period, complying with legal requirements.

Why this answer

Archiving old data to cheaper storage reduces primary storage costs, and setting retention limits ensures data is deleted when no longer needed, balancing cost and compliance.

Full explanation →

333

MCQhard

A data analyst is preparing a logistic regression model to predict customer churn. After examining the exhibit, which data quality issue should the analyst address first?

A.Duplicate customer IDs

B.Missing values in total_charges

C.Inconsistent data in total_charges

D.Outliers in monthly_charges

AnswerC

The total_charges for the first customer is equal to monthly_charges, suggesting a calculation error.

Why this answer

Option C is correct because the exhibit shows that the 'total_charges' column contains entries like '1,234.56' and '1234.56', which are inconsistent numeric formats. Logistic regression in Python (e.g., using scikit-learn) requires all feature values to be numeric and consistent; mixed formats will cause parsing errors or incorrect model training. The analyst must standardize these values to a uniform numeric type (e.g., float) before proceeding.

Exam trap

CompTIA often tests the distinction between data quality issues that prevent model execution (like inconsistent data types) versus issues that degrade model performance (like outliers or missing values), and candidates frequently overlook the former because they focus on statistical concerns rather than data preprocessing fundamentals.

How to eliminate wrong answers

Option A is wrong because duplicate customer IDs are a data integrity issue that can cause data leakage or overfitting, but the exhibit does not show any duplicate IDs, and this is not the most immediate problem for model training. Option B is wrong because missing values in 'total_charges' are not indicated in the exhibit; the issue is inconsistent formatting, not absence of data. Option D is wrong because outliers in 'monthly_charges' are not visible in the exhibit, and while outliers can affect logistic regression, they are a secondary concern compared to the fundamental data type inconsistency that prevents the model from even reading the data correctly.

Full explanation →

334

MCQhard

A data analyst is reviewing an S3 bucket policy that controls access to a data lake. The analyst wants to communicate that the current policy restricts data downloads to a specific IP range. Which of the following best describes the policy's effect?

A.The policy allows all users to download data.

B.The policy denies data downloads to all users.

C.The policy restricts data downloads to a specific IP range.

D.The policy allows data uploads from a specific IP range.

AnswerC

The Deny with condition effectively allows only that IP range.

Why this answer

Option C is correct because the S3 bucket policy explicitly restricts data downloads to a specific IP range by using a condition in the policy statement, such as `aws:SourceIp` with a CIDR block. This means only requests originating from IP addresses within that range are allowed to perform the `s3:GetObject` action, effectively limiting downloads to authorized network locations.

Exam trap

The trap here is that candidates may confuse the policy's effect with a blanket denial (Option B) or misinterpret the restriction as applying to uploads (Option D), when the policy specifically targets downloads and allows them only from a defined IP range.

How to eliminate wrong answers

Option A is wrong because it describes an allow-all effect, which contradicts the explicit IP-based restriction in the policy; a policy that restricts to a specific IP range does not allow all users. Option B is wrong because it states the policy denies downloads to all users, but the policy actually allows downloads from the specified IP range while implicitly or explicitly denying others—it does not universally deny. Option D is wrong because it refers to data uploads (e.g., `s3:PutObject`), but the question specifically addresses downloads (`s3:GetObject`), and the policy effect is about restricting downloads, not uploads.

Full explanation →

335

Drag & Dropmedium

Drag and drop the steps to perform a data audit in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Data audit begins with inventory, quality assessment, compliance check, documentation, and recommendations.

Full explanation →

336

MCQeasy

A retail company is merging customer data from three separate systems: an e-commerce platform, a point-of-sale (POS) system, and a loyalty program. The e-commerce platform stores customer names in "FirstName LastName" format, the POS system stores names as "LastName, FirstName", and the loyalty program stores names in separate "first_name" and "last_name" fields. The data analyst needs to create a unified customer master table. After initial merging, there are 20% more records than expected, including duplicates with slight name variations (e.g., "John Smith" vs "John A. Smith"). To ensure accurate consolidation, which data concept should the analyst prioritize applying first?

A.Data profiling

B.Data standardization

C.Data indexing

D.Data encryption

AnswerB

Standardizing name formats to a common convention reduces variations and allows accurate matching and deduplication.

Why this answer

Data standardization is the correct first step because it resolves the inconsistent name formats (e.g., 'FirstName LastName', 'LastName, FirstName', and separate fields) into a single, consistent representation. By applying a standardized format (e.g., 'FirstName LastName'), the analyst can then accurately identify and merge duplicates like 'John Smith' and 'John A. Smith' using fuzzy matching or exact matching on the standardized values.

This ensures the unified customer master table has the correct number of records without the 20% inflation caused by formatting variations.

Exam trap

The trap here is that candidates confuse data profiling (which only identifies issues) with data standardization (which actively resolves format inconsistencies), leading them to choose A instead of B, even though profiling alone cannot fix the duplicate records caused by name variations.

How to eliminate wrong answers

Option A is wrong because data profiling is an exploratory process that assesses data quality and structure (e.g., detecting nulls, patterns, or anomalies), but it does not transform or resolve the inconsistent name formats that cause duplicate records. Option C is wrong because data indexing improves query performance by creating sorted structures (e.g., B-trees or hash indexes) on columns, but it does not address the underlying data inconsistency or deduplication needed for accurate consolidation. Option D is wrong because data encryption protects data at rest or in transit (e.g., using AES-256 or TLS 1.3), but it has no role in standardizing name formats or removing duplicates from merged datasets.

Full explanation →

337

MCQmedium

A report uses red and green colors to indicate performance; a stakeholder who is colorblind complains. What is the BEST remedy?

A.Remove color entirely

B.Change colors to blue and orange

C.Add patterns or labels in addition to color

D.Use a different chart type

AnswerC

Patterns and labels provide redundant encoding that works for all users.

Why this answer

Option C is correct because adding patterns or labels alongside color ensures that the data encoding is accessible to colorblind viewers without removing the informational value of color for those who can see it. This approach follows WCAG (Web Content Accessibility Guidelines) 2.1 Success Criterion 1.4.1, which requires that color is not the sole means of conveying information. In data visualization tools like Tableau or Power BI, this is implemented by adding shape markers or text annotations to series, preserving the original color scheme for non-colorblind users while providing redundant encoding.

Exam trap

The trap here is that candidates assume the best remedy is to simply change colors to a more colorblind-friendly palette (like blue and orange), but CompTIA often tests the deeper principle that color alone is never sufficient for accessibility—redundant encoding via patterns or labels is required to meet WCAG standards.

How to eliminate wrong answers

Option A is wrong because removing color entirely eliminates a key visual encoding channel, which can reduce the chart's effectiveness for the majority of viewers and may violate best practices for data visualization that leverage color for quick pattern recognition. Option B is wrong because while blue and orange are more colorblind-friendly than red and green, they still rely solely on color differentiation, which fails for users with complete color blindness (achromatopsia) or other deficiencies; the WCAG requirement is to not use color as the only means of distinction. Option D is wrong because changing the chart type does not address the core issue of color-dependent encoding; a different chart type (e.g., bar chart instead of pie chart) still uses color to differentiate categories, so the accessibility problem persists unless patterns or labels are added.

Full explanation →

338

MCQhard

Given the linear regression output, which independent variable has the strongest effect on price, based on standardized coefficients?

A.bathrooms

B.sqft_living

C.Intercept

D.bedrooms

AnswerB

sqft_living has the highest absolute t-value (10.0) indicating strong effect.

Why this answer

Standardized coefficients (beta weights) allow comparison of the relative strength of independent variables by measuring the number of standard deviations the dependent variable changes per one standard deviation change in the predictor. In the regression output, sqft_living has the highest absolute standardized coefficient, indicating it has the strongest effect on price. The intercept is not an independent variable and its coefficient is not standardized for comparison.

Exam trap

The trap here is that candidates mistakenly compare unstandardized coefficients or p-values instead of standardized coefficients, leading them to choose a variable like bathrooms or bedrooms that appears significant but has a weaker standardized effect.

How to eliminate wrong answers

Option A is wrong because bathrooms may have a statistically significant coefficient, but its standardized coefficient is smaller than that of sqft_living, meaning it has a weaker relative effect on price. Option C is wrong because the intercept is a constant term representing the predicted price when all independent variables are zero; it is not an independent variable and its coefficient is not standardized for effect comparison. Option D is wrong because bedrooms, while possibly significant, has a lower absolute standardized coefficient than sqft_living, indicating a weaker influence on price per standard deviation change.

Full explanation →

339

Multi-Selectmedium

Which TWO chart types are best suited to show the proportion of total sales contributed by each product category? (Select exactly two.)

Select 2 answers

A.Histogram

B.Scatter plot

C.Stacked bar chart

D.Pie chart

E.Line chart

AnswersC, D

Stacked bars show composition across categories.

Why this answer

Options B and C are correct. Pie charts and stacked bar charts effectively show parts of a whole. Line charts show trends; scatter plots show relationships; histograms show distributions.

Full explanation →

340

MCQhard

Refer to the exhibit. What is the best course of action for the data analyst?

A.Change the default value to 'NA'

B.Delete rows with missing 'Region'

C.Investigate source data to understand why 'Region' is missing

D.Ignore the warning as it is handled

AnswerC

Root cause analysis helps improve data quality at the source.

Why this answer

Option C is correct because missing data in a critical field like 'Region' indicates a potential data quality issue that requires root cause analysis. The data analyst should investigate the source system to determine whether the missing values result from a collection error, a transformation bug, or a legitimate null, rather than making assumptions by deleting rows or filling defaults. This aligns with the principle of data provenance and ensures that downstream insights are based on accurate, traceable data.

Exam trap

The trap here is that candidates assume missing data should be immediately cleaned by deletion or default values, but Cisco tests the understanding that a data analyst's first responsibility is to investigate the source of the anomaly before applying any transformation.

How to eliminate wrong answers

Option A is wrong because blindly changing the default value to 'NA' introduces artificial data that may skew aggregations or filter logic, and it does not address the underlying reason for the missing values. Option B is wrong because deleting rows with missing 'Region' could discard valid records if the missing data is due to a temporary system glitch or a mapping error, leading to biased analysis and loss of potentially important context. Option D is wrong because ignoring the warning assumes the missing data is harmless, but in practice, missing region information can break geographic segmentation, violate data integrity constraints, and cause incorrect reporting in dashboards or ETL pipelines.

Full explanation →

341

MCQhard

A data engineer is designing a data pipeline to ingest streaming data from IoT sensors. The sensors send data every second, and the pipeline must handle bursts of up to 10,000 messages per second. Which approach is most appropriate for capturing this data before processing?

A.Directly write each message to a relational database

B.Load directly into a data warehouse

C.Use a message queue to buffer the incoming data

D.Store data in flat files and process in nightly batches

AnswerC

A message queue handles high throughput and provides reliable buffering.

Why this answer

Option C is correct because a message queue (e.g., Apache Kafka, Amazon Kinesis, or RabbitMQ) provides an asynchronous buffer that decouples the high-velocity ingestion (up to 10,000 messages/second) from downstream processing. This allows the pipeline to absorb burst traffic without overwhelming the processing layer, ensures data durability, and supports replayability in case of failures.

Exam trap

CompTIA often tests the misconception that relational databases or data warehouses can handle real-time streaming ingestion at scale, when in fact they require a buffering layer like a message queue to absorb bursts and decouple ingestion from processing.

How to eliminate wrong answers

Option A is wrong because directly writing each message to a relational database (RDBMS) at 10,000 messages/second would cause severe write contention, lock contention, and I/O bottlenecks, leading to dropped data and unacceptable latency. Option B is wrong because loading directly into a data warehouse (e.g., Snowflake, Redshift) is designed for batch or micro-batch ingestion, not for real-time streaming at this scale; it would incur high costs and fail to handle bursty throughput without prior buffering. Option D is wrong because storing data in flat files and processing in nightly batches introduces unacceptable latency (up to 24 hours) for streaming IoT data, and the file system cannot reliably handle 10,000 writes per second without data loss or corruption.

Full explanation →

342

Multi-Selecthard

Which THREE of the following are valid methods for handling missing data?

Select 3 answers

A.Using a placeholder like 'Unknown' for categorical data

B.Ignoring missing values and proceeding with analysis

C.Replacing missing values with the mean of the column

D.Sorting the data to bring missing values to the top

E.Deleting rows with missing values

AnswersA, C, E

Placeholder is a valid approach.

Why this answer

Option A is correct because using a placeholder like 'Unknown' for categorical missing data preserves the dataset's structure and allows analysis to proceed without introducing statistical bias. This method is particularly valid for nominal data where the missing category can be treated as a distinct value, enabling downstream operations like one-hot encoding or frequency analysis without distorting the original distribution.

Exam trap

The trap here is that candidates may confuse 'handling missing data' with 'preprocessing steps'—sorting (Option D) is a data organization technique, not a valid method for dealing with missing values, and ignoring missing data (Option B) is often mistakenly considered acceptable in quick analyses, but it violates best practices for robust data science workflows.

Full explanation →

343

Multi-Selecthard

When presenting data insights, which three practices help ensure clarity? (Choose THREE.)

Select 3 answers

A.Include all data points in the chart

B.Overlay multiple chart types on one axis

C.Use consistent color coding

D.Label axes clearly

E.Provide context and benchmarks

AnswersC, D, E

Consistency helps audiences quickly interpret visual elements.

Why this answer

Option C is correct because consistent color coding across charts and dashboards reduces cognitive load, allowing viewers to quickly associate colors with specific categories or metrics. This practice aligns with data visualization best practices for clarity, as it leverages pre-attentive processing to speed up interpretation without requiring the audience to re-learn the mapping for each chart.

Exam trap

The trap here is that candidates may think including all data points (Option A) ensures completeness, but the exam tests the understanding that clarity often requires simplification and focus, not exhaustive detail.

Full explanation →

344

MCQeasy

A data analyst is asked to create a visualization showing the trend of monthly revenue over the last five years. Which chart type is most suitable?

A.Pie chart.

B.Histogram.

C.Line chart.

D.Bar chart.

AnswerC

Line charts effectively show changes over time with continuous data.

Why this answer

A line chart is the most suitable choice because it explicitly shows the trend of monthly revenue over a continuous time period (five years). The x-axis represents time (months), and the y-axis represents revenue, allowing the viewer to easily identify patterns, seasonality, and overall direction. This makes it ideal for time-series data visualization.

Exam trap

The trap here is that candidates often confuse a histogram with a line chart because both use bars or lines, but a histogram bins continuous data into ranges and loses the temporal sequence, while a line chart preserves the order of time-series data.

How to eliminate wrong answers

Option A is wrong because a pie chart is used to show parts of a whole (proportions) at a single point in time, not trends over a continuous period. Option B is wrong because a histogram displays the distribution of a continuous variable by binning data into intervals, but it does not inherently show a trend over time—it would obscure the sequential order of months. Option D is wrong because a bar chart compares discrete categories or values at specific points, but it does not emphasize the continuous flow and trend direction as effectively as a line chart for time-series data.

Full explanation →

345

MCQeasy

A data analyst is importing a CSV file that contains a mixture of numeric and text fields. What is the most common issue when importing?

A.Duplicate rows

B.Missing header row

C.Data types being incorrectly inferred

D.File size limitation

AnswerC

CSV import tools often guess types incorrectly, leading to conversion errors.

Why this answer

Data type inference often fails, causing numbers to be read as text or vice versa. File size limitations, missing headers, and duplicate rows are less common or not specific to mixed types.

Full explanation →

346

MCQeasy

When creating a data story, what should be the FIRST step?

A.Design the layout of the report

B.Gather all available data

C.Choose visualization tools

D.Identify the key message

AnswerD

The key message frames the story and directs the entire process.

Why this answer

The first step in creating a data story is to identify the key message because the entire narrative, data selection, and visualization choices must align with a central insight. Without a clear message, subsequent steps like gathering data or designing layouts risk being unfocused and failing to communicate actionable findings. This ensures the story is driven by the insight, not by the data or tools.

Exam trap

CompTIA often tests the misconception that data gathering or tool selection is the logical starting point, tempting candidates to skip the critical step of defining the core insight first.

How to eliminate wrong answers

Option A is wrong because designing the layout of the report should occur after the key message is defined; layout is a presentation concern, not the foundational step. Option B is wrong because gathering all available data without a guiding message leads to analysis paralysis and irrelevant information; data should be collected with the key message in mind. Option C is wrong because choosing visualization tools depends on the data and the story to be told; selecting tools prematurely can limit the narrative or force inappropriate chart types.

Full explanation →

347

MCQeasy

A marketing team wants to compare the growth rates of four product categories over the last quarter. Which chart type would best display this?

A.Pie chart

B.Stacked bar chart

C.Line chart

D.Bar chart

AnswerD

Bar charts effectively compare discrete values like growth percentages.

Why this answer

Option C is correct because a bar chart comparing growth percentages across categories is clear and straightforward. Other options are less effective for comparing discrete metrics.

Full explanation →

348

MCQeasy

A marketing team wants to analyze customer sentiment from social media posts. Which data acquisition method is most appropriate?

A.Internal database query

B.Physical sensor data

C.Web scraping from public social media APIs

D.Survey questionnaire

AnswerC

Allows direct access to public posts for sentiment analysis.

Why this answer

Option D is correct because web scraping from public social media APIs allows direct access to public posts for sentiment analysis. Option A is wrong because internal databases do not contain social media data. Option B is wrong because surveys are not real-time from social media.

Option C is wrong because physical sensors are unrelated.

Full explanation →

349

MCQhard

An e-commerce company is merging customer data from three legacy systems. Two systems use email as unique identifier, but one system allows multiple customers per email. The third uses phone number. To create a unified customer view, the analyst should first:

A.Request the IT team to modify the legacy system

B.Build a customer matching rule that uses multiple attributes (email, phone, name) with a confidence score

C.Use email as primary key and ignore conflicts

D.Assign new unique IDs and discard existing identifiers

AnswerB

Multi-attribute matching handles non-unique identifiers and improves accuracy.

Why this answer

Option B is correct because merging data from systems with different identifier schemas requires a probabilistic matching approach. Using multiple attributes (email, phone, name) with a confidence score allows the analyst to resolve conflicts where email is not unique and phone numbers may be missing or formatted differently, creating a unified customer view without forcing a single key.

Exam trap

The trap here is that candidates assume a single unique identifier (email) can be forced as a primary key, ignoring the real-world data quality issue of non-unique emails, which the question explicitly states.

How to eliminate wrong answers

Option A is wrong because modifying legacy systems is often impractical, costly, and outside the analyst's scope; the question asks what the analyst should do first, not a long-term IT project. Option C is wrong because using email as primary key and ignoring conflicts would lose data integrity when one email maps to multiple customers, violating the goal of a unified view. Option D is wrong because assigning new unique IDs and discarding existing identifiers eliminates the ability to link records back to source systems and loses valuable matching context, making deduplication impossible.

Full explanation →

350

MCQhard

A data analyst is reviewing the error log from a nightly batch load. What is the most likely cause of the error?

A.A row with the same primary key was already loaded in a previous batch.

B.The data type of order_id is incorrect.

C.The source and target schemas are mismatched.

D.The order_id field contains null values.

AnswerA

The error explicitly says duplicate key.

Why this answer

The error log from a nightly batch load indicates a primary key violation. This occurs when a row with the same primary key value already exists in the target table from a previous batch load. Since batch loads typically use INSERT operations, attempting to insert a duplicate primary key will raise a constraint violation error, halting the load process.

Exam trap

The trap here is that candidates confuse a primary key violation with a data type mismatch or schema mismatch, but the error log's specific reference to a duplicate key points directly to the primary key constraint.

How to eliminate wrong answers

Option B is wrong because an incorrect data type for order_id would cause a data type conversion error or truncation error, not a primary key violation. Option C is wrong because a schema mismatch (e.g., missing columns or different column order) would produce a column mapping error or a 'column not found' error, not a duplicate key error. Option D is wrong because null values in order_id would violate a NOT NULL constraint if the primary key column is defined as NOT NULL, but the error message specifically points to a duplicate key violation, not a null constraint violation.

Full explanation →

351

MCQeasy

A data analyst is preparing a presentation for executive leadership. The analyst wants to highlight the correlation between marketing spend and revenue over the past year. Which visualization type is most appropriate for showing this relationship?

A.Scatter plot

B.Pie chart

C.Bar chart

D.Histogram

AnswerA

Scatter plots effectively display correlations between two continuous variables.

Why this answer

A scatter plot is the most appropriate visualization for showing the relationship between two continuous variables—marketing spend and revenue—because it plots individual data points on an X-Y axis, allowing the analyst to visually assess correlation, trends, and outliers. This directly supports the goal of highlighting correlation, as the pattern of points (e.g., upward slope) indicates the strength and direction of the relationship.

Exam trap

The trap here is that candidates often confuse a bar chart or histogram with a scatter plot because they think any chart with axes can show relationships, but only a scatter plot directly plots paired continuous data to reveal correlation without aggregation.

How to eliminate wrong answers

Option B (Pie chart) is wrong because pie charts are designed to show parts of a whole (proportions) for categorical data, not the relationship between two continuous variables. Option C (Bar chart) is wrong because bar charts compare discrete categories or aggregated values, not the correlation between two continuous metrics; they would require binning or summarizing the data, losing the granularity needed for correlation analysis. Option D (Histogram) is wrong because histograms display the distribution of a single continuous variable (e.g., frequency of revenue values), not the relationship between two variables.

Full explanation →

352

MCQhard

A sales dashboard shows a map with many overlapping markers in the same city, making it hard to read. What is the best improvement?

A.Add tooltips to show details on hover

B.Aggregate the data by region and use a choropleth map

C.Use a bubble chart instead of a map

D.Use different marker colors for each store

AnswerB

Aggregating reduces density and choropleth shows region-level values.

Why this answer

Option B is correct because aggregating sales data by region and using a choropleth map eliminates visual clutter from overlapping markers by shading entire geographic areas based on a metric (e.g., total sales). This approach leverages spatial aggregation to provide a clear, high-level view of regional performance, which is the best practice when individual point markers become unreadable due to density.

Exam trap

The trap here is that candidates may choose tooltips (Option A) thinking interactivity solves the problem, but the question asks for the 'best improvement' to readability, and tooltips do not address the fundamental issue of overlapping markers obscuring the visualization.

How to eliminate wrong answers

Option A is wrong because tooltips only provide details on hover and do not solve the core problem of overlapping markers obscuring data; they add interactivity but do not reduce visual density. Option C is wrong because a bubble chart, while useful for comparing values, is not a map-based visualization and would lose the geographic context that the dashboard intends to convey. Option D is wrong because using different marker colors for each store does not address overlapping markers; it only adds visual differentiation without reducing clutter, and in dense areas, colored markers still overlap and remain unreadable.

Full explanation →

353

MCQhard

Refer to the exhibit. A database administrator notices that queries filtering on both CustomerID and OrderDate are slow. Which single change would most likely improve performance for such queries?

A.Partition the table by OrderDate

B.Convert TotalAmount to VARCHAR

C.Add a composite index on (CustomerID, OrderDate)

D.Remove the primary key constraint

AnswerC

A composite index can satisfy both conditions in one index seek.

Why this answer

A composite index on (CustomerID, OrderDate) allows the database to use a single index to filter on both columns, which is more efficient than using separate indexes and combining results.

Full explanation →

354

MCQmedium

A data quality report shows that 95% of records have all required fields completed, but 20% of the completed fields contain values that are outside valid ranges. Which data quality dimension is most affected?

A.Consistency

B.Accuracy

C.Timeliness

D.Completeness

AnswerB

Accuracy is compromised because values outside valid ranges are incorrect.

Why this answer

Accuracy measures how well data reflects real-world values or a defined standard. Here, 20% of completed fields contain values outside valid ranges, meaning the data is present but incorrect, directly degrading accuracy. Completeness (95% filled) is high, but the core issue is that the values themselves are wrong, not missing or late.

Exam trap

The trap here is that candidates see '95% of records have all required fields completed' and immediately think 'Completeness is high, so that dimension is fine,' but then incorrectly assume the 20% out-of-range values also affect Completeness, when in fact Accuracy is the dimension that suffers when present data is invalid.

How to eliminate wrong answers

Option A (Consistency) is wrong because consistency checks for logical coherence across datasets or over time (e.g., same customer ID format in two tables), not whether individual field values fall within valid ranges. Option C (Timeliness) is wrong because timeliness concerns whether data is available when needed or within a required time window, not the correctness of values. Option D (Completeness) is wrong because completeness measures the presence of data (95% of records have all required fields), which is high; the problem is with the quality of the present data, not its absence.

Full explanation →

355

MCQhard

A data team is preparing a quarterly business review for the CEO. The report must include both high-level summaries and the ability for the CEO to drill down into specific departments. Which reporting technique best meets this requirement?

A.A slide deck with one slide per department.

B.An interactive dashboard with drill-down capabilities.

C.A static PDF with a summary page and appendices.

D.A data dump in Excel with filters.

AnswerB

Interactive dashboards allow users to start with a summary and click to see underlying details for specific departments.

Why this answer

An interactive dashboard with drill-down capabilities (Option B) is the correct choice because it directly addresses the requirement for both high-level summaries and the ability to explore specific departments. Dashboards allow the CEO to view aggregated KPIs at a glance and then click through to detailed views for each department, providing a seamless, user-driven exploration experience without switching between separate reports or slides.

Exam trap

The trap here is that candidates often choose a static PDF (Option C) or a slide deck (Option A) because they associate 'report' with printed or presentation materials, but the question explicitly requires 'drill-down' capability, which is a hallmark of interactive business intelligence tools, not static documents.

How to eliminate wrong answers

Option A is wrong because a slide deck with one slide per department forces a linear, static presentation; the CEO cannot dynamically drill down from a summary view into a specific department without manually navigating slides, which breaks the requirement for interactive drill-down. Option C is wrong because a static PDF with a summary page and appendices is non-interactive; the CEO would have to jump to appendix pages manually, which is not a true drill-down capability and lacks the real-time filtering or cross-filtering that an interactive dashboard provides. Option D is wrong because a data dump in Excel with filters is a raw data file that requires the CEO to understand the data structure and apply filters manually; it does not offer a curated high-level summary or a guided drill-down path, and it risks overwhelming the user with granular data without pre-built aggregations.

Full explanation →

356

Multi-Selectmedium

Which THREE are best practices for data profiling during acquisition? (Choose three.)

Select 3 answers

A.Immediately normalize data

B.Check for completeness

C.Assess data types

D.Identify outliers

E.Skip validation for trusted sources

AnswersB, C, D

Ensuring all required fields are populated is essential.

Why this answer

Checking for completeness (Option B) is a best practice during data acquisition because it ensures that all required fields and records are present before further processing. Incomplete data can lead to incorrect analysis or failed transformations, so profiling for missing values or nulls is a fundamental validation step.

Exam trap

The trap here is that candidates confuse 'best practices for acquisition' with 'best practices for transformation,' leading them to select normalization (Option A) as an immediate step rather than a later processing stage.

Full explanation →

357

Multi-Selecthard

A data analyst is evaluating data quality issues in a customer database. Which TWO actions are best practices for ensuring data consistency?

Select 2 answers

A.Allowing null values for foreign keys

B.Standardizing date formats across all tables

C.Implementing referential integrity constraints

D.Enabling cascading updates on primary keys

E.Using data profiling to identify duplicate records

AnswersB, C

Correct: Uniform formats ensure consistency in temporal data.

Why this answer

Standardizing date formats across all tables (Option B) ensures that date values are stored and interpreted uniformly, eliminating inconsistencies that arise from mixed formats (e.g., MM/DD/YYYY vs. DD-MM-YY). This practice directly supports data consistency by enforcing a single representation, which is critical for accurate querying, reporting, and integration across systems.

Exam trap

CompTIA often tests the distinction between data quality dimensions (e.g., consistency vs. accuracy), leading candidates to confuse data profiling (which identifies duplicates) with a direct method for enforcing consistency.

Full explanation →

358

MCQmedium

A data analyst creates a report and wants to ensure it tells a compelling story. Which element is most important for data storytelling?

A.Using only one chart type for consistency.

B.Avoiding any visual elements to keep focus on text.

C.Including all data without filtering.

D.Using a narrative arc with context.

AnswerD

Engages audience and makes findings memorable.

Why this answer

Option D is correct because data storytelling relies on a narrative arc—introducing context, building tension through data insights, and resolving with actionable conclusions—to engage the audience and make the data memorable. Without a narrative, even the most accurate data fails to drive understanding or decision-making. This aligns with the DA0-001 objective of communicating data insights effectively.

Exam trap

The trap here is that candidates often confuse 'data storytelling' with 'data presentation' and assume that including all data (Option C) is thorough, when in fact the exam emphasizes that a compelling story requires filtering and context to avoid overwhelming the audience.

How to eliminate wrong answers

Option A is wrong because using only one chart type ignores the fact that different data relationships (e.g., trends vs. distributions) require different visual encodings; forcing consistency sacrifices clarity. Option B is wrong because avoiding visual elements contradicts the principle that humans process visual information faster than text; data storytelling relies on charts, graphs, and annotations to highlight key patterns. Option C is wrong because including all data without filtering leads to cognitive overload and obscures the main message; effective storytelling requires selective inclusion based on the narrative's focus.

Full explanation →

359

Matchingmedium

Match each data analysis technique to its primary purpose.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Model relationships between variables

Group similar data points without labels

Analyze data points collected over time

Compare means across multiple groups

Test association between categorical variables

Why these pairings

These are common statistical techniques for data analysis.

Full explanation →

360

MCQmedium

A data analyst needs to compare the salary distribution across five departments. Which visualization is most appropriate?

A.Line chart

B.Side-by-side box plot

C.Scatter plot

D.Stacked bar chart

AnswerB

Box plots display distribution statistics for each group.

Why this answer

A side-by-side box plot (option B) is the most appropriate visualization for comparing salary distributions across multiple departments because it displays the median, quartiles, and potential outliers for each group simultaneously. This allows the analyst to assess central tendency, spread, and skewness across all five departments in a single, compact chart.

Exam trap

The trap here is that candidates often confuse 'comparing distributions' with 'showing trends' or 'showing relationships,' leading them to incorrectly select a line chart or scatter plot instead of recognizing that a box plot is purpose-built for distribution comparison across groups.

How to eliminate wrong answers

Option A is wrong because a line chart is designed to show trends over a continuous interval (e.g., time series) and is not suitable for comparing distributions of categorical groups like departments. Option C is wrong because a scatter plot visualizes the relationship between two continuous variables, not the distribution of a single variable across categories. Option D is wrong because a stacked bar chart is used to show the composition of parts to a whole across categories, not the distribution (e.g., quartiles, outliers) of a continuous variable like salary.

Full explanation →

361

MCQhard

A data analyst is building a binary classification model to predict customer churn. The dataset is imbalanced, with only 10% churners. The analyst wants to evaluate model performance with a focus on correctly identifying churners. Which metric is most appropriate?

A.Recall (sensitivity)

B.F1-score

C.Precision

D.Accuracy

AnswerA

Recall measures how many actual churners were correctly found, directly addressing the focus.

Why this answer

Recall (sensitivity) is the most appropriate metric because it measures the proportion of actual churners correctly identified by the model. Since the dataset is imbalanced (only 10% churners) and the analyst's focus is on correctly identifying churners, recall directly addresses the cost of missing positive cases (false negatives). Accuracy would be misleading due to class imbalance, while precision and F1-score prioritize different trade-offs.

Exam trap

The trap here is that candidates often default to accuracy as the default metric, failing to recognize that class imbalance renders accuracy misleading, and that the question's explicit focus on 'correctly identifying churners' points directly to recall, not precision or F1-score.

How to eliminate wrong answers

Option B (F1-score) is wrong because it balances precision and recall, but the analyst's primary goal is to maximize identification of churners, not to balance false positives and false negatives; F1-score would penalize a model that achieves high recall at the expense of precision, which may be acceptable in this scenario. Option C (Precision) is wrong because it measures the proportion of predicted churners that are actual churners, focusing on false positives rather than false negatives; the analyst wants to minimize missed churners, not necessarily avoid false alarms. Option D (Accuracy) is wrong because with only 10% churners, a naive model predicting all non-churners would achieve 90% accuracy, masking poor performance on the minority class; accuracy is inappropriate for imbalanced classification problems.

Full explanation →

362

MCQeasy

A data analytics team has created a report for stakeholders. The report includes complex statistical terms and raw data tables. Stakeholders are confused and ask for clarification. Which of the following should the team do to improve communication?

A.Schedule a follow-up meeting to explain each term individually.

B.Provide a glossary of statistical terms and keep the report as is.

C.Remove all data and only give conclusions.

D.Simplify the report by using clear visualizations and plain language summaries.

AnswerD

Correct. This directly addresses the confusion by making the report accessible.

Why this answer

Option D is correct because effective data communication requires tailoring the message to the audience. By replacing complex statistical terms and raw data tables with clear visualizations and plain language summaries, the team makes insights accessible to stakeholders who may lack technical expertise, directly addressing the confusion.

Exam trap

The trap here is that candidates may think providing more explanation (Option A) or more data (Option B) is always better, but the DA0-001 exam emphasizes that communication must be tailored to the audience's level of understanding, not just the completeness of the information.

How to eliminate wrong answers

Option A is wrong because scheduling a follow-up meeting to explain each term individually is inefficient and does not improve the report itself; stakeholders should be able to understand the report without needing a separate tutorial. Option B is wrong because providing a glossary while keeping the report as is forces stakeholders to constantly cross-reference terms, which does not simplify the communication and still leaves raw data tables that are hard to interpret. Option C is wrong because removing all data and only giving conclusions removes the evidence and context needed for stakeholders to trust and verify the insights, which undermines transparency and data-driven decision-making.

Full explanation →

363

MCQmedium

A retail company wants to predict future sales based on historical data. Which modeling approach is most appropriate if the data shows a clear seasonal pattern?

A.Linear regression

B.Time series analysis

C.K-means clustering

D.Logistic regression

AnswerB

Time series analysis explicitly models seasonal patterns.

Why this answer

Time series analysis is specifically designed to model data points indexed in time order, making it ideal for capturing and forecasting seasonal patterns. Unlike regression models, it accounts for autocorrelation, trends, and seasonality components, which are critical for accurate sales prediction from historical data.

Exam trap

The trap here is that candidates see 'predict future sales' and mistakenly choose linear regression, overlooking that time series methods are required when data has temporal dependencies and seasonality.

How to eliminate wrong answers

Option A is wrong because linear regression assumes independence of observations and cannot model time-dependent structures like seasonality or autocorrelation. Option C is wrong because K-means clustering is an unsupervised learning method used for grouping similar data points, not for forecasting future values. Option D is wrong because logistic regression is used for binary classification problems, not for predicting continuous numeric sales figures.

Full explanation →

364

MCQmedium

The exhibit shows an SQL query executed on an 'orders' table that contains 'order_id', 'customer_id', and 'order_date'. What is the purpose of this query?

A.Count total orders per customer regardless of date

B.Calculate average order count per customer for 2023

C.Find products with more than 5 orders in 2023

D.Identify customers who placed more than 5 orders in 2023

AnswerD

The query filters by 2023 date and having count > 5.

Why this answer

The query groups orders by customer_id and filters using a HAVING clause with COUNT(*) > 5, which counts the number of orders per customer. The WHERE clause restricts orders to those placed in 2023, so the result identifies customers who placed more than 5 orders in that year. This matches option D exactly.

Exam trap

CompTIA often tests the distinction between WHERE and HAVING, and the trap here is confusing a count of orders per customer with a count of products or an average, leading candidates to pick option B or C.

How to eliminate wrong answers

Option A is wrong because the WHERE clause filters for order_date in 2023, so the count is not regardless of date. Option B is wrong because the query counts orders per customer, not the average order count per customer. Option C is wrong because the query operates on an 'orders' table with no product-related column; it counts orders per customer, not products.

Full explanation →

365

MCQmedium

An analyst is reviewing the above SQL query used to acquire data. What does this query retrieve?

A.Customers who placed more than 5 orders in 2023

B.All customers who placed at least 5 orders in 2023

C.The total number of orders per customer in 2023

D.Customers who placed exactly 5 orders in 2023

AnswerA

The HAVING clause filters for counts greater than 5.

Why this answer

The SQL query uses a HAVING clause with COUNT(*) > 5 to filter customers who placed more than 5 orders in 2023. The WHERE clause restricts records to the year 2023, and the GROUP BY customer_id aggregates orders per customer. The condition '> 5' explicitly excludes customers with exactly 5 or fewer orders, making option A correct.

Exam trap

The trap here is confusing the comparison operator '>' with '>=', leading candidates to mistakenly include customers with exactly 5 orders when the query explicitly excludes them.

How to eliminate wrong answers

Option B is wrong because 'at least 5 orders' would require the condition COUNT(*) >= 5, not > 5. Option C is wrong because the query returns customer IDs, not the total number of orders per customer; the COUNT is used only for filtering, not as a selected column. Option D is wrong because 'exactly 5 orders' would require COUNT(*) = 5, not > 5.

Full explanation →

366

MCQmedium

A data team needs to communicate insights about customer churn to the sales team. The insights include confidence intervals and p-values. The sales team is not familiar with statistics. Which of the following should the data team do?

A.Explain the practical implications of the results without statistical jargon.

B.Assume the sales team will learn the terms over time.

C.Use technical terms but provide written definitions.

D.Include a detailed statistical appendix.

AnswerA

Correct. This makes the insights accessible and actionable for the sales team.

Why this answer

Option A is correct because the sales team lacks statistical background, so presenting confidence intervals and p-values directly would cause confusion. The data team should translate these results into practical business implications—such as 'customers with a 30-day inactivity are 40% more likely to churn'—without using terms like p-value or confidence interval. This aligns with the DA0-001 objective of tailoring communication to the audience's expertise level.

Exam trap

The trap here is that candidates often choose Option C (providing definitions) thinking it balances accuracy and clarity, but the DA0-001 exam emphasizes audience adaptation—definitions still require the audience to learn technical terms, which is less effective than plain-language explanations.

How to eliminate wrong answers

Option B is wrong because assuming the sales team will learn statistical terms over time is unrealistic and risks misinterpretation of critical insights, leading to poor business decisions. Option C is wrong because providing written definitions of technical terms still forces the sales team to process unfamiliar jargon, which can slow understanding and reduce engagement. Option D is wrong because a detailed statistical appendix is excessive for a non-technical audience and may overwhelm them, defeating the purpose of clear communication.

Full explanation →

367

MCQhard

Refer to the exhibit. Which conclusion can be drawn from this data quality report?

A.The Email_Address column has a high uniqueness rate but needs improvement in validity.

B.The column is fully consistent but has low completeness.

C.The column has low validity and low uniqueness.

D.The column requires immediate action to improve completeness.

AnswerA

Uniqueness is 97%, but validity is only 85%, meaning some emails may be in invalid format.

Why this answer

Option A is correct because the data quality report shows that the Email_Address column has a high uniqueness rate (e.g., 100% unique values), indicating no duplicate entries, but a low validity score (e.g., many entries fail format checks like missing '@' or domain). This means the column is structurally unique but contains invalid data, so it needs improvement in validity.

Exam trap

CompTIA often tests the distinction between uniqueness and validity, trapping candidates who assume high uniqueness implies high quality, when in fact validity is a separate dimension that can be poor even with perfect uniqueness.

How to eliminate wrong answers

Option B is wrong because the report indicates low validity, not full consistency; consistency refers to adherence to a standard format, which is violated here. Option C is wrong because the report shows high uniqueness (not low uniqueness), so the claim of 'low uniqueness' is factually incorrect. Option D is wrong because completeness (non-null values) appears high or acceptable; the issue is with validity, not missing data.

Full explanation →

368

MCQeasy

A marketing team wants to explore the relationship between advertising spend (in dollars) and resulting revenue. Which chart type is most suitable?

A.Line chart

B.Table

C.Pie chart

D.Scatter plot

AnswerD

Scatter plot displays relationship between two numerical variables.

Why this answer

Scatter plots reveal correlation and distribution of two continuous variables.

Full explanation →

369

MCQhard

A data analyst at a retail company is building a dashboard for store managers to track sales performance. The data comes from three sources: point-of-sale (POS) systems, inventory, and customer loyalty. The POS table contains columns transaction_id, store_id, date, product_id, quantity, and price. The inventory table has product_id, store_id, stock_level, and reorder_point. The loyalty table has customer_id, transaction_id, and points_earned. The analyst creates a star schema with a sales_fact fact table containing all rows from POS, dimension tables for store, product, date, and customer. To calculate average transaction value, the analyst uses the formula SUM(quantity * price) / COUNT(*). Store managers report that the average transaction value appears too low, especially for stores with multiple registers. The analyst realizes that because each product sold in a transaction creates a separate row in sales_fact, a single transaction with multiple items contributes multiple rows. The current calculation divides by the number of rows rather than the number of distinct transactions. Which of the following is the best course of action to correct the average transaction value metric? (Choose one.)

A.Use the MEDIAN function instead of AVG

B.Aggregate the data at the transaction level before calculating the average

C.Use a different data model that denormalizes transaction totals into a new fact table

D.Create a calculated field that sums sales per transaction (quantity * price) and then averages across distinct transaction IDs

AnswerD

This correctly computes average transaction value by first summing per transaction.

Why this answer

Option A is correct because it calculates total sales per transaction (summing product-level rows) and then averages across distinct transactions, fixing the over-counting issue. Option B is too vague and does not specify how to aggregate. Option C is not required since the star schema is appropriate.

Option D uses median, which does not address the counting issue.

Full explanation →

370

MCQmedium

Refer to the exhibit. Which type of data is the field "region"?

A.Qualitative

B.Continuous

C.Quantitative

D.Discrete

AnswerA

Correct. Region is a descriptive category.

Why this answer

The field 'region' contains categorical labels (e.g., 'North', 'South', 'East', 'West') that represent distinct groups or categories, not numerical measurements. Qualitative data (also called categorical data) describes attributes or characteristics that can be named but not meaningfully ordered or measured on a numeric scale. Since 'region' assigns a name to a geographic area without any inherent numeric value or order, it is a classic example of qualitative data.

Exam trap

The trap here is that candidates may confuse 'region' with a numeric code (e.g., region ID 1, 2, 3) and incorrectly classify it as discrete quantitative data, but the field 'region' as shown contains text labels, making it qualitative.

How to eliminate wrong answers

Option B is wrong because continuous data represents measurements that can take any value within a range (e.g., temperature, time), but 'region' consists of discrete labels with no numeric continuum. Option C is wrong because quantitative data involves numerical values that can be counted or measured (e.g., sales amount, age), whereas 'region' is a non-numeric category. Option D is wrong because discrete data is a subset of quantitative data that takes countable integer values (e.g., number of customers), but 'region' is not numeric at all.

Full explanation →

371

Multi-Selectmedium

Which TWO of the following are examples of semi-structured data?

Select 2 answers

A.XML document

B.JSON object

C.Relational table

D.Plain text file

E.CSV file

AnswersA, B

XML uses tags and has flexible schema, semi-structured.

Why this answer

XML and JSON have tags/keys but no rigid schema, making them semi-structured. CSV is structured, relational tables are structured, plain text is unstructured.

Full explanation →

372

MCQmedium

A data analyst creates a scatter plot showing the relationship between advertising spend and revenue. The plot shows a strong positive correlation. Which of the following should the analyst include in the report to ensure accurate communication?

A.Include a note that correlation does not imply causation.

B.Replace the scatter plot with a bar chart.

C.Remove any outliers from the plot.

D.Add a trend line to the scatter plot.

AnswerA

This prevents misinterpretation of the relationship.

Why this answer

Option B is correct because correlation does not imply causation, and this caveat is essential. Option A is wrong because the scatter plot already shows the relationship. Option C is wrong because removing points could bias the analysis.

Option D is wrong because a regression line is not necessary for every scatter plot.

Full explanation →

373

MCQeasy

A hospital wants to analyze patient readmission rates. The data contains daily patient visits. What is the level of granularity?

A.Patient

B.Visit

C.Day

D.Hospital

AnswerB

Correct. Each record captures one visit.

Why this answer

The level of granularity refers to the finest detail captured in the dataset. Since the data contains daily patient visits, each record represents a single visit event, not the patient or the day itself. Therefore, 'Visit' is the correct granularity because each row corresponds to one visit occurrence.

Exam trap

The trap here is confusing the subject of analysis (patient readmission rates) with the actual data granularity (each row is a visit), leading candidates to incorrectly select 'Patient' instead of 'Visit'.

How to eliminate wrong answers

Option A is wrong because 'Patient' would be the granularity if the data summarized all visits per patient (e.g., one row per patient with aggregated readmission counts), but here each visit is a separate record. Option C is wrong because 'Day' would be the granularity if the data aggregated all visits per day (e.g., total visits per day), but the data contains individual visit records, not daily summaries. Option D is wrong because 'Hospital' would be the granularity if the data aggregated across the entire hospital (e.g., total readmission rate for the hospital), but the data is at the individual visit level.

Full explanation →

374

MCQmedium

A retail company analyzes customer purchase data to improve inventory management. They store daily transaction records in a relational database and monthly aggregate reports in a data warehouse. Which difference between these storage methods best explains why the warehouse is more suitable for trend analysis?

A.The database uses a star schema while the warehouse uses a normalized schema.

B.The database enforces ACID transactions, while the warehouse uses eventual consistency.

C.The database is optimized for write-heavy OLTP, while the warehouse is optimized for read-heavy OLAP.

D.The database stores only current data, while the warehouse stores historical data.

AnswerC

Correct: OLTP supports many writes; OLAP supports complex reads.

Why this answer

Option C is correct because OLTP databases are optimized for high-frequency write operations (INSERT/UPDATE/DELETE) and ACID compliance, making them ideal for transaction processing but poor for complex analytical queries. In contrast, a data warehouse is optimized for read-heavy OLAP workloads, using columnar storage, pre-aggregated tables, and indexing strategies that enable fast aggregation and trend analysis over large historical datasets. This architectural difference directly supports the retail company's need to analyze purchase trends over time.

Exam trap

CompTIA often tests the misconception that 'data warehouses only store historical data' (Option D) as the primary reason for trend analysis suitability, but the real differentiator is the workload optimization (OLTP vs. OLAP), not merely the presence of history.

How to eliminate wrong answers

Option A is wrong because a star schema (with fact and dimension tables) is actually typical of data warehouses for analytical queries, while OLTP databases usually use normalized schemas to reduce redundancy and maintain data integrity. Option B is wrong because data warehouses often support ACID or snapshot isolation for consistency, and eventual consistency is more characteristic of NoSQL systems, not traditional data warehouses. Option D is wrong because relational databases can store historical data as well; the key difference is not the presence of history but the optimization for read-heavy analytical queries versus write-heavy transactional processing.

Full explanation →

375

MCQeasy

A marketing team conducted a customer satisfaction survey for five different departments (Sales, Support, Billing, Shipping, Returns). The survey asked customers to rate their satisfaction on a scale of 1 (Very Dissatisfied) to 5 (Very Satisfied). The data is ordinal and the team wants to visualize the distribution of responses for each department to quickly see which department has the most 'Very Satisfied' customers and which has the most 'Very Dissatisfied'. They also want to compare the spread of responses across departments. Which chart type should they use?

A.Stacked bar chart with departments on x-axis and counts of each rating stacked

B.Line chart with departments on x-axis and average rating on y-axis

C.Box plot for each department

D.Scatter plot with department as category and satisfaction score as value

AnswerA

Stacked bars show the full distribution of ordinal responses, highlighting proportions of top and bottom ratings.

Why this answer

Option C is correct because a stacked bar chart can show the proportion of each rating for each department, making it easy to see satisfaction levels. Option A is wrong because a box plot requires continuous data. Option B is wrong because a line chart is for trends.

Option D is wrong because a scatter plot is for relationships.

Full explanation →

Page 5 of 7

All pages

Practice DA0-001 by domain

Target a specific domain to shore up weak areas.

Comparing and Contrasting Data Concepts Mining and Acquiring Data Analyzing and Modeling Data Visualizing Data Communicating Data Insights

See all domains with question counts →