Knowledge + Practice

CompTIA Data+ DA0-001 (DA0-001) — Questions 76–150

509 questions total · 7pages · All types, answers revealed

Take a mock exam Exam hub

Page 2 of 7

76

MCQmedium

During a data presentation, an audience member questions the accuracy of the data shown. Which of the following is the best way for the analyst to respond?

A.Provide documentation of data sources and transformation steps

B.Change the topic

C.Offer to send the raw data later

D.Dismiss the question and continue

AnswerA

Documentation validates data accuracy and shows integrity.

Why this answer

Providing documentation of data sources and transformation steps directly addresses the audience member's concern about accuracy by demonstrating transparency and traceability. This approach aligns with best practices in data governance, as it allows the audience to verify the data lineage and any ETL processes that may have introduced errors. It also builds trust by showing the analyst has a clear understanding of the data pipeline.

Exam trap

The trap here is that candidates may choose Option C, thinking that providing raw data is sufficient, but they overlook that raw data without transformation documentation does not prove accuracy and may even raise more questions about how the data was prepared.

How to eliminate wrong answers

Option B is wrong because changing the topic avoids the question entirely, which undermines the credibility of the analyst and fails to address the legitimate concern about data accuracy. Option C is wrong because offering to send raw data later delays the response and does not provide immediate clarification; raw data alone may also be insufficient without context on how it was processed. Option D is wrong because dismissing the question and continuing is dismissive and unprofessional, likely eroding audience trust and suggesting the analyst cannot defend the data's integrity.

Full explanation →

77

MCQmedium

Refer to the exhibit. A data analyst is reviewing the job log. Which of the following best explains the reduction in record count?

A.The null region exclusion caused the reduction.

B.The job applied a filter to only include top regions.

C.The job aggregated data by region and date, reducing granularity.

D.The job failed to process half the data.

AnswerC

Correct. Aggregation collapses many rows into summary rows.

Why this answer

Option B is correct because the job aggregated data by region and date, reducing granularity from transactional to summary level. The null exclusion of 50 records is minor. A is incorrect as the job completed successfully; C is incorrect because only 50 records were excluded; D is incorrect as no filter is mentioned.

Full explanation →

78

Multi-Selectmedium

Which TWO actions are best practices for creating effective data visualizations?

Select 2 answers

A.Avoid using more than five slices in a pie chart

B.Maximize data-ink ratio by removing all whitespace

C.Always include gridlines with high contrast

D.Use 3D effects to make charts look professional

E.Use color to represent data values consistently

AnswersA, E

Too many slices make pie charts unreadable.

Why this answer

Option A is correct because pie charts with more than five slices become cluttered and difficult to read, making it hard for viewers to compare proportions accurately. Limiting slices to five or fewer ensures the chart remains clear and effectively communicates the relative sizes of categories. This best practice aligns with data visualization principles that prioritize clarity and cognitive ease.

Exam trap

CompTIA often tests the misconception that maximizing data-ink ratio means eliminating all whitespace, when in fact whitespace is a critical design element for readability and should be preserved judiciously.

Full explanation →

79

MCQeasy

A data analyst runs the Python code shown. What is the result of executing this code?

A.It reads the data, adds a calculated column, and shows the first 5 rows

B.It throws an error because 'total' column already exists

C.It reads the data and displays all rows

D.It reads the data and displays summary statistics

AnswerA

The code does exactly that.

Why this answer

The code reads a CSV file into a pandas DataFrame, then creates a new column 'total' by summing columns 'col1' and 'col2'. Finally, `head()` returns the first 5 rows. Option A correctly describes this sequence of operations.

Exam trap

The trap here is that candidates may think `head()` shows all rows or that adding a column with an existing name throws an error, but pandas silently overwrites the column.

How to eliminate wrong answers

Option B is wrong because pandas allows adding a new column with the same name as an existing column only if the assignment overwrites it; here, if 'total' already existed, it would be overwritten without error. Option C is wrong because `head()` without an argument defaults to 5 rows, not all rows. Option D is wrong because `head()` displays rows, not summary statistics (which would require `.describe()`).

Full explanation →

80

MCQeasy

A data analyst is cleaning a dataset and finds missing values in a categorical variable representing customer region. Which imputation method is most appropriate?

A.Drop rows with missing values

B.Mode imputation

C.Mean imputation

D.Median imputation

AnswerB

Mode is appropriate for categorical variables.

Why this answer

Mode imputation is the most appropriate method for a categorical variable because it replaces missing values with the most frequently occurring category, preserving the distribution of the data. Unlike mean or median imputation, which are designed for numerical data, mode imputation maintains the categorical nature of the variable and avoids introducing invalid values. This approach is simple and effective when missing data is random and the category is well-represented.

Exam trap

The trap here is that candidates often confuse imputation methods across data types, incorrectly applying mean or median imputation to categorical variables because they focus on central tendency without considering data type appropriateness.

How to eliminate wrong answers

Option A is wrong because dropping rows with missing values can lead to significant data loss and potential bias, especially if the missingness is not completely random, reducing the dataset's representativeness. Option C is wrong because mean imputation is only appropriate for numerical data, not categorical variables, as calculating the mean of categories is meaningless and would produce non-categorical values. Option D is wrong because median imputation is also designed for numerical data and cannot be applied to categorical variables, as the median requires ordered numerical values to compute.

Full explanation →

81

Multi-Selectmedium

A data analyst is comparing characteristics of structured and unstructured data. Which TWO of the following are characteristics of structured data? (Choose two.)

Select 2 answers

A.Data is typically stored as raw text

B.Data lacks a fixed format

C.Data is stored in predefined schemas

D.Data often requires NoSQL databases for storage

E.Data can be easily queried using SQL

AnswersC, E

Structured data follows a predefined schema, such as tables in a relational database.

Why this answer

Structured data is organized into predefined schemas, such as tables with rows and columns, which enforce a consistent data format and relationships. This rigid structure allows structured data to be easily queried using SQL, as SQL is designed to operate on relational database management systems (RDBMS) that rely on these schemas. Option C is correct because a predefined schema is a defining characteristic of structured data, enabling efficient storage, retrieval, and integrity constraints.

Exam trap

The trap here is that candidates often confuse 'lack of fixed format' (unstructured) with 'flexibility in storage' (NoSQL), leading them to select options B or D, which describe unstructured or semi-structured data, not structured data.

Full explanation →

82

MCQmedium

Refer to the exhibit. A data analyst is troubleshooting a failed dashboard refresh. The error log shows repeated SQL syntax errors. Which of the following is the most likely cause?

A.The database server is offline.

B.The query contains a syntax mistake.

C.The user does not have permissions to access the table.

D.The network connection timed out.

AnswerB

ORA-00933 is a SQL syntax error, indicating the query is not properly formed.

Why this answer

The error log explicitly states 'repeated SQL syntax errors,' which directly indicates that the SQL query being executed is malformed. A syntax mistake in the query (e.g., missing keyword, incorrect clause order, or mismatched parentheses) will cause the database to reject the statement before any execution begins, leading to the exact error described.

Exam trap

CompTIA often tests the distinction between error types (syntax vs. runtime vs. connectivity) to see if candidates can map the exact error message to its root cause, rather than guessing based on general troubleshooting assumptions.

How to eliminate wrong answers

Option A is wrong because if the database server were offline, the error would be a connection timeout or 'cannot connect to server' message, not a SQL syntax error. Option C is wrong because a permissions issue would produce an 'access denied' or 'permission denied' error, not a syntax error. Option D is wrong because a network timeout would result in a timeout or connection reset error, not a SQL syntax error.

Full explanation →

83

MCQeasy

A data analyst needs to visualize the relationship between two continuous variables: advertising spend (in dollars) and monthly sales (in units). Which chart type is most appropriate?

A.Scatter plot

B.Pie chart

C.Bar chart

D.Line chart

AnswerA

Scatter plots are used to display the relationship between two continuous variables.

Why this answer

A scatter plot is the most appropriate chart for visualizing the relationship between two continuous variables, such as advertising spend (dollars) and monthly sales (units). It displays individual data points on an x-y axis, allowing the analyst to observe correlation, clustering, or outliers between the two numeric fields. This aligns with the DA0-001 objective of selecting visualizations based on data type and analytical goal.

Exam trap

The trap here is that candidates often choose a line chart (Option D) because they mistakenly think 'relationship' implies a trend over time, but line charts require a sequential or time-based x-axis, not two independent continuous variables.

How to eliminate wrong answers

Option B (Pie chart) is wrong because pie charts are designed to show proportions of a whole for categorical data, not the relationship between two continuous variables. Option C (Bar chart) is wrong because bar charts compare discrete categories or aggregated values, not the direct correlation between two continuous numeric fields. Option D (Line chart) is wrong because line charts are typically used to show trends over time or ordered sequences, not the general relationship between two independent continuous variables.

Full explanation →

84

Matchingmedium

Match each data visualization type to its best use case.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Compare quantities across categories

Show relationship between two numeric variables

Display distribution of a single continuous variable

Show magnitude of values across two dimensions

Summarize distribution and identify outliers

Why these pairings

Choosing the right chart is key to effective data presentation.

Full explanation →

85

MCQmedium

Refer to the exhibit. A data analyst receives this error when running a data load script. What is the most likely cause?

A.The email field is too long

B.The database connection is lost

C.The customer_id 12345 already exists in the table

D.The name field is null

AnswerC

The error indicates duplicate primary key value.

Why this answer

Option B is correct. The error message explicitly states 'Duplicate entry '12345' for key 'PRIMARY'', meaning a record with customer_id 12345 already exists in the table. Option A (email too long) would give a different error.

Option C (null name) would not cause primary key violation. Option D (connection lost) would give a different error.

Full explanation →

86

MCQeasy

A data analyst is extracting data from a relational database using SQL. Which clause is essential for limiting the rows retrieved to only those needed?

A.GROUP BY

B.ORDER BY

C.WHERE

D.HAVING

AnswerC

Filters rows based on conditions.

Why this answer

Option A is correct because the WHERE clause filters rows based on conditions. Option B is wrong because GROUP BY groups rows for aggregation. Option C is wrong because ORDER BY sorts results.

Option D is wrong because HAVING filters groups after aggregation.

Full explanation →

87

MCQhard

A financial institution is merging transaction data from two different systems. System A stores currency amounts as integers in cents, and System B stores as decimals in dollars. What is the best way to integrate the data?

A.Convert System A amounts to dollars by dividing by 100.

B.Keep both as is and use a transformation layer.

C.Store all amounts as strings to preserve precision.

D.Convert System B amounts to cents by multiplying by 100.

AnswerA

This standardizes all amounts to dollar decimal format.

Why this answer

Option A is correct because converting System A's integer cents to dollars by dividing by 100 ensures both datasets share a consistent unit (dollars) and numeric data type (decimal). This direct transformation eliminates ambiguity in aggregation and reporting, as financial calculations require uniform precision and scale. Using a transformation layer or storing as strings would introduce unnecessary complexity or risk of rounding errors.

Exam trap

The trap here is that candidates may assume keeping both formats (Option B) is simpler or that converting to cents (Option D) is safer, but they overlook the critical requirement for a single, consistent unit to enable direct arithmetic and avoid precision loss in financial data integration.

How to eliminate wrong answers

Option B is wrong because keeping both formats as-is forces every downstream query or application to repeatedly apply conversion logic, increasing complexity, maintenance overhead, and the risk of inconsistent results. Option C is wrong because storing currency amounts as strings prevents arithmetic operations (e.g., SUM, AVG) without explicit casting, degrades query performance, and can lead to sorting or comparison errors due to lexical ordering. Option D is wrong because converting System B's decimal dollars to cents by multiplying by 100 would lose fractional cent precision (e.g., $1.234 becomes 123 cents, truncating 0.4 cents), which is unacceptable for financial data integrity.

Full explanation →

88

MCQeasy

A data engineer is configuring access to a data lake in Amazon S3. What does the JSON policy shown allow?

A.Change user permissions

B.Read objects from the bucket

C.Delete objects from the bucket

D.Write objects to the bucket

AnswerB

GetObject allows reading.

Why this answer

The JSON policy shown grants the `s3:GetObject` action, which allows reading objects from the specified S3 bucket. This is a standard AWS IAM policy that explicitly permits the `GetObject` API call, enabling the data engineer to retrieve objects from the data lake.

Exam trap

The trap here is that candidates may confuse `s3:GetObject` with broader permissions like `s3:PutObject` or `s3:DeleteObject`, or assume that any S3 policy allows all actions, when in fact each action must be explicitly listed.

How to eliminate wrong answers

Option A is wrong because changing user permissions requires the `iam:ChangePassword` or `iam:UpdateUser` actions, which are not included in this S3-specific policy. Option C is wrong because deleting objects requires the `s3:DeleteObject` action, which is not listed in the policy. Option D is wrong because writing objects requires the `s3:PutObject` action, which is also absent from the policy.

Full explanation →

89

MCQhard

A social media monitoring company collects public tweets using the Twitter API. The API has a tiered access: free tier allows 500,000 tweets per month, and paid tier allows 2 million tweets per month. The company needs to collect 1.5 million tweets per month for analysis. They are on a free tier but have been exceeding the limit, causing account suspension. They need a sustainable solution without significantly increasing costs. What should they do?

A.Request an academic research exemption

B.Reduce the collection to exactly 500,000 tweets per month by sampling

C.Use multiple developer accounts to stay within free limits

D.Upgrade to the paid tier

AnswerC

Multiple accounts can split the load, staying within free limits and avoiding costs.

Why this answer

Using multiple developer accounts to distribute the collection load can allow access to more tweets while staying within each account's free limit. This avoids the cost of upgrading to a paid tier. Reducing collection to 500,000 tweets would cause loss of critical data.

Requesting an academic exemption is unlikely because the company is commercial. Upgrading to paid tier increases costs significantly.

Full explanation →

90

MCQmedium

A data scientist is building a model to predict customer churn. The company's internal CRM system provides customer demographics and transaction history. They also purchase demographic data from a third-party vendor. How should the purchased data be classified?

A.Secondary data

B.Internal data

C.Structured data

D.Primary data

AnswerA

Correct. Secondary data is collected by another entity and reused.

Why this answer

Purchased demographic data from a third-party vendor is classified as secondary data because it was originally collected by another entity for a different purpose and is being reused by the data scientist for churn prediction. Secondary data contrasts with primary data, which is collected firsthand for the specific analysis at hand. This classification is independent of whether the data is structured or unstructured.

Exam trap

The trap here is that candidates confuse 'secondary data' with 'structured data' because purchased data is often delivered in a structured format like CSV, but the classification is based on data origin and collection purpose, not its structure.

How to eliminate wrong answers

Option B (Internal data) is wrong because the purchased data originates from an external vendor, not from the company's own CRM or internal systems. Option C (Structured data) is wrong because the classification of data as primary or secondary is about its origin and collection purpose, not its format; purchased data could be structured or unstructured. Option D (Primary data) is wrong because primary data is collected directly by the researcher for the specific study, whereas this data was pre-existing and collected by a third party.

Full explanation →

91

MCQmedium

A stakeholder asks for the exact number of customers who churned last month. Which metric should the analyst report?

A.Churn trend

B.Churn rate percentage

C.Count of churned customers

D.Churn probability

AnswerC

This directly gives the exact number requested.

Why this answer

The stakeholder explicitly asks for the 'exact number' of customers who churned, which is a discrete count. Option C, 'Count of churned customers,' directly provides this integer value without any normalization or ratio. The analyst should report the raw metric that matches the request's specificity.

Exam trap

The trap here is that candidates often confuse 'churn rate percentage' (a relative metric) with the 'exact count' (an absolute metric), assuming the stakeholder wants the rate when they explicitly ask for the number.

How to eliminate wrong answers

Option A is wrong because a 'churn trend' shows the direction or pattern over time (e.g., increasing or decreasing), not a single exact number. Option B is wrong because 'churn rate percentage' is a ratio (churned customers divided by total customers), which normalizes the count and does not give the exact number requested. Option D is wrong because 'churn probability' is a predictive model output (e.g., a score between 0 and 1) indicating likelihood of future churn, not a historical count of past churned customers.

Full explanation →

92

Multi-Selecthard

Which THREE of the following are common mistakes when creating data visualizations? (Choose 3.)

Select 3 answers

A.Choosing the correct chart type for the data

B.Using a 3D pie chart

C.Labeling axes clearly

D.Using a non-zero baseline for bar charts

E.Truncating the y-axis to exaggerate differences

AnswersB, D, E

3D distorts angles and makes comparison harder.

Why this answer

Option B is correct because 3D pie charts distort the perception of proportions by adding a false depth dimension, making it difficult for viewers to accurately compare slice sizes. This violates the principle of data-ink ratio and is widely discouraged in data visualization best practices.

Exam trap

CompTIA often tests the misconception that adding visual flair (like 3D effects) improves a chart, when in reality it reduces accuracy; the trap here is that candidates may think 3D pie charts are acceptable because they look 'professional' or 'modern'.

Full explanation →

93

Multi-Selecteasy

Which TWO color choices are appropriate for a categorical data visualization? (Select two.)

Select 2 answers

A.Distinct hues

B.Sequential color scheme

C.Monochrome

D.Rainbow gradient

E.Colorblind-friendly palette

AnswersA, E

Distinct hues clearly separate categories.

Why this answer

Distinct hues (A) are appropriate for categorical data because they use different colors to represent distinct categories without implying any order or magnitude. This aligns with best practices in data visualization where categorical variables require qualitative color schemes that maximize perceptual separation between groups.

Exam trap

CompTIA often tests the misconception that any color scheme can be used for any data type, but the trap here is confusing sequential or rainbow schemes (which imply order) with categorical data that requires distinct, unordered hues.

Full explanation →

94

MCQeasy

A business user asks a data analyst to include several charts in a weekly report. The user wants to see the trend of sales over the last 12 months at a glance. Which chart type should the analyst use?

A.Line chart

B.Stacked bar chart

C.Treemap

D.Pie chart

AnswerA

Line charts clearly show changes and trends over continuous time periods.

Why this answer

A line chart is the correct choice because it is specifically designed to display trends over continuous time intervals, such as sales over 12 months. The x-axis represents time (months), and the y-axis represents sales values, allowing the user to quickly see upward, downward, or cyclical patterns. This aligns with the requirement to visualize a trend at a glance, which is a core strength of line charts in data visualization.

Exam trap

The trap here is that candidates often confuse a stacked bar chart's ability to show cumulative totals over time with a clear trend line, but the stacked segments actually make it harder to discern the overall sales trajectory at a glance.

How to eliminate wrong answers

Option B (Stacked bar chart) is wrong because it emphasizes part-to-whole relationships across categories over time, not a single trend line; it can obscure the overall sales trend due to stacked segments. Option C (Treemap) is wrong because it uses nested rectangles to show hierarchical proportions, making it unsuitable for time-series trend analysis. Option D (Pie chart) is wrong because it shows proportions of a whole at a single point in time, not changes over a continuous period like 12 months.

Full explanation →

95

MCQmedium

A company needs to visualize monthly sales revenue for the past five years to identify seasonal trends. Which chart type is most appropriate?

A.Line chart with months on the x-axis and revenue on the y-axis

B.Stacked bar chart showing each year as a segment

C.Scatter plot with revenue vs. month number

D.Pie chart for each year showing revenue distribution

AnswerA

Line chart is best for displaying continuous data over time and highlighting trends.

Why this answer

Option B (Line chart) is correct because line charts are ideal for showing trends over time. Option A (Bar chart) could work but is less effective for continuous time series. Option C (Pie chart) is for proportions, not trends.

Option D (Scatter plot) is for relationships between two variables.

Full explanation →

96

MCQeasy

A data analyst needs to communicate findings to a non-technical audience that is concerned with overall performance but not interested in details. Which approach is best?

A.Provide a summary dashboard with key KPIs

B.Include complex model outputs

C.Share raw data tables

D.Use detailed statistical jargon

AnswerA

A dashboard with KPIs gives a concise overview of performance.

Why this answer

A summary dashboard with key KPIs is best because it distills complex data into visual, high-level metrics that non-technical stakeholders can quickly grasp. This approach aligns with the principle of data storytelling, where the focus is on actionable insights rather than technical details. Dashboards using tools like Tableau or Power BI allow for interactive filtering without overwhelming the audience.

Exam trap

The trap here is that candidates may overestimate the audience's technical comfort and choose raw data or jargon, forgetting that the question explicitly states the audience is 'non-technical' and 'not interested in details.'

How to eliminate wrong answers

Option B is wrong because complex model outputs (e.g., regression coefficients or decision tree splits) require statistical literacy and obscure the main performance narrative, causing confusion. Option C is wrong because raw data tables present unaggregated, granular information that is difficult to interpret and irrelevant for high-level performance review. Option D is wrong because detailed statistical jargon (e.g., p-values, confidence intervals) alienates non-technical audiences and violates the principle of communicating insights in plain language.

Full explanation →

97

MCQmedium

A dashboard uses a heatmap to show sales density by hour and day of week. Users report that the color scale is confusing because some low values appear similar to high values. Which design change improves clarity?

A.Use a sequential color scale with more contrast

B.Increase the number of color steps

C.Switch to a diverging color scale

D.Change to a single hue gradient

AnswerC

Diverging scale uses two contrasting colors from a midpoint, making differences more apparent.

Why this answer

Using a diverging color scale with a neutral midpoint improves differentiation between low, medium, and high values. A sequential scale may still cause confusion. Increasing color steps or using a single hue does not address the issue.

Full explanation →

98

Multi-Selecthard

Which THREE data quality dimensions are commonly assessed in a data profiling task?

Select 3 answers

A.Scalability

B.Consistency

C.Uniqueness

D.Availability

E.Completeness

AnswersB, C, E

Consistency ensures uniform data representation, a common profiling check.

Why this answer

Consistency is a core data quality dimension assessed in data profiling because it evaluates whether data values are free from contradiction and adhere to the same representation rules across records. In profiling tools like Informatica or Talend, consistency checks identify violations such as 'NY' vs 'New York' in a state column, ensuring semantic uniformity.

Exam trap

CompTIA often tests the distinction between data quality dimensions (completeness, consistency, uniqueness) and system-level attributes (scalability, availability), leading candidates to mistakenly select non-quality terms like 'Availability' or 'Scalability' because they sound relevant to data management.

Full explanation →

99

Multi-Selectmedium

Which TWO of the following are considered structured data?

Select 2 answers

A.A PDF report with free-form text

B.A relational database table

C.A JPEG image of a product

D.A JSON file with nested key-value pairs

E.A CSV file containing sales records

AnswersB, E

Tables have a fixed schema.

Why this answer

Option B is correct because a relational database table stores data in a predefined schema of rows and columns, where each column has a fixed data type. This rigid structure allows for efficient querying, indexing, and relational operations, making it a classic example of structured data.

Exam trap

The trap here is that candidates often mistake semi-structured data (like JSON) for structured data because it has key-value pairs, but the DA0-001 exam strictly defines structured data as having a fixed, predefined schema—typically found in relational databases or CSV files with consistent column headers.

Full explanation →

100

Matchingmedium

Match each data sampling method to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Each member has equal chance of selection

Population divided into subgroups; random sample from each

Randomly select entire groups (clusters)

Select every k-th element from a list

Sample based on ease of access

Why these pairings

Sampling methods are important for data collection.

Full explanation →

101

MCQmedium

A data analyst is creating a report to compare the performance of different sales regions. The report will be used by regional managers to identify areas needing improvement. Which of the following visualization techniques would be most effective?

A.A bar chart comparing each region's sales

B.A line chart showing overall company sales

C.A pie chart showing each region's contribution

D.A scatter plot of sales vs. expenses

AnswerA

Bar charts enable easy comparison of values across categories.

Why this answer

A bar chart is most effective because it allows direct, side-by-side comparison of discrete categories (sales regions) using a common baseline, making it easy for regional managers to quickly identify which regions are underperforming. The vertical bars encode exact values with high perceptual accuracy, supporting the report's goal of highlighting areas needing improvement.

Exam trap

The trap here is that candidates often choose a pie chart (Option C) because they think 'contribution to the whole' is the goal, but the question asks for comparing performance across regions, which requires a common baseline — a task for which pie charts are notoriously poor.

How to eliminate wrong answers

Option B is wrong because a line chart is designed to show trends over continuous time intervals, not to compare discrete categories like sales regions; it would obscure regional differences by aggregating data into a single overall trend. Option C is wrong because a pie chart shows parts of a whole, making it difficult to compare individual region performance accurately due to the lack of a common baseline and poor perceptual precision for small differences. Option D is wrong because a scatter plot is used to explore the relationship between two continuous variables (e.g., correlation between sales and expenses), not to compare performance across distinct categories like regions.

Full explanation →

102

MCQmedium

A data team is designing an ETL process to extract data from an operational database daily. The database experiences heavy write loads during business hours. What is the best practice to minimize impact on operations?

A.Extract directly from the primary database with high priority

B.Run the extraction during peak hours to ensure data freshness

C.Schedule extraction at midnight when load is low

D.Use replication or a read replica to extract data

AnswerD

Read replicas are designed for such purposes and do not affect the primary.

Why this answer

Option B (use a read replica) is best because it offloads the extraction from the primary. Option A (extract from primary) impacts performance. Option C (schedule at midnight) still hits primary.

Option D (run during peak) increases load.

Full explanation →

103

MCQeasy

After a marketing campaign, sales increased by 15%. The analyst wants to understand which customer segment contributed most to the increase. Which type of analysis is this?

A.Predictive analysis

B.Diagnostic analysis

C.Prescriptive analysis

D.Descriptive analysis

AnswerB

Diagnostic analysis investigates the cause of the outcome—here, which segment drove the increase.

Why this answer

Diagnostic analysis is used to understand the root cause of an event or change. In this scenario, the analyst already knows sales increased by 15% and wants to determine which customer segment drove that increase, which is a classic diagnostic question. This type of analysis goes beyond describing what happened to explain why it happened.

Exam trap

The trap here is confusing diagnostic analysis with descriptive analysis, as both deal with past data, but descriptive only summarizes what happened while diagnostic explains why it happened.

How to eliminate wrong answers

Option A is wrong because predictive analysis uses historical data to forecast future outcomes, not to explain past changes. Option C is wrong because prescriptive analysis recommends actions or decisions to achieve a desired outcome, not to diagnose the cause of a past event. Option D is wrong because descriptive analysis summarizes what happened (e.g., 'sales increased by 15%') but does not investigate which segment contributed most to the increase.

Full explanation →

104

MCQmedium

A data engineer is designing a data warehouse for a retail company. The fact table must record each sale transaction, including product ID, store ID, date, and quantity sold. The product details (name, category, price) are stored in a separate table. This design is an example of which data modeling concept?

A.Star schema

B.Data lake

C.Normalization

D.Snowflake schema

AnswerA

Correct: fact table linked to dimension tables.

Why this answer

This design is a classic star schema, where a central fact table (sales transactions) contains foreign keys to dimension tables (product, store, date). The fact table stores quantitative measures (quantity sold) and foreign keys, while dimension tables hold descriptive attributes (product name, category, price). This separation optimizes query performance for OLAP workloads by reducing joins and enabling straightforward aggregations.

Exam trap

The trap here is that candidates confuse star schema with snowflake schema, but the key differentiator is whether dimension tables are further normalized (snowflake) or kept denormalized (star), and this question's single product table clearly indicates a star schema.

How to eliminate wrong answers

Option B is wrong because a data lake stores raw, unprocessed data in its native format (e.g., CSV, Parquet) without a predefined schema, whereas this design explicitly separates facts and dimensions with a structured schema. Option C is wrong because normalization would split data into many related tables to eliminate redundancy (e.g., separating product category into its own table), but here product details are kept in a single dimension table, which is denormalized. Option D is wrong because a snowflake schema further normalizes dimension tables into sub-dimensions (e.g., splitting product category into a separate table), but this design keeps product details in one table, making it a star schema, not a snowflake.

Full explanation →

105

MCQeasy

A data analyst is designing a dashboard for executives. Which best practice should be followed regarding the placement of key performance indicators (KPIs)?

A.Include as many charts as possible on a single screen to avoid scrolling

B.Hide KPIs behind filters to reduce initial load time

C.Use distinct colors to highlight all KPIs equally

D.Place the most important KPIs in the top-left corner

AnswerD

Users typically scan from top-left, so important metrics should be placed there.

Why this answer

Option C is correct because placing the most important KPIs at the top-left follows natural reading patterns. Option A is wrong because cramming many charts reduces readability. Option B is wrong because color coding alone is insufficient.

Option D is wrong because hiding KPIs defeats the purpose.

Full explanation →

106

MCQhard

An e-commerce company stores customer support emails in a text database, product images in a blob store, and sales transactions in a SQL table. Which data store holds only structured data?

A.Blob store

B.Text database

C.SQL table

D.None

AnswerC

Correct. SQL tables have rows and columns with defined data types.

Why this answer

Structured data conforms to a predefined schema with rows and columns, enforcing data types and relationships. A SQL table is the canonical example of a structured data store because it organizes data into tables with fixed schemas, supports ACID transactions, and enables relational queries via SQL. In contrast, blob stores and text databases store unstructured or semi-structured data without a rigid schema.

Exam trap

The trap here is that candidates confuse 'structured data' with any data that has some organization (like tags in a blob store or fields in a text document), but only a SQL table enforces a rigid, predefined schema with typed columns and relational constraints, which is the defining characteristic of structured data.

How to eliminate wrong answers

Option A is wrong because a blob store (e.g., Amazon S3, Azure Blob Storage) stores binary large objects such as images, videos, or documents as opaque blobs with no inherent schema or structure — it is designed for unstructured data. Option B is wrong because a text database (e.g., a NoSQL document store like MongoDB or a plain text file repository) stores free-form text or semi-structured documents (e.g., JSON, XML) that lack a fixed, predefined schema and are not organized into rows and columns. Option D is wrong because the SQL table explicitly holds structured data, so 'None' is incorrect.

Full explanation →

107

Drag & Dropmedium

Drag and drop the steps to create a data visualization dashboard in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Dashboard creation starts with planning, then chart selection, layout design, building, and testing.

Full explanation →

108

Multi-Selectmedium

Which TWO of the following are commonly used techniques for handling missing data in a dataset? (Select TWO).

Select 2 answers

A.Mean imputation

B.Mode imputation

C.Dropping columns with missing data

D.Dropping rows with missing data

E.Regression imputation

AnswersA, E

Mean imputation replaces missing values with the mean of the column.

Why this answer

Mean imputation is a commonly used technique for handling missing numerical data where the missing value is replaced with the mean of the observed values for that feature. It preserves the sample size and is simple to implement, though it can reduce variance and distort relationships if data is not missing completely at random.

Exam trap

CompTIA often tests the distinction between common imputation methods (mean, median, mode, regression) and data removal techniques, trapping candidates who confuse 'dropping rows' as a primary technique when imputation is more widely recommended for preserving data integrity.

Full explanation →

109

MCQmedium

A data analyst creates a dashboard showing average order value by region. The chart indicates that one region has an unusually high average. Investigation reveals that the region has very few orders, but one large purchase inflates the average. Which data transformation should the analyst apply to improve the visualization?

A.Apply logarithmic scaling to the y-axis

B.Use median instead of mean for aggregation

C.Change the chart type to a pie chart

D.Remove all outliers from the dataset

AnswerB

Median is less sensitive to outliers than mean.

Why this answer

Option A is correct because using an appropriate aggregation like median or grouping with sufficient data can mitigate the impact of outliers. Other options do not address the root cause.

Full explanation →

110

MCQmedium

A data analyst creates a dashboard that includes a map showing sales by region. The map uses a continuous color gradient from light yellow to dark blue. Some regions with very high sales appear as dark blue, but many regions with moderate sales appear similar. Which improvement would most enhance the readability?

A.Switch to a diverging color scheme with a neutral midpoint.

B.Increase the map size to show more detail.

C.Add data labels to each region showing exact sales numbers.

D.Use a discrete color scale with distinct bins for sales ranges.

AnswerD

Discrete bins with distinct colors make it easier to differentiate between ranges, reducing visual ambiguity.

Why this answer

Option D is correct because a discrete color scale with distinct bins for sales ranges eliminates the ambiguity caused by a continuous gradient, where moderate sales values blend together. By grouping sales into defined intervals (e.g., $0–$10K, $10K–$50K, etc.), each region is assigned a unique, easily distinguishable color, making it immediately clear which sales bracket a region falls into without requiring precise color differentiation.

Exam trap

The trap here is that candidates may think a diverging color scheme (Option A) is always better for readability, but CompTIA often tests the distinction between continuous vs. discrete scales—the core issue is that a continuous gradient causes perceptual blending in the mid-range, which only a discrete scale with distinct bins can resolve.

How to eliminate wrong answers

Option A is wrong because a diverging color scheme is designed to highlight deviation from a neutral midpoint (e.g., above vs. below average), but the problem here is not about showing positive/negative divergence—it is about distinguishing moderate sales values that appear similar in a continuous gradient. Option B is wrong because increasing the map size does not change the underlying color mapping; it only enlarges the visual elements, so regions with similar moderate sales will still appear indistinguishable. Option C is wrong because adding data labels provides exact numbers but does not improve the readability of the color encoding itself; users would still struggle to visually compare regions at a glance, and labels can clutter the map, especially with many regions.

Full explanation →

111

Multi-Selecteasy

Which TWO are common pitfalls when communicating data insights?

Select 2 answers

A.Explaining assumptions clearly

B.Using misleading scales on charts

C.Including a clear call to action

D.Providing context for the data

E.Overloading the audience with too many visuals

AnswersB, E

Manipulating scales can distort the data and mislead the audience.

Why this answer

Option B is correct because using misleading scales on charts (e.g., truncating the y-axis or using non-zero baselines) distorts the visual representation of data, leading to incorrect interpretations. This violates the principle of data integrity in data visualization, as it can exaggerate or minimize trends, making it a common pitfall when communicating data insights.

Exam trap

CompTIA often tests the distinction between best practices and common pitfalls, so the trap here is that candidates may confuse beneficial actions (like explaining assumptions or providing context) with pitfalls, leading them to select those as wrong answers instead of recognizing them as correct practices.

Full explanation →

112

MCQmedium

A data analyst is pulling data from a production database for a report. The database contains customer orders with a column 'order_date'. The analyst notices that some orders have dates in the future. Which data quality issue does this represent?

A.Invalid data type

B.Inconsistent data

C.Missing data

D.Violation of business rules

AnswerD

Future orders are not valid per business rules, indicating a data quality issue.

Why this answer

Option D is correct because future order dates violate a business rule that order_date must be in the past or present. This is a classic data integrity issue where the data does not conform to domain-specific constraints, such as 'order_date <= CURRENT_DATE'. The analyst should flag this as a violation of business rules, not a data type or consistency problem.

Exam trap

The trap here is that candidates confuse 'invalid data type' (Option A) with 'invalid data value' — the data is of the correct type but violates a logical business rule, which is a distinct quality issue often tested in DA0-001.

How to eliminate wrong answers

Option A is wrong because the column 'order_date' is of a valid date data type (e.g., DATE or TIMESTAMP), so there is no data type mismatch. Option B is wrong because inconsistent data refers to contradictory values across related columns (e.g., different date formats), not a single column containing future dates. Option C is wrong because missing data would involve NULL or empty values, not dates that are present but invalid according to business logic.

Full explanation →

113

MCQhard

Refer to the exhibit. Based on the data profiling results, what is a likely data quality issue?

A.Completeness

B.Accuracy

C.Validity

D.Consistency

AnswerC

Correct. Values like 0 and 150 violate reasonable constraints.

Why this answer

The min of 0 and max of 150 are biologically implausible for age. A 0-year-old or 150-year-old customer likely indicates invalid data, affecting validity.

Full explanation →

114

MCQmedium

A retail company with 500 stores across North America wants to visualize its sales performance. The dataset includes store ID, region (Northeast, Southeast, Midwest, West), product category (Electronics, Clothing, Home Goods), monthly sales (in dollars), and date (from January 2018 to December 2023). The data has missing values for about 5% of store-month combinations, and a few stores have reported sales that are 10 times higher than the average for their region due to grand opening events. The goal is to create a dashboard that shows monthly sales trends for each region and product category, and allows users to identify which categories are driving growth. Which approach should the analyst take?

A.Use a stacked bar chart showing total sales by month, with each bar segmented by region and category

B.Create a line chart with month on the x-axis, sales on the y-axis, and separate lines for each region and category; check for outliers and consider annotating them

C.Create a scatter plot of sales vs. month with dots colored by region

D.Remove all stores with outlier sales and then create a line chart of the cleansed data

AnswerB

Line charts excel at showing trends over time; grouping by region and category allows comparison; outliers should be investigated and annotated, not removed.

Why this answer

Option A is correct because using a line chart with time on x-axis and grouping by region and category shows trends clearly. Handling outliers separately (e.g., annotation) preserves data integrity. Option B is wrong because combining all categories in a stacked bar makes it hard to see individual trends.

Option C is wrong because removing outliers discards valid grand opening data. Option D is wrong because a scatter plot is not appropriate for time series.

Full explanation →

115

MCQmedium

A healthcare organization acquires data from multiple hospitals with different patient record systems. The data includes patient IDs but no common identifier across systems. Which technique should be used to link records?

A.Merge all records without deduplication

B.Generate random unique IDs for each system

C.Manually match records for all patients

D.Probabilistic record linkage using name, DOB, and ZIP

AnswerD

Probabilistic linkage uses multiple attributes to find matches with high confidence.

Why this answer

Option D (probabilistic linkage) is designed for such situations. Option A (random IDs) loses connections. Option B (merge without dedup) creates duplicates.

Option C (manual matching) is not scalable.

Full explanation →

116

MCQeasy

An organization wants to assign responsibility for data quality and metadata management. Which role is primarily accountable for defining data standards and ensuring data quality across a specific domain?

A.Data analyst

B.Data owner

C.Data steward

D.Data custodian

AnswerC

A data steward ensures data quality, standards, and metadata management for a specific domain.

Why this answer

The data steward is the role primarily accountable for defining data standards and ensuring data quality within a specific domain. This aligns with the DAMA-DMBOK framework, where the data steward acts as the business-side owner of data content, establishing rules for data entry, validation, and metadata management to maintain consistency and accuracy.

Exam trap

The trap here is confusing the data steward with the data owner or data custodian, as many candidates mistakenly think the owner handles domain-level quality or that the custodian defines standards, when in fact the steward is the bridge between business requirements and technical enforcement.

How to eliminate wrong answers

Option A is wrong because a data analyst focuses on querying, analyzing, and reporting data, not on defining standards or governing data quality across a domain. Option B is wrong because a data owner is typically a senior executive accountable for data assets at an enterprise level, not for day-to-day domain-specific standards and quality enforcement. Option D is wrong because a data custodian (or data steward in some frameworks) handles technical implementation, storage, and security, but does not define business-level data standards or quality rules.

Full explanation →

117

MCQeasy

A data analyst creates a dashboard for executives to monitor quarterly sales. Which best practice ensures the dashboard is effective?

A.Place the most important metric in the top-left corner with simple charts.

B.Use a dark background with bright colors for contrast.

C.Include raw data tables for detailed analysis.

D.Use as many charts as possible to show all data.

AnswerA

Leverages natural scanning pattern for quick insight.

Why this answer

Option A is correct because placing the most important metric in the top-left corner leverages the natural reading pattern (left-to-right, top-to-bottom) to immediately draw the executive's attention to the key insight. Using simple charts (e.g., bar or line charts) reduces cognitive load, enabling rapid comprehension of quarterly sales trends without distracting details. This aligns with dashboard design principles that prioritize clarity and actionability over data density.

Exam trap

CompTIA often tests the misconception that more data or flashy visuals improve a dashboard, when in fact effective data communication relies on minimalism and strategic placement of the most critical insight.

How to eliminate wrong answers

Option B is wrong because a dark background with bright colors can cause eye strain and reduce readability, especially in well-lit executive meeting rooms; effective dashboards typically use light backgrounds with high-contrast, accessible color schemes. Option C is wrong because including raw data tables in an executive dashboard defeats its purpose—executives need summarized insights, not granular data, which should be available in a separate drill-down report. Option D is wrong because using as many charts as possible leads to clutter and information overload, obscuring the key sales metrics and making the dashboard ineffective for quick decision-making.

Full explanation →

118

MCQhard

A research firm is acquiring data from public government databases via API. The API rate limits at 100 requests per minute. They need to download 10,000 records, but each request returns a maximum of 100 records. What is the most efficient approach to ensure complete acquisition without being blocked?

A.Use a retry logic with exponential backoff and pagination

B.Request a data dump from the government via email

C.Download one record per second

D.Send all requests simultaneously in parallel

AnswerA

This approach respects the rate limit, handles failures gracefully, and ensures complete data acquisition.

Why this answer

Pagination with retry logic using exponential backoff allows the firm to send requests in a controlled manner, respecting the rate limit and handling potential failures. Sending all requests in parallel would likely exceed the rate limit and cause blocking. Downloading one record per second is too slow.

Requesting a data dump via email is inefficient and may not be supported.

Full explanation →

119

MCQeasy

A data analyst needs to create a visual that shows the distribution of customer ages across different regions. Which chart type is most appropriate?

A.Line chart

B.Stacked bar chart

C.Scatter plot

D.Pie chart

AnswerB

A stacked bar chart can display the distribution of age groups within each region, making comparisons easy.

Why this answer

A stacked bar chart is most appropriate because it allows the analyst to compare the distribution of customer ages (typically grouped into bins) across multiple regions simultaneously. Each bar represents a region, and the segments within the bar show the proportion or count of each age group, making it easy to see both the overall distribution and regional differences.

Exam trap

The trap here is that candidates often choose a pie chart because they think of 'distribution' as a single whole, forgetting that the question requires comparison across multiple regions, which a pie chart cannot handle.

How to eliminate wrong answers

Option A is wrong because a line chart is designed to show trends over a continuous variable (e.g., time), not the distribution of categorical age groups across regions. Option C is wrong because a scatter plot is used to show the relationship between two continuous variables, not the distribution of a single categorical variable across regions. Option D is wrong because a pie chart can only show the composition of a whole for a single category (e.g., age distribution for one region), but it cannot effectively compare distributions across multiple regions.

Full explanation →

120

MCQhard

During a presentation, a stakeholder questions the validity of a correlation found. What is the best response?

A.Correlation does not imply causation, but we can perform further analysis.

B.We can accept the correlation as true.

C.We used a large sample so it's valid.

D.The p-value is low, so it's significant.

AnswerA

This response is honest and proposes next steps.

Why this answer

Option A is correct because it directly addresses the stakeholder's concern about validity by acknowledging the fundamental statistical principle that correlation does not imply causation. It then proposes a constructive next step—further analysis—which aligns with best practices in data communication, where validating insights requires additional testing (e.g., controlled experiments or causal inference methods). This response demonstrates both technical honesty and a commitment to rigorous data-driven decision-making.

Exam trap

The trap here is that candidates often confuse statistical significance (p-value) or sample size with validity of a correlation, overlooking the core principle that correlation does not imply causation, which is a classic pitfall in data interpretation questions.

How to eliminate wrong answers

Option B is wrong because accepting a correlation as true without scrutiny ignores the possibility of spurious correlations, confounding variables, or sampling bias, which undermines data integrity. Option C is wrong because a large sample size reduces sampling error but does not guarantee that a correlation is meaningful or causal; it can still be due to chance or hidden confounders. Option D is wrong because a low p-value indicates statistical significance (i.e., the correlation is unlikely to be due to random chance), but it does not prove practical importance or causation, and significance can be inflated with large samples.

Full explanation →

121

MCQhard

A healthcare organization is building a data warehouse to support population health analytics. The data sources include: (1) an electronic health record (EHR) system with a relational database containing patient demographics, diagnoses, and medications; (2) a claims system that generates CSV files daily; (3) patient-generated health data from mobile apps via a REST API returning JSON. The data engineer needs to design a data acquisition process that runs nightly. The EHR system has a change tracking mechanism that logs changes with timestamps. The claims CSV files are appended daily. The API supports filtering by date. The data warehouse uses a star schema with fact and dimension tables. The engineer must ensure data consistency and minimize load times. Which approach should the engineer take?

A.Perform a full extraction of all data from all sources every night and load directly into the data warehouse

B.Extract only new and changed EHR data using change tracking, extract the full claims CSV (since it's append-only), and extract API data filtered by the last extraction date

C.Use a staging area to land all raw data first, then transform and load

D.Extract the EHR data using change tracking, extract the full claims CSV, and extract the API data using a full dump

AnswerB

This minimizes data transfer and load time while capturing all changes.

Why this answer

Option B is correct because it uses incremental extraction for the EHR system (via change tracking) and the API (via date filtering), while performing a full extraction of the claims CSV since it is append-only and small enough to reload nightly. This minimizes load times by avoiding full re-extraction of large, slowly changing datasets, and ensures data consistency by capturing only new or modified records. The star schema in the data warehouse is then populated efficiently from these targeted extracts.

Exam trap

The trap here is that candidates may assume a staging area (Option C) is always required for data consistency, but the question specifically asks for the acquisition approach to minimize load times, and incremental extraction (Option B) directly achieves that without mandating a staging area.

How to eliminate wrong answers

Option A is wrong because performing a full extraction of all data every night would be extremely inefficient, causing unnecessarily long load times and high resource consumption, especially for large relational databases like the EHR system. Option C is wrong because while using a staging area is a best practice for data quality and transformation, it does not address the core requirement of minimizing load times through incremental extraction; the question specifically asks for the acquisition approach, not the ETL pipeline design. Option D is wrong because extracting a full dump of the API data every night ignores the API's built-in date filtering capability, leading to redundant data transfer and longer load times compared to incremental extraction.

Full explanation →

122

MCQeasy

A data analyst receives a dataset with a column 'salary' that contains values like '45,000', '55,000', and '65,000'. The analyst notices that the values are stored as text. Which data concept should be applied to convert the salary column from text to numeric format for analysis?

A.Data imputation

B.Data type conversion

C.Data validation

D.Data normalization

AnswerB

Conversion changes data type, e.g., string to integer.

Why this answer

Data type conversion is the correct concept because the salary values are stored as text (string) but need to be converted to a numeric type (e.g., integer or float) for mathematical operations like aggregation or averaging. In tools like Python (pandas `astype(float)`), SQL (`CAST(salary AS INTEGER)`), or Excel (`VALUE()` function), this explicit conversion ensures the data is treated as numbers, not strings. Without conversion, operations like `SUM` or `AVG` would fail or produce incorrect results.

Exam trap

CompTIA often tests the distinction between data transformation (type conversion) and data preparation techniques like imputation or normalization, trapping candidates who confuse 'changing format' with 'filling gaps' or 'scaling values'.

How to eliminate wrong answers

Option A is wrong because data imputation deals with filling missing values (e.g., using mean or median), not changing the data type of existing values. Option C is wrong because data validation checks whether data meets predefined rules (e.g., range or format constraints), but it does not transform text to numeric format. Option D is wrong because data normalization rescales numeric values to a standard range (e.g., 0–1 or z-scores), which assumes the data is already numeric, not converting text to numbers.

Full explanation →

123

MCQmedium

A researcher wants to study the effect of a new drug. She collects data directly from clinical trial participants. Later, she compares her findings with historical data from medical journals. Which contrast best describes her data sources?

A.Internal vs. External

B.Quantitative vs. Qualitative

C.Structured vs. Unstructured

D.Primary vs. Secondary

AnswerD

Primary data is collected firsthand; secondary data is obtained from existing sources.

Why this answer

Option D is correct because the researcher is directly collecting data from clinical trial participants (primary data) and then comparing it with historical data from medical journals (secondary data). Primary data is original data collected firsthand for a specific purpose, while secondary data is pre-existing data collected by others for different purposes. This contrast directly maps to the primary vs. secondary data classification in data management.

Exam trap

The trap here is that candidates confuse 'internal vs. external' (Option A) with 'primary vs. secondary' because both involve a contrast between data from the researcher's own work versus outside sources, but the DA0-001 exam specifically tests the distinction based on whether the data was collected firsthand (primary) or reused from existing records (secondary).

How to eliminate wrong answers

Option A is wrong because internal vs. external refers to data originating within or outside an organization, not the method of collection; the clinical trial data is internal to the study but the historical data is external, but the core contrast here is about data origin (firsthand vs. reused), not organizational boundary. Option B is wrong because quantitative vs. qualitative describes data types (numerical vs. categorical/textual), not the source of data; both the clinical trial data and historical journal data could be quantitative or qualitative. Option C is wrong because structured vs. unstructured refers to data format (e.g., tables vs. free text), not the source; both data sources could be structured (e.g., trial results in a database) or unstructured (e.g., narrative journal articles).

Full explanation →

124

Multi-Selecteasy

Which TWO of the following are best practices when presenting data insights to a non-technical audience?

Select 2 answers

A.Focus on actionable insights and recommendations

B.Use visualizations like bar charts and line graphs

C.Use technical jargon to demonstrate expertise

D.Present raw data tables for transparency

E.Include detailed statistical formulas

AnswersA, B

Actionable insights help the audience make decisions.

Why this answer

Option A is correct because non-technical audiences need clear, actionable insights to make decisions. Presenting recommendations directly from the data ensures the insights are useful and drive business outcomes, which is a core principle of effective data communication.

Exam trap

CompTIA often tests the misconception that technical depth equals credibility, but the DA0-001 exam emphasizes that effective communication means simplifying complexity for the audience, not showcasing every analytical detail.

Full explanation →

125

MCQmedium

An analyst needs to combine two datasets from different sources that share a common key but have different levels of granularity. Dataset A has daily sales per store, Dataset B has hourly foot traffic per store. The analyst wants to analyze correlation. Which approach is appropriate?

A.Aggregate Dataset B to daily level before merging

B.Use an outer join and keep all rows

C.Disaggregate Dataset A to hourly level by dividing daily sales by hours

D.Join on store and date without aggregation

AnswerA

Aggregating the more granular dataset to match the less granular is the standard approach.

Why this answer

Aggregating Dataset B (hourly foot traffic) to the daily level ensures both datasets share the same granularity before merging on the common key (store and date). This allows a valid correlation analysis between daily sales and daily foot traffic without introducing artificial patterns or data duplication. Merging at mismatched granularities would violate the assumption that each row represents a comparable unit of observation.

Exam trap

CompTIA often tests the misconception that disaggregating (splitting) the coarser dataset is acceptable, but this introduces artificial data and violates the assumption of uniform distribution, whereas aggregation preserves the actual measured values.

How to eliminate wrong answers

Option B is wrong because an outer join without aggregation would produce multiple rows per store-date (one for each hour) when joined with daily sales, inflating the number of rows and creating a many-to-one relationship that distorts correlation calculations. Option C is wrong because disaggregating daily sales by simply dividing by hours (e.g., 24) assumes uniform sales distribution, which is rarely true and introduces artificial hourly values that do not reflect actual sales patterns. Option D is wrong because joining on store and date without aggregation retains hourly granularity from Dataset B, causing each daily sales row to repeat for every hour, leading to duplicate data and invalid statistical analysis.

Full explanation →

126

Multi-Selecteasy

A data analyst is designing a dashboard for senior executives who need to quickly monitor key business metrics. Which TWO design principles should the analyst follow? (Choose two.)

Select 2 answers

A.Include detailed data tables for reference

B.Display only the most important KPIs

C.Use consistent formatting and clear labels

D.Add complex interactive filters

E.Use as many colors as possible to make it visually appealing

AnswersB, C

Focus on critical metrics for quick decision-making.

Why this answer

Option B is correct because senior executives need to monitor key business metrics at a glance, not be overwhelmed with extraneous data. Displaying only the most important KPIs ensures the dashboard is focused and actionable, aligning with the principle of delivering concise, high-level insights for quick decision-making.

Exam trap

CompTIA often tests the misconception that more data and interactivity always improve a dashboard, when in fact, for executive audiences, simplicity and focus on the most important KPIs are paramount.

Full explanation →

127

MCQeasy

A data analyst is designing a data model for a sales data warehouse. The model should optimize query performance for aggregations by minimizing joins and duplicating data where necessary. Which schema design should the analyst use?

A.Entity-relationship model

B.Snowflake schema

C.3NF normalized model

D.Star schema

AnswerD

Star schema denormalizes dimensions, minimizing joins and optimizing aggregate queries.

Why this answer

Star schema denormalizes dimensions into a single table, reducing joins and improving query speed for aggregates. Snowflake schema normalizes dimensions increasing joins. Entity-relationship and 3NF are optimized for transactional systems, not analytical queries.

Full explanation →

128

MCQhard

A data analyst uses linear regression to model the relationship between advertising spend and sales. The residual plot shows a clear U-shaped pattern. What assumption is violated?

A.Independence of residuals

B.Homoscedasticity

C.Normality of residuals

D.Linearity

AnswerD

A U-shaped pattern means the relationship is not linear; the model is missing a nonlinear term.

Why this answer

The U-shaped pattern in the residual plot indicates that the relationship between advertising spend and sales is not linear; the model fails to capture the curvature in the data. Linear regression assumes a straight-line relationship between predictors and the response, so a systematic pattern like a U-shape directly violates the linearity assumption. This means the model is misspecified and requires a transformation or a nonlinear modeling approach.

Exam trap

CompTIA often tests the distinction between residual pattern shapes and their corresponding assumptions, so the trap here is that candidates confuse a curved pattern (nonlinearity) with heteroscedasticity or non-normality, leading them to pick B or C instead of D.

How to eliminate wrong answers

Option A is wrong because independence of residuals refers to errors being uncorrelated with each other, often violated in time-series data, but a U-shaped pattern does not imply autocorrelation. Option B is wrong because homoscedasticity means constant variance of residuals across fitted values, which would appear as a funnel or cone shape, not a U-shaped curve. Option C is wrong because normality of residuals concerns the distribution of errors (checked via Q-Q plot or histogram), not the pattern of residuals versus fitted values; a U-shaped pattern does not directly indicate non-normality.

Full explanation →

129

MCQmedium

A company requires real-time masking of credit card numbers for customer support agents while allowing full access for accountants. Which technique should be implemented?

A.Dynamic data masking

B.Tokenization

C.Static data masking

D.Data encryption

AnswerA

Dynamic masking masks data on-the-fly based on user roles, perfect for this requirement.

Why this answer

Dynamic data masking (DDM) applies masking rules at query runtime based on user privileges, allowing accountants full access while customer support agents see only masked credit card numbers. Unlike static masking, DDM does not alter the underlying stored data, making it ideal for real-time, role-based obfuscation without duplicating or transforming the database.

Exam trap

CompTIA often tests the misconception that encryption or tokenization can provide real-time, role-based masking, but these technologies either require decryption (exposing the full value) or introduce latency and storage overhead, making dynamic data masking the only correct choice for this use case.

How to eliminate wrong answers

Option B (Tokenization) is wrong because it replaces sensitive data with a non-sensitive token stored in a separate vault, requiring a detokenization process that adds latency and is not designed for real-time, role-based masking within the same database. Option C (Static data masking) is wrong because it creates a permanent, masked copy of the data in a non-production environment, which cannot provide real-time, on-the-fly masking for live queries. Option D (Data encryption) is wrong because encryption protects data at rest or in transit but does not provide role-based masking at query time; decryption keys grant full access, not partial masking.

Full explanation →

130

MCQhard

A data architect is designing an ETL pipeline to ingest streaming data from IoT sensors. The data must be available for real-time analytics. Which acquisition method is best?

A.Real-time streaming via API

B.Poll sensors every hour

C.Manually upload sensor logs

D.Batch load daily CSV files

AnswerA

Streaming provides continuous, low-latency data flow.

Why this answer

Real-time streaming via API is the best method because IoT sensors generate continuous data that must be ingested with sub-second latency for real-time analytics. APIs (e.g., REST, WebSocket, or MQTT) enable event-driven ingestion, allowing the ETL pipeline to process each sensor reading as it arrives, which is essential for time-sensitive use cases like anomaly detection or live monitoring.

Exam trap

The trap here is that candidates may confuse 'real-time' with 'frequent batch' and choose hourly polling (Option B), not realizing that real-time analytics requires sub-second latency, not just periodic updates.

How to eliminate wrong answers

Option B is wrong because polling sensors every hour introduces latency of up to 60 minutes, which violates the real-time analytics requirement and can cause data staleness for time-critical decisions. Option C is wrong because manually uploading sensor logs is not automated, introduces human error, and cannot achieve the low-latency ingestion needed for streaming data. Option D is wrong because batch loading daily CSV files imposes a 24-hour delay, making the data unavailable for real-time analytics and contradicting the explicit requirement for immediate data availability.

Full explanation →

131

MCQhard

Refer to the exhibit. A data analyst is reviewing the configuration of an executive dashboard. The dashboard refreshes daily at 6:00 AM. Which of the following best describes a potential issue with this dashboard for executive use?

A.The data sources do not include all necessary tables.

B.The dashboard uses a table visualization which may not be suitable for quick insights.

C.The alert condition for revenue is set too low.

D.The refresh schedule is too frequent.

AnswerB

Executives typically prefer visual summaries (charts) over raw tables for rapid comprehension.

Why this answer

Option B is correct because an executive dashboard should provide quick, at-a-glance insights, and a table visualization forces the viewer to read through rows of data rather than immediately grasping trends or outliers. For high-level decision-making, visualizations like line charts, bar charts, or KPI tiles are more effective at conveying key metrics without cognitive overload.

Exam trap

CompTIA often tests the principle that the suitability of a visualization depends on the audience and purpose, and the trap here is assuming that a table is always acceptable because it shows all data, ignoring that executives need rapid, high-level insights rather than granular detail.

How to eliminate wrong answers

Option A is wrong because the question does not provide any information about missing tables or data sources; the exhibit only shows the dashboard configuration, and there is no indication of incomplete data. Option C is wrong because the alert condition for revenue being set too low is not inherently an issue—it depends on business thresholds, and the question does not specify that the alert is misconfigured or causing false alarms. Option D is wrong because a daily refresh at 6:00 AM is a common and reasonable schedule for an executive dashboard, ensuring data is current for morning reviews without being overly frequent or resource-intensive.

Full explanation →

132

MCQhard

A healthcare analytics team is building a predictive model to identify patients at high risk of readmission within 30 days of discharge. The dataset includes 50,000 patient records with 200 features, including demographics, vital signs, lab results, and historical admissions. The target variable is binary (readmitted or not). The team uses a logistic regression model and achieves an AUC of 0.72 on the test set. However, the model's calibration is poor: for patients predicted to have a 70% risk, the actual readmission rate is only 40%. The team wants to improve calibration without significantly reducing discrimination (AUC). The data scientist suggests applying Platt scaling. However, the team lead is concerned that Platt scaling may reduce the model's ability to rank patients correctly. Which of the following is the best course of action?

A.Remove poorly calibrated predictions by discarding all patients with predicted risk between 0.3 and 0.7.

B.Ignore calibration because AUC is the only metric that matters for readmission risk models.

C.Apply Platt scaling on a held-out validation set to recalibrate the predicted probabilities without refitting the original model.

D.Switch to a random forest model, which inherently produces better-calibrated probabilities.

AnswerC

Platt scaling is designed to improve calibration while maintaining AUC.

Why this answer

Platt scaling is a post-processing technique that fits a logistic regression model on the predicted probabilities from the original model using a held-out validation set. This recalibrates the probabilities without altering the ranking of patients (the AUC remains unchanged), directly addressing the poor calibration while preserving discrimination. Option C correctly describes this procedure.

Exam trap

The trap here is that candidates may think Platt scaling changes the model's ranking (AUC), but in reality it applies a monotonic transformation that preserves rank order, so discrimination is unaffected.

How to eliminate wrong answers

Option A is wrong because discarding patients with predicted risk between 0.3 and 0.7 removes a large portion of the data and does not fix the underlying miscalibration; it merely hides the problem and reduces the model's utility. Option B is wrong because AUC measures only rank ordering, not probability accuracy; for clinical risk models, well-calibrated probabilities are critical for decision-making (e.g., resource allocation). Option D is wrong because random forest models are known to produce poorly calibrated probabilities due to their averaging of decision tree outputs, often requiring their own calibration (e.g., isotonic regression) and do not inherently guarantee better calibration than logistic regression.

Full explanation →

133

Multi-Selecthard

Which THREE are common challenges when acquiring data from external APIs? (Choose three.)

Select 3 answers

A.Authentication requirements

B.Consistent data schemas

C.Rate limiting

D.Guaranteed uptime

E.Data volume constraints

AnswersA, C, E

Most APIs require keys or tokens for access.

Why this answer

Authentication requirements are a common challenge because external APIs typically require valid credentials (e.g., API keys, OAuth 2.0 tokens, or JWT) to access protected resources. Without proper authentication, requests are rejected with HTTP 401 Unauthorized or 403 Forbidden errors, and managing token expiration, refresh cycles, and secure storage adds significant complexity to data acquisition pipelines.

Exam trap

The trap here is that candidates confuse 'consistent data schemas' (which APIs typically provide) with 'inconsistent data quality' (which is a separate challenge), leading them to incorrectly select Option B instead of recognizing that schema consistency is actually a benefit of using APIs.

Full explanation →

134

MCQmedium

A healthcare analytics team is building a classification model to predict patient readmission within 30 days. The dataset contains 10,000 records with 30 features, including demographics, vital signs, lab results, and medication history. The target variable is imbalanced: 85% no readmission, 15% readmission. The team used logistic regression with default settings and achieved an accuracy of 85%, but the model predicted 'no readmission' for all patients. The lead analyst suspects the model is not learning due to class imbalance. The team has time to implement one corrective action before the next model review. Which action should the team take?

A.Remove features with low variance to reduce noise

B.Apply SMOTE to oversample the readmission class

C.Use accuracy as the evaluation metric to monitor improvement

D.Switch to a random forest model with default settings

AnswerB

SMOTE generates synthetic samples, balancing the classes and allowing the model to learn from the minority class.

Why this answer

Option B is correct because SMOTE (Synthetic Minority Oversampling Technique) directly addresses the class imbalance by generating synthetic samples for the minority class (readmission). This forces the logistic regression model to learn decision boundaries that separate the two classes, rather than defaulting to the majority class prediction. With 85% majority and 15% minority, accuracy alone is misleading, and SMOTE is a proven technique to improve recall for the minority class.

Exam trap

The trap here is that candidates often choose accuracy as a metric (Option C) because it seems intuitive, but in imbalanced datasets, accuracy is misleading and does not reflect model performance for the minority class.

How to eliminate wrong answers

Option A is wrong because removing low-variance features does not address class imbalance; it only reduces noise or redundant features, but the model will still predict the majority class if the imbalance is not handled. Option C is wrong because using accuracy as the evaluation metric is exactly the problem—it will remain high (85%) even if the model predicts all 'no readmission', so it does not monitor improvement for the minority class. Option D is wrong because switching to a random forest model with default settings does not inherently solve class imbalance; random forest can also be biased toward the majority class without techniques like class weighting or resampling.

Full explanation →

135

MCQmedium

A logistics company tracks delivery times and customer satisfaction scores. The data analyst finds that delivery times have increased over the past quarter, correlating with a drop in satisfaction. The analyst needs to present this to the operations team, which is interested in root cause analysis. The team wants to identify whether the increase is due to specific regions, routes, or time periods. The analyst has access to granular data including timestamps, route IDs, and region codes. The presentation should lead to actionable insights for process improvement. What visualization should the analyst use as the primary chart?

A.A scatter plot of delivery time vs. satisfaction.

B.A line chart showing delivery times and satisfaction over time.

C.A histogram of delivery times.

D.A pie chart showing proportion of late deliveries.

AnswerB

Clearly visualizes trends and correlation, enabling root cause analysis.

Why this answer

Option B is correct because a line chart with dual axes (or separate panels) can clearly show the trend of delivery times and satisfaction scores over the same time period, directly addressing the operations team's need to identify whether the increase is due to specific time periods. This visualization allows the team to correlate changes in delivery times with satisfaction drops over time, supporting root cause analysis by highlighting temporal patterns. The granular timestamp data makes a time-series line chart the most effective primary chart for revealing trends and potential seasonality.

Exam trap

The trap here is that candidates often choose a scatter plot (Option A) because it shows correlation, but the question specifically requires identifying root causes by region, route, or time period, which a scatter plot cannot address without additional dimensions.

How to eliminate wrong answers

Option A is wrong because a scatter plot of delivery time vs. satisfaction shows correlation but does not incorporate time, region, or route dimensions, making it impossible to identify whether increases are due to specific regions, routes, or time periods. Option C is wrong because a histogram of delivery times shows the distribution of delivery times but provides no temporal context or correlation with satisfaction, failing to address the root cause analysis requirement for time-based trends. Option D is wrong because a pie chart showing the proportion of late deliveries is a static snapshot that ignores time trends, regional breakdowns, and route-level granularity, offering no actionable insights for process improvement.

Full explanation →

136

Multi-Selecteasy

A data analyst is designing a dashboard for a sales team. Which TWO of the following are best practices for dashboard design?

Select 2 answers

A.Use complex visualizations to impress users.

B.Include as many KPIs as possible on one screen.

C.Use consistent color coding for similar metrics.

D.Place the most important information at the top or left.

E.Use a single chart type for all visuals.

AnswersC, D

Consistent color coding helps users quickly associate colors with metrics.

Why this answer

Option C is correct because consistent color coding for similar metrics reduces cognitive load and helps users quickly interpret data without re-learning visual cues. In dashboard design, this aligns with Gestalt principles of similarity and proximity, ensuring that revenue metrics, for example, always appear in the same color across charts. This practice is recommended by data visualization experts like Stephen Few and is a standard in tools like Tableau and Power BI.

Exam trap

The trap here is that candidates often confuse 'impressive visuals' with effective communication, or assume that more data equals better insights, when in fact simplicity and consistency are the hallmarks of professional dashboard design.

Full explanation →

137

Multi-Selecthard

Which THREE of the following are appropriate ways to handle outliers when communicating data insights?

Select 3 answers

A.Document the outlier and its potential impact in the report.

B.Ignore the outlier and proceed with the analysis.

C.Investigate the cause of the outlier.

D.Use a box plot to visualize the distribution including outliers.

E.Remove the outlier from the dataset to clean the data.

AnswersA, C, D

Documentation provides transparency and context.

Why this answer

Option A is correct because documenting the outlier and its potential impact in the report is a best practice for transparent and ethical data communication. It allows stakeholders to understand the anomaly's influence on the analysis and make informed decisions, rather than hiding or misrepresenting the data.

Exam trap

The trap here is that candidates may think removing outliers is always a standard data cleaning step, but the exam emphasizes that outliers must be investigated and documented rather than automatically deleted, as they can carry significant meaning.

Full explanation →

138

MCQmedium

A retail company stores customer purchase history in a relational database. The database contains a table 'transactions' with columns: transaction_id, customer_id, product_id, quantity, price, and transaction_date. A data analyst needs to create a report that shows total revenue per customer for the last quarter. Which data concept describes the relationship between customer_id and total revenue?

A.Foreign key

B.Composite attribute

C.Derived attribute

D.Atomic attribute

AnswerC

Total revenue is calculated from other attributes, making it derived.

Why this answer

Total revenue is calculated by summing (quantity * price) for each customer, making it a derived attribute because it is computed from existing stored data (quantity and price) rather than stored directly. In the context of the 'transactions' table, customer_id is a stored key, but total_revenue is not stored; it is derived via aggregation, which matches the definition of a derived attribute in database design.

Exam trap

CompTIA often tests the confusion between a derived attribute (computed from other attributes) and a foreign key (a referential constraint), leading candidates to incorrectly select 'foreign key' because customer_id appears in multiple tables.

How to eliminate wrong answers

Option A is wrong because a foreign key is a column that references a primary key in another table to enforce referential integrity; customer_id in the transactions table is a foreign key referencing the customers table, but total revenue is not a key—it is a computed value. Option B is wrong because a composite attribute is an attribute that can be divided into smaller sub-parts (e.g., address into street, city, zip); total revenue is a single calculated value, not composed of multiple atomic sub-attributes. Option D is wrong because an atomic attribute is indivisible and stored directly (e.g., price, quantity); total revenue is not stored but derived, so it violates the atomicity principle.

Full explanation →

139

MCQhard

An organization needs to acquire data from a third-party vendor. The data will be used for regulatory reporting. Which of the following should be the primary consideration before acquiring the data?

A.Legal and compliance requirements

B.Volume of data

C.Data format

D.Cost of the data

AnswerA

Regulatory reporting requires adherence to data governance and privacy laws.

Why this answer

When acquiring data for regulatory reporting, legal and compliance requirements must be the primary consideration because the data must adhere to specific laws (e.g., GDPR, HIPAA, SOX) and industry regulations. Failing to ensure compliance can result in legal penalties, fines, or rejection of the report by regulatory bodies. This overrides technical or cost concerns, as non-compliant data is unusable for its intended purpose.

Exam trap

The trap here is that candidates prioritize technical or cost factors (volume, format, price) over the foundational legal and compliance gate, mistakenly assuming any data can be adapted later without verifying regulatory fitness first.

How to eliminate wrong answers

Option B is wrong because the volume of data is a secondary operational concern (e.g., storage, processing bandwidth) but does not address whether the data legally satisfies regulatory mandates. Option C is wrong because data format (e.g., CSV, JSON, XML) is a technical integration detail that can be transformed later, not a primary legal or compliance gate. Option D is wrong because cost is a business negotiation factor; even free data must first meet regulatory requirements to be used for reporting.

Full explanation →

140

MCQmedium

A data analyst is building a dataset from multiple sources and needs to ensure data quality. During the data acquisition phase, which activity is most important to perform?

A.Data visualization

B.Data cleaning

C.Data profiling

D.Data modeling

AnswerC

Profiling assesses data quality and structure before further processing.

Why this answer

Data profiling is the most important activity during the data acquisition phase because it involves examining source data to understand its structure, content, and quality issues before integration. This step identifies missing values, data types, duplicates, and inconsistencies early, preventing downstream errors in analysis. Without profiling, subsequent cleaning and modeling may be based on flawed assumptions about the data.

Exam trap

CompTIA often tests the distinction between data profiling (discovery/assessment) and data cleaning (correction), leading candidates to mistakenly choose cleaning as the first step during acquisition when profiling must come first to identify what needs cleaning.

How to eliminate wrong answers

Option A is wrong because data visualization is a presentation and exploratory analysis technique used after data is acquired and cleaned, not during acquisition. Option B is wrong because data cleaning is a corrective process that typically follows data profiling; performing cleaning without first profiling can waste effort on unknown issues or miss critical quality problems. Option D is wrong because data modeling defines relationships and structures for storage or analysis, which occurs after data is acquired and understood, not during the initial acquisition phase.

Full explanation →

141

MCQhard

A company is analyzing customer feedback sentiment. The dataset is highly imbalanced with 95% positive and 5% negative comments. Which technique should the analyst use to address class imbalance before modeling?

A.Use accuracy as the evaluation metric

B.Undersample the majority class

C.Oversample the majority class

D.Use SMOTE

AnswerD

SMOTE generates synthetic minority samples to balance classes.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) is the correct choice because it generates synthetic samples for the minority class (negative comments) by interpolating between existing minority instances, rather than simply duplicating them. This addresses the 95:5 imbalance without the information loss of undersampling or the overfitting risk of naive oversampling.

Exam trap

The trap here is that candidates often confuse oversampling the minority class with oversampling the majority class, or they incorrectly assume that simply using a different evaluation metric (like accuracy) can fix the imbalance problem without modifying the dataset.

How to eliminate wrong answers

Option A is wrong because accuracy is a misleading metric for imbalanced datasets; a model predicting all comments as positive would achieve 95% accuracy but fail to identify any negative comments. Option B is wrong because undersampling the majority class discards a large amount of potentially useful data, which can lead to loss of important patterns and reduced model performance. Option C is wrong because oversampling the majority class would exacerbate the imbalance, making the model even more biased toward the majority class.

Full explanation →

142

Multi-Selecthard

Which THREE are challenges in acquiring data from external sources? (Select three.)

Select 3 answers

A.Data redundancy

B.Unauthorized access

C.Licensing restrictions

D.Rate limiting

E.Data format inconsistency

AnswersC, D, E

External data may have legal restrictions on usage, sharing, or redistribution.

Why this answer

Data format inconsistency occurs when integrating data from different sources. Rate limiting is a common API restriction that limits how much data can be accessed. Licensing restrictions may limit the use or redistribution of acquired data.

Data redundancy is an internal data quality issue, not a challenge specific to acquisition. Unauthorized access is a security concern but not a typical acquisition challenge.

Full explanation →

143

Multi-Selecteasy

Which THREE of the following are examples of descriptive statistics? (Select THREE.)

Select 3 answers

A.Correlation coefficient

B.Mean

C.P-value

D.Regression coefficient

E.Standard deviation

AnswersA, B, E

Correlation coefficient describes the strength of a linear relationship, a descriptive statistic.

Why this answer

The correlation coefficient (A) is a descriptive statistic because it quantifies the strength and direction of a linear relationship between two variables using a single number (ranging from -1 to +1) without making inferences about a larger population. It simply describes the observed data's association, which is the core function of descriptive statistics.

Exam trap

CompTIA often tests the distinction between descriptive and inferential statistics by including p-values and regression coefficients as distractors, exploiting the common misconception that any numerical summary of data is descriptive.

Full explanation →

144

MCQmedium

A retail company's data analyst developed a dashboard for store managers to monitor daily sales performance. The dashboard includes numerous metrics such as sales by hour, product category, employee, and customer demographics, along with trend lines and forecast graphs. Despite the comprehensive data, store managers are ignoring the dashboard because they find it cluttered and confusing. They prefer to rely on their intuition and verbal updates from shift leads. The analyst needs to improve communication of data insights to ensure the dashboard is used effectively. Which of the following actions should the analyst take FIRST?

A.Send the raw data in a spreadsheet instead

B.Simplify the dashboard by focusing on key metrics and using clear visual hierarchy

C.Schedule a training session to explain all metrics

D.Add more data points to provide a comprehensive view

AnswerB

This directly addresses the managers' feedback and makes insights easier to grasp.

Why this answer

The core issue is that the dashboard is cluttered and confusing, which directly undermines its usability. Option B addresses this by simplifying the dashboard to focus on key metrics and using a clear visual hierarchy, which is the foundational step in effective data communication. Without first reducing cognitive load, no amount of training or additional data will make the dashboard useful for time-constrained store managers.

Exam trap

The trap here is that candidates may confuse 'comprehensive data' with 'effective communication,' leading them to choose options that add more information (D) or provide raw data (A), rather than recognizing that clarity and focus are the primary drivers of dashboard adoption.

How to eliminate wrong answers

Option A is wrong because sending raw data in a spreadsheet would exacerbate the problem by overwhelming managers with unstructured, granular data, requiring them to perform their own analysis—the opposite of a dashboard's purpose. Option C is wrong because scheduling a training session to explain all metrics assumes the problem is a lack of understanding, not the dashboard's poor design; training a user to navigate a cluttered interface is inefficient and does not fix the root cause. Option D is wrong because adding more data points would increase clutter and confusion, directly contradicting the user feedback that the dashboard is already too complex.

Full explanation →

145

MCQhard

The exhibit shows a SQL query result intended for a bar chart of revenue by region. However, the chart shows only the top 10 regions, but the query returns all regions. What is the most likely cause?

A.The GROUP BY clause is incorrect

B.The visualization tool has a default limit on the number of categories displayed

C.The query is missing a WHERE clause

D.The ORDER BY clause is ignored in the chart

AnswerB

Many tools limit categories to avoid clutter unless configured otherwise.

Why this answer

The query has no LIMIT clause, so it returns all rows. The chart tool must have a built-in limit. Option A is correct because the data is not truncated, but the visualization tool may be set to show only top 10.

Options B, C, and D are incorrect because the query is correct and order is applied.

Full explanation →

146

Multi-Selecthard

Which TWO of the following are valid techniques for validating the performance of a predictive model?

Select 2 answers

A.Bootstrapping

B.Feature scaling

C.Train-test split

D.K-fold cross-validation

E.Increasing training data

AnswersC, D

Splitting data into training and testing sets is a basic validation approach.

Why this answer

The train-test split (Option C) is a fundamental technique for validating predictive model performance by partitioning the dataset into separate training and testing subsets, ensuring the model is evaluated on unseen data to gauge generalization. This method directly addresses overfitting and provides an unbiased estimate of model accuracy, making it a standard practice in supervised learning workflows.

Exam trap

CompTIA often tests the distinction between data preprocessing techniques (like feature scaling) and actual model validation methods, leading candidates to mistakenly select feature scaling as a validation technique because it is a common step in the modeling pipeline.

Full explanation →

147

MCQmedium

A company's sales dashboard shows that the current month's revenue is $1.2M, which is 10% below the target of $1.33M. The analyst wants to highlight this shortfall. Which method of data presentation is most effective?

A.Show a trend line of the last 12 months.

B.Provide a table of all monthly targets and actuals.

C.Display the actual revenue only.

D.Use a bullet chart showing actual vs. target.

AnswerD

A bullet chart provides a concise visual comparison of actual value to a target, highlighting the shortfall.

Why this answer

A bullet chart is the most effective method because it is specifically designed to show performance against a target, combining a bar for the actual value ($1.2M) with a reference line or marker for the target ($1.33M). This allows the analyst to immediately visualize the 10% shortfall in a compact, high-density format without needing to compare separate numbers or interpret a trend. It directly addresses the goal of highlighting the variance, which is a core principle in data presentation for performance dashboards.

Exam trap

CompTIA often tests the misconception that a trend line or table provides sufficient context for a single variance, when in fact the bullet chart is the optimal choice for directly comparing actual vs. target in a single, focused visual.

How to eliminate wrong answers

Option A is wrong because a trend line of the last 12 months shows historical patterns but does not explicitly highlight the current month's shortfall against the target; it buries the key insight in a broader time series. Option B is wrong because a table of all monthly targets and actuals requires the viewer to manually scan and compare numbers, which is less efficient and less visually immediate for highlighting a single variance than a bullet chart. Option C is wrong because displaying only the actual revenue ($1.2M) omits the target entirely, making it impossible to identify the shortfall without external context.

Full explanation →

148

MCQhard

A data governance team is drafting a policy for handling personally identifiable information (PII). According to data governance best practices, which document should define the classification levels and handling procedures?

A.Data dictionary

B.Data classification policy

C.Data quality report

D.Data flow diagram

AnswerB

A data classification policy categorizes data by sensitivity and outlines handling rules.

Why this answer

The data classification policy is the authoritative document that defines classification levels (e.g., public, internal, confidential, restricted) and specifies handling procedures for each category, including PII. This aligns with data governance best practices, as it establishes the rules for labeling, storing, transmitting, and disposing of sensitive data. A data dictionary describes metadata and schema, not classification rules.

Exam trap

The trap here is that candidates confuse the data dictionary (which describes data structure) with the data classification policy (which governs data sensitivity and handling), leading them to select the dictionary as the document that defines classification levels.

How to eliminate wrong answers

Option A is wrong because a data dictionary documents metadata such as field names, data types, and definitions, but it does not define classification levels or handling procedures for PII. Option C is wrong because a data quality report measures data accuracy, completeness, and consistency, not security or classification policies. Option D is wrong because a data flow diagram visually maps how data moves between systems, but it does not prescribe classification levels or handling rules.

Full explanation →

149

MCQmedium

A data analyst creates a dashboard for operational metrics. The operations team reports that the dashboard is confusing because it shows too many metrics on one screen. Which design principle should the analyst apply?

A.Apply progressive disclosure

B.Increase white space

C.Use a single chart type

D.Add more filters

AnswerA

Progressive disclosure shows a summary first and allows drilling into details.

Why this answer

The correct answer is A because progressive disclosure is a design principle that presents only the most critical information initially, with the option to reveal additional details as needed. This directly addresses the operations team's complaint of too many metrics on one screen by reducing cognitive load and allowing users to drill down into specific metrics when required. In dashboard design, this is often implemented through expandable sections, hover-over tooltips, or click-through layers.

Exam trap

The trap here is that candidates often confuse 'reducing clutter' (white space) with 'reducing information overload' (progressive disclosure), or they mistakenly believe that adding more filters will simplify the initial view, when in fact filters only change what is shown without addressing the core issue of too many metrics displayed at once.

How to eliminate wrong answers

Option B is wrong because increasing white space improves visual clarity and reduces clutter, but it does not solve the problem of too many metrics being displayed simultaneously; it merely spaces them out. Option C is wrong because using a single chart type does not reduce the number of metrics shown; it may even force inappropriate visualization of diverse data types, leading to misinterpretation. Option D is wrong because adding more filters gives users control over what data is displayed, but it does not address the initial overload of visible metrics; filters are a complementary feature, not a primary solution for reducing on-screen complexity.

Full explanation →

150

MCQmedium

An analyst builds a dashboard with a gauge showing 'Current Inventory Level' as a percentage. Stakeholders find the gauge misleading because it always shows near 100% even when inventory is low. What is the most likely issue?

A.The maximum value of the gauge is set too high

B.The gauge updates too slowly

C.The gauge uses green for all values

D.The gauge needle is too small

AnswerA

If max is high, actual value appears as a smaller percentage than reality.

Why this answer

When a gauge always shows near 100% despite low inventory, the most likely cause is that the maximum value of the gauge is set too high. For example, if the gauge's maximum is configured as 10,000 units but the actual inventory never exceeds 100 units, even a small absolute value will appear as a high percentage. This scaling mismatch makes the gauge misleading because it fails to reflect the true relative state of the inventory.

Exam trap

The trap here is that candidates may focus on visual or performance issues (like update speed or color) instead of recognizing that the gauge's scale configuration is the root cause of the misleading percentage display.

How to eliminate wrong answers

Option B is wrong because a slow update rate would cause stale data, not a consistently high reading; the gauge would eventually show the correct low value after refreshing. Option C is wrong because using green for all values is a color-coding issue that affects interpretation of thresholds, not the gauge's percentage calculation or scale. Option D is wrong because the needle size is a visual design choice that does not impact the underlying data representation or the percentage value displayed.

Full explanation →

Page 2 of 7

All pages

Practice DA0-001 by domain

Target a specific domain to shore up weak areas.

Comparing and Contrasting Data Concepts Mining and Acquiring Data Analyzing and Modeling Data Visualizing Data Communicating Data Insights

See all domains with question counts →