Knowledge + Practice

CompTIA Data+ DA0-001 (DA0-001) — Questions 901–975

982 questions total · 14pages · All types, answers revealed

Take a mock exam Exam hub

Page 13 of 14

901

MCQhard

A dataset contains a column 'Education Level' with values: 'High School', 'Bachelor', 'Master', 'PhD'. An analyst computes the average by assigning numbers 1-4. Which data concept is being violated?

A.Misclassifying data as structured

B.Treating ordinal data as interval

C.Treating nominal data as ordinal

D.Treating ratio data as interval

AnswerB

Assigning numbers and averaging assumes equal intervals, which ordinal data lacks.

Why this answer

The analyst assigned numeric values (1-4) to 'Education Level' categories and computed an average. This treats the ordinal data as if it were interval data, assuming equal spacing between categories (e.g., the difference between 'High School' and 'Bachelor' is the same as between 'Master' and 'PhD'), which is not valid. Ordinal data only preserves order, not magnitude or equal intervals, so calculating a mean is inappropriate.

Exam trap

CompTIA often tests the distinction between ordinal and interval scales by presenting a scenario where a mean is computed on ranked categories, tempting candidates to think the error is about nominal vs. ordinal (Option C) rather than the misuse of arithmetic operations on ordinal data.

How to eliminate wrong answers

Option A is wrong because misclassifying data as structured refers to incorrectly labeling unstructured data (e.g., text) as structured, but the dataset already has a structured column; the violation is about measurement scale, not structure. Option C is wrong because treating nominal data as ordinal would involve imposing an order on unordered categories (e.g., colors), but 'Education Level' already has a natural order, so the error is not about misordering but about assuming equal intervals. Option D is wrong because treating ratio data as interval would ignore a true zero point (e.g., income), but 'Education Level' has no meaningful zero, so the violation is not about ratio vs. interval but about ordinal vs. interval.

Full explanation →

902

MCQmedium

During exploratory data analysis, you calculate the IQR for a numeric column and find that several data points fall below Q1 - 1.5*IQR. These points are likely:

A.Normal variations within the distribution

B.The mode of the dataset

C.The median of the dataset

D.Outliers

AnswerD

The IQR method identifies outliers.

Why this answer

The IQR method defines outliers as points below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. Points below the lower fence are considered outliers. They are not necessarily errors, but potential outliers.

They are not the median or mode.

Full explanation →

903

MCQhard

You are a data analyst at a retail company. Your team has deployed a new dashboard that displays real-time inventory levels across all warehouses. The dashboard uses a traffic light color scheme (green = sufficient, yellow = low, red = out of stock). After one week, the warehouse managers complain that the dashboard is overwhelming because it shows all 500+ products at once, and they cannot quickly identify which items need immediate attention. They request a way to prioritize items that are red or yellow. Additionally, the dashboard updates every second, causing the screen to flicker constantly, which they find distracting. You need to redesign the dashboard to address these issues. Which course of action should you take?

A.Add a filter to show only items with red or yellow status, and reduce the refresh rate to every 5 minutes.

B.Replace the dashboard with a static table that shows all products sorted alphabetically, and increase the refresh rate to every 2 seconds.

C.Remove the color scheme and display only numerical values, and keep the refresh rate at 1 second.

D.Add additional colors to distinguish more inventory levels and increase the refresh rate to every 500 milliseconds.

AnswerA

Filter reduces clutter; slower refresh eliminates flicker.

Why this answer

Option A is correct because it directly addresses both user complaints: adding a filter to show only red/yellow items reduces cognitive overload by hiding the 500+ products that are not critical, and reducing the refresh rate from every second to every 5 minutes eliminates distracting screen flicker while still providing near-real-time inventory updates. This approach balances usability with the need for timely alerts without overwhelming the user.

Exam trap

The trap here is that candidates may think increasing refresh rate improves data accuracy, but the real issue is user distraction and information overload, not data latency.

How to eliminate wrong answers

Option B is wrong because sorting alphabetically does not help prioritize critical items (red/yellow), and increasing the refresh rate to every 2 seconds would worsen the flicker issue. Option C is wrong because removing the color scheme removes the quick visual prioritization that the traffic light system provides, and keeping the 1-second refresh rate does not solve the flicker problem. Option D is wrong because adding more colors would increase visual complexity, not reduce it, and increasing the refresh rate to every 500 milliseconds would make the flicker even more distracting.

Full explanation →

904

MCQhard

A hospital's analytics team has created a dashboard for tracking patient wait times across departments. The dashboard uses a stacked bar chart showing average wait time per department, with each bar segmented by severity level (Low, Medium, High). However, management complains that it is difficult to compare total wait times across departments or identify which department has the highest average wait time. The data itself is accurate and complete. The analyst needs to redesign the visualization to address these concerns. Which course of action should the analyst take?

A.Replace the stacked bar chart with multiple line charts, one per department.

B.Use a heat map with departments on one axis and severity levels on the other, with color intensity representing wait time.

C.Change the chart to a grouped bar chart, with each department having separate bars for each severity level placed side-by-side.

D.Switch to a pie chart showing the proportion of total wait time each department contributes.

AnswerC

This allows direct comparison of totals and individual segments across departments.

Why this answer

Option A is correct. A grouped bar chart with bars for each department and separate bars for each severity level side-by-side allows direct comparison of total wait times across departments and easy identification of the department with the highest average wait time. Option B is incorrect because line charts are not suitable for categorical comparisons and would clutter the view with many lines.

Option C is incorrect because a pie chart can only show proportions for a single category (e.g., total wait time per department) and does not convey severity levels. Option D is incorrect because a heat map shows patterns but does not clearly compare total wait times across departments; it is harder to read exact values.

Full explanation →

905

MCQeasy

A data analyst is summarizing the central tendency of a dataset with extreme outliers. Which measure is most robust to outliers?

A.Standard deviation

B.Median

C.Mean

D.Range

AnswerB

Median is robust to outliers.

Why this answer

The median is not affected by extreme values, unlike the mean.

Full explanation →

906

MCQeasy

A data analyst is working with a dataset containing customer information. The dataset includes a column 'full_name' which stores first and last names together. To perform analysis on first names separately, which data concept describes the process of splitting 'full_name' into 'first_name' and 'last_name'?

A.Data deduplication

B.Data summarization

C.Data normalization

D.Data aggregation

AnswerC

Normalization reduces redundancy and breaks down attributes.

Why this answer

Option C is correct because data normalization is the process of organizing data to reduce redundancy and improve integrity, which includes splitting composite attributes like 'full_name' into atomic values ('first_name', 'last_name'). This aligns with the first normal form (1NF) principle in database design, where each column should contain indivisible values. The data analyst is decomposing a single field into multiple, more granular fields to enable separate analysis.

Exam trap

The trap here is that candidates confuse data normalization with data aggregation or summarization, because both involve restructuring data, but normalization focuses on reducing redundancy and achieving atomicity, not on computing summary statistics.

How to eliminate wrong answers

Option A is wrong because data deduplication refers to identifying and removing duplicate records or entries, not splitting a single column into multiple columns. Option B is wrong because data summarization involves aggregating or condensing data (e.g., calculating averages or totals) to provide a high-level view, not decomposing a field. Option D is wrong because data aggregation combines multiple data points into a single summary value (e.g., sum, count), which is the opposite of splitting a field into more granular components.

Full explanation →

907

MCQmedium

An analyst is creating a report that includes multiple charts. To ensure the audience quickly grasps the key insight, which principle of data storytelling should be applied?

A.Avoid using titles to reduce clutter

B.Include a legend for every chart

C.Use a title that states the main insight

D.Place the chart before any explanation

AnswerC

A title that summarizes the finding helps the audience understand the message immediately.

Why this answer

Starting with a clear title that summarizes the takeaway guides the audience to the main insight. Using a legend only or placing the chart first does not provide context. Avoiding titles reduces clarity.

Full explanation →

908

Multi-Selectmedium

Which TWO are valid data acquisition methods? (Select two.)

Select 2 answers

A.Web scraping

B.Data normalization

C.API calls

D.Data encryption

E.Data profiling

AnswersA, C

Web scraping extracts data from websites and is a common acquisition method.

Why this answer

Web scraping and API calls are direct methods to acquire data from external sources. Data profiling and data normalization are data preparation techniques, not acquisition. Data encryption is a security measure.

Full explanation →

909

MCQeasy

You are using pandas in Python to clean a dataset. You notice several rows with missing values in the 'age' column. Which method would you use to remove those rows?

A.df.drop_duplicates()

B.df.dropna()

C.df.fillna(0)

D.df.isna()

AnswerB

Removes rows with any missing values by default.

Why this answer

df.dropna() removes rows with any missing values by default. df.fillna() fills missing values, df.isna() returns a boolean mask, df.drop_duplicates() removes duplicate rows.

Full explanation →

910

MCQmedium

A data analyst is creating a report on customer satisfaction scores. The analyst wants to ensure that regional managers can only see data for their own region. Which security measure should be applied?

A.Data encryption

B.Row-level security

C.Single version of truth

D.Data masking

AnswerB

Row-level security filters data based on user attributes.

Why this answer

Row-level security restricts data visibility so that users see only the rows relevant to them, such as their region.

Full explanation →

911

MCQmedium

A database administrator is designing a normalized database to reduce data redundancy. They have a table with columns: OrderID, ProductID, ProductName, and Quantity. The table is currently in 1NF. To move to 2NF, which issue must be resolved?

A.The table has repeating groups

B.ProductName depends only on ProductID, causing a partial dependency

C.Quantity depends on both OrderID and ProductID

D.The table has a transitive dependency

AnswerB

Partial dependency on part of a composite key violates 2NF.

Why this answer

To achieve 2NF, the table must be in 1NF and have no partial dependencies, where a non-key column depends on only part of a composite primary key. Here, ProductName depends solely on ProductID, not on the full composite key (OrderID, ProductID), creating a partial dependency. Removing ProductName to a separate Product table resolves this and moves the design to 2NF.

Exam trap

Cisco often tests the distinction between partial dependencies (2NF) and transitive dependencies (3NF), so candidates mistakenly choose a transitive dependency when the real issue is a partial dependency on a composite key.

How to eliminate wrong answers

Option A is wrong because repeating groups are a 1NF violation, and the table is already stated to be in 1NF, so this issue is already resolved. Option C is wrong because Quantity depending on both OrderID and ProductID is a full functional dependency on the composite key, which is acceptable and does not violate 2NF. Option D is wrong because a transitive dependency (where a non-key column depends on another non-key column) is a 3NF violation, not a 2NF issue.

Full explanation →

912

MCQmedium

A data analyst is cleaning a dataset and finds that a numeric field has several missing values. The variable is normally distributed. Which imputation method is most appropriate?

A.Median imputation

B.Mean imputation

C.Mode imputation

D.Forward-fill

AnswerB

Mean is appropriate for symmetric distributions.

Why this answer

For normally distributed data, mean imputation is common and preserves the mean.

Full explanation →

913

Multi-Selecthard

Which TWO of the following are best practices when creating a data visualization for a presentation to stakeholders?

Select 2 answers

A.Including interactive filters when possible to allow stakeholders to explore data

B.Using a consistent color scheme that aligns with company branding

C.Placing the legend at the bottom of the chart

D.Using 3D effects to make the chart visually appealing

E.Using only one type of chart throughout the presentation for consistency

AnswersA, B

Interactivity lets stakeholders focus on relevant data.

Why this answer

Option A is correct because interactive filters (e.g., slicers, parameter controls) empower stakeholders to drill down into specific data subsets, ask ad-hoc questions, and uncover insights without overwhelming the initial view. This aligns with the principle of user-driven exploration, which is a best practice for stakeholder presentations where different audience members may have varying interests. Including interactivity also reduces the need for multiple static charts, keeping the presentation concise while still offering depth.

Exam trap

CompTIA often tests the misconception that visual embellishments like 3D effects or consistent chart types are always beneficial, when in fact they can obscure data accuracy and limit communicative effectiveness.

Full explanation →

914

Multi-Selecthard

A data analyst is designing a database for a retail application. Which TWO of the following are valid reasons to use a NoSQL document database like MongoDB instead of a relational database? (Select 2)

Select 2 answers

A.The application requires high-speed transactional consistency

B.The data structure evolves frequently

C.The data is hierarchical, such as orders with line items

D.The data has a fixed schema with many relationships

E.The application needs complex joins across multiple tables

AnswersB, C

Document stores allow schema flexibility.

Why this answer

Option B is correct because NoSQL document databases like MongoDB are schema-flexible, allowing the data structure to evolve over time without requiring migrations or downtime. This is ideal for agile development where application requirements change frequently, as documents can have varying fields without breaking existing records.

Exam trap

The trap here is that candidates often assume NoSQL databases are always faster or more consistent, but the exam tests the specific trade-offs: document databases excel at flexible schemas and hierarchical data, not at transactional consistency or complex joins.

Full explanation →

915

MCQeasy

A data analyst calculates the mean, median, and mode of a dataset. Which of the following measures of central tendency is least affected by extreme outliers?

A.Median

B.Range

C.Mode

D.Mean

AnswerA

The median is not affected by extreme values.

Why this answer

The median is resistant to outliers because it is the middle value, whereas the mean is pulled by extreme values and the mode may not be affected but is less robust for continuous data.

Full explanation →

916

MCQhard

A data analyst uses the elbow method to determine the number of clusters for k-means. The plot shows a sharp bend at k=3 and a small bend at k=5. What is the recommended number of clusters?

A.5

B.The method is inconclusive.

C.2

D.3

AnswerD

The sharp bend suggests 3 clusters.

Why this answer

The elbow method suggests choosing k where the decrease in inertia becomes marginal; the sharp bend at 3 indicates the optimal k.

Full explanation →

917

MCQhard

An analyst is creating a data story about the decline in customer satisfaction scores. The data shows that satisfaction dropped after a software update. Which narrative structure best helps the audience understand the cause and effect?

A.Start with the current scores, then show historical trend

B.Begin with the software update event, then show satisfaction before and after

C.Present all data points without a story

D.Use a pie chart of satisfaction categories

AnswerB

This chronological structure clearly shows the cause and its effect.

Why this answer

Option B is correct because it follows a chronological cause-and-effect narrative, starting with the software update event and then showing satisfaction scores before and after. This structure directly maps to the audience's need to understand the causal relationship, as it highlights the intervention point and the resulting change in the metric. In data storytelling, this is known as the 'before-and-after' or 'change-over-time' narrative, which is most effective for demonstrating impact.

Exam trap

The trap here is that candidates often choose Option A because they think starting with the current state is more engaging, but Cisco tests the understanding that a chronological cause-and-effect narrative is required to clearly demonstrate the impact of a specific event, not just a general trend.

How to eliminate wrong answers

Option A is wrong because starting with current scores and then showing historical trend presents the data in reverse chronological order, which obscures the cause-and-effect relationship and forces the audience to infer the impact of the software update from the end of the story. Option C is wrong because presenting all data points without a story fails to provide context or a narrative arc, leaving the audience to interpret the cause-and-effect on their own, which is inefficient and can lead to misinterpretation. Option D is wrong because a pie chart of satisfaction categories shows a static distribution of current satisfaction levels, not the temporal change before and after the software update, and thus cannot illustrate cause and effect.

Full explanation →

918

Matchingmedium

Match each data type to its example.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Country of origin (e.g., USA, Canada)

Customer satisfaction rating (1-5)

Temperature in Celsius

Annual income in dollars

Gender (Male, Female, Other)

Why these pairings

These are fundamental data types in statistics.

Full explanation →

919

MCQeasy

A hospital's patient records system must process thousands of small transactions per second. Which type of database system is best suited for this workload?

A.Data mart

B.OLTP

C.Data warehouse

D.OLAP

AnswerB

OLTP handles many concurrent short transactions efficiently.

Why this answer

OLTP (Online Transaction Processing) systems are designed to handle a high volume of small, concurrent transactions with low latency and high concurrency. This makes them ideal for a hospital patient records system that must process thousands of small transactions per second, such as patient check-ins, prescription updates, and billing entries.

Exam trap

The trap here is that candidates often confuse OLTP with OLAP, mistakenly thinking that 'processing many transactions' implies analytical processing, when in fact OLTP is the correct choice for high-frequency, small, write-heavy workloads.

How to eliminate wrong answers

Option A is wrong because a data mart is a subset of a data warehouse focused on a specific business line (e.g., cardiology), not designed for high-throughput transactional processing. Option C is wrong because a data warehouse is optimized for complex analytical queries on large historical datasets, not for handling thousands of small, real-time transactions per second. Option D is wrong because OLAP (Online Analytical Processing) is used for multidimensional analysis and reporting, not for high-frequency transactional workloads.

Full explanation →

920

Multi-Selectmedium

Which TWO data types are considered quantitative? (Select two.)

Select 2 answers

A.Customer satisfaction rating (1-5)

B.Temperature in Celsius

C.Product color

D.Zip code

E.Employee ID

AnswersA, B

Correct. Although ordinal, it is a numeric rating and considered quantitative.

Why this answer

Customer satisfaction rating (1-5) is quantitative because it represents numerical values that can be measured and subjected to mathematical operations, such as calculating an average. Temperature in Celsius is also quantitative as it is a continuous numerical measurement on an interval scale, allowing for meaningful arithmetic comparisons. Both options involve numeric data that can be analyzed statistically.

Exam trap

The trap here is that candidates often mistake numeric-looking identifiers (like zip codes or employee IDs) for quantitative data, failing to recognize that they are nominal categorical variables with no mathematical significance.

Full explanation →

921

Multi-Selecthard

A data analyst is creating a report in Power BI and needs to calculate the total sales for the current year and the previous year. Which TWO DAX functions can be used to retrieve the previous year's sales?

Select 2 answers

A.DATEADD

B.TOTALYTD

C.SAMEPERIODLASTYEAR

D.FILTER

E.PREVIOUSYEAR

AnswersA, C

Can shift dates by -1 year to get the previous year.

Why this answer

SAMEPERIODLASTYEAR and DATEADD (with -1 year) both can return dates from the previous year. PREVIOUSYEAR is invalid, TOTALYTD is for current year, and FILTER is not a time intelligence function.

Full explanation →

922

MCQhard

A data scientist has a dataset with 50 variables and wants to identify clusters of similar observations. Which visualization technique is most suitable for reducing dimensionality to 2D while preserving cluster structure?

A.Heatmap of correlations

B.Scatter matrix (pairplot)

C.Parallel coordinates plot

D.Scatter plot of first two principal components

AnswerD

PCA reduces dimensions while preserving variance, enabling visualization of clusters.

Why this answer

Option C is correct because PCA (or t-SNE) is designed for dimensionality reduction to visualize clusters. Other options are less effective for high-dimensional data.

Full explanation →

923

MCQhard

An analyst needs to compute a running total of sales for each department, ordered by date. Which window function is most appropriate?

A.ROW_NUMBER() OVER (PARTITION BY department ORDER BY date)

B.SUM(sales) OVER (ORDER BY date)

C.SUM(sales) OVER (PARTITION BY department ORDER BY date)

D.LAG(sales, 1) OVER (PARTITION BY department ORDER BY date)

AnswerC

Partitions by department and orders by date to compute running total per department.

Why this answer

SUM() with OVER(PARTITION BY department ORDER BY date) computes a running total within each department.

Full explanation →

924

MCQeasy

An e-commerce company wants to provide real-time personalized product recommendations based on customer browsing behavior. Currently, they have a traditional data warehouse that processes batch updates every night. The marketing team complains that recommendations are outdated within hours because customers see yesterday's data. The data engineer needs to modify the architecture to support near-real-time analytics. The budget is limited, and the existing warehouse infrastructure must be reused as much as possible. Which architectural change would best meet the requirement?

A.Replace the warehouse with an in-memory database for real-time processing.

B.Add more nodes to the warehouse cluster to speed up batch processing.

C.Implement a streaming data pipeline (e.g., Apache Kafka) that feeds a real-time recommendation engine.

D.Increase the frequency of batch load from nightly to every hour.

AnswerC

Correct: Streaming enables real-time analytics without replacing the warehouse.

Why this answer

Option C is correct because implementing a streaming data pipeline like Apache Kafka enables the ingestion and processing of customer browsing events in near real-time, feeding a dedicated recommendation engine that can update recommendations within seconds or minutes. This approach reuses the existing data warehouse for historical analytics and batch reporting while adding a lightweight streaming layer for low-latency recommendations, aligning with the limited budget and reuse requirement.

Exam trap

The trap here is that candidates may assume increasing batch frequency (Option D) is sufficient for near-real-time needs, but Cisco tests the understanding that 'near-real-time' typically requires sub-minute latency, which batch processing cannot achieve due to scheduling overhead and resource contention.

How to eliminate wrong answers

Option A is wrong because replacing the warehouse with an in-memory database would discard the existing infrastructure entirely, incurring high migration costs and losing the warehouse's batch processing capabilities for other workloads, which violates the constraint to reuse the existing warehouse. Option B is wrong because adding more nodes to the warehouse cluster only improves the throughput of batch processing, but does not reduce the latency of data freshness—recommendations would still be based on data that is at least hours old, failing the near-real-time requirement. Option D is wrong because increasing batch frequency to every hour still introduces a delay of up to 60 minutes, which is insufficient for real-time personalization; moreover, frequent batch loads can cause resource contention and degrade warehouse performance for other queries.

Full explanation →

925

Multi-Selectmedium

A data analyst is performing hypothesis testing to compare the mean sales of two store locations. Which TWO conditions must be satisfied to use a two‑sample t‑test? (Select TWO.)

Select 2 answers

A.The data is paired between the two locations

B.The sample sizes are equal

C.The data is approximately normally distributed

D.The variances of the two populations are equal

E.The two samples are independent of each other

AnswersC, E

Normality is assumed for the t-test, though it is robust for large samples.

Why this answer

Option C is correct because the two-sample t-test assumes that the data in each group are approximately normally distributed. This is a key parametric assumption; if the sample sizes are large (typically n > 30), the Central Limit Theorem can relax this requirement, but for smaller samples, normality must hold to ensure valid test statistics and p-values.

Exam trap

CompTIA often tests the misconception that equal sample sizes or equal variances are required for a two-sample t-test, but the actual core assumptions are independence and normality (or large sample sizes via CLT).

Full explanation →

926

MCQmedium

A data analyst is performing data profiling on a customer table. Which metric provides the number of unique values in a column?

A.Row count

B.Cardinality

C.Standard deviation

D.Null count

AnswerB

Cardinality is the number of distinct values.

Why this answer

Cardinality refers to the number of distinct values in a column.

Full explanation →

927

MCQmedium

An analyst is performing a linear regression and obtains an R-squared value of 0.85. Which of the following is the best interpretation?

A.85% of the residuals are zero.

B.85% of the data points lie on the regression line.

C.There is an 85% chance that the relationship is causal.

D.The model explains 85% of the variability in the dependent variable.

AnswerD

This is the correct interpretation of R-squared.

Why this answer

R-squared indicates the proportion of variance in the dependent variable explained by the independent variable(s). 0.85 means 85% explained.

Full explanation →

928

MCQhard

A data analyst is working with a relational database that contains a table of customer orders. To optimize query performance for a report that filters by order date and customer ID, the analyst wants to create an index. Which type of index would be most effective for queries that filter on both columns?

A.B-tree index on order_date

B.Hash index on customer_id

C.Composite index on (order_date, customer_id)

D.Clustered index on order_id

AnswerC

A composite index on both columns allows the database to use the index for queries filtering on both columns, improving performance.

Why this answer

Option C is correct because a composite B-tree index on (order_date, customer_id) allows the database to efficiently satisfy equality and range predicates on both columns in a single index scan. B-tree indexes support ordered traversal and range lookups, making them ideal for date-based filtering combined with an equality filter on customer_id. This index structure minimizes the number of rows scanned by leveraging the index's leading column for the date range and the second column for the customer ID match.

Exam trap

The trap here is that candidates often choose a single-column index (A or B) thinking it will be sufficient, not realizing that a composite index is required to avoid a 'filter' step that scans many rows after the index lookup.

How to eliminate wrong answers

Option A is wrong because a single-column B-tree index on order_date can only efficiently filter by date; any additional filter on customer_id would require a separate lookup or a full scan of the date-matched rows, leading to poor performance. Option B is wrong because a hash index on customer_id only supports equality lookups and cannot handle range queries on order_date, making it unsuitable for date-range filtering. Option D is wrong because a clustered index on order_id physically reorders the table by order_id, which does not help with filtering on order_date or customer_id and may even degrade performance for these queries due to unnecessary key lookups.

Full explanation →

929

MCQmedium

After presenting a data analysis to stakeholders, they ask for additional details on a specific trend. The analyst needs to provide a follow-up report. What is the best approach?

A.Create a new dashboard with all the data.

B.Send the raw data file to the stakeholders.

C.Provide a detailed written report with charts and explanations.

D.Hold another meeting to verbally explain the trend.

AnswerC

Allows self-paced review and provides clear context.

Why this answer

Option C is correct because a detailed written report with charts and explanations allows stakeholders to review at their own pace and provides depth. Option A is wrong because creating a new dashboard for every request is inefficient. Option B is wrong because raw data files lack context and interpretation.

Option D is wrong because holding another meeting delays information.

Full explanation →

930

MCQhard

An analyst creates a histogram of customer transaction amounts but observes that the distribution looks bimodal. Upon review, the analyst realizes that two different customer segments (retail and wholesale) were combined. Which action best addresses this?

A.Create two separate histograms, one for each segment

B.Use a single histogram with two colors for segments

C.Use a box plot instead of a histogram

D.Increase the number of bins to see more detail

AnswerA

Separate histograms clearly show each segment's distribution and the reason for bimodality.

Why this answer

Splitting the data into two separate histograms by segment reveals the distinct distributions. Changing bin width may hide or exaggerate features. Combining into one chart with color is possible but may still obscure the bimodal nature if overlapping.

A box plot shows medians but not shape.

Full explanation →

931

MCQmedium

A data analyst is building a self-service reporting environment. Which of the following is the primary benefit of this approach?

A.It ensures all reports use the same data source.

B.It reduces the number of ad hoc report requests to the analytics team.

C.It automatically generates executive summaries.

D.It improves data security by limiting access.

AnswerB

Users can answer their own questions without involving analysts.

Why this answer

Self-service BI empowers business users to create their own reports, reducing ad hoc requests to the analytics team.

Full explanation →

932

MCQeasy

Refer to the exhibit. An ETL pipeline failed. What is the most likely root cause?

A.Data format mismatch

B.Authentication failure

C.Insufficient disk space

D.Network connectivity issues

AnswerD

Timeouts and connection resets are classic signs of network problems.

Why this answer

The exhibit shows an ETL pipeline failure with a timeout error, which is a classic symptom of network connectivity issues. When the pipeline cannot reach the source or target system due to a network interruption, the process hangs until a timeout threshold is exceeded, causing the job to fail. This is the most likely root cause because timeouts are directly associated with network problems, not with data format, authentication, or storage issues.

Exam trap

The trap here is that candidates may confuse a timeout error with a data format mismatch or authentication failure, but Cisco tests the understanding that timeouts are uniquely tied to network connectivity issues, not to data content or access permissions.

How to eliminate wrong answers

Option A is wrong because a data format mismatch would typically result in a parsing or schema validation error, not a timeout. Option B is wrong because an authentication failure would produce a 401 Unauthorized or 403 Forbidden error code, not a timeout. Option C is wrong because insufficient disk space would generate a 'disk full' or 'no space left on device' error, not a timeout.

Full explanation →

933

MCQhard

A data scientist is building a model to predict customer churn (yes/no). After training a logistic regression model, the coefficient for 'monthly charges' is 0.05 with a p-value of 0.03. Which interpretation is correct at α=0.05?

A.The model's R-squared is 0.05.

B.For every unit increase in monthly charges, the odds of churn increase by about 5%.

C.Monthly charges decrease the probability of churn.

D.Monthly charges have no significant effect on churn.

AnswerB

The coefficient 0.05 in logistic regression represents log-odds; exp(0.05)≈1.05, a 5% increase in odds.

Why this answer

The p-value < 0.05 indicates a statistically significant relationship; the positive coefficient means higher charges increase the log-odds of churn.

Full explanation →

934

MCQhard

A retail company has merged with another firm and now needs to create a unified customer data warehouse. The existing systems use different data classification methods: System A stores customer income as a categorical range (e.g., '$0-$50k', '$50k-$100k', '$100k+') while System B stores exact income as a decimal number. A data analyst must combine these into a single table. The goal is to perform statistical analysis that includes calculating average income, but the categorical data from System A loses precision. The analyst proposes converting System B's exact values into the same ranges as System A to ensure consistency. However, the data governance team wants to preserve as much detail as possible. Which course of action should the analyst recommend?

A.Store both columns separately and treat them as independent attributes

B.Convert System B's exact income to ranges matching System A, then combine

C.Impute System A's categorical data with the midpoint of each range to create a continuous numeric field, then combine with System B's exact values

D.Use only System B's data and discard System A because it is less precise

AnswerC

This preserves detail from System B and creates a usable numeric field from System A for analysis.

Why this answer

Option C is correct because imputing the midpoint of each income range converts System A's categorical data into a continuous numeric field, allowing it to be combined with System B's exact decimal values. This approach preserves the granularity of System B's data while enabling statistical calculations like average income across the unified dataset, balancing the data governance team's requirement for detail with the need for consistency.

Exam trap

The trap here is that candidates may choose Option B, thinking consistency requires downgrading all data to the lowest common denominator, but the exam tests the ability to preserve precision while achieving integration through transformation techniques like midpoint imputation.

How to eliminate wrong answers

Option A is wrong because storing both columns separately as independent attributes fails to create a unified customer data warehouse and prevents direct statistical analysis across the combined dataset, such as calculating a single average income. Option B is wrong because converting System B's exact decimal values into the same categorical ranges as System A discards precision unnecessarily, violating the data governance team's goal to preserve as much detail as possible. Option D is wrong because discarding System A's data entirely ignores valuable customer information from the merged firm, leading to data loss and an incomplete unified warehouse.

Full explanation →

935

Multi-Selecthard

A data analyst is performing K-means clustering on customer data. Which THREE of the following are steps in the K-means algorithm?

Select 3 answers

A.Perform eigenvalue decomposition.

B.Calculate the correlation matrix.

C.Initialize k centroids randomly.

D.Update centroids by computing the mean of all points assigned to each centroid.

E.Assign each data point to the nearest centroid.

AnswersC, D, E

Correct: initial step.

Why this answer

K-means involves initializing centroids, assigning points to nearest centroid, and updating centroids as the mean of assigned points.

Full explanation →

936

MCQhard

A data analyst is tasked with presenting the results of a customer churn analysis to a group of marketing managers. The model identified that customers with low engagement scores are 3 times more likely to churn. The analyst wants to effectively communicate this finding to persuade the managers to focus on engagement. Which of the following is the most compelling way to present this insight?

A.Show a table with raw churn rates by engagement score

B.Use a bar chart comparing churn rates of low vs. high engagement

C.Show a scatter plot of engagement score vs. churn

D.Calculate the potential revenue increase if engagement improves by 10%

AnswerD

Revenue impact directly connects to business objectives.

Why this answer

Option A is correct because translating the finding into potential revenue impact resonates with managers' business goals. Option B (bar chart) is factual but less persuasive. Option C (scatter plot) is not compelling.

Option D (table) is less engaging.

Full explanation →

937

MCQeasy

A dataset contains a column 'Age' with values: [22, 25, 25, 30, 35, 40, 45]. What is the interquartile range (IQR)?

A.15

B.10

C.20

D.25

AnswerA

Correct IQR = Q3 - Q1 = 40 - 25 = 15.

Why this answer

Q1 is median of lower half (22,25,25) = 25; Q3 is median of upper half (35,40,45) = 40; IQR = 40-25 = 15.

Full explanation →

938

MCQmedium

A data analyst is conducting an A/B test on a website's landing page. The null hypothesis is that there is no difference in conversion rates between the control and treatment groups. After collecting data, the analyst calculates a p-value of 0.03. Using a significance level of α = 0.05, what is the correct conclusion?

A.Accept the null hypothesis; the difference is due to chance.

B.Reject the null hypothesis; the treatment group has a higher conversion rate.

C.Fail to reject the null hypothesis; there is no evidence of a difference.

D.The result is inconclusive because the p-value is close to 0.05.

AnswerB

The p-value indicates statistical significance, but direction must be checked from data.

Why this answer

Since p < α, the null hypothesis is rejected, indicating a statistically significant difference in conversion rates.

Full explanation →

939

MCQmedium

A data analyst creates a report showing sales by product category. The analyst notices that one category has a very high sales figure due to a one-time bulk order. Which of the following is the best way to communicate this insight to stakeholders?

A.Delete the bulk order from the dataset.

B.Add a note to the chart explaining the bulk order.

C.Remove the category with the bulk order from the report.

D.Use a pie chart to show the proportion of each category.

AnswerB

Annotation provides context for the anomaly.

Why this answer

Option B is correct because it maintains data integrity while providing necessary context. Adding a note to the chart allows stakeholders to understand the anomaly without distorting the underlying data. This approach aligns with best practices in data communication, where transparency about outliers is critical for accurate interpretation.

Exam trap

The trap here is that candidates may think removing or hiding data (options A and C) simplifies the message, but the exam tests the principle that data communication must be transparent and preserve the full context for decision-making.

How to eliminate wrong answers

Option A is wrong because deleting the bulk order from the dataset introduces bias and misrepresents actual sales performance, violating data integrity principles. Option C is wrong because removing the entire category hides the outlier entirely, preventing stakeholders from understanding its impact on overall trends. Option D is wrong because a pie chart would visually exaggerate the proportion of the bulk order category, potentially misleading stakeholders about normal sales distribution.

Full explanation →

940

MCQmedium

A dashboard designer is creating a sales performance dashboard and wants to minimize non-data ink to improve clarity. Which action best follows the principle of maximizing the data-ink ratio?

A.Removing gridlines that do not add value to the chart

B.Including a detailed company logo on every chart

C.Adding a background image to make the dashboard more visually appealing

D.Using 3-D effects on bars to make them stand out

AnswerA

Correct. Removing unnecessary gridlines reduces chartjunk and increases the proportion of ink used for data.

Why this answer

Option A is correct because it directly follows Tufte's data-ink ratio principle by removing non-data ink (gridlines that don't aid interpretation) while preserving the core data. This maximizes the proportion of ink devoted to actual sales metrics, improving clarity without sacrificing information.

Exam trap

The trap here is that candidates may confuse 'visually appealing' (options B, C, D) with effective data communication, not realizing that decorative elements reduce the data-ink ratio and can obscure insights in a professional dashboard.

How to eliminate wrong answers

Option B is wrong because a detailed company logo on every chart adds non-data ink that distracts from the sales performance data, violating the data-ink ratio principle. Option C is wrong because a background image introduces decorative non-data ink that reduces the clarity of the data visualization, contrary to the principle. Option D is wrong because 3-D effects on bars add chartjunk (non-data ink) that can distort perception of bar heights and make comparisons harder, directly opposing the goal of maximizing data-ink.

Full explanation →

941

MCQmedium

You are a data analyst at a logistics company. You have created a dashboard to monitor delivery performance. The dashboard includes a scatter plot showing delivery time (hours) vs. distance (miles) for each delivery, with points colored by delivery region (A, B, C, D, E). Users have reported that the scatter plot is cluttered because there are over 10,000 points, making it hard to see patterns. Additionally, the color legend for the five regions uses similar shades of blue, making it difficult to distinguish which region a point belongs to. You need to improve the scatter plot to reduce overplotting and improve region differentiation. Which approach is most effective?

A.Use a hexagonal binning plot (hexbin) with color representing region

B.Create five separate scatter plots (small multiples) for each region

C.Set point opacity to 30% and use a qualitative color palette with distinct hues for each region

D.Convert to a bubble chart by adding package weight as bubble size

AnswerC

Alpha blending reveals density; distinct colors improve region identification.

Why this answer

Option C is correct because reducing opacity (alpha blending) mitigates overplotting by making overlapping points more transparent, while switching to a qualitative color palette (e.g., distinct hues like red, green, blue) ensures each of the five regions is easily distinguishable. This directly addresses both user complaints without losing the overall distribution context.

Exam trap

The trap here is that candidates often choose small multiples (Option B) thinking they reduce clutter, but the question specifically asks to improve differentiation and reduce overplotting in a single view, and small multiples fragment the data, making cross-region comparison harder.

How to eliminate wrong answers

Option A is wrong because a hexagonal binning plot aggregates points into hexagons based on density, but it cannot encode a categorical variable like region with color inside a single plot—hexbin typically uses a single color gradient for density, not multiple regions. Option B is wrong because creating five separate scatter plots (small multiples) forces users to mentally compare across separate charts, losing the ability to see relative spatial patterns and interactions between regions in a single view. Option D is wrong because converting to a bubble chart by adding package weight as bubble size does not reduce overplotting (it can worsen it with larger circles) and does not improve region differentiation; it introduces a third continuous variable that is not related to the reported issues.

Full explanation →

942

MCQmedium

An organization uses a data warehouse for analytics. The data team wants to load data from source systems into the warehouse. They choose to load raw data first and then perform transformations within the warehouse. Which approach are they using?

A.ELT

B.Data lake

C.Data mart

D.ETL

AnswerA

ELT loads raw data first, then transforms it within the warehouse.

Why this answer

ELT (Extract, Load, Transform) involves extracting data, loading it into the target system (e.g., data warehouse), and then transforming it there. This is common with modern cloud warehouses like Snowflake or BigQuery that handle transformations efficiently.

Full explanation →

943

MCQmedium

A data analyst receives a dataset with inconsistent date formats (e.g., "01/02/2023", "2023-01-02", "Jan 2, 2023"). Which data quality dimension is most directly affected?

A.Accuracy

B.Consistency

C.Completeness

D.Timeliness

AnswerB

Inconsistent formats directly impact data consistency.

Why this answer

Consistency refers to the uniformity of data representation. Inconsistent date formats violate consistency, not accuracy, completeness, or timeliness.

Full explanation →

944

MCQhard

An analyst creates a dashboard with a large amount of decorative graphics and background images. According to CompTIA Data+, which design principle is being violated?

A.Visual hierarchy

B.Data-ink ratio

C.Appropriate precision

D.Consistent color coding

AnswerB

The data-ink ratio is reduced by unnecessary decorative elements.

Why this answer

The data-ink ratio, a concept from Edward Tufte, states that a visualization should maximize the proportion of ink used to represent actual data relative to total ink used. By adding excessive decorative graphics and background images, the analyst reduces this ratio, making the dashboard less efficient and potentially obscuring the data. CompTIA Data+ emphasizes minimizing non-data ink to improve clarity and insight extraction.

Exam trap

The trap here is that candidates confuse 'decorative graphics' with 'visual hierarchy' or 'color coding,' but CompTIA Data+ specifically tests the data-ink ratio as a principle for minimizing non-data elements to enhance readability.

How to eliminate wrong answers

Option A is wrong because visual hierarchy refers to the arrangement of elements to guide the viewer's eye by importance, which is not directly violated by decorative graphics; the issue is about ink efficiency, not order. Option C is wrong because appropriate precision concerns the level of detail in data values (e.g., rounding or decimal places), not the use of decorative elements. Option D is wrong because consistent color coding ensures that colors have the same meaning across the dashboard, which is unrelated to the overuse of non-data ink.

Full explanation →

945

Multi-Selecthard

A data governance team is establishing policies. Which three activities are part of data governance? (Select THREE.)

Select 3 answers

A.Data quality management

B.Data ownership assignment

C.Data indexing

D.Data steward designation

E.Data normalization

AnswersA, B, D

Ensuring data quality is a core governance function.

Why this answer

Data quality management is a core activity of data governance because it ensures that data meets defined standards for accuracy, completeness, consistency, and timeliness. Governance policies mandate monitoring and remediation processes to maintain data quality across the organization.

Exam trap

Cisco often tests the distinction between data governance (policies, roles, quality) and data management (technical implementation like indexing and normalization), leading candidates to confuse operational tasks with governance activities.

Full explanation →

946

MCQhard

A dataset contains salaries ranging from $25,000 to $2,500,000, with most salaries under $100,000. Which chart type best shows the distribution without distortion from extreme values?

A.Box plot

B.Scatter plot

C.Histogram with equal bin widths

D.Line chart

AnswerA

Box plots show median, quartiles, and outliers, handling skew well.

Why this answer

Option B is correct because box plots display distribution with outliers clearly separated, unaffected by extreme values in terms of scale. Histograms may compress the bulk of data.

Full explanation →

947

Multi-Selecteasy

A data analyst is performing data acquisition from multiple source files. Which TWO data profiling tasks should the analyst complete before loading the data into the target system?

Select 2 answers

A.Create a dashboard for stakeholders

B.Verify data types and formats

C.Build a linear regression model

D.Perform cluster analysis

E.Identify missing values and nulls

AnswersB, E

Verifying data types ensures consistency and prevents errors during loading.

Why this answer

Identifying missing values and verifying data types are fundamental data profiling tasks that help ensure data quality before acquisition. Building models, creating dashboards, or clustering are not part of profiling.

Full explanation →

948

MCQeasy

A small business wants to acquire customer feedback through a short questionnaire emailed after purchase. Which data acquisition method does this represent?

A.Transaction log

B.Interview

C.Survey

D.Observation

AnswerC

A questionnaire is a classic survey tool for collecting feedback.

Why this answer

The correct answer is C because a survey is a structured data collection method where respondents answer predefined questions, typically via a form or questionnaire. In this scenario, the business is using a short questionnaire emailed after purchase to gather customer feedback, which directly aligns with the definition of a survey as a data acquisition method.

Exam trap

The trap here is that candidates may confuse a survey with a transaction log because both can be automated and delivered electronically, but a transaction log captures system events, not user-provided feedback.

How to eliminate wrong answers

Option A is wrong because a transaction log records system-level events such as database changes, user logins, or API calls, not subjective customer feedback via a questionnaire. Option B is wrong because an interview involves a direct, synchronous conversation between an interviewer and a respondent, often with open-ended questions, whereas the scenario describes an asynchronous, self-administered questionnaire. Option D is wrong because observation involves watching and recording behavior or events without direct interaction, whereas the scenario explicitly involves asking customers for their opinions through a questionnaire.

Full explanation →

949

MCQhard

Refer to the exhibit. A data architect is designing a data dictionary for a relational database. Based on the exhibit, which data concept is being illustrated?

A.Data constraints

B.Data aggregation

C.Data normalization

D.Data cardinality

AnswerA

The exhibit specifies field properties like nullable and unique, which are constraints on the data.

Why this answer

The exhibit shows a table definition with column attributes such as NOT NULL, UNIQUE, and PRIMARY KEY, which are data constraints that enforce rules on the data values. Data constraints ensure data integrity by restricting what data can be stored in a column, such as preventing null values or duplicate entries. This directly aligns with the concept of data constraints, making option A correct.

Exam trap

The trap here is that candidates may confuse data constraints with data cardinality, because both involve 'rules' in a database, but cardinality specifically describes the nature of relationships between tables, not the column-level restrictions shown in the exhibit.

How to eliminate wrong answers

Option B is wrong because data aggregation involves summarizing or combining data from multiple rows (e.g., using SUM, AVG), which is not illustrated in the table definition. Option C is wrong because data normalization is a process of organizing data to reduce redundancy and dependency, typically involving splitting tables into related tables, not defining column-level constraints. Option D is wrong because data cardinality refers to the relationship between tables (e.g., one-to-many), not the rules applied to individual columns in a table definition.

Full explanation →

950

Multi-Selecteasy

Which TWO of the following are true about correlation and causation? (Select TWO).

Select 2 answers

A.Correlation measures both linear and nonlinear relationships

B.Causation can always be inferred from a controlled experiment without randomization

C.Correlation does not imply causation

D.If two variables are highly correlated, one must cause the other

E.A statistically significant correlation may still be due to chance or confounding variables

AnswersC, E

This is a fundamental concept.

Why this answer

Option C is correct because correlation measures the strength and direction of a linear relationship between two variables, but it does not imply that one variable causes the other. Causation requires controlled experiments with randomization to rule out confounding variables and establish a cause-effect relationship.

Exam trap

CompTIA often tests the classic 'correlation does not imply causation' fallacy, where candidates mistakenly think that a statistically significant correlation automatically proves a causal relationship, ignoring the role of chance and confounding variables.

Full explanation →

951

MCQhard

A data analyst is using a public API to collect historical weather data. The API has a rate limit of 100 requests per minute, but the analyst needs to retrieve 10,000 records as quickly as possible. What strategy should be used?

A.Increase the request rate

B.Use multiple API keys

C.Paginate with appropriate delays

D.Download a precompiled dataset

AnswerC

Pagination allows systematic retrieval; delays ensure compliance with rate limits.

Why this answer

Option C is correct because paginating with appropriate delays respects the API's rate limit of 100 requests per minute while maximizing throughput. By splitting the 10,000 records into pages (e.g., 100 records per page) and sending requests at a rate just under the limit (e.g., one request every 0.6 seconds), the analyst can retrieve all data in approximately 100 minutes without triggering HTTP 429 rate-limit errors.

Exam trap

The trap here is that candidates may assume 'as quickly as possible' means sending requests as fast as possible (Option A) or using multiple keys (Option B), overlooking that rate limits are enforced per key or IP and that proper pagination with delays is the only compliant way to maximize throughput.

How to eliminate wrong answers

Option A is wrong because increasing the request rate beyond 100 requests per minute would violate the API's rate limit, resulting in HTTP 429 (Too Many Requests) responses or temporary IP bans. Option B is wrong because using multiple API keys to circumvent rate limits violates the API's terms of service and could lead to account suspension or revocation of access. Option D is wrong because downloading a precompiled dataset may not be available, may not contain the specific historical weather data needed, or may be outdated, and the question explicitly states the analyst is using a public API to collect data.

Full explanation →

952

MCQmedium

A business analyst needs to show the monthly sales trend for the past three years. Which chart type is most appropriate?

A.Histogram

B.Pie chart

C.Line chart

D.Scatter plot

AnswerC

Line charts effectively show trends over time.

Why this answer

Line charts are ideal for displaying trends over time, making them suitable for showing monthly sales trends.

Full explanation →

953

MCQmedium

Refer to the exhibit. A data analyst is trying to extract data from a SQL Server database but receives the error. Which configuration change should the analyst recommend to the database administrator?

A.Change the server firewall to allow port 1433

B.Enable Mixed Mode authentication on the SQL Server

C.Use a different extraction tool that supports Windows authentication

D.Grant the 'dataminer' user SELECT permissions

AnswerB

Mixed Mode allows SQL authentication, which matches the login attempt.

Why this answer

The error indicates that the server is configured for Windows authentication only, but the login attempt used SQL authentication. Enabling Mixed Mode authentication allows both Windows and SQL authentication, resolving the issue.

Full explanation →

954

MCQhard

A data analyst is using a window function to assign a unique rank to each employee within their department based on salary, with ties receiving the same rank and leaving gaps. Which function should be used?

A.DENSE_RANK()

B.RANK()

C.ROW_NUMBER()

D.NTILE()

AnswerB

RANK() assigns same rank to ties and leaves gaps.

Why this answer

RANK() assigns the same rank to ties and leaves gaps (e.g., 1,1,3). DENSE_RANK() does not leave gaps.

Full explanation →

955

MCQeasy

A retail company operates 50 stores across the country. The data analyst has been asked to create a dashboard to visualize monthly sales trends over the past two years and compare the performance of the top 5 stores. The dataset includes store name, date, and daily sales amount. Initial exploration reveals that some stores have missing sales data for certain months due to system outages, and there are occasional extreme values caused by promotional events (e.g., Black Friday sales are 10x normal). The analyst needs to choose an appropriate visualization approach that accurately represents the trends and comparisons while handling these data quality issues. What should the analyst do to best meet the requirements?

A.Use a line chart for each store's monthly sales and exclude outliers from the dataset.

B.Use a bar chart showing monthly sales for each month and filter out stores with missing data.

C.Use a line chart for the overall monthly sales trend (aggregated across all stores) and a separate bar chart for the top 5 stores by total sales, with tooltips explaining outlier values.

D.Use a stacked bar chart with all 50 stores, each bar representing a month.

AnswerC

Combines trend and comparison effectively; aggregation handles missing data, and tooltips provide context.

Why this answer

Option C is correct because it combines a line chart for overall trend (aggregating all stores) and bar charts for top 5 stores, which allows comparison while handling missing data through aggregation. Tooltips can provide context for outliers. Option A is wrong because using separate line charts for each store becomes cluttered with 50 stores, and excluding outliers may omit valuable information about promotions.

Option B is wrong because filtering out stores with missing data could remove important stores and distort the trend. Option D is wrong because a stacked bar chart with all 50 stores would be difficult to read and doesn't show trends over time effectively.

Full explanation →

956

MCQmedium

A data scientist is preparing data for a K-means clustering algorithm. The dataset contains features measured in different units (e.g., income in dollars and age in years). Which preprocessing step is most critical before running K-means?

A.Remove outliers

B.Encode categorical variables

C.Standardize or normalize the features

D.Perform feature selection

AnswerC

Scaling ensures equal weighting; both min-max and Z-score are common.

Why this answer

K-means is sensitive to the scale of features because it uses Euclidean distance. Min-max normalization or standardization ensures all features contribute equally.

Full explanation →

957

MCQmedium

A large online retailer stores customer orders in a PostgreSQL database. Each order has a unique order ID, and the database is normalized to 3NF. Which type of data is this?

A.Semi-structured data

B.Structured data

C.Unstructured data

D.Metadata

AnswerB

Relational databases store structured data with fixed schemas, rows, and columns.

Why this answer

The data is structured because it resides in a normalized PostgreSQL database with a unique order ID and conforms to a fixed schema (3NF). Structured data is organized into rows and columns with defined data types, enabling efficient SQL querying and ACID compliance. PostgreSQL's relational model enforces this structure through tables, constraints, and indexes.

Exam trap

The trap here is that candidates confuse 'structured data' with 'metadata' or assume that any database containing JSON fields is semi-structured, but the question specifies a normalized 3NF schema, which inherently means structured data regardless of any JSON columns.

How to eliminate wrong answers

Option A is wrong because semi-structured data (e.g., JSON, XML) does not require a fixed schema and is typically stored in NoSQL databases or as JSONB in PostgreSQL, not in a normalized 3NF relational schema. Option C is wrong because unstructured data (e.g., images, videos, free text) lacks a predefined data model and cannot be directly stored in normalized relational tables without transformation. Option D is wrong because metadata is data about data (e.g., table schemas, column descriptions), not the actual customer order records themselves.

Full explanation →

958

MCQmedium

A business analyst wants to show the relationship between advertising spend (independent variable) and sales revenue (dependent variable) for 12 different regions. Each region will be represented by a point. Which chart type is most appropriate?

A.Line chart

B.Bar chart

C.Scatter plot

D.Histogram

AnswerC

Correct. Scatter plots reveal correlation between two variables.

Why this answer

A scatter plot is the most appropriate chart type because it directly visualizes the relationship between two continuous variables—advertising spend (independent) and sales revenue (dependent)—by plotting each of the 12 regions as a point. This allows the analyst to assess correlation, trends, and outliers, which is the core purpose of scatter plots in data visualization.

Exam trap

The trap here is that candidates often confuse scatter plots with line charts, mistakenly thinking a line connecting points is needed to show a relationship, but the exam tests the understanding that scatter plots are the standard for bivariate continuous data without an inherent sequence.

How to eliminate wrong answers

Option A is wrong because a line chart is typically used to show trends over time or sequential data, not the relationship between two independent continuous variables across distinct regions. Option B is wrong because a bar chart compares categorical data or discrete values, not the continuous relationship between two numeric variables. Option D is wrong because a histogram displays the distribution of a single continuous variable by grouping data into bins, not the relationship between two variables.

Full explanation →

959

MCQhard

An analyst is fitting a polynomial regression model and wants to choose the degree that minimizes overfitting. Which technique should the analyst use?

A.Lasso regression (L1)

B.Principal component analysis (PCA)

C.Stepwise selection

D.Ridge regression (L2)

AnswerD

Ridge regression penalizes large coefficients, which is effective for reducing overfitting in polynomial models without removing features.

Why this answer

Ridge regression (L2) adds a penalty proportional to the square of the magnitude of coefficients, which shrinks them toward zero but does not eliminate them. This regularization reduces variance and helps prevent overfitting in polynomial regression by controlling the influence of higher-degree terms, making it the correct technique for minimizing overfitting while retaining all features.

Exam trap

The trap here is that candidates often confuse Lasso (L1) with Ridge (L2), mistakenly thinking Lasso's coefficient elimination is always better for overfitting, when in fact Ridge's smooth shrinkage is more appropriate for polynomial models where all degrees should be retained but controlled.

How to eliminate wrong answers

Option A is wrong because Lasso regression (L1) performs feature selection by shrinking some coefficients exactly to zero, which is more suited for sparse models rather than simply minimizing overfitting in a polynomial context where all degrees may be needed. Option B is wrong because Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms features into uncorrelated components, but it does not directly address overfitting in polynomial regression and can lose interpretability of the polynomial terms. Option C is wrong because stepwise selection is a variable selection method that adds or removes predictors based on statistical criteria (e.g., AIC, p-values), but it can be unstable and does not inherently regularize coefficients to combat overfitting as effectively as ridge regression.

Full explanation →

960

MCQmedium

A company is merging two customer databases from different acquisitions. They need to identify duplicate records. Which data profiling technique is most effective?

A.Fuzzy matching on name and address

B.Manually compare all records

C.Exact match on customer names

D.Use primary keys from each database

AnswerA

Fuzzy matching handles variations and is appropriate for deduplication.

Why this answer

Fuzzy matching on name and address is the most effective technique because customer databases from different acquisitions often contain variations in spelling, formatting, and abbreviations (e.g., 'Bob' vs. 'Robert', 'St.' vs. 'Street'). Exact matching would miss these duplicates, while fuzzy matching uses algorithms like Levenshtein distance or Jaro-Winkler to quantify similarity and identify near-matches, ensuring comprehensive deduplication.

Exam trap

The trap here is that candidates assume exact matching or primary keys are sufficient for deduplication, overlooking the real-world data inconsistencies that fuzzy matching is designed to handle.

How to eliminate wrong answers

Option B is wrong because manually comparing all records is impractical and error-prone for large datasets, lacking scalability and consistency. Option C is wrong because exact match on customer names fails to capture duplicates caused by typos, nicknames, or inconsistent formatting (e.g., 'Jon' vs. 'John'). Option D is wrong because primary keys from each database are unique within their own system but cannot identify cross-database duplicates, as the same customer may have different primary keys in each source.

Full explanation →

961

MCQhard

A data analyst is presenting results from a customer segmentation analysis to a non-technical audience. The segmentation was performed using k-means clustering, and the analyst wants to explain the characteristics of each cluster. Which visualization is most appropriate for this audience?

A.A parallel coordinates plot displaying all variables used in clustering.

B.A heatmap of the distance matrix between cluster centroids.

C.A bar chart comparing the average values of key metrics for each cluster.

D.A scatter plot with two principal components showing cluster boundaries.

AnswerC

Bar charts are simple, familiar, and effectively compare averages across categories.

Why this answer

A bar chart comparing average values of key metrics per cluster is the most appropriate choice because it directly translates the numerical output of k-means clustering into a simple, intuitive visual that non-technical audiences can easily interpret. Unlike complex multivariate plots, a bar chart focuses on the distinct characteristics of each cluster without requiring knowledge of clustering algorithms or dimensionality reduction. This aligns with the DA0-001 objective of communicating data insights effectively to stakeholders.

Exam trap

The trap here is that candidates may choose a technically sophisticated visualization like a parallel coordinates plot or PCA scatter plot, mistakenly believing it demonstrates deeper analytical skill, when the question specifically tests the ability to tailor visualizations to a non-technical audience's comprehension level.

How to eliminate wrong answers

Option A is wrong because a parallel coordinates plot displays all variables simultaneously, which overwhelms non-technical audiences with overlapping lines and multidimensional complexity, making it unsuitable for clear communication of cluster characteristics. Option B is wrong because a heatmap of the distance matrix between cluster centroids shows abstract mathematical relationships (e.g., Euclidean distances) that are meaningless to a non-technical audience and does not convey the actual metric values defining each cluster. Option D is wrong because a scatter plot with two principal components requires understanding of PCA and dimensionality reduction, which is not intuitive for non-technical stakeholders, and the cluster boundaries may be ambiguous without proper labeling of the original metrics.

Full explanation →

962

MCQmedium

A data scientist is building a predictive model to forecast monthly sales. The data shows a linear trend with no seasonality. Which regression technique is most appropriate?

A.Polynomial regression

B.Logistic regression

C.Linear regression

D.Ridge regression

AnswerC

Linear regression directly models a linear relationship between independent and dependent variables.

Why this answer

Linear regression is the most appropriate technique because the data shows a linear trend with no seasonality, making a straight-line model the simplest and most effective fit. It directly models the relationship between the independent variable (e.g., time) and the dependent variable (monthly sales) using a linear equation, minimizing the sum of squared residuals.

Exam trap

The trap here is that candidates often confuse 'linear trend' with 'linear in parameters' and incorrectly choose polynomial regression, thinking it adds flexibility, when the question explicitly states no seasonality and a linear trend, making simple linear regression the optimal choice.

How to eliminate wrong answers

Option A is wrong because polynomial regression introduces higher-degree terms (e.g., x², x³) to model curvature, which is unnecessary and risks overfitting when the trend is explicitly linear. Option B is wrong because logistic regression is used for binary classification problems (e.g., predicting yes/no outcomes), not for forecasting continuous numeric values like monthly sales. Option D is wrong because ridge regression is a regularization technique designed to handle multicollinearity or overfitting by adding an L2 penalty, but it is not a distinct regression type for linear trends and would be overkill when a simple linear model suffices.

Full explanation →

963

MCQmedium

A data engineer is designing an ETL pipeline to extract sales data from a legacy on-premise database and load it into a cloud data warehouse. The database is slow and queries during business hours affect performance. Which extraction strategy minimizes impact?

A.Query the database with SELECT * every hour

B.Incremental extraction using Change Data Capture (CDC)

C.Full table extraction nightly

D.Use a database log shipping

AnswerB

CDC minimizes database load by extracting only changed data, reducing performance impact.

Why this answer

Incremental extraction using Change Data Capture (CDC) minimizes impact on the legacy on-premise database by reading only the changed rows (inserts, updates, deletes) from transaction logs or change tables, rather than issuing heavy SELECT queries. This avoids full table scans or frequent queries during business hours, preserving database performance for operational workloads.

Exam trap

The trap here is that candidates confuse 'log shipping' (a high-availability technique) with 'Change Data Capture' (an extraction method), or assume that any periodic query (like hourly SELECT *) is acceptable without considering the cumulative performance impact on a slow legacy database.

How to eliminate wrong answers

Option A is wrong because querying the database with SELECT * every hour performs full table scans on the legacy database, which is slow and would degrade performance during business hours, directly contradicting the goal of minimizing impact. Option C is wrong because full table extraction nightly still requires a complete scan of the entire table, which can be resource-intensive and may not complete within a reasonable window if the database is slow, and it does not capture intra-day changes without additional overhead. Option D is wrong because database log shipping is a disaster recovery technique that continuously copies transaction logs to a standby server, not an extraction strategy for ETL; it does not provide a queryable change stream and would require additional processing to parse logs for CDC.

Full explanation →

964

MCQeasy

A retail company has a dashboard that tracks daily sales across all stores. The dashboard is used by the CEO during monthly board meetings. Recently, the CEO complained that the dashboard is confusing and takes too long to interpret. The data analyst is asked to redesign the dashboard to improve usability. The analyst has access to all sales data. The CEO prefers a high-level overview with the ability to drill down into regions if needed. The analyst must decide on a course of action. What should the analyst do first?

A.Revamp the dashboard with as many visualizations as possible.

B.Conduct a survey of all employees to understand their needs.

C.Create a static PDF report instead of a dashboard.

D.Interview the CEO to determine key performance indicators and preferred layout.

AnswerD

Directly addresses the user's frustration and gathers precise requirements.

Why this answer

The correct first step is to interview the CEO to determine key performance indicators and preferred layout. This ensures the dashboard is designed around the specific high-level metrics the CEO needs for monthly board meetings, with the ability to drill down into regions. Starting with stakeholder requirements prevents wasted effort on irrelevant visualizations or formats.

Exam trap

The trap here is that candidates may assume more data or broader input (options A or B) is always better, when in fact the correct approach is to focus on the specific needs of the primary user to reduce confusion and improve usability.

How to eliminate wrong answers

Option A is wrong because revamping with as many visualizations as possible would increase clutter and confusion, directly contradicting the CEO's complaint about the dashboard being confusing and taking too long to interpret. Option B is wrong because surveying all employees is unnecessary; the primary user is the CEO, and the dashboard is for board meetings, not general employee use. Option C is wrong because creating a static PDF report removes interactivity and the ability to drill down into regions, which the CEO explicitly prefers.

Full explanation →

965

MCQmedium

In a logistic regression model predicting customer churn (1 = churn, 0 = not churn), the coefficient for 'contract length' is -0.5. Which of the following is the correct interpretation?

A.For each unit increase in contract length, the log-odds of churn decrease by 0.5.

B.Longer contract length increases the odds of churn.

C.The probability of churn decreases by 50% for each unit increase in contract length.

D.Contract length is not a significant predictor.

AnswerA

Correct interpretation of logistic regression coefficient.

Why this answer

In logistic regression, coefficients represent the log-odds change. A negative coefficient decreases the log-odds, meaning lower probability of churn.

Full explanation →

966

MCQmedium

A data analyst is performing exploratory data analysis on a dataset containing house prices. They want to identify outliers in the 'price' column using the IQR method. The first quartile (Q1) is $200,000, the third quartile (Q3) is $350,000, and the IQR is $150,000. What is the upper bound for identifying outliers?

A.$500,000

B.$575,000

C.$425,000

D.$650,000

AnswerB

Correct upper bound.

Why this answer

The IQR method defines outliers as values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR. Upper bound = 350,000 + (1.5 * 150,000) = 350,000 + 225,000 = $575,000.

Full explanation →

967

MCQhard

An analyst notices that a dashboard displays conflicting data from two sources. What is the first step in troubleshooting?

A.Re-run all queries

B.Notify stakeholders of potential error

C.Replace the data source with the one believed to be correct

D.Check data transformation steps and join logic

AnswerD

Examining transformations and joins helps identify discrepancies between sources.

Why this answer

Option D is correct because the first step in troubleshooting conflicting data from multiple sources is to verify the data transformation steps and join logic. Conflicts often arise from incorrect joins (e.g., using INNER JOIN instead of LEFT JOIN), mismatched keys, or data type conversions that alter values. Checking these steps isolates the root cause before any other action, such as re-running queries or notifying stakeholders.

Exam trap

The trap here is that candidates often jump to re-running queries or notifying stakeholders first, mistaking a symptom (conflicting data) for a root cause, instead of methodically inspecting the transformation and join logic where the discrepancy likely originates.

How to eliminate wrong answers

Option A is wrong because re-running all queries blindly repeats the same potentially flawed logic without diagnosing the source of the conflict; it wastes time and may overwrite intermediate results. Option B is wrong because notifying stakeholders prematurely escalates an unverified issue, causing unnecessary alarm and eroding trust before the analyst has confirmed the error. Option C is wrong because replacing the data source without understanding the conflict ignores the possibility that the transformation logic or join conditions are the actual problem, and it may introduce new inconsistencies.

Full explanation →

968

MCQmedium

A data analyst is using SQL to filter a sales table for transactions that occurred in either 'Q1' or 'Q3' of 2023 and have a sale amount greater than $100. Which WHERE clause correctly implements this condition?

A.WHERE quarter IN ('Q1','Q3') OR amount > 100

B.WHERE quarter = 'Q1' OR quarter = 'Q3' AND amount > 100

C.WHERE quarter = 'Q1' AND amount > 100 OR quarter = 'Q3' AND amount > 100

D.WHERE (quarter = 'Q1' OR quarter = 'Q3') AND amount > 100

AnswerD

Correct: groups the OR and then AND with amount condition.

Why this answer

The correct clause uses parentheses to group the OR conditions and an AND to connect the amount condition. The other options either miss parentheses (changing logic) or mix AND/OR incorrectly.

Full explanation →

969

MCQmedium

A data analyst needs to share a weekly sales report with the marketing team. The report includes aggregated data from the data warehouse. To simplify access, the analyst creates a virtual table that encapsulates the complex query. Which database object should the analyst create?

A.Trigger

B.View

C.Stored procedure

D.Index

AnswerB

A view is a virtual table that simplifies querying by hiding complexity.

Why this answer

A view is a virtual table that encapsulates a complex query, allowing users to access aggregated data without needing to understand the underlying SQL. In this scenario, the analyst creates a view to simplify access to the weekly sales report, as it presents pre-defined, aggregated data from the data warehouse as if it were a table.

Exam trap

The trap here is that candidates may confuse a view with a stored procedure, thinking both can encapsulate logic, but only a view behaves as a virtual table that can be directly queried with SELECT, while a stored procedure requires explicit execution and does not return a result set in the same way.

How to eliminate wrong answers

Option A is wrong because a trigger is a procedural code that automatically executes in response to certain events (e.g., INSERT, UPDATE, DELETE) on a table, not a virtual table for simplifying query access. Option C is wrong because a stored procedure is a set of precompiled SQL statements that can accept parameters and perform operations, but it does not act as a virtual table that can be queried directly with SELECT statements. Option D is wrong because an index is a database structure that improves the speed of data retrieval operations on a table, but it is not a virtual table or a query encapsulation object.

Full explanation →

970

MCQhard

Based on the exhibit, what is the most likely cause of the import failure?

A.The file is empty or contains only headers.

B.The price field includes non-numeric characters that cannot be parsed.

C.The source file is corrupted or in an unsupported format.

D.A data quality issue: the date field contains an invalid date.

AnswerD

The date '2024-02-30' is invalid and caused the import to halt.

Why this answer

Option D is correct because the exhibit shows a date field containing '2023-02-30', which is an invalid date (February never has 30 days). This data quality issue causes the import to fail, as the system likely validates date values against calendar rules before inserting them into the target table. The error is not due to file emptiness, non-numeric characters, or corruption, but specifically a semantic data integrity violation.

Exam trap

The trap here is that candidates may overlook semantic data quality issues (like invalid dates) and instead focus on syntactic problems (like file format or non-numeric characters), even though the exhibit clearly shows a date that does not exist in the calendar.

How to eliminate wrong answers

Option A is wrong because the file contains multiple rows of data beyond headers, as evidenced by the visible records in the exhibit. Option B is wrong because the price field shows numeric values (e.g., 19.99, 29.99) without any non-numeric characters that would cause parsing failures. Option C is wrong because the source file is displayed in a standard CSV format with proper delimiters and readable content, indicating it is not corrupted or in an unsupported format.

Full explanation →

971

Multi-Selectmedium

A data analyst is designing a dashboard for executives. Which TWO practices improve dashboard effectiveness? (Select two.)

Select 2 answers

A.Including clear labels and titles for all charts

B.Adding decorative images to make the dashboard visually appealing

C.Displaying data with high precision (e.g., 4 decimal places)

D.Using consistent color coding for similar metrics across charts

E.Using many different colors to highlight every data point

AnswersA, D

Clear labels are essential for understanding.

Why this answer

Clear labels and titles help interpretation; consistent color coding aids quick comprehension.

Full explanation →

972

MCQmedium

A data analyst needs to show the relationship between advertising spend (in dollars) and monthly sales revenue (in dollars) for the past 12 months. Which chart type is most appropriate?

A.Scatter plot

B.Line chart

C.Bar chart

D.Pie chart

AnswerA

Correct. Scatter plots display the relationship between two continuous variables.

Why this answer

A scatter plot is used to visualize the correlation between two continuous variables.

Full explanation →

973

MCQmedium

A data analyst discovers that the sales data for the current quarter shows a 15% increase in revenue. However, the analyst notes that the data does not include returns from the last week due to a system lag. How should the analyst communicate this uncertainty?

A.Delay the report until returns are processed.

B.Report the 15% increase as final, because the returns are insignificant.

C.Include a caveat that returns from the last week are missing and revenue may be overstated.

D.State that the revenue increase is exactly 15% and provide a confidence interval.

AnswerC

Disclosing the limitation is the correct approach.

Why this answer

When data has limitations, it's important to disclose them, such as missing returns, to avoid misleading conclusions.

Full explanation →

974

MCQeasy

Which database index type is most commonly used for exact-match lookups and range queries in a B-tree structure?

A.B-tree index

B.Hash index

C.Clustered index

D.Bitmap index

AnswerA

B-tree indexes support both exact-match and range queries.

Why this answer

A B-tree index is the correct answer because it maintains sorted data in a balanced tree structure, enabling both exact-match lookups (via equality searches) and efficient range queries (via ordered traversal of leaf nodes). This dual capability makes it the standard index type in relational databases like MySQL, PostgreSQL, and Oracle for general-purpose querying.

Exam trap

The trap here is that candidates often confuse 'clustered index' as a separate index type, but it is actually a physical implementation of a B-tree where the leaf nodes contain the full row data, not a different algorithmic structure.

How to eliminate wrong answers

Option B (Hash index) is wrong because hash indexes use a hash function to map keys to bucket locations, which is extremely fast for exact-match lookups but does not support range queries (e.g., BETWEEN, >, <) since the hash order does not preserve key order. Option C (Clustered index) is wrong because while a clustered index physically reorders table data based on the index key and can support range queries, it is not a distinct index type but rather a storage organization; the underlying structure is still a B-tree, and the question asks for the index type most commonly used for both operations, which is the B-tree itself. Option D (Bitmap index) is wrong because bitmap indexes store bitmaps for each distinct key value and are optimized for low-cardinality columns and complex boolean queries, not for efficient range scans or exact-match lookups in high-cardinality scenarios.

Full explanation →

975

Multi-Selectmedium

A data analyst is preparing a dataset for analysis and needs to handle outliers. Which TWO of the following are common methods for treating outliers?

Select 2 answers

A.Removal

B.Capping

C.Normalization

D.Imputation

E.Standardization

AnswersA, B

Removing outlier records is a common approach.

Why this answer

Capping (winsorizing) limits extreme values, and removal simply deletes outlier rows. Transformation (e.g., log) can also reduce impact but is not listed here; normalization and imputation are not primary outlier treatments.

Full explanation →

Page 13 of 14

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Practice DA0-001 by domain

Target a specific domain to shore up weak areas.

Data Concepts and Environments Analysing Data Visualising Data Reporting Insights Mining Data Comparing and Contrasting Data Concepts Mining and Acquiring Data Analyzing and Modeling Data Visualizing Data Communicating Data Insights

See all domains with question counts →

CompTIA Data+ DA0-001 DA0-001 Questions 901–975 | Page 13/14 | Courseiva