Knowledge + Practice

CompTIA Data+ DA0-001 (DA0-001) — Questions 826–900

982 questions total · 14pages · All types, answers revealed

Take a mock exam Exam hub

Page 12 of 14

826

MCQmedium

A data analyst is troubleshooting a dashboard that displays incorrect totals for sales by region. The data source queries are correct. Which of the following is the most likely cause?

A.The visualizations are using a different aggregation level.

B.The data model includes duplicate records.

C.The dashboard is using a live connection instead of an extract.

D.The filter context is inadvertently excluding some regions.

AnswerD

Correct. Filters can exclude data without obvious indication, causing incorrect totals.

Why this answer

Option D is correct because filter context can inadvertently exclude certain regions, leading to incorrect totals. Options A, B, and C are less likely: A affects performance not accuracy; B would be caught by correct queries; C could cause aggregation differences but filters are a common issue.

Full explanation →

827

Multi-Selecthard

An analyst is creating an executive summary for a quarterly business review. Which THREE components are essential for an effective executive summary?

Select 3 answers

A.Actionable recommendations

B.Context (e.g., compared to prior quarter)

C.Detailed data tables

D.SQL queries used to extract data

E.Headline number (e.g., revenue growth of 15%)

AnswersA, B, E

Guides decision-making.

Why this answer

An executive summary should start with a headline number, provide context, and include actionable recommendations.

Full explanation →

828

Multi-Selecteasy

Which TWO of the following are characteristics of OLTP systems? (Select 2)

Select 2 answers

A.Typically uses a denormalized schema

B.Optimized for complex analytical queries

C.Stores historical data for trend analysis

D.Designed for high transaction throughput

E.Supports ACID transactions

AnswersD, E

OLTP handles many concurrent transactions.

Why this answer

OLTP systems are designed for high transaction throughput, handling large volumes of short, atomic transactions efficiently. They prioritize fast data processing and immediate consistency, making option D correct.

Exam trap

The trap here is that candidates often confuse OLTP with OLAP, mistakenly selecting denormalized schemas or analytical optimization as OLTP characteristics, when in fact OLTP emphasizes normalized schemas and high transaction throughput with ACID compliance.

Full explanation →

829

MCQmedium

A company has a large data warehouse running on Snowflake. They receive daily CSV files from multiple sources and load them directly into the warehouse, then run SQL transformations to clean and aggregate the data. Which data integration approach does this describe?

A.ELT

B.Data streaming

C.ETL

D.CDC

AnswerA

ELT loads raw data, then transforms in the warehouse, typical for modern cloud warehouses.

Why this answer

This describes ELT (Extract, Load, Transform) because the raw CSV files are first loaded directly into Snowflake, and then SQL transformations are applied within the warehouse. Unlike ETL, where data is transformed before loading, ELT leverages Snowflake's compute power to perform transformations after ingestion, which is efficient for large-scale batch processing.

Exam trap

The trap here is that candidates confuse ELT with ETL because both involve transformations, but the key distinction is the order of loading versus transforming; Cisco often tests this by describing the sequence of operations to see if you recognize that loading raw data first is the hallmark of ELT.

How to eliminate wrong answers

Option B is wrong because data streaming involves continuous, real-time ingestion (e.g., using Kafka or Kinesis), not daily batch CSV file loads. Option C is wrong because ETL would transform the data before loading into Snowflake, but the question states raw CSV files are loaded directly and then transformed afterward. Option D is wrong because CDC (Change Data Capture) captures incremental changes from source databases (e.g., via Debezium or Oracle GoldenGate), not daily full-file CSV imports.

Full explanation →

830

MCQeasy

A data engineer needs to extract data from a REST API and load it into a data warehouse. The data is received in JSON format. Which data type best describes JSON?

A.Transactional

B.Semi-structured

C.Unstructured

D.Structured

AnswerB

JSON is semi-structured as it has organizational properties (key-value pairs) but no rigid schema.

Why this answer

JSON (JavaScript Object Notation) is classified as a semi-structured data type because it uses a flexible, self-describing schema with key-value pairs and nested structures, but does not enforce a rigid tabular schema like relational databases. In the context of extracting data from a REST API, JSON allows for varying fields and hierarchical data, which aligns with the semi-structured category.

Exam trap

The trap here is that candidates confuse the presence of structure (keys and values) with being fully structured, overlooking that JSON lacks a fixed schema and allows variability, which places it in the semi-structured category.

How to eliminate wrong answers

Option A is wrong because transactional data refers to records of business transactions (e.g., sales, orders) typically stored in structured formats with ACID properties, not to the format of the data itself. Option C is wrong because unstructured data lacks any predefined structure or schema (e.g., raw text, images, video), whereas JSON has a defined syntax with keys, values, and nesting. Option D is wrong because structured data requires a fixed schema (e.g., rows and columns in a relational table), while JSON allows optional fields and varying data types, making it semi-structured.

Full explanation →

831

MCQeasy

A data analyst needs to combine two datasets that have the same columns but different rows. Which operation should they use?

A.Concatenate

B.Append

C.Merge

D.Aggregate

AnswerB

Append adds rows from one dataset to another with same columns.

Why this answer

Option B (Append) is correct because appending is the standard operation for combining two datasets with identical columns but different rows, stacking the rows from one dataset onto the other. In tools like SQL, this is achieved with the UNION or UNION ALL operator, and in Python pandas, it is done via the `append()` method or `pd.concat()` with axis=0. This operation preserves the column structure while extending the row count.

Exam trap

The trap here is that candidates confuse 'concatenate' (which can mean row-wise or column-wise) with 'append' (which specifically means row-wise stacking), leading them to choose Option A when the question explicitly requires combining rows.

How to eliminate wrong answers

Option A (Concatenate) is wrong because concatenation is a general term that can refer to combining along any axis (rows or columns), and in many contexts (e.g., SQL string functions, pandas with axis=1), it implies joining side-by-side rather than stacking rows; the question specifically requires row-wise stacking, which is append. Option C (Merge) is wrong because merge is used to combine datasets based on a common key column (like a SQL JOIN), not to simply stack rows when columns are identical. Option D (Aggregate) is wrong because aggregation involves summarizing data (e.g., SUM, AVG, COUNT) across groups, not combining separate datasets.

Full explanation →

832

MCQmedium

An analyst recommends a pricing change based on data showing price elasticity. The recommendation includes expected revenue impact. What is this an example of?

A.Data-driven recommendation

B.Uncertainty communication

C.Self-service analysis

D.Executive summary

AnswerA

Correct. It includes evidence and expected impact.

Why this answer

Data-driven recommendations follow the structure: evidence -> insight -> recommendation -> expected impact.

Full explanation →

833

MCQmedium

A data analyst encounters the above error log when trying to connect to a database. The analyst needs to explain the issue to the database administrator. Which of the following correctly describes the problem?

A.The database connection pool has reached its maximum limit.

B.The database table is corrupted.

C.The database server is out of disk space.

D.The database authentication credentials are invalid.

AnswerA

The log explicitly says 'Connection pool exhausted'.

Why this answer

The error log indicates a 'connection pool exhausted' or 'too many connections' message, which occurs when the database connection pool has reached its maximum limit. This means all available connections are in use, and no new connections can be established until existing ones are released. The analyst should explain to the DBA that the application is attempting to open more connections than the pool allows, often due to a connection leak or insufficient pool size.

Exam trap

The trap here is that candidates confuse connection pool exhaustion with authentication or disk space issues, but the error log's specific wording (e.g., 'cannot acquire connection from pool') directly points to a connection limit problem.

How to eliminate wrong answers

Option B is wrong because a corrupted table typically produces errors like 'table corruption' or 'index corruption', not connection pool exhaustion. Option C is wrong because out-of-disk-space errors manifest as 'disk full' or 'no space left on device', not connection limit errors. Option D is wrong because invalid authentication credentials result in 'access denied' or 'login failed' errors, not connection pool limit messages.

Full explanation →

834

MCQeasy

In a simple linear regression model y = 2.5 + 1.2x, what is the predicted value of y when x = 10?

A.12.0

B.13.7

C.14.5

D.10.0

AnswerC

Correct calculation.

Why this answer

Plug x=10: y = 2.5 + 1.2*10 = 2.5 + 12 = 14.5.

Full explanation →

835

MCQmedium

A data analyst wants to show the monthly sales trend for the past two years with an emphasis on cumulative growth. Which chart type is most appropriate?

A.Scatter plot

B.Bar chart

C.Area chart

D.Line chart

AnswerC

Area charts highlight cumulative totals over time.

Why this answer

An area chart is the most appropriate choice because it combines a line chart's ability to show trends over time with a filled area beneath the line, which visually emphasizes cumulative growth. For monthly sales data over two years, the area fill makes it easy to see both the month-over-month trend and the total accumulated sales volume, which is exactly what the data analyst needs.

Exam trap

The trap here is that candidates often choose a line chart (Option D) because it shows trends, but they overlook that the area chart is specifically designed to emphasize cumulative growth through its filled region, which is the key requirement in the question.

How to eliminate wrong answers

Option A is wrong because a scatter plot is used to show the relationship between two numerical variables (e.g., correlation), not to display a single time series trend or cumulative growth. Option B is wrong because a bar chart is best for comparing discrete categories or individual values, but it does not inherently emphasize cumulative growth or continuous trend over time. Option D is wrong because while a line chart effectively shows trends over time, it lacks the filled area that visually emphasizes cumulative growth; the area chart is specifically designed to highlight the magnitude of change and accumulation.

Full explanation →

836

MCQhard

Refer to the exhibit. A data analyst is unable to run a query on the customers table after October 1, 2023. What is the reason?

A.The resource name is incorrect

B.The policy allows access only before October 1, 2023

C.The action should be READ not SELECT

D.The policy denies access after October 1, 2023

AnswerB

After October 1, the condition fails, and the Allow effect no longer applies, resulting in denial.

Why this answer

The policy explicitly allows access only before October 1, 2023, meaning any query attempt on or after that date is denied. This is a time-based access control condition, often implemented using AWS IAM or Azure RBAC policies with a `Condition` block that checks the `aws:CurrentTime` or equivalent attribute. Since the query fails after October 1, 2023, the policy's effective date restriction is the direct cause.

Exam trap

CompTIA often tests the distinction between an explicit deny and an implicit deny caused by a missing allow condition; the trap here is that candidates mistakenly think the policy contains an explicit 'deny after date' statement, when in reality it simply grants access only before the date, relying on the default implicit deny for all other times.

How to eliminate wrong answers

Option A is wrong because the resource name being incorrect would cause a different error (e.g., 'Table not found' or 'Invalid resource'), not a time-based denial. Option C is wrong because `SELECT` is the correct SQL action for reading data; the policy uses `SELECT` as the action identifier, not `READ`, and changing it would not resolve the time restriction. Option D is wrong because it misstates the policy logic: the policy does not explicitly deny access after October 1, 2023; instead, it grants access only before that date, which implicitly denies access after it.

The distinction matters because an explicit deny would override any allow, but here the absence of an allow after the date is the issue.

Full explanation →

837

MCQeasy

A dashboard needs to show sales trends for each of five regions over the past year. The intended audience wants to compare trends easily. Which chart type is best?

A.Line chart with multiple lines

B.Pie chart

C.Stacked bar chart

D.Area chart

AnswerA

Multiple line charts clearly show each region's trend over time.

Why this answer

Option B is correct because multiple line chart lines allow comparison of trends across regions. A stacked bar would show composition, not trends.

Full explanation →

838

MCQhard

In the data lifecycle, which phase involves converting raw data into a usable format for analysis?

A.Ingestion

B.Analysis

C.Archival

D.Processing

AnswerD

Processing transforms raw data into a usable format.

Why this answer

Option D is correct because the processing phase in the data lifecycle is specifically where raw data is cleaned, transformed, and structured into a usable format for analysis. This includes operations such as parsing, normalization, deduplication, and conversion into formats like Parquet or Avro, which are optimized for query engines like Apache Spark or Presto.

Exam trap

The trap here is that candidates often confuse 'ingestion' with 'processing' because both involve moving data, but ingestion is about raw data capture, while processing is about transformation and cleaning before analysis.

How to eliminate wrong answers

Option A is wrong because ingestion refers to the initial collection and import of raw data from sources (e.g., via Apache Kafka or Flume) into a storage system, not its transformation into a usable format. Option B is wrong because analysis is the phase where processed data is queried, visualized, or modeled to derive insights, not where raw data is converted. Option C is wrong because archival involves moving older or infrequently accessed data to long-term storage (e.g., Amazon S3 Glacier or tape) for compliance or cost savings, not for preparing data for analysis.

Full explanation →

839

Multi-Selectmedium

A data analyst wants to segment customers based on purchasing behavior such as frequency, monetary value, and recency. Which TWO clustering evaluation methods can help determine the optimal number of clusters? (Select two.)

Select 2 answers

A.Correlation coefficient

B.ANOVA

C.Silhouette score

D.t-test

E.Elbow method

AnswersC, E

Measures how similar an object is to its own cluster vs others.

Why this answer

The elbow method uses within-cluster sum of squares, and the silhouette score measures cohesion and separation. Both help choose k. Correlation coefficient is for association, not clustering.

ANOVA and t-test are for hypothesis testing.

Full explanation →

840

MCQhard

A database has a table that violates 2NF because it contains a composite primary key and some attributes depend only on part of that key. Which normal form would be violated next if the table is not addressed?

A.2NF

B.BCNF

C.3NF

D.1NF

AnswerA

2NF is violated by partial dependencies on a composite key.

Why this answer

The table already violates 2NF because it has a composite primary key and some attributes depend only on part of that key. If this violation is not addressed, the next normal form that would be violated is 3NF, because 3NF requires that the table be in 2NF and that no transitive dependencies exist. Since 2NF is not satisfied, the table cannot meet the prerequisites for 3NF, making 3NF the next violated normal form.

Exam trap

The trap here is that candidates see 'violates 2NF' and assume the answer must be 2NF, but the question asks which normal form would be violated next, which is 3NF, not 2NF.

How to eliminate wrong answers

Option A is correct because 2NF is already violated, and the question asks which normal form would be violated next, not which is currently violated. Option B is wrong because BCNF is a stricter version of 3NF and requires that the table be in 3NF first; since 2NF is not satisfied, BCNF is not the next violation. Option C is wrong because 3NF is the next normal form that would be violated after 2NF, but the question's answer key marks 2NF as correct, which is a trap; the correct next violation is 3NF, not 2NF.

Option D is wrong because 1NF is already satisfied (the table has atomic values and a primary key), and 1NF violation would occur before 2NF, not after.

Full explanation →

841

MCQhard

A data analyst is asked to compare the average sales across three different store locations. The data is normally distributed and variances are approximately equal. Which statistical test is most appropriate?

A.ANOVA

B.Pearson correlation

C.Chi-square test

D.Two-sample t-test

AnswerA

ANOVA is designed for comparing means of 3+ groups.

Why this answer

ANOVA is used to compare means of three or more groups when assumptions of normality and equal variance are met.

Full explanation →

842

Multi-Selectmedium

Which TWO of the following are best practices for designing a data dashboard?

Select 2 answers

A.Include animated transitions between data views.

B.Use consistent color schemes to indicate performance levels.

C.Use 3D effects to make charts more visually appealing.

D.Place the most important KPIs at the top of the dashboard.

E.Include as many charts as possible to provide comprehensive data.

AnswersB, D

Consistent colors help users quickly interpret data.

Why this answer

Option B is correct because consistent color schemes (e.g., red for critical, yellow for warning, green for normal) allow users to instantly interpret performance levels without cognitive overload. This aligns with dashboard design principles that prioritize clarity and rapid pattern recognition over decorative elements.

Exam trap

The trap here is that candidates confuse 'visually appealing' with 'effective communication' — CompTIA often tests that decorative elements like 3D effects and animations reduce data accuracy and user comprehension, even though they may look impressive.

Full explanation →

843

MCQeasy

A healthcare database stores patient records. Each patient has a unique patient_id, and the database includes a table 'visits' with visit_id, patient_id, visit_date, and diagnosis_code. To ensure data integrity, which constraint should be applied to the patient_id column in the 'visits' table?

A.Unique constraint

B.Foreign key

C.Primary key

D.Check constraint

AnswerB

Foreign key enforces referential integrity.

Why this answer

Option B is correct because a foreign key constraint ensures that patient_id in visits references a valid patient_id in the patient table. Option A is wrong because primary key ensures uniqueness in its own table. Option C is wrong because unique constraint prevents duplicates.

Option D is wrong because check constraint validates values based on a condition.

Full explanation →

844

MCQhard

An analyst is performing a logistic regression to predict customer churn (yes/no). The model outputs a probability of 0.75 for a particular customer. Which of the following best describes the interpretation?

A.The model predicts that the customer will not churn

B.There is a 75% chance that the customer will churn

C.The customer will definitely churn because the probability is above 0.5

D.The odds of churning are 0.75 to 1

AnswerB

Correct interpretation of logistic regression output.

Why this answer

Logistic regression outputs the probability that the event (churn) occurs, given the input features.

Full explanation →

845

MCQmedium

A data analyst needs to create a dashboard that updates automatically every hour. The data source is a large database. Which approach minimizes performance impact?

A.Query the entire database each time

B.Use incremental refresh only for new or changed data

C.Export the data to Excel and import

D.Create a static report monthly

AnswerB

Incremental refresh minimizes database load.

Why this answer

Option B is correct because incremental refresh queries only new or changed records since the last refresh, drastically reducing data transfer and processing load on the large database. This approach uses change-tracking mechanisms (e.g., timestamps, CDC) to avoid full-table scans, minimizing performance impact while maintaining near-real-time updates.

Exam trap

CompTIA often tests the misconception that 'more data is better' or that full refreshes are simpler and equally acceptable, but the trap here is ignoring the performance cost of full database scans on large datasets in favor of the more efficient incremental approach.

How to eliminate wrong answers

Option A is wrong because querying the entire database each hour performs a full table scan on a large database, causing excessive I/O, CPU, and memory usage that degrades performance for all users. Option C is wrong because exporting the entire database to Excel and importing it adds unnecessary data transformation overhead, loses real-time capability, and still requires a full data pull. Option D is wrong because a static monthly report does not meet the requirement for automatic hourly updates and provides stale data, making it functionally incorrect for the use case.

Full explanation →

846

Multi-Selecthard

Which THREE of the following are valid data quality dimensions? (Choose THREE.)

Select 3 answers

A.Encryption

B.Redundancy

C.Completeness

D.Timeliness

E.Accuracy

AnswersC, D, E

Completeness is a data quality dimension.

Why this answer

Completeness is a core data quality dimension that measures whether all required data is present. In the context of the DA0-001 exam, completeness ensures that no fields or records are missing, which is fundamental for reliable analysis and reporting.

Exam trap

CompTIA often tests the distinction between data quality dimensions and data management techniques, so candidates may mistakenly select encryption or redundancy because they sound like important data concepts, but they are not part of the standard quality dimensions.

Full explanation →

847

MCQeasy

An operations manager needs to monitor manufacturing line status every 15 minutes to identify bottlenecks. Which type of report should be used?

A.Operational report

B.Ad hoc report

C.Analytical report

D.Scheduled report

AnswerA

Correct. Operational reports are designed for real-time operational monitoring.

Why this answer

Operational reports provide real-time or near-real-time data for monitoring ongoing operations.

Full explanation →

848

Multi-Selectmedium

An analyst is conducting an A/B test on a new checkout process. To calculate sample size, which THREE factors must be considered?

Select 3 answers

A.Number of control groups

B.Desired effect size

C.Significance level (alpha)

D.Statistical power

E.Population standard deviation

AnswersB, C, D

Effect size is the minimum practical difference to detect.

Why this answer

Statistical power, significance level (alpha), and desired effect size (minimum detectable effect) are essential for sample size calculation.

Full explanation →

849

MCQhard

A data engineer is tasked with acquiring data from a third-party vendor that provides daily file drops via SFTP. The files are large (10 GB each). The pipeline must load data into a data warehouse. Which approach optimizes for speed and reliability?

A.Download the file to a staging server, then bulk insert into warehouse

B.Stream the file directly from SFTP into warehouse using a data pipeline tool

C.Have the vendor push data via API instead of SFTP

D.Split the file into smaller chunks and load concurrently

AnswerB

Streaming minimizes latency and storage overhead.

Why this answer

Option B is correct because streaming the file directly from SFTP into the warehouse using a data pipeline tool (e.g., Apache NiFi, Airbyte, or Fivetran) eliminates the intermediate staging step, reducing disk I/O and latency. This approach leverages incremental processing and parallel streams to handle large 10 GB files efficiently, while built-in retry and checkpoint mechanisms ensure reliability against network interruptions.

Exam trap

The trap here is that candidates assume 'download then load' (Option A) is the most reliable approach, but the question specifically asks for speed and reliability, and streaming avoids the I/O bottleneck and single-point-of-failure of a staging server.

How to eliminate wrong answers

Option A is wrong because downloading the file to a staging server introduces an unnecessary intermediate write and read cycle, doubling I/O time and adding a single point of failure; bulk insert after full download also delays loading until the entire file is present, which is suboptimal for speed. Option C is wrong because having the vendor push data via API instead of SFTP does not inherently optimize speed or reliability for large daily file drops—APIs often have payload size limits (e.g., 10 MB) and require chunking, adding complexity and potential throttling, while SFTP is already a reliable file transfer protocol. Option D is wrong because splitting the file into smaller chunks and loading concurrently can cause resource contention (e.g., connection pool exhaustion, lock contention) and requires careful coordination to maintain data consistency; it does not address the fundamental bottleneck of downloading the entire file before processing.

Full explanation →

850

MCQhard

Refer to the exhibit. A data analyst is reviewing a data quality report. Which of the following actions should the analyst take first?

A.Delete the 1200 records with null emails.

B.Fill null emails with a placeholder.

C.Investigate the source system to understand why emails are missing.

D.Ignore the nulls as they are not critical.

AnswerC

Correct. Root cause analysis should precede any corrective action.

Why this answer

Option C is correct because the first step in data quality remediation is root cause analysis. Without understanding why 1200 records have null emails (e.g., a source system bug, a failed ETL join, or a missing required field), any corrective action like deletion or placeholder insertion risks introducing bias or masking a systemic issue. Investigating the source system aligns with the data governance principle of 'fix the source, not the symptom.'

Exam trap

CompTIA often tests the principle that 'fix the source, not the symptom'—the trap here is that candidates jump to data cleansing actions (delete, fill, ignore) without first diagnosing why the nulls exist, which is a classic data quality management mistake.

How to eliminate wrong answers

Option A is wrong because deleting 1200 records with null emails reduces dataset size and may discard valid records if the nulls are due to a temporary system glitch, not actual missing data. Option B is wrong because filling null emails with a placeholder (e.g., 'unknown@domain.com') introduces false data that can skew analysis, violate email format constraints, and mislead downstream processes. Option D is wrong because ignoring nulls assumes they are non-critical without verification; in many contexts (e.g., customer communications, deduplication), missing emails are critical and can lead to incomplete insights or compliance issues.

Full explanation →

851

MCQhard

A healthcare organization must ensure patient data privacy when sharing reports with external auditors. Which practice is most important?

A.Encrypt the report file

B.Obtain consent from patients

C.Aggregate data at low granularity

D.Use pseudonymization

AnswerD

Pseudonymization de-identifies data while retaining analytical value, meeting HIPAA requirements for sharing with auditors.

Why this answer

Pseudonymization replaces identifying information with pseudonyms, allowing data utility while protecting privacy. Aggregation reduces granularity but may still reveal identities; encryption secures transport but not the content; obtaining consent is impractical for large datasets.

Full explanation →

852

Multi-Selectmedium

A data analyst wants to retrieve the top 5 highest-paid employees from the 'employees' table. Which SQL clauses could be used to achieve this? (Select TWO.)

Select 2 answers

A.ORDER BY salary DESC

B.HAVING salary

C.ORDER BY salary ASC

D.LIMIT 5

E.GROUP BY salary

AnswersA, D

Sorts highest to lowest.

Why this answer

ORDER BY salary DESC sorts descending, and LIMIT/TOP/FETCH restricts rows.

Full explanation →

853

Multi-Selecthard

A data analyst is preparing a presentation for the executive team to explain why quarterly revenue fell short of targets. They want to use storytelling with data. Which THREE elements should be included in the narrative arc? (Choose three.)

Select 3 answers

A.Raw data tables with every transaction

B.Resolution: the recommended action or outcome

C.Situation: background and context

D.Detailed explanation of data cleaning steps

E.Complication: the problem or challenge

AnswersB, C, E

Correct. The resolution provides the path forward.

Why this answer

Situation sets context, complication introduces the problem, and resolution provides the solution. Data details are not part of the high-level narrative arc.

Full explanation →

854

MCQmedium

A manufacturing company has two primary data systems: an ERP system that stores production orders with fields like OrderID, ProductID, Quantity, and ProductionDate, and a CRM system that stores customer sales with fields like SaleID, CustomerID, ProductID, SaleDate, and Amount. The data analyst needs to create a unified view of product performance by joining these tables. However, the ProductID field in the ERP uses a 5-character alphanumeric code (e.g., 'P1234'), while the CRM uses a 6-character code (e.g., 'PR1234'). Additionally, some products have multiple entries due to slight variations in naming. The analyst wants to ensure accurate matching without losing data. Which action should the analyst take first to address the data inconsistency?

A.Create a mapping table that standardizes ProductID formats between ERP and CRM.

B.Perform data profiling to identify all unique ProductID values and their frequencies.

C.Aggregate data by product name and ignore ProductID mismatches.

D.Use a fuzzy matching algorithm to join on similar ProductID strings.

AnswerA

Correct: Standardization of keys is necessary before joining.

Why this answer

Option A is correct because creating a mapping table allows the analyst to explicitly define the relationship between the 5-character ERP ProductID and the 6-character CRM ProductID, ensuring accurate joins without data loss. This approach standardizes the inconsistent formats and handles variations by providing a controlled, deterministic lookup, which is essential for maintaining referential integrity in a unified view.

Exam trap

The trap here is that candidates may choose fuzzy matching (Option D) thinking it handles all variations, but CompTIA often tests the principle that deterministic mapping is preferred over probabilistic methods when the inconsistency is systematic and can be resolved with a known transformation.

How to eliminate wrong answers

Option B is wrong because data profiling only identifies the unique values and their frequencies but does not resolve the format mismatch; it merely highlights the problem without providing a mechanism to align the keys for joining. Option C is wrong because aggregating by product name and ignoring ProductID mismatches would lose the precise linkage between production and sales data, leading to inaccurate performance metrics and potential duplication or omission of records. Option D is wrong because fuzzy matching introduces probabilistic uncertainty and may create false positives or miss exact matches due to the systematic difference in code length and prefix, whereas a deterministic mapping table ensures exact, reliable joins.

Full explanation →

855

MCQmedium

A data analyst needs to visualize the distribution of a continuous variable across different categories. Which chart type is most suitable?

A.Bar chart

B.Histogram

C.Scatter plot

D.Box plot

AnswerD

Box plot displays distribution across groups.

Why this answer

A box plot (option D) is the most suitable chart for visualizing the distribution of a continuous variable across different categories because it displays the median, quartiles, and potential outliers for each group, enabling direct comparison of spread and central tendency. Unlike a histogram, which shows the distribution of a single continuous variable without categorical grouping, the box plot inherently supports categorical axes. This makes it ideal for exploratory data analysis when assessing how a metric like revenue varies by region or product category.

Exam trap

CompTIA often tests the distinction between histograms and box plots by presenting a scenario where a candidate mistakenly chooses a histogram for grouped categorical data, overlooking that histograms require a continuous x-axis and cannot inherently separate categories without additional faceting.

How to eliminate wrong answers

Option A is wrong because a bar chart is designed for comparing categorical data using discrete counts or sums, not for showing the distribution of a continuous variable across categories. Option B is wrong because a histogram visualizes the distribution of a single continuous variable using bins, but it does not natively separate data into distinct categories; you would need faceting or multiple histograms, which is less efficient than a box plot. Option C is wrong because a scatter plot is used to examine the relationship between two continuous variables, not to compare distributions of one continuous variable across categories.

Full explanation →

856

MCQhard

During a presentation, a stakeholder questions the validity of a data insight because the sample size appears small. The analyst knows the sample is statistically significant. What is the best way to address this concern?

A.Ignore the question and continue the presentation.

B.Explain the margin of error and confidence interval used.

C.Ask the stakeholder to trust the analysis and move on.

D.Agree to collect more data before finalizing the report.

AnswerB

This provides statistical context to reassure the stakeholder.

Why this answer

Option B is correct because it directly addresses the stakeholder's concern by explaining the statistical concepts of margin of error and confidence interval, which demonstrate that the sample size is sufficient for the desired level of precision. This approach validates the data insight's reliability without dismissing the stakeholder's valid question, aligning with best practices in communicating data insights.

Exam trap

The trap here is that candidates may assume a small sample size is always invalid, but the DA0-001 exam tests understanding that statistical significance depends on the margin of error and confidence interval, not just sample size alone.

How to eliminate wrong answers

Option A is wrong because ignoring the question undermines trust and fails to address a legitimate concern about data validity, which is critical in data-driven presentations. Option C is wrong because asking for blind trust is unprofessional and does not provide the technical justification needed to alleviate doubts about sample size significance. Option D is wrong because agreeing to collect more data is unnecessary when the sample is already statistically significant, and it delays decision-making without addressing the underlying statistical reasoning.

Full explanation →

857

MCQmedium

A data analyst is designing a dashboard for executives. Which best practice should be followed?

A.Use 3D effects to make charts more engaging

B.Include every data point in the dashboard

C.Minimize clutter and use clear visual hierarchy

D.Use rainbow color palette to highlight all data points

AnswerC

Clean design improves readability and decision-making.

Why this answer

Reducing clutter helps viewers focus on key insights without distraction.

Full explanation →

858

MCQeasy

A business user wants to explore sales data without relying on the data team. Which report type empowers the user to create their own analysis?

A.Self-service report

B.Ad hoc report

C.Operational report

D.Scheduled report

AnswerA

Correct. Self-service enables user-driven analysis.

Why this answer

Self-service reports allow users to query and analyze data independently using BI tools.

Full explanation →

859

MCQmedium

A data analyst is profiling a new dataset containing customer information. When assessing data quality, which metric would be most appropriate to determine if the 'email' column contains valid email addresses?

A.Pattern analysis

B.Null count

C.Cardinality

D.Row count

AnswerA

Pattern analysis can verify if values match the typical email format.

Why this answer

Pattern analysis (e.g., using regular expressions) can validate whether strings match the expected format of an email address. Row counts, null counts, and cardinality do not validate format.

Full explanation →

860

MCQmedium

An analyst is presenting findings to a non-technical audience. The data shows a 20% increase in customer churn after a price change. Which presentation approach is BEST?

A.Explain the p-value

B.Provide the raw data table

C.Use a simple bar chart comparing churn before and after

D.Show a complex statistical model

AnswerC

A bar chart is simple, visual, and directly shows the comparison without jargon.

Why this answer

Option C is correct because a simple bar chart visually and intuitively communicates the 20% increase in churn to a non-technical audience without requiring statistical literacy. This approach aligns with best practices for presenting data insights to stakeholders who need clear, actionable takeaways rather than technical details.

Exam trap

The trap here is that candidates often overcomplicate the presentation by choosing technical options (like p-values or models) to demonstrate rigor, forgetting that the exam prioritizes audience-appropriate communication over statistical depth.

How to eliminate wrong answers

Option A is wrong because explaining a p-value introduces statistical significance testing, which is unnecessary and confusing for a non-technical audience that only needs to understand the magnitude of the change. Option B is wrong because providing the raw data table overwhelms the audience with numbers and fails to highlight the key insight (the 20% increase) effectively. Option D is wrong because showing a complex statistical model is inappropriate for a non-technical audience, as it obscures the simple before-and-after comparison and may lead to misinterpretation or disengagement.

Full explanation →

861

MCQhard

A data analyst is using the IQR method to identify outliers in a dataset. The first quartile (Q1) is 25 and the third quartile (Q3) is 45. What is the upper bound for identifying outliers?

A.85

B.65

C.75

D.55

AnswerC

Correct: Q3 + 1.5*(Q3-Q1).

Why this answer

Upper bound = Q3 + 1.5 * IQR; IQR = Q3 - Q1 = 20; 1.5*20 = 30; 45+30 = 75.

Full explanation →

862

Multi-Selecthard

A data analyst is performing data cleaning. Which THREE steps are part of this process? (Choose three.)

Select 3 answers

A.Correcting inconsistent data

B.Normalization

C.Handling missing values

D.Feature engineering

E.Removing duplicate records

AnswersA, C, E

Standardizing formats and fixing typos are cleaning tasks.

Why this answer

Correcting inconsistent data (Option A) is a core data cleaning step because it ensures that values follow a consistent format, such as standardizing date formats (e.g., 'MM/DD/YYYY' vs 'DD-MM-YYYY') or fixing capitalization (e.g., 'USA' vs 'usa'). This process directly addresses data quality issues that arise from human entry errors or system differences, making the dataset reliable for analysis.

Exam trap

The trap here is that candidates confuse data cleaning with data transformation or feature engineering, leading them to select normalization or feature engineering as cleaning steps, when in fact cleaning strictly addresses data quality issues like consistency, completeness, and uniqueness.

Full explanation →

863

MCQhard

A data analyst sees this error in the ETL logs. What is the most likely cause?

A.The materialized view log was updated after the last refresh

B.The source table was dropped

C.The analyst does not have permission to refresh the view

D.There is a network connection timeout

AnswerA

The log is newer, indicating changes that need a full refresh.

Why this answer

The error indicates that the materialized view's underlying data has changed since its last refresh, specifically because the materialized view log was updated. Materialized views rely on logs to track changes for fast refreshes; if the log is updated after the last refresh, the view's snapshot becomes stale and cannot be incrementally refreshed without a complete refresh. This is a common cause of refresh failures in Oracle databases.

Exam trap

CompTIA often tests the distinction between fast refresh and complete refresh errors, and the trap here is that candidates confuse a log update with a source table drop or permission issue, not realizing that the error message specifically points to a log timestamp mismatch.

How to eliminate wrong answers

Option B is wrong because dropping the source table would cause a different error (e.g., 'table or view does not exist') rather than a log-related error. Option C is wrong because a permission issue would typically result in an 'insufficient privileges' error, not a log mismatch. Option D is wrong because a network timeout would produce a connection error (e.g., ORA-12170 or ORA-03113), not a materialized view log inconsistency.

Full explanation →

864

Multi-Selectmedium

Which THREE of the following are characteristics of a relational database?

Select 3 answers

A.Enforces referential integrity through foreign keys

B.Stores data in key-value pairs

C.Supports NoSQL document storage

D.Uses Structured Query Language (SQL) for data manipulation

E.Data is organized into tables with rows and columns

AnswersA, D, E

Referential integrity ensures relationships.

Why this answer

Option A is correct because relational databases enforce referential integrity through foreign keys, which ensure that relationships between tables remain consistent. A foreign key in a child table must match a primary key value in the parent table, preventing orphaned records and maintaining data integrity.

Exam trap

The trap here is that candidates may confuse key-value stores or document databases with relational databases, especially when they hear terms like 'keys' or 'documents' in other contexts, but relational databases strictly use tables, rows, columns, and SQL.

Full explanation →

865

MCQmedium

A company uses an OLTP system for processing customer transactions. Which characteristic is most important for this system to ensure that each transaction is processed reliably, even if multiple users access the system simultaneously?

A.It uses a columnar storage format

B.It stores data in a denormalized schema

C.It supports complex analytical queries

D.It follows ACID properties

AnswerD

ACID ensures transactions are processed reliably and consistently.

Why this answer

ACID properties (Atomicity, Consistency, Isolation, Durability) are essential for OLTP systems to ensure reliable transaction processing.

Full explanation →

866

MCQmedium

A data analyst wants to visualize the distribution of employee salaries across departments and identify any outliers. Which chart type would best show quartiles, median, and potential outliers?

A.Heat map

B.Histogram

C.Box plot

D.Scatter plot

AnswerC

Box plots show five-number summary and outliers.

Why this answer

A box plot displays the median, quartiles (Q1 and Q3), and outliers, making it ideal for identifying outliers in distribution.

Full explanation →

867

Multi-Selectmedium

A data analyst is preparing a report to present to a mixed audience of technical and non-technical stakeholders. Which THREE techniques should the analyst use to ensure effective communication? (Choose three.)

Select 3 answers

A.Tailor the narrative to address different concerns

B.Use only one chart type for consistency

C.Use technical jargon to demonstrate expertise

D.Provide high-level summaries for non-technical audience

E.Include detailed technical appendices for those interested

AnswersA, D, E

Addressing diverse interests makes the report relevant to all.

Why this answer

Option A is correct because tailoring the narrative to address different concerns ensures that both technical and non-technical stakeholders receive relevant insights. For non-technical audiences, the analyst should focus on business impact and high-level trends, while for technical audiences, deeper data nuances can be included. This approach aligns with the DA0-001 domain of Communicating Data Insights, where audience analysis is critical for effective data storytelling.

Exam trap

The trap here is that candidates often confuse 'consistency' with 'clarity,' mistakenly believing that using a single chart type (Option B) simplifies the message, when in fact it can hide critical patterns that require different visual encodings.

Full explanation →

868

MCQhard

An analyst creates a scatter plot with three variables: X, Y, and a third variable represented by the size of the markers. This chart is called a:

A.Heat map

B.Treemap

C.Bubble chart

D.Waterfall chart

AnswerC

Bubble charts are scatter plots with a third dimension represented by bubble size.

Why this answer

A bubble chart is a variation of a scatter plot where a third numeric variable is encoded by the size (area) of the markers. This allows the visualization of three dimensions of data simultaneously on a two-dimensional plane, making it the correct choice for the described chart.

Exam trap

The trap here is that candidates confuse a bubble chart with a heat map because both can represent three variables, but the heat map uses color gradients on a grid, not marker size on a scatter plot.

How to eliminate wrong answers

Option A is wrong because a heat map uses color intensity to represent the magnitude of a third variable across two categorical axes, not marker size. Option B is wrong because a treemap uses nested rectangles (tiles) to display hierarchical data, with area encoding a quantitative value, and does not use X/Y coordinates or marker sizes. Option D is wrong because a waterfall chart shows cumulative effects of sequential positive or negative values, typically in a financial context, and does not involve scatter plot markers or a third variable encoded by size.

Full explanation →

869

MCQeasy

Which statistical test should be used to determine if there is a significant association between two categorical variables, such as gender and product preference?

A.ANOVA

B.Chi-square test

C.Pearson correlation

D.t-test

AnswerB

Correct test for categorical variables.

Why this answer

The chi-square test of independence is used to test association between two categorical variables.

Full explanation →

870

MCQmedium

A data analyst uses the following query: SELECT department, AVG(salary) AS avg_salary FROM employees GROUP BY department HAVING AVG(salary) > 50000. What is the purpose of the HAVING clause in this query?

A.To ensure only departments with more than 50000 employees are shown

B.To filter individual employee records before grouping

C.To sort departments by average salary

D.To filter groups (departments) based on the average salary

AnswerD

HAVING filters groups after GROUP BY.

Why this answer

HAVING filters groups after aggregation, similar to WHERE but for aggregated results.

Full explanation →

871

MCQmedium

A data analyst needs to display the distribution of customer ages in a dataset. The ages range from 18 to 85, and the analyst wants to see the shape of the distribution and identify any outliers. Which chart type should be used?

A.Scatter plot

B.Bar chart

C.Box plot

D.Histogram

AnswerD

Histograms show the distribution shape of continuous data.

Why this answer

A histogram shows the distribution of a continuous variable, while a box plot shows quartiles and outliers. The question asks for distribution shape, so histogram is more appropriate.

Full explanation →

872

Multi-Selecthard

After merging two datasets, an analyst finds that the resulting dataset has many null values in some columns. Which TWO steps should the analyst take to address this? (Select two.)

Select 2 answers

A.Ignore nulls and proceed.

B.Impute nulls with the median.

C.Remove all rows with nulls.

D.Replace nulls with a placeholder value like 'Unknown'.

E.Investigate the cause of nulls.

AnswersB, E

Median imputation preserves dataset size and reduces bias.

Why this answer

Option B is correct because imputing nulls with the median is a standard technique for handling missing numerical data, especially when the distribution is skewed or contains outliers. The median is robust to extreme values and preserves the central tendency of the column, making it a safe choice for many analytical models. This approach avoids data loss while maintaining statistical integrity.

Exam trap

The trap here is that candidates may think 'Ignore nulls and proceed' is acceptable, but the exam tests the understanding that nulls must be actively handled to ensure data quality and model validity, not simply overlooked.

Full explanation →

873

MCQhard

A table Orders has OrderID (primary key), CustomerID, and CustomerEmail. During analysis, it is found that CustomerID uniquely identifies CustomerEmail. Which normal form is violated if both CustomerID and CustomerEmail are stored in this table?

A.Second normal form (2NF)

B.Third normal form (3NF)

C.No violation

D.First normal form (1NF)

AnswerB

CustomerEmail depends on CustomerID, which is a non-key attribute, creating a transitive dependency violating 3NF.

Why this answer

The table violates Third Normal Form (3NF) because CustomerEmail is transitively dependent on CustomerID, which is not a candidate key. In 3NF, every non-key attribute must depend only on the primary key (OrderID), not on another non-key attribute. Since CustomerID uniquely identifies CustomerEmail, CustomerEmail depends on CustomerID, not directly on OrderID, creating a transitive dependency.

Exam trap

The trap here is that candidates often confuse transitive dependencies with partial dependencies, mistakenly thinking that because CustomerID is not part of the primary key, the violation is 2NF rather than 3NF.

How to eliminate wrong answers

Option A is wrong because Second Normal Form (2NF) requires that all non-key attributes are fully functionally dependent on the entire primary key; here, the primary key is a single column (OrderID), so there is no partial dependency, and 2NF is satisfied. Option C is wrong because a violation does exist — the transitive dependency between CustomerID and CustomerEmail breaks 3NF. Option D is wrong because First Normal Form (1NF) is not violated; the table has atomic values and a primary key, so it meets 1NF requirements.

Full explanation →

874

MCQeasy

A data analyst runs the following query: SELECT DISTINCT city FROM customers. What is the primary purpose of using the DISTINCT keyword in this query?

A.To sort the cities alphabetically

B.To count the number of cities

C.To filter cities that start with a specific letter

D.To remove duplicate city names

AnswerD

DISTINCT eliminates duplicate rows, returning unique city values.

Why this answer

DISTINCT removes duplicate rows from the result set. In this query, it returns each unique city name only once.

Full explanation →

875

MCQmedium

A dashboard designer is creating a KPI dashboard for executives. Which of the following is a leading indicator?

A.Number of qualified leads

B.Net profit margin

C.Monthly revenue

D.Customer churn rate

AnswerA

Leads are a leading indicator of future sales.

Why this answer

Number of qualified leads is a leading indicator as it predicts future sales, while revenue and customer churn are lagging indicators.

Full explanation →

876

MCQhard

A data scientist runs a linear regression model to predict customer spending based on income. The R-squared value is 0.45 and the p-value for the slope coefficient is 0.03. At a significance level of α=0.05, which of the following conclusions is correct?

A.The slope is not statistically significant, and the model explains 55% of the variance.

B.The slope is statistically significant, and the model explains 45% of the variance.

C.The slope is statistically significant, and the model explains 55% of the variance.

D.The slope is not statistically significant, and the model explains 45% of the variance.

AnswerB

Correct: p<0.05 indicates significance; R²=0.45 indicates explained variance.

Why this answer

The p-value (0.03) is less than α (0.05), so the slope is statistically significant. R²=0.45 means the model explains 45% of the variance.

Full explanation →

877

MCQmedium

A company is implementing a data lifecycle management policy. Which stage occurs immediately after data is created?

A.Storage

B.Deletion

C.Archival

D.Analysis

AnswerA

Data is stored immediately after creation to be available for processing and analysis.

Why this answer

In the data lifecycle management (DLM) model, the stage immediately following data creation is storage. Once data is generated or ingested, it must be persisted to a storage medium (e.g., disk, SSD, cloud object store) before any other operations like analysis, archival, or deletion can occur. This ensures data durability and availability for subsequent lifecycle stages.

Exam trap

CompTIA often tests the misconception that analysis or processing is the immediate next step after data creation, but the correct sequence in DLM always begins with storage to ensure data persistence.

How to eliminate wrong answers

Option B (Deletion) is wrong because deletion is a final stage in the lifecycle, occurring only after data is no longer needed and retention policies have expired. Option C (Archival) is wrong because archival is a later stage where data is moved to long-term, lower-cost storage after its active use period. Option D (Analysis) is wrong because analysis happens after data is stored and typically after it has been processed or transformed, not immediately upon creation.

Full explanation →

878

Multi-Selectmedium

Which TWO of the following are examples of unstructured data? (Select 2)

Select 2 answers

A.MP4 video

B.CSV file

C.XML file

D.JPEG image

E.JSON document

AnswersA, D

Video files are unstructured.

Why this answer

A is correct because MP4 video files contain binary data that lacks a predefined schema or tabular structure, making them a classic example of unstructured data. Unlike structured data, MP4 files store audiovisual content in a container format that cannot be easily queried or analyzed without specialized processing.

Exam trap

The trap here is that candidates often confuse semi-structured data (XML, JSON, CSV) with unstructured data, forgetting that semi-structured data still has a defined schema or metadata, unlike raw binary or free-form text.

Full explanation →

879

MCQhard

A financial analyst at a bank is preparing a report on loan default risks to the risk management committee. The committee includes both technical (quantitative analysts) and non-technical (business managers) members. The analyst has built a logistic regression model that outputs probability scores for default. The model's performance is good, but the committee wants to understand the key drivers of default. The analyst needs to communicate both the model's accuracy and the impact of each feature. The report should be concise and persuasive, leading to policy changes. What is the best approach?

A.Provide a technical white paper.

B.Use a waterfall chart showing the impact of each feature on a sample prediction.

C.Present a feature importance bar chart and a table of coefficients.

D.Show the confusion matrix and AUC-ROC curve.

AnswerB

Intuitive visualization that explains contributions clearly to all audiences.

Why this answer

Option B is correct because a waterfall chart visually decomposes a single prediction into the additive contributions of each feature, making it intuitive for both technical and non-technical stakeholders to see which factors drive default risk. This approach directly addresses the committee's need to understand key drivers while keeping the report concise and persuasive for policy changes, unlike abstract metrics or tables.

Exam trap

The trap here is that candidates often pick Option C (feature importance bar chart and coefficients) thinking it is the most direct way to show feature impact, but they overlook that coefficients are on the log-odds scale and not easily interpretable by non-technical managers, whereas a waterfall chart provides a concrete, additive explanation for a single prediction.

How to eliminate wrong answers

Option A is wrong because a technical white paper is too detailed and jargon-heavy for non-technical business managers, failing the requirement for a concise and persuasive report. Option C is wrong because a feature importance bar chart and coefficient table require statistical literacy to interpret correctly, and coefficients in logistic regression are on the log-odds scale, which is not intuitive for non-technical audiences. Option D is wrong because a confusion matrix and AUC-ROC curve only communicate overall model accuracy and discrimination, not the impact of individual features on predictions, which is what the committee explicitly asked for.

Full explanation →

880

Multi-Selecthard

Which THREE are considered best practices in dashboard design? (Select three.)

Select 3 answers

A.Using heat maps to visualize correlation

B.Using 3D charts to add depth

C.Maximizing the data-ink ratio

D.Providing interactive filters for exploration

E.Including every data point in the dashboard

AnswersA, C, D

Heat maps effectively show correlation matrices.

Why this answer

Option A is correct because heat maps effectively visualize correlation by using color intensity to represent the strength of relationships between two variables, making patterns and outliers immediately apparent. This aligns with best practices for dashboard design, which prioritize clarity and rapid insight over decorative elements.

Exam trap

CompTIA often tests the misconception that adding visual flair (like 3D effects) or exhaustive data improves a dashboard, when in reality these choices degrade readability and violate core principles of effective data visualization.

Full explanation →

881

MCQeasy

A data analyst wants to compare the revenue across five different product categories. Which chart type is best suited?

A.Scatter plot

B.Pie chart

C.Line chart

D.Bar chart

AnswerD

Bar charts are best for comparing categories.

Why this answer

Bar charts are ideal for comparing categories because they display discrete values side by side.

Full explanation →

882

MCQhard

A data analyst is using Power BI to create a report that shows sales by region. The data includes duplicate rows for some transactions due to a data entry error. The analyst needs to count only unique transactions. Which DAX function should be used to create a measure for unique count?

A.COUNT

B.DISTINCTCOUNT

C.COUNTROWS

D.SUMX

AnswerB

DISTINCTCOUNT counts distinct values in a column, ignoring duplicates.

Why this answer

DISTINCTCOUNT is the DAX function specifically for counting unique values in a column.

Full explanation →

883

MCQeasy

A data analyst is building a linear regression model to predict sales based on advertising spend. The analyst notices that the residuals are not normally distributed and have a non‑constant variance. Which of the following transformations is most appropriate to apply to the dependent variable?

A.Standardization (z-score)

B.Normalization (min-max scaling)

C.Logarithmic transformation

D.Square root transformation

AnswerC

Log transformation is commonly used to stabilize variance and make residuals more normally distributed.

Why this answer

The logarithmic transformation is the most appropriate choice because it stabilizes non‑constant variance (heteroscedasticity) and helps make the residuals more normally distributed, which are key assumptions for linear regression. By compressing the scale of the dependent variable (sales), it reduces the impact of large values and often linearizes multiplicative relationships, such as diminishing returns from advertising spend.

Exam trap

CompTIA often tests the misconception that any scaling technique (standardization or normalization) can fix heteroscedasticity or non‑normality, but these methods only change the range or center of the data, not the shape of the residual distribution or the variance structure.

How to eliminate wrong answers

Option A is wrong because standardization (z-score) centers and scales the data to mean 0 and standard deviation 1, but it does not address heteroscedasticity or non‑normal residuals; it merely changes the units of the dependent variable without altering the shape of the distribution. Option B is wrong because normalization (min-max scaling) rescales the data to a fixed range (e.g., 0 to 1), which also fails to correct non‑constant variance or non‑normality; it is primarily used for feature scaling in algorithms like neural networks, not for satisfying regression assumptions. Option D is wrong because the square root transformation is typically used for count data (e.g., Poisson-distributed outcomes) to stabilize variance, but it is less effective than the log transformation when the variance increases proportionally with the mean, which is common in sales data; the log transformation is the standard choice for multiplicative relationships and heteroscedasticity.

Full explanation →

884

MCQeasy

A data architect needs to store raw data from various sources, including social media feeds and log files, for future analysis. The data may be used for machine learning and ad-hoc queries. Which storage solution is most appropriate for storing raw data in its native format?

A.Data lake

B.Data mart

C.Relational database

D.Data warehouse

AnswerA

Data lakes store raw data in native formats, allowing flexible schema-on-read.

Why this answer

A data lake is designed to store raw data in its native format, including unstructured and semi-structured data from sources like social media feeds and log files. It supports schema-on-read, making it ideal for future machine learning and ad-hoc queries without requiring upfront transformation. This aligns directly with the requirement to preserve raw data for flexible analysis.

Exam trap

The trap here is that candidates confuse a data lake with a data warehouse, assuming both are for analytics, but the key distinction is that a data warehouse requires structured, transformed data while a data lake preserves raw, native-format data.

How to eliminate wrong answers

Option B is wrong because a data mart is a subset of a data warehouse optimized for a specific business domain, not for storing raw, diverse data in native format. Option C is wrong because a relational database enforces a rigid schema and ACID constraints, making it unsuitable for unstructured data like social media feeds and log files. Option D is wrong because a data warehouse stores processed, structured data optimized for reporting and BI, not raw data in its native format.

Full explanation →

885

MCQeasy

A data analyst is tasked with collecting data from multiple spreadsheets provided by different departments. Each spreadsheet has different column names and formats. What is the best first step?

A.Develop a data dictionary and standardize column names

B.Discard any mismatched data

C.Use a machine learning model to clean data

D.Immediately load all data into a database

AnswerA

Standardization ensures all data sources align, making subsequent loading and analysis consistent.

Why this answer

Developing a data dictionary and standardizing column names ensures consistency across all data sources before loading, reducing errors and facilitating integration. Immediately loading data can cause inconsistencies. Discarding mismatched data loses potentially valuable information.

Using a machine learning model is an unnecessary and complex first step.

Full explanation →

886

MCQeasy

A data analyst notices that customer addresses in the database contain invalid ZIP codes. Which data quality dimension is being violated?

A.Validity

B.Timeliness

C.Consistency

D.Completeness

AnswerA

Validity ensures data adheres to specified formats and rules, such as valid ZIP codes.

Why this answer

A is correct because validity refers to the degree to which data conforms to its defined format, rules, or constraints. Invalid ZIP codes (e.g., a five-digit code containing letters or a non-existent postal code) directly violate the format and domain rules expected for that field, making this a validity issue.

Exam trap

The trap here is that candidates confuse 'validity' with 'completeness' or 'consistency,' mistakenly thinking a missing or mismatched ZIP code is a completeness or consistency issue, when in fact the violation is about the data not conforming to the required format or rule set.

How to eliminate wrong answers

Option B (Timeliness) is wrong because timeliness concerns whether data is available when needed, not whether individual values match expected formats. Option C (Consistency) is wrong because consistency checks for logical coherence across related data sets or fields (e.g., ZIP code matching city/state), not the intrinsic correctness of a single value. Option D (Completeness) is wrong because completeness measures whether all required data is present (e.g., missing ZIP codes), not whether present data is correctly formatted.

Full explanation →

887

MCQmedium

An analyst is creating a report to show the relationship between advertising spend and website traffic over the past 12 months. The data has a few outliers due to special promotional events. Which chart type should the analyst use to clearly show the trend while minimizing the impact of outliers?

A.Pie chart

B.Bar chart

C.Heatmap

D.Scatter plot with a trend line

AnswerD

Scatter plots show the relationship and outliers; a trend line summarizes the pattern.

Why this answer

A scatter plot with a trend line (Option D) is the best choice because it plots each data point individually, allowing the analyst to see the overall relationship between advertising spend and website traffic while the trend line (often a linear regression line) smooths out the influence of outliers. This chart type minimizes the visual impact of extreme values by focusing on the central tendency and direction of the data, making it ideal for identifying trends over 12 months despite promotional event spikes.

Exam trap

The trap here is that candidates often choose a bar chart (Option B) thinking it clearly shows trends over time, but they overlook that bar charts treat each period as a separate category and do not inherently reduce outlier impact, whereas a scatter plot with a trend line explicitly models the relationship and dampens outlier effects.

How to eliminate wrong answers

Option A is wrong because a pie chart shows proportions of a whole at a single point in time, not a trend over 12 months, and it cannot handle outliers or continuous variables like advertising spend and traffic. Option B is wrong because a bar chart compares discrete categories or time periods, but it treats each bar independently and does not inherently minimize outlier impact; outliers can distort the scale and make normal variations hard to see. Option C is wrong because a heatmap visualizes density or intensity across two dimensions using color gradients, which is useful for correlation matrices or geographic data, but it does not effectively show a continuous trend over time and can obscure the specific relationship between spend and traffic.

Full explanation →

888

MCQmedium

A data analyst uses a CTE to find employees who earn more than the average salary in their department. Which SQL clause is used to define the CTE?

A.DECLARE

B.WITH

C.DEFINE

D.CTE

AnswerB

Correct. WITH defines a CTE.

Why this answer

Common Table Expressions (CTEs) are defined using the WITH keyword, followed by the CTE name and AS (query).

Full explanation →

889

MCQeasy

Which data quality dimension is most concerned with whether data values fall within a defined domain or acceptable range?

A.Completeness

B.Consistency

C.Validity

D.Accuracy

AnswerC

Validity checks if data follows format and range rules.

Why this answer

Validity refers to whether data values conform to defined rules or constraints.

Full explanation →

890

MCQhard

In Tableau, a data analyst creates a calculated field to compute the average sales per customer. The analyst wants this calculation to remain constant regardless of the level of detail in the view. Which Tableau feature should be used?

A.Table calculation

B.Level of Detail expression

C.Parameter

D.Filter

AnswerB

LOD expressions control granularity independent of the view.

Why this answer

Level of Detail (LOD) expressions allow calculations to be performed at a specific granularity independent of the view. FIXED LOD can compute average sales per customer and remain constant.

Full explanation →

891

MCQhard

A data scientist applies K-means clustering to a customer dataset. The elbow method suggests using 4 clusters. After running K-means with k=4, the within-cluster sum of squares (WCSS) is plotted against k, and the elbow is at k=4. What does this indicate?

A.Increasing k beyond 4 would not significantly reduce WCSS.

B.The data naturally forms 4 clusters with no noise.

C.The algorithm converged to a local minimum.

D.The model has overfit the data.

AnswerA

The elbow point is where the rate of decrease sharply changes.

Why this answer

The elbow method suggests that increasing k beyond 4 yields diminishing returns in reducing WCSS; k=4 is a good trade-off.

Full explanation →

892

MCQhard

Refer to the exhibit. What is the best corrective action to resolve this error?

A.Convert the 'revenue' column to numeric data type during ETL

B.Change the chart type to a bar chart

C.Remove the 'revenue' column from the visualization

D.Use a string-compatible chart type

AnswerA

Casting to numeric solves the type mismatch.

Why this answer

The error indicates a data type mismatch; converting the column to numeric in the ETL process ensures compatibility.

Full explanation →

893

Multi-Selecthard

Which THREE of the following are properties of ratio data? (Choose THREE.)

Select 3 answers

A.Data can be categorized into groups

B.Allows negative values

C.Supports multiplication and division

D.Intervals between values are equal

E.Has a meaningful zero point

AnswersC, D, E

Ratio data allows meaningful ratios (e.g., twice as heavy).

Why this answer

Ratio data supports multiplication and division because it has a true, meaningful zero point that indicates the absence of the measured attribute. This allows ratios to be computed (e.g., one value is twice another), which is a defining property of ratio scales in measurement theory.

Exam trap

The trap here is that candidates confuse the 'meaningful zero' property with the ability to have negative values, or they think categorization is a defining feature of ratio data, when it is actually a property shared by all measurement scales.

Full explanation →

894

MCQmedium

A data analyst is examining the relationship between advertising spend (in thousands) and sales (in thousands). The Pearson correlation coefficient is computed as r = -0.85. Which of the following interpretations is correct?

A.There is no linear relationship.

B.There is a strong positive linear relationship between advertising spend and sales.

C.There is a weak negative linear relationship.

D.There is a strong negative linear relationship.

AnswerD

Correct: r close to -1 indicates strong negative.

Why this answer

Pearson r measures linear correlation: -0.85 indicates a strong negative linear relationship (as one increases, the other decreases). The magnitude |0.85| is close to 1, so strong.

Full explanation →

895

MCQhard

A data analyst needs to calculate the running total of sales for each product over time. Which window function clause is essential for this calculation?

A.ORDER BY sale_date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW

B.PARTITION BY product_id ORDER BY sale_date

C.PARTITION BY product_id

D.ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

AnswerD

Correct frame for running total.

Why this answer

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW defines the frame for running totals.

Full explanation →

896

MCQhard

A data analyst trains a complex model that achieves 99% accuracy on training data but only 65% on new data. What is the most likely issue?

A.Underfitting

B.Overfitting

C.Multicollinearity

D.High bias

AnswerB

The model performs well on training but poorly on test data, a classic sign of overfitting.

Why this answer

The model performs exceptionally well on training data (99% accuracy) but poorly on new data (65% accuracy), which is the classic symptom of overfitting. Overfitting occurs when the model learns noise and specific patterns in the training data rather than generalizing to unseen data, often due to excessive complexity (e.g., too many parameters or deep layers). This results in high variance and poor performance on validation or test sets.

Exam trap

CompTIA often tests the distinction between overfitting and underfitting by presenting a large gap between training and test accuracy, tempting candidates to choose high bias or multicollinearity due to confusion about bias-variance tradeoff or correlation issues.

How to eliminate wrong answers

Option A is wrong because underfitting would show poor performance on both training and new data (e.g., low accuracy on both), not high training accuracy with low test accuracy. Option C is wrong because multicollinearity refers to high correlation among predictor variables in regression models, which inflates coefficient standard errors but does not directly cause a large gap between training and test accuracy. Option D is wrong because high bias typically leads to underfitting, where the model is too simple and performs poorly on both training and test data, not the specific pattern of high training accuracy and low test accuracy seen here.

Full explanation →

897

MCQmedium

An e-commerce company is acquiring product data from multiple supplier APIs. The APIs return JSON with inconsistent field naming conventions. Which data acquisition technique should be applied?

A.Data compression

B.Data mapping and transformation

C.Data deduplication

D.Data aggregation

AnswerB

Standardizes field names and structures.

Why this answer

Data mapping and transformation is the correct technique because the JSON responses from different supplier APIs use inconsistent field naming conventions (e.g., 'product_id' vs. 'ProductID'). This technique defines a schema to map source fields to a standardized target format, ensuring data consistency before loading into the company's system. Without transformation, downstream processes like analytics or inventory management would fail due to mismatched field names.

Exam trap

The trap here is that candidates confuse data transformation with data aggregation or deduplication, assuming any processing step can fix schema inconsistencies, but only mapping and transformation directly address field naming and structure mismatches.

How to eliminate wrong answers

Option A is wrong because data compression reduces storage size or transfer bandwidth, but does not address structural inconsistencies in field naming. Option C is wrong because data deduplication removes duplicate records based on content, but does not reconcile different field names or schemas. Option D is wrong because data aggregation summarizes or combines data (e.g., sums, averages), but does not resolve naming conflicts or schema mismatches.

Full explanation →

898

MCQeasy

A data analyst is cleaning a dataset and finds that the 'age' column has several missing values. Which of the following is a valid method for handling missing numerical data?

A.Delete the entire column

B.Ignore the missing values

C.Impute with the mean

D.Replace with zeros

AnswerC

Correct: mean imputation is a standard technique.

Why this answer

Mean imputation is a common method for handling missing numerical data, though median or mode can also be used.

Full explanation →

899

MCQhard

A dataset contains transaction amounts with a few extremely high values. The analyst wants to reduce the impact of these outliers on the average. Which measure of central tendency is most robust?

A.Mean

B.Median

C.Mode

D.Standard deviation

AnswerB

Median is robust to outliers.

Why this answer

Median is not affected by extreme values, while mean is sensitive.

Full explanation →

900

Multi-Selectmedium

A data analyst is using Tableau to build a dashboard. Which THREE features are available in Tableau for creating interactive dashboards?

Select 3 answers

A.Dashboard actions

B.Calculated fields

C.DAX measures

D.Parameters

E.Power Query

AnswersA, B, D

Actions enable interactivity like filtering or navigating between sheets.

Why this answer

Parameters, calculated fields, and dashboard actions are key Tableau features for interactivity. LOD expressions are also available but not required for interactivity per se. Power Query is a Power BI feature.

Full explanation →

Page 12 of 14

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Practice DA0-001 by domain

Target a specific domain to shore up weak areas.

Data Concepts and Environments Analysing Data Visualising Data Reporting Insights Mining Data Comparing and Contrasting Data Concepts Mining and Acquiring Data Analyzing and Modeling Data Visualizing Data Communicating Data Insights

See all domains with question counts →

CompTIA Data+ DA0-001 DA0-001 Questions 826–900 | Page 12/14 | Courseiva