CompTIA Data+ DA0-001 (DA0-001) — Questions 751825

982 questions total · 14pages · All types, answers revealed

Page 10

Page 11 of 14

Page 12
751
MCQmedium

In Looker Studio, what is the difference between dimensions and metrics?

A.Dimensions are used for aggregation; metrics are for grouping
B.Dimensions are numerical; metrics are categorical
C.Both can be categorical or numerical
D.Dimensions are categorical; metrics are numerical
AnswerD

This matches Looker Studio terminology.

Why this answer

In Looker Studio, dimensions are fields that contain categorical data (e.g., text, dates, or geographic names) used to group and segment data, while metrics are numerical fields (e.g., counts, sums, averages) that can be aggregated. Option D is correct because this distinction is fundamental to how Looker Studio processes and visualizes data: dimensions define the rows or categories in a chart, and metrics provide the quantitative values to be measured.

Exam trap

The trap here is that candidates often confuse the general data types (numeric vs. string) with the semantic roles in Looker Studio, leading them to choose option C, but the exam expects you to know that dimensions are always used for grouping (categorical) and metrics for aggregation (numerical) in the context of this tool.

How to eliminate wrong answers

Option A is wrong because it reverses the roles: dimensions are used for grouping and segmenting data, not aggregation, while metrics are the fields that are aggregated (e.g., SUM, COUNT, AVG). Option B is wrong because it incorrectly states that dimensions are numerical and metrics are categorical; in reality, dimensions are typically categorical (text, date, boolean) and metrics are numerical. Option C is wrong because while both can technically be categorical or numerical in raw data, Looker Studio enforces a strict semantic distinction: dimensions are treated as grouping keys (categorical) and metrics as aggregatable values (numerical), and mixing them leads to incorrect chart behavior.

752
Multi-Selecthard

A data analyst is using a recursive CTE to traverse a hierarchical organizational chart. Which THREE components are required to define a recursive CTE? (Select THREE.)

Select 3 answers
A.ORDER BY clause
B.Anchor member
C.WITH clause
D.Recursive member
E.UNION ALL operator
AnswersB, D, E

The anchor member defines the initial result set.

Why this answer

A recursive CTE requires an anchor member (initial query), a recursive member (that references the CTE itself), and the UNION ALL operator to combine them. The WITH clause is the outer syntax, not part of the recursion itself.

753
MCQmedium

A sales manager wants to see the conversion rates at each stage of the sales pipeline, from initial contact to closed deal. Which chart type is most appropriate?

A.Waterfall chart
B.Funnel chart
C.Treemap
D.Bar chart
AnswerB

Funnel charts visualize pipeline stages and conversion.

Why this answer

A funnel chart is designed to show progressive reductions as stages narrow.

754
MCQeasy

A sales analyst wants to show total sales by product category, with each category's contribution to the total. Which chart type is best?

A.Scatter plot
B.Box plot
C.Stacked bar chart
D.Line chart
AnswerC

Stacked bars show total and composition.

Why this answer

A stacked bar chart shows both the total and the part-to-whole breakdown by category.

755
Multi-Selecthard

Which TWO of the following are primary benefits of implementing a data governance program?

Select 2 answers
A.Faster data processing speed
B.Increased data volume
C.Improved data quality and consistency
D.Lower storage costs
E.Reduced data redundancy
AnswersC, E

Governance establishes standards that enhance quality and consistency.

Why this answer

Option C is correct because a primary benefit of a data governance program is improved data quality and consistency. Data governance establishes policies, standards, and procedures for data management, ensuring that data is accurate, complete, and reliable across the organization. This directly enhances decision-making and operational efficiency by reducing errors and inconsistencies in data assets.

Exam trap

The trap here is that candidates may confuse data governance with data management or data engineering tasks, mistakenly thinking it directly improves performance or reduces costs, when its core value is in quality, consistency, and compliance.

756
MCQeasy

A data team needs to extract data from a legacy system that only supports flat file exports. Which data acquisition method is most appropriate?

A.Database replication
B.API call
C.Web scraping
D.File transfer via SFTP
AnswerD

SFTP enables secure transfer of flat files, aligning with the system's export capability.

Why this answer

The legacy system only supports flat file exports, meaning it cannot provide direct database or API access. SFTP (SSH File Transfer Protocol) is the most appropriate method because it securely transfers flat files over a network, aligning with the system's export capabilities while ensuring data integrity and encryption during transit.

Exam trap

The trap here is that candidates may confuse 'flat file exports' with a need for real-time or API-based methods, overlooking that SFTP is the standard secure file transfer protocol for batch-oriented legacy systems.

How to eliminate wrong answers

Option A is wrong because database replication requires the source system to support a database engine with replication features (e.g., transactional logs or CDC), which a legacy flat-file-only system lacks. Option B is wrong because an API call requires the legacy system to expose a programmatic interface (e.g., REST or SOAP), which is not available if it only supports flat file exports. Option C is wrong because web scraping is used to extract data from web pages via HTTP, not from a legacy system that exports flat files via a file transfer protocol.

757
MCQeasy

Refer to the exhibit. Which data quality dimension is compromised by the missing value for Charlie's salary?

A.Uniqueness
B.Completeness
C.Timeliness
D.Accuracy
AnswerB

Correct. The salary field is missing, so data is incomplete.

Why this answer

Completeness measures whether all required data is present. Charlie's missing salary value means the record is incomplete, directly violating this dimension. In data quality frameworks, completeness is assessed by the proportion of non-null values in a field, and a null salary here fails that check.

Exam trap

CompTIA often tests the distinction between 'missing' (completeness) and 'wrong' (accuracy), leading candidates to confuse a null value with an incorrect value.

How to eliminate wrong answers

Option A is wrong because uniqueness refers to the absence of duplicate records or values, not missing data; a missing salary does not create a duplicate. Option C is wrong because timeliness concerns whether data is up-to-date or available when needed, not whether a value is present or absent. Option D is wrong because accuracy measures correctness of values against a reference source; a missing value is not an inaccurate value—it is an absent one.

758
MCQeasy

A dashboard designer wants to highlight the sales performance of individual sales representatives compared to team averages. Which chart type is most suitable for this comparison?

A.Scatter plot
B.Bar chart with average line
C.Pie chart
D.Line chart
AnswerB

Bar chart allows easy comparison of each rep's sales to the average line.

Why this answer

A bar chart is ideal for comparing individual values to an average because it clearly shows the magnitude of each value relative to a reference line. Options B, C, and D are less suitable for direct comparison of multiple entities.

759
MCQeasy

An analyst is presenting a recommendation to increase marketing spend. Which statement best follows the data-driven recommendation structure (evidence → insight → recommendation → expected impact)?

A.A 10% increase in marketing spend is recommended because we have budget.
B.The data shows sales are up, so we should spend more on marketing.
C.Based on a 5% lift in sales from previous campaigns, we recommend a 10% increase in marketing spend, expecting a 7% revenue growth.
D.We should increase marketing spend by 10% because it might boost sales.
AnswerC

This includes evidence, insight, recommendation, and expected impact.

Why this answer

The correct structure provides evidence, insight derived from it, a recommendation, and the expected impact.

760
Multi-Selecthard

Which THREE are best practices for designing a dashboard for executive consumption?

Select 3 answers
A.Include detailed raw data tables for transparency
B.Ensure the dashboard is responsive for mobile devices
C.Use a separate chart for each metric to avoid clutter
D.Provide interactive filters for time periods and regions
E.Display the most critical KPIs at the top
AnswersB, D, E

Executives often view dashboards on tablets or phones.

Why this answer

Options A, C, and E are correct. Prioritize key metrics (A), provide interactive filters (C), and ensure mobile responsiveness (E). Option B is wrong because daily updates may be too frequent and clutter the dashboard.

Option D is wrong because summary views are more useful for executives.

761
MCQhard

A data analyst needs to perform stratified sampling on a customer database to ensure proportional representation across three regions: North (40%), South (30%), and West (30%). The total sample size required is 1,000. How many customers should be sampled from the North region?

A.333
B.500
C.300
D.400
AnswerD

Correct: 40% of 1000.

Why this answer

Stratified sampling with proportional allocation: sample size per stratum = (stratum proportion) * total sample size. North = 0.40 * 1000 = 400.

762
Multi-Selectmedium

Which TWO of the following are appropriate uses of min-max normalisation?

Select 2 answers
A.Transforming data to have mean 0 and standard deviation 1
B.Scaling features to a range of 0 to 1
C.Preparing data for linear regression with normally distributed residuals
D.Preparing data for k-nearest neighbours algorithm
E.Handling missing values
AnswersB, D

Correct: Min-max normalisation scales to [0,1].

Why this answer

Min-max normalisation scales data to a fixed range (often 0-1), useful for distance-based algorithms like k-NN and neural networks. Standardisation (Z-score) is better for algorithms assuming Gaussian distribution.

763
MCQmedium

A data analyst notices that a line chart showing monthly sales over the past two years has a steep drop in one month. Upon investigation, the analyst discovers that a new sales region was added mid-month and the data was not normalized. Which of the following best practices should the analyst apply to communicate this insight accurately?

A.Remove the month with the drop from the report.
B.Use a bar chart instead to show the data.
C.Normalize the sales data by region and explain the data anomaly in the report.
D.Present the data as-is and let stakeholders interpret the drop.
AnswerC

Normalization corrects the artifact, and explanation provides transparency.

Why this answer

Option C is correct because the core issue is that the sales data is not normalized by region, causing a misleading drop when a new region was added mid-month. By normalizing the data (e.g., calculating per-region averages or percentages) and explicitly noting the anomaly in the report, the analyst ensures accurate communication of insights. This aligns with the DA0-001 domain of Communicating Data Insights, where transparency and data integrity are paramount.

Exam trap

CompTIA often tests the misconception that simply changing the chart type (Option B) or omitting problematic data (Option A) is acceptable, when the correct approach is to normalize and explain the anomaly to maintain data integrity and clarity.

How to eliminate wrong answers

Option A is wrong because removing the month with the drop would hide the anomaly and distort the trend, violating the principle of data integrity and transparency. Option B is wrong because switching to a bar chart does not address the underlying issue of unnormalized data; the same misleading drop would still appear, just in a different visual format. Option D is wrong because presenting the data as-is without explanation shifts the burden of interpretation to stakeholders, who may draw incorrect conclusions due to the unnormalized data, which is a failure in clear communication.

764
MCQhard

A data analyst needs to create a recursive CTE to traverse a hierarchical employee-manager table. Which of the following is a key requirement for a recursive CTE?

A.The CTE must include a WHERE clause in the recursive member
B.The CTE must use the RECURSIVE keyword in the WITH clause
C.The recursive CTE must have at least one anchor member that does not reference the CTE
D.The recursive member must use UNION instead of UNION ALL
AnswerC

The anchor member provides the starting set; the recursive member references the CTE.

Why this answer

A recursive CTE must have an anchor member (non-recursive) and a recursive member that references the CTE name, connected by UNION ALL.

765
MCQeasy

Refer to the exhibit. A data analyst is unable to access the dataset during weekends. Based on the policy, what is the most likely reason?

A.The analyst's role is not in the list
B.The policy is misconfigured
C.The access condition requires being on the internal network
D.The condition restricts access to business hours only
AnswerD

Weekends are outside business hours, so access is denied.

Why this answer

Option D is correct because the policy explicitly includes a condition that restricts access to business hours only (e.g., Monday–Friday, 9 AM–5 PM). Since the analyst is attempting to access the dataset during a weekend, the condition denies access regardless of the analyst's role or network location. This is a common attribute-based access control (ABAC) rule where time-based conditions are evaluated before granting permissions.

Exam trap

The trap here is that candidates may assume the issue is a missing role or network restriction, but the policy explicitly shows a time-based condition that overrides other permissions, and the weekend access attempt directly violates that condition.

How to eliminate wrong answers

Option A is wrong because the exhibit shows the analyst's role is listed in the policy (e.g., 'Data Analyst' is present in the role list), so the issue is not a missing role. Option B is wrong because the policy is correctly configured with a time-based condition; misconfiguration would imply a syntax error or invalid attribute, but the policy is syntactically valid and the denial is expected behavior. Option C is wrong because the policy does not include a network location condition (e.g., 'sourceIp' or 'internalNetwork' attribute); the only condition is time-based, so being on the internal network would not grant access during weekends.

766
MCQhard

A data analyst discovers that a dataset contains multiple records for the same customer with different spellings (e.g., 'Jon' vs 'John'). Which data preparation step should be applied first?

A.Merge all records into one per customer.
B.Remove duplicates based on exact match.
C.Standardize text fields using a lookup table.
D.Flag records for manual review.
AnswerC

Standardization harmonizes variations like 'Jon' and 'John'.

Why this answer

Option C is correct because the first step when dealing with inconsistent text values (like 'Jon' vs 'John') is to standardize the data using a lookup table or reference mapping. This ensures that all variations are normalized to a canonical form before any merging or deduplication is attempted, preventing data loss and preserving referential integrity.

Exam trap

The trap here is that candidates often jump to 'remove duplicates' (Option B) because they think of exact-match deduplication, but the question specifically tests the understanding that data quality issues like inconsistent spellings must be resolved through standardization before any deduplication logic can be applied.

How to eliminate wrong answers

Option A is wrong because merging records before standardizing spellings would combine data based on non-uniform keys, likely creating erroneous composite records or losing the ability to correctly identify which records belong to the same customer. Option B is wrong because removing duplicates based on exact match would treat 'Jon' and 'John' as different records, failing to identify them as the same customer and leaving the inconsistency unresolved. Option D is wrong because flagging records for manual review is a downstream action that should only be taken after automated standardization has been attempted; skipping standardization first would result in an unnecessarily large and inefficient manual review workload.

767
Multi-Selectmedium

Which THREE of the following are common characteristics of unstructured data?

Select 3 answers
A.Easily queried using SQL
B.Often stored in NoSQL databases or data lakes
C.Can include text, images, and video
D.Stored in relational tables
E.Lacks a predefined schema
AnswersB, C, E

NoSQL and data lakes are designed to store unstructured data at scale.

Why this answer

Options B, C, and E are correct. Unstructured data lacks a predefined schema, can include various media types, and is often stored in NoSQL or data lakes. A is wrong because unstructured data is not stored in relational tables.

D is wrong because SQL queries are not designed for unstructured data.

768
MCQeasy

A data analyst is using SQL to extract data. The analyst wants to retrieve all records from a table named 'sales' where the 'amount' column is greater than 100. Which SQL clause should be used?

A.WHERE
B.ORDER BY
C.GROUP BY
D.HAVING
AnswerA

WHERE clause filters rows based on a condition.

Why this answer

The WHERE clause in SQL is used to filter records based on a specified condition, such as 'amount > 100'. It is applied directly to the rows in the 'sales' table before any grouping or ordering, making it the correct choice for retrieving only records where the amount exceeds 100.

Exam trap

The trap here is that candidates often confuse HAVING with WHERE, thinking both can filter rows, but HAVING is only valid after GROUP BY and for aggregate conditions, while WHERE filters individual rows before any grouping.

How to eliminate wrong answers

Option B (ORDER BY) is wrong because it is used to sort the result set by one or more columns, not to filter rows based on a condition. Option C (GROUP BY) is wrong because it groups rows that have the same values in specified columns into summary rows, often for use with aggregate functions, and does not filter individual records. Option D (HAVING) is wrong because it is used to filter groups after the GROUP BY clause has been applied, typically with aggregate functions, and cannot be used to filter individual rows before grouping.

769
MCQeasy

A data analyst needs to show the relationship between advertising spend (in dollars) and the number of website visits. Both variables are continuous. Which chart type is most suitable?

A.Line chart
B.Scatter plot
C.Box plot
D.Bar chart
AnswerB

Scatter plots show the relationship between two continuous variables and can reveal correlation patterns.

Why this answer

Scatter plots are ideal for visualizing the correlation between two continuous variables.

770
MCQeasy

Which data quality dimension ensures that data represents the real-world scenario correctly and without errors?

A.Completeness
B.Consistency
C.Accuracy
D.Timeliness
AnswerC

Accuracy is about correctness and error-free data.

Why this answer

Accuracy means the data correctly reflects reality.

771
MCQmedium

A data analyst is analyzing survey responses where respondents rated satisfaction on a scale of 1-5. The analyst wants to visualize the distribution of responses. Which chart type is most appropriate?

A.Box plot
B.Scatter plot
C.Line chart
D.Histogram
AnswerD

Histograms display the frequency distribution of a single numeric variable across bins.

Why this answer

A histogram is the most appropriate chart for visualizing the distribution of a single discrete variable, such as satisfaction ratings on a 1-5 scale. It groups the responses into bins (each rating value) and displays the frequency of each bin using bars, clearly showing the shape, central tendency, and spread of the data.

Exam trap

The trap here is that candidates often confuse a histogram with a bar chart, but the key distinction is that a histogram is used for quantitative (ordinal or continuous) data where bin order matters, while a bar chart is for categorical (nominal) data with no inherent order.

How to eliminate wrong answers

Option A is wrong because a box plot summarizes data using five-number statistics (min, Q1, median, Q3, max) and is better for comparing distributions across groups, not for showing the detailed frequency distribution of a single ordinal variable. Option B is wrong because a scatter plot is used to visualize the relationship between two continuous variables, not the distribution of a single categorical or ordinal variable. Option C is wrong because a line chart is typically used to display trends over time or sequential data, not the frequency distribution of discrete survey responses.

772
MCQhard

In A/B testing, which factor is increased by having a larger sample size?

A.P-value
B.Effect size
C.Type I error rate
D.Statistical power
AnswerD

Power increases with sample size.

Why this answer

Larger sample size increases statistical power (ability to detect a true effect).

773
MCQhard

A financial analyst is creating a quarterly report compliant with SOX. Which requirement is most critical for the report's audit trail?

A.Row-level security
B.Audit trail of data changes
C.Data lineage
D.Data dictionary
AnswerB

Correct. SOX mandates audit trails for financial reports.

Why this answer

SOX requires audit trails to track changes and ensure data integrity for financial reports.

774
MCQmedium

A data analyst wants to assign a unique sequential integer to each row in a result set, starting at 1, based on the order of the 'sales_amount' column descending. Which window function should be used?

A.DENSE_RANK() OVER (ORDER BY sales_amount DESC)
B.RANK() OVER (ORDER BY sales_amount DESC)
C.NTILE(1) OVER (ORDER BY sales_amount DESC)
D.ROW_NUMBER() OVER (ORDER BY sales_amount DESC)
AnswerD

Correct: assigns sequential numbers based on the order.

Why this answer

ROW_NUMBER() assigns a unique sequential integer to each row within a partition, starting at 1.

775
MCQeasy

Which type of chart is most suitable for showing the distribution of a single continuous variable?

A.Histogram
B.Line chart
C.Scatter plot
D.Bar chart
AnswerA

Histograms display the frequency distribution of a continuous variable.

Why this answer

A histogram is the most suitable chart for showing the distribution of a single continuous variable because it groups the data into bins (intervals) along the x-axis and displays the frequency or count of observations within each bin on the y-axis. This allows you to see the shape, spread, and central tendency of the data, such as whether it is normally distributed, skewed, or has multiple modes. For a single continuous variable, a histogram directly visualizes the probability density or frequency distribution, which is its primary purpose in data analysis.

Exam trap

The trap here is that candidates often confuse a histogram with a bar chart, assuming both are interchangeable for showing frequencies, but a histogram is specifically for continuous data with no gaps between bars, while a bar chart is for discrete categories with gaps.

How to eliminate wrong answers

Option B is wrong because a line chart is designed to show trends over time or ordered categories by connecting data points with lines, not to display the distribution of a single continuous variable. Option C is wrong because a scatter plot is used to show the relationship between two continuous variables by plotting points on a Cartesian plane, not to visualize the distribution of a single variable. Option D is wrong because a bar chart is used for categorical data, where each bar represents a distinct category, and it does not handle continuous variables properly since it would require arbitrary binning without the statistical properties of a histogram.

776
MCQhard

An organization has multiple systems that store customer information inconsistently. To create a single authoritative view of customer data, they implement a process that identifies and merges duplicate records. This is an example of which data management discipline?

A.Data governance
B.Data warehousing
C.Data quality
D.Master Data Management (MDM)
AnswerD

MDM focuses on creating a single, consistent view of master data entities like customers.

Why this answer

Option A is correct because Master Data Management (MDM) creates a golden record by merging duplicates. Option B is wrong because data governance provides oversight but does not directly merge records. Option C is wrong because data warehousing consolidates data for reporting.

Option D is wrong because data quality is a component but not the specific discipline for master data.

777
Multi-Selecteasy

A data analyst needs to retrieve the top 5 most expensive products from a 'products' table sorted by price descending. Which TWO SQL clauses are required to achieve this? (Select TWO).

Select 2 answers
A.HAVING COUNT(*) > 1
B.WHERE price > 100
C.ORDER BY price DESC
D.GROUP BY price
E.LIMIT 5
AnswersC, E

Sorts prices from highest to lowest.

Why this answer

ORDER BY DESC sorts prices descending, and LIMIT restricts to 5 rows. GROUP BY and HAVING are for aggregation; WHERE is for filtering but not needed here.

778
MCQmedium

A table named Orders has columns OrderID, CustomerID, OrderDate, and TotalAmount. Which column should be the primary key to uniquely identify each order?

A.OrderDate
B.OrderID
C.TotalAmount
D.CustomerID
AnswerB

OrderID is unique per order.

Why this answer

The OrderID column is the correct choice for the primary key because it contains unique values for each order, ensuring that each row can be uniquely identified. A primary key must be unique, non-null, and stable; OrderID satisfies all these requirements, whereas the other columns do not guarantee uniqueness or are subject to change.

Exam trap

The trap here is that candidates may confuse a column that is frequently used for filtering or grouping (like CustomerID or OrderDate) with one that guarantees uniqueness, overlooking the fundamental primary key requirement of uniqueness and non-nullability.

How to eliminate wrong answers

Option A is wrong because OrderDate is not unique; multiple orders can occur on the same date, and it can also be null, violating primary key constraints. Option C is wrong because TotalAmount can have duplicate values (e.g., two orders with the same total) and is not inherently unique or stable. Option D is wrong because CustomerID is not unique per order; a single customer can place many orders, so it cannot uniquely identify each order row.

779
Multi-Selecteasy

Which TWO of the following are examples of data transformation? (Choose TWO.)

Select 2 answers
A.Normalizing data to eliminate redundancy
B.Creating a backup of the database
C.Converting string dates to date format
D.Generating summary statistics
E.Removing duplicate records
AnswersA, C

Normalization is a transformation.

Why this answer

Option A is correct because data normalization is a transformation process that reorganizes data to reduce redundancy and improve integrity, typically by decomposing tables into smaller, related tables (e.g., achieving 3NF in relational databases). This changes the structure and representation of the data, which is a core example of data transformation.

Exam trap

CompTIA often tests the distinction between data transformation (changing format/structure) and data cleansing (removing errors/duplicates) or data analysis (generating summaries), leading candidates to mistakenly select removal of duplicates or summary statistics as transformations.

780
MCQeasy

A data analyst is presenting findings to a non-technical audience. Which data visualization technique is most appropriate?

A.Box plot
B.Heat map
C.Scatter plot
D.Bar chart
AnswerD

Bar charts are straightforward and show comparisons clearly, making them suitable for non-technical audiences.

Why this answer

Bar charts are the most appropriate visualization for a non-technical audience because they present categorical comparisons in a simple, intuitive format that requires no statistical background to interpret. Unlike box plots, heat maps, or scatter plots, bar charts clearly show relative magnitudes with minimal cognitive load, making them ideal for communicating data insights to stakeholders who may not be familiar with data analysis conventions.

Exam trap

The trap here is that candidates often choose scatter plots or box plots because they seem more 'analytical,' but the question specifically targets audience comprehension, not analytical depth — Cisco tests whether you prioritize clarity over complexity when the audience lacks technical background.

How to eliminate wrong answers

Option A is wrong because box plots display data distribution through quartiles and outliers, which require understanding of statistical concepts like median, interquartile range, and whiskers — concepts that are not intuitive for a non-technical audience. Option B is wrong because heat maps use color gradients to represent data density or correlation, which can be ambiguous without a legend and are better suited for identifying patterns in large matrices rather than straightforward comparisons. Option C is wrong because scatter plots show relationships between two continuous variables and rely on the audience's ability to interpret correlation or clustering, which is not immediately accessible to non-technical viewers.

781
Multi-Selecthard

A sales analyst is building a Power BI report to analyze quarterly performance. Which THREE are appropriate uses of DAX or Power Query? (Select three.)

Select 3 answers
A.Using the SUMX function to sum a column without iteration
B.Using Power Query to remove duplicate rows from the data source
C.Using TOTALYTD to calculate year-to-date sales
D.Using the CALCULATE function to change the filter context in a measure
E.Using the RELATED function to create a calculated column from a related table
AnswersB, C, D

Power Query is ideal for data cleaning and transformation.

Why this answer

Option B is correct because Power Query is specifically designed for data transformation and cleaning tasks, such as removing duplicate rows from a data source, before the data is loaded into the data model. This is a standard ETL operation that Power Query handles efficiently using its 'Remove Duplicates' feature, which is not a DAX function.

Exam trap

Cisco often tests the distinction between data transformation tools (Power Query) and data analysis functions (DAX), leading candidates to mistakenly think that DAX functions like SUMX can perform data cleaning tasks, when in fact they are for analytical calculations within the data model.

782
MCQhard

A data analyst discovers a data discrepancy in a report that has already been distributed to executives. The discrepancy is due to a known bug that will be fixed next month. What is the most appropriate action?

A.Send a corrected report immediately with an apology
B.Wait for the next scheduled update
C.Ignore it since it will be fixed soon
D.Inform the executives of the error and its impact
AnswerD

Transparency builds trust and allows informed decision-making.

Why this answer

Option B is correct because informing executives of the error and its impact maintains transparency and trust. Option A (ignore) is unethical. Option C (send corrected report immediately) may cause confusion without context.

Option D (wait) delays communication.

783
MCQmedium

A data analyst is working with a dataset that contains customer names and addresses. Some records have missing state codes. Which data quality issue is this?

A.Duplication
B.Incompleteness
C.Outliers
D.Inconsistency
AnswerB

Missing state codes make the record incomplete.

Why this answer

Incompleteness is the correct answer because missing state codes in customer address records represent a lack of required data. This is a classic example of incomplete data, where fields that should contain values are left null or blank, reducing the dataset's usability for analysis.

Exam trap

The trap here is that candidates may confuse incompleteness with inconsistency, but incompleteness is about missing data (nulls), while inconsistency is about contradictory data across records.

How to eliminate wrong answers

Option A is wrong because duplication refers to duplicate records (e.g., same customer appearing multiple times), not missing values. Option C is wrong because outliers are data points that deviate significantly from the norm (e.g., an unusually high age), not absent data. Option D is wrong because inconsistency involves contradictory or conflicting data (e.g., same customer with different state codes in different records), not missing values.

784
MCQmedium

A telecommunications company is experiencing issues with its customer satisfaction survey data. The data is collected from multiple channels: phone, email, and web forms. Each channel uses a different scale for ratings: phone uses 1-10, email uses 1-5, and web uses 1-7. Additionally, some survey responses contain missing values for demographic fields. The data analyst needs to calculate an overall satisfaction score that is comparable across all channels. The company's leadership wants a single metric that minimizes distortion from the different scales. Which approach should the analyst use to standardize the ratings?

A.Normalize each rating to a 0-100 scale using min-max normalization.
B.Calculate the average rating separately for each channel and then compare the averages.
C.Convert all ratings to a binary metric of satisfied (above midpoint) or unsatisfied.
D.Convert all ratings to a 1-10 scale by multiplying email ratings by 2 and web by 1.43.
AnswerA

Correct: Min-max normalization maps each scale to a common range, preserving relative differences.

Why this answer

Option A is correct because min-max normalization rescales each rating to a common 0-100 range using the formula (x - min) / (max - min) * 100. This preserves the relative distribution of responses within each channel while eliminating the effect of different scale lengths, making the scores directly comparable. It minimizes distortion better than simple multiplication or binary conversion, as it accounts for the full range of each original scale.

Exam trap

The trap here is that candidates may think simple multiplication (Option D) is sufficient for scale conversion, but Cisco tests the understanding that linear scaling without considering the full range and distribution can introduce distortion, whereas min-max normalization is the proper technique for creating a comparable metric across different scales.

How to eliminate wrong answers

Option B is wrong because calculating separate averages per channel does not standardize the ratings; it only produces channel-specific means that remain on different scales, making direct comparison invalid. Option C is wrong because converting to a binary satisfied/unsatisfied metric discards granularity and loses information about the degree of satisfaction, which can distort the overall score and reduce statistical power. Option D is wrong because multiplying email ratings by 2 and web by 1.43 assumes linear proportionality between scales, which is arbitrary and does not account for differences in distribution shape or endpoints, potentially introducing systematic bias.

785
MCQhard

You have a table 'Orders' with columns order_id, customer_id, order_date, and amount. You need to write a query that returns each customer's most recent order date and the amount for that order. Which approach is correct?

A.SELECT customer_id, MAX(order_date), amount FROM Orders GROUP BY customer_id
B.SELECT customer_id, order_date, amount FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn FROM Orders) t WHERE rn = 1
C.SELECT customer_id, FIRST_VALUE(order_date) OVER (PARTITION BY customer_id ORDER BY order_date DESC), FIRST_VALUE(amount) OVER (PARTITION BY customer_id ORDER BY order_date DESC) FROM Orders
D.SELECT customer_id, order_date, amount FROM Orders WHERE order_date IN (SELECT MAX(order_date) FROM Orders GROUP BY customer_id)
AnswerB

Correctly identifies the most recent order per customer.

Why this answer

Using a window function with ROW_NUMBER() to rank orders per customer by date descending, then filtering for rank=1, gives the most recent order details. FIRST_VALUE() can also get the amount, but requires careful framing. GROUP BY with MAX(date) alone cannot get the corresponding amount.

786
Multi-Selectmedium

Which TWO chart types are appropriate for visualizing the distribution of a continuous variable? (Select two.)

Select 2 answers
A.Bar chart
B.Box plot
C.Line chart
D.Histogram
E.Pie chart
AnswersB, D

Box plots show quartiles, median, and outliers.

Why this answer

A box plot (B) is appropriate for visualizing the distribution of a continuous variable because it displays the median, quartiles, and potential outliers, providing a five-number summary of the data. A histogram (D) is also appropriate as it groups continuous data into bins and shows the frequency distribution, revealing the shape, central tendency, and spread of the variable.

Exam trap

The trap here is that candidates often confuse a bar chart with a histogram, thinking both can show distribution, but a bar chart is for categorical data while a histogram is specifically for continuous data with no gaps between bars.

787
MCQeasy

A retail company wants to analyze customer purchase patterns to identify products frequently bought together. Which data mining technique is most appropriate?

A.Classification
B.Clustering
C.Regression
D.Association rules
AnswerD

Association rules identify frequent itemsets and co-occurrence patterns.

Why this answer

Association rules (market basket analysis) discover co-occurrence relationships. Regression predicts numeric values, clustering groups similar items, classification assigns categories.

788
MCQeasy

A data analyst needs to retrieve only unique job titles from the 'employees' table. Which SQL keyword should be used in the SELECT clause?

A.TOP
B.DISTINCT
C.UNIQUE
D.FILTER
AnswerB

DISTINCT returns unique rows.

Why this answer

The DISTINCT keyword removes duplicate rows from the result set, returning only unique values.

789
Multi-Selectmedium

Which THREE of the following are common steps in data cleaning?

Select 3 answers
A.Removing outliers without justification
B.Imputing missing values
C.Standardizing data formats
D.Removing duplicate records
E.Increasing sample size
AnswersB, C, D

Missing values are often imputed to maintain dataset completeness.

Why this answer

Imputing missing values is a common data cleaning step because real-world datasets often have gaps due to data collection errors or system failures. Techniques like mean/median imputation, regression imputation, or using algorithms like k-NN help preserve sample size and avoid bias that would result from simply dropping rows. This ensures the dataset remains usable for analysis without introducing significant distortion.

Exam trap

CompTIA often tests the distinction between data cleaning steps and data collection or preprocessing steps, so the trap here is confusing 'increasing sample size' (a data augmentation or collection activity) with actual cleaning tasks like imputation, standardization, and deduplication.

790
MCQmedium

A data analyst is designing a dashboard and wants to maximize the data-ink ratio. Which action aligns with this principle?

A.Removing unnecessary gridlines
B.Using 3D effects for bars
C.Adding a background image
D.Using bright colors for all elements
AnswerA

Removing unnecessary gridlines reduces non-data ink and improves the data-ink ratio.

Why this answer

Data-ink ratio focuses on removing non-data ink (chartjunk) and emphasizing data. Using minimal gridlines reduces non-data ink.

791
MCQhard

A data team is preparing a dashboard for executives. The team wants to highlight key performance indicators (KPIs) that are below target. Which of the following visualization techniques would most effectively draw attention to underperforming metrics without causing confusion?

A.Remove underperforming KPIs from the dashboard to avoid confusion.
B.Use a scatter plot to show the relationship between KPIs.
C.Apply conditional formatting to turn KPI values red when below target.
D.Use a pie chart showing the proportion of each KPI.
AnswerC

Red highlights call attention to issues immediately.

Why this answer

Conditional formatting that turns KPI values red when below target is the most effective technique because it leverages pre-attentive visual processing — the human eye naturally notices color changes (especially red) before other visual elements. This allows executives to instantly identify underperforming metrics without needing to interpret complex chart types, reducing cognitive load and confusion.

Exam trap

CompTIA often tests the misconception that removing or hiding negative data simplifies communication, when in fact effective dashboards must highlight both successes and failures to drive data-driven decisions.

How to eliminate wrong answers

Option A is wrong because removing underperforming KPIs from the dashboard hides critical negative information, violating the principle of transparency in data communication and preventing executives from taking corrective action. Option B is wrong because a scatter plot is designed to show relationships or correlations between two variables, not to highlight individual KPI performance against a target; it would introduce unnecessary complexity and fail to draw immediate attention to underperforming metrics. Option D is wrong because a pie chart shows proportions of a whole, which is irrelevant for comparing KPIs against targets; it cannot effectively indicate which metrics are below target and often leads to misinterpretation due to difficulty in comparing angles and areas.

792
MCQhard

A data analyst is building a dashboard that will be used by both executives and operational managers. The executives need high-level summaries, while managers need granular details. Which dashboard design principle should the analyst apply?

A.Use a single chart that shows both summary and detail simultaneously.
B.Display all available data on one page for transparency.
C.Design the dashboard with drill-down capabilities from summary to detail.
D.Create two separate dashboards for each audience.
AnswerC

Drill-down allows executives to see overview and managers to access details on demand.

Why this answer

Option C is correct because drill-down capabilities allow users to start with a high-level summary (e.g., total revenue by region) and then interactively navigate to granular details (e.g., individual transactions) without overwhelming either audience. This design principle supports both executive and operational manager needs within a single dashboard, maintaining clarity and performance by loading only the required level of detail on demand.

Exam trap

The trap here is that candidates often confuse 'drill-down' with 'separate dashboards' (Option D), thinking that different audiences require entirely different views, when in fact a single interactive dashboard with drill-down capabilities is the standard best practice for serving both summary and detail needs efficiently.

How to eliminate wrong answers

Option A is wrong because a single chart attempting to show both summary and detail simultaneously would likely become cluttered, violate data-ink ratio principles, and fail to provide clear high-level insights for executives or actionable granularity for managers. Option B is wrong because displaying all available data on one page creates information overload, reduces readability, and violates the principle of progressive disclosure, making it difficult for both audiences to quickly find relevant insights. Option D is wrong because creating two separate dashboards introduces maintenance overhead, potential data inconsistency, and forces users to switch contexts, whereas a single dashboard with drill-down capabilities is more efficient and user-friendly.

793
Multi-Selectmedium

Which TWO of the following are examples of quantitative data? (Choose TWO.)

Select 2 answers
A.Product color
B.Age in years
C.Customer satisfaction rating (Poor, Fair, Good)
D.Country of origin
E.Shoe size
AnswersB, E

Age is a numeric, quantitative variable.

Why this answer

Option B is correct because age in years is a numerical measurement that can be counted or measured on a ratio scale, making it quantitative data. Quantitative data represents quantities that can be expressed numerically and subjected to mathematical operations, such as calculating the average age of a group.

Exam trap

The trap here is that candidates often confuse ordinal data (like customer satisfaction ratings) with quantitative data because the categories have an order, but they are still qualitative since the values are not numeric measurements.

794
MCQhard

A company is designing a data lake to store raw sensor data from IoT devices. The data arrives as JSON objects with varying schemas. Which storage approach is most appropriate?

A.Ingest into a relational database with a predefined schema
B.Store each JSON object as a separate file in a compressed columnar format
C.Convert all JSON to Avro with a fixed schema before storing
D.Store raw JSON files in a distributed file system and apply schema-on-read
AnswerD

Schema-on-read allows handling varying schemas without upfront transformation.

Why this answer

Option D is correct because a data lake is designed to store raw data in its native format, and IoT sensor data with varying schemas is best handled by storing raw JSON files in a distributed file system (e.g., HDFS or Amazon S3). This approach leverages schema-on-read, where the schema is applied at query time rather than at write time, allowing flexibility for heterogeneous JSON objects without data loss or transformation overhead.

Exam trap

The trap here is that candidates confuse 'schema-on-read' with 'schema-on-write' and assume that converting to a structured format like Avro or columnar storage is always better for performance, ignoring the requirement to store raw, varying-schema data as-is.

How to eliminate wrong answers

Option A is wrong because relational databases require a predefined schema and enforce ACID constraints, which cannot accommodate JSON objects with varying schemas without costly schema migrations or data loss. Option B is wrong because storing each JSON object as a separate file in a compressed columnar format (e.g., Parquet or ORC) is inefficient for small, variable-schema records; columnar formats are optimized for analytical queries on large, homogeneous datasets, not for raw ingestion of many small, schema-varying JSON objects. Option C is wrong because converting all JSON to Avro with a fixed schema before storing defeats the purpose of a data lake, which is to preserve raw data; Avro requires a predefined schema at write time, and forcing a fixed schema on varying JSON objects would either lose data or require complex schema evolution management.

795
MCQeasy

Which data quality dimension is violated if a customer record has a missing phone number?

A.Consistency
B.Accuracy
C.Completeness
D.Validity
AnswerC

Completeness measures missing values.

Why this answer

Completeness refers to the extent to which data is not missing.

796
MCQmedium

A data analyst needs to combine rows from two tables based on a related column, but only wants rows that have matching values in both tables. Which join type should the analyst use?

A.RIGHT JOIN
B.INNER JOIN
C.FULL OUTER JOIN
D.LEFT JOIN
AnswerB

INNER JOIN returns only matching rows.

Why this answer

INNER JOIN returns only rows with matching values in both tables, which matches the requirement.

797
MCQhard

A large e-commerce company has a data warehouse that stores daily sales transactions. The data analytics team creates a weekly report for the executive team that includes total revenue, number of orders, and average order value. Recently, the report showed a sudden 20% drop in total revenue compared to the previous week. Upon investigation, the data analyst discovers that a major promotional event ended the previous week, and the current week's data includes a new product launch with lower prices. The analyst needs to present this data to the executives in a way that accurately reflects the business situation. The analyst must also provide recommendations for actions. Which of the following is the best course of action?

A.Remove the new product launch data from the report to show a clearer comparison.
B.Present only the data from the new product launch to show growth potential.
C.Present the data with annotations explaining the promotion end and new product launch, and recommend analyzing customer retention and cross-selling opportunities.
D.Present the drop as-is and recommend cutting costs to compensate for the revenue decline.
AnswerC

This provides full context and actionable insights.

Why this answer

Option C is correct because it adheres to the principle of transparent data communication: the analyst must present the full dataset with contextual annotations (promotion end, new product launch) to avoid misleading executives. This approach enables informed decision-making, and the recommendation to analyze customer retention and cross-selling opportunities directly addresses the underlying business dynamics—understanding whether the revenue drop is temporary or signals a shift in customer behavior. In a data warehouse context, this ensures that the weekly report reflects the true state of the business rather than a manipulated or incomplete view.

Exam trap

The trap here is that candidates may think removing or cherry-picking data (Options A and B) simplifies the story for executives, but the exam tests the ethical and professional standard of presenting complete, contextualized data—even when it shows a negative trend—while offering actionable recommendations based on the full picture.

How to eliminate wrong answers

Option A is wrong because removing the new product launch data constitutes data manipulation that violates the integrity of the report; it would hide a significant business event and prevent executives from understanding the full revenue picture. Option B is wrong because presenting only the new product launch data is a form of selection bias that ignores the overall revenue decline and the impact of the promotion end, leading to an overly optimistic and incomplete analysis. Option D is wrong because presenting the drop as-is without context or analysis fails to provide actionable insights; recommending cost cuts without understanding the root cause (promotion end and new product pricing) could lead to misguided strategic decisions.

798
MCQmedium

A data engineer needs to store logs from web servers that have varying fields. The logs are in JSON format. Which data type describes this JSON data?

A.Binary data
B.Structured data
C.Semi-structured data
D.Unstructured data
AnswerC

JSON allows schema flexibility with key-value pairs, fitting the semi-structured definition.

Why this answer

JSON data with varying fields is classified as semi-structured data because it has organizational properties (key-value pairs, nested structures) but does not conform to a rigid schema like a relational table. The logs from web servers may have different fields per record, which is a hallmark of semi-structured data, as it allows flexibility while still being self-describing.

Exam trap

The trap here is that candidates confuse 'structured' with any data that has a format, but JSON's lack of a fixed schema and varying fields disqualifies it from being structured data, which requires a rigid, predefined schema like a relational database table.

How to eliminate wrong answers

Option A is wrong because binary data refers to raw bytes or encoded formats (e.g., images, executables) that lack any inherent structure or human-readable format, whereas JSON is text-based and has explicit key-value organization. Option B is wrong because structured data requires a fixed schema with predefined fields and data types (e.g., rows in a SQL table), but JSON logs with varying fields violate this strict schema requirement. Option D is wrong because unstructured data has no predefined format or organization (e.g., plain text, video files), while JSON has a defined syntax with keys, values, and nesting, providing a clear structure.

799
MCQhard

A data analyst is presenting a story about declining sales. The narrative arc should include which three elements in order?

A.Resolution → Complication → Situation
B.Situation → Complication → Resolution
C.Situation → Resolution → Complication
D.Complication → Situation → Resolution
AnswerB

This is the standard narrative arc.

Why this answer

Option B is correct because a compelling data story about declining sales follows the classic narrative arc: Situation (establish context, e.g., 'Sales were steady in Q1'), Complication (introduce the conflict, e.g., 'Then a 20% drop occurred in Q2'), and Resolution (present the insight or action, e.g., 'We identified the cause and implemented a new pricing strategy'). This order mirrors the 'Situation-Complication-Resolution' framework used in data storytelling to guide the audience logically from context to problem to solution.

Exam trap

The trap here is that candidates often confuse the narrative order with a simple 'problem-solution' structure, mistakenly placing Complication first (Option D) or skipping the Situation entirely, but the exam requires the full Situation → Complication → Resolution sequence to ensure a complete and logical data story.

How to eliminate wrong answers

Option A is wrong because starting with Resolution (the solution) before establishing the Situation or Complication confuses the audience; they need context first to understand why the resolution matters. Option C is wrong because placing Resolution before Complication skips the core conflict that drives the narrative, making the story feel incomplete and the resolution unsupported. Option D is wrong because beginning with Complication without first setting the Situation leaves the audience without necessary background, making the problem seem arbitrary or disconnected from the data.

800
MCQeasy

A data analyst needs to present the distribution of customer ages to a non-technical audience. Which type of chart would be most appropriate?

A.Scatter plot
B.Histogram
C.Pie chart
D.Line chart
AnswerB

Histograms show distribution of continuous data.

Why this answer

A histogram is the most appropriate chart for displaying the distribution of a single continuous variable, such as customer ages, to a non-technical audience. It groups ages into bins and shows the frequency of customers within each bin, making the shape, center, and spread of the distribution immediately visible. This aligns with the DA0-001 objective of selecting visualizations that clearly communicate data insights to stakeholders.

Exam trap

The trap here is that candidates often confuse a histogram with a bar chart, thinking a bar chart can show distribution, but a bar chart is for categorical data, not continuous numeric data grouped into bins.

How to eliminate wrong answers

Option A is wrong because a scatter plot is used to show the relationship between two continuous variables, not the distribution of a single variable like age. Option C is wrong because a pie chart is designed for parts-of-a-whole categorical data, not for displaying the frequency distribution of a continuous numeric variable. Option D is wrong because a line chart is typically used to show trends over time or ordered categories, not the distribution of a single continuous variable.

801
MCQhard

A data scientist is building a classification model to predict customer churn. The dataset has 10,000 records with 500 churners. The scientist uses logistic regression and achieves 98% accuracy, but the precision for churn class is only 15%. Which of the following is the most likely cause?

A.Class imbalance
B.Non‑linear decision boundary
C.Multicollinearity among predictor variables
D.Overfitting due to too many features
AnswerA

With only 500 churners out of 10,000, the model predicts most as non-churners, achieving high accuracy but low precision for the minority class.

Why this answer

The dataset has only 500 churners out of 10,000 records (5% churn rate), which is a classic class imbalance. Logistic regression can achieve high accuracy by simply predicting the majority class (non-churn) for all records, yielding 95% accuracy even without learning anything about churn. The very low precision (15%) for the churn class indicates that most of the positive predictions are false positives, a direct consequence of the model being biased toward the majority class due to imbalance.

Exam trap

CompTIA often tests the misconception that high accuracy always means a good model, hiding the fact that with imbalanced data, accuracy is misleading and metrics like precision, recall, or F1-score for the minority class are critical.

How to eliminate wrong answers

Option B is wrong because logistic regression inherently models a linear decision boundary; while non-linear boundaries can be approximated with feature engineering (e.g., polynomial terms), the core issue here is class imbalance, not boundary shape. Option C is wrong because multicollinearity inflates coefficient standard errors but does not cause the extreme precision drop seen here; it affects interpretability, not the fundamental accuracy-imbalance trade-off. Option D is wrong because overfitting would typically yield high training accuracy but poor generalization, not a specific low precision for the minority class while maintaining high overall accuracy; the model is actually underfitting the minority class.

802
MCQmedium

A data analyst is creating a dashboard and wants to maximize the data-ink ratio. Which action supports this principle?

A.Including detailed data tables alongside charts.
B.Removing gridlines and reducing chart borders.
C.Using 3D effects to make bars stand out.
D.Adding a background image to make the dashboard visually appealing.
AnswerB

This removes redundant ink and focuses on data.

Why this answer

Removing unnecessary gridlines and decorative elements reduces non-data ink, thereby increasing the data-ink ratio. The data-ink ratio is the proportion of ink used to display data versus total ink used in the chart.

803
MCQeasy

A retail analyst wants to visualize monthly sales over the past year to identify seasonal patterns. Which chart type is best?

A.Pie chart
B.Histogram
C.Line chart
D.Box plot
AnswerC

Line chart clearly shows trends and seasonal patterns over time.

Why this answer

A line chart is best for visualizing monthly sales over time because it explicitly shows trends and seasonal patterns across a continuous time axis. The sequential ordering of months allows the line to connect data points, making it easy to identify peaks, troughs, and recurring cycles. This aligns with the DA0-001 objective of selecting appropriate visualizations for time-series data.

Exam trap

The trap here is that candidates confuse a histogram with a line chart because both use bars or lines, but a histogram bins continuous data without regard to order, while a line chart requires a sequential axis like time to show trends.

How to eliminate wrong answers

Option A is wrong because a pie chart is designed to show parts of a whole at a single point in time, not trends across multiple time periods; it cannot convey seasonal patterns or month-over-month changes. Option B is wrong because a histogram displays the distribution of a continuous variable by binning values, but it does not preserve the temporal order of months, making it impossible to identify seasonal patterns. Option D is wrong because a box plot summarizes data distribution (median, quartiles, outliers) for one or more groups, but it lacks a time axis and cannot show the sequential progression of sales across months.

804
Multi-Selectmedium

An analyst wants to compare the average sales revenue across three different store locations. Which TWO statistical methods are appropriate for this comparison?

Select 2 answers
A.Two-sample t-test
B.ANOVA
C.Multiple regression
D.Descriptive statistics
E.Chi-square test
AnswersB, C

Correct: ANOVA compares means across three or more groups.

Why this answer

ANOVA compares means of three or more groups. A t-test compares only two groups. Chi-square tests categorical independence.

Correlation measures linear relationship. Descriptive stats summarise but don't compare multiple groups inferentially.

805
Multi-Selecteasy

Which TWO of the following are valid SQL clauses used to filter and sort data?

Select 2 answers
A.DELETE
B.WHERE
C.ORDER BY
D.UPDATE
E.INSERT
AnswersB, C

WHERE filters rows based on conditions.

Why this answer

The WHERE clause is used to filter rows based on specified conditions, while the ORDER BY clause sorts the result set in ascending or descending order. Both are standard SQL clauses for data filtering and sorting, respectively.

Exam trap

CompTIA often tests the distinction between SQL DML statements (DELETE, UPDATE, INSERT) and query clauses (WHERE, ORDER BY), trapping candidates who confuse data manipulation commands with data retrieval or sorting operations.

806
Multi-Selecthard

Which THREE factors should be considered when choosing a chart type for a dataset?

Select 3 answers
A.The animation capabilities of the software
B.The data types (categorical, numerical, time series)
C.The number of variables to display
D.The key insight or message to convey
E.The color scheme of the company logo
AnswersB, C, D

Data type determines suitable chart types.

Why this answer

Option B is correct because the data type (categorical, numerical, time series) determines which chart types are semantically valid. For example, a line chart requires a continuous numerical or time-series axis, while a bar chart works with categorical data. Choosing a chart that mismatches the data type can misrepresent the underlying distribution or trends.

Exam trap

The trap here is that candidates often confuse aesthetic or software-specific features (like animation or branding) with the fundamental data characteristics that dictate chart appropriateness, leading them to select options that are about polish rather than analytical correctness.

807
MCQmedium

A marketing team wants to segment customers into groups based on purchasing behavior without prior labels. Which algorithm should the data analyst use?

A.K-means clustering
B.K-nearest neighbors
C.Linear regression
D.Decision tree
AnswerA

K-means is an unsupervised clustering algorithm suitable for segmentation.

Why this answer

K-means clustering is the correct choice because it is an unsupervised learning algorithm that groups unlabeled data into clusters based on feature similarity. Since the marketing team has no prior labels for customer segments, K-means can partition customers by purchasing behavior patterns, such as frequency and monetary value, without needing predefined categories.

Exam trap

The trap here is that candidates often confuse unsupervised clustering (K-means) with supervised classification (K-nearest neighbors) because both involve 'K' and grouping, but KNN requires labeled data and predicts labels, while K-means discovers inherent structures without labels.

How to eliminate wrong answers

Option B is wrong because K-nearest neighbors is a supervised learning algorithm that requires labeled training data to classify or predict outcomes, making it unsuitable for unlabeled segmentation. Option C is wrong because linear regression is a supervised regression algorithm used to predict a continuous target variable, not to discover hidden groupings in unlabeled data. Option D is wrong because decision trees are typically used for supervised classification or regression tasks, relying on labeled data to split on features, and cannot perform unsupervised clustering without prior labels.

808
MCQmedium

A data analyst needs to combine customer information from a CRM table and order information from an orders table, returning only customers who have placed at least one order. Which type of join should the analyst use?

A.RIGHT JOIN
B.INNER JOIN
C.FULL OUTER JOIN
D.LEFT JOIN
AnswerB

INNER JOIN returns only rows where the join condition is met, i.e., customers with orders.

Why this answer

An INNER JOIN between the CRM table and the orders table returns only rows where there is a match in both tables based on the join key (e.g., customer ID). This satisfies the requirement to return only customers who have placed at least one order, because any customer without an order in the orders table will be excluded from the result set.

Exam trap

Cisco often tests the misconception that a LEFT JOIN will include only customers with orders because it 'joins' the tables, but the trap is that a LEFT JOIN preserves all rows from the left table, including those with no matches, so it does not filter out customers without orders.

How to eliminate wrong answers

Option A (RIGHT JOIN) is wrong because it returns all rows from the orders table and matching rows from the CRM table, which could include orders without a matching customer (if referential integrity is not enforced) and would not limit results to only customers with orders. Option C (FULL OUTER JOIN) is wrong because it returns all rows from both tables, including customers without orders and orders without customers, which violates the requirement to return only customers who have placed at least one order. Option D (LEFT JOIN) is wrong because it returns all rows from the CRM table and matching rows from the orders table, which would include customers with zero orders (where the orders columns are NULL), failing to filter out customers without orders.

809
MCQhard

A data scientist creates a box plot of employee salaries and notices many outliers above the upper whisker. What action should be taken to best understand the salary distribution?

A.Replace the box plot with a histogram of the salaries
B.Remove all outliers to create a more typical box plot
C.Trim the top 5% of salaries and recreate the box plot
D.Investigate the outliers to determine if they are data entry errors or valid extremes
AnswerD

Outliers should be investigated to ensure data quality and contextual understanding.

Why this answer

Option A is correct because investigating outliers may reveal legitimate high salaries or data errors. Option B is wrong because removing outliers without cause can bias results. Option C is wrong because trimming changes the distribution.

Option D is wrong because a histogram may not highlight outliers clearly.

810
MCQhard

A data scientist is working with a dataset containing 1000 features and 500 samples. The goal is to build a predictive model. Which technique should be used to reduce the number of features while retaining most of the variance?

A.Ridge regression
B.Forward selection
C.Principal Component Analysis (PCA)
D.Lasso regression
AnswerC

PCA reduces dimensionality by creating new features that capture maximum variance.

Why this answer

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms the original features into a set of orthogonal components, ordered by the variance they capture. Given 1000 features and only 500 samples, PCA is ideal because it reduces the feature space while retaining the maximum variance, helping to avoid overfitting and the curse of dimensionality.

Exam trap

CompTIA often tests the distinction between supervised feature selection (Lasso, Forward selection) and unsupervised dimensionality reduction (PCA), trapping candidates who confuse regularization with variance-based reduction.

How to eliminate wrong answers

Option A is wrong because Ridge regression is a regularization technique that shrinks coefficients but does not reduce the number of features; it retains all features with penalized weights. Option B is wrong because Forward selection is a supervised feature selection method that selects features based on their predictive power, not on variance retention, and it can be computationally expensive with 1000 features. Option D is wrong because Lasso regression performs feature selection by shrinking some coefficients to zero, but it is a supervised method that selects features based on target correlation, not on maximizing variance retention, and may not be optimal for unsupervised dimensionality reduction.

811
MCQmedium

The exhibit shows a JSON configuration for a line chart that displays average response time over time. However, the chart shows jagged lines with many spikes. Which configuration change would smooth the visualization?

A.Change "aggregation" to "average"
B.Change the query to include a moving average
C.Change "interpolation" to "monotone"
D.Change chartType to "bar"
AnswerC

Monotone interpolation produces a smoother curve through points.

Why this answer

Setting interpolation to 'monotone' or 'smooth' can reduce jaggedness. The aggregation is already applied in the query, so changing it to 'average' in the widget would double-aggregate. Changing chart type or data source is not necessary.

812
MCQeasy

A company must comply with GDPR when reporting customer data. What must be included in the report?

A.Customer names and emails
B.The source of the data
C.A notice about data processing purposes
D.Data retention period
AnswerC

GDPR mandates that data subjects be informed of the purposes of processing, especially when data is shared.

Why this answer

Under GDPR, when reporting customer data, the controller must provide a notice about the purposes of data processing to ensure transparency and lawful basis for processing. This is required by Article 13 of the GDPR, which mandates that data subjects be informed of the specific purposes for which their personal data is being processed. Including this notice in the report demonstrates compliance with the accountability principle.

Exam trap

The trap here is that candidates often confuse operational data elements (like names, source, or retention) with the mandatory transparency obligations under GDPR, assuming that including any data field satisfies compliance, whereas the core requirement is the notice about processing purposes.

How to eliminate wrong answers

Option A is wrong because customer names and emails are personal data that may be included in the report, but GDPR does not require their inclusion; in fact, pseudonymization or anonymization is often preferred to minimize data exposure. Option B is wrong because while the source of data may be relevant for transparency, GDPR does not mandate including the data source in every report; the focus is on processing purposes and lawful basis. Option D is wrong because the data retention period is a policy requirement under GDPR Article 5(1)(e), but it is not a mandatory element to be included in a customer data report; it is typically documented in the privacy notice or data retention policy, not the report itself.

813
Multi-Selecthard

A data analyst is using Python pandas to perform exploratory data analysis. Which THREE methods are commonly used to assess data quality and distributions?

Select 3 answers
A.df.transpose()
B.df.describe()
C.df.info()
D.df.sort_values()
E.df.value_counts()
AnswersB, C, E

Provides summary statistics for numerical columns.

Why this answer

describe() gives summary statistics, info() shows data types and non-null counts, and value_counts() shows frequency distributions.

814
Multi-Selectmedium

A data analyst is troubleshooting a dashboard that displays slowly. Which two of the following are common causes of dashboard performance issues? (Select TWO.)

Select 2 answers
A.Minimal use of filters.
B.Use of summary tables.
C.Large unaggregated datasets.
D.Complex calculated fields with many dependencies.
E.Excessive use of live data connections.
AnswersC, D

Correct. Large unaggregated datasets increase load on the dashboard.

Why this answer

Option C is correct because large unaggregated datasets force the dashboard to process every row of raw data at query time, which dramatically increases load times and memory consumption. Aggregated or pre-summarized tables reduce the volume of data scanned, improving performance.

Exam trap

The trap here is that candidates often confuse 'summary tables' (which improve performance) with 'large unaggregated datasets' (which degrade it), and they may overlook that complex calculated fields with dependencies are a hidden performance killer.

815
Multi-Selecteasy

A data analyst wants to export a summary report from a DataFrame in pandas. Which THREE methods are commonly used for data export?

Select 3 answers
A.to_csv()
B.to_pickle()
C.to_excel()
D.to_sql()
E.to_json()
AnswersA, C, E

Exports to CSV file.

Why this answer

pandas provides to_csv, to_excel, and to_json for exporting data.

816
MCQmedium

A retail company generates a daily PDF report showing the previous day's sales by region and product category. The report is automatically emailed to store managers at 6:00 AM. Which type of report is this?

A.Scheduled report
B.Self-service report
C.Ad hoc report
D.Operational report
AnswerA

This is correct because the report is automated and delivered regularly.

Why this answer

Scheduled reports are automated and delivered on a regular cadence. This daily sales report is scheduled and automatically delivered.

817
MCQmedium

A logistics company has data on delivery times (continuous) and distance traveled (continuous). They want to visualize the relationship between these two variables. Which chart type is most appropriate?

A.Histogram
B.Bar chart
C.Line chart
D.Scatter plot
AnswerD

Scatter plots are ideal for visualizing the correlation between two continuous variables.

Why this answer

Option D is correct because scatter plots show the relationship between two continuous variables. Option A is for categorical comparisons. Option B is for trends over time.

Option C is for frequency distribution.

818
Multi-Selecthard

A data team is creating a report for the board of directors. The board wants to see both the big picture and the ability to explore details for specific regions. Which TWO approaches should the team implement?

Select 2 answers
A.Use a single summary metric for all regions.
B.Include an interactive dashboard with filters by region.
C.Embed a table of data for each region in an appendix.
D.Provide a static executive summary with no drill-down.
E.Create a hierarchical report with linked pages.
AnswersB, E

Interactive dashboards allow users to filter and drill into specific regions on demand.

Why this answer

Option B is correct because interactive dashboards with filters allow the board to see high-level summaries while drilling into region-specific details on demand. This approach aligns with modern BI tools like Tableau or Power BI, which support dynamic filtering and cross-filtering without requiring separate report pages.

Exam trap

CompTIA often tests the distinction between static data presentation (tables, summaries) and interactive exploration features (filters, drill-through), trapping candidates who choose static options like an appendix or single metric when the question explicitly asks for both big-picture and detail exploration.

819
MCQhard

A data analyst is presenting a weekly sales report to a diverse audience that includes sales managers, marketing team, and finance. The analyst wants to ensure the data is understood correctly. Which communication strategy is most effective?

A.Use a single complex chart to show all data at once.
B.Create a slide deck with key insights and contextual explanations.
C.Email the raw data table to all stakeholders.
D.Use technical terms such as 'variance' and 'standard deviation'.
AnswerB

Contextual explanations cater to different roles and improve understanding.

Why this answer

Option B is correct because a slide deck with key insights and contextual explanations tailors the presentation to a diverse audience, ensuring that sales managers, marketing, and finance can each grasp the relevant data without being overwhelmed. This approach aligns with best practices for data communication, where visual summaries and narrative context improve comprehension over raw data or overly technical jargon.

Exam trap

The trap here is that candidates may overvalue technical precision (Option D) or data completeness (Option C), forgetting that the primary goal is audience comprehension, not data exhaustiveness.

How to eliminate wrong answers

Option A is wrong because a single complex chart (e.g., a multi-series radar or stacked area chart) can obscure individual trends and overload viewers, leading to misinterpretation across roles with different analytical backgrounds. Option C is wrong because emailing raw data tables forces stakeholders to derive insights themselves, risking errors and inefficiency, especially for non-technical teams like marketing. Option D is wrong because using technical terms like 'variance' and 'standard deviation' without explanation alienates audience members who lack statistical training, violating the principle of audience-appropriate communication.

820
MCQhard

A financial analyst is preparing a report that must comply with the Sarbanes-Oxley (SOX) Act. What is the most critical requirement for this report?

A.Row-level security
B.Data anonymization
C.Data dictionary
D.Audit trail
AnswerD

SOX mandates audit trails for financial reporting to ensure accountability.

Why this answer

SOX requires audit trails to ensure the integrity and traceability of financial data for compliance.

821
MCQhard

A data analyst is presenting a recommendation to reduce inventory costs. The evidence shows that overstocking occurs in 30% of warehouses. Which of the following best structures the recommendation?

A."Implement a just-in-time system."
B."Our inventory costs are too high. We need to fix this."
C."Overstocking occurs in 30% of warehouses. This indicates poor demand forecasting. We recommend implementing a just-in-time system to reduce inventory holding costs by 15%."
D."We should reduce inventory by 20%. This will save costs."
AnswerC

Follows evidence → insight → recommendation → expected impact.

Why this answer

A proper data-driven recommendation follows the sequence: evidence → insight → recommendation → expected impact.

822
MCQhard

A database table has columns: OrderID (primary key), ProductID, CustomerID, CustomerName, OrderDate, ProductName. All products are purchased only by the customer who placed the order. Which normal form violation exists if CustomerName depends on CustomerID?

A.Boyce-Codd normal form (BCNF)
B.Third normal form (3NF)
C.Second normal form (2NF)
D.First normal form (1NF)
AnswerB

CustomerName depends on CustomerID, which is not a candidate key, creating a transitive dependency and violating 3NF.

Why this answer

The table violates Third Normal Form (3NF) because CustomerName depends on CustomerID, which is not a candidate key (the primary key is OrderID). 3NF requires that every non-key attribute be non-transitively dependent on the primary key; here, CustomerName is transitively dependent on OrderID via CustomerID. Since CustomerID is a non-key attribute (it is not part of the primary key), this transitive dependency breaks 3NF.

Exam trap

The trap here is that candidates often confuse transitive dependencies (3NF violation) with partial dependencies (2NF violation) or think that any dependency on a non-key attribute automatically violates BCNF, but the specific scenario of CustomerName depending on CustomerID is a textbook transitive dependency that breaks 3NF first.

How to eliminate wrong answers

Option A is wrong because Boyce-Codd Normal Form (BCNF) is a stricter version of 3NF that requires every determinant to be a candidate key; while this table also violates BCNF, the question asks which normal form violation exists, and the dependency described is a classic 3NF violation (transitive dependency), not a BCNF-specific one. Option C is wrong because Second Normal Form (2NF) is violated only when a non-key attribute depends on a proper subset of a composite primary key; here the primary key is a single column (OrderID), so no partial dependency exists, and 2NF is satisfied. Option D is wrong because First Normal Form (1NF) is violated only if there are repeating groups or non-atomic values; the table as described has atomic columns and no repeating groups, so 1NF is satisfied.

823
MCQmedium

A data team is building a predictive model. They have data on 'Number of employees' (whole numbers) and 'Revenue' (currency). Which statement correctly compares these data types?

A.Number of employees is discrete; revenue is continuous
B.Both are continuous data
C.Both are ratio data
D.Number of employees is qualitative; revenue is quantitative
AnswerA

Employees are counted in whole units (discrete), while revenue can have fractional values (continuous).

Why this answer

Option A is correct because 'Number of employees' is a count of distinct entities, making it discrete data (only whole numbers), while 'Revenue' can take any value within a range (including decimals), making it continuous data. Discrete data arises from counting, whereas continuous data arises from measurement.

Exam trap

The trap here is that candidates confuse the measurement scale (ratio) with the data type (discrete vs. continuous), leading them to pick option C even though the question specifically asks about data type classification.

How to eliminate wrong answers

Option B is wrong because 'Number of employees' is not continuous; it is discrete as it can only take integer values (e.g., 10, 11, not 10.5). Option C is wrong because while both are ratio data (they have a true zero point), the question asks about data types (discrete vs. continuous), not measurement scales. Option D is wrong because both 'Number of employees' and 'Revenue' are quantitative (numerical) data, not qualitative (categorical).

824
MCQhard

A data modeler is designing a dimensional model for a sales analytics system. The fact table contains sales transactions, and the dimension tables include product, customer, and time. To reduce data redundancy, the modeler normalizes the dimension tables into multiple related tables. Which schema is being implemented?

A.Vault schema
B.Star schema
C.Galaxy schema
D.Snowflake schema
AnswerD

Snowflake schema normalizes dimension tables to reduce redundancy.

Why this answer

The snowflake schema is a dimensional model where dimension tables are normalized into multiple related tables to reduce data redundancy. In this scenario, the product, customer, and time dimensions are split into sub-dimensions (e.g., product category, customer geography, time hierarchy), which is the defining characteristic of a snowflake schema. This contrasts with a star schema where dimensions remain denormalized.

Exam trap

CompTIA often tests the distinction between star and snowflake schemas by emphasizing normalization of dimensions; the trap here is that candidates may confuse 'normalized dimensions' with a star schema, which actually uses denormalized dimensions for simplicity and performance.

How to eliminate wrong answers

Option A is wrong because a vault schema (Data Vault) is a hybrid modeling approach focused on auditability and flexibility using hubs, links, and satellites, not on normalizing dimension tables for a sales analytics fact table. Option B is wrong because a star schema keeps dimension tables denormalized (single table per dimension) to optimize query performance, which directly contradicts the normalization described in the question. Option C is wrong because a galaxy schema (also called a fact constellation) contains multiple fact tables sharing dimension tables, not the normalization of a single fact table’s dimensions.

825
MCQeasy

The exhibit shows log entries. A data analyst wants to visualize the frequency of each error type over time. Which chart type is most appropriate?

A.Time series line chart
B.Bar chart of error types
C.Scatter plot of timestamp vs error code
D.Pie chart of error types
AnswerA

Shows frequency trend over time.

Why this answer

A time series line chart counts events per time interval, best for showing frequency over time. Bar chart of counts is static. Pie of error types ignores time.

Scatter plot not suitable.

Page 10

Page 11 of 14

Page 12