CompTIA Data+ DA0-001 (DA0-001) — Questions 601675

982 questions total · 14pages · All types, answers revealed

Page 8

Page 9 of 14

Page 10
601
MCQmedium

A finance department wants to compare actual expenses against budgeted amounts for each of 10 departments over the past 12 months. They need to show the variance (over/under budget) as well as the trend of expenses over time. The data includes monthly actual and budget figures for each department. The audience is the CFO who needs a quick overview. Which visualization approach should be used?

A.A dual-axis chart with bars for actual and budget by month for each department, and a line for variance
B.A waterfall chart showing cumulative variance over the year
C.A table with conditional formatting to highlight over-budget cells
D.Stacked bar chart with one bar per month, segments showing each department's actual and budget
AnswerA

Dual-axis allows direct comparison of actual vs budget over time, and the line highlights variance trend.

Why this answer

Option B is correct because a dual-axis chart (bars for budget vs actual per month, line for variance) shows both comparison and trend clearly. Option A is wrong because a single stacked bar for all departments obscures monthly trends. Option C is wrong because a table is not a quick-visual overview.

Option D is wrong because a waterfall chart is for cumulative changes, not month-by-month trends.

602
Multi-Selectmedium

A data analyst is building a supervised learning model to predict customer churn. The target variable is binary (churn = yes/no). Which TWO modeling techniques are appropriate for this task? (Select two.)

Select 2 answers
A.K-means clustering
B.Linear regression
C.Logistic regression
D.Decision trees
E.Apriori algorithm
AnswersC, D

Logistic regression models binary outcomes and is appropriate for classification.

Why this answer

Logistic regression is appropriate because it models the probability of a binary outcome (churn yes/no) using a logistic function, making it a standard choice for binary classification tasks. It outputs a value between 0 and 1, which can be thresholded to predict the class label.

Exam trap

The trap here is that candidates may confuse unsupervised clustering (K-means) or association rule mining (Apriori) with supervised classification, or mistakenly think linear regression can be adapted for binary outcomes without transformation.

603
MCQmedium

A company wants to collect real-time clickstream data from its website. Which acquisition method is most suitable?

A.Streaming API
B.Web scraping
C.Batch processing nightly
D.Manual entry
AnswerA

Enables continuous, low-latency data ingestion.

Why this answer

A streaming API is the most suitable method for collecting real-time clickstream data because it enables continuous, low-latency ingestion of events as they occur. Unlike batch or manual methods, a streaming API (e.g., using WebSockets or HTTP/2 Server-Sent Events) pushes each click event immediately to the data pipeline, satisfying the real-time requirement.

Exam trap

CompTIA often tests the distinction between 'real-time' and 'near-real-time' or 'batch' methods, and the trap here is that candidates may confuse web scraping (which can be automated frequently) with true streaming, not realizing that scraping is still a pull-based, scheduled operation that cannot match the push-based immediacy of a streaming API.

How to eliminate wrong answers

Option B (Web scraping) is wrong because it is a pull-based technique that typically retrieves static HTML pages at intervals, not real-time event streams, and is inefficient for high-frequency click data. Option C (Batch processing nightly) is wrong because it introduces a delay of up to 24 hours, failing the real-time requirement. Option D (Manual entry) is wrong because it is error-prone, non-scalable, and cannot capture high-velocity clickstream data in real time.

604
MCQmedium

In a customer database, each row represents a customer with columns: CustomerID, Name, Address, Phone. What does the column "Name" represent?

A.Instance
B.Entity
C.Attribute
D.Record
AnswerC

Correct. Name is an attribute of the customer entity.

Why this answer

In the context of a relational database, a column represents an attribute of an entity. The 'Name' column stores a specific characteristic (the customer's name) for each row, making it an attribute. This aligns with the data modeling concept where attributes define the properties of an entity.

Exam trap

The trap here is that candidates confuse 'attribute' with 'record' because they think of a row as containing all attributes, but the question specifically asks what a single column represents, not the row itself.

How to eliminate wrong answers

Option A is wrong because an instance refers to a single occurrence of an entity (e.g., a specific customer row), not a column. Option B is wrong because an entity is a table-level concept representing a real-world object (e.g., the Customer table), not a column within it. Option D is wrong because a record is a row in the table, which contains values for all attributes, not a single column like 'Name'.

605
MCQeasy

Refer to the exhibit. A data pipeline is failing to parse this log entry. What is the most likely cause of the error?

A.Missing comma between fields
B.Incorrect data type for age
C.Extra whitespace
D.Unquoted string for country
AnswerA

There should be a comma after '30' before the next key.

Why this answer

The log entry shows fields separated by spaces, but the country value 'United States' contains a space, which breaks the expected delimiter-based parsing. Without a comma or other explicit delimiter between fields, the parser cannot distinguish where the age field ends and the country field begins, causing a parsing failure. A comma-separated format (CSV) would require a comma between each field to avoid ambiguity.

Exam trap

CompTIA often tests the concept that missing or inconsistent delimiters cause parsing failures, and the trap here is that candidates may focus on the unquoted string or extra whitespace rather than recognizing that the fundamental issue is the absence of a required comma between fields.

How to eliminate wrong answers

Option B is wrong because the age field '30' is a valid integer and would parse correctly if the fields were properly delimited; the error is not due to data type mismatch. Option C is wrong because extra whitespace is not the issue—the spaces are part of the intended delimiter or the country value, and the parser is failing due to the lack of a comma, not due to excessive whitespace. Option D is wrong because the country 'United States' is not unquoted in a way that causes the error; the core problem is the missing comma between fields, not the lack of quotes around the string.

606
Multi-Selectmedium

A data governance team is establishing policies for data quality. Which THREE of the following are common dimensions of data quality? (Select 3)

Select 3 answers
A.Consistency
B.Completeness
C.Accuracy
D.Velocity
E.Volume
AnswersA, B, C

Data is uniform across systems.

Why this answer

Consistency is a common dimension of data quality because it ensures that data values are uniform across different datasets or systems, preventing contradictions. For example, if a customer's address is stored as '123 Main St' in one database and '123 Main Street' in another, consistency rules would flag this discrepancy. This dimension is critical for reliable reporting and integration.

Exam trap

The trap here is that candidates confuse the characteristics of big data (velocity, volume, variety) with the dimensions of data quality, leading them to select velocity or volume instead of the correct quality-focused options.

607
MCQhard

A financial services company is migrating its customer data from a legacy on-premises relational database to a cloud-based data warehouse. The legacy database uses a denormalized schema with a single table 'customer_master' that contains all customer attributes, including repeated groups for multiple accounts per customer (account1_type, account1_balance, account2_type, account2_balance, etc.). The data warehouse team wants to implement a normalized star schema with separate dimension and fact tables. During the ETL process, the team encounters an error: 'Data truncation: string data right truncation' when loading account_type values into the dim_account table. The account_type column in dim_account is defined as VARCHAR(10), but the source data contains account types like 'SavingsPlus' (11 characters) and 'CheckingPremium' (15 characters). The team must resolve this issue without losing data. Which course of action should the team take?

A.Truncate the account_type values to 10 characters during ETL.
B.Change the data type of dim_account.account_type to TEXT.
C.Ignore the error and continue loading with NULL values for truncated rows.
D.Increase the VARCHAR length of dim_account.account_type to accommodate the longest account type.
AnswerD

This resolves truncation without data loss.

Why this answer

Option D is correct because increasing the VARCHAR length of dim_account.account_type to accommodate the longest account type (e.g., VARCHAR(15) for 'CheckingPremium') resolves the data truncation error without data loss. This aligns with the star schema design principle of preserving source data integrity while ensuring the column definition matches the actual data length. The team must avoid truncation or NULL insertion to maintain accurate dimensional attributes for analytics.

Exam trap

The trap here is that candidates may choose truncation (Option A) or NULL insertion (Option C) as quick fixes, overlooking the requirement to preserve data integrity, or mistakenly think TEXT (Option B) is a safe catch-all without considering performance implications in a data warehouse context.

How to eliminate wrong answers

Option A is wrong because truncating account_type values to 10 characters would lose data, violating the requirement to resolve the issue without data loss. Option B is wrong because changing the data type to TEXT is unnecessary and can introduce performance overhead in indexing and querying, as TEXT is a large object type not optimized for VARCHAR-like operations in a data warehouse. Option C is wrong because ignoring the error and loading NULL values for truncated rows would discard valid account_type data, breaking referential integrity and analytics accuracy.

608
Multi-Selecthard

A data analyst is building a narrative around a quarterly sales decline. The story should follow a narrative arc. Which THREE elements should be included in the story?

Select 3 answers
A.Random color scheme for every chart
B.Resolution: By diversifying suppliers, sales recovered in November.
C.Situation: Q3 sales were on track to meet targets.
D.Detailed description of each product's sales breakdown
E.Complication: A supply chain disruption caused a drop in October.
AnswersB, C, E

Resolution shows how the issue was addressed.

Why this answer

A narrative arc in data storytelling typically includes situation (context), complication (problem), and resolution (recommendation or action).

609
MCQmedium

A data analyst needs to compare sales data from the company's internal CRM with public demographic data from a government census. Which data concept best describes this scenario?

A.Internal vs. External data
B.Primary vs. Secondary data
C.Structured vs. Unstructured data
D.Quantitative vs. Qualitative data
AnswerA

CRM data is internal; census data is external, directly contrasting sources.

Why this answer

The scenario involves comparing internal CRM data (generated and owned by the company) with external government census data (publicly sourced from outside the organization). This directly maps to the Internal vs. External data concept, where internal data is collected within the enterprise (e.g., sales transactions, customer records) and external data is acquired from third-party sources (e.g., census bureaus, market research firms).

The key distinction is the data's origin and ownership, not its structure, collection method, or measurement type.

Exam trap

CompTIA often tests the Internal vs. External data concept by presenting a scenario where the key differentiator is the data's source (inside vs. outside the organization), tempting candidates to confuse it with Primary vs. Secondary data, which focuses on whether the data was collected firsthand or repurposed.

How to eliminate wrong answers

Option B (Primary vs. Secondary data) is wrong because both datasets could be primary (collected firsthand by the CRM or census) or secondary (repurposed from another source), but the question focuses on the origin relative to the organization, not the collection method. Option C (Structured vs.

Unstructured data) is wrong because both CRM sales data and census demographic data are typically structured (e.g., tables with rows and columns), so the contrast is not about format but about source. Option D (Quantitative vs. Qualitative data) is wrong because both datasets contain quantitative values (e.g., sales figures, population counts) and possibly qualitative labels (e.g., region names), but the core distinction in the scenario is internal versus external sourcing, not measurement scale.

610
MCQeasy

Refer to the exhibit. The data shows sales by product category for two years. Which product category had the highest percentage increase from 2023 to 2024?

A.Clothing
B.Books
C.Electronics
D.Home & Garden
AnswerD

Home & Garden had the highest percentage increase of about 15.8%.

Why this answer

Home & Garden had the highest percentage increase from 2023 to 2024 because its sales grew from $20,000 to $30,000, a 50% increase, which is greater than the percentage increases for Clothing (25%), Books (33.3%), and Electronics (40%). The percentage change is calculated as ((2024 value - 2023 value) / 2023 value) * 100, and Home & Garden yields the highest result.

Exam trap

CompTIA often tests the distinction between absolute increase and percentage increase, leading candidates to mistakenly choose Clothing because it has the largest absolute dollar increase ($10,000), while the question asks for the highest percentage increase.

How to eliminate wrong answers

Option A is wrong because Clothing increased from $40,000 to $50,000, a 25% increase, which is lower than Home & Garden's 50%. Option B is wrong because Books increased from $15,000 to $20,000, a 33.3% increase, which is lower than Home & Garden's 50%. Option C is wrong because Electronics increased from $25,000 to $35,000, a 40% increase, which is lower than Home & Garden's 50%.

611
MCQmedium

An analyst needs to show the part-to-whole relationship of market share among four competitors. Which chart type is most appropriate, considering best practices?

A.Pie chart
B.Scatter plot
C.Histogram
D.Box plot
AnswerA

Pie charts effectively show proportions with a limited number of categories.

Why this answer

A pie or donut chart is suitable for part-to-whole with few categories (5-7 slices max). Here, four competitors fit well.

612
MCQmedium

A data analyst wants to ensure a sample proportionally represents different regions in a population. Which sampling method should be used?

A.Simple random sampling
B.Cluster sampling
C.Systematic sampling
D.Stratified sampling
AnswerD

Stratified sampling ensures proportional representation from each stratum.

Why this answer

Stratified sampling divides the population into strata (regions) and samples proportionally from each.

613
MCQhard

In time series decomposition, a pattern that repeats at regular intervals (e.g., weekly, yearly) is called:

A.Cyclical
B.Irregular
C.Trend
D.Seasonality
AnswerD

Seasonality has fixed and known periods.

Why this answer

Seasonality refers to regular, periodic patterns in time series data.

614
MCQmedium

Refer to the exhibit. A data engineer is setting up data acquisition from an S3 bucket with this policy. What does the policy enforce?

A.Only objects with key 'data-bucket' can be accessed
B.Only GET requests are allowed
C.The bucket is publicly accessible
D.Access is restricted to a specific IP range
AnswerD

The condition limits access to IPs in the 10.0.0.0/8 range.

Why this answer

The S3 bucket policy includes a condition using `aws:SourceIp` with a CIDR block, which restricts access to requests originating from the specified IP range. This enforces network-level access control, allowing only clients within that IP range to perform the allowed actions (like GET requests) on the bucket. Option D correctly identifies this IP-based restriction.

Exam trap

CompTIA often tests the misconception that a Deny statement with a condition on IP addresses makes the bucket publicly accessible, when in fact it restricts access to only the specified IP range.

How to eliminate wrong answers

Option A is wrong because the policy does not restrict access based on object keys; the `arn:aws:s3:::data-bucket/*` in the Resource element specifies all objects in the bucket, not a specific key. Option B is wrong because the policy allows both GET and PUT requests (as shown in the Action list), not only GET requests. Option C is wrong because the policy explicitly denies access from IPs outside the specified range, making the bucket private to that IP range, not publicly accessible.

615
MCQmedium

A data analyst is cleaning a dataset and finds that some cells in the 'email' column contain leading spaces. Which string function should be used to remove these spaces?

A.TRIM
B.LTRIM
C.REPLACE
D.SUBSTRING
AnswerA

TRIM removes both leading and trailing spaces.

Why this answer

TRIM removes leading and trailing spaces from a string.

616
MCQeasy

A data analyst creates a bar chart showing monthly sales for the current year. The CEO asks to see the same data but grouped by region. Which chart type is most appropriate for this new requirement?

A.Grouped bar chart
B.Pie chart
C.Stacked bar chart
D.Line chart
AnswerA

Grouped bar charts compare values across categories and subgroups.

Why this answer

A grouped bar chart is the most appropriate choice because it allows the CEO to compare monthly sales across different regions side by side within each month. This chart type effectively displays two categorical variables (month and region) and one quantitative variable (sales), making it easy to see both regional performance and monthly trends simultaneously.

Exam trap

CompTIA often tests the distinction between grouped and stacked bar charts, trapping candidates who think any multi-category bar chart is equivalent, when in fact grouped bars are for comparing individual category values and stacked bars are for comparing totals and proportions.

How to eliminate wrong answers

Option B (Pie chart) is wrong because pie charts are designed to show parts of a whole for a single categorical variable at a single point in time, not to compare multiple categories (regions) across a time series (months). Option C (Stacked bar chart) is wrong because while it can show regions within months, it emphasizes the total sales per month and the proportion each region contributes, rather than allowing direct comparison of individual region sales across months. Option D (Line chart) is wrong because line charts are best for showing continuous trends over time for one or more series, but they do not effectively compare discrete categories like regions within each month; a grouped bar chart provides clearer categorical comparison.

617
Multi-Selecthard

An analyst is using SQL to analyze employee data. Which THREE of the following are valid uses of the WHERE clause? (Select three.)

Select 3 answers
A.Sort the result set by hire_date
B.Filter groups after aggregation using HAVING
C.Filter rows where manager_id is NULL using IS NULL
D.Filter rows where the name starts with 'J' using LIKE
E.Filter rows where salary is between 50,000 and 70,000 using BETWEEN
AnswersC, D, E

IS NULL is used in WHERE to check for NULL values.

Why this answer

WHERE can filter using LIKE, BETWEEN, and IS NULL. HAVING is for aggregated results, and ORDER BY is for sorting.

618
MCQhard

In Power BI, a developer needs to create a measure that calculates total sales for the same period last year. Which DAX function should be used?

A.TOTALYTD
B.PARALLELPERIOD
C.DATEADD
D.SAMEPERIODLASTYEAR
AnswerD

This function returns the same date range from the previous year.

Why this answer

SAMEPERIODLASTYEAR is a time intelligence function that returns dates from the previous year.

619
MCQhard

A data analyst is creating a dashboard in Tableau to show monthly sales trends. The data source includes daily sales figures. The analyst needs to calculate the cumulative sales for each month and compare it to the same period last year. Which Tableau feature should be used for the year-over-year comparison?

A.Calculated fields
B.Level of Detail (LOD) expressions
C.Table calculations
D.Parameters
AnswerC

Table calculations can compute running total and date comparison (e.g., PREVIOUS_VALUE) easily.

Why this answer

Tableau's table calculations can compute running totals and year-over-year comparisons without modifying the data source.

620
MCQeasy

A data analyst needs to extract data from an API that returns JSON. The analyst wants to convert the JSON output into a tabular format for analysis. Which function in a scripting language is commonly used for this purpose?

A.json.loads()
B.to_csv()
C.read_json()
D.json_normalize()
AnswerD

This function normalizes semi-structured JSON data into a flat table.

Why this answer

Option D is correct because `json_normalize()` is a function in the pandas library specifically designed to flatten semi-structured JSON data (including nested lists and dictionaries) into a tabular DataFrame. This makes it the ideal tool for converting API responses with complex nesting into rows and columns for analysis, unlike simpler JSON parsing functions.

Exam trap

The trap here is that candidates confuse `read_json()` (which works only for flat JSON) with `json_normalize()` (which handles nested structures), leading them to choose option C when the API response contains hierarchical data.

How to eliminate wrong answers

Option A is wrong because `json.loads()` only parses a JSON string into a Python dictionary or list; it does not flatten nested structures or produce a tabular format. Option B is wrong because `to_csv()` is a pandas method for exporting a DataFrame to a CSV file, not for converting JSON to a table. Option C is wrong because `read_json()` in pandas reads a JSON file or string into a DataFrame but only handles simple, flat JSON structures; it fails with deeply nested JSON (e.g., arrays of objects with sub-objects) without additional normalization.

621
MCQmedium

A scatter plot of advertising spend vs. revenue shows no clear correlation, but the analyst suspects a relationship exists. Which addition to the plot could help reveal a hidden trend?

A.Change to a bar chart
B.Increase the marker size
C.Remove data points with low spend
D.Add a trendline
AnswerD

A trendline helps model and visualize the underlying relationship.

Why this answer

Option A is correct because adding a trendline (e.g., linear regression) can reveal patterns in noisy data. Other options may obscure or not address the issue.

622
MCQeasy

A data analyst wants to visualize the monthly sales trend for the past year. Which chart type is most appropriate?

A.Bar chart
B.Pie chart
C.Line chart
D.Scatter plot
AnswerC

Line charts effectively display trends over time.

Why this answer

Line charts are best for showing trends over time.

623
MCQmedium

A database administrator wants to ensure that every value in a column matches values in a primary key column of another table. Which constraint enforces this rule?

A.Unique constraint
B.Primary key
C.Check constraint
D.Foreign key
AnswerD

Foreign key enforces referential integrity.

Why this answer

A foreign key constraint ensures referential integrity by requiring that values in a column match values in the primary key of another table.

624
MCQhard

An analyst is reviewing a table that stores customer orders. The table contains columns: OrderID, CustomerName, Product1, Product1Qty, Product2, Product2Qty. This design violates which normal form?

A.No violation
B.Third normal form (3NF)
C.Second normal form (2NF)
D.First normal form (1NF)
AnswerD

Repeating groups violate 1NF.

Why this answer

The table violates First Normal Form (1NF) because it contains repeating groups (Product1, Product1Qty, Product2, Product2Qty) instead of storing each product in a separate row. 1NF requires that each column contains atomic values and that there are no repeating groups or arrays. The presence of multiple product columns for a single order breaks this atomicity and normalization rule.

Exam trap

The trap here is that candidates often think the table is already in 1NF because it has a primary key (OrderID), but they overlook the repeating group columns that violate the atomicity requirement of 1NF.

How to eliminate wrong answers

Option A is wrong because the table clearly violates normalization rules due to repeating groups, so a violation exists. Option B is wrong because Third Normal Form (3NF) requires that the table already be in 2NF and have no transitive dependencies; the immediate violation is at the 1NF level, not 3NF. Option C is wrong because Second Normal Form (2NF) requires that the table first satisfy 1NF and then have no partial dependencies; since the table fails 1NF, it cannot be evaluated for 2NF.

625
Multi-Selecteasy

Which TWO are examples of internal data sources? (Select exactly 2)

Select 2 answers
A.APIs
B.Relational databases
C.Sensor readings
D.Social media comments
E.Flat files
AnswersB, E

Common internal source.

Why this answer

Relational databases are internal data sources because they store structured data generated and controlled within an organization's own systems. They are typically managed by internal IT teams and accessed via SQL queries, making them a classic example of an internal data repository.

Exam trap

CompTIA often tests the distinction between data sources and data access methods, so candidates mistakenly select APIs or sensor readings as internal sources when they are actually mechanisms or external origin points.

626
MCQmedium

A simple linear regression model predicts sales (y) from advertising spend (x). The equation is y = 2.5x + 10, and R² = 0.81. Which interpretation is correct?

A.The correlation between sales and advertising is 0.81.
B.When advertising is $0, sales are $2.5.
C.81% of the variation in sales is explained by advertising spend.
D.For every $1 increase in advertising, sales increase by $10 on average.
AnswerC

R² = 0.81 means 81% explained.

Why this answer

Slope indicates that each unit increase in x increases y by 2.5 units. R² of 0.81 means 81% of variance in y is explained by x.

627
MCQmedium

In a time series analysis, a retail analyst observes consistent peaks in sales every December and troughs every February. This pattern repeats annually. Which component of time series does this represent?

A.Irregular
B.Seasonality
C.Trend
D.Cyclical
AnswerB

Seasonality is predictable and repeats over fixed intervals.

Why this answer

Seasonality refers to regular patterns that repeat over fixed periods, such as months or quarters.

628
MCQhard

Refer to the exhibit. A data analyst runs this query to identify high-value customers. However, the result does not include customers with exactly 5 orders. Which data concept does the HAVING clause illustrate?

A.Data sorting with ORDER BY
B.Data joining with INNER JOIN
C.Data aggregation with filtering on aggregated values
D.Data filtering on row-level conditions
AnswerC

HAVING filters after GROUP BY, operating on aggregated results.

Why this answer

HAVING filters groups after aggregation, unlike WHERE which filters rows before aggregation. This demonstrates data aggregation with filtering on aggregated values.

629
Multi-Selecteasy

A data analyst is preparing a presentation to share findings with non-technical stakeholders. Which TWO practices should the analyst follow to effectively communicate data insights? (Choose two.)

Select 2 answers
A.Tailor the message to the audience
B.Provide raw data tables
C.Use technical jargon to demonstrate expertise
D.Include all data anomalies and outliers
E.Focus on actionable insights
AnswersA, E

Different stakeholders have different priorities and levels of understanding.

Why this answer

Option B (focus on actionable insights) is correct because stakeholders need clear recommendations. Option D (tailor the message to the audience) is correct because different stakeholders have different priorities. Option A is wrong because technical jargon can confuse non-technical audiences.

Option C is wrong because raw data tables are overwhelming and not insightful. Option E is wrong because including all anomalies can distract from key findings.

630
Multi-Selectmedium

A dataset contains outliers in a feature that will be used for linear regression. Which two outlier treatment methods are appropriate? (Choose TWO)

Select 2 answers
A.Cap the outliers at a percentile (e.g., 99th percentile)
B.Use min-max normalization
C.Increase the sample size
D.Remove the outlier rows
E.Replace outliers with the mean
AnswersA, D

Capping limits extreme values.

Why this answer

Capping outliers or transforming the variable can reduce their influence.

631
Multi-Selectmedium

Which TWO are best practices for designing effective dashboards? (Select exactly two.)

Select 2 answers
A.Use 3D charts for visual appeal.
B.Use pie charts for comparing parts of a whole.
C.Include every possible metric to provide completeness.
D.Use a single page to avoid scrolling.
E.Use a consistent color scheme across visualizations.
AnswersD, E

A single page allows immediate overview.

Why this answer

Options A and C are correct. A single-page dashboard provides an at-a-glance view, and consistent color schemes improve readability. 3D charts (B) can distort perception; including every metric (D) clutters; pie charts (E) are often misused.

632
MCQmedium

Refer to the exhibit. If the date column is stored as a string in 'MM/DD/YYYY' format, what will be the result?

A.Incorrect results because string comparison is lexicographic.
B.NULL values
C.Error because DATE type is expected.
D.Correct results because string comparison works for dates.
AnswerA

The different format causes lexicographic comparison to fail.

Why this answer

Option A is correct because when dates are stored as strings in 'MM/DD/YYYY' format, string comparison is lexicographic (character-by-character). This means that '01/02/2023' (January 2) would be considered greater than '12/31/2022' because '0' > '1' at the first character, leading to incorrect chronological ordering. The comparison does not interpret the string as a date value.

Exam trap

CompTIA often tests the misconception that string comparison of dates in 'MM/DD/YYYY' format will yield correct chronological order, but the trap is that lexicographic comparison compares month first, not year, leading to incorrect results.

How to eliminate wrong answers

Option B is wrong because string comparison does not produce NULL values; it simply compares strings lexicographically and returns a valid boolean result. Option C is wrong because no error occurs; the database or application will perform string comparison without expecting a DATE type, as the column is defined as a string. Option D is wrong because string comparison does not work correctly for dates in this format; lexicographic order does not match chronological order for 'MM/DD/YYYY' strings.

633
MCQhard

A data scientist is tuning a decision tree model to prevent overfitting. The model currently has a high variance. Which hyperparameter adjustment is most effective?

A.Reduce maximum depth
B.Increase minimum samples split
C.Increase number of leaves
D.Use a smaller dataset
AnswerA

Reducing max depth stops the tree from growing too deep, simplifying the model and reducing variance.

Why this answer

Reducing maximum depth limits the number of splits in the decision tree, which directly reduces model complexity and variance. A high-variance model is overfitting to training data, and capping depth prevents the tree from learning overly specific patterns that do not generalize.

Exam trap

CompTIA often tests the misconception that increasing model complexity (e.g., more leaves) reduces overfitting, when in reality it increases variance; the trap here is that candidates may confuse 'minimum samples split' as the only regularization technique, overlooking that reducing max depth is a more direct and effective hyperparameter for high variance.

How to eliminate wrong answers

Option B is wrong because increasing minimum samples split actually reduces overfitting by requiring more samples per split, which is also effective but not the most direct adjustment for high variance; the question asks for the most effective hyperparameter adjustment, and reducing depth is more aggressive. Option C is wrong because increasing the number of leaves increases model complexity, which would exacerbate overfitting and increase variance, not reduce it. Option D is wrong because using a smaller dataset would increase variance (less data leads to more unstable splits) and is not a hyperparameter adjustment; it is a data-level change that typically worsens overfitting.

634
MCQhard

A marketing analyst wants to segment customers based on purchasing behavior and demographics. The dataset includes continuous variables (spending amount, frequency) and categorical variables (region, gender). The analyst decides to use k-means clustering. What should the analyst do to prepare the data?

A.Use raw data because k-means works with mixed types
B.Standardize continuous variables and one-hot encode categorical variables
C.Apply PCA first to reduce dimensionality
D.Remove categorical variables entirely
AnswerB

Standardization ensures equal weight; one-hot encoding converts categories to binary vectors.

Why this answer

Option B is correct because k-means clustering relies on Euclidean distance, which is sensitive to the scale of features. Standardizing continuous variables (e.g., spending amount, frequency) ensures they contribute equally to distance calculations, while one-hot encoding categorical variables (e.g., region, gender) converts them into numerical form without implying ordinal relationships, allowing k-means to process mixed data types correctly.

Exam trap

The trap here is that candidates assume k-means can natively handle mixed data types because it is a common clustering algorithm, but it strictly requires numerical input and scale normalization to avoid skewed distance calculations.

How to eliminate wrong answers

Option A is wrong because k-means cannot directly handle categorical variables; it requires numerical input and assumes continuous features, so using raw mixed-type data would produce meaningless distance calculations. Option C is wrong because PCA is a dimensionality reduction technique applied after preprocessing, not a substitute for standardizing and encoding; it may be used optionally but is not the required preparation step. Option D is wrong because removing categorical variables discards valuable demographic information that could improve segmentation, and k-means can incorporate them after proper encoding.

635
Multi-Selecthard

A data analyst is reviewing the quality of a sales report before publishing. Which THREE elements should the analyst include to ensure report quality?

Select 3 answers
A.Data freshness timestamp
B.Source attribution
C.Row-level security settings
D.Limitations and assumptions
E.Methodology notes
AnswersA, B, D

Indicates how current the data is.

Why this answer

Report quality includes documenting when data was last updated, any limitations or assumptions, and the source of the data to ensure transparency and reliability.

636
MCQeasy

A data analyst wants to compare the sales performance of four different stores over the same time period. Which chart type is most suitable?

A.Line chart with multiple lines
B.Grouped bar chart
C.Stacked bar chart
D.Pie chart with multiple pies
AnswerB

Grouped bars allow side-by-side comparison of stores.

Why this answer

A grouped bar chart is the most suitable choice because it allows direct comparison of discrete categories (four stores) across a common time period, with each group representing a time interval and individual bars representing each store's sales. This chart type excels at side-by-side comparisons of multiple entities over the same categorical axis, making it ideal for the analyst's goal.

Exam trap

The trap here is that candidates often choose a line chart (Option A) because they associate 'time period' with trend analysis, but the question explicitly asks to compare sales performance over the same time period, not to show changes over time, making the grouped bar chart the correct choice for direct comparison of discrete entities.

How to eliminate wrong answers

Option A is wrong because a line chart with multiple lines is designed to show trends over continuous time, but here the time period is fixed and the focus is on comparing discrete stores, not tracking changes over time. Option C is wrong because a stacked bar chart shows part-to-whole relationships within each time period, which obscures individual store performance and makes direct comparisons between stores difficult. Option D is wrong because a pie chart with multiple pies is meant to show proportions of a whole for each store separately, but comparing sales across stores requires a common baseline, which pie charts do not provide.

637
MCQeasy

A data analyst wants to predict customer churn based on categorical features like region and plan type, and continuous features like usage and tenure. Which regression type should be used?

A.Logistic regression
B.Ridge regression
C.Linear regression
D.Lasso regression
AnswerA

Logistic regression is used for binary classification, suitable for churn prediction.

Why this answer

Logistic regression is the correct choice because the target variable, customer churn, is binary (churn vs. no churn). Logistic regression models the probability of a binary outcome using a sigmoid function, making it suitable for classification tasks with both categorical and continuous predictors.

Exam trap

CompTIA often tests the misconception that 'regression' in the option name implies it is only for continuous outcomes, leading candidates to overlook logistic regression as a valid classification technique.

How to eliminate wrong answers

Option B (Ridge regression) is wrong because it is a regularized form of linear regression used for continuous outcomes, not binary classification. Option C (Linear regression) is wrong because it predicts a continuous value and is inappropriate for a binary dependent variable; it can produce probabilities outside [0,1] and violates the assumption of normally distributed errors. Option D (Lasso regression) is wrong because, like Ridge, it is a regularized linear regression for continuous targets and performs feature selection via L1 penalty, but it does not handle binary classification.

638
MCQeasy

During data exploration, an analyst notices that the target variable has a heavily right-skewed distribution. Which data transformation would be most appropriate to make the distribution more symmetric?

A.Log transformation
B.Reciprocal transformation
C.No transformation needed
D.Square root transformation
AnswerA

Log transformation effectively reduces right skewness.

Why this answer

Log transformation is commonly used to reduce right skew. Square root is for moderate skew. Box-Cox can also work but log is simpler.

Option B is correct.

639
MCQmedium

A company’s marketing team wants to segment customers based on purchase history, demographics, and website behavior. The data includes both numeric and categorical variables. Which clustering algorithm is best suited for handling mixed data types?

A.Hierarchical clustering with Gower distance
B.K-modes clustering
C.DBSCAN with Euclidean distance
D.K-means clustering
AnswerA

Gower distance can handle mixed data types by computing a dissimilarity matrix that combines numeric and categorical attributes.

Why this answer

Hierarchical clustering with Gower distance is best suited for mixed data types because Gower distance computes a dissimilarity measure that handles both numeric and categorical variables by normalizing numeric differences and using a simple matching coefficient for categorical ones. This allows the algorithm to create a distance matrix that equally weights all variable types, making it ideal for segmenting customers with purchase history, demographics, and website behavior data.

Exam trap

The trap here is that candidates often assume K-means or DBSCAN can handle mixed data by simply encoding categorical variables, but they overlook that Euclidean distance on encoded data distorts the geometry and fails to preserve the natural dissimilarity structure of categorical variables.

How to eliminate wrong answers

Option B (K-modes clustering) is wrong because it is designed exclusively for categorical data and cannot handle numeric variables like purchase history or website behavior metrics. Option C (DBSCAN with Euclidean distance) is wrong because Euclidean distance is only meaningful for numeric data and cannot properly measure dissimilarity between categorical variables, leading to distorted clusters. Option D (K-means clustering) is wrong because it relies on Euclidean distance and assumes numeric, continuous data; it cannot directly incorporate categorical variables without encoding, and even with encoding, it is sensitive to scaling and does not naturally handle mixed types.

640
MCQeasy

A data analyst needs to merge two customer tables from different sources. One table uses 'CUST_ID' as the primary key, the other uses 'CustomerID'. To ensure accurate merging, the analyst should first:

A.Perform a fuzzy match on names
B.Normalize the key column names to a common format
C.Remove duplicate rows from both tables
D.Aggregate data by region
AnswerB

Standardizing key names allows for accurate merging without data loss.

Why this answer

Normalizing key column names to a common format (Option B) is the correct first step because the merge operation requires a consistent join key. Without aligning 'CUST_ID' and 'CustomerID' to a single name and data type, the database or ETL tool will treat them as different columns, resulting in a cross join or an error. This step ensures referential integrity and enables an accurate inner or outer join based on the primary key.

Exam trap

The trap here is that candidates assume deduplication (Option C) is the most critical first step, but without first standardizing the join keys, any deduplication logic would operate on mismatched or incomplete data, leading to incorrect results.

How to eliminate wrong answers

Option A is wrong because performing a fuzzy match on names is an advanced, resource-intensive technique used only when exact key values are unavailable or inconsistent; it is unnecessary when the tables already have primary key columns that can be standardized. Option C is wrong because removing duplicate rows before aligning key names could inadvertently delete legitimate records that only appear duplicated due to key naming differences, and deduplication should occur after the merge or as a separate quality step. Option D is wrong because aggregating data by region is a post-merge analytical operation that has no bearing on resolving key column mismatches and would corrupt the granularity needed for accurate joining.

641
MCQeasy

An analyst creates a dashboard with multiple visualizations. Which feature allows users to change the data displayed across all charts simultaneously?

A.Linked chart
B.Drill-down
C.Filter or slicer
D.Data segmentation
AnswerC

Filters and slicers allow users to interactively change the data displayed on multiple charts.

Why this answer

Option D is correct because a filter (or slicer) can apply to multiple visuals. Option A is for limited data view. Option B is a chart type.

Option C is dividing data into segments.

642
MCQmedium

A sales manager receives a daily report at 8 AM via email showing yesterday's sales by region. This is an example of which report type?

A.Ad hoc report
B.Scheduled report
C.Self-service report
D.Operational report
AnswerB

Scheduled reports are automated and delivered on a fixed schedule.

Why this answer

The report is automatically delivered on a regular schedule, which defines a scheduled report.

643
MCQeasy

Refer to the exhibit. An analyst runs this query before acquiring data from a PostgreSQL database. What is the primary purpose of this query?

A.To verify data types
B.To check for data freshness
C.To find primary keys
D.To identify duplicate tables
AnswerB

The 'last_analyzed' column shows when statistics were last updated, indicating freshness.

Why this answer

The query `SELECT current_timestamp;` returns the current date and time from the PostgreSQL server. By comparing this value with the timestamp of the most recent record in a target table, an analyst can determine if the data is up-to-date or stale. This is a standard technique to check data freshness before proceeding with data acquisition.

Exam trap

The trap here is that candidates may confuse `current_timestamp` with a function used for data type verification or schema inspection, when in fact it is solely a temporal function used to gauge data freshness.

How to eliminate wrong answers

Option A is wrong because verifying data types requires querying the `information_schema.columns` table or using `pg_typeof()`, not `current_timestamp`. Option C is wrong because finding primary keys involves querying `information_schema.table_constraints` or `pg_indexes`, not a simple timestamp function. Option D is wrong because identifying duplicate tables would require comparing table names or schemas via `information_schema.tables`, not a timestamp query.

644
MCQhard

A data analyst is creating a report to compare the performance of sales regions across multiple years. The report will be used by regional managers to identify trends. Which visualization approach best supports this?

A.A single line chart with all regions overlaid
B.A bar chart with years on x-axis and regions as grouped bars
C.A stacked area chart with all regions
D.Small multiples of line charts, one per region
AnswerD

Small multiples facilitate comparison while maintaining clarity.

Why this answer

Small multiples (trellis charts) allow comparison of multiple time series in a compact, consistent layout, making it easy to spot trends across regions.

645
MCQhard

A data analyst is working with a dataset containing house prices. After building a multiple linear regression model, the analyst observes that the model performs well on training data but poorly on validation data. Which technique is most appropriate to address this issue?

A.Decrease the training data size
B.Use a polynomial transformation
C.Increase the number of features
D.Apply L2 regularization (Ridge)
AnswerD

Ridge regularization adds a penalty to large coefficients, reducing variance and combating overfitting.

Why this answer

The model is overfitting the training data, as evidenced by high performance on training data but poor performance on validation data. L2 regularization (Ridge) adds a penalty term proportional to the square of the coefficients, which shrinks them and reduces model complexity, thereby improving generalization to unseen data.

Exam trap

CompTIA often tests the distinction between overfitting and underfitting, and candidates mistakenly choose polynomial transformation or adding features thinking they will improve fit, when in fact they increase model complexity and worsen overfitting.

How to eliminate wrong answers

Option A is wrong because decreasing the training data size would exacerbate overfitting by providing the model with even less information to learn generalizable patterns. Option B is wrong because polynomial transformation increases model complexity and feature interactions, which typically worsens overfitting rather than addressing it. Option C is wrong because increasing the number of features adds more predictors, which increases the risk of overfitting and does not directly penalize large coefficients.

646
MCQeasy

A data analyst is tasked with gathering data from a legacy system that only exports CSV files. The files contain headers but no data types. Which tool would best facilitate initial data exploration?

A.Hadoop
B.Tableau
C.SQL database
D.Python pandas
AnswerD

Provides powerful data structures and functions for CSV exploration.

Why this answer

Option A is correct because Python pandas provides powerful data structures and functions for exploring and manipulating CSV data, including detecting data types. Option B is wrong because SQL database requires importing data first. Option C is wrong because Hadoop is overkill for CSV exploration.

Option D is wrong because Tableau is visualization-focused and not ideal for initial data exploration without pre-processing.

647
Multi-Selectmedium

A data team must implement a data retention policy to reduce storage costs while meeting legal requirements. Which TWO actions best achieve this?

Select 2 answers
A.Set data retention limits with automated deletion
B.Use data compression
C.Increase primary storage capacity
D.Implement data deduplication
E.Archive historical data to tape or cloud archive
AnswersA, E

Ensures data is deleted after a defined period, complying with legal requirements.

Why this answer

Archiving old data to cheaper storage reduces primary storage costs, and setting retention limits ensures data is deleted when no longer needed, balancing cost and compliance.

648
MCQmedium

During EDA, an analyst calculates the Z-score for each data point in a dataset. A data point with a Z-score of 3.5 is identified. What does this indicate?

A.The data point has a high frequency
B.The data point is exactly at the mean
C.The data point is likely an outlier
D.The data point is within the interquartile range
AnswerC

A Z-score above 3 or below -3 is often considered an outlier.

Why this answer

A Z-score of 3.5 means the value is 3.5 standard deviations from the mean, commonly considered an outlier (threshold often >3 or <-3).

649
MCQhard

A data analyst is preparing a logistic regression model to predict customer churn. After examining the exhibit, which data quality issue should the analyst address first?

A.Duplicate customer IDs
B.Missing values in total_charges
C.Inconsistent data in total_charges
D.Outliers in monthly_charges
AnswerC

The total_charges for the first customer is equal to monthly_charges, suggesting a calculation error.

Why this answer

Option C is correct because the exhibit shows that the 'total_charges' column contains entries like '1,234.56' and '1234.56', which are inconsistent numeric formats. Logistic regression in Python (e.g., using scikit-learn) requires all feature values to be numeric and consistent; mixed formats will cause parsing errors or incorrect model training. The analyst must standardize these values to a uniform numeric type (e.g., float) before proceeding.

Exam trap

CompTIA often tests the distinction between data quality issues that prevent model execution (like inconsistent data types) versus issues that degrade model performance (like outliers or missing values), and candidates frequently overlook the former because they focus on statistical concerns rather than data preprocessing fundamentals.

How to eliminate wrong answers

Option A is wrong because duplicate customer IDs are a data integrity issue that can cause data leakage or overfitting, but the exhibit does not show any duplicate IDs, and this is not the most immediate problem for model training. Option B is wrong because missing values in 'total_charges' are not indicated in the exhibit; the issue is inconsistent formatting, not absence of data. Option D is wrong because outliers in 'monthly_charges' are not visible in the exhibit, and while outliers can affect logistic regression, they are a secondary concern compared to the fundamental data type inconsistency that prevents the model from even reading the data correctly.

650
MCQmedium

A dataset contains a feature 'Age' with values ranging from 18 to 95. To prepare data for a k-nearest neighbors algorithm, which transformation should be applied to 'Age'?

A.Z-score standardization
B.Min-max normalization
C.No transformation needed
D.Log transformation
AnswerB

Min-max normalization ensures all features contribute equally to distance calculations.

Why this answer

Min-max normalization scales features to a fixed range (e.g., 0-1), which is appropriate for distance-based algorithms like k-NN.

651
MCQhard

A data analyst is reviewing an S3 bucket policy that controls access to a data lake. The analyst wants to communicate that the current policy restricts data downloads to a specific IP range. Which of the following best describes the policy's effect?

A.The policy allows all users to download data.
B.The policy denies data downloads to all users.
C.The policy restricts data downloads to a specific IP range.
D.The policy allows data uploads from a specific IP range.
AnswerC

The Deny with condition effectively allows only that IP range.

Why this answer

Option C is correct because the S3 bucket policy explicitly restricts data downloads to a specific IP range by using a condition in the policy statement, such as `aws:SourceIp` with a CIDR block. This means only requests originating from IP addresses within that range are allowed to perform the `s3:GetObject` action, effectively limiting downloads to authorized network locations.

Exam trap

The trap here is that candidates may confuse the policy's effect with a blanket denial (Option B) or misinterpret the restriction as applying to uploads (Option D), when the policy specifically targets downloads and allows them only from a defined IP range.

How to eliminate wrong answers

Option A is wrong because it describes an allow-all effect, which contradicts the explicit IP-based restriction in the policy; a policy that restricts to a specific IP range does not allow all users. Option B is wrong because it states the policy denies downloads to all users, but the policy actually allows downloads from the specified IP range while implicitly or explicitly denying others—it does not universally deny. Option D is wrong because it refers to data uploads (e.g., `s3:PutObject`), but the question specifically addresses downloads (`s3:GetObject`), and the policy effect is about restricting downloads, not uploads.

652
Multi-Selectmedium

Which TWO of the following are leading indicators that can help predict future performance?

Select 2 answers
A.Net profit margin
B.Monthly revenue
C.Website traffic
D.Number of qualified leads
E.Customer churn rate
AnswersC, D

Increased traffic may lead to more sales.

Why this answer

Leading indicators predict future outcomes; website traffic and number of qualified leads are forward-looking.

653
MCQmedium

A data analyst wants to compare the average revenue per customer between two marketing campaigns (A and B). The analyst is unsure if the data follows a normal distribution. Which statistical test is most appropriate for comparing the means of the two groups?

A.Two-sample t-test
B.Pearson correlation
C.Chi-square test
D.ANOVA
AnswerA

The two-sample t-test compares means of two independent groups.

Why this answer

For comparing means of two independent groups, the t-test is the standard parametric test. If normality is violated, a non-parametric alternative like Mann-Whitney U could be used, but the t-test is robust for moderate sample sizes.

654
MCQeasy

A retail company is merging customer data from three separate systems: an e-commerce platform, a point-of-sale (POS) system, and a loyalty program. The e-commerce platform stores customer names in "FirstName LastName" format, the POS system stores names as "LastName, FirstName", and the loyalty program stores names in separate "first_name" and "last_name" fields. The data analyst needs to create a unified customer master table. After initial merging, there are 20% more records than expected, including duplicates with slight name variations (e.g., "John Smith" vs "John A. Smith"). To ensure accurate consolidation, which data concept should the analyst prioritize applying first?

A.Data profiling
B.Data standardization
C.Data indexing
D.Data encryption
AnswerB

Standardizing name formats to a common convention reduces variations and allows accurate matching and deduplication.

Why this answer

Data standardization is the correct first step because it resolves the inconsistent name formats (e.g., 'FirstName LastName', 'LastName, FirstName', and separate fields) into a single, consistent representation. By applying a standardized format (e.g., 'FirstName LastName'), the analyst can then accurately identify and merge duplicates like 'John Smith' and 'John A. Smith' using fuzzy matching or exact matching on the standardized values.

This ensures the unified customer master table has the correct number of records without the 20% inflation caused by formatting variations.

Exam trap

The trap here is that candidates confuse data profiling (which only identifies issues) with data standardization (which actively resolves format inconsistencies), leading them to choose A instead of B, even though profiling alone cannot fix the duplicate records caused by name variations.

How to eliminate wrong answers

Option A is wrong because data profiling is an exploratory process that assesses data quality and structure (e.g., detecting nulls, patterns, or anomalies), but it does not transform or resolve the inconsistent name formats that cause duplicate records. Option C is wrong because data indexing improves query performance by creating sorted structures (e.g., B-trees or hash indexes) on columns, but it does not address the underlying data inconsistency or deduplication needed for accurate consolidation. Option D is wrong because data encryption protects data at rest or in transit (e.g., using AES-256 or TLS 1.3), but it has no role in standardizing name formats or removing duplicates from merged datasets.

655
MCQmedium

A report uses red and green colors to indicate performance; a stakeholder who is colorblind complains. What is the BEST remedy?

A.Remove color entirely
B.Change colors to blue and orange
C.Add patterns or labels in addition to color
D.Use a different chart type
AnswerC

Patterns and labels provide redundant encoding that works for all users.

Why this answer

Option C is correct because adding patterns or labels alongside color ensures that the data encoding is accessible to colorblind viewers without removing the informational value of color for those who can see it. This approach follows WCAG (Web Content Accessibility Guidelines) 2.1 Success Criterion 1.4.1, which requires that color is not the sole means of conveying information. In data visualization tools like Tableau or Power BI, this is implemented by adding shape markers or text annotations to series, preserving the original color scheme for non-colorblind users while providing redundant encoding.

Exam trap

The trap here is that candidates assume the best remedy is to simply change colors to a more colorblind-friendly palette (like blue and orange), but CompTIA often tests the deeper principle that color alone is never sufficient for accessibility—redundant encoding via patterns or labels is required to meet WCAG standards.

How to eliminate wrong answers

Option A is wrong because removing color entirely eliminates a key visual encoding channel, which can reduce the chart's effectiveness for the majority of viewers and may violate best practices for data visualization that leverage color for quick pattern recognition. Option B is wrong because while blue and orange are more colorblind-friendly than red and green, they still rely solely on color differentiation, which fails for users with complete color blindness (achromatopsia) or other deficiencies; the WCAG requirement is to not use color as the only means of distinction. Option D is wrong because changing the chart type does not address the core issue of color-dependent encoding; a different chart type (e.g., bar chart instead of pie chart) still uses color to differentiate categories, so the accessibility problem persists unless patterns or labels are added.

656
MCQhard

A data analyst has a time series of monthly sales data. They observe that sales are consistently higher every December and lower every January. Which component of time series does this pattern represent?

A.Irregular
B.Cyclical
C.Seasonality
D.Trend
AnswerC

Seasonality refers to fixed periodic patterns within a year.

Why this answer

Regular patterns that repeat within one year are seasonality.

657
MCQmedium

A data analyst is creating a dashboard in Looker Studio for an e-commerce company. They want to display the average order value by product category, and also allow users to filter by date range. Which combination of elements should be used?

A.Dimension: Order Value; Metric: Product Category; Control: Date Range
B.Dimension: Date; Metric: Product Category; Control: None
C.Dimension: Average Order Value; Metric: Product Category; Control: Date Range
D.Dimension: Product Category; Metric: Average Order Value; Control: Date Range
AnswerD

Product category is a dimension, average order value is a metric, and date range control allows filtering.

Why this answer

In Looker Studio, dimensions are fields like product category, metrics are aggregations like average order value, and date range controls provide filtering.

658
Multi-Selecthard

A data analyst in a healthcare organization needs to visualize patient wait times across different departments. The data includes department names (categorical), wait times in minutes (continuous), and time of day. Which THREE chart types could be effectively used to explore this data?

Select 3 answers
A.Funnel chart
B.Scatter plot
C.Histogram
D.Heat map
E.Box plot
AnswersB, D, E

Can show relationship between time of day and wait time.

Why this answer

Box plots show distribution and outliers; scatter plots can show wait time vs. time of day; heat maps can show average wait time by department and time period. Histograms show distribution of a single variable, funnel charts are for pipelines.

659
MCQhard

Given the linear regression output, which independent variable has the strongest effect on price, based on standardized coefficients?

A.bathrooms
B.sqft_living
C.Intercept
D.bedrooms
AnswerB

sqft_living has the highest absolute t-value (10.0) indicating strong effect.

Why this answer

Standardized coefficients (beta weights) allow comparison of the relative strength of independent variables by measuring the number of standard deviations the dependent variable changes per one standard deviation change in the predictor. In the regression output, sqft_living has the highest absolute standardized coefficient, indicating it has the strongest effect on price. The intercept is not an independent variable and its coefficient is not standardized for comparison.

Exam trap

The trap here is that candidates mistakenly compare unstandardized coefficients or p-values instead of standardized coefficients, leading them to choose a variable like bathrooms or bedrooms that appears significant but has a weaker standardized effect.

How to eliminate wrong answers

Option A is wrong because bathrooms may have a statistically significant coefficient, but its standardized coefficient is smaller than that of sqft_living, meaning it has a weaker relative effect on price. Option C is wrong because the intercept is a constant term representing the predicted price when all independent variables are zero; it is not an independent variable and its coefficient is not standardized for effect comparison. Option D is wrong because bedrooms, while possibly significant, has a lower absolute standardized coefficient than sqft_living, indicating a weaker influence on price per standard deviation change.

660
Multi-Selectmedium

A data analyst is performing data profiling on a customer table. Which TWO metrics are commonly used to assess the completeness of a column? (Select TWO.)

Select 2 answers
A.Row count
B.Null count
C.Cardinality
D.Mean
E.Standard deviation
AnswersA, B

Total rows; used to compute percentage complete.

Why this answer

Completeness is measured by null count and row count; null count shows missing values, row count gives total rows.

661
Multi-Selectmedium

Which TWO chart types are best suited to show the proportion of total sales contributed by each product category? (Select exactly two.)

Select 2 answers
A.Histogram
B.Scatter plot
C.Stacked bar chart
D.Pie chart
E.Line chart
AnswersC, D

Stacked bars show composition across categories.

Why this answer

Options B and C are correct. Pie charts and stacked bar charts effectively show parts of a whole. Line charts show trends; scatter plots show relationships; histograms show distributions.

662
MCQhard

Refer to the exhibit. What is the best course of action for the data analyst?

A.Change the default value to 'NA'
B.Delete rows with missing 'Region'
C.Investigate source data to understand why 'Region' is missing
D.Ignore the warning as it is handled
AnswerC

Root cause analysis helps improve data quality at the source.

Why this answer

Option C is correct because missing data in a critical field like 'Region' indicates a potential data quality issue that requires root cause analysis. The data analyst should investigate the source system to determine whether the missing values result from a collection error, a transformation bug, or a legitimate null, rather than making assumptions by deleting rows or filling defaults. This aligns with the principle of data provenance and ensures that downstream insights are based on accurate, traceable data.

Exam trap

The trap here is that candidates assume missing data should be immediately cleaned by deletion or default values, but Cisco tests the understanding that a data analyst's first responsibility is to investigate the source of the anomaly before applying any transformation.

How to eliminate wrong answers

Option A is wrong because blindly changing the default value to 'NA' introduces artificial data that may skew aggregations or filter logic, and it does not address the underlying reason for the missing values. Option B is wrong because deleting rows with missing 'Region' could discard valid records if the missing data is due to a temporary system glitch or a mapping error, leading to biased analysis and loss of potentially important context. Option D is wrong because ignoring the warning assumes the missing data is harmless, but in practice, missing region information can break geographic segmentation, violate data integrity constraints, and cause incorrect reporting in dashboards or ETL pipelines.

663
MCQmedium

A company is building a data pipeline to ingest sensor data from IoT devices. The data arrives continuously in small batches and must be processed in real-time for monitoring. Which type of data source best describes this scenario?

A.Transactional database
B.Streaming data
C.Web scraping
D.Flat file
AnswerB

IoT sensors produce streaming data that is continuous and requires real-time processing.

Why this answer

B is correct because the scenario describes data arriving continuously in small batches that must be processed in real-time for monitoring. This is the defining characteristic of streaming data, which is typically ingested via technologies like Apache Kafka, Amazon Kinesis, or MQTT brokers, enabling low-latency processing and immediate alerting.

Exam trap

The trap here is that candidates may confuse 'real-time' with 'fast batch processing' and incorrectly choose a transactional database, not recognizing that streaming data sources are specifically designed for continuous, unbounded data flows with sub-second latency requirements.

How to eliminate wrong answers

Option A is wrong because a transactional database (e.g., PostgreSQL, MySQL) is designed for ACID-compliant, query-based storage and retrieval, not for continuous real-time ingestion of sensor data; it would introduce latency and cannot handle unbounded streams efficiently. Option C is wrong because web scraping is a technique for extracting data from web pages via HTTP requests (e.g., using BeautifulSoup or Scrapy), which is batch-oriented and not suited for real-time IoT sensor data. Option D is wrong because a flat file (e.g., CSV, JSON file) is a static storage format that requires manual or scheduled batch loads, making it incapable of supporting real-time processing or continuous ingestion.

664
MCQhard

A data engineer is designing a data pipeline to ingest streaming data from IoT sensors. The sensors send data every second, and the pipeline must handle bursts of up to 10,000 messages per second. Which approach is most appropriate for capturing this data before processing?

A.Directly write each message to a relational database
B.Load directly into a data warehouse
C.Use a message queue to buffer the incoming data
D.Store data in flat files and process in nightly batches
AnswerC

A message queue handles high throughput and provides reliable buffering.

Why this answer

Option C is correct because a message queue (e.g., Apache Kafka, Amazon Kinesis, or RabbitMQ) provides an asynchronous buffer that decouples the high-velocity ingestion (up to 10,000 messages/second) from downstream processing. This allows the pipeline to absorb burst traffic without overwhelming the processing layer, ensures data durability, and supports replayability in case of failures.

Exam trap

CompTIA often tests the misconception that relational databases or data warehouses can handle real-time streaming ingestion at scale, when in fact they require a buffering layer like a message queue to absorb bursts and decouple ingestion from processing.

How to eliminate wrong answers

Option A is wrong because directly writing each message to a relational database (RDBMS) at 10,000 messages/second would cause severe write contention, lock contention, and I/O bottlenecks, leading to dropped data and unacceptable latency. Option B is wrong because loading directly into a data warehouse (e.g., Snowflake, Redshift) is designed for batch or micro-batch ingestion, not for real-time streaming at this scale; it would incur high costs and fail to handle bursty throughput without prior buffering. Option D is wrong because storing data in flat files and processing in nightly batches introduces unacceptable latency (up to 24 hours) for streaming IoT data, and the file system cannot reliably handle 10,000 writes per second without data loss or corruption.

665
Multi-Selecthard

Which THREE of the following are valid methods for handling missing data?

Select 3 answers
A.Using a placeholder like 'Unknown' for categorical data
B.Ignoring missing values and proceeding with analysis
C.Replacing missing values with the mean of the column
D.Sorting the data to bring missing values to the top
E.Deleting rows with missing values
AnswersA, C, E

Placeholder is a valid approach.

Why this answer

Option A is correct because using a placeholder like 'Unknown' for categorical missing data preserves the dataset's structure and allows analysis to proceed without introducing statistical bias. This method is particularly valid for nominal data where the missing category can be treated as a distinct value, enabling downstream operations like one-hot encoding or frequency analysis without distorting the original distribution.

Exam trap

The trap here is that candidates may confuse 'handling missing data' with 'preprocessing steps'—sorting (Option D) is a data organization technique, not a valid method for dealing with missing values, and ignoring missing data (Option B) is often mistakenly considered acceptable in quick analyses, but it violates best practices for robust data science workflows.

666
Multi-Selecthard

When presenting data insights, which three practices help ensure clarity? (Choose THREE.)

Select 3 answers
A.Include all data points in the chart
B.Overlay multiple chart types on one axis
C.Use consistent color coding
D.Label axes clearly
E.Provide context and benchmarks
AnswersC, D, E

Consistency helps audiences quickly interpret visual elements.

Why this answer

Option C is correct because consistent color coding across charts and dashboards reduces cognitive load, allowing viewers to quickly associate colors with specific categories or metrics. This practice aligns with data visualization best practices for clarity, as it leverages pre-attentive processing to speed up interpretation without requiring the audience to re-learn the mapping for each chart.

Exam trap

The trap here is that candidates may think including all data points (Option A) ensures completeness, but the exam tests the understanding that clarity often requires simplification and focus, not exhaustive detail.

667
MCQeasy

A data analyst is asked to create a visualization showing the trend of monthly revenue over the last five years. Which chart type is most suitable?

A.Pie chart.
B.Histogram.
C.Line chart.
D.Bar chart.
AnswerC

Line charts effectively show changes over time with continuous data.

Why this answer

A line chart is the most suitable choice because it explicitly shows the trend of monthly revenue over a continuous time period (five years). The x-axis represents time (months), and the y-axis represents revenue, allowing the viewer to easily identify patterns, seasonality, and overall direction. This makes it ideal for time-series data visualization.

Exam trap

The trap here is that candidates often confuse a histogram with a line chart because both use bars or lines, but a histogram bins continuous data into ranges and loses the temporal sequence, while a line chart preserves the order of time-series data.

How to eliminate wrong answers

Option A is wrong because a pie chart is used to show parts of a whole (proportions) at a single point in time, not trends over a continuous period. Option B is wrong because a histogram displays the distribution of a continuous variable by binning data into intervals, but it does not inherently show a trend over time—it would obscure the sequential order of months. Option D is wrong because a bar chart compares discrete categories or values at specific points, but it does not emphasize the continuous flow and trend direction as effectively as a line chart for time-series data.

668
MCQeasy

A data analyst is importing a CSV file that contains a mixture of numeric and text fields. What is the most common issue when importing?

A.Duplicate rows
B.Missing header row
C.Data types being incorrectly inferred
D.File size limitation
AnswerC

CSV import tools often guess types incorrectly, leading to conversion errors.

Why this answer

Data type inference often fails, causing numbers to be read as text or vice versa. File size limitations, missing headers, and duplicate rows are less common or not specific to mixed types.

669
MCQmedium

An analyst is conducting an A/B test to compare two website designs. The null hypothesis is that there is no difference in conversion rates. The p-value obtained is 0.03, and the significance threshold is 0.05. What should the analyst conclude?

A.Reject the null hypothesis; there is a significant difference.
B.Accept the alternative hypothesis that the new design is better.
C.The test is inconclusive; need a larger sample size.
D.Fail to reject the null hypothesis; there is no significant difference.
AnswerA

Correct: p < α, reject null.

Why this answer

Since p-value (0.03) < α (0.05), we reject the null hypothesis, indicating a statistically significant difference.

670
MCQeasy

An organization needs to store raw data from IoT sensors in its native format for future analysis. Which storage solution is best suited for this purpose?

A.Relational database
B.Data lake
C.Data mart
D.Data warehouse
AnswerB

Data lakes store raw data in native format.

Why this answer

A data lake is designed to store raw data in its native format, including unstructured and semi-structured data from IoT sensors, without requiring a predefined schema. This allows the organization to preserve the original data for future analysis, unlike traditional databases that enforce structure upon ingestion.

Exam trap

The trap here is that candidates often confuse a data warehouse with a data lake, assuming both are for storage, but a data warehouse requires ETL and structured schemas, making it unsuitable for raw, native-format IoT data.

How to eliminate wrong answers

Option A is wrong because a relational database requires a predefined schema and is optimized for structured data, not raw, native-format IoT sensor data. Option C is wrong because a data mart is a subset of a data warehouse focused on a specific business domain, not designed for storing raw, unprocessed data. Option D is wrong because a data warehouse stores processed, structured, and transformed data for analytical queries, not raw data in its native format.

671
Multi-Selectmedium

A data analyst is building a Power BI report to track KPIs for a retail chain. Which TWO of the following are considered leading indicators? (Choose two.)

Select 2 answers
A.Number of employees
B.Customer satisfaction score
C.Number of website visits
D.Profit margin
E.Total sales revenue
AnswersB, C

High satisfaction often leads to repeat purchases and referrals.

Why this answer

Leading indicators predict future performance. Number of website visits and customer satisfaction score can foreshadow future sales, whereas total sales and profit margin are lagging indicators. Employee count is operational but not typically leading.

672
MCQeasy

Which of the following is an example of unstructured data?

A.A JSON file
B.An image file
C.A relational database table
D.A CSV file with rows and columns
AnswerB

Images are unstructured.

Why this answer

Unstructured data has no predefined schema. Images are a classic example of unstructured data.

673
MCQeasy

A data analyst calculates the mean, median, and mode of a dataset. Which of the following best describes how these measures are used in descriptive statistics?

A.To identify outliers using standard deviation
B.To test hypotheses about population parameters
C.To describe the central tendency of the data
D.To determine the probability of an event
AnswerC

Mean, median, and mode are measures of central tendency.

Why this answer

Descriptive statistics summarize data using measures like mean, median, and mode to describe central tendency.

674
MCQeasy

When creating a data story, what should be the FIRST step?

A.Design the layout of the report
B.Gather all available data
C.Choose visualization tools
D.Identify the key message
AnswerD

The key message frames the story and directs the entire process.

Why this answer

The first step in creating a data story is to identify the key message because the entire narrative, data selection, and visualization choices must align with a central insight. Without a clear message, subsequent steps like gathering data or designing layouts risk being unfocused and failing to communicate actionable findings. This ensures the story is driven by the insight, not by the data or tools.

Exam trap

CompTIA often tests the misconception that data gathering or tool selection is the logical starting point, tempting candidates to skip the critical step of defining the core insight first.

How to eliminate wrong answers

Option A is wrong because designing the layout of the report should occur after the key message is defined; layout is a presentation concern, not the foundational step. Option B is wrong because gathering all available data without a guiding message leads to analysis paralysis and irrelevant information; data should be collected with the key message in mind. Option C is wrong because choosing visualization tools depends on the data and the story to be told; selecting tools prematurely can limit the narrative or force inappropriate chart types.

675
MCQeasy

A marketing team wants to compare the growth rates of four product categories over the last quarter. Which chart type would best display this?

A.Pie chart
B.Stacked bar chart
C.Line chart
D.Bar chart
AnswerD

Bar charts effectively compare discrete values like growth percentages.

Why this answer

Option C is correct because a bar chart comparing growth percentages across categories is clear and straightforward. Other options are less effective for comparing discrete metrics.

Page 8

Page 9 of 14

Page 10