CCNA Analyzing Modeling Data Questions

75 of 84 questions · Page 1/2 · Analyzing Modeling Data topic · Answers revealed

1
MCQhard

Refer to the exhibit. Before running the code, the original salary column had 50 missing values. The median was calculated as 52000. After imputation, which of the following statements is true?

A.The mean decreased significantly
B.The standard deviation increased
C.The median remains unchanged
D.The minimum value decreased
AnswerC

Since missing values are replaced by the median, the median of the dataset does not change.

Why this answer

Imputing missing values with the median (52000) replaces only the 50 missing entries with that value, leaving all original non-missing values unchanged. Since the median is a positional statistic, adding values equal to the current median does not shift the middle position of the sorted data, so the median remains unchanged. This is why option C is correct.

Exam trap

CompTIA often tests the misconception that imputing with the median will change the median itself, when in fact adding values equal to the current median leaves the median unchanged because it is a rank-based statistic.

How to eliminate wrong answers

Option A is wrong because imputing with the median does not significantly change the mean; the mean may shift slightly toward the median but not decrease significantly unless the missing values were extremely high. Option B is wrong because adding values exactly at the median reduces variance (since imputed values are all equal to the median), so the standard deviation decreases, not increases. Option D is wrong because the minimum value is unaffected—imputation only adds values at the median, which is far above the minimum, so the minimum remains the same.

2
MCQmedium

A data analyst is performing a hypothesis test with a significance level of 0.05. The p-value obtained is 0.03. What should the analyst conclude?

A.Reject the null hypothesis
B.Fail to reject the null hypothesis
C.Accept the null hypothesis
D.The result is practically significant
AnswerA

p < alpha indicates statistically significant result.

Why this answer

Since the p-value (0.03) is less than the significance level (0.05), the result is statistically significant. This means the observed data provides sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. The analyst should conclude that there is a statistically significant effect or difference.

Exam trap

The trap here is that candidates often confuse 'fail to reject' with 'accept' the null hypothesis, or they mistakenly think a p-value less than α means the null hypothesis is proven false with certainty, rather than just providing sufficient evidence to reject it.

How to eliminate wrong answers

Option B is wrong because failing to reject the null hypothesis occurs only when the p-value is greater than or equal to the significance level (p ≥ 0.05), not when it is smaller. Option C is wrong because hypothesis testing never 'accepts' the null hypothesis; we either reject it or fail to reject it, as acceptance implies proof of truth, which is not a valid statistical conclusion. Option D is wrong because practical significance is a separate consideration from statistical significance; a statistically significant result (p < 0.05) does not automatically imply practical importance, and the question only asks about the hypothesis test conclusion.

3
Multi-Selectmedium

Which TWO of the following are common assumptions of linear regression?

Select 2 answers
A.Independence of observations
B.No multicollinearity
C.Linearity of the relationship
D.Normality of the dependent variable
E.Homoscedasticity
AnswersC, E

Assumes linear relationship between predictors and outcome.

Why this answer

Linear regression assumes that the relationship between the independent and dependent variables is linear (option C). This means the model expects that a unit change in the predictor results in a constant change in the outcome, which is the core assumption for ordinary least squares (OLS) estimation to produce unbiased coefficients.

Exam trap

The trap here is that candidates often confuse the assumption of normality of residuals with normality of the dependent variable, leading them to incorrectly select option D instead of recognizing that homoscedasticity (option E) is a core assumption.

4
MCQhard

A data scientist is analyzing a dataset with 100 variables and 5,000 records. The dataset has several missing values and a few extreme outliers. The goal is to build a regression model to predict a continuous target. Which combination of preprocessing steps is most likely to improve model performance?

A.Impute missing values with median, apply robust scaling, and then log transform skewed variables
B.Impute missing values with mean, then use PCA for dimensionality reduction
C.Drop all rows with missing values, then apply min-max scaling
D.Remove outliers using Z-score, then apply standard scaling
AnswerA

Median imputation is robust, robust scaling handles outliers, log transform handles skewness.

Why this answer

Option A is correct because imputing missing values with the median is robust to outliers, robust scaling handles extreme values by using median and IQR, and log transformation reduces skewness in predictors. This combination preserves data integrity and stabilizes variance, which is critical for regression models on a dataset with 100 variables and 5,000 records.

Exam trap

CompTIA often tests the misconception that mean imputation and standard scaling are universally safe, but the trap here is that outliers and skewness require robust methods like median imputation and robust scaling to avoid distorting the model.

How to eliminate wrong answers

Option B is wrong because imputing with the mean is sensitive to outliers, which can distort the distribution and negatively affect PCA, and PCA may discard important variance related to the target. Option C is wrong because dropping all rows with missing values reduces the already limited 5,000 records, potentially losing significant information and introducing bias, and min-max scaling is not robust to outliers. Option D is wrong because removing outliers using Z-score assumes a normal distribution, which may not hold with skewed variables, and standard scaling is also sensitive to outliers, leading to poor model performance.

5
MCQmedium

In a dataset with variables on different scales (e.g., age in years and income in dollars), which preprocessing step is necessary before applying k-means clustering?

A.Feature selection
B.Dimensionality reduction
C.Normalization (scaling)
D.One-hot encoding
AnswerC

Normalization ensures each feature contributes equally to distance calculations.

Why this answer

K-means clustering relies on Euclidean distance to measure similarity between data points. When variables like age (in years) and income (in dollars) are on different scales, the variable with larger numeric values (income) will dominate the distance calculation, skewing the clustering results. Normalization (scaling), such as min-max scaling or z-score standardization, rescales all features to a comparable range (e.g., [0,1] or mean=0, variance=1), ensuring each feature contributes equally to the distance computation.

Exam trap

The trap here is that candidates may confuse normalization with other preprocessing steps like feature selection or dimensionality reduction, thinking that removing irrelevant features or reducing dimensions will automatically fix scale differences, but k-means specifically requires scaling to ensure equal feature influence in distance calculations.

How to eliminate wrong answers

Option A is wrong because feature selection is about choosing a subset of relevant features to reduce noise or improve model performance, but it does not address the issue of differing scales among features, which is required before k-means. Option B is wrong because dimensionality reduction (e.g., PCA) reduces the number of features, but it does not inherently scale the data; scaling is typically performed before dimensionality reduction, not as a substitute for it. Option D is wrong because one-hot encoding is used to convert categorical variables into numerical format, not to handle numerical variables on different scales; applying one-hot encoding to already numerical features would be incorrect and does not solve the scaling problem.

6
MCQhard

A data analyst at a retail company is building a multiple linear regression model to forecast weekly sales. The dataset contains 50 predictor variables, including store size, promotional spend, holiday indicators, and many others. After training the model, the analyst observes an R-squared of 0.99 on the training set but only 0.55 on the holdout test set. Which action should the analyst take first to address this discrepancy?

A.Remove highly correlated predictor variables and apply regularization (e.g., Ridge or Lasso).
B.Add more predictor variables to increase the training R-squared further.
C.Use k-fold cross-validation with a different random seed to get a more reliable test set estimate.
D.Increase the number of hidden layers in the model to capture more complexity.
AnswerA

Regularization and feature selection reduce overfitting by penalizing large coefficients and removing redundant predictors.

Why this answer

The high R-squared of 0.99 on training data versus 0.55 on test data is a classic sign of overfitting, where the model has learned noise and specific patterns in the training set that do not generalize. Removing highly correlated predictors reduces multicollinearity and model complexity, while regularization (Ridge or Lasso) penalizes large coefficients, shrinking them to prevent overfitting. This is the most direct first step to improve generalization.

Exam trap

The trap here is that candidates may think a high R-squared is always good, or they may confuse overfitting with underfitting and choose to add more complexity (Option D) or more data (Option B), rather than recognizing the need to reduce model complexity and apply regularization.

How to eliminate wrong answers

Option B is wrong because adding more predictor variables would increase the training R-squared but worsen overfitting, making the test set performance even lower. Option C is wrong because k-fold cross-validation with a different random seed does not address the fundamental overfitting issue; it only provides a different estimate of test error but does not change the model's tendency to overfit. Option D is wrong because increasing the number of hidden layers (a neural network technique) is irrelevant for a multiple linear regression model and would introduce unnecessary complexity, likely exacerbating overfitting.

7
Multi-Selecthard

Which THREE of the following are appropriate methods to handle outliers in a dataset?

Select 3 answers
A.Transforming the data using log transformation
B.Removing the outlier records
C.Capping the outlier values at a certain percentile
D.Binning continuous variables
E.Imputing outliers with the mean
AnswersA, B, C

Transformation can reduce the impact of outliers.

Why this answer

Option A is correct because log transformation compresses the scale of data, reducing the impact of extreme values and making the distribution more symmetric. This is a standard technique for handling skewed data where outliers are present, as it preserves the relative order of observations while mitigating outlier influence.

Exam trap

The trap here is that candidates may confuse data preprocessing techniques like binning or imputation with outlier handling methods, but binning is for discretization and mean imputation is not robust for outliers, while the correct methods (transformation, removal, capping) directly address outlier impact.

8
Matchingmedium

Match each data governance role to its responsibility.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Ensures data quality and adherence to policies

Manages technical environment and data access

Has accountability for specific data assets

Sets strategic direction for data management

Designs data structures and integration processes

Why these pairings

These roles are defined in data governance frameworks.

9
MCQhard

Refer to the exhibit. Which data quality dimension is being violated?

A.Uniqueness
B.Consistency
C.Timeliness
D.Completeness
AnswerB

Consistency ensures data formats and values are uniform; mixed data types violate this.

Why this answer

The exhibit shows the same customer ID (C001) associated with two different customer names ('John Smith' and 'Jon Smith'), which violates the consistency dimension. Consistency requires that data values be free from contradiction and adhere to the same representation rules across the dataset. Here, the conflicting names for the same identifier break referential integrity and data uniformity.

Exam trap

The trap here is that candidates confuse consistency with uniqueness, assuming any conflict between rows must be a duplicate record issue, when in fact consistency violations involve contradictory values for the same identifier across multiple records.

How to eliminate wrong answers

Option A is wrong because uniqueness is about ensuring no duplicate records exist for the same entity, but here the issue is conflicting attribute values for the same ID, not duplicate rows. Option C is wrong because timeliness concerns whether data is up-to-date and available when needed, which is not indicated by the name mismatch. Option D is wrong because completeness checks for missing values, but both records have all fields populated; the problem is contradictory data, not absent data.

10
MCQmedium

A data analyst needs to determine whether the mean sales of two different regions are significantly different. The samples are independent and the data is normally distributed. Which statistical test should be used?

A.Chi-square test for independence
B.ANOVA
C.Independent samples t-test
D.Paired t-test
AnswerC

This test compares means of two independent groups with normal distribution.

Why this answer

The independent samples t-test is the correct choice because the scenario involves comparing the means of two independent groups (two different regions) with normally distributed data. This test specifically assesses whether the difference between the two sample means is statistically significant, assuming equal or unequal variances as determined by Levene's test.

Exam trap

CompTIA often tests the distinction between independent and paired t-tests, trapping candidates who overlook the 'independent samples' condition and mistakenly choose the paired t-test for any two-group comparison.

How to eliminate wrong answers

Option A is wrong because the Chi-square test for independence is used for categorical data to assess associations between two variables, not for comparing means of continuous data. Option B is wrong because ANOVA is used to compare means among three or more groups, not exactly two independent groups. Option D is wrong because the paired t-test is used for dependent samples (e.g., before-and-after measurements on the same subjects), not for independent samples from different regions.

11
MCQmedium

A data analyst is preparing a dataset for a predictive model. The dataset contains a feature 'age' with values ranging from 18 to 80, and a feature 'income' ranging from 20,000 to 200,000. To ensure both features contribute equally to distance-based algorithms, which transformation should the analyst apply?

A.Min-max normalization
B.Log transformation
C.Standardization (z-score)
D.Box-Cox transformation
AnswerC

Standardization ensures each feature has mean 0 and std 1, providing equal weight in distance calculations.

Why this answer

Standardization (z-score) transforms features to have a mean of 0 and a standard deviation of 1, which ensures that both 'age' (18–80) and 'income' (20,000–200,000) contribute equally to distance-based algorithms like k-NN or k-means. Unlike min-max normalization, standardization is not affected by outliers and preserves the relative distances between data points, making it the preferred choice when the data does not follow a uniform distribution.

Exam trap

The trap here is that candidates often confuse min-max normalization with standardization, assuming that scaling to a fixed range is sufficient for distance-based algorithms, without considering the impact of outliers or the need for zero mean and unit variance.

How to eliminate wrong answers

Option A is wrong because min-max normalization scales features to a fixed range (e.g., [0,1]), but it is highly sensitive to outliers and does not guarantee equal contribution if the data contains extreme values; it also does not center the data around zero, which can distort distance calculations. Option B is wrong because log transformation is used to reduce skewness in positively skewed data, not to standardize features with different scales; it changes the shape of the distribution and would not make 'age' and 'income' comparable for distance-based algorithms. Option D is wrong because Box-Cox transformation is designed to make data more normally distributed and requires all values to be positive, but it does not standardize features to a common scale; applying it to 'age' and 'income' would not ensure equal contribution to distance metrics.

12
MCQeasy

A data analyst needs to join two tables in a SQL database: Orders and Customers. The analyst wants to include all orders, even if there is no matching customer record. Which type of join should be used?

A.RIGHT JOIN
B.FULL OUTER JOIN
C.LEFT JOIN
D.INNER JOIN
AnswerC

LEFT JOIN returns all orders, including those without matching customers.

Why this answer

A LEFT JOIN returns all rows from the left table (Orders) and the matching rows from the right table (Customers). If there is no match, NULL values are returned for the right table's columns. This satisfies the requirement to include all orders, even those without a matching customer record.

Exam trap

The trap here is that candidates often confuse LEFT JOIN with RIGHT JOIN, mistakenly thinking they need to 'keep all customers' instead of 'keep all orders,' or they overcomplicate the requirement by choosing FULL OUTER JOIN when only one side needs to be preserved.

How to eliminate wrong answers

Option A (RIGHT JOIN) is wrong because it returns all rows from the right table (Customers) and matching rows from the left table (Orders), which would include all customers, not all orders. Option B (FULL OUTER JOIN) is wrong because it returns all rows from both tables, including unmatched rows from both sides, which is unnecessary when the requirement is specifically to keep all orders. Option D (INNER JOIN) is wrong because it returns only rows where there is a match in both tables, which would exclude orders without a matching customer record.

13
MCQmedium

A data analyst at a marketing firm is tasked with segmenting customers based on their purchasing behavior. The dataset contains 10,000 customers with features such as annual spend, frequency of purchases, recency of last purchase, and average order value. The analyst decides to use k-means clustering. After standardizing the features, the analyst runs k-means with k=3, k=4, and k=5, and computes the silhouette score for each: k=3: 0.45, k=4: 0.52, k=5: 0.48. The analyst also plots the elbow curve and observes that the within-cluster sum of squares (WCSS) decreases sharply from k=2 to k=4, then levels off. Based on these results, what is the most appropriate number of clusters?

A.k=4
B.k=2
C.k=3
D.k=5
AnswerA

Highest silhouette score and elbow point.

Why this answer

The silhouette score is highest at k=4 (0.52), indicating that clusters are well-separated and cohesive. The elbow curve shows WCSS decreasing sharply up to k=4 and then leveling off, suggesting that k=4 captures the optimal trade-off between model complexity and variance explained. Together, these metrics point to k=4 as the most appropriate number of clusters.

Exam trap

The trap here is that candidates might rely solely on the elbow curve and pick k=3 or k=5, ignoring the silhouette score which directly measures cluster quality and clearly favors k=4.

How to eliminate wrong answers

Option B (k=2) is wrong because the elbow curve shows a sharp decrease in WCSS from k=2 to k=4, meaning k=2 would underfit the data and miss meaningful segmentation. Option C (k=3) is wrong because its silhouette score (0.45) is lower than k=4 (0.52), indicating poorer cluster separation and cohesion. Option D (k=5) is wrong because its silhouette score (0.48) is lower than k=4, and the elbow curve shows WCSS leveling off after k=4, so adding a fifth cluster introduces unnecessary complexity without significant improvement.

14
MCQeasy

A data analyst runs the Python code shown. What is the result of executing this code?

A.It reads the data, adds a calculated column, and shows the first 5 rows
B.It throws an error because 'total' column already exists
C.It reads the data and displays all rows
D.It reads the data and displays summary statistics
AnswerA

The code does exactly that.

Why this answer

The code reads a CSV file into a pandas DataFrame, then creates a new column 'total' by summing columns 'col1' and 'col2'. Finally, `head()` returns the first 5 rows. Option A correctly describes this sequence of operations.

Exam trap

The trap here is that candidates may think `head()` shows all rows or that adding a column with an existing name throws an error, but pandas silently overwrites the column.

How to eliminate wrong answers

Option B is wrong because pandas allows adding a new column with the same name as an existing column only if the assignment overwrites it; here, if 'total' already existed, it would be overwritten without error. Option C is wrong because `head()` without an argument defaults to 5 rows, not all rows. Option D is wrong because `head()` displays rows, not summary statistics (which would require `.describe()`).

15
MCQeasy

A data analyst is cleaning a dataset and finds missing values in a categorical variable representing customer region. Which imputation method is most appropriate?

A.Drop rows with missing values
B.Mode imputation
C.Mean imputation
D.Median imputation
AnswerB

Mode is appropriate for categorical variables.

Why this answer

Mode imputation is the most appropriate method for a categorical variable because it replaces missing values with the most frequently occurring category, preserving the distribution of the data. Unlike mean or median imputation, which are designed for numerical data, mode imputation maintains the categorical nature of the variable and avoids introducing invalid values. This approach is simple and effective when missing data is random and the category is well-represented.

Exam trap

The trap here is that candidates often confuse imputation methods across data types, incorrectly applying mean or median imputation to categorical variables because they focus on central tendency without considering data type appropriateness.

How to eliminate wrong answers

Option A is wrong because dropping rows with missing values can lead to significant data loss and potential bias, especially if the missingness is not completely random, reducing the dataset's representativeness. Option C is wrong because mean imputation is only appropriate for numerical data, not categorical variables, as calculating the mean of categories is meaningless and would produce non-categorical values. Option D is wrong because median imputation is also designed for numerical data and cannot be applied to categorical variables, as the median requires ordered numerical values to compute.

16
Matchingmedium

Match each data sampling method to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Each member has equal chance of selection

Population divided into subgroups; random sample from each

Randomly select entire groups (clusters)

Select every k-th element from a list

Sample based on ease of access

Why these pairings

Sampling methods are important for data collection.

17
MCQeasy

After a marketing campaign, sales increased by 15%. The analyst wants to understand which customer segment contributed most to the increase. Which type of analysis is this?

A.Predictive analysis
B.Diagnostic analysis
C.Prescriptive analysis
D.Descriptive analysis
AnswerB

Diagnostic analysis investigates the cause of the outcome—here, which segment drove the increase.

Why this answer

Diagnostic analysis is used to understand the root cause of an event or change. In this scenario, the analyst already knows sales increased by 15% and wants to determine which customer segment drove that increase, which is a classic diagnostic question. This type of analysis goes beyond describing what happened to explain why it happened.

Exam trap

The trap here is confusing diagnostic analysis with descriptive analysis, as both deal with past data, but descriptive only summarizes what happened while diagnostic explains why it happened.

How to eliminate wrong answers

Option A is wrong because predictive analysis uses historical data to forecast future outcomes, not to explain past changes. Option C is wrong because prescriptive analysis recommends actions or decisions to achieve a desired outcome, not to diagnose the cause of a past event. Option D is wrong because descriptive analysis summarizes what happened (e.g., 'sales increased by 15%') but does not investigate which segment contributed most to the increase.

18
Multi-Selectmedium

Which TWO of the following are commonly used techniques for handling missing data in a dataset? (Select TWO).

Select 2 answers
A.Mean imputation
B.Mode imputation
C.Dropping columns with missing data
D.Dropping rows with missing data
E.Regression imputation
AnswersA, E

Mean imputation replaces missing values with the mean of the column.

Why this answer

Mean imputation is a commonly used technique for handling missing numerical data where the missing value is replaced with the mean of the observed values for that feature. It preserves the sample size and is simple to implement, though it can reduce variance and distort relationships if data is not missing completely at random.

Exam trap

CompTIA often tests the distinction between common imputation methods (mean, median, mode, regression) and data removal techniques, trapping candidates who confuse 'dropping rows' as a primary technique when imputation is more widely recommended for preserving data integrity.

19
MCQeasy

A data analyst needs to create a visual that shows the distribution of customer ages across different regions. Which chart type is most appropriate?

A.Line chart
B.Stacked bar chart
C.Scatter plot
D.Pie chart
AnswerB

A stacked bar chart can display the distribution of age groups within each region, making comparisons easy.

Why this answer

A stacked bar chart is most appropriate because it allows the analyst to compare the distribution of customer ages (typically grouped into bins) across multiple regions simultaneously. Each bar represents a region, and the segments within the bar show the proportion or count of each age group, making it easy to see both the overall distribution and regional differences.

Exam trap

The trap here is that candidates often choose a pie chart because they think of 'distribution' as a single whole, forgetting that the question requires comparison across multiple regions, which a pie chart cannot handle.

How to eliminate wrong answers

Option A is wrong because a line chart is designed to show trends over a continuous variable (e.g., time), not the distribution of categorical age groups across regions. Option C is wrong because a scatter plot is used to show the relationship between two continuous variables, not the distribution of a single categorical variable across regions. Option D is wrong because a pie chart can only show the composition of a whole for a single category (e.g., age distribution for one region), but it cannot effectively compare distributions across multiple regions.

20
MCQeasy

A data analyst is designing a data model for a sales data warehouse. The model should optimize query performance for aggregations by minimizing joins and duplicating data where necessary. Which schema design should the analyst use?

A.Entity-relationship model
B.Snowflake schema
C.3NF normalized model
D.Star schema
AnswerD

Star schema denormalizes dimensions, minimizing joins and optimizing aggregate queries.

Why this answer

Star schema denormalizes dimensions into a single table, reducing joins and improving query speed for aggregates. Snowflake schema normalizes dimensions increasing joins. Entity-relationship and 3NF are optimized for transactional systems, not analytical queries.

21
MCQhard

A data analyst uses linear regression to model the relationship between advertising spend and sales. The residual plot shows a clear U-shaped pattern. What assumption is violated?

A.Independence of residuals
B.Homoscedasticity
C.Normality of residuals
D.Linearity
AnswerD

A U-shaped pattern means the relationship is not linear; the model is missing a nonlinear term.

Why this answer

The U-shaped pattern in the residual plot indicates that the relationship between advertising spend and sales is not linear; the model fails to capture the curvature in the data. Linear regression assumes a straight-line relationship between predictors and the response, so a systematic pattern like a U-shape directly violates the linearity assumption. This means the model is misspecified and requires a transformation or a nonlinear modeling approach.

Exam trap

CompTIA often tests the distinction between residual pattern shapes and their corresponding assumptions, so the trap here is that candidates confuse a curved pattern (nonlinearity) with heteroscedasticity or non-normality, leading them to pick B or C instead of D.

How to eliminate wrong answers

Option A is wrong because independence of residuals refers to errors being uncorrelated with each other, often violated in time-series data, but a U-shaped pattern does not imply autocorrelation. Option B is wrong because homoscedasticity means constant variance of residuals across fitted values, which would appear as a funnel or cone shape, not a U-shaped curve. Option C is wrong because normality of residuals concerns the distribution of errors (checked via Q-Q plot or histogram), not the pattern of residuals versus fitted values; a U-shaped pattern does not directly indicate non-normality.

22
MCQhard

A healthcare analytics team is building a predictive model to identify patients at high risk of readmission within 30 days of discharge. The dataset includes 50,000 patient records with 200 features, including demographics, vital signs, lab results, and historical admissions. The target variable is binary (readmitted or not). The team uses a logistic regression model and achieves an AUC of 0.72 on the test set. However, the model's calibration is poor: for patients predicted to have a 70% risk, the actual readmission rate is only 40%. The team wants to improve calibration without significantly reducing discrimination (AUC). The data scientist suggests applying Platt scaling. However, the team lead is concerned that Platt scaling may reduce the model's ability to rank patients correctly. Which of the following is the best course of action?

A.Remove poorly calibrated predictions by discarding all patients with predicted risk between 0.3 and 0.7.
B.Ignore calibration because AUC is the only metric that matters for readmission risk models.
C.Apply Platt scaling on a held-out validation set to recalibrate the predicted probabilities without refitting the original model.
D.Switch to a random forest model, which inherently produces better-calibrated probabilities.
AnswerC

Platt scaling is designed to improve calibration while maintaining AUC.

Why this answer

Platt scaling is a post-processing technique that fits a logistic regression model on the predicted probabilities from the original model using a held-out validation set. This recalibrates the probabilities without altering the ranking of patients (the AUC remains unchanged), directly addressing the poor calibration while preserving discrimination. Option C correctly describes this procedure.

Exam trap

The trap here is that candidates may think Platt scaling changes the model's ranking (AUC), but in reality it applies a monotonic transformation that preserves rank order, so discrimination is unaffected.

How to eliminate wrong answers

Option A is wrong because discarding patients with predicted risk between 0.3 and 0.7 removes a large portion of the data and does not fix the underlying miscalibration; it merely hides the problem and reduces the model's utility. Option B is wrong because AUC measures only rank ordering, not probability accuracy; for clinical risk models, well-calibrated probabilities are critical for decision-making (e.g., resource allocation). Option D is wrong because random forest models are known to produce poorly calibrated probabilities due to their averaging of decision tree outputs, often requiring their own calibration (e.g., isotonic regression) and do not inherently guarantee better calibration than logistic regression.

23
MCQmedium

A healthcare analytics team is building a classification model to predict patient readmission within 30 days. The dataset contains 10,000 records with 30 features, including demographics, vital signs, lab results, and medication history. The target variable is imbalanced: 85% no readmission, 15% readmission. The team used logistic regression with default settings and achieved an accuracy of 85%, but the model predicted 'no readmission' for all patients. The lead analyst suspects the model is not learning due to class imbalance. The team has time to implement one corrective action before the next model review. Which action should the team take?

A.Remove features with low variance to reduce noise
B.Apply SMOTE to oversample the readmission class
C.Use accuracy as the evaluation metric to monitor improvement
D.Switch to a random forest model with default settings
AnswerB

SMOTE generates synthetic samples, balancing the classes and allowing the model to learn from the minority class.

Why this answer

Option B is correct because SMOTE (Synthetic Minority Oversampling Technique) directly addresses the class imbalance by generating synthetic samples for the minority class (readmission). This forces the logistic regression model to learn decision boundaries that separate the two classes, rather than defaulting to the majority class prediction. With 85% majority and 15% minority, accuracy alone is misleading, and SMOTE is a proven technique to improve recall for the minority class.

Exam trap

The trap here is that candidates often choose accuracy as a metric (Option C) because it seems intuitive, but in imbalanced datasets, accuracy is misleading and does not reflect model performance for the minority class.

How to eliminate wrong answers

Option A is wrong because removing low-variance features does not address class imbalance; it only reduces noise or redundant features, but the model will still predict the majority class if the imbalance is not handled. Option C is wrong because using accuracy as the evaluation metric is exactly the problem—it will remain high (85%) even if the model predicts all 'no readmission', so it does not monitor improvement for the minority class. Option D is wrong because switching to a random forest model with default settings does not inherently solve class imbalance; random forest can also be biased toward the majority class without techniques like class weighting or resampling.

24
MCQhard

A company is analyzing customer feedback sentiment. The dataset is highly imbalanced with 95% positive and 5% negative comments. Which technique should the analyst use to address class imbalance before modeling?

A.Use accuracy as the evaluation metric
B.Undersample the majority class
C.Oversample the majority class
D.Use SMOTE
AnswerD

SMOTE generates synthetic minority samples to balance classes.

Why this answer

SMOTE (Synthetic Minority Oversampling Technique) is the correct choice because it generates synthetic samples for the minority class (negative comments) by interpolating between existing minority instances, rather than simply duplicating them. This addresses the 95:5 imbalance without the information loss of undersampling or the overfitting risk of naive oversampling.

Exam trap

The trap here is that candidates often confuse oversampling the minority class with oversampling the majority class, or they incorrectly assume that simply using a different evaluation metric (like accuracy) can fix the imbalance problem without modifying the dataset.

How to eliminate wrong answers

Option A is wrong because accuracy is a misleading metric for imbalanced datasets; a model predicting all comments as positive would achieve 95% accuracy but fail to identify any negative comments. Option B is wrong because undersampling the majority class discards a large amount of potentially useful data, which can lead to loss of important patterns and reduced model performance. Option C is wrong because oversampling the majority class would exacerbate the imbalance, making the model even more biased toward the majority class.

25
Multi-Selecteasy

Which THREE of the following are examples of descriptive statistics? (Select THREE.)

Select 3 answers
A.Correlation coefficient
B.Mean
C.P-value
D.Regression coefficient
E.Standard deviation
AnswersA, B, E

Correlation coefficient describes the strength of a linear relationship, a descriptive statistic.

Why this answer

The correlation coefficient (A) is a descriptive statistic because it quantifies the strength and direction of a linear relationship between two variables using a single number (ranging from -1 to +1) without making inferences about a larger population. It simply describes the observed data's association, which is the core function of descriptive statistics.

Exam trap

CompTIA often tests the distinction between descriptive and inferential statistics by including p-values and regression coefficients as distractors, exploiting the common misconception that any numerical summary of data is descriptive.

26
Multi-Selecthard

Which TWO of the following are valid techniques for validating the performance of a predictive model?

Select 2 answers
A.Bootstrapping
B.Feature scaling
C.Train-test split
D.K-fold cross-validation
E.Increasing training data
AnswersC, D

Splitting data into training and testing sets is a basic validation approach.

Why this answer

The train-test split (Option C) is a fundamental technique for validating predictive model performance by partitioning the dataset into separate training and testing subsets, ensuring the model is evaluated on unseen data to gauge generalization. This method directly addresses overfitting and provides an unbiased estimate of model accuracy, making it a standard practice in supervised learning workflows.

Exam trap

CompTIA often tests the distinction between data preprocessing techniques (like feature scaling) and actual model validation methods, leading candidates to mistakenly select feature scaling as a validation technique because it is a common step in the modeling pipeline.

27
MCQmedium

Refer to the exhibit. Which type of ensemble method is being used?

A.Boosting
B.Stacking
C.Voting
D.Bagging
AnswerD

Random forest uses bagging (bootstrap aggregating) to create multiple decision trees.

Why this answer

The exhibit shows multiple base models (Model 1, Model 2, Model 3) trained in parallel on bootstrap samples of the data, and their predictions are combined via averaging (regression) or majority voting (classification). This parallel training with resampled data and equal-weight aggregation is the defining characteristic of bagging (Bootstrap Aggregating).

Exam trap

CompTIA often tests the distinction between bagging and boosting by showing parallel vs. sequential training diagrams, and the trap here is confusing the parallel bootstrap resampling with the sequential error-correction approach of boosting.

How to eliminate wrong answers

Option A is wrong because boosting trains models sequentially, where each subsequent model focuses on correcting the errors of the previous one, not in parallel on bootstrap samples. Option B is wrong because stacking uses a meta-learner to combine predictions from diverse base models, not simple averaging or majority voting. Option C is wrong because voting typically combines predictions from different model types (e.g., logistic regression, SVM) trained on the same dataset, not from the same model type trained on bootstrap samples.

28
MCQmedium

A data analyst is reviewing a SQL query that joins three large tables. The query takes over an hour to run. The analyst notices that the WHERE clause filters on indexed columns in only two tables. Which of the following should the analyst do first to improve performance?

A.Use subqueries instead of joins
B.Check the query execution plan and optimize join order
C.Add indexes to all columns used in joins
D.Increase server memory
AnswerB

Analyzing the execution plan reveals performance bottlenecks and suggests whether indexes, join order, or other optimizations are needed.

Why this answer

The query execution plan reveals how the database engine processes joins and filters. By checking the plan, the analyst can identify the most selective filter and rearrange the join order to reduce the number of rows processed early, which is the most impactful first step. Optimizing join order leverages existing indexes without requiring schema changes or hardware upgrades.

Exam trap

CompTIA often tests the misconception that adding indexes or hardware is the immediate fix, when in fact analyzing the execution plan and adjusting join order is the cheapest and most effective first step.

How to eliminate wrong answers

Option A is wrong because subqueries often perform worse than joins in large-table scenarios, as they can lead to correlated subquery execution and repeated scans. Option C is wrong because adding indexes to all join columns is unnecessary and may degrade write performance; the analyst should first verify if existing indexes are being used efficiently via the execution plan. Option D is wrong because increasing server memory is a reactive, costly measure that does not address the root cause of inefficient query processing, such as poor join order or missing index usage.

29
Multi-Selecthard

Which THREE of the following are assumptions of linear regression? (Select THREE).

Select 3 answers
A.Normal distribution of independent variables
B.Multicollinearity among independent variables
C.Independence of errors
D.Homoscedasticity (constant variance of errors)
E.Linearity between independent and dependent variables
AnswersC, D, E

Errors should be independent.

Why this answer

Independence of errors is a core assumption of linear regression, meaning the residuals (errors) should not be correlated with each other. This is critical for valid inference because correlated errors violate the Gauss-Markov theorem, leading to biased standard errors and unreliable hypothesis tests. In time series data, this assumption is often violated due to autocorrelation, which can be detected using the Durbin-Watson test.

Exam trap

The trap here is that candidates confuse the normality assumption for errors with a normality assumption for the independent variables, leading them to incorrectly select Option A.

30
Multi-Selecteasy

Which TWO of the following are examples of supervised learning algorithms?

Select 2 answers
A.Linear regression
B.K-means clustering
C.Principal component analysis (PCA)
D.Decision trees
E.Apriori algorithm
AnswersA, D

Supervised regression algorithm.

Why this answer

Linear regression is a supervised learning algorithm because it learns a mapping from input features to a continuous target variable using labeled training data. The model minimizes the difference between predicted and actual values (e.g., via ordinary least squares) to make predictions on new data.

Exam trap

CompTIA often tests the distinction between supervised and unsupervised learning by including clustering (K-means) and association (Apriori) as distractors, which candidates mistakenly think are supervised because they involve pattern discovery.

31
MCQeasy

A data analyst needs to identify the most frequently occurring value in a dataset. Which measure of central tendency should they use?

A.Mode
B.Standard deviation
C.Median
D.Mean
AnswerA

Mode is the most frequently occurring value.

Why this answer

The mode is the measure of central tendency that identifies the most frequently occurring value in a dataset. Unlike the mean or median, the mode directly counts the frequency of each distinct value and returns the value with the highest count, making it the correct choice for this specific requirement.

Exam trap

The trap here is that candidates often confuse 'most frequently occurring' with 'average' or 'middle value' and incorrectly choose mean or median, especially when the dataset is numeric and they assume central tendency always refers to mean.

How to eliminate wrong answers

Option B (Standard deviation) is wrong because it measures the dispersion or spread of data points around the mean, not the frequency of occurrence of any single value. Option C (Median) is wrong because it identifies the middle value when the dataset is sorted, which does not indicate which value appears most often. Option D (Mean) is wrong because it calculates the arithmetic average of all values, which can be skewed by outliers and does not reflect frequency of occurrence.

32
MCQmedium

A company wants to segment its customers into distinct groups based on purchasing behavior. Which algorithm is best suited for this task?

A.Decision tree
B.Logistic regression
C.K-means clustering
D.Linear regression
AnswerC

K-means clustering groups similar customers together based on features.

Why this answer

K-means clustering is an unsupervised learning algorithm that partitions data into K distinct clusters based on feature similarity, making it ideal for segmenting customers by purchasing behavior without predefined labels. It groups customers who exhibit similar purchasing patterns, enabling the company to identify natural segments for targeted marketing.

Exam trap

The trap here is that candidates often confuse supervised learning algorithms (like decision trees or logistic regression) with unsupervised clustering, mistakenly thinking that any algorithm that 'groups' data can be used for segmentation without recognizing the need for unlabeled data.

How to eliminate wrong answers

Option A is wrong because a decision tree is a supervised learning algorithm used for classification or regression, requiring labeled training data to predict outcomes, not for discovering unknown groupings in unlabeled data. Option B is wrong because logistic regression is a supervised classification algorithm for binary or multinomial outcomes, relying on labeled target variables, and cannot perform unsupervised clustering. Option D is wrong because linear regression is a supervised regression algorithm that models the relationship between a dependent variable and one or more independent variables, and it is not designed to segment data into distinct groups without predefined categories.

33
MCQmedium

A retail company wants to predict sales based on advertising spend and season. Which data modeling technique should the analyst use?

A.Simple linear regression
B.Multiple linear regression
C.Logistic regression
D.K-means clustering
AnswerB

Multiple linear regression handles two or more predictors and predicts a continuous outcome.

Why this answer

Multiple linear regression is the correct technique because the analyst needs to model a continuous outcome (sales) based on two or more predictor variables: advertising spend (continuous) and season (categorical, typically encoded as dummy variables). This allows the model to capture the independent effect of each predictor on sales, which simple linear regression cannot do because it only handles one predictor.

Exam trap

The trap here is that candidates often confuse simple linear regression with multiple linear regression, thinking that 'linear regression' alone suffices, but the exam specifically tests whether you recognize that multiple predictors require multiple regression.

How to eliminate wrong answers

Option A is wrong because simple linear regression can only model the relationship between one independent variable and the dependent variable, but here we have two predictors (advertising spend and season). Option C is wrong because logistic regression is used for binary or categorical outcome variables (e.g., yes/no), not for continuous outcomes like sales. Option D is wrong because K-means clustering is an unsupervised learning technique used to group similar data points, not to predict a continuous target variable.

34
Multi-Selecteasy

Which TWO of the following are dimensional modeling techniques commonly used in data warehouses?

Select 2 answers
A.Entity-relationship diagram
B.Snowflake schema
C.Star schema
D.Scatter plot
E.Histogram
AnswersB, C

Snowflake schema is a dimensional modeling technique where dimensions are normalized.

Why this answer

The snowflake schema is a dimensional modeling technique where dimension tables are normalized into multiple related tables, reducing data redundancy. This structure is commonly used in data warehouses to improve query performance and maintainability for complex analytical queries.

Exam trap

The trap here is that candidates may confuse general data modeling concepts (like ERDs) or data visualization tools (like scatter plots and histograms) with specific dimensional modeling techniques used in data warehouses.

35
MCQhard

A data scientist trains a regression model and observes high variance with low bias. Which technique is most appropriate to reduce variance?

A.Apply Ridge regularization
B.Increase polynomial features
C.Use a smaller training set
D.Remove correlated features
AnswerA

Ridge adds penalty to coefficients, reducing overfitting and variance.

Why this answer

Ridge regularization (L2) reduces variance by adding a penalty term proportional to the square of the coefficients, which shrinks them toward zero without eliminating them. This directly addresses high variance (overfitting) by constraining the model's complexity, while low bias indicates the model fits the training data well. The regularization parameter λ controls the trade-off between bias and variance.

Exam trap

CompTIA often tests the misconception that reducing variance requires removing features or simplifying the model, but Ridge regularization is the correct technique because it penalizes coefficient magnitude without discarding predictors.

How to eliminate wrong answers

Option B is wrong because increasing polynomial features adds higher-order terms, which increases model complexity and typically increases variance, not reduces it. Option C is wrong because using a smaller training set reduces the amount of data available for learning, which generally increases variance due to less stable coefficient estimates. Option D is wrong because removing correlated features can reduce multicollinearity but does not directly penalize coefficient magnitudes; it may even increase variance if important predictors are dropped.

36
MCQeasy

A marketing analyst wants to segment customers based on their purchase history, including total spent, number of transactions, and average order value. The analyst runs k-means clustering with k=5 on the raw data but notices that the cluster assignments change significantly every time the algorithm is executed. What should the analyst do first to obtain consistent and meaningful clusters?

A.Normalize the features and set a fixed random seed for the initial centroids.
B.Switch to hierarchical clustering, which does not require specifying k.
C.Increase the number of clusters to k=10 to capture more detail.
D.Use principal component analysis (PCA) to reduce the number of features to two.
AnswerA

Normalization ensures all features contribute equally, and a fixed seed ensures reproducible results.

Why this answer

The instability in cluster assignments is caused by the algorithm's sensitivity to the scale of features and the random initialization of centroids. Normalizing the features ensures that each variable contributes equally to the distance calculations, while setting a fixed random seed makes the initial centroid selection deterministic, leading to reproducible results.

Exam trap

The trap here is that candidates may think the instability is due to the choice of k or the algorithm itself, rather than recognizing that k-means is sensitive to feature scaling and random initialization, which are the first things to address for consistency.

How to eliminate wrong answers

Option B is wrong because hierarchical clustering does not require specifying k, but it still suffers from sensitivity to data scaling and does not address the core issue of random initialization causing variability. Option C is wrong because increasing k to 10 would likely increase instability and overfit noise, not resolve the fundamental problem of non-deterministic centroids. Option D is wrong because PCA reduces dimensionality but does not stabilize the k-means algorithm; the cluster assignments would still vary with different random seeds unless combined with normalization and a fixed seed.

37
MCQeasy

Refer to the exhibit. Which clause is used to aggregate the data by department?

A.HAVING
B.WHERE
C.ORDER BY
D.GROUP BY
AnswerD

GROUP BY groups rows by department, allowing COUNT to compute per-department totals.

Why this answer

The GROUP BY clause is used to aggregate data by department because it groups rows that have the same values in the specified column(s), allowing aggregate functions like SUM, AVG, or COUNT to be applied per group. In SQL, without GROUP BY, aggregate functions would operate on the entire result set, not per department.

Exam trap

CompTIA often tests the distinction between WHERE (row-level filter) and HAVING (group-level filter), leading candidates to confuse HAVING with GROUP BY when the question asks for the clause that performs aggregation.

How to eliminate wrong answers

Option A is wrong because HAVING is used to filter groups after aggregation, not to define the grouping itself. Option B is wrong because WHERE filters individual rows before aggregation and cannot group data by department. Option C is wrong because ORDER BY sorts the result set but does not perform any aggregation or grouping.

38
MCQeasy

You are a data analyst at a logistics company. The operations manager wants to reduce delivery delays. You have historical data including order date, delivery date, distance, weather conditions, and driver ID. Initial analysis shows that the average delivery time has increased over the past six months. You suspect that weather is a contributing factor, but you need to confirm. The company also wants to build a model to predict delivery times to better manage customer expectations. The data contains missing values for weather conditions in about 10% of records, and some driver IDs are incorrect. You have limited time and resources. What should you do first?

A.Immediately focus on time series analysis to look for patterns
B.Start by cleaning the data: correct driver IDs and decide how to handle missing weather data, then perform exploratory data analysis
C.Collect more data to fill missing values
D.Build a predictive model using all available data after imputing missing weather data
AnswerB

Cleaning ensures data integrity, and EDA guides modeling choices.

Why this answer

Option B is correct because data cleaning and exploratory data analysis (EDA) are foundational steps before any modeling or time series work. With missing weather data (10%) and incorrect driver IDs, proceeding without cleaning would introduce bias and errors. EDA will reveal patterns, correlations, and data quality issues, enabling informed decisions on imputation and feature engineering for the predictive model.

Exam trap

CompTIA often tests the misconception that you can jump directly to modeling or advanced analysis without first ensuring data quality, ignoring the 'garbage in, garbage out' principle.

How to eliminate wrong answers

Option A is wrong because time series analysis assumes clean, consistent data; applying it directly with missing values and incorrect IDs would yield unreliable patterns and waste resources. Option C is wrong because collecting more data is time-consuming and does not address the existing incorrect driver IDs or the need to understand current data quality; it also assumes missing values are random, which may not hold. Option D is wrong because building a predictive model on uncleaned data with imputed weather values without prior EDA risks overfitting, misinterpretation of feature importance, and propagation of errors from incorrect IDs.

39
MCQeasy

A data analyst needs to summarize customer satisfaction scores. The data contains a few extremely low scores that skew the distribution. Which measure of central tendency is most appropriate?

A.Range
B.Mode
C.Median
D.Mean
AnswerC

The median is robust to outliers and provides a better central value for skewed data.

Why this answer

The median is the most appropriate measure of central tendency when data contains extreme outliers, such as the very low customer satisfaction scores described. Unlike the mean, the median is resistant to skew because it depends only on the middle value(s) of the sorted dataset, not on the magnitude of extreme values. This makes it the standard choice for summarizing ordinal or skewed interval/ratio data in data analysis.

Exam trap

The trap here is that candidates often default to the mean as the 'average' without considering outlier impact, but Cisco tests the understanding that the mean is non-robust and the median is the correct choice for skewed data in the Analyzing and Modeling domain.

How to eliminate wrong answers

Option A (Range) is wrong because it is a measure of dispersion (the difference between the maximum and minimum values), not a measure of central tendency, and it is heavily influenced by outliers. Option B (Mode) is wrong because it identifies the most frequently occurring score, which may not represent the center of the distribution and can be misleading when outliers are present but not frequent. Option D (Mean) is wrong because it is sensitive to extreme values; the few extremely low scores will pull the arithmetic mean downward, misrepresenting the typical customer satisfaction experience.

40
MCQhard

A data analyst is building a model to predict customer churn. The dataset has 10,000 records with 500 churned customers. The model predicts churn with 95% accuracy, but only identifies 10% of actual churners. Which metric best highlights this issue?

A.Accuracy
B.F1 score
C.Recall
D.Precision
AnswerC

Recall is low (10%), showing the model fails to detect churners.

Why this answer

Recall (also known as sensitivity or true positive rate) measures the proportion of actual positives correctly identified. With only 10% of actual churners detected, the model has a recall of 0.1, which directly highlights the failure to capture churners despite high overall accuracy.

Exam trap

The trap here is that candidates may choose accuracy because it is a familiar and seemingly high value (95%), failing to recognize that in imbalanced datasets, accuracy can be deceptive and does not reflect poor performance on the minority class.

How to eliminate wrong answers

Option A is wrong because accuracy (95%) is misleading in imbalanced datasets; it can be high even if the model fails to detect churners, as the majority class (non-churners) dominates. Option B is wrong because the F1 score is the harmonic mean of precision and recall; while it would be low here, it does not directly isolate the issue of missing churners—recall is the metric that specifically measures detection of the positive class. Option D is wrong because precision measures the proportion of predicted churners that are actual churners; it does not reflect how many actual churners were missed, which is the core problem.

41
MCQeasy

A dataset contains a column 'Income' with values in different scales (some in thousands, some in hundreds). What is the best way to standardize this column for use in a machine learning model?

A.Apply min-max scaling to range [0,1]
B.Apply standard scaling (Z-score normalization)
C.Apply log transformation
D.Remove the column
AnswerB

Standard scaling centers and scales data, suitable for inconsistent scales.

Why this answer

Standard scaling (Z-score) removes the mean and scales to unit variance, making values comparable. Min-max scaling also works but doesn't handle outliers well. Option A is correct.

42
MCQhard

After building a binary classification model, the data analyst obtains the following confusion matrix: True Positives=80, True Negatives=100, False Positives=20, False Negatives=30. What is the F1 score?

A.0.76
B.0.73
C.0.80
D.0.69
AnswerA

Precision=0.8, Recall≈0.727, F1≈0.76.

Why this answer

The F1 score is the harmonic mean of precision and recall. Precision = TP/(TP+FP) = 80/(80+20) = 0.80. Recall = TP/(TP+FN) = 80/(80+30) ≈ 0.7273.

F1 = 2 * (0.80 * 0.7273) / (0.80 + 0.7273) ≈ 0.7619, which rounds to 0.76. Option A is correct.

Exam trap

CompTIA often tests the distinction between precision, recall, and F1, and the trap here is that candidates mistakenly use accuracy or a simple average instead of the harmonic mean, or they confuse recall with F1.

How to eliminate wrong answers

Option B (0.73) is wrong because it approximates recall (0.727) instead of computing the harmonic mean. Option C (0.80) is wrong because it uses precision alone, ignoring recall. Option D (0.69) is wrong because it likely results from a miscalculation, such as averaging precision and recall arithmetically (0.80+0.727)/2 ≈ 0.76, not 0.69, or from an incorrect formula like (TP+TN)/(TP+TN+FP+FN) = 180/230 ≈ 0.78, which is accuracy, not F1.

43
Multi-Selecteasy

A data analyst is preparing to build a predictive model. Which TWO steps are essential to ensure model validity? (Choose two.)

Select 2 answers
A.Increase model complexity
B.Perform cross-validation
C.Avoid feature selection
D.Use the entire dataset for training
E.Split data into training and testing sets
AnswersB, E

Cross-validation provides a more reliable estimate of model performance.

Why this answer

Cross-validation is essential for model validity because it partitions the data into multiple folds, training on k-1 folds and validating on the remaining fold, which provides a robust estimate of model performance and reduces overfitting. This technique ensures that the model generalizes well to unseen data by repeatedly testing different subsets, making it a standard practice in predictive modeling.

Exam trap

The trap here is that candidates may think using the entire dataset for training (Option D) is acceptable because it maximizes data for learning, but they overlook the necessity of a separate testing set to validate model performance and avoid overfitting.

44
MCQeasy

A marketing team wants to segment customers into distinct groups based on purchasing behavior. The data includes numeric features such as frequency, monetary value, and recency. Which unsupervised learning algorithm should be used?

A.Decision tree
B.K-means clustering
C.Linear regression
D.Association rules
AnswerB

K-means is an unsupervised clustering algorithm suitable for grouping customers based on numeric attributes.

Why this answer

K-means clustering is the correct choice because it is an unsupervised learning algorithm that partitions data into K distinct clusters based on feature similarity. For segmenting customers by purchasing behavior (frequency, monetary value, recency), K-means groups customers with similar numeric patterns without requiring labeled outcomes, making it ideal for exploratory segmentation.

Exam trap

The trap here is that candidates may confuse unsupervised clustering (K-means) with supervised classification (decision tree) or regression (linear regression), mistakenly thinking any algorithm that 'groups' data must be supervised, or that association rules are for segmentation rather than transaction pattern mining.

How to eliminate wrong answers

Option A is wrong because a decision tree is a supervised learning algorithm used for classification or regression, requiring labeled target variables, not for unsupervised segmentation of unlabeled customer data. Option C is wrong because linear regression is a supervised learning algorithm that models the relationship between independent and dependent variables, predicting a continuous output, not for discovering hidden groups in unlabeled data. Option D is wrong because association rules are used for market basket analysis to find frequent itemsets and co-occurrence patterns (e.g., products bought together), not for clustering customers into distinct groups based on numeric features.

45
Multi-Selecthard

A data scientist is cleaning a dataset and notices missing values in several columns. Which THREE techniques are appropriate for handling missing data? (Select THREE.)

Select 3 answers
A.Replace missing values with the mean or median
B.Ignore missing values and proceed with analysis
C.Predict missing values using regression
D.Remove rows with missing values
E.Always replace missing values with zero
AnswersA, C, D

Imputation with mean/median is a common technique for numeric data.

Why this answer

Option A is correct because replacing missing values with the mean (for normally distributed data) or median (for skewed data) is a standard imputation technique that preserves the central tendency of the dataset without introducing bias. This method is appropriate when the missingness is random and the proportion of missing data is low, as it maintains the sample size for analysis.

Exam trap

CompTIA often tests the misconception that ignoring missing values (Option B) is acceptable, but the DA0-001 exam expects candidates to recognize that most analytical tools require explicit handling of nulls, and simply proceeding without action leads to runtime errors or flawed results.

46
Drag & Dropmedium

Drag and drop the steps to implement a data classification policy in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Classification involves defining levels, assigning ownership, labeling, access control, and training.

47
MCQhard

A financial analyst is building a model to predict stock price movements. The data is time series with daily prices. The analyst wants to use a regression model but notices that the residuals are autocorrelated. What adjustment should be made?

A.Use a time series model like ARIMA instead
B.Use cross-validation to validate the model
C.Add more predictors to the regression model
D.Transform the data to remove autocorrelation (e.g., differencing)
AnswerA

ARIMA models capture autocorrelation through autoregressive and moving average components.

Why this answer

When residuals from a regression model on time series data exhibit autocorrelation, the standard ordinary least squares (OLS) assumptions are violated, leading to biased standard errors and unreliable inference. An ARIMA model is specifically designed to handle autocorrelated time series by explicitly modeling the autoregressive (AR) and moving average (MA) components, making it the correct adjustment to capture the temporal dependencies in stock price movements.

Exam trap

The trap here is that candidates often confuse data transformation (like differencing) with model selection, thinking that simply removing autocorrelation from the data is sufficient, when in fact the model itself must be changed to a time series framework like ARIMA to properly account for the temporal structure.

How to eliminate wrong answers

Option B is wrong because cross-validation is a model validation technique that does not address autocorrelation in residuals; it would still produce unreliable performance estimates if the underlying model violates independence assumptions. Option C is wrong because adding more predictors does not fix autocorrelated residuals; it may even introduce multicollinearity or overfitting without correcting the temporal dependency structure. Option D is wrong because while differencing can remove certain types of autocorrelation (e.g., unit roots), it is a data transformation step often used within ARIMA modeling, not a standalone adjustment; simply transforming the data without changing the model framework does not resolve the fundamental issue that the regression model assumes independent errors.

48
MCQeasy

During ETL, a data analyst discovers that a date column contains values like '01/02/2023' and '2023-01-02'. Which of the following is the best practice to ensure consistent date format before analysis?

A.Keep both formats and handle during analysis
B.Use regular expressions to parse and convert each format
C.Remove records with inconsistent date formats
D.Apply a standardized date parsing function to convert all dates
AnswerD

Using a standardized date parsing function (e.g., TO_DATE in SQL or pd.to_datetime in Python) ensures all dates are in a consistent format.

Why this answer

Option D is correct because applying a standardized date parsing function (e.g., `TO_DATE` in SQL or `pd.to_datetime` in Python) ensures all date values are converted to a single, consistent format regardless of the original representation. This is a fundamental ETL best practice to avoid ambiguity and enable accurate date-based filtering, aggregation, and joins during analysis.

Exam trap

The trap here is that candidates may choose Option B (regular expressions) thinking it offers fine-grained control, but they overlook that dedicated date parsing functions are more reliable, simpler, and handle edge cases like leap years or time zones that regex cannot easily manage.

How to eliminate wrong answers

Option A is wrong because keeping both formats forces the analyst to handle multiple date patterns during every query, increasing complexity and risk of errors in comparisons or calculations. Option B is wrong because using regular expressions to parse dates is fragile, error-prone, and unnecessary when dedicated date parsing functions exist that handle locale and format variations robustly. Option C is wrong because removing records with inconsistent date formats discards potentially valid data, leading to incomplete analysis and biased results.

49
Multi-Selecteasy

A data analyst is building a linear regression model to predict sales based on advertising spend across TV, radio, and newspaper channels. Which TWO diagnostics should the analyst perform to validate the model assumptions?

Select 2 answers
A.Durbin-Watson test for autocorrelation
B.Q-Q plot to assess normality of residuals
C.Variance inflation factor (VIF) for multicollinearity
D.Cook's distance to identify influential points
E.Residual plots to check for homoscedasticity
AnswersB, E

Q-Q plot checks normality assumption.

Why this answer

Option B is correct because a Q-Q plot is used to assess whether the residuals of a linear regression model are approximately normally distributed, which is a key assumption for valid inference (e.g., p-values and confidence intervals). Option E is correct because residual plots (e.g., fitted vs. residuals) are the standard diagnostic to check for homoscedasticity—constant variance of errors across all levels of the independent variables—another core assumption of ordinary least squares regression.

Exam trap

CompTIA often tests the distinction between assumption validation (normality and homoscedasticity) and other regression diagnostics (autocorrelation, multicollinearity, influence) to see if candidates confuse model-building checks with residual assumption checks.

50
MCQmedium

A company has a dataset with 100 features. The data analyst wants to reduce dimensionality while preserving as much variance as possible. Which technique should be used?

A.PCA (Principal Component Analysis)
B.LDA (Linear Discriminant Analysis)
C.Autoencoders
D.t-SNE
AnswerA

PCA finds the directions of maximum variance and projects data onto them, preserving as much variance as possible.

Why this answer

PCA is the correct choice because it is an unsupervised linear dimensionality reduction technique that projects the data onto orthogonal components ordered by the variance they capture. By selecting the top principal components, the analyst can retain the maximum possible variance in the dataset while reducing the number of features from 100 to a smaller set, directly addressing the goal of preserving variance.

Exam trap

The trap here is that candidates often confuse PCA with LDA because both are linear transformations, but LDA requires labeled data and maximizes class separation, not variance, making it unsuitable for this unsupervised variance-preservation goal.

How to eliminate wrong answers

Option B (LDA) is wrong because LDA is a supervised technique that maximizes class separability, not variance preservation, and requires labeled target classes, which are not mentioned in the scenario. Option C (Autoencoders) is wrong because while autoencoders can reduce dimensionality, they are neural-network-based, require significant tuning and data, and are not the standard first-choice technique for simple variance-preserving reduction; PCA is more straightforward and computationally efficient for this task. Option D (t-SNE) is wrong because t-SNE is a nonlinear visualization technique primarily used for exploring high-dimensional data in 2D or 3D plots; it does not preserve global variance structure and cannot be used to transform new data or reduce dimensionality for modeling.

51
MCQeasy

A data analyst calculates a correlation coefficient of -0.85 between temperature and heating costs. What does this indicate?

A.No correlation
B.Strong positive correlation
C.Strong negative correlation
D.Weak negative correlation
AnswerC

The negative sign shows an inverse relationship, and 0.85 is close to -1, indicating strength.

Why this answer

A correlation coefficient of -0.85 indicates a strong negative linear relationship between temperature and heating costs. As temperature increases, heating costs decrease significantly, and the magnitude of 0.85 (close to -1) confirms the strength of this inverse association.

Exam trap

CompTIA often tests the misinterpretation of the sign of the correlation coefficient, where candidates confuse a strong negative correlation with a weak one or mistakenly think a negative value implies no relationship.

How to eliminate wrong answers

Option A is wrong because a correlation coefficient of -0.85 is far from 0, indicating a clear relationship, not no correlation. Option B is wrong because a positive correlation would have a coefficient greater than 0, but -0.85 is negative, showing an inverse relationship. Option D is wrong because a weak negative correlation would have a coefficient closer to 0 (e.g., -0.2 to -0.4), whereas -0.85 is near -1, indicating a strong negative correlation.

52
MCQmedium

Refer to the exhibit. An analyst runs the following query: SELECT product_id, AVG(quantity) FROM sales GROUP BY product_id HAVING AVG(quantity) > 8; Which product_id(s) will be returned?

A.P001 and P003
B.P001 only
C.P002 only
D.P003 only
AnswerA

P001 average is 9 and P003 average is 12, both >8.

Why this answer

The query groups sales by product_id and filters groups where the average quantity exceeds 8. From the exhibit (not shown but implied), only product_ids P001 and P003 have an AVG(quantity) > 8, so they are returned. The HAVING clause operates on aggregated data after GROUP BY, unlike WHERE which filters rows before aggregation.

Exam trap

CompTIA often tests the distinction between WHERE and HAVING, and the trap here is that candidates mistakenly think HAVING filters individual rows or that AVG(quantity) > 8 applies to each row, leading them to select only one product_id instead of recognizing the grouped result.

How to eliminate wrong answers

Option B is wrong because P001 alone does not satisfy the condition; P003 also has an average quantity above 8, so both are returned. Option C is wrong because P002's average quantity is 8 or less, so it is excluded by the HAVING clause. Option D is wrong because P003 is returned, but P001 also meets the condition, so the result is not limited to P003 only.

53
MCQhard

After training a decision tree, the tree has depth 20 and 100% accuracy on training data but only 60% on test data. Which hyperparameter adjustment is most likely to improve generalization?

A.Increase number of estimators
B.Decrease minimum samples per split
C.Increase minimum samples per leaf
D.Increase maximum depth
AnswerC

Increasing min_samples_leaf prevents the tree from fitting noise by requiring more samples in each leaf, reducing overfitting.

Why this answer

The model is overfitting: 100% training accuracy vs. 60% test accuracy with a depth-20 tree. Increasing minimum samples per leaf forces the tree to be simpler by requiring more samples in each leaf, reducing variance and improving generalization. This directly combats the overfitting caused by the overly deep tree.

Exam trap

The trap here is that candidates often confuse hyperparameters that reduce overfitting with those that increase model complexity, mistakenly choosing options like 'increase maximum depth' or 'decrease minimum samples per split' thinking they will improve accuracy.

How to eliminate wrong answers

Option A is wrong because increasing the number of estimators applies to ensemble methods like Random Forest or Gradient Boosting, not to a single decision tree; it would not affect this tree's overfitting. Option B is wrong because decreasing minimum samples per split allows the tree to split on smaller subsets, making it even more complex and worsening overfitting. Option D is wrong because increasing maximum depth would allow the tree to grow even deeper, exacerbating the overfitting problem rather than reducing it.

54
MCQeasy

Refer to the exhibit. A data analyst wants to grant read access to an entire S3 bucket named 'data-lake'. Which of the following best describes what this policy does?

A.Allows both read and write access to the bucket
B.Allows only specific users to read objects
C.Allows read access to a specific folder within the bucket
D.Allows read access to all objects in the data-lake bucket
AnswerD

The policy grants s3:GetObject on the entire bucket, enabling read access to all objects.

Why this answer

This policy grants read access to all objects within the 'data-lake' S3 bucket. In AWS S3, a bucket-level policy that allows the 's3:GetObject' action without a condition restricting the resource to a specific prefix or folder effectively permits reading every object in the bucket. Option D correctly identifies this behavior.

Exam trap

The trap here is that candidates often confuse a bucket-level policy that grants access to all objects with one that restricts access to a specific folder or user, overlooking the absence of a condition or principal specification in the policy statement.

How to eliminate wrong answers

Option A is wrong because the policy only grants read access (s3:GetObject), not write access (s3:PutObject). Option B is wrong because the policy does not specify any user or principal restriction; it applies broadly (e.g., to all principals if the Principal is '*'). Option C is wrong because the policy does not include a condition limiting access to a specific folder (prefix); it applies to the entire bucket (arn:aws:s3:::data-lake/*).

55
Multi-Selectmedium

A data analyst is building a supervised learning model to predict customer churn. The target variable is binary (churn = yes/no). Which TWO modeling techniques are appropriate for this task? (Select two.)

Select 2 answers
A.K-means clustering
B.Linear regression
C.Logistic regression
D.Decision trees
E.Apriori algorithm
AnswersC, D

Logistic regression models binary outcomes and is appropriate for classification.

Why this answer

Logistic regression is appropriate because it models the probability of a binary outcome (churn yes/no) using a logistic function, making it a standard choice for binary classification tasks. It outputs a value between 0 and 1, which can be thresholded to predict the class label.

Exam trap

The trap here is that candidates may confuse unsupervised clustering (K-means) or association rule mining (Apriori) with supervised classification, or mistakenly think linear regression can be adapted for binary outcomes without transformation.

56
MCQhard

A data scientist is tuning a decision tree model to prevent overfitting. The model currently has a high variance. Which hyperparameter adjustment is most effective?

A.Reduce maximum depth
B.Increase minimum samples split
C.Increase number of leaves
D.Use a smaller dataset
AnswerA

Reducing max depth stops the tree from growing too deep, simplifying the model and reducing variance.

Why this answer

Reducing maximum depth limits the number of splits in the decision tree, which directly reduces model complexity and variance. A high-variance model is overfitting to training data, and capping depth prevents the tree from learning overly specific patterns that do not generalize.

Exam trap

CompTIA often tests the misconception that increasing model complexity (e.g., more leaves) reduces overfitting, when in reality it increases variance; the trap here is that candidates may confuse 'minimum samples split' as the only regularization technique, overlooking that reducing max depth is a more direct and effective hyperparameter for high variance.

How to eliminate wrong answers

Option B is wrong because increasing minimum samples split actually reduces overfitting by requiring more samples per split, which is also effective but not the most direct adjustment for high variance; the question asks for the most effective hyperparameter adjustment, and reducing depth is more aggressive. Option C is wrong because increasing the number of leaves increases model complexity, which would exacerbate overfitting and increase variance, not reduce it. Option D is wrong because using a smaller dataset would increase variance (less data leads to more unstable splits) and is not a hyperparameter adjustment; it is a data-level change that typically worsens overfitting.

57
MCQhard

A marketing analyst wants to segment customers based on purchasing behavior and demographics. The dataset includes continuous variables (spending amount, frequency) and categorical variables (region, gender). The analyst decides to use k-means clustering. What should the analyst do to prepare the data?

A.Use raw data because k-means works with mixed types
B.Standardize continuous variables and one-hot encode categorical variables
C.Apply PCA first to reduce dimensionality
D.Remove categorical variables entirely
AnswerB

Standardization ensures equal weight; one-hot encoding converts categories to binary vectors.

Why this answer

Option B is correct because k-means clustering relies on Euclidean distance, which is sensitive to the scale of features. Standardizing continuous variables (e.g., spending amount, frequency) ensures they contribute equally to distance calculations, while one-hot encoding categorical variables (e.g., region, gender) converts them into numerical form without implying ordinal relationships, allowing k-means to process mixed data types correctly.

Exam trap

The trap here is that candidates assume k-means can natively handle mixed data types because it is a common clustering algorithm, but it strictly requires numerical input and scale normalization to avoid skewed distance calculations.

How to eliminate wrong answers

Option A is wrong because k-means cannot directly handle categorical variables; it requires numerical input and assumes continuous features, so using raw mixed-type data would produce meaningless distance calculations. Option C is wrong because PCA is a dimensionality reduction technique applied after preprocessing, not a substitute for standardizing and encoding; it may be used optionally but is not the required preparation step. Option D is wrong because removing categorical variables discards valuable demographic information that could improve segmentation, and k-means can incorporate them after proper encoding.

58
MCQeasy

A data analyst wants to predict customer churn based on categorical features like region and plan type, and continuous features like usage and tenure. Which regression type should be used?

A.Logistic regression
B.Ridge regression
C.Linear regression
D.Lasso regression
AnswerA

Logistic regression is used for binary classification, suitable for churn prediction.

Why this answer

Logistic regression is the correct choice because the target variable, customer churn, is binary (churn vs. no churn). Logistic regression models the probability of a binary outcome using a sigmoid function, making it suitable for classification tasks with both categorical and continuous predictors.

Exam trap

CompTIA often tests the misconception that 'regression' in the option name implies it is only for continuous outcomes, leading candidates to overlook logistic regression as a valid classification technique.

How to eliminate wrong answers

Option B (Ridge regression) is wrong because it is a regularized form of linear regression used for continuous outcomes, not binary classification. Option C (Linear regression) is wrong because it predicts a continuous value and is inappropriate for a binary dependent variable; it can produce probabilities outside [0,1] and violates the assumption of normally distributed errors. Option D (Lasso regression) is wrong because, like Ridge, it is a regularized linear regression for continuous targets and performs feature selection via L1 penalty, but it does not handle binary classification.

59
MCQeasy

During data exploration, an analyst notices that the target variable has a heavily right-skewed distribution. Which data transformation would be most appropriate to make the distribution more symmetric?

A.Log transformation
B.Reciprocal transformation
C.No transformation needed
D.Square root transformation
AnswerA

Log transformation effectively reduces right skewness.

Why this answer

Log transformation is commonly used to reduce right skew. Square root is for moderate skew. Box-Cox can also work but log is simpler.

Option B is correct.

60
MCQmedium

A company’s marketing team wants to segment customers based on purchase history, demographics, and website behavior. The data includes both numeric and categorical variables. Which clustering algorithm is best suited for handling mixed data types?

A.Hierarchical clustering with Gower distance
B.K-modes clustering
C.DBSCAN with Euclidean distance
D.K-means clustering
AnswerA

Gower distance can handle mixed data types by computing a dissimilarity matrix that combines numeric and categorical attributes.

Why this answer

Hierarchical clustering with Gower distance is best suited for mixed data types because Gower distance computes a dissimilarity measure that handles both numeric and categorical variables by normalizing numeric differences and using a simple matching coefficient for categorical ones. This allows the algorithm to create a distance matrix that equally weights all variable types, making it ideal for segmenting customers with purchase history, demographics, and website behavior data.

Exam trap

The trap here is that candidates often assume K-means or DBSCAN can handle mixed data by simply encoding categorical variables, but they overlook that Euclidean distance on encoded data distorts the geometry and fails to preserve the natural dissimilarity structure of categorical variables.

How to eliminate wrong answers

Option B (K-modes clustering) is wrong because it is designed exclusively for categorical data and cannot handle numeric variables like purchase history or website behavior metrics. Option C (DBSCAN with Euclidean distance) is wrong because Euclidean distance is only meaningful for numeric data and cannot properly measure dissimilarity between categorical variables, leading to distorted clusters. Option D (K-means clustering) is wrong because it relies on Euclidean distance and assumes numeric, continuous data; it cannot directly incorporate categorical variables without encoding, and even with encoding, it is sensitive to scaling and does not naturally handle mixed types.

61
MCQhard

A data analyst is working with a dataset containing house prices. After building a multiple linear regression model, the analyst observes that the model performs well on training data but poorly on validation data. Which technique is most appropriate to address this issue?

A.Decrease the training data size
B.Use a polynomial transformation
C.Increase the number of features
D.Apply L2 regularization (Ridge)
AnswerD

Ridge regularization adds a penalty to large coefficients, reducing variance and combating overfitting.

Why this answer

The model is overfitting the training data, as evidenced by high performance on training data but poor performance on validation data. L2 regularization (Ridge) adds a penalty term proportional to the square of the coefficients, which shrinks them and reduces model complexity, thereby improving generalization to unseen data.

Exam trap

CompTIA often tests the distinction between overfitting and underfitting, and candidates mistakenly choose polynomial transformation or adding features thinking they will improve fit, when in fact they increase model complexity and worsen overfitting.

How to eliminate wrong answers

Option A is wrong because decreasing the training data size would exacerbate overfitting by providing the model with even less information to learn generalizable patterns. Option B is wrong because polynomial transformation increases model complexity and feature interactions, which typically worsens overfitting rather than addressing it. Option C is wrong because increasing the number of features adds more predictors, which increases the risk of overfitting and does not directly penalize large coefficients.

62
MCQhard

A data analyst is preparing a logistic regression model to predict customer churn. After examining the exhibit, which data quality issue should the analyst address first?

A.Duplicate customer IDs
B.Missing values in total_charges
C.Inconsistent data in total_charges
D.Outliers in monthly_charges
AnswerC

The total_charges for the first customer is equal to monthly_charges, suggesting a calculation error.

Why this answer

Option C is correct because the exhibit shows that the 'total_charges' column contains entries like '1,234.56' and '1234.56', which are inconsistent numeric formats. Logistic regression in Python (e.g., using scikit-learn) requires all feature values to be numeric and consistent; mixed formats will cause parsing errors or incorrect model training. The analyst must standardize these values to a uniform numeric type (e.g., float) before proceeding.

Exam trap

CompTIA often tests the distinction between data quality issues that prevent model execution (like inconsistent data types) versus issues that degrade model performance (like outliers or missing values), and candidates frequently overlook the former because they focus on statistical concerns rather than data preprocessing fundamentals.

How to eliminate wrong answers

Option A is wrong because duplicate customer IDs are a data integrity issue that can cause data leakage or overfitting, but the exhibit does not show any duplicate IDs, and this is not the most immediate problem for model training. Option B is wrong because missing values in 'total_charges' are not indicated in the exhibit; the issue is inconsistent formatting, not absence of data. Option D is wrong because outliers in 'monthly_charges' are not visible in the exhibit, and while outliers can affect logistic regression, they are a secondary concern compared to the fundamental data type inconsistency that prevents the model from even reading the data correctly.

63
MCQhard

Given the linear regression output, which independent variable has the strongest effect on price, based on standardized coefficients?

A.bathrooms
B.sqft_living
C.Intercept
D.bedrooms
AnswerB

sqft_living has the highest absolute t-value (10.0) indicating strong effect.

Why this answer

Standardized coefficients (beta weights) allow comparison of the relative strength of independent variables by measuring the number of standard deviations the dependent variable changes per one standard deviation change in the predictor. In the regression output, sqft_living has the highest absolute standardized coefficient, indicating it has the strongest effect on price. The intercept is not an independent variable and its coefficient is not standardized for comparison.

Exam trap

The trap here is that candidates mistakenly compare unstandardized coefficients or p-values instead of standardized coefficients, leading them to choose a variable like bathrooms or bedrooms that appears significant but has a weaker standardized effect.

How to eliminate wrong answers

Option A is wrong because bathrooms may have a statistically significant coefficient, but its standardized coefficient is smaller than that of sqft_living, meaning it has a weaker relative effect on price. Option C is wrong because the intercept is a constant term representing the predicted price when all independent variables are zero; it is not an independent variable and its coefficient is not standardized for effect comparison. Option D is wrong because bedrooms, while possibly significant, has a lower absolute standardized coefficient than sqft_living, indicating a weaker influence on price per standard deviation change.

64
MCQhard

A data analyst is building a binary classification model to predict customer churn. The dataset is imbalanced, with only 10% churners. The analyst wants to evaluate model performance with a focus on correctly identifying churners. Which metric is most appropriate?

A.Recall (sensitivity)
B.F1-score
C.Precision
D.Accuracy
AnswerA

Recall measures how many actual churners were correctly found, directly addressing the focus.

Why this answer

Recall (sensitivity) is the most appropriate metric because it measures the proportion of actual churners correctly identified by the model. Since the dataset is imbalanced (only 10% churners) and the analyst's focus is on correctly identifying churners, recall directly addresses the cost of missing positive cases (false negatives). Accuracy would be misleading due to class imbalance, while precision and F1-score prioritize different trade-offs.

Exam trap

The trap here is that candidates often default to accuracy as the default metric, failing to recognize that class imbalance renders accuracy misleading, and that the question's explicit focus on 'correctly identifying churners' points directly to recall, not precision or F1-score.

How to eliminate wrong answers

Option B (F1-score) is wrong because it balances precision and recall, but the analyst's primary goal is to maximize identification of churners, not to balance false positives and false negatives; F1-score would penalize a model that achieves high recall at the expense of precision, which may be acceptable in this scenario. Option C (Precision) is wrong because it measures the proportion of predicted churners that are actual churners, focusing on false positives rather than false negatives; the analyst wants to minimize missed churners, not necessarily avoid false alarms. Option D (Accuracy) is wrong because with only 10% churners, a naive model predicting all non-churners would achieve 90% accuracy, masking poor performance on the minority class; accuracy is inappropriate for imbalanced classification problems.

65
MCQmedium

A retail company wants to predict future sales based on historical data. Which modeling approach is most appropriate if the data shows a clear seasonal pattern?

A.Linear regression
B.Time series analysis
C.K-means clustering
D.Logistic regression
AnswerB

Time series analysis explicitly models seasonal patterns.

Why this answer

Time series analysis is specifically designed to model data points indexed in time order, making it ideal for capturing and forecasting seasonal patterns. Unlike regression models, it accounts for autocorrelation, trends, and seasonality components, which are critical for accurate sales prediction from historical data.

Exam trap

The trap here is that candidates see 'predict future sales' and mistakenly choose linear regression, overlooking that time series methods are required when data has temporal dependencies and seasonality.

How to eliminate wrong answers

Option A is wrong because linear regression assumes independence of observations and cannot model time-dependent structures like seasonality or autocorrelation. Option C is wrong because K-means clustering is an unsupervised learning method used for grouping similar data points, not for forecasting future values. Option D is wrong because logistic regression is used for binary classification problems, not for predicting continuous numeric sales figures.

66
MCQmedium

The exhibit shows an SQL query executed on an 'orders' table that contains 'order_id', 'customer_id', and 'order_date'. What is the purpose of this query?

A.Count total orders per customer regardless of date
B.Calculate average order count per customer for 2023
C.Find products with more than 5 orders in 2023
D.Identify customers who placed more than 5 orders in 2023
AnswerD

The query filters by 2023 date and having count > 5.

Why this answer

The query groups orders by customer_id and filters using a HAVING clause with COUNT(*) > 5, which counts the number of orders per customer. The WHERE clause restricts orders to those placed in 2023, so the result identifies customers who placed more than 5 orders in that year. This matches option D exactly.

Exam trap

CompTIA often tests the distinction between WHERE and HAVING, and the trap here is confusing a count of orders per customer with a count of products or an average, leading candidates to pick option B or C.

How to eliminate wrong answers

Option A is wrong because the WHERE clause filters for order_date in 2023, so the count is not regardless of date. Option B is wrong because the query counts orders per customer, not the average order count per customer. Option C is wrong because the query operates on an 'orders' table with no product-related column; it counts orders per customer, not products.

67
MCQmedium

A retail company wants to forecast monthly sales for the next 12 months. Sales data shows a clear upward trend and seasonal patterns that repeat yearly. Which time series model is most appropriate?

A.SARIMA
B.Simple exponential smoothing
C.Holt-Winters exponential smoothing
D.ARIMA
AnswerC

Holt-Winters includes trend and seasonality components, making it suitable for this data.

Why this answer

The Holt-Winters exponential smoothing model (option C) is the most appropriate because it explicitly captures both trend and seasonality components, which are present in the sales data (upward trend and yearly seasonal patterns). Unlike simple exponential smoothing, Holt-Winters includes additive or multiplicative seasonal terms, making it ideal for data with clear, repeating seasonal cycles over a 12-month horizon.

Exam trap

The trap here is that candidates often choose ARIMA or SARIMA because they are more 'advanced,' but the question specifically describes clear trend and seasonality without requiring stationarity or differencing, making Holt-Winters the most direct and appropriate choice.

How to eliminate wrong answers

Option A (SARIMA) is wrong because while SARIMA can model trend and seasonality, it requires the data to be stationary (differencing) and involves more complex parameter selection (p, d, q, P, D, Q, s); for a straightforward forecasting task with clear trend and seasonality, Holt-Winters is simpler and often more robust. Option B (Simple exponential smoothing) is wrong because it only handles level (no trend or seasonality), so it would fail to capture the upward trend and yearly seasonal patterns in the sales data. Option D (ARIMA) is wrong because it models trend but not seasonality; without seasonal differencing or seasonal AR terms, it cannot account for the repeating yearly patterns in the data.

68
Multi-Selectmedium

In multiple linear regression, which TWO assumptions are critical for unbiased coefficient estimates? (Choose two.)

Select 2 answers
A.Linearity: the relationship between predictors and response is linear
B.Large sample size
C.Normality of errors
D.Homoscedasticity: errors have constant variance
E.Independence of errors
AnswersA, D

Nonlinear relationships can bias coefficient estimates.

Why this answer

For unbiased coefficient estimates in multiple linear regression, the linearity assumption (A) ensures that the model correctly specifies the functional form between predictors and the response. Homoscedasticity (D) ensures that the variance of errors is constant across all levels of the predictors, which is necessary for the Gauss-Markov theorem to hold and for ordinary least squares (OLS) estimates to be unbiased.

Exam trap

CompTIA often tests the distinction between assumptions required for unbiasedness (linearity and homoscedasticity) versus those needed for efficiency or inference (normality, independence, large sample size), causing candidates to mistakenly select normality or independence as critical for unbiased coefficients.

69
MCQmedium

A data analyst is analyzing survey responses where respondents rated satisfaction on a scale of 1-5. The analyst wants to visualize the distribution of responses. Which chart type is most appropriate?

A.Box plot
B.Scatter plot
C.Line chart
D.Histogram
AnswerD

Histograms display the frequency distribution of a single numeric variable across bins.

Why this answer

A histogram is the most appropriate chart for visualizing the distribution of a single discrete variable, such as satisfaction ratings on a 1-5 scale. It groups the responses into bins (each rating value) and displays the frequency of each bin using bars, clearly showing the shape, central tendency, and spread of the data.

Exam trap

The trap here is that candidates often confuse a histogram with a bar chart, but the key distinction is that a histogram is used for quantitative (ordinal or continuous) data where bin order matters, while a bar chart is for categorical (nominal) data with no inherent order.

How to eliminate wrong answers

Option A is wrong because a box plot summarizes data using five-number statistics (min, Q1, median, Q3, max) and is better for comparing distributions across groups, not for showing the detailed frequency distribution of a single ordinal variable. Option B is wrong because a scatter plot is used to visualize the relationship between two continuous variables, not the distribution of a single categorical or ordinal variable. Option C is wrong because a line chart is typically used to display trends over time or sequential data, not the frequency distribution of discrete survey responses.

70
Multi-Selectmedium

Which THREE of the following are common steps in data cleaning?

Select 3 answers
A.Removing outliers without justification
B.Imputing missing values
C.Standardizing data formats
D.Removing duplicate records
E.Increasing sample size
AnswersB, C, D

Missing values are often imputed to maintain dataset completeness.

Why this answer

Imputing missing values is a common data cleaning step because real-world datasets often have gaps due to data collection errors or system failures. Techniques like mean/median imputation, regression imputation, or using algorithms like k-NN help preserve sample size and avoid bias that would result from simply dropping rows. This ensures the dataset remains usable for analysis without introducing significant distortion.

Exam trap

CompTIA often tests the distinction between data cleaning steps and data collection or preprocessing steps, so the trap here is confusing 'increasing sample size' (a data augmentation or collection activity) with actual cleaning tasks like imputation, standardization, and deduplication.

71
MCQhard

A data scientist is building a classification model to predict customer churn. The dataset has 10,000 records with 500 churners. The scientist uses logistic regression and achieves 98% accuracy, but the precision for churn class is only 15%. Which of the following is the most likely cause?

A.Class imbalance
B.Non‑linear decision boundary
C.Multicollinearity among predictor variables
D.Overfitting due to too many features
AnswerA

With only 500 churners out of 10,000, the model predicts most as non-churners, achieving high accuracy but low precision for the minority class.

Why this answer

The dataset has only 500 churners out of 10,000 records (5% churn rate), which is a classic class imbalance. Logistic regression can achieve high accuracy by simply predicting the majority class (non-churn) for all records, yielding 95% accuracy even without learning anything about churn. The very low precision (15%) for the churn class indicates that most of the positive predictions are false positives, a direct consequence of the model being biased toward the majority class due to imbalance.

Exam trap

CompTIA often tests the misconception that high accuracy always means a good model, hiding the fact that with imbalanced data, accuracy is misleading and metrics like precision, recall, or F1-score for the minority class are critical.

How to eliminate wrong answers

Option B is wrong because logistic regression inherently models a linear decision boundary; while non-linear boundaries can be approximated with feature engineering (e.g., polynomial terms), the core issue here is class imbalance, not boundary shape. Option C is wrong because multicollinearity inflates coefficient standard errors but does not cause the extreme precision drop seen here; it affects interpretability, not the fundamental accuracy-imbalance trade-off. Option D is wrong because overfitting would typically yield high training accuracy but poor generalization, not a specific low precision for the minority class while maintaining high overall accuracy; the model is actually underfitting the minority class.

72
MCQmedium

A marketing team wants to segment customers into groups based on purchasing behavior without prior labels. Which algorithm should the data analyst use?

A.K-means clustering
B.K-nearest neighbors
C.Linear regression
D.Decision tree
AnswerA

K-means is an unsupervised clustering algorithm suitable for segmentation.

Why this answer

K-means clustering is the correct choice because it is an unsupervised learning algorithm that groups unlabeled data into clusters based on feature similarity. Since the marketing team has no prior labels for customer segments, K-means can partition customers by purchasing behavior patterns, such as frequency and monetary value, without needing predefined categories.

Exam trap

The trap here is that candidates often confuse unsupervised clustering (K-means) with supervised classification (K-nearest neighbors) because both involve 'K' and grouping, but KNN requires labeled data and predicts labels, while K-means discovers inherent structures without labels.

How to eliminate wrong answers

Option B is wrong because K-nearest neighbors is a supervised learning algorithm that requires labeled training data to classify or predict outcomes, making it unsuitable for unlabeled segmentation. Option C is wrong because linear regression is a supervised regression algorithm used to predict a continuous target variable, not to discover hidden groupings in unlabeled data. Option D is wrong because decision trees are typically used for supervised classification or regression tasks, relying on labeled data to split on features, and cannot perform unsupervised clustering without prior labels.

73
MCQhard

A data scientist is working with a dataset containing 1000 features and 500 samples. The goal is to build a predictive model. Which technique should be used to reduce the number of features while retaining most of the variance?

A.Ridge regression
B.Forward selection
C.Principal Component Analysis (PCA)
D.Lasso regression
AnswerC

PCA reduces dimensionality by creating new features that capture maximum variance.

Why this answer

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms the original features into a set of orthogonal components, ordered by the variance they capture. Given 1000 features and only 500 samples, PCA is ideal because it reduces the feature space while retaining the maximum variance, helping to avoid overfitting and the curse of dimensionality.

Exam trap

CompTIA often tests the distinction between supervised feature selection (Lasso, Forward selection) and unsupervised dimensionality reduction (PCA), trapping candidates who confuse regularization with variance-based reduction.

How to eliminate wrong answers

Option A is wrong because Ridge regression is a regularization technique that shrinks coefficients but does not reduce the number of features; it retains all features with penalized weights. Option B is wrong because Forward selection is a supervised feature selection method that selects features based on their predictive power, not on variance retention, and it can be computationally expensive with 1000 features. Option D is wrong because Lasso regression performs feature selection by shrinking some coefficients to zero, but it is a supervised method that selects features based on target correlation, not on maximizing variance retention, and may not be optimal for unsupervised dimensionality reduction.

74
MCQeasy

A data analyst needs to combine two datasets that have the same columns but different rows. Which operation should they use?

A.Concatenate
B.Append
C.Merge
D.Aggregate
AnswerB

Append adds rows from one dataset to another with same columns.

Why this answer

Option B (Append) is correct because appending is the standard operation for combining two datasets with identical columns but different rows, stacking the rows from one dataset onto the other. In tools like SQL, this is achieved with the UNION or UNION ALL operator, and in Python pandas, it is done via the `append()` method or `pd.concat()` with axis=0. This operation preserves the column structure while extending the row count.

Exam trap

The trap here is that candidates confuse 'concatenate' (which can mean row-wise or column-wise) with 'append' (which specifically means row-wise stacking), leading them to choose Option A when the question explicitly requires combining rows.

How to eliminate wrong answers

Option A (Concatenate) is wrong because concatenation is a general term that can refer to combining along any axis (rows or columns), and in many contexts (e.g., SQL string functions, pandas with axis=1), it implies joining side-by-side rather than stacking rows; the question specifically requires row-wise stacking, which is append. Option C (Merge) is wrong because merge is used to combine datasets based on a common key column (like a SQL JOIN), not to simply stack rows when columns are identical. Option D (Aggregate) is wrong because aggregation involves summarizing data (e.g., SUM, AVG, COUNT) across groups, not combining separate datasets.

75
MCQmedium

A data analyst needs to visualize the distribution of a continuous variable across different categories. Which chart type is most suitable?

A.Bar chart
B.Histogram
C.Scatter plot
D.Box plot
AnswerD

Box plot displays distribution across groups.

Why this answer

A box plot (option D) is the most suitable chart for visualizing the distribution of a continuous variable across different categories because it displays the median, quartiles, and potential outliers for each group, enabling direct comparison of spread and central tendency. Unlike a histogram, which shows the distribution of a single continuous variable without categorical grouping, the box plot inherently supports categorical axes. This makes it ideal for exploratory data analysis when assessing how a metric like revenue varies by region or product category.

Exam trap

CompTIA often tests the distinction between histograms and box plots by presenting a scenario where a candidate mistakenly chooses a histogram for grouped categorical data, overlooking that histograms require a continuous x-axis and cannot inherently separate categories without additional faceting.

How to eliminate wrong answers

Option A is wrong because a bar chart is designed for comparing categorical data using discrete counts or sums, not for showing the distribution of a continuous variable across categories. Option B is wrong because a histogram visualizes the distribution of a single continuous variable using bins, but it does not natively separate data into distinct categories; you would need faceting or multiple histograms, which is less efficient than a box plot. Option C is wrong because a scatter plot is used to examine the relationship between two continuous variables, not to compare distributions of one continuous variable across categories.

Page 1 of 2 · 84 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Analyzing Modeling Data questions.