CCNA Dap Analysing Data Questions — Page 1 of 2

MCQhard

An analyst runs an A/B test with 1000 users per group and observes a conversion rate of 5% in the control and 6% in the treatment. The p-value is 0.12. What should the analyst conclude?

A.The difference is not statistically significant at the 0.05 level.

B.The sample size is too small to detect an effect.

C.The treatment significantly outperforms control.

D.There is a 12% chance the treatment is better.

AnswerA

Correct interpretation.

Why this answer

Since p-value > 0.05, we fail to reject the null hypothesis; the observed difference is not statistically significant.

Practice this question →

MCQeasy

In A/B testing, the null hypothesis typically states that:

A.There is no difference between the control and treatment groups

B.The treatment group will perform better than the control group

C.The sample size is sufficient for the test

D.There is a significant difference between the control and treatment groups

AnswerA

Correct definition of null hypothesis.

Why this answer

The null hypothesis (H0) is a statement of no effect or no difference between groups.

Practice this question →

MCQeasy

An analyst computed the mean, median, and mode of a dataset and found they are all equal. Which of the following best describes the distribution?

A.Bimodal

B.Negatively skewed

C.Positively skewed

D.Symmetric

AnswerD

Symmetric distributions have equal mean, median, and mode.

Why this answer

When mean, median, and mode are equal, the distribution is symmetric and unimodal, often resembling a normal distribution.

Practice this question →

MCQhard

A time series of monthly sales data exhibits a clear upward trend over several years, with consistent peaks each December. Which components are present in this series?

A.Trend and seasonality

B.Cyclical and irregular components only

C.Seasonality and cyclical components only

D.Trend and irregular components only

AnswerA

Correct identification.

Why this answer

The upward trend is a trend component, and the consistent December peaks indicate seasonality.

Practice this question →

MCQhard

A dataset contains a feature with values ranging from 10 to 1000. The analyst applies min-max normalization to scale the feature between 0 and 1. What is the normalized value of 520?

A.0.515

B.0.510

C.0.480

D.0.520

AnswerA

Calculation yields 0.515.

Why this answer

Min-max normalization formula: (x - min) / (max - min) = (520 - 10) / (1000 - 10) = 510 / 990 = 0.515.

Practice this question →

MCQmedium

An analyst is comparing the average sales of two different store locations using a t-test. The p-value obtained is 0.03, and the significance level is 0.05. What should the analyst conclude?

A.Fail to reject the null hypothesis; no significant difference

B.The test is inconclusive because the p-value is too low

C.Reject the null hypothesis; there is a significant difference

D.Accept the null hypothesis; the means are equal

AnswerC

Correct interpretation.

Why this answer

Since p-value (0.03) < α (0.05), we reject the null hypothesis, indicating a statistically significant difference in mean sales between the two locations.

Practice this question →

Multi-Selectmedium

A data analyst is preparing to run an A/B test comparing two email subject lines. Which TWO of the following should the analyst define before the test begins?

Select 2 answers

A.The exact lift in conversion rate

B.The time series decomposition

C.The p-value after the test

D.The null and alternative hypotheses

E.The sample size required for the desired power

AnswersD, E

Needed to frame the test and interpret results.

Why this answer

Before A/B testing, define null and alternative hypotheses, and determine sample size needed for desired statistical power and effect size.

Practice this question →

MCQmedium

A retailer wants to test if a new website layout increases the average time spent on the site. They split traffic: control group (old layout) and treatment group (new layout). Which statistical test is most appropriate to compare the average time spent between the two groups?

A.ANOVA

B.Pearson correlation

C.Chi-square test

D.Two-sample t-test

AnswerD

Compares means of two independent groups.

Why this answer

A t-test is used to compare means of two independent groups.

Practice this question →

MCQhard

A data analyst runs an A/B test on a new website layout. The test yields a p-value of 0.04 with the null hypothesis being no difference in conversion rates. The significance threshold is α=0.05. Which of the following is the correct conclusion?

A.The result is not significant; accept the alternative hypothesis.

B.Reject the null hypothesis; the new layout is proven to increase conversions.

C.Reject the null hypothesis; there is a statistically significant difference in conversion rates.

D.Fail to reject the null hypothesis; there is no evidence of a difference.

AnswerC

Correct interpretation: statistically significant difference exists.

Why this answer

Since p-value (0.04) < α (0.05), we reject the null hypothesis and conclude there is a statistically significant difference. However, statistical significance does not guarantee practical significance.

Practice this question →

Multi-Selectmedium

A data analyst is performing a chi-square test of independence on a 2x2 contingency table. The p-value is 0.04. At α=0.05, which THREE of the following statements are correct?

Select 3 answers

A.There is a statistically significant association between the two variables.

B.The test indicates a strong association between variables.

C.The variables are not independent.

D.The null hypothesis is rejected.

E.The result is not statistically significant.

AnswersA, C, D

Correct: Significant association exists.

Why this answer

Since p < α, reject the null hypothesis, meaning there is an association. The test does not measure strength (Cramer's V does) and does not identify specific categories.

Practice this question →

MCQmedium

A dataset contains height measurements in centimeters and inches. An analyst wants to apply k-means clustering. Which data transformation should be applied before clustering?

A.Log transformation

B.Z-score standardization

C.Min-max normalization

D.No transformation needed

AnswerC

Normalization ensures equal weight from all features.

Why this answer

Min-max normalization scales features to a range, often [0,1], which is appropriate for distance-based algorithms like k-means.

Practice this question →

MCQhard

A data analyst is testing whether a new website layout increases conversion rate. The p-value from the test is 0.03. Using a significance level of 0.05, what is the correct conclusion?

A.Reject the null hypothesis; the new layout significantly increases conversion rate

B.The test is inconclusive because p-value is greater than 0.01

C.Accept the null hypothesis; the new layout has no effect

D.Fail to reject the null hypothesis; the new layout does not increase conversion rate

AnswerA

Correct interpretation.

Why this answer

Since p-value (0.03) < α (0.05), we reject the null hypothesis and conclude there is a statistically significant difference.

Practice this question →

Multi-Selectmedium

Which TWO of the following data quality dimensions are most directly affected by duplicate records?

Select 2 answers

A.Timeliness

B.Consistency

C.Uniqueness

D.Accuracy

E.Completeness

AnswersC, D

Correct: Duplicates violate uniqueness.

Why this answer

Duplicates harm accuracy (incorrect counts) and uniqueness (duplicate entries). Completeness, consistency, timeliness are less directly affected.

Practice this question →

Multi-Selectmedium

An analyst is preparing data for a clustering algorithm that uses Euclidean distance. Which TWO data preprocessing techniques should be applied to ensure all features contribute equally?

Select 2 answers

A.Min-max normalization

B.Z-score standardization

C.Log transformation

D.Principal component analysis

E.One-hot encoding

AnswersA, B

Scales features to [0,1] range.

Why this answer

Min-max normalization and Z-score standardization both scale features to comparable ranges, preventing features with larger scales from dominating distance calculations.

Practice this question →

MCQmedium

An analyst calculates a Pearson correlation coefficient of -0.8 between advertising spend and customer churn rate. Which interpretation is correct?

A.There is a weak positive relationship.

B.Advertising spend causes churn to decrease.

C.64% of the variance in churn is explained by spend.

D.Increasing advertising spend is associated with decreasing churn rate.

AnswerD

Negative correlation: one goes up, other down.

Why this answer

Negative correlation means as one variable increases, the other decreases; strength is high (close to -1).

Practice this question →

MCQmedium

A data analyst wants to segment customers into groups based on their purchasing behavior. The dataset includes numerical features such as annual income and purchase frequency. Which algorithm is most appropriate for this task?

A.Linear regression

B.K-means clustering

C.Logistic regression

D.Chi-square test

AnswerB

Correct: K-means is unsupervised clustering for segmentation.

Why this answer

K-means clustering is a common algorithm for customer segmentation based on numerical features.

Practice this question →

Multi-Selecthard

A data scientist is conducting an A/B test with a significance level of 0.05. Which three factors should be considered when calculating the required sample size? (Choose THREE)

Select 3 answers

A.Seasonality of the data

B.Statistical power (e.g., 0.80)

C.Minimum detectable effect size

D.Number of clusters in k-means

E.Significance level (α)

AnswersB, C, E

Higher power requires larger sample.

Why this answer

Sample size calculation depends on desired power, effect size, and significance level.

Practice this question →

MCQhard

A logistic regression model is used to predict the probability of customer churn. The model's coefficient for the feature 'customer support calls' is 0.8 with a p-value of 0.001. Which interpretation is correct?

A.For each additional support call, the log-odds of churn increase by 0.8, and this effect is statistically significant.

B.The odds of churn are multiplied by 0.8 for each additional call.

C.Support calls have no significant effect on churn.

D.For each additional support call, the probability of churn increases by 80%.

AnswerA

Correct interpretation of logistic regression coefficient.

Why this answer

In logistic regression, a positive coefficient indicates that as the predictor increases, the log-odds of the outcome increase. The p-value being less than 0.05 indicates the effect is statistically significant.

Practice this question →

MCQhard

In time series decomposition, a data analyst separates a retail sales series into trend, seasonal, and residual components. After decomposition, the residual component shows no pattern and is random. Which of the following best describes the seasonal component?

A.Cyclical variations lasting more than a year.

B.Irregular fluctuations that cannot be predicted.

C.Regular patterns that repeat at fixed intervals.

D.A long-term increase or decrease in sales.

AnswerC

Correct: seasonality is regular periodic patterns.

Why this answer

Seasonality refers to regular, periodic patterns that repeat at fixed intervals (e.g., monthly, quarterly).

Practice this question →

MCQeasy

A data analyst calculates the mean, median, and mode of a dataset. Which measure of central tendency is most affected by extreme outliers?

A.Mean

B.Range

C.Median

D.Mode

AnswerA

The mean is the average and is pulled toward extreme values.

Why this answer

The mean is sensitive to extreme values because it includes all data points in its calculation, whereas median and mode are more robust.

Practice this question →

MCQmedium

A retail company wants to identify customer segments based on purchase history and demographics. Which technique is most appropriate for this task?

A.Linear regression

B.K-means clustering

C.Chi-square test

D.Logistic regression

AnswerB

K-means groups similar customers into clusters.

Why this answer

K-means clustering is an unsupervised learning technique designed to segment data into groups based on similarity.

Practice this question →

MCQmedium

A data scientist is using K-means clustering with k=3. After the first iteration, the centroids are recalculated. Which step occurs next in the algorithm?

A.Calculate the sum of squared errors

B.Stop the algorithm because k is fixed

C.Compute the elbow curve

D.Assign each point to the nearest centroid

AnswerD

After centroid update, points are reassigned based on distance.

Why this answer

In K-means, after recalculating centroids, each point is reassigned to the nearest centroid, then centroids are updated again, iterating until convergence.

Practice this question →

MCQmedium

In a multiple regression model, one predictor has a high p-value (0.45). What should the analyst consider doing?

A.Transform the predictor

B.Keep the predictor regardless

C.Remove the predictor from the model

D.Increase the sample size

AnswerC

The variable is not significant.

Why this answer

High p-value indicates the predictor is not statistically significant; it may be removed to simplify the model.

Practice this question →

MCQmedium

A data analyst is examining the distribution of customer ages in a dataset. The ages are: 22, 25, 29, 30, 31, 34, 35, 37, 40, 42, 45, 50, 55, 60, 65. Which measure of central tendency would be least affected by an outlier if a value of 120 is incorrectly recorded as age 120?

A.Mode

B.Median

C.Mean

D.Range

AnswerB

The median is not affected by outliers.

Why this answer

The median is resistant to outliers because it is the middle value when data are sorted. The mean is sensitive to extreme values, and the mode may not change but is not a robust measure of central tendency. The range is a measure of spread, not central tendency.

Practice this question →

MCQhard

In a multiple regression model with three predictors, the coefficient for one predictor is 5.2 with a p-value of 0.001. Which of the following is the best interpretation?

A.The predictor explains 5.2% of the variance in the dependent variable.

B.A one-unit increase in the predictor decreases the dependent variable by 5.2 units, on average.

C.The model is not a good fit because one predictor is significant.

D.The predictor has a statistically significant effect on the dependent variable, controlling for other variables.

AnswerD

p < 0.05 indicates significance, and 'holding constant' is key.

Why this answer

The coefficient indicates the change in the dependent variable for a one-unit increase in the predictor, holding other predictors constant.

Practice this question →

MCQmedium

A company wants to determine if there is a significant difference in the average sales revenue between two different store layouts. They collect sales data from 30 stores with Layout A and 30 stores with Layout B. Which statistical test is most appropriate for comparing the means of these two independent groups?

A.ANOVA

B.Chi-square test

C.Paired t-test

D.Two-sample t-test

AnswerD

Correct for comparing means of two independent groups.

Why this answer

The two-sample t-test (independent t-test) compares the means of two independent groups. A paired t-test would be for dependent samples, ANOVA for three or more groups, and chi-square for categorical variables.

Practice this question →

MCQmedium

A retail company wants to test whether a new website layout increases the conversion rate compared to the current layout. They randomly assign visitors to either the control or treatment group. Which statistical test is most appropriate to compare the conversion rates?

A.Two-sample t-test

B.Chi-square test

C.ANOVA

D.Logistic regression

AnswerA

Compares means of two independent groups.

Why this answer

A two-sample t-test compares means of two groups, suitable for conversion rate comparison.

Practice this question →

MCQhard

A data scientist builds a logistic regression model to predict customer churn (yes/no). The model outputs a probability of 0.75 for a particular customer. Which of the following best describes this output?

A.The customer will definitely churn.

B.There is a 75% chance the customer will churn.

C.The odds of churning are 0.75 to 1.

D.The model is 75% accurate.

AnswerB

Probability interpretation.

Why this answer

Logistic regression outputs a probability between 0 and 1, interpreted as the likelihood of the positive class (churn = yes).

Practice this question →

Multi-Selectmedium

A data analyst is evaluating data quality for a customer database. Which TWO dimensions of data quality are most directly affected by duplicate customer records?

Select 2 answers

A.Consistency

B.Accuracy

C.Timeliness

D.Completeness

E.Uniqueness

AnswersB, E

Duplicates can cause inaccurate counts and misrepresent entity.

Why this answer

Duplicates reduce accuracy (records may be incorrect) and uniqueness (each entity should appear once).

Practice this question →

MCQeasy

A dataset contains the ages of 100 customers. The analyst wants to transform the ages to a 0-1 range for use in a distance-based algorithm. Which technique should be used?

A.Square root transformation

B.Log transformation

C.Z-score normalization

D.Min-max normalization

AnswerD

Min-max scales to a range, e.g., 0-1.

Why this answer

Min-max normalization scales features to a fixed range, typically 0-1.

Practice this question →

Multi-Selecthard

An analyst is conducting an A/B test on a new website layout. Which TWO of the following must be defined before the test begins?

Select 2 answers

A.The final conversion rates for each group

B.The actual p-value from the test

C.The confidence interval for the lift

D.The significance threshold (alpha)

E.The sample size required for adequate statistical power

AnswersD, E

Must be set beforehand.

Why this answer

Sample size and significance level must be set a priori to ensure proper test design.

Practice this question →

MCQmedium

A marketing team runs an A/B test comparing two webpage designs. The null hypothesis states there is no difference in conversion rates. The p-value is 0.08 at α=0.05. Which is the correct interpretation?

A.The null hypothesis is rejected, indicating the designs are different.

B.The alternative hypothesis is accepted, showing the new design is better.

C.There is insufficient evidence to conclude a difference between the designs.

D.There is a statistically significant difference between the designs.

AnswerC

We fail to reject the null hypothesis due to high p-value.

Why this answer

Since p > α, we fail to reject the null hypothesis, meaning no statistically significant difference was found.

Practice this question →

MCQeasy

Which measure best describes the spread of the middle 50% of a dataset?

A.IQR

B.Range

C.Standard deviation

D.Variance

AnswerA

IQR is robust and covers middle 50%.

Why this answer

Interquartile range (IQR) is the range between Q1 and Q3, covering the middle 50%.

Practice this question →

MCQmedium

An analyst uses K-means clustering on customer purchase data. After plotting the within-cluster sum of squares for different values of k, they observe an elbow at k=4. What is the most appropriate number of clusters?

A.4

B.6

C.5

D.3

AnswerA

The elbow indicates the optimal k.

Why this answer

The elbow method suggests choosing k where the WSS starts to diminish slowly; that point is the elbow.

Practice this question →

MCQmedium

A dataset has missing values in the 'age' column. The distribution of age is approximately normal with few outliers. Which imputation method is most appropriate?

A.Mean imputation

B.Forward-fill

C.Delete all rows with missing data

D.Mode imputation

AnswerA

Mean imputation is appropriate for normal distribution.

Why this answer

For normally distributed data, mean imputation is reasonable and preserves the mean.

Practice this question →

MCQeasy

A data analyst is comparing the average test scores of students who attended a tutoring program versus those who did not. Which statistical test is most appropriate for determining if there is a significant difference between the means of these two independent groups?

A.Paired t-test

B.Chi-square test

C.Two-sample t-test

D.ANOVA

AnswerC

Correct: independent samples t-test compares means of two groups.

Why this answer

The independent samples t-test is used to compare the means of two independent groups.

Practice this question →

MCQhard

In a linear regression model predicting house prices, the coefficient for the number of bedrooms is $30,000 and the intercept is $50,000. If a house has 3 bedrooms, what is the predicted price?

A.$80,000

B.$150,000

C.$90,000

D.$140,000

AnswerD

Correct: 30000*3 + 50000 = 140000.

Why this answer

Using y = mx + b, predicted price = 30000 * 3 + 50000 = $140,000.

Practice this question →

MCQmedium

In time series analysis, which component represents regular patterns that repeat over fixed periods, such as daily or yearly?

A.Seasonality

B.Trend

C.Cyclical

D.Irregular

AnswerA

Correct definition.

Why this answer

Seasonality refers to patterns that repeat at regular intervals.

Practice this question →

MCQhard

A data scientist is building a K-means clustering model for customer segmentation. After plotting the within-cluster sum of squares (WCSS) against the number of clusters (k), she observes that the WCSS decreases sharply until k=5 and then levels off. Which value of k should she choose based on the elbow method?

A.k=5

B.k=6

C.k=4

D.k=3

AnswerA

Correct elbow point.

Why this answer

The elbow method suggests selecting the number of clusters at the point where the WCSS starts to diminish less rapidly, forming an 'elbow'. Here, the elbow is at k=5, where adding more clusters yields diminishing returns.

Practice this question →

Multi-Selectmedium

A data analyst is preparing a dataset for a machine learning algorithm that assumes normally distributed features. Which TWO data transformation methods should the analyst consider to achieve this?

Select 2 answers

A.Square root transformation

B.Log transformation

C.One-hot encoding

D.Z-score standardization

E.Min-max normalization

AnswersB, D

Can reduce skewness, making data more normal.

Why this answer

Z-score standardization transforms data to have mean=0, std=1, which helps if the original distribution is normal. Log transformation can reduce skewness but does not guarantee normality.

Practice this question →

MCQmedium

A dataset contains employee salaries ranging from $30,000 to $200,000. An analyst wants to scale the salaries to a range of 0 to 1 for use in a distance-based clustering algorithm. Which method should they use?

A.Log transformation

B.Robust scaling

C.Min-max normalization

D.Z-score standardization

AnswerC

Scales to 0-1 using (x - min)/(max - min).

Why this answer

Min-max normalization scales data to a 0-1 range.

Practice this question →

MCQeasy

In an A/B test, the null hypothesis states that there is no difference between the conversion rates of the control and treatment groups. After collecting data, the p-value is 0.03. Using a significance level α = 0.05, what should the analyst conclude?

A.Reject the null hypothesis; there is a significant difference

B.Accept the alternative hypothesis; the treatment is better

C.The test is inconclusive

D.Fail to reject the null hypothesis; no significant difference

AnswerA

Correct conclusion.

Why this answer

Since the p-value (0.03) is less than α (0.05), the null hypothesis is rejected, indicating a statistically significant difference between the groups.

Practice this question →

MCQmedium

A data scientist is performing a hypothesis test with a significance level α=0.05. The p-value obtained is 0.03. What should the scientist conclude?

A.Reject the null hypothesis because the p-value is less than the significance level.

B.Fail to reject the null hypothesis because the p-value is greater than 0.01.

C.The test is inconclusive, need a larger sample size.

D.Accept the null hypothesis because the p-value is small.

AnswerA

A p-value less than α leads to rejection of the null hypothesis.

Why this answer

Since 0.03 < 0.05, we reject the null hypothesis, indicating statistically significant evidence against it.

Practice this question →

Multi-Selectmedium

Which TWO of the following are components of time series data?

Select 2 answers

A.Mean

B.Variance

C.Trend

D.Seasonality

E.Median

AnswersC, D

Correct: Trend is a long-term direction.

Why this answer

Trend and seasonality are classic components of time series. Mean, median, and variance are statistical measures but not components of time series decomposition.

Practice this question →

MCQmedium

A data analyst wants to understand the relationship between advertising spend and sales revenue. The analyst calculates a Pearson correlation coefficient of 0.85. Which of the following is the best interpretation?

A.There is a strong positive linear relationship between advertising spend and sales.

B.85% of the variation in sales is explained by advertising spend.

C.Increasing advertising spend by $1 will increase sales by $0.85.

D.There is a strong negative linear relationship between advertising spend and sales.

AnswerA

r=0.85 indicates strong positive linear relationship.

Why this answer

Pearson r ranges from -1 to +1; 0.85 indicates a strong positive linear relationship, but correlation does not imply causation.

Practice this question →

Multi-Selecteasy

Which TWO of the following are measures of central tendency?

Select 2 answers

A.Median

B.Range

C.Variance

D.Standard deviation

E.Mean

AnswersA, E

Correct: Median is a measure of central tendency.

Why this answer

Mean, median, and mode are measures of central tendency. Range and standard deviation measure dispersion.

Practice this question →

MCQeasy

A data analyst wants to use a Z-score to standardize a dataset. The variable has a mean of 50 and a standard deviation of 10. What is the Z-score for a raw value of 70?

A.0.5

B.20

C.-2

D.2

AnswerD

Correct Z-score.

Why this answer

Z = (X - mean) / std = (70 - 50) / 10 = 2.

Practice this question →

MCQmedium

A data analyst notices that a dataset of customer ages has several missing values. Which method for handling missing data is most appropriate if the data is missing completely at random and the analyst wants to preserve sample size?

A.Forward-fill using the previous value

B.Impute with the mean age

C.Replace missing values with zero

D.Delete all rows with missing data

AnswerB

Mean imputation is simple and preserves sample size.

Why this answer

Mean imputation replaces missing values with the mean, preserving sample size, but can bias estimates. However, for MCAR, it is a common simple approach.

Practice this question →

Multi-Selecteasy

A data analyst is cleaning a dataset with missing values. Which TWO of the following are acceptable methods for handling missing numerical data?

Select 2 answers

A.Min-max normalisation

B.Forward-fill

C.Mode imputation

D.Mean imputation

E.Deletion of rows with missing values

AnswersD, E

Correct: Replacing missing with mean is acceptable.

Why this answer

Mean imputation and deletion (listwise) are common methods. Mode imputation is for categorical, and forward-fill is for time series; min-max is normalisation.

Practice this question →

Multi-Selecthard

A data analyst is building a logistic regression model to predict whether a customer will churn (yes/no). Which TWO statements about logistic regression are correct?

Select 2 answers

A.It is used only for time series forecasting.

B.The dependent variable is continuous.

C.The output is a probability between 0 and 1.

D.It requires normally distributed errors.

E.It assumes a linear relationship between predictors and the log-odds of the outcome.

AnswersC, E

Logistic regression predicts probabilities.

Why this answer

Logistic regression outputs probabilities between 0 and 1, and can use a sigmoid function. It is a classification algorithm, and coefficients represent log-odds changes.

Practice this question →

MCQmedium

An analyst runs a simple linear regression with an R² value of 0.85. Which interpretation is correct?

A.85% of the variance in the dependent variable is explained by the independent variable.

B.The slope of the regression line is 0.85.

C.The independent variable is 85% correlated with the dependent variable.

D.85% of the data points lie on the regression line.

AnswerA

R² is the coefficient of determination, indicating explained variance.

Why this answer

R² represents the proportion of variance in the dependent variable explained by the independent variable. 0.85 means 85% is explained.

Practice this question →

Multi-Selecthard

An analyst is performing K-means clustering on customer data. The elbow method shows a clear bend at k=4. Which THREE of the following are true about K-means clustering with k=4?

Select 3 answers

A.The number of clusters is determined to be 4.

B.The algorithm will always produce the same clusters regardless of initial centroids.

C.The centroids are recomputed iteratively until convergence.

D.Categorical variables should be standardised before clustering.

E.The algorithm minimises the sum of squared distances between points and their assigned centroid.

AnswersA, C, E

Correct: Elbow method indicates k=4.

Why this answer

K-means initialises centroids randomly, so results can vary. The elbow method suggests 4 clusters. The algorithm minimises within-cluster sum of squares.

K-means works best with numeric data and assumes spherical clusters.

Practice this question →

MCQmedium

A marketing analyst wants to predict whether a customer will churn (yes/no) based on account age and monthly charges. Which regression technique is most appropriate?

A.Logistic regression

B.Simple linear regression

C.Multiple linear regression

D.K-means clustering

AnswerA

Logistic regression handles binary outcomes.

Why this answer

Logistic regression is used for binary classification problems, outputting probabilities.

Practice this question →

MCQhard

A data scientist is analyzing a dataset with multiple features and wants to apply k-means clustering to segment customers. She chooses k = 4 based on the elbow method. During the iteration process, which of the following correctly describes a step in the k-means algorithm?

A.Compute the covariance matrix and use principal components to initialize centroids.

B.Use hierarchical clustering to determine initial centroids.

C.Randomly assign centroids and then compute distances to the cluster medians.

D.Assign each point to the nearest centroid based on Euclidean distance, then update centroids as the mean of points in each cluster.

AnswerD

This is the standard k-means iteration.

Why this answer

K-means iteratively assigns each point to the nearest centroid, then recalculates centroids as the mean of points in the cluster.

Practice this question →

MCQmedium

An analyst compares average sales across three different store locations using a statistical test. Which test is most appropriate?

A.ANOVA

B.t-test

C.Correlation analysis

D.Chi-square test

AnswerA

ANOVA compares means of three or more groups.

Why this answer

ANOVA compares means across three or more groups.

Practice this question →

Multi-Selecthard

A data analyst is evaluating the quality of a customer database. Which THREE of the following are dimensions of data quality?

Select 3 answers

A.Completeness

B.Correlation

C.Timeliness

D.Accuracy

E.Variance

AnswersA, C, D

Whether all required data is present.

Why this answer

Accuracy, completeness, and timeliness are standard data quality dimensions.

Practice this question →

MCQeasy

Which data quality dimension ensures that data represents the real-world object or event correctly?

A.Accuracy

B.Completeness

C.Consistency

D.Timeliness

AnswerA

Correct definition.

Why this answer

Accuracy refers to how well data reflects reality.

Practice this question →

MCQmedium

A data analyst is performing time series analysis on monthly sales data and notices a consistent pattern of higher sales every December. Which component of time series does this represent?

A.Trend

B.Irregular component

C.Seasonality

D.Cyclical

AnswerC

Seasonality is regular periodic pattern.

Why this answer

Seasonality refers to regular patterns that repeat at fixed intervals, such as yearly.

Practice this question →

Multi-Selecthard

A data analyst is cleaning a dataset and identifies several outliers. Which TWO methods are appropriate for handling outliers?

Select 2 answers

A.Capping

B.Mean imputation

C.Removal

D.Min-max normalization

E.Forward-fill

AnswersA, C

Replaces outliers with a threshold value.

Why this answer

Capping (winsorizing) and removal are common outlier treatments. Mean imputation is for missing values, and min-max normalization is scaling.

Practice this question →

MCQmedium

A data analyst is examining sales data for a retail chain and notices that the mean monthly sales is $50,000 while the median is $35,000. Which of the following best describes the distribution of the sales data?

A.The distribution is right-skewed.

B.The distribution is bimodal.

C.The distribution is left-skewed.

D.The distribution is symmetrical.

AnswerA

Correct: mean > median indicates right skew.

Why this answer

When the mean is greater than the median, the distribution is right-skewed (positively skewed) because the mean is pulled towards the higher values by outliers or a long right tail.

Practice this question →

MCQhard

A data analyst is cleaning a dataset and finds that 5% of values in the 'income' column are missing. The analyst decides to impute missing values using the mean of the non-missing values. Which potential issue should the analyst be most concerned about?

A.The imputation may reduce the variance and distort the distribution.

B.The imputation is not valid because the missing rate is too low.

C.The imputation will increase the standard deviation of the variable.

D.The imputation will create outliers.

AnswerA

Mean imputation pulls values toward the mean, reducing variance and potentially biasing results.

Why this answer

Mean imputation reduces variance and can distort relationships, especially if data is skewed. It may also bias estimates if missingness is not random.

Practice this question →

MCQmedium

A dataset contains a variable 'Income' with many missing values. The analyst decides to impute missing values with the median income of the non-missing values. Which type of imputation is this?

A.Interpolation

B.Deletion

C.Median imputation

D.Forward-fill imputation

AnswerC

Correct term.

Why this answer

Replacing missing values with the median is a form of mean/median/mode imputation.

Practice this question →

MCQeasy

A data analyst needs to identify outliers in a dataset. Which of the following is a common method based on the interquartile range (IQR)?

A.Values more than 2 standard deviations from the mean

B.Values that are negative

C.Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR

D.Values below the 5th percentile or above the 95th percentile

AnswerC

Correct: IQR method.

Why this answer

A common rule is to consider any data point below Q1 - 1.5*IQR or above Q3 + 1.5*IQR as an outlier.

Practice this question →

MCQmedium

A data analyst is examining the relationship between advertising spend (in dollars) and revenue (in dollars). The Pearson correlation coefficient r is calculated as +0.92. Which of the following interpretations is correct?

A.There is a strong negative linear relationship.

B.There is no linear relationship.

C.There is a strong positive linear relationship.

D.92% of the variation in revenue is explained by advertising spend.

AnswerC

Close to +1 indicates strong positive.

Why this answer

r = +0.92 indicates a strong positive linear relationship.

Practice this question →

MCQhard

A data analyst is performing a chi-square test of independence on a contingency table of customer satisfaction (satisfied vs. dissatisfied) and product type (A, B, C). The test yields a p-value of 0.04 with α = 0.05. What is the correct conclusion?

A.There is no evidence of an association between satisfaction and product type.

B.There is a significant association between satisfaction and product type.

C.The test is invalid because the expected counts are too low.

D.Satisfaction and product type are independent.

AnswerB

Correct: reject null, conclude association.

Why this answer

Since p-value < α, we reject the null hypothesis of independence, meaning there is a significant association between satisfaction and product type.

Practice this question →

Multi-Selecthard

A company runs an A/B test to compare a new website layout (treatment) against the current layout (control). The conversion rate for the control is 5% and for the treatment is 5.5%. The p-value is 0.06 at α=0.05. Which THREE of the following conclusions are valid?

Select 3 answers

A.There is not enough evidence to conclude that the new layout is better.

B.The test has sufficient power to detect the observed effect.

C.The observed lift of 0.5% may be due to random chance.

D.The new layout significantly increases conversion rate.

E.A larger sample size might reveal a significant difference if one exists.

AnswersA, C, E

Correct: Fail to reject null.

Why this answer

The p-value > α, so fail to reject the null; the difference is not statistically significant. However, the observed lift is 0.5% (absolute). Sample size might be insufficient; statistical power could be low.

Practice this question →

MCQmedium

A dataset contains features with vastly different scales (e.g., age 0-100 and income 0-1,000,000). Which data transformation should be applied before using a K-nearest neighbors algorithm?

A.No transformation is needed

B.Min-max normalization

C.Log transformation

D.Z-score standardization

AnswerB

Min-max scales features to a fixed range (0-1), suitable for distance-based methods.

Why this answer

Distance-based algorithms like KNN require features on similar scales; min-max normalization is appropriate.

Practice this question →

Multi-Selectmedium

A data analyst is preparing a dataset for analysis and needs to ensure data quality. Which TWO of the following are dimensions of data quality?

Select 2 answers

A.Volume

B.Velocity

C.Consistency

D.Variety

E.Accuracy

AnswersC, E

Correct: consistency ensures data is uniform across sources.

Why this answer

Accuracy and consistency are recognized dimensions of data quality in CompTIA Data+.

Practice this question →

MCQmedium

A data analyst wants to test if the proportion of customers who prefer Product A over Product B is different from 50%. She surveys 200 customers and finds that 120 prefer Product A. Which statistical test should she use?

A.Chi-square test of independence

B.One-sample z-test for proportions

C.ANOVA

D.Two-sample t-test

AnswerB

Correct for testing a single proportion against a hypothesized value.

Why this answer

A one-sample z-test for proportions compares a sample proportion to a hypothesized population proportion. Here, the null is p=0.5. A chi-square test for goodness-of-fit could also be used, but the z-test is standard for a single proportion.

Practice this question →

Multi-Selectmedium

A data team is preparing data for a clustering analysis. Which THREE of the following steps are commonly part of data cleaning?

Select 3 answers

A.Removing duplicate records

B.Imputing missing values

C.Calculating the mean

D.Training a regression model

E.Capping outliers at the 5th and 95th percentiles

AnswersA, B, E

Deduplication is cleaning.

Why this answer

Data cleaning includes handling missing values, outlier treatment, and deduplication.

Practice this question →

MCQmedium

A data scientist is performing K-means clustering on customer data. She plots the within-cluster sum of squares (WCSS) for different values of k and observes an 'elbow' at k=4. What does this indicate?

A.The optimal number of clusters is 4

B.The algorithm should be run with k=3 to avoid overfitting

C.The data contains exactly 4 outliers

D.The WCSS is minimized at k=4, indicating perfect clustering

AnswerA

The elbow point indicates a good trade-off between cluster compactness and number of clusters.

Why this answer

The elbow method suggests that adding more clusters beyond k=4 yields diminishing returns, so k=4 is a suitable number of clusters.

Practice this question →

MCQmedium

A data analyst is evaluating a multiple regression model with three predictors. The R² value is 0.85. Which of the following is the best interpretation of R²?

A.85% of the variance in the outcome is explained by the predictors.

B.85% of the predicted values are correct.

C.The model has a high bias.

D.The model has a strong correlation of 0.85.

AnswerA

Correct: R² measures explained variance.

Why this answer

R² represents the proportion of variance in the dependent variable explained by the independent variables. 0.85 means 85% of the variance is explained.

Practice this question →

Multi-Selecthard

A logistic regression model predicts customer churn (0=no churn, 1=churn). The model outputs probabilities. Which THREE of the following statements about logistic regression are correct?

Select 3 answers

A.The model output is a probability between 0 and 1.

B.The coefficient of determination R² is used to assess model fit.

C.The coefficients represent the change in log-odds for a one-unit change in the predictor.

D.Logistic regression is used for binary classification.

E.The model uses the linear regression equation y = mx + b directly.

AnswersA, C, D

Correct: The sigmoid function ensures output in [0,1].

Why this answer

Logistic regression outputs probabilities; it uses the logistic function (sigmoid) to map linear combination to [0,1]. The coefficients represent log-odds changes. R² is for linear regression; pseudo-R² is used but not standard R².

Practice this question →

Multi-Selecthard

A data analyst is performing a chi-square test for independence between two categorical variables. Which THREE of the following are necessary conditions for the test to be valid?

Select 3 answers

A.Variances are equal across groups

B.Data is normally distributed

C.Sample is randomly selected

D.Observations are independent

E.Expected frequency in each cell is at least 5

AnswersC, D, E

Correct condition.

Why this answer

The chi-square test requires expected frequencies ≥5, random sampling, and independence of observations.

Practice this question →

MCQeasy

A retail company wants to analyze monthly sales data over the past three years to identify long-term trends. Which component of time series analysis is most relevant for this goal?

A.Irregular component

B.Cyclical component

C.Seasonality

D.Trend

AnswerD

Trend shows the overall long-term direction of the time series.

Why this answer

The trend component represents the long-term direction of the data, which is exactly what the company wants to identify.

Practice this question →