CCNA Dap Analysing Data Questions

72 of 147 questions · Page 2/2 · Dap Analysing Data topic · Answers revealed

76
Multi-Selecthard

A company is planning an A/B test to compare two website designs. Which THREE of the following must be determined before the test begins to ensure valid results? (Select three.)

Select 3 answers
A.The desired effect size
B.The p-value of the test
C.Which hypothesis is true
D.The minimum sample size required
E.The significance level (α)
AnswersA, D, E

Helps determine sample size.

Why this answer

Sample size (based on power and effect size), significance level (α), and desired effect size are all pre-specified to design the test. The p-value is an outcome, not a pre-test parameter. The hypothesis is defined beforehand, but which one is false? Actually null and alternative hypotheses should be pre-specified, but the phrasing 'which one is true' is not determined before; the test determines that.

So correct are: determine minimum sample size, determine significance level, and determine desired effect size.

77
MCQmedium

A data analyst is cleaning a dataset and finds that the 'age' column has several missing values. Which method of handling missing values is least likely to introduce bias if the missingness is completely at random?

A.Mean imputation
B.Listwise deletion
C.Mode imputation
D.Forward-fill
AnswerB

If MCAR, listwise deletion gives unbiased estimates, though with less power.

Why this answer

Listwise deletion (removing rows with missing values) is simple and unbiased if data is MCAR, but it reduces sample size. However, it is least likely to introduce bias among the options when MCAR holds.

78
MCQhard

A data analyst is comparing the means of two independent groups using a t-test. The sample sizes are small and the data is not normally distributed. Which condition is violated for a valid t-test?

A.Normality
B.Equal variances
C.Independence of observations
D.Sample size larger than 30
AnswerA

Normality is an assumption of t-test.

Why this answer

The t-test assumes normality of the data, especially with small samples. Violation of normality can affect the validity.

79
Multi-Selectmedium

A data analyst is performing data cleaning on a dataset and identifies several outliers in the 'age' column. Which TWO methods are appropriate for handling these outliers? (Select two.)

Select 2 answers
A.Capping
B.Mean imputation
C.Transformation
D.Removal
E.Binning
AnswersA, D

Capping limits outliers to a specified percentile.

Why this answer

Capping limits extreme values to a threshold, and removal deletes outlier records. Transformation (e.g., log) can reduce impact but is more for skewness. Imputation and binning are for missing data or discretization, not directly for outliers.

80
MCQeasy

In an A/B test, the null hypothesis states that there is no difference between the control and treatment groups. After running the test, the p-value is 0.04. Assuming α = 0.05, what is the correct conclusion?

A.Fail to reject the null hypothesis
B.Reject the null hypothesis
C.Accept the null hypothesis
D.The test is invalid because the p-value is too low
AnswerB

Correct conclusion.

Why this answer

Since p-value (0.04) < α (0.05), we reject the null hypothesis, indicating a statistically significant difference.

81
Multi-Selectmedium

A data analyst is preparing a dataset for analysis and needs to address data quality issues. Which TWO of the following are common data cleaning tasks?

Select 2 answers
A.Performing hypothesis testing
B.Imputing missing values
C.Building a regression model
D.Calculating correlation coefficients
E.Deduplicating records
AnswersB, E

Correct.

Why this answer

Handling missing values and removing duplicates are standard data cleaning tasks.

82
Multi-Selectmedium

A data analyst is cleaning a customer dataset. Which two actions are appropriate for handling duplicate records? (Choose TWO)

Select 2 answers
A.Impute missing values with mean
B.Delete any row with a duplicate email address
C.Remove all rows with identical values in every field
D.Apply Z-score standardization
E.Use a fuzzy matching algorithm to identify near-duplicates
AnswersC, E

Exact duplicates can be removed safely.

Why this answer

Removing exact duplicates and standardizing identifiers help resolve duplicates.

83
MCQmedium

A data analyst is preparing features for a machine learning model that uses distance-based algorithms (e.g., K-means, KNN). The dataset contains numerical features with different scales: age (0-100), income (20,000-200,000), and credit score (300-850). Which data transformation technique is most appropriate to ensure all features contribute equally to the distance calculations?

A.Z-score standardization
B.Min-max normalization
C.One-hot encoding
D.Log transformation
AnswerB

Correct: scales all features to [0,1] so distances are not dominated by large-scale features.

Why this answer

Min-max normalization rescales features to a fixed range (e.g., 0 to 1), making distances computed equally weighted. Standardization is better for algorithms assuming Gaussian distributions.

84
MCQmedium

A data analyst is testing whether the average sales amount differs between two regions. Which statistical test is most appropriate?

A.Chi-square test
B.ANOVA
C.Two-sample t-test
D.Paired t-test
AnswerC

Compares means of two independent groups.

Why this answer

A two-sample t-test compares the means of two independent groups.

85
MCQmedium

A data scientist builds a simple linear regression model to predict house prices based on square footage. The model yields an R-squared value of 0.85. Which statement accurately interprets this result?

A.The slope of the regression line is 0.85
B.85% of the data points lie exactly on the regression line
C.The model explains 85% of the variability in house prices
D.There is a 85% chance that square footage causes higher prices
AnswerC

Correct interpretation of R-squared.

Why this answer

R-squared of 0.85 means 85% of the variance in house prices is explained by square footage.

86
Multi-Selecthard

A data analyst is performing a chi-square test of independence on a contingency table of customer satisfaction (satisfied, neutral, dissatisfied) by region (North, South, East, West). Which THREE of the following are necessary assumptions for the test?

Select 3 answers
A.The two variables are categorical
B.The sample size is greater than 30
C.Expected frequencies in each cell are at least 5 (or most cells)
D.The observations are independent
E.The data must be normally distributed
AnswersA, C, D

Chi-square tests association between categorical variables.

Why this answer

Chi-square test requires categorical variables, expected frequencies >=5 in at least 80% of cells, and independence of observations.

87
MCQeasy

A dataset contains customer records with a column for 'Phone Number' that should be unique. However, the analyst finds several duplicate phone numbers. Which data quality dimension is primarily affected?

A.Completeness
B.Accuracy
C.Uniqueness
D.Consistency
AnswerC

Correct: duplicates violate uniqueness.

Why this answer

Uniqueness refers to the expectation that each record or attribute value should be unique. Duplicate phone numbers violate uniqueness.

88
MCQmedium

A marketing team runs an A/B test on email subject lines. The p-value is 0.03 with α = 0.05. Which of the following is the correct interpretation?

A.The result is not statistically significant at the 95% confidence level.
B.The probability that the null hypothesis is true is 3%.
C.Fail to reject the null hypothesis; no significant difference.
D.Reject the null hypothesis; there is a statistically significant difference.
AnswerD

p < α provides evidence against the null.

Why this answer

Since p-value (0.03) < α (0.05), we reject the null hypothesis, indicating a statistically significant difference.

89
MCQhard

In logistic regression, the output is a probability between 0 and 1. If the predicted probability for a customer churning is 0.7 and the decision threshold is 0.5, what is the predicted class?

A.Not churn (class 0)
B.Churn (class 1)
C.Both classes equally likely
D.Uncertain, need more data
AnswerB

Probability above threshold predicts the positive class.

Why this answer

Since 0.7 > 0.5, the predicted class is churn (usually coded as 1).

90
Multi-Selectmedium

Which TWO of the following are true about Pearson correlation coefficient (r)?

Select 2 answers
A.An r of 0 means no relationship exists
B.It ranges from 0 to 1
C.It measures the strength and direction of a linear relationship
D.A value of +1 indicates a perfect positive linear relationship
E.It can be used for categorical variables
AnswersC, D

Correct.

Why this answer

Pearson r ranges from -1 to 1, measuring linear relationship; +1 indicates perfect positive linear correlation.

91
MCQmedium

A data analyst is analyzing customer purchase amounts. The dataset contains several extreme high values due to luxury purchases. Which measure of central tendency is most robust to these outliers?

A.Range
B.Mean
C.Mode
D.Median
AnswerD

The median is robust to outliers.

Why this answer

The median is not affected by extreme values, making it robust to outliers.

92
MCQmedium

A financial analyst wants to compare the mean annual returns of three different investment strategies. Which statistical test is most appropriate?

A.Chi-square test
B.Paired t-test
C.One-way ANOVA
D.Two-sample t-test
AnswerC

ANOVA can compare means of three or more independent groups.

Why this answer

ANOVA is used to compare means of three or more groups.

93
MCQmedium

A data analyst is preparing data for a k-nearest neighbors algorithm. The features include age (0-100) and income (0-200,000). Which technique should be applied to ensure the distance metric is not dominated by income?

A.Min-max normalization
B.Log transformation
C.Z-score standardization
D.One-hot encoding
AnswerA

Correct: min-max normalization scales to [0,1], preventing features with larger ranges from dominating.

Why this answer

Min-max normalization scales features to a 0-1 range, ensuring each feature contributes equally to distance calculations.

94
MCQmedium

A data analyst is working with a dataset that includes a column 'income' with values ranging from 20,000 to 150,000. To standardize this variable for a linear regression that assumes normally distributed residuals, which method should be used?

A.Log transformation
B.Min-max normalization
C.Square root transformation
D.Z-score standardization
AnswerD

Correct: Z-score centers and scales to unit variance, suitable for normality assumptions.

Why this answer

Z-score standardization transforms data to have mean 0 and standard deviation 1, which is suitable for algorithms that assume normality (like linear regression).

95
MCQeasy

Which data cleaning method involves replacing a missing value with the average of the available values in that column?

A.Mean imputation
B.Interpolation
C.Listwise deletion
D.Forward-fill
AnswerA

Mean imputation uses column average.

Why this answer

Mean imputation replaces missing values with the column mean.

96
MCQhard

A data analyst is performing a multiple linear regression with three predictors. The model output shows an R-squared of 0.85 and an adjusted R-squared of 0.80. Which of the following is the best interpretation of the difference between these two values?

A.The model is overfitted, so all predictors should be removed
B.The model has high multicollinearity
C.The residuals are not normally distributed
D.One or more predictors may not be contributing meaningfully
AnswerD

The drop from R-squared to adjusted R-squared indicates that some predictors reduce model efficiency.

Why this answer

Adjusted R-squared penalizes for adding predictors that do not improve the model significantly; a gap suggests some predictors may be irrelevant or the sample size is small.

97
MCQmedium

A marketing team uses K-means clustering to segment customers based on purchase history. To determine the optimal number of clusters, they plot the within-cluster sum of squares (WCSS) against k and look for an elbow. What is the purpose of this method?

A.To find the point where the rate of decrease in WCSS slows down
B.To identify the value of k that minimizes WCSS
C.To determine the initial centroids for the algorithm
D.To ensure all clusters have equal size
AnswerA

Correct description of the elbow method.

Why this answer

The elbow method helps choose k where adding more clusters yields diminishing returns in reducing variance.

98
MCQmedium

An analyst wants to compare the mean sales revenue across three different store regions. The data is normally distributed and variances are equal. Which statistical test is most appropriate?

A.Two-sample t-test
B.ANOVA
C.Paired t-test
D.Chi-square test
AnswerB

ANOVA is appropriate for three groups.

Why this answer

ANOVA (Analysis of Variance) is used to compare means of three or more groups.

99
Multi-Selectmedium

A researcher is designing an A/B test to compare two website layouts. Which TWO elements are essential for determining the required sample size?

Select 2 answers
A.Sample mean
B.Statistical power
C.Confidence interval width
D.Desired effect size
E.P-value
AnswersB, D

Power affects the probability of detecting an effect.

Why this answer

Statistical power and desired effect size are key inputs for sample size calculation.

100
MCQhard

A data analyst is cleaning a dataset and finds that some records have duplicate entries based on customer ID. Which data quality dimension is most directly affected by these duplicates?

A.Timeliness
B.Consistency
C.Accuracy
D.Uniqueness
AnswerD

Duplicates directly impact uniqueness.

Why this answer

Duplicates violate the uniqueness dimension, which requires each entity to be represented only once.

101
MCQhard

In time series decomposition, a pattern that repeats at regular intervals (e.g., weekly, yearly) is called:

A.Cyclical
B.Irregular
C.Trend
D.Seasonality
AnswerD

Seasonality has fixed and known periods.

Why this answer

Seasonality refers to regular, periodic patterns in time series data.

102
MCQmedium

A simple linear regression model predicts sales (y) from advertising spend (x). The equation is y = 2.5x + 10, and R² = 0.81. Which interpretation is correct?

A.The correlation between sales and advertising is 0.81.
B.When advertising is $0, sales are $2.5.
C.81% of the variation in sales is explained by advertising spend.
D.For every $1 increase in advertising, sales increase by $10 on average.
AnswerC

R² = 0.81 means 81% explained.

Why this answer

Slope indicates that each unit increase in x increases y by 2.5 units. R² of 0.81 means 81% of variance in y is explained by x.

103
MCQmedium

In a time series analysis, a retail analyst observes consistent peaks in sales every December and troughs every February. This pattern repeats annually. Which component of time series does this represent?

A.Irregular
B.Seasonality
C.Trend
D.Cyclical
AnswerB

Seasonality is predictable and repeats over fixed intervals.

Why this answer

Seasonality refers to regular patterns that repeat over fixed periods, such as months or quarters.

104
Multi-Selectmedium

A dataset contains outliers in a feature that will be used for linear regression. Which two outlier treatment methods are appropriate? (Choose TWO)

Select 2 answers
A.Cap the outliers at a percentile (e.g., 99th percentile)
B.Use min-max normalization
C.Increase the sample size
D.Remove the outlier rows
E.Replace outliers with the mean
AnswersA, D

Capping limits extreme values.

Why this answer

Capping outliers or transforming the variable can reduce their influence.

105
MCQmedium

A dataset contains a feature 'Age' with values ranging from 18 to 95. To prepare data for a k-nearest neighbors algorithm, which transformation should be applied to 'Age'?

A.Z-score standardization
B.Min-max normalization
C.No transformation needed
D.Log transformation
AnswerB

Min-max normalization ensures all features contribute equally to distance calculations.

Why this answer

Min-max normalization scales features to a fixed range (e.g., 0-1), which is appropriate for distance-based algorithms like k-NN.

106
MCQmedium

A data analyst wants to compare the average revenue per customer between two marketing campaigns (A and B). The analyst is unsure if the data follows a normal distribution. Which statistical test is most appropriate for comparing the means of the two groups?

A.Two-sample t-test
B.Pearson correlation
C.Chi-square test
D.ANOVA
AnswerA

The two-sample t-test compares means of two independent groups.

Why this answer

For comparing means of two independent groups, the t-test is the standard parametric test. If normality is violated, a non-parametric alternative like Mann-Whitney U could be used, but the t-test is robust for moderate sample sizes.

107
MCQhard

A data analyst has a time series of monthly sales data. They observe that sales are consistently higher every December and lower every January. Which component of time series does this pattern represent?

A.Irregular
B.Cyclical
C.Seasonality
D.Trend
AnswerC

Seasonality refers to fixed periodic patterns within a year.

Why this answer

Regular patterns that repeat within one year are seasonality.

108
MCQmedium

An analyst is conducting an A/B test to compare two website designs. The null hypothesis is that there is no difference in conversion rates. The p-value obtained is 0.03, and the significance threshold is 0.05. What should the analyst conclude?

A.Reject the null hypothesis; there is a significant difference.
B.Accept the alternative hypothesis that the new design is better.
C.The test is inconclusive; need a larger sample size.
D.Fail to reject the null hypothesis; there is no significant difference.
AnswerA

Correct: p < α, reject null.

Why this answer

Since p-value (0.03) < α (0.05), we reject the null hypothesis, indicating a statistically significant difference.

109
MCQeasy

A data analyst calculates the mean, median, and mode of a dataset. Which of the following best describes how these measures are used in descriptive statistics?

A.To identify outliers using standard deviation
B.To test hypotheses about population parameters
C.To describe the central tendency of the data
D.To determine the probability of an event
AnswerC

Mean, median, and mode are measures of central tendency.

Why this answer

Descriptive statistics summarize data using measures like mean, median, and mode to describe central tendency.

110
MCQmedium

A stock analyst is analyzing monthly sales data for a retail company and observes a consistent pattern of high sales every December. This pattern is most likely an example of which time series component?

A.Irregular
B.Cyclical
C.Seasonality
D.Trend
AnswerC

Correct: regular pattern within a fixed period.

Why this answer

Seasonality refers to regular, predictable patterns that repeat at fixed intervals (e.g., yearly, monthly). The consistent December peak indicates a seasonal pattern.

111
MCQeasy

In simple linear regression, the coefficient of determination R² measures:

A.The probability that the slope is zero
B.The slope of the regression line
C.The proportion of variance in the dependent variable explained by the independent variable
D.The strength and direction of the linear relationship
AnswerC

Correct interpretation of R².

Why this answer

R² indicates the proportion of variance in the dependent variable explained by the independent variable.

112
MCQeasy

In a regression analysis, the coefficient of determination (R²) is 0.85. How should this value be interpreted?

A.85% of the data points lie on the regression line
B.The slope of the regression line is 0.85
C.85% of the variance in the dependent variable is explained by the model
D.85% of the independent variables are significant
AnswerC

Correct interpretation of R².

Why this answer

R² represents the proportion of variance in the dependent variable that is explained by the independent variable(s). An R² of 0.85 means the model explains 85% of the variability.

113
Multi-Selectmedium

A retail company wants to segment its customers based on purchase history. Which THREE methods are appropriate for customer segmentation?

Select 3 answers
A.RFM analysis
B.Linear regression
C.K-means clustering
D.t-test
E.Hierarchical clustering
AnswersA, C, E

Segments based on recency, frequency, monetary value.

Why this answer

K-means clustering, hierarchical clustering, and RFM analysis are common segmentation techniques. Linear regression and t-test are not segmentation methods.

114
Multi-Selectmedium

An analyst is preparing data for an A/B test and wants to ensure valid results. Which TWO of the following should be considered when calculating the required sample size?

Select 2 answers
A.Data dimensionality
B.Desired effect size
C.Skewness of data
D.Number of features
E.Statistical power
AnswersB, E

Correct: effect size is a key input.

Why this answer

Sample size calculation depends on desired effect size and statistical power, among other factors like significance level.

115
MCQeasy

A data analyst wants to compare the means of three different training methods on employee productivity. Which statistical test is most appropriate?

A.Correlation analysis
B.ANOVA
C.Chi-square test
D.t-test
AnswerB

ANOVA compares means across multiple groups.

Why this answer

ANOVA (Analysis of Variance) is used to compare means of three or more groups.

116
Multi-Selectmedium

An analyst is planning an A/B test to compare two website designs. Which TWO factors should be considered when calculating the required sample size?

Select 2 answers
A.Data type of the outcome variable
B.Desired effect size
C.Statistical power
D.Color scheme of the designs
E.Number of missing values
AnswersB, C

Correct.

Why this answer

Statistical power and desired effect size are key inputs for sample size calculations.

117
MCQhard

A data analyst is cleaning a dataset with missing values in a time series of daily temperatures. The missing values occur sporadically. Which imputation method is most appropriate to maintain the temporal trend?

A.Forward-fill
B.Mean imputation
C.Median imputation
D.Interpolation
AnswerD

Correct: uses neighboring values to estimate missing points, preserving trend.

Why this answer

Interpolation estimates missing values by using surrounding data points and is suitable for time series with a trend. Forward-fill carries the last observation forward, which may not capture trend well. Mean imputation ignores order.

118
MCQmedium

A data analyst is reviewing a dataset containing house prices. The mean price is $350,000 and the median is $280,000. Which of the following best describes the distribution of house prices?

A.The distribution is right-skewed.
B.The distribution is symmetric.
C.The distribution is left-skewed.
D.The distribution is bimodal.
AnswerA

Correct: Mean > median indicates right skew.

Why this answer

When the mean is greater than the median, the distribution is right-skewed (positively skewed) because higher values pull the mean upward.

119
Multi-Selectmedium

Which TWO of the following are appropriate uses of min-max normalisation?

Select 2 answers
A.Transforming data to have mean 0 and standard deviation 1
B.Scaling features to a range of 0 to 1
C.Preparing data for linear regression with normally distributed residuals
D.Preparing data for k-nearest neighbours algorithm
E.Handling missing values
AnswersB, D

Correct: Min-max normalisation scales to [0,1].

Why this answer

Min-max normalisation scales data to a fixed range (often 0-1), useful for distance-based algorithms like k-NN and neural networks. Standardisation (Z-score) is better for algorithms assuming Gaussian distribution.

120
MCQeasy

Which data quality dimension ensures that data represents the real-world scenario correctly and without errors?

A.Completeness
B.Consistency
C.Accuracy
D.Timeliness
AnswerC

Accuracy is about correctness and error-free data.

Why this answer

Accuracy means the data correctly reflects reality.

121
MCQhard

In A/B testing, which factor is increased by having a larger sample size?

A.P-value
B.Effect size
C.Type I error rate
D.Statistical power
AnswerD

Power increases with sample size.

Why this answer

Larger sample size increases statistical power (ability to detect a true effect).

122
MCQeasy

Which data quality dimension is violated if a customer record has a missing phone number?

A.Consistency
B.Accuracy
C.Completeness
D.Validity
AnswerC

Completeness measures missing values.

Why this answer

Completeness refers to the extent to which data is not missing.

123
Multi-Selectmedium

An analyst wants to compare the average sales revenue across three different store locations. Which TWO statistical methods are appropriate for this comparison?

Select 2 answers
A.Two-sample t-test
B.ANOVA
C.Multiple regression
D.Descriptive statistics
E.Chi-square test
AnswersB, C

Correct: ANOVA compares means across three or more groups.

Why this answer

ANOVA compares means of three or more groups. A t-test compares only two groups. Chi-square tests categorical independence.

Correlation measures linear relationship. Descriptive stats summarise but don't compare multiple groups inferentially.

124
MCQeasy

In a simple linear regression model y = 2.5 + 1.2x, what is the predicted value of y when x = 10?

A.12.0
B.13.7
C.14.5
D.10.0
AnswerC

Correct calculation.

Why this answer

Plug x=10: y = 2.5 + 1.2*10 = 2.5 + 12 = 14.5.

125
Multi-Selectmedium

A data analyst wants to segment customers based on purchasing behavior such as frequency, monetary value, and recency. Which TWO clustering evaluation methods can help determine the optimal number of clusters? (Select two.)

Select 2 answers
A.Correlation coefficient
B.ANOVA
C.Silhouette score
D.t-test
E.Elbow method
AnswersC, E

Measures how similar an object is to its own cluster vs others.

Why this answer

The elbow method uses within-cluster sum of squares, and the silhouette score measures cohesion and separation. Both help choose k. Correlation coefficient is for association, not clustering.

ANOVA and t-test are for hypothesis testing.

126
MCQhard

A data analyst is asked to compare the average sales across three different store locations. The data is normally distributed and variances are approximately equal. Which statistical test is most appropriate?

A.ANOVA
B.Pearson correlation
C.Chi-square test
D.Two-sample t-test
AnswerA

ANOVA is designed for comparing means of 3+ groups.

Why this answer

ANOVA is used to compare means of three or more groups when assumptions of normality and equal variance are met.

127
MCQhard

An analyst is performing a logistic regression to predict customer churn (yes/no). The model outputs a probability of 0.75 for a particular customer. Which of the following best describes the interpretation?

A.The model predicts that the customer will not churn
B.There is a 75% chance that the customer will churn
C.The customer will definitely churn because the probability is above 0.5
D.The odds of churning are 0.75 to 1
AnswerB

Correct interpretation of logistic regression output.

Why this answer

Logistic regression outputs the probability that the event (churn) occurs, given the input features.

128
Multi-Selectmedium

An analyst is conducting an A/B test on a new checkout process. To calculate sample size, which THREE factors must be considered?

Select 3 answers
A.Number of control groups
B.Desired effect size
C.Significance level (alpha)
D.Statistical power
E.Population standard deviation
AnswersB, C, D

Effect size is the minimum practical difference to detect.

Why this answer

Statistical power, significance level (alpha), and desired effect size (minimum detectable effect) are essential for sample size calculation.

129
MCQeasy

Which statistical test should be used to determine if there is a significant association between two categorical variables, such as gender and product preference?

A.ANOVA
B.Chi-square test
C.Pearson correlation
D.t-test
AnswerB

Correct test for categorical variables.

Why this answer

The chi-square test of independence is used to test association between two categorical variables.

130
MCQhard

A data scientist runs a linear regression model to predict customer spending based on income. The R-squared value is 0.45 and the p-value for the slope coefficient is 0.03. At a significance level of α=0.05, which of the following conclusions is correct?

A.The slope is not statistically significant, and the model explains 55% of the variance.
B.The slope is statistically significant, and the model explains 45% of the variance.
C.The slope is statistically significant, and the model explains 55% of the variance.
D.The slope is not statistically significant, and the model explains 45% of the variance.
AnswerB

Correct: p<0.05 indicates significance; R²=0.45 indicates explained variance.

Why this answer

The p-value (0.03) is less than α (0.05), so the slope is statistically significant. R²=0.45 means the model explains 45% of the variance.

131
MCQeasy

Which data quality dimension is most concerned with whether data values fall within a defined domain or acceptable range?

A.Completeness
B.Consistency
C.Validity
D.Accuracy
AnswerC

Validity checks if data follows format and range rules.

Why this answer

Validity refers to whether data values conform to defined rules or constraints.

132
MCQhard

A data scientist applies K-means clustering to a customer dataset. The elbow method suggests using 4 clusters. After running K-means with k=4, the within-cluster sum of squares (WCSS) is plotted against k, and the elbow is at k=4. What does this indicate?

A.Increasing k beyond 4 would not significantly reduce WCSS.
B.The data naturally forms 4 clusters with no noise.
C.The algorithm converged to a local minimum.
D.The model has overfit the data.
AnswerA

The elbow point is where the rate of decrease sharply changes.

Why this answer

The elbow method suggests that increasing k beyond 4 yields diminishing returns in reducing WCSS; k=4 is a good trade-off.

133
MCQmedium

A data analyst is examining the relationship between advertising spend (in thousands) and sales (in thousands). The Pearson correlation coefficient is computed as r = -0.85. Which of the following interpretations is correct?

A.There is no linear relationship.
B.There is a strong positive linear relationship between advertising spend and sales.
C.There is a weak negative linear relationship.
D.There is a strong negative linear relationship.
AnswerD

Correct: r close to -1 indicates strong negative.

Why this answer

Pearson r measures linear correlation: -0.85 indicates a strong negative linear relationship (as one increases, the other decreases). The magnitude |0.85| is close to 1, so strong.

134
MCQeasy

A data analyst is cleaning a dataset and finds that the 'age' column has several missing values. Which of the following is a valid method for handling missing numerical data?

A.Delete the entire column
B.Ignore the missing values
C.Impute with the mean
D.Replace with zeros
AnswerC

Correct: mean imputation is a standard technique.

Why this answer

Mean imputation is a common method for handling missing numerical data, though median or mode can also be used.

135
MCQeasy

A data analyst is summarizing the central tendency of a dataset with extreme outliers. Which measure is most robust to outliers?

A.Standard deviation
B.Median
C.Mean
D.Range
AnswerB

Median is robust to outliers.

Why this answer

The median is not affected by extreme values, unlike the mean.

136
MCQmedium

A data analyst is cleaning a dataset and finds that a numeric field has several missing values. The variable is normally distributed. Which imputation method is most appropriate?

A.Median imputation
B.Mean imputation
C.Mode imputation
D.Forward-fill
AnswerB

Mean is appropriate for symmetric distributions.

Why this answer

For normally distributed data, mean imputation is common and preserves the mean.

137
MCQeasy

A data analyst calculates the mean, median, and mode of a dataset. Which of the following measures of central tendency is least affected by extreme outliers?

A.Median
B.Range
C.Mode
D.Mean
AnswerA

The median is not affected by extreme values.

Why this answer

The median is resistant to outliers because it is the middle value, whereas the mean is pulled by extreme values and the mode may not be affected but is less robust for continuous data.

138
MCQhard

A data analyst uses the elbow method to determine the number of clusters for k-means. The plot shows a sharp bend at k=3 and a small bend at k=5. What is the recommended number of clusters?

A.5
B.The method is inconclusive.
C.2
D.3
AnswerD

The sharp bend suggests 3 clusters.

Why this answer

The elbow method suggests choosing k where the decrease in inertia becomes marginal; the sharp bend at 3 indicates the optimal k.

139
MCQmedium

An analyst is performing a linear regression and obtains an R-squared value of 0.85. Which of the following is the best interpretation?

A.85% of the residuals are zero.
B.85% of the data points lie on the regression line.
C.There is an 85% chance that the relationship is causal.
D.The model explains 85% of the variability in the dependent variable.
AnswerD

This is the correct interpretation of R-squared.

Why this answer

R-squared indicates the proportion of variance in the dependent variable explained by the independent variable(s). 0.85 means 85% explained.

140
MCQhard

A data scientist is building a model to predict customer churn (yes/no). After training a logistic regression model, the coefficient for 'monthly charges' is 0.05 with a p-value of 0.03. Which interpretation is correct at α=0.05?

A.The model's R-squared is 0.05.
B.For every unit increase in monthly charges, the odds of churn increase by about 5%.
C.Monthly charges decrease the probability of churn.
D.Monthly charges have no significant effect on churn.
AnswerB

The coefficient 0.05 in logistic regression represents log-odds; exp(0.05)≈1.05, a 5% increase in odds.

Why this answer

The p-value < 0.05 indicates a statistically significant relationship; the positive coefficient means higher charges increase the log-odds of churn.

141
Multi-Selecthard

A data analyst is performing K-means clustering on customer data. Which THREE of the following are steps in the K-means algorithm?

Select 3 answers
A.Perform eigenvalue decomposition.
B.Calculate the correlation matrix.
C.Initialize k centroids randomly.
D.Update centroids by computing the mean of all points assigned to each centroid.
E.Assign each data point to the nearest centroid.
AnswersC, D, E

Correct: initial step.

Why this answer

K-means involves initializing centroids, assigning points to nearest centroid, and updating centroids as the mean of assigned points.

142
MCQeasy

A dataset contains a column 'Age' with values: [22, 25, 25, 30, 35, 40, 45]. What is the interquartile range (IQR)?

A.15
B.10
C.20
D.25
AnswerA

Correct IQR = Q3 - Q1 = 40 - 25 = 15.

Why this answer

Q1 is median of lower half (22,25,25) = 25; Q3 is median of upper half (35,40,45) = 40; IQR = 40-25 = 15.

143
MCQmedium

A data analyst is conducting an A/B test on a website's landing page. The null hypothesis is that there is no difference in conversion rates between the control and treatment groups. After collecting data, the analyst calculates a p-value of 0.03. Using a significance level of α = 0.05, what is the correct conclusion?

A.Accept the null hypothesis; the difference is due to chance.
B.Reject the null hypothesis; the treatment group has a higher conversion rate.
C.Fail to reject the null hypothesis; there is no evidence of a difference.
D.The result is inconclusive because the p-value is close to 0.05.
AnswerB

The p-value indicates statistical significance, but direction must be checked from data.

Why this answer

Since p < α, the null hypothesis is rejected, indicating a statistically significant difference in conversion rates.

144
MCQmedium

A data scientist is preparing data for a K-means clustering algorithm. The dataset contains features measured in different units (e.g., income in dollars and age in years). Which preprocessing step is most critical before running K-means?

A.Remove outliers
B.Encode categorical variables
C.Standardize or normalize the features
D.Perform feature selection
AnswerC

Scaling ensures equal weighting; both min-max and Z-score are common.

Why this answer

K-means is sensitive to the scale of features because it uses Euclidean distance. Min-max normalization or standardization ensures all features contribute equally.

145
MCQmedium

In a logistic regression model predicting customer churn (1 = churn, 0 = not churn), the coefficient for 'contract length' is -0.5. Which of the following is the correct interpretation?

A.For each unit increase in contract length, the log-odds of churn decrease by 0.5.
B.Longer contract length increases the odds of churn.
C.The probability of churn decreases by 50% for each unit increase in contract length.
D.Contract length is not a significant predictor.
AnswerA

Correct interpretation of logistic regression coefficient.

Why this answer

In logistic regression, coefficients represent the log-odds change. A negative coefficient decreases the log-odds, meaning lower probability of churn.

146
Multi-Selectmedium

A data analyst is preparing a dataset for analysis and needs to handle outliers. Which TWO of the following are common methods for treating outliers?

Select 2 answers
A.Removal
B.Capping
C.Normalization
D.Imputation
E.Standardization
AnswersA, B

Removing outlier records is a common approach.

Why this answer

Capping (winsorizing) limits extreme values, and removal simply deletes outlier rows. Transformation (e.g., log) can also reduce impact but is not listed here; normalization and imputation are not primary outlier treatments.

147
MCQeasy

A data analyst calculates the mean, median, and mode of a sales dataset and finds they are all equal. Which type of distribution does this indicate?

A.Normal distribution
B.Skewed right
C.Bimodal distribution
D.Skewed left
AnswerA

Normal distribution has equal mean, median, and mode.

Why this answer

When mean, median, and mode are equal, the distribution is symmetric and typically bell-shaped (normal).

← PreviousPage 2 of 2 · 147 questions total

Ready to test yourself?

Try a timed practice session using only Dap Analysing Data questions.