Knowledge + Practice

CCNA Dap Analysing Data Questions

72 of 147 questions · Page 2/2 · Dap Analysing Data topic · Answers revealed

Practice these questions Exam hub All questions

76

Multi-Selecthard

A company is planning an A/B test to compare two website designs. Which THREE of the following must be determined before the test begins to ensure valid results? (Select three.)

Select 3 answers

A.The desired effect size

B.The p-value of the test

C.Which hypothesis is true

D.The minimum sample size required

E.The significance level (α)

AnswersA, D, E

Helps determine sample size.

Why this answer

Sample size (based on power and effect size), significance level (α), and desired effect size are all pre-specified to design the test. The p-value is an outcome, not a pre-test parameter. The hypothesis is defined beforehand, but which one is false? Actually null and alternative hypotheses should be pre-specified, but the phrasing 'which one is true' is not determined before; the test determines that.

So correct are: determine minimum sample size, determine significance level, and determine desired effect size.

Practice this question →

77

MCQmedium

A data analyst is cleaning a dataset and finds that the 'age' column has several missing values. Which method of handling missing values is least likely to introduce bias if the missingness is completely at random?

A.Mean imputation

B.Listwise deletion

C.Mode imputation

D.Forward-fill

AnswerB

If MCAR, listwise deletion gives unbiased estimates, though with less power.

Why this answer

Listwise deletion (removing rows with missing values) is simple and unbiased if data is MCAR, but it reduces sample size. However, it is least likely to introduce bias among the options when MCAR holds.

Practice this question →

78

MCQhard

A data analyst is comparing the means of two independent groups using a t-test. The sample sizes are small and the data is not normally distributed. Which condition is violated for a valid t-test?

A.Normality

B.Equal variances

C.Independence of observations

D.Sample size larger than 30

AnswerA

Normality is an assumption of t-test.

Why this answer

The t-test assumes normality of the data, especially with small samples. Violation of normality can affect the validity.

Practice this question →

79

Multi-Selectmedium

A data analyst is performing data cleaning on a dataset and identifies several outliers in the 'age' column. Which TWO methods are appropriate for handling these outliers? (Select two.)

Select 2 answers

A.Capping

B.Mean imputation

C.Transformation

D.Removal

E.Binning

AnswersA, D

Capping limits outliers to a specified percentile.

Why this answer

Capping limits extreme values to a threshold, and removal deletes outlier records. Transformation (e.g., log) can reduce impact but is more for skewness. Imputation and binning are for missing data or discretization, not directly for outliers.

Practice this question →

80

MCQeasy

In an A/B test, the null hypothesis states that there is no difference between the control and treatment groups. After running the test, the p-value is 0.04. Assuming α = 0.05, what is the correct conclusion?

A.Fail to reject the null hypothesis

B.Reject the null hypothesis

C.Accept the null hypothesis

D.The test is invalid because the p-value is too low

AnswerB

Correct conclusion.

Why this answer

Since p-value (0.04) < α (0.05), we reject the null hypothesis, indicating a statistically significant difference.

Practice this question →

81

Multi-Selectmedium

A data analyst is preparing a dataset for analysis and needs to address data quality issues. Which TWO of the following are common data cleaning tasks?

Select 2 answers

A.Performing hypothesis testing

B.Imputing missing values

C.Building a regression model

D.Calculating correlation coefficients

E.Deduplicating records

AnswersB, E

Correct.

Why this answer

Handling missing values and removing duplicates are standard data cleaning tasks.

Practice this question →

82

Multi-Selectmedium

A data analyst is cleaning a customer dataset. Which two actions are appropriate for handling duplicate records? (Choose TWO)

Select 2 answers

A.Impute missing values with mean

B.Delete any row with a duplicate email address

C.Remove all rows with identical values in every field

D.Apply Z-score standardization

E.Use a fuzzy matching algorithm to identify near-duplicates

AnswersC, E

Exact duplicates can be removed safely.

Why this answer

Removing exact duplicates and standardizing identifiers help resolve duplicates.

Practice this question →

83

MCQmedium

A data analyst is preparing features for a machine learning model that uses distance-based algorithms (e.g., K-means, KNN). The dataset contains numerical features with different scales: age (0-100), income (20,000-200,000), and credit score (300-850). Which data transformation technique is most appropriate to ensure all features contribute equally to the distance calculations?

A.Z-score standardization

B.Min-max normalization

C.One-hot encoding

D.Log transformation

AnswerB

Correct: scales all features to [0,1] so distances are not dominated by large-scale features.

Why this answer

Min-max normalization rescales features to a fixed range (e.g., 0 to 1), making distances computed equally weighted. Standardization is better for algorithms assuming Gaussian distributions.

Practice this question →

84

MCQmedium

A data analyst is testing whether the average sales amount differs between two regions. Which statistical test is most appropriate?

A.Chi-square test

B.ANOVA

C.Two-sample t-test

D.Paired t-test

AnswerC

Compares means of two independent groups.

Why this answer

A two-sample t-test compares the means of two independent groups.

Practice this question →

85

MCQmedium

A data scientist builds a simple linear regression model to predict house prices based on square footage. The model yields an R-squared value of 0.85. Which statement accurately interprets this result?

A.The slope of the regression line is 0.85

B.85% of the data points lie exactly on the regression line

C.The model explains 85% of the variability in house prices

D.There is a 85% chance that square footage causes higher prices

AnswerC

Correct interpretation of R-squared.

Why this answer

R-squared of 0.85 means 85% of the variance in house prices is explained by square footage.

Practice this question →

86

Multi-Selecthard

A data analyst is performing a chi-square test of independence on a contingency table of customer satisfaction (satisfied, neutral, dissatisfied) by region (North, South, East, West). Which THREE of the following are necessary assumptions for the test?

Select 3 answers

A.The two variables are categorical

B.The sample size is greater than 30

C.Expected frequencies in each cell are at least 5 (or most cells)

D.The observations are independent

E.The data must be normally distributed

AnswersA, C, D

Chi-square tests association between categorical variables.

Why this answer

Chi-square test requires categorical variables, expected frequencies >=5 in at least 80% of cells, and independence of observations.

Practice this question →

87

MCQeasy

A dataset contains customer records with a column for 'Phone Number' that should be unique. However, the analyst finds several duplicate phone numbers. Which data quality dimension is primarily affected?

A.Completeness

B.Accuracy

C.Uniqueness

D.Consistency

AnswerC

Correct: duplicates violate uniqueness.

Why this answer

Uniqueness refers to the expectation that each record or attribute value should be unique. Duplicate phone numbers violate uniqueness.

Practice this question →

88

MCQmedium

A marketing team runs an A/B test on email subject lines. The p-value is 0.03 with α = 0.05. Which of the following is the correct interpretation?

A.The result is not statistically significant at the 95% confidence level.

B.The probability that the null hypothesis is true is 3%.

C.Fail to reject the null hypothesis; no significant difference.

D.Reject the null hypothesis; there is a statistically significant difference.

AnswerD

p < α provides evidence against the null.

Why this answer

Since p-value (0.03) < α (0.05), we reject the null hypothesis, indicating a statistically significant difference.

Practice this question →

89

MCQhard

In logistic regression, the output is a probability between 0 and 1. If the predicted probability for a customer churning is 0.7 and the decision threshold is 0.5, what is the predicted class?

A.Not churn (class 0)

B.Churn (class 1)

C.Both classes equally likely

D.Uncertain, need more data

AnswerB

Probability above threshold predicts the positive class.

Why this answer

Since 0.7 > 0.5, the predicted class is churn (usually coded as 1).

Practice this question →

90

Multi-Selectmedium

Which TWO of the following are true about Pearson correlation coefficient (r)?

Select 2 answers

A.An r of 0 means no relationship exists

B.It ranges from 0 to 1

C.It measures the strength and direction of a linear relationship

D.A value of +1 indicates a perfect positive linear relationship

E.It can be used for categorical variables

AnswersC, D

Correct.

Why this answer

Pearson r ranges from -1 to 1, measuring linear relationship; +1 indicates perfect positive linear correlation.

Practice this question →

91

MCQmedium

A data analyst is analyzing customer purchase amounts. The dataset contains several extreme high values due to luxury purchases. Which measure of central tendency is most robust to these outliers?

A.Range

B.Mean

C.Mode

D.Median

AnswerD

The median is robust to outliers.

Why this answer

The median is not affected by extreme values, making it robust to outliers.

Practice this question →

92

MCQmedium

A financial analyst wants to compare the mean annual returns of three different investment strategies. Which statistical test is most appropriate?

A.Chi-square test

B.Paired t-test

C.One-way ANOVA

D.Two-sample t-test

AnswerC

ANOVA can compare means of three or more independent groups.

Why this answer

ANOVA is used to compare means of three or more groups.

Practice this question →

93

MCQmedium

A data analyst is preparing data for a k-nearest neighbors algorithm. The features include age (0-100) and income (0-200,000). Which technique should be applied to ensure the distance metric is not dominated by income?

A.Min-max normalization

B.Log transformation

C.Z-score standardization

D.One-hot encoding

AnswerA

Correct: min-max normalization scales to [0,1], preventing features with larger ranges from dominating.

Why this answer

Min-max normalization scales features to a 0-1 range, ensuring each feature contributes equally to distance calculations.

Practice this question →

94

MCQmedium

A data analyst is working with a dataset that includes a column 'income' with values ranging from 20,000 to 150,000. To standardize this variable for a linear regression that assumes normally distributed residuals, which method should be used?

A.Log transformation

B.Min-max normalization

C.Square root transformation

D.Z-score standardization

AnswerD

Correct: Z-score centers and scales to unit variance, suitable for normality assumptions.

Why this answer

Z-score standardization transforms data to have mean 0 and standard deviation 1, which is suitable for algorithms that assume normality (like linear regression).

Practice this question →

95

MCQeasy

Which data cleaning method involves replacing a missing value with the average of the available values in that column?

A.Mean imputation

B.Interpolation

C.Listwise deletion

D.Forward-fill

AnswerA

Mean imputation uses column average.

Why this answer

Mean imputation replaces missing values with the column mean.

Practice this question →

96

MCQhard

A data analyst is performing a multiple linear regression with three predictors. The model output shows an R-squared of 0.85 and an adjusted R-squared of 0.80. Which of the following is the best interpretation of the difference between these two values?

A.The model is overfitted, so all predictors should be removed

B.The model has high multicollinearity

C.The residuals are not normally distributed

D.One or more predictors may not be contributing meaningfully

AnswerD

The drop from R-squared to adjusted R-squared indicates that some predictors reduce model efficiency.

Why this answer

Adjusted R-squared penalizes for adding predictors that do not improve the model significantly; a gap suggests some predictors may be irrelevant or the sample size is small.

Practice this question →

97

MCQmedium

A marketing team uses K-means clustering to segment customers based on purchase history. To determine the optimal number of clusters, they plot the within-cluster sum of squares (WCSS) against k and look for an elbow. What is the purpose of this method?

A.To find the point where the rate of decrease in WCSS slows down

B.To identify the value of k that minimizes WCSS

C.To determine the initial centroids for the algorithm

D.To ensure all clusters have equal size

AnswerA

Correct description of the elbow method.

Why this answer

The elbow method helps choose k where adding more clusters yields diminishing returns in reducing variance.

Practice this question →

98

MCQmedium

An analyst wants to compare the mean sales revenue across three different store regions. The data is normally distributed and variances are equal. Which statistical test is most appropriate?

A.Two-sample t-test

B.ANOVA

C.Paired t-test

D.Chi-square test

AnswerB

ANOVA is appropriate for three groups.

Why this answer

ANOVA (Analysis of Variance) is used to compare means of three or more groups.

Practice this question →

99

Multi-Selectmedium

A researcher is designing an A/B test to compare two website layouts. Which TWO elements are essential for determining the required sample size?

Select 2 answers

A.Sample mean

B.Statistical power

C.Confidence interval width

D.Desired effect size

E.P-value

AnswersB, D

Power affects the probability of detecting an effect.

Why this answer

Statistical power and desired effect size are key inputs for sample size calculation.

Practice this question →

100

MCQhard

A data analyst is cleaning a dataset and finds that some records have duplicate entries based on customer ID. Which data quality dimension is most directly affected by these duplicates?

A.Timeliness

B.Consistency

C.Accuracy

D.Uniqueness

AnswerD

Duplicates directly impact uniqueness.

Why this answer

Duplicates violate the uniqueness dimension, which requires each entity to be represented only once.

Practice this question →

101

MCQhard

In time series decomposition, a pattern that repeats at regular intervals (e.g., weekly, yearly) is called:

A.Cyclical

B.Irregular

C.Trend

D.Seasonality

AnswerD

Seasonality has fixed and known periods.

Why this answer

Seasonality refers to regular, periodic patterns in time series data.

Practice this question →

102

MCQmedium

A simple linear regression model predicts sales (y) from advertising spend (x). The equation is y = 2.5x + 10, and R² = 0.81. Which interpretation is correct?

A.The correlation between sales and advertising is 0.81.

B.When advertising is $0, sales are $2.5.

C.81% of the variation in sales is explained by advertising spend.

D.For every $1 increase in advertising, sales increase by $10 on average.

AnswerC

R² = 0.81 means 81% explained.

Why this answer

Slope indicates that each unit increase in x increases y by 2.5 units. R² of 0.81 means 81% of variance in y is explained by x.

Practice this question →

103

MCQmedium

In a time series analysis, a retail analyst observes consistent peaks in sales every December and troughs every February. This pattern repeats annually. Which component of time series does this represent?

A.Irregular

B.Seasonality

C.Trend

D.Cyclical

AnswerB

Seasonality is predictable and repeats over fixed intervals.

Why this answer

Seasonality refers to regular patterns that repeat over fixed periods, such as months or quarters.

Practice this question →

104

Multi-Selectmedium

A dataset contains outliers in a feature that will be used for linear regression. Which two outlier treatment methods are appropriate? (Choose TWO)

Select 2 answers

A.Cap the outliers at a percentile (e.g., 99th percentile)

B.Use min-max normalization

C.Increase the sample size

D.Remove the outlier rows

E.Replace outliers with the mean

AnswersA, D

Capping limits extreme values.

Why this answer

Capping outliers or transforming the variable can reduce their influence.

Practice this question →

105

MCQmedium

A dataset contains a feature 'Age' with values ranging from 18 to 95. To prepare data for a k-nearest neighbors algorithm, which transformation should be applied to 'Age'?

A.Z-score standardization

B.Min-max normalization

C.No transformation needed

D.Log transformation

AnswerB

Min-max normalization ensures all features contribute equally to distance calculations.

Why this answer

Min-max normalization scales features to a fixed range (e.g., 0-1), which is appropriate for distance-based algorithms like k-NN.

Practice this question →

106

MCQmedium

A data analyst wants to compare the average revenue per customer between two marketing campaigns (A and B). The analyst is unsure if the data follows a normal distribution. Which statistical test is most appropriate for comparing the means of the two groups?

A.Two-sample t-test

B.Pearson correlation

C.Chi-square test

D.ANOVA

AnswerA

The two-sample t-test compares means of two independent groups.

Why this answer

For comparing means of two independent groups, the t-test is the standard parametric test. If normality is violated, a non-parametric alternative like Mann-Whitney U could be used, but the t-test is robust for moderate sample sizes.

Practice this question →

107

MCQhard

A data analyst has a time series of monthly sales data. They observe that sales are consistently higher every December and lower every January. Which component of time series does this pattern represent?

A.Irregular

B.Cyclical

C.Seasonality

D.Trend

AnswerC

Seasonality refers to fixed periodic patterns within a year.

Why this answer

Regular patterns that repeat within one year are seasonality.

Practice this question →

108

MCQmedium

An analyst is conducting an A/B test to compare two website designs. The null hypothesis is that there is no difference in conversion rates. The p-value obtained is 0.03, and the significance threshold is 0.05. What should the analyst conclude?

A.Reject the null hypothesis; there is a significant difference.

B.Accept the alternative hypothesis that the new design is better.

C.The test is inconclusive; need a larger sample size.

D.Fail to reject the null hypothesis; there is no significant difference.

AnswerA

Correct: p < α, reject null.

Why this answer

Since p-value (0.03) < α (0.05), we reject the null hypothesis, indicating a statistically significant difference.

Practice this question →

109

MCQeasy

A data analyst calculates the mean, median, and mode of a dataset. Which of the following best describes how these measures are used in descriptive statistics?

A.To identify outliers using standard deviation

B.To test hypotheses about population parameters

C.To describe the central tendency of the data

D.To determine the probability of an event

AnswerC

Mean, median, and mode are measures of central tendency.

Why this answer

Descriptive statistics summarize data using measures like mean, median, and mode to describe central tendency.

Practice this question →

110

MCQmedium

A stock analyst is analyzing monthly sales data for a retail company and observes a consistent pattern of high sales every December. This pattern is most likely an example of which time series component?

A.Irregular

B.Cyclical

C.Seasonality

D.Trend

AnswerC

Correct: regular pattern within a fixed period.

Why this answer

Seasonality refers to regular, predictable patterns that repeat at fixed intervals (e.g., yearly, monthly). The consistent December peak indicates a seasonal pattern.

Practice this question →

111

MCQeasy

In simple linear regression, the coefficient of determination R² measures:

A.The probability that the slope is zero

B.The slope of the regression line

C.The proportion of variance in the dependent variable explained by the independent variable

D.The strength and direction of the linear relationship

AnswerC

Correct interpretation of R².

Why this answer

R² indicates the proportion of variance in the dependent variable explained by the independent variable.

Practice this question →

112

MCQeasy

In a regression analysis, the coefficient of determination (R²) is 0.85. How should this value be interpreted?

A.85% of the data points lie on the regression line

B.The slope of the regression line is 0.85

C.85% of the variance in the dependent variable is explained by the model

D.85% of the independent variables are significant

AnswerC

Correct interpretation of R².

Why this answer

R² represents the proportion of variance in the dependent variable that is explained by the independent variable(s). An R² of 0.85 means the model explains 85% of the variability.

Practice this question →

113

Multi-Selectmedium

A retail company wants to segment its customers based on purchase history. Which THREE methods are appropriate for customer segmentation?

Select 3 answers

A.RFM analysis

B.Linear regression

C.K-means clustering

D.t-test

E.Hierarchical clustering

AnswersA, C, E

Segments based on recency, frequency, monetary value.

Why this answer

K-means clustering, hierarchical clustering, and RFM analysis are common segmentation techniques. Linear regression and t-test are not segmentation methods.

Practice this question →

114

Multi-Selectmedium

An analyst is preparing data for an A/B test and wants to ensure valid results. Which TWO of the following should be considered when calculating the required sample size?

Select 2 answers

A.Data dimensionality

B.Desired effect size

C.Skewness of data

D.Number of features

E.Statistical power

AnswersB, E

Correct: effect size is a key input.

Why this answer

Sample size calculation depends on desired effect size and statistical power, among other factors like significance level.

Practice this question →

115

MCQeasy

A data analyst wants to compare the means of three different training methods on employee productivity. Which statistical test is most appropriate?

A.Correlation analysis

B.ANOVA

C.Chi-square test

D.t-test

AnswerB

ANOVA compares means across multiple groups.

Why this answer

ANOVA (Analysis of Variance) is used to compare means of three or more groups.

Practice this question →

116

Multi-Selectmedium

An analyst is planning an A/B test to compare two website designs. Which TWO factors should be considered when calculating the required sample size?

Select 2 answers

A.Data type of the outcome variable

B.Desired effect size

C.Statistical power

D.Color scheme of the designs

E.Number of missing values

AnswersB, C

Correct.

Why this answer

Statistical power and desired effect size are key inputs for sample size calculations.

Practice this question →

117

MCQhard

A data analyst is cleaning a dataset with missing values in a time series of daily temperatures. The missing values occur sporadically. Which imputation method is most appropriate to maintain the temporal trend?

A.Forward-fill

B.Mean imputation

C.Median imputation

D.Interpolation

AnswerD

Correct: uses neighboring values to estimate missing points, preserving trend.

Why this answer

Interpolation estimates missing values by using surrounding data points and is suitable for time series with a trend. Forward-fill carries the last observation forward, which may not capture trend well. Mean imputation ignores order.

Practice this question →

118

MCQmedium

A data analyst is reviewing a dataset containing house prices. The mean price is $350,000 and the median is $280,000. Which of the following best describes the distribution of house prices?

A.The distribution is right-skewed.

B.The distribution is symmetric.

C.The distribution is left-skewed.

D.The distribution is bimodal.

AnswerA

Correct: Mean > median indicates right skew.

Why this answer

When the mean is greater than the median, the distribution is right-skewed (positively skewed) because higher values pull the mean upward.

Practice this question →

119

Multi-Selectmedium

Which TWO of the following are appropriate uses of min-max normalisation?

Select 2 answers

A.Transforming data to have mean 0 and standard deviation 1

B.Scaling features to a range of 0 to 1

C.Preparing data for linear regression with normally distributed residuals

D.Preparing data for k-nearest neighbours algorithm

E.Handling missing values

AnswersB, D

Correct: Min-max normalisation scales to [0,1].

Why this answer

Min-max normalisation scales data to a fixed range (often 0-1), useful for distance-based algorithms like k-NN and neural networks. Standardisation (Z-score) is better for algorithms assuming Gaussian distribution.

Practice this question →

120

MCQeasy

Which data quality dimension ensures that data represents the real-world scenario correctly and without errors?

A.Completeness

B.Consistency

C.Accuracy

D.Timeliness

AnswerC

Accuracy is about correctness and error-free data.

Why this answer

Accuracy means the data correctly reflects reality.

Practice this question →

121

MCQhard

In A/B testing, which factor is increased by having a larger sample size?

A.P-value

B.Effect size

C.Type I error rate

D.Statistical power

AnswerD

Power increases with sample size.

Why this answer

Larger sample size increases statistical power (ability to detect a true effect).

Practice this question →

122

MCQeasy

Which data quality dimension is violated if a customer record has a missing phone number?

A.Consistency

B.Accuracy

C.Completeness

D.Validity

AnswerC

Completeness measures missing values.

Why this answer

Completeness refers to the extent to which data is not missing.

Practice this question →

123

Multi-Selectmedium

An analyst wants to compare the average sales revenue across three different store locations. Which TWO statistical methods are appropriate for this comparison?

Select 2 answers

A.Two-sample t-test

B.ANOVA

C.Multiple regression

D.Descriptive statistics

E.Chi-square test

AnswersB, C

Correct: ANOVA compares means across three or more groups.

Why this answer

ANOVA compares means of three or more groups. A t-test compares only two groups. Chi-square tests categorical independence.

Correlation measures linear relationship. Descriptive stats summarise but don't compare multiple groups inferentially.

Practice this question →

124

MCQeasy

In a simple linear regression model y = 2.5 + 1.2x, what is the predicted value of y when x = 10?

A.12.0

B.13.7

C.14.5

D.10.0

AnswerC

Correct calculation.

Why this answer

Plug x=10: y = 2.5 + 1.2*10 = 2.5 + 12 = 14.5.

Practice this question →

125

Multi-Selectmedium

A data analyst wants to segment customers based on purchasing behavior such as frequency, monetary value, and recency. Which TWO clustering evaluation methods can help determine the optimal number of clusters? (Select two.)

Select 2 answers

A.Correlation coefficient

B.ANOVA

C.Silhouette score

D.t-test

E.Elbow method

AnswersC, E

Measures how similar an object is to its own cluster vs others.

Why this answer

The elbow method uses within-cluster sum of squares, and the silhouette score measures cohesion and separation. Both help choose k. Correlation coefficient is for association, not clustering.

ANOVA and t-test are for hypothesis testing.

Practice this question →

126

MCQhard

A data analyst is asked to compare the average sales across three different store locations. The data is normally distributed and variances are approximately equal. Which statistical test is most appropriate?

A.ANOVA

B.Pearson correlation

C.Chi-square test

D.Two-sample t-test

AnswerA

ANOVA is designed for comparing means of 3+ groups.

Why this answer

ANOVA is used to compare means of three or more groups when assumptions of normality and equal variance are met.

Practice this question →

127

MCQhard

An analyst is performing a logistic regression to predict customer churn (yes/no). The model outputs a probability of 0.75 for a particular customer. Which of the following best describes the interpretation?

A.The model predicts that the customer will not churn

B.There is a 75% chance that the customer will churn

C.The customer will definitely churn because the probability is above 0.5

D.The odds of churning are 0.75 to 1

AnswerB

Correct interpretation of logistic regression output.

Why this answer

Logistic regression outputs the probability that the event (churn) occurs, given the input features.

Practice this question →

128

Multi-Selectmedium

An analyst is conducting an A/B test on a new checkout process. To calculate sample size, which THREE factors must be considered?

Select 3 answers

A.Number of control groups

B.Desired effect size

C.Significance level (alpha)

D.Statistical power

E.Population standard deviation

AnswersB, C, D

Effect size is the minimum practical difference to detect.

Why this answer

Statistical power, significance level (alpha), and desired effect size (minimum detectable effect) are essential for sample size calculation.

Practice this question →

129

MCQeasy

Which statistical test should be used to determine if there is a significant association between two categorical variables, such as gender and product preference?

A.ANOVA

B.Chi-square test

C.Pearson correlation

D.t-test

AnswerB

Correct test for categorical variables.

Why this answer

The chi-square test of independence is used to test association between two categorical variables.

Practice this question →

130

MCQhard

A data scientist runs a linear regression model to predict customer spending based on income. The R-squared value is 0.45 and the p-value for the slope coefficient is 0.03. At a significance level of α=0.05, which of the following conclusions is correct?

A.The slope is not statistically significant, and the model explains 55% of the variance.

B.The slope is statistically significant, and the model explains 45% of the variance.

C.The slope is statistically significant, and the model explains 55% of the variance.

D.The slope is not statistically significant, and the model explains 45% of the variance.

AnswerB

Correct: p<0.05 indicates significance; R²=0.45 indicates explained variance.

Why this answer

The p-value (0.03) is less than α (0.05), so the slope is statistically significant. R²=0.45 means the model explains 45% of the variance.

Practice this question →

131

MCQeasy

Which data quality dimension is most concerned with whether data values fall within a defined domain or acceptable range?

A.Completeness

B.Consistency

C.Validity

D.Accuracy

AnswerC

Validity checks if data follows format and range rules.

Why this answer

Validity refers to whether data values conform to defined rules or constraints.

Practice this question →

132

MCQhard

A data scientist applies K-means clustering to a customer dataset. The elbow method suggests using 4 clusters. After running K-means with k=4, the within-cluster sum of squares (WCSS) is plotted against k, and the elbow is at k=4. What does this indicate?

A.Increasing k beyond 4 would not significantly reduce WCSS.

B.The data naturally forms 4 clusters with no noise.

C.The algorithm converged to a local minimum.

D.The model has overfit the data.

AnswerA

The elbow point is where the rate of decrease sharply changes.

Why this answer

The elbow method suggests that increasing k beyond 4 yields diminishing returns in reducing WCSS; k=4 is a good trade-off.

Practice this question →

133

MCQmedium

A data analyst is examining the relationship between advertising spend (in thousands) and sales (in thousands). The Pearson correlation coefficient is computed as r = -0.85. Which of the following interpretations is correct?

A.There is no linear relationship.

B.There is a strong positive linear relationship between advertising spend and sales.

C.There is a weak negative linear relationship.

D.There is a strong negative linear relationship.

AnswerD

Correct: r close to -1 indicates strong negative.

Why this answer

Pearson r measures linear correlation: -0.85 indicates a strong negative linear relationship (as one increases, the other decreases). The magnitude |0.85| is close to 1, so strong.

Practice this question →

134

MCQeasy

A data analyst is cleaning a dataset and finds that the 'age' column has several missing values. Which of the following is a valid method for handling missing numerical data?

A.Delete the entire column

B.Ignore the missing values

C.Impute with the mean

D.Replace with zeros

AnswerC

Correct: mean imputation is a standard technique.

Why this answer

Mean imputation is a common method for handling missing numerical data, though median or mode can also be used.

Practice this question →

135

MCQeasy

A data analyst is summarizing the central tendency of a dataset with extreme outliers. Which measure is most robust to outliers?

A.Standard deviation

B.Median

C.Mean

D.Range

AnswerB

Median is robust to outliers.

Why this answer

The median is not affected by extreme values, unlike the mean.

Practice this question →

136

MCQmedium

A data analyst is cleaning a dataset and finds that a numeric field has several missing values. The variable is normally distributed. Which imputation method is most appropriate?

A.Median imputation

B.Mean imputation

C.Mode imputation

D.Forward-fill

AnswerB

Mean is appropriate for symmetric distributions.

Why this answer

For normally distributed data, mean imputation is common and preserves the mean.

Practice this question →

137

MCQeasy

A data analyst calculates the mean, median, and mode of a dataset. Which of the following measures of central tendency is least affected by extreme outliers?

A.Median

B.Range

C.Mode

D.Mean

AnswerA

The median is not affected by extreme values.

Why this answer

The median is resistant to outliers because it is the middle value, whereas the mean is pulled by extreme values and the mode may not be affected but is less robust for continuous data.

Practice this question →

138

MCQhard

A data analyst uses the elbow method to determine the number of clusters for k-means. The plot shows a sharp bend at k=3 and a small bend at k=5. What is the recommended number of clusters?

A.5

B.The method is inconclusive.

C.2

D.3

AnswerD

The sharp bend suggests 3 clusters.

Why this answer

The elbow method suggests choosing k where the decrease in inertia becomes marginal; the sharp bend at 3 indicates the optimal k.

Practice this question →

139

MCQmedium

An analyst is performing a linear regression and obtains an R-squared value of 0.85. Which of the following is the best interpretation?

A.85% of the residuals are zero.

B.85% of the data points lie on the regression line.

C.There is an 85% chance that the relationship is causal.

D.The model explains 85% of the variability in the dependent variable.

AnswerD

This is the correct interpretation of R-squared.

Why this answer

R-squared indicates the proportion of variance in the dependent variable explained by the independent variable(s). 0.85 means 85% explained.

Practice this question →

140

MCQhard

A data scientist is building a model to predict customer churn (yes/no). After training a logistic regression model, the coefficient for 'monthly charges' is 0.05 with a p-value of 0.03. Which interpretation is correct at α=0.05?

A.The model's R-squared is 0.05.

B.For every unit increase in monthly charges, the odds of churn increase by about 5%.

C.Monthly charges decrease the probability of churn.

D.Monthly charges have no significant effect on churn.

AnswerB

The coefficient 0.05 in logistic regression represents log-odds; exp(0.05)≈1.05, a 5% increase in odds.

Why this answer

The p-value < 0.05 indicates a statistically significant relationship; the positive coefficient means higher charges increase the log-odds of churn.

Practice this question →

141

Multi-Selecthard

A data analyst is performing K-means clustering on customer data. Which THREE of the following are steps in the K-means algorithm?

Select 3 answers

A.Perform eigenvalue decomposition.

B.Calculate the correlation matrix.

C.Initialize k centroids randomly.

D.Update centroids by computing the mean of all points assigned to each centroid.

E.Assign each data point to the nearest centroid.

AnswersC, D, E

Correct: initial step.

Why this answer

K-means involves initializing centroids, assigning points to nearest centroid, and updating centroids as the mean of assigned points.

Practice this question →

142

MCQeasy

A dataset contains a column 'Age' with values: [22, 25, 25, 30, 35, 40, 45]. What is the interquartile range (IQR)?

A.15

B.10

C.20

D.25

AnswerA

Correct IQR = Q3 - Q1 = 40 - 25 = 15.

Why this answer

Q1 is median of lower half (22,25,25) = 25; Q3 is median of upper half (35,40,45) = 40; IQR = 40-25 = 15.

Practice this question →

143

MCQmedium

A data analyst is conducting an A/B test on a website's landing page. The null hypothesis is that there is no difference in conversion rates between the control and treatment groups. After collecting data, the analyst calculates a p-value of 0.03. Using a significance level of α = 0.05, what is the correct conclusion?

A.Accept the null hypothesis; the difference is due to chance.

B.Reject the null hypothesis; the treatment group has a higher conversion rate.

C.Fail to reject the null hypothesis; there is no evidence of a difference.

D.The result is inconclusive because the p-value is close to 0.05.

AnswerB

The p-value indicates statistical significance, but direction must be checked from data.

Why this answer

Since p < α, the null hypothesis is rejected, indicating a statistically significant difference in conversion rates.

Practice this question →

144

MCQmedium

A data scientist is preparing data for a K-means clustering algorithm. The dataset contains features measured in different units (e.g., income in dollars and age in years). Which preprocessing step is most critical before running K-means?

A.Remove outliers

B.Encode categorical variables

C.Standardize or normalize the features

D.Perform feature selection

AnswerC

Scaling ensures equal weighting; both min-max and Z-score are common.

Why this answer

K-means is sensitive to the scale of features because it uses Euclidean distance. Min-max normalization or standardization ensures all features contribute equally.

Practice this question →

145

MCQmedium

In a logistic regression model predicting customer churn (1 = churn, 0 = not churn), the coefficient for 'contract length' is -0.5. Which of the following is the correct interpretation?

A.For each unit increase in contract length, the log-odds of churn decrease by 0.5.

B.Longer contract length increases the odds of churn.

C.The probability of churn decreases by 50% for each unit increase in contract length.

D.Contract length is not a significant predictor.

AnswerA

Correct interpretation of logistic regression coefficient.

Why this answer

In logistic regression, coefficients represent the log-odds change. A negative coefficient decreases the log-odds, meaning lower probability of churn.

Practice this question →

146

Multi-Selectmedium

A data analyst is preparing a dataset for analysis and needs to handle outliers. Which TWO of the following are common methods for treating outliers?

Select 2 answers

A.Removal

B.Capping

C.Normalization

D.Imputation

E.Standardization

AnswersA, B

Removing outlier records is a common approach.

Why this answer

Capping (winsorizing) limits extreme values, and removal simply deletes outlier rows. Transformation (e.g., log) can also reduce impact but is not listed here; normalization and imputation are not primary outlier treatments.

Practice this question →

147

MCQeasy

A data analyst calculates the mean, median, and mode of a sales dataset and finds they are all equal. Which type of distribution does this indicate?

A.Normal distribution

B.Skewed right

C.Bimodal distribution

D.Skewed left

AnswerA

Normal distribution has equal mean, median, and mode.

Why this answer

When mean, median, and mode are equal, the distribution is symmetric and typically bell-shaped (normal).

Practice this question →

← PreviousPage 2 of 2 · 147 questions total

Ready to test yourself?

Try a timed practice session using only Dap Analysing Data questions.

Start 20-question session