CCNA Analyzing Modeling Data Questions — Page 2 of 2

Multi-Selecthard

A data analyst is performing data cleaning. Which THREE steps are part of this process? (Choose three.)

Select 3 answers

A.Correcting inconsistent data

B.Normalization

C.Handling missing values

D.Feature engineering

E.Removing duplicate records

AnswersA, C, E

Standardizing formats and fixing typos are cleaning tasks.

Why this answer

Correcting inconsistent data (Option A) is a core data cleaning step because it ensures that values follow a consistent format, such as standardizing date formats (e.g., 'MM/DD/YYYY' vs 'DD-MM-YYYY') or fixing capitalization (e.g., 'USA' vs 'usa'). This process directly addresses data quality issues that arise from human entry errors or system differences, making the dataset reliable for analysis.

Exam trap

The trap here is that candidates confuse data cleaning with data transformation or feature engineering, leading them to select normalization or feature engineering as cleaning steps, when in fact cleaning strictly addresses data quality issues like consistency, completeness, and uniqueness.

Practice this question →

MCQeasy

A data analyst is building a linear regression model to predict sales based on advertising spend. The analyst notices that the residuals are not normally distributed and have a non‑constant variance. Which of the following transformations is most appropriate to apply to the dependent variable?

A.Standardization (z-score)

B.Normalization (min-max scaling)

C.Logarithmic transformation

D.Square root transformation

AnswerC

Log transformation is commonly used to stabilize variance and make residuals more normally distributed.

Why this answer

The logarithmic transformation is the most appropriate choice because it stabilizes non‑constant variance (heteroscedasticity) and helps make the residuals more normally distributed, which are key assumptions for linear regression. By compressing the scale of the dependent variable (sales), it reduces the impact of large values and often linearizes multiplicative relationships, such as diminishing returns from advertising spend.

Exam trap

CompTIA often tests the misconception that any scaling technique (standardization or normalization) can fix heteroscedasticity or non‑normality, but these methods only change the range or center of the data, not the shape of the residual distribution or the variance structure.

How to eliminate wrong answers

Option A is wrong because standardization (z-score) centers and scales the data to mean 0 and standard deviation 1, but it does not address heteroscedasticity or non‑normal residuals; it merely changes the units of the dependent variable without altering the shape of the distribution. Option B is wrong because normalization (min-max scaling) rescales the data to a fixed range (e.g., 0 to 1), which also fails to correct non‑constant variance or non‑normality; it is primarily used for feature scaling in algorithms like neural networks, not for satisfying regression assumptions. Option D is wrong because the square root transformation is typically used for count data (e.g., Poisson-distributed outcomes) to stabilize variance, but it is less effective than the log transformation when the variance increases proportionally with the mean, which is common in sales data; the log transformation is the standard choice for multiplicative relationships and heteroscedasticity.

Practice this question →

MCQhard

A data analyst trains a complex model that achieves 99% accuracy on training data but only 65% on new data. What is the most likely issue?

A.Underfitting

B.Overfitting

C.Multicollinearity

D.High bias

AnswerB

The model performs well on training but poorly on test data, a classic sign of overfitting.

Why this answer

The model performs exceptionally well on training data (99% accuracy) but poorly on new data (65% accuracy), which is the classic symptom of overfitting. Overfitting occurs when the model learns noise and specific patterns in the training data rather than generalizing to unseen data, often due to excessive complexity (e.g., too many parameters or deep layers). This results in high variance and poor performance on validation or test sets.

Exam trap

CompTIA often tests the distinction between overfitting and underfitting by presenting a large gap between training and test accuracy, tempting candidates to choose high bias or multicollinearity due to confusion about bias-variance tradeoff or correlation issues.

How to eliminate wrong answers

Option A is wrong because underfitting would show poor performance on both training and new data (e.g., low accuracy on both), not high training accuracy with low test accuracy. Option C is wrong because multicollinearity refers to high correlation among predictor variables in regression models, which inflates coefficient standard errors but does not directly cause a large gap between training and test accuracy. Option D is wrong because high bias typically leads to underfitting, where the model is too simple and performs poorly on both training and test data, not the specific pattern of high training accuracy and low test accuracy seen here.

Practice this question →

Multi-Selectmedium

A data analyst is performing hypothesis testing to compare the mean sales of two store locations. Which TWO conditions must be satisfied to use a two‑sample t‑test? (Select TWO.)

Select 2 answers

A.The data is paired between the two locations

B.The sample sizes are equal

C.The data is approximately normally distributed

D.The variances of the two populations are equal

E.The two samples are independent of each other

AnswersC, E

Normality is assumed for the t-test, though it is robust for large samples.

Why this answer

Option C is correct because the two-sample t-test assumes that the data in each group are approximately normally distributed. This is a key parametric assumption; if the sample sizes are large (typically n > 30), the Central Limit Theorem can relax this requirement, but for smaller samples, normality must hold to ensure valid test statistics and p-values.

Exam trap

CompTIA often tests the misconception that equal sample sizes or equal variances are required for a two-sample t-test, but the actual core assumptions are independence and normality (or large sample sizes via CLT).

Practice this question →

Drag & Dropmedium

Drag and drop the steps to normalize a database table from 1NF to 3NF in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Normalization proceeds from 1NF to 2NF to 3NF, then table creation and foreign keys.

Practice this question →

Multi-Selecteasy

Which TWO of the following are true about correlation and causation? (Select TWO).

Select 2 answers

A.Correlation measures both linear and nonlinear relationships

B.Causation can always be inferred from a controlled experiment without randomization

C.Correlation does not imply causation

D.If two variables are highly correlated, one must cause the other

E.A statistically significant correlation may still be due to chance or confounding variables

AnswersC, E

This is a fundamental concept.

Why this answer

Option C is correct because correlation measures the strength and direction of a linear relationship between two variables, but it does not imply that one variable causes the other. Causation requires controlled experiments with randomization to rule out confounding variables and establish a cause-effect relationship.

Exam trap

CompTIA often tests the classic 'correlation does not imply causation' fallacy, where candidates mistakenly think that a statistically significant correlation automatically proves a causal relationship, ignoring the role of chance and confounding variables.

Practice this question →

MCQhard

An analyst is fitting a polynomial regression model and wants to choose the degree that minimizes overfitting. Which technique should the analyst use?

A.Lasso regression (L1)

B.Principal component analysis (PCA)

C.Stepwise selection

D.Ridge regression (L2)

AnswerD

Ridge regression penalizes large coefficients, which is effective for reducing overfitting in polynomial models without removing features.

Why this answer

Ridge regression (L2) adds a penalty proportional to the square of the magnitude of coefficients, which shrinks them toward zero but does not eliminate them. This regularization reduces variance and helps prevent overfitting in polynomial regression by controlling the influence of higher-degree terms, making it the correct technique for minimizing overfitting while retaining all features.

Exam trap

The trap here is that candidates often confuse Lasso (L1) with Ridge (L2), mistakenly thinking Lasso's coefficient elimination is always better for overfitting, when in fact Ridge's smooth shrinkage is more appropriate for polynomial models where all degrees should be retained but controlled.

How to eliminate wrong answers

Option A is wrong because Lasso regression (L1) performs feature selection by shrinking some coefficients exactly to zero, which is more suited for sparse models rather than simply minimizing overfitting in a polynomial context where all degrees may be needed. Option B is wrong because Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms features into uncorrelated components, but it does not directly address overfitting in polynomial regression and can lose interpretability of the polynomial terms. Option C is wrong because stepwise selection is a variable selection method that adds or removes predictors based on statistical criteria (e.g., AIC, p-values), but it can be unstable and does not inherently regularize coefficients to combat overfitting as effectively as ridge regression.

Practice this question →

MCQmedium

A data scientist is building a predictive model to forecast monthly sales. The data shows a linear trend with no seasonality. Which regression technique is most appropriate?

A.Polynomial regression

B.Logistic regression

C.Linear regression

D.Ridge regression

AnswerC

Linear regression directly models a linear relationship between independent and dependent variables.

Why this answer

Linear regression is the most appropriate technique because the data shows a linear trend with no seasonality, making a straight-line model the simplest and most effective fit. It directly models the relationship between the independent variable (e.g., time) and the dependent variable (monthly sales) using a linear equation, minimizing the sum of squared residuals.

Exam trap

The trap here is that candidates often confuse 'linear trend' with 'linear in parameters' and incorrectly choose polynomial regression, thinking it adds flexibility, when the question explicitly states no seasonality and a linear trend, making simple linear regression the optimal choice.

How to eliminate wrong answers

Option A is wrong because polynomial regression introduces higher-degree terms (e.g., x², x³) to model curvature, which is unnecessary and risks overfitting when the trend is explicitly linear. Option B is wrong because logistic regression is used for binary classification problems (e.g., predicting yes/no outcomes), not for forecasting continuous numeric values like monthly sales. Option D is wrong because ridge regression is a regularization technique designed to handle multicollinearity or overfitting by adding an L2 penalty, but it is not a distinct regression type for linear trends and would be overkill when a simple linear model suffices.

Practice this question →

MCQmedium

A healthcare analytics team is analyzing patient readmission rates. They have a dataset with thousands of records including patient age, diagnosis, length of stay, number of prior admissions, and discharge date. The goal is to identify key factors influencing readmission and create a model to predict high-risk patients. The data is imbalanced: only 5% of patients are readmitted within 30 days. The team plans to use logistic regression. What is the most appropriate approach?

A.Use the dataset as is because logistic regression handles imbalance

B.Remove most of the non-readmitted patients to balance the dataset

C.Use accuracy as the evaluation metric

D.Apply oversampling techniques like SMOTE to the training set

AnswerD

Oversampling balances the classes, improving model performance on the minority class.

Why this answer

With imbalanced data, logistic regression can be biased toward the majority class. Oversampling the minority class (e.g., SMOTE) helps the model learn patterns for readmission. Using accuracy as a metric would be misleading.

Removing majority samples discards valuable data. Using data as-is often fails to predict the minority class.

Practice this question →