Practice DA0-001 Analyzing and Modeling Data questions with full explanations on every answer.
Start practicing
Analyzing and Modeling Data — choose a session length
Free · No account required
Click any question to see the full explanation and answer options, or start a focused practice session above.
A data analyst needs to identify the most frequently occurring value in a dataset. Which measure of central tendency should they use?
2A retail company wants to predict future sales based on historical data. Which modeling approach is most appropriate if the data shows a clear seasonal pattern?
3A data analyst is building a model to predict customer churn. The dataset has 10,000 records with 500 churned customers. The model predicts churn with 95% accuracy, but only identifies 10% of actual churners. Which metric best highlights this issue?
4A data analyst needs to combine two datasets that have the same columns but different rows. Which operation should they use?
5A data analyst is performing a hypothesis test with a significance level of 0.05. The p-value obtained is 0.03. What should the analyst conclude?
6A data scientist trains a regression model and observes high variance with low bias. Which technique is most appropriate to reduce variance?
7A data analyst is cleaning a dataset and finds missing values in a categorical variable representing customer region. Which imputation method is most appropriate?
8A data analyst needs to visualize the distribution of a continuous variable across different categories. Which chart type is most suitable?
9A company is analyzing customer feedback sentiment. The dataset is highly imbalanced with 95% positive and 5% negative comments. Which technique should the analyst use to address class imbalance before modeling?
10Which TWO of the following are common assumptions of linear regression?
11Which THREE of the following are appropriate methods to handle outliers in a dataset?
12Which TWO of the following are examples of supervised learning algorithms?
13A healthcare analytics team is building a predictive model to identify patients at high risk of readmission within 30 days of discharge. The dataset includes 50,000 patient records with 200 features, including demographics, vital signs, lab results, and historical admissions. The target variable is binary (readmitted or not). The team uses a logistic regression model and achieves an AUC of 0.72 on the test set. However, the model's calibration is poor: for patients predicted to have a 70% risk, the actual readmission rate is only 40%. The team wants to improve calibration without significantly reducing discrimination (AUC). The data scientist suggests applying Platt scaling. However, the team lead is concerned that Platt scaling may reduce the model's ability to rank patients correctly. Which of the following is the best course of action?
14A data analyst at a marketing firm is tasked with segmenting customers based on their purchasing behavior. The dataset contains 10,000 customers with features such as annual spend, frequency of purchases, recency of last purchase, and average order value. The analyst decides to use k-means clustering. After standardizing the features, the analyst runs k-means with k=3, k=4, and k=5, and computes the silhouette score for each: k=3: 0.45, k=4: 0.52, k=5: 0.48. The analyst also plots the elbow curve and observes that the within-cluster sum of squares (WCSS) decreases sharply from k=2 to k=4, then levels off. Based on these results, what is the most appropriate number of clusters?
15A data analyst is building a linear regression model to predict sales based on advertising spend across TV, radio, and newspaper channels. Which TWO diagnostics should the analyst perform to validate the model assumptions?
16A data analyst is preparing a logistic regression model to predict customer churn. After examining the exhibit, which data quality issue should the analyst address first?
17A healthcare analytics team is building a classification model to predict patient readmission within 30 days. The dataset contains 10,000 records with 30 features, including demographics, vital signs, lab results, and medication history. The target variable is imbalanced: 85% no readmission, 15% readmission. The team used logistic regression with default settings and achieved an accuracy of 85%, but the model predicted 'no readmission' for all patients. The lead analyst suspects the model is not learning due to class imbalance. The team has time to implement one corrective action before the next model review. Which action should the team take?
18Drag and drop the steps to normalize a database table from 1NF to 3NF in the correct order.
19Drag and drop the steps to implement a data classification policy in the correct order.
20Match each data governance role to its responsibility.
21Match each data sampling method to its description.
22A data analyst is designing a data model for a sales data warehouse. The model should optimize query performance for aggregations by minimizing joins and duplicating data where necessary. Which schema design should the analyst use?
23A data scientist is building a predictive model to forecast monthly sales. The data shows a linear trend with no seasonality. Which regression technique is most appropriate?
24A data analyst is working with a dataset containing house prices. After building a multiple linear regression model, the analyst observes that the model performs well on training data but poorly on validation data. Which technique is most appropriate to address this issue?
25A marketing team wants to segment customers into distinct groups based on purchasing behavior. The data includes numeric features such as frequency, monetary value, and recency. Which unsupervised learning algorithm should be used?
26A data analyst is preparing a dataset for a predictive model. The dataset contains a feature 'age' with values ranging from 18 to 80, and a feature 'income' ranging from 20,000 to 200,000. To ensure both features contribute equally to distance-based algorithms, which transformation should the analyst apply?
27A data analyst is building a binary classification model to predict customer churn. The dataset is imbalanced, with only 10% churners. The analyst wants to evaluate model performance with a focus on correctly identifying churners. Which metric is most appropriate?
28A data analyst needs to join two tables in a SQL database: Orders and Customers. The analyst wants to include all orders, even if there is no matching customer record. Which type of join should be used?
29A data analyst is analyzing survey responses where respondents rated satisfaction on a scale of 1-5. The analyst wants to visualize the distribution of responses. Which chart type is most appropriate?
30A data scientist is tuning a decision tree model to prevent overfitting. The model currently has a high variance. Which hyperparameter adjustment is most effective?
31Which TWO of the following are dimensional modeling techniques commonly used in data warehouses?
32Which THREE of the following are common steps in data cleaning?
33Which TWO of the following are valid techniques for validating the performance of a predictive model?
34Refer to the exhibit. Which clause is used to aggregate the data by department?
35Refer to the exhibit. Which type of ensemble method is being used?
36Refer to the exhibit. Which data quality dimension is being violated?
37A data analyst is building a linear regression model to predict sales based on advertising spend. The analyst notices that the residuals are not normally distributed and have a non‑constant variance. Which of the following transformations is most appropriate to apply to the dependent variable?
38A company’s marketing team wants to segment customers based on purchase history, demographics, and website behavior. The data includes both numeric and categorical variables. Which clustering algorithm is best suited for handling mixed data types?
39A data scientist is building a classification model to predict customer churn. The dataset has 10,000 records with 500 churners. The scientist uses logistic regression and achieves 98% accuracy, but the precision for churn class is only 15%. Which of the following is the most likely cause?
40During ETL, a data analyst discovers that a date column contains values like '01/02/2023' and '2023-01-02'. Which of the following is the best practice to ensure consistent date format before analysis?
41A data analyst is reviewing a SQL query that joins three large tables. The query takes over an hour to run. The analyst notices that the WHERE clause filters on indexed columns in only two tables. Which of the following should the analyst do first to improve performance?
42An analyst is fitting a polynomial regression model and wants to choose the degree that minimizes overfitting. Which technique should the analyst use?
43A data analyst needs to create a visual that shows the distribution of customer ages across different regions. Which chart type is most appropriate?
44A company has a dataset with 100 features. The data analyst wants to reduce dimensionality while preserving as much variance as possible. Which technique should be used?
45After training a decision tree, the tree has depth 20 and 100% accuracy on training data but only 60% on test data. Which hyperparameter adjustment is most likely to improve generalization?
46A data analyst is performing hypothesis testing to compare the mean sales of two store locations. Which TWO conditions must be satisfied to use a two‑sample t‑test? (Select TWO.)
47A data scientist is cleaning a dataset and notices missing values in several columns. Which THREE techniques are appropriate for handling missing data? (Select THREE.)
48Which THREE of the following are examples of descriptive statistics? (Select THREE.)
49Refer to the exhibit. An analyst runs the following query: SELECT product_id, AVG(quantity) FROM sales GROUP BY product_id HAVING AVG(quantity) > 8; Which product_id(s) will be returned?
50Refer to the exhibit. Before running the code, the original salary column had 50 missing values. The median was calculated as 52000. After imputation, which of the following statements is true?
51Refer to the exhibit. A data analyst wants to grant read access to an entire S3 bucket named 'data-lake'. Which of the following best describes what this policy does?
52A data analyst wants to predict customer churn based on categorical features like region and plan type, and continuous features like usage and tenure. Which regression type should be used?
53A retail company wants to forecast monthly sales for the next 12 months. Sales data shows a clear upward trend and seasonal patterns that repeat yearly. Which time series model is most appropriate?
54A data scientist is analyzing a dataset with 100 variables and 5,000 records. The dataset has several missing values and a few extreme outliers. The goal is to build a regression model to predict a continuous target. Which combination of preprocessing steps is most likely to improve model performance?
55During data exploration, an analyst notices that the target variable has a heavily right-skewed distribution. Which data transformation would be most appropriate to make the distribution more symmetric?
56A marketing team wants to segment customers into groups based on purchasing behavior without prior labels. Which algorithm should the data analyst use?
57After building a binary classification model, the data analyst obtains the following confusion matrix: True Positives=80, True Negatives=100, False Positives=20, False Negatives=30. What is the F1 score?
58A dataset contains a column 'Income' with values in different scales (some in thousands, some in hundreds). What is the best way to standardize this column for use in a machine learning model?
59A data analyst needs to determine whether the mean sales of two different regions are significantly different. The samples are independent and the data is normally distributed. Which statistical test should be used?
60A data scientist is working with a dataset containing 1000 features and 500 samples. The goal is to build a predictive model. Which technique should be used to reduce the number of features while retaining most of the variance?
61Which TWO of the following are commonly used techniques for handling missing data in a dataset? (Select TWO).
62Which THREE of the following are assumptions of linear regression? (Select THREE).
63Which TWO of the following are true about correlation and causation? (Select TWO).
64The exhibit shows an SQL query executed on an 'orders' table that contains 'order_id', 'customer_id', and 'order_date'. What is the purpose of this query?
65Given the linear regression output, which independent variable has the strongest effect on price, based on standardized coefficients?
66A data analyst runs the Python code shown. What is the result of executing this code?
67A data analyst needs to summarize customer satisfaction scores. The data contains a few extremely low scores that skew the distribution. Which measure of central tendency is most appropriate?
68A retail company wants to predict sales based on advertising spend and season. Which data modeling technique should the analyst use?
69A data analyst trains a complex model that achieves 99% accuracy on training data but only 65% on new data. What is the most likely issue?
70After a marketing campaign, sales increased by 15%. The analyst wants to understand which customer segment contributed most to the increase. Which type of analysis is this?
71In a dataset with variables on different scales (e.g., age in years and income in dollars), which preprocessing step is necessary before applying k-means clustering?
72A data analyst uses linear regression to model the relationship between advertising spend and sales. The residual plot shows a clear U-shaped pattern. What assumption is violated?
73A data analyst calculates a correlation coefficient of -0.85 between temperature and heating costs. What does this indicate?
74A company wants to segment its customers into distinct groups based on purchasing behavior. Which algorithm is best suited for this task?
75A data analyst is preparing to build a predictive model. Which TWO steps are essential to ensure model validity? (Choose two.)
76In multiple linear regression, which TWO assumptions are critical for unbiased coefficient estimates? (Choose two.)
77A data analyst is performing data cleaning. Which THREE steps are part of this process? (Choose three.)
78You are a data analyst at a logistics company. The operations manager wants to reduce delivery delays. You have historical data including order date, delivery date, distance, weather conditions, and driver ID. Initial analysis shows that the average delivery time has increased over the past six months. You suspect that weather is a contributing factor, but you need to confirm. The company also wants to build a model to predict delivery times to better manage customer expectations. The data contains missing values for weather conditions in about 10% of records, and some driver IDs are incorrect. You have limited time and resources. What should you do first?
79A healthcare analytics team is analyzing patient readmission rates. They have a dataset with thousands of records including patient age, diagnosis, length of stay, number of prior admissions, and discharge date. The goal is to identify key factors influencing readmission and create a model to predict high-risk patients. The data is imbalanced: only 5% of patients are readmitted within 30 days. The team plans to use logistic regression. What is the most appropriate approach?
80A marketing analyst wants to segment customers based on purchasing behavior and demographics. The dataset includes continuous variables (spending amount, frequency) and categorical variables (region, gender). The analyst decides to use k-means clustering. What should the analyst do to prepare the data?
81A financial analyst is building a model to predict stock price movements. The data is time series with daily prices. The analyst wants to use a regression model but notices that the residuals are autocorrelated. What adjustment should be made?
82A data analyst is building a supervised learning model to predict customer churn. The target variable is binary (churn = yes/no). Which TWO modeling techniques are appropriate for this task? (Select two.)
83A data analyst at a retail company is building a multiple linear regression model to forecast weekly sales. The dataset contains 50 predictor variables, including store size, promotional spend, holiday indicators, and many others. After training the model, the analyst observes an R-squared of 0.99 on the training set but only 0.55 on the holdout test set. Which action should the analyst take first to address this discrepancy?
84A marketing analyst wants to segment customers based on their purchase history, including total spent, number of transactions, and average order value. The analyst runs k-means clustering with k=5 on the raw data but notices that the cluster assignments change significantly every time the algorithm is executed. What should the analyst do first to obtain consistent and meaningful clusters?
The Analyzing and Modeling Data domain covers the key concepts tested in this area of the DA0-001 exam blueprint published by CompTIA. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all DA0-001 domains — no account required.
The Courseiva DA0-001 question bank contains 84 questions in the Analyzing and Modeling Data domain. Click any question to see the full explanation and answer breakdown.
Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.
Yes — the session launcher on this page draws questions exclusively from the Analyzing and Modeling Data domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.
Save your results, see per-domain analytics, and get readiness scores — free, for every certification.
Sign Up FreeFree forever · Every certification included