How many Analyzing and Modeling Data questions are on the DA0-001 exam?

The Analyzing and Modeling Data domain is one of the weighted domains on the DA0-001 exam. The Courseiva question bank has 84 practice questions for this domain.

Free DA0-001 Analyzing and Modeling Data Practice Questions (2026)

Q: How can I practice Analyzing and Modeling Data questions for DA0-001?

Click any of the 84 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Analyzing and Modeling Data domain.

Practice Analyzing and Modeling Data questions

10Q 20Q 30Q 50Q

All DA0-001 Analyzing and Modeling Data questions (84)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A data analyst needs to identify the most frequently occurring value in a dataset. Which measure of central tendency should they use?

A retail company wants to predict future sales based on historical data. Which modeling approach is most appropriate if the data shows a clear seasonal pattern?

A data analyst is building a model to predict customer churn. The dataset has 10,000 records with 500 churned customers. The model predicts churn with 95% accuracy, but only identifies 10% of actual churners. Which metric best highlights this issue?

A data analyst needs to combine two datasets that have the same columns but different rows. Which operation should they use?

A data analyst is performing a hypothesis test with a significance level of 0.05. The p-value obtained is 0.03. What should the analyst conclude?

A data scientist trains a regression model and observes high variance with low bias. Which technique is most appropriate to reduce variance?

A data analyst is cleaning a dataset and finds missing values in a categorical variable representing customer region. Which imputation method is most appropriate?

A data analyst needs to visualize the distribution of a continuous variable across different categories. Which chart type is most suitable?

A company is analyzing customer feedback sentiment. The dataset is highly imbalanced with 95% positive and 5% negative comments. Which technique should the analyst use to address class imbalance before modeling?

Which TWO of the following are common assumptions of linear regression?

Which THREE of the following are appropriate methods to handle outliers in a dataset?

Which TWO of the following are examples of supervised learning algorithms?

A healthcare analytics team is building a predictive model to identify patients at high risk of readmission within 30 days of discharge. The dataset includes 50,000 patient records with 200 features, including demographics, vital signs, lab results, and historical admissions. The target variable is binary (readmitted or not). The team uses a logistic regression model and achieves an AUC of 0.72 on the test set. However, the model's calibration is poor: for patients predicted to have a 70% risk, the actual readmission rate is only 40%. The team wants to improve calibration without significantly reducing discrimination (AUC). The data scientist suggests applying Platt scaling. However, the team lead is concerned that Platt scaling may reduce the model's ability to rank patients correctly. Which of the following is the best course of action?

A data analyst at a marketing firm is tasked with segmenting customers based on their purchasing behavior. The dataset contains 10,000 customers with features such as annual spend, frequency of purchases, recency of last purchase, and average order value. The analyst decides to use k-means clustering. After standardizing the features, the analyst runs k-means with k=3, k=4, and k=5, and computes the silhouette score for each: k=3: 0.45, k=4: 0.52, k=5: 0.48. The analyst also plots the elbow curve and observes that the within-cluster sum of squares (WCSS) decreases sharply from k=2 to k=4, then levels off. Based on these results, what is the most appropriate number of clusters?

A data analyst is building a linear regression model to predict sales based on advertising spend across TV, radio, and newspaper channels. Which TWO diagnostics should the analyst perform to validate the model assumptions?

A data analyst is preparing a logistic regression model to predict customer churn. After examining the exhibit, which data quality issue should the analyst address first?

A healthcare analytics team is building a classification model to predict patient readmission within 30 days. The dataset contains 10,000 records with 30 features, including demographics, vital signs, lab results, and medication history. The target variable is imbalanced: 85% no readmission, 15% readmission. The team used logistic regression with default settings and achieved an accuracy of 85%, but the model predicted 'no readmission' for all patients. The lead analyst suspects the model is not learning due to class imbalance. The team has time to implement one corrective action before the next model review. Which action should the team take?

Drag and drop the steps to normalize a database table from 1NF to 3NF in the correct order.

Drag and drop the steps to implement a data classification policy in the correct order.

Match each data governance role to its responsibility.

Match each data sampling method to its description.

A data analyst is designing a data model for a sales data warehouse. The model should optimize query performance for aggregations by minimizing joins and duplicating data where necessary. Which schema design should the analyst use?

A data scientist is building a predictive model to forecast monthly sales. The data shows a linear trend with no seasonality. Which regression technique is most appropriate?

A data analyst is working with a dataset containing house prices. After building a multiple linear regression model, the analyst observes that the model performs well on training data but poorly on validation data. Which technique is most appropriate to address this issue?

A marketing team wants to segment customers into distinct groups based on purchasing behavior. The data includes numeric features such as frequency, monetary value, and recency. Which unsupervised learning algorithm should be used?

A data analyst is preparing a dataset for a predictive model. The dataset contains a feature 'age' with values ranging from 18 to 80, and a feature 'income' ranging from 20,000 to 200,000. To ensure both features contribute equally to distance-based algorithms, which transformation should the analyst apply?

A data analyst is building a binary classification model to predict customer churn. The dataset is imbalanced, with only 10% churners. The analyst wants to evaluate model performance with a focus on correctly identifying churners. Which metric is most appropriate?

A data analyst needs to join two tables in a SQL database: Orders and Customers. The analyst wants to include all orders, even if there is no matching customer record. Which type of join should be used?

A data analyst is analyzing survey responses where respondents rated satisfaction on a scale of 1-5. The analyst wants to visualize the distribution of responses. Which chart type is most appropriate?

A data scientist is tuning a decision tree model to prevent overfitting. The model currently has a high variance. Which hyperparameter adjustment is most effective?

Which TWO of the following are dimensional modeling techniques commonly used in data warehouses?

Which THREE of the following are common steps in data cleaning?

Which TWO of the following are valid techniques for validating the performance of a predictive model?

Refer to the exhibit. Which clause is used to aggregate the data by department?

Refer to the exhibit. Which type of ensemble method is being used?

Refer to the exhibit. Which data quality dimension is being violated?

A data analyst is building a linear regression model to predict sales based on advertising spend. The analyst notices that the residuals are not normally distributed and have a non‑constant variance. Which of the following transformations is most appropriate to apply to the dependent variable?

A company’s marketing team wants to segment customers based on purchase history, demographics, and website behavior. The data includes both numeric and categorical variables. Which clustering algorithm is best suited for handling mixed data types?

A data scientist is building a classification model to predict customer churn. The dataset has 10,000 records with 500 churners. The scientist uses logistic regression and achieves 98% accuracy, but the precision for churn class is only 15%. Which of the following is the most likely cause?

During ETL, a data analyst discovers that a date column contains values like '01/02/2023' and '2023-01-02'. Which of the following is the best practice to ensure consistent date format before analysis?

A data analyst is reviewing a SQL query that joins three large tables. The query takes over an hour to run. The analyst notices that the WHERE clause filters on indexed columns in only two tables. Which of the following should the analyst do first to improve performance?

An analyst is fitting a polynomial regression model and wants to choose the degree that minimizes overfitting. Which technique should the analyst use?

A data analyst needs to create a visual that shows the distribution of customer ages across different regions. Which chart type is most appropriate?

A company has a dataset with 100 features. The data analyst wants to reduce dimensionality while preserving as much variance as possible. Which technique should be used?

After training a decision tree, the tree has depth 20 and 100% accuracy on training data but only 60% on test data. Which hyperparameter adjustment is most likely to improve generalization?

A data analyst is performing hypothesis testing to compare the mean sales of two store locations. Which TWO conditions must be satisfied to use a two‑sample t‑test? (Select TWO.)

A data scientist is cleaning a dataset and notices missing values in several columns. Which THREE techniques are appropriate for handling missing data? (Select THREE.)

Which THREE of the following are examples of descriptive statistics? (Select THREE.)

Refer to the exhibit. An analyst runs the following query: SELECT product_id, AVG(quantity) FROM sales GROUP BY product_id HAVING AVG(quantity) > 8; Which product_id(s) will be returned?

Refer to the exhibit. Before running the code, the original salary column had 50 missing values. The median was calculated as 52000. After imputation, which of the following statements is true?

Refer to the exhibit. A data analyst wants to grant read access to an entire S3 bucket named 'data-lake'. Which of the following best describes what this policy does?

A data analyst wants to predict customer churn based on categorical features like region and plan type, and continuous features like usage and tenure. Which regression type should be used?

A retail company wants to forecast monthly sales for the next 12 months. Sales data shows a clear upward trend and seasonal patterns that repeat yearly. Which time series model is most appropriate?

A data scientist is analyzing a dataset with 100 variables and 5,000 records. The dataset has several missing values and a few extreme outliers. The goal is to build a regression model to predict a continuous target. Which combination of preprocessing steps is most likely to improve model performance?

During data exploration, an analyst notices that the target variable has a heavily right-skewed distribution. Which data transformation would be most appropriate to make the distribution more symmetric?

A marketing team wants to segment customers into groups based on purchasing behavior without prior labels. Which algorithm should the data analyst use?

After building a binary classification model, the data analyst obtains the following confusion matrix: True Positives=80, True Negatives=100, False Positives=20, False Negatives=30. What is the F1 score?

A dataset contains a column 'Income' with values in different scales (some in thousands, some in hundreds). What is the best way to standardize this column for use in a machine learning model?

A data analyst needs to determine whether the mean sales of two different regions are significantly different. The samples are independent and the data is normally distributed. Which statistical test should be used?

A data scientist is working with a dataset containing 1000 features and 500 samples. The goal is to build a predictive model. Which technique should be used to reduce the number of features while retaining most of the variance?

Which TWO of the following are commonly used techniques for handling missing data in a dataset? (Select TWO).

Which THREE of the following are assumptions of linear regression? (Select THREE).

Which TWO of the following are true about correlation and causation? (Select TWO).

The exhibit shows an SQL query executed on an 'orders' table that contains 'order_id', 'customer_id', and 'order_date'. What is the purpose of this query?

Given the linear regression output, which independent variable has the strongest effect on price, based on standardized coefficients?

A data analyst runs the Python code shown. What is the result of executing this code?

A data analyst needs to summarize customer satisfaction scores. The data contains a few extremely low scores that skew the distribution. Which measure of central tendency is most appropriate?

A retail company wants to predict sales based on advertising spend and season. Which data modeling technique should the analyst use?

A data analyst trains a complex model that achieves 99% accuracy on training data but only 65% on new data. What is the most likely issue?

After a marketing campaign, sales increased by 15%. The analyst wants to understand which customer segment contributed most to the increase. Which type of analysis is this?

In a dataset with variables on different scales (e.g., age in years and income in dollars), which preprocessing step is necessary before applying k-means clustering?

A data analyst uses linear regression to model the relationship between advertising spend and sales. The residual plot shows a clear U-shaped pattern. What assumption is violated?

A data analyst calculates a correlation coefficient of -0.85 between temperature and heating costs. What does this indicate?

A company wants to segment its customers into distinct groups based on purchasing behavior. Which algorithm is best suited for this task?

A data analyst is preparing to build a predictive model. Which TWO steps are essential to ensure model validity? (Choose two.)

In multiple linear regression, which TWO assumptions are critical for unbiased coefficient estimates? (Choose two.)

A data analyst is performing data cleaning. Which THREE steps are part of this process? (Choose three.)

You are a data analyst at a logistics company. The operations manager wants to reduce delivery delays. You have historical data including order date, delivery date, distance, weather conditions, and driver ID. Initial analysis shows that the average delivery time has increased over the past six months. You suspect that weather is a contributing factor, but you need to confirm. The company also wants to build a model to predict delivery times to better manage customer expectations. The data contains missing values for weather conditions in about 10% of records, and some driver IDs are incorrect. You have limited time and resources. What should you do first?

A healthcare analytics team is analyzing patient readmission rates. They have a dataset with thousands of records including patient age, diagnosis, length of stay, number of prior admissions, and discharge date. The goal is to identify key factors influencing readmission and create a model to predict high-risk patients. The data is imbalanced: only 5% of patients are readmitted within 30 days. The team plans to use logistic regression. What is the most appropriate approach?

A marketing analyst wants to segment customers based on purchasing behavior and demographics. The dataset includes continuous variables (spending amount, frequency) and categorical variables (region, gender). The analyst decides to use k-means clustering. What should the analyst do to prepare the data?

A financial analyst is building a model to predict stock price movements. The data is time series with daily prices. The analyst wants to use a regression model but notices that the residuals are autocorrelated. What adjustment should be made?

A data analyst is building a supervised learning model to predict customer churn. The target variable is binary (churn = yes/no). Which TWO modeling techniques are appropriate for this task? (Select two.)

A data analyst at a retail company is building a multiple linear regression model to forecast weekly sales. The dataset contains 50 predictor variables, including store size, promotional spend, holiday indicators, and many others. After training the model, the analyst observes an R-squared of 0.99 on the training set but only 0.55 on the holdout test set. Which action should the analyst take first to address this discrepancy?

A marketing analyst wants to segment customers based on their purchase history, including total spent, number of transactions, and average order value. The analyst runs k-means clustering with k=5 on the raw data but notices that the cluster assignments change significantly every time the algorithm is executed. What should the analyst do first to obtain consistent and meaningful clusters?

Practice all 84 Analyzing and Modeling Data questions

Other DA0-001 exam domains

Comparing and Contrasting Data Concepts Mining and Acquiring Data Visualizing Data Communicating Data Insights

Frequently asked questions

What does the Analyzing and Modeling Data domain cover on the DA0-001 exam?

The Analyzing and Modeling Data domain covers the key concepts tested in this area of the DA0-001 exam blueprint published by CompTIA. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all DA0-001 domains — no account required.

How many Analyzing and Modeling Data questions are in the DA0-001 question bank?

The Courseiva DA0-001 question bank contains 84 questions in the Analyzing and Modeling Data domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Analyzing and Modeling Data for DA0-001?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Analyzing and Modeling Data questions for DA0-001?

Yes — the session launcher on this page draws questions exclusively from the Analyzing and Modeling Data domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your DA0-001 domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included