DA0-001 Analyzing and Modeling Data — All Questions With Answers

Question 1easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst needs to identify the most frequently occurring value in a dataset. Which measure of central tendency should they use?

Question 2mediummultiple choice

Read the full NAT/PAT explanation →

A retail company wants to predict future sales based on historical data. Which modeling approach is most appropriate if the data shows a clear seasonal pattern?

Question 3hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is building a model to predict customer churn. The dataset has 10,000 records with 500 churned customers. The model predicts churn with 95% accuracy, but only identifies 10% of actual churners. Which metric best highlights this issue?

Question 4easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst needs to combine two datasets that have the same columns but different rows. Which operation should they use?

Question 5mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is performing a hypothesis test with a significance level of 0.05. The p-value obtained is 0.03. What should the analyst conclude?

Question 6hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A data scientist trains a regression model and observes high variance with low bias. Which technique is most appropriate to reduce variance?

Question 7easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is cleaning a dataset and finds missing values in a categorical variable representing customer region. Which imputation method is most appropriate?

Question 8mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst needs to visualize the distribution of a continuous variable across different categories. Which chart type is most suitable?

Question 9hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A company is analyzing customer feedback sentiment. The dataset is highly imbalanced with 95% positive and 5% negative comments. Which technique should the analyst use to address class imbalance before modeling?

Question 10mediummulti select

Read the full Analyzing and Modeling Data explanation →

Which TWO of the following are common assumptions of linear regression?

Question 11hardmulti select

Read the full Analyzing and Modeling Data explanation →

Which THREE of the following are appropriate methods to handle outliers in a dataset?

Question 12easymulti select

Read the full Analyzing and Modeling Data explanation →

Which TWO of the following are examples of supervised learning algorithms?

Question 13hardmultiple choice

Read the full NAT/PAT explanation →

A healthcare analytics team is building a predictive model to identify patients at high risk of readmission within 30 days of discharge. The dataset includes 50,000 patient records with 200 features, including demographics, vital signs, lab results, and historical admissions. The target variable is binary (readmitted or not). The team uses a logistic regression model and achieves an AUC of 0.72 on the test set. However, the model's calibration is poor: for patients predicted to have a 70% risk, the actual readmission rate is only 40%. The team wants to improve calibration without significantly reducing discrimination (AUC). The data scientist suggests applying Platt scaling. However, the team lead is concerned that Platt scaling may reduce the model's ability to rank patients correctly. Which of the following is the best course of action?

Question 14mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst at a marketing firm is tasked with segmenting customers based on their purchasing behavior. The dataset contains 10,000 customers with features such as annual spend, frequency of purchases, recency of last purchase, and average order value. The analyst decides to use k-means clustering. After standardizing the features, the analyst runs k-means with k=3, k=4, and k=5, and computes the silhouette score for each: k=3: 0.45, k=4: 0.52, k=5: 0.48. The analyst also plots the elbow curve and observes that the within-cluster sum of squares (WCSS) decreases sharply from k=2 to k=4, then levels off. Based on these results, what is the most appropriate number of clusters?

Question 15easymulti select

Read the full Analyzing and Modeling Data explanation →

A data analyst is building a linear regression model to predict sales based on advertising spend across TV, radio, and newspaper channels. Which TWO diagnostics should the analyst perform to validate the model assumptions?

Question 16hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is preparing a logistic regression model to predict customer churn. After examining the exhibit, which data quality issue should the analyst address first?

Network Topology

Question 17mediummultiple choice

Read the full NAT/PAT explanation →

A healthcare analytics team is building a classification model to predict patient readmission within 30 days. The dataset contains 10,000 records with 30 features, including demographics, vital signs, lab results, and medication history. The target variable is imbalanced: 85% no readmission, 15% readmission. The team used logistic regression with default settings and achieved an accuracy of 85%, but the model predicted 'no readmission' for all patients. The lead analyst suspects the model is not learning due to class imbalance. The team has time to implement one corrective action before the next model review. Which action should the team take?

Question 18mediumdrag order

Read the full Analyzing and Modeling Data explanation →

Drag and drop the steps to normalize a database table from 1NF to 3NF in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 19mediumdrag order

Read the full Analyzing and Modeling Data explanation →

Drag and drop the steps to implement a data classification policy in the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

1Step 1

2Step 2

3Step 3

4Step 4

5Step 5

Question 20mediummatching

Read the full Analyzing and Modeling Data explanation →

Match each data governance role to its responsibility.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Ensures data quality and adherence to policies

Manages technical environment and data access

Has accountability for specific data assets

Sets strategic direction for data management

Designs data structures and integration processes

Question 21mediummatching

Read the full Analyzing and Modeling Data explanation →

Match each data sampling method to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Each member has equal chance of selection

Population divided into subgroups; random sample from each

Randomly select entire groups (clusters)

Select every k-th element from a list

Sample based on ease of access

Question 22easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is designing a data model for a sales data warehouse. The model should optimize query performance for aggregations by minimizing joins and duplicating data where necessary. Which schema design should the analyst use?

Question 23mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A data scientist is building a predictive model to forecast monthly sales. The data shows a linear trend with no seasonality. Which regression technique is most appropriate?

Question 24hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is working with a dataset containing house prices. After building a multiple linear regression model, the analyst observes that the model performs well on training data but poorly on validation data. Which technique is most appropriate to address this issue?

Question 25easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A marketing team wants to segment customers into distinct groups based on purchasing behavior. The data includes numeric features such as frequency, monetary value, and recency. Which unsupervised learning algorithm should be used?

Question 26mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is preparing a dataset for a predictive model. The dataset contains a feature 'age' with values ranging from 18 to 80, and a feature 'income' ranging from 20,000 to 200,000. To ensure both features contribute equally to distance-based algorithms, which transformation should the analyst apply?

Question 27hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is building a binary classification model to predict customer churn. The dataset is imbalanced, with only 10% churners. The analyst wants to evaluate model performance with a focus on correctly identifying churners. Which metric is most appropriate?

Question 28easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst needs to join two tables in a SQL database: Orders and Customers. The analyst wants to include all orders, even if there is no matching customer record. Which type of join should be used?

Question 29mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is analyzing survey responses where respondents rated satisfaction on a scale of 1-5. The analyst wants to visualize the distribution of responses. Which chart type is most appropriate?

Question 30hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A data scientist is tuning a decision tree model to prevent overfitting. The model currently has a high variance. Which hyperparameter adjustment is most effective?

Question 31easymulti select

Read the full Analyzing and Modeling Data explanation →

Which TWO of the following are dimensional modeling techniques commonly used in data warehouses?

Question 32mediummulti select

Read the full Analyzing and Modeling Data explanation →

Which THREE of the following are common steps in data cleaning?

Question 33hardmulti select

Read the full Analyzing and Modeling Data explanation →

Which TWO of the following are valid techniques for validating the performance of a predictive model?

Question 34easymultiple choice

Read the full Analyzing and Modeling Data explanation →

Refer to the exhibit. Which clause is used to aggregate the data by department?

Exhibit

SELECT department, COUNT(*) as employee_count FROM employees WHERE hire_year > 2020 GROUP BY department HAVING COUNT(*) > 5;

Question 35mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

Refer to the exhibit. Which type of ensemble method is being used?

Exhibit

{"model_type": "random_forest", "n_estimators": 100, "max_depth": 5, "criterion": "gini"}

Question 36hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

Refer to the exhibit. Which data quality dimension is being violated?

Exhibit

2024-01-15 10:23:45 ERROR: DataTypeMismatchException - Column 'age' contains mixed data types: INT and VARCHAR. Pipeline 'user_profile_etl' failed.

Question 37easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is building a linear regression model to predict sales based on advertising spend. The analyst notices that the residuals are not normally distributed and have a non‑constant variance. Which of the following transformations is most appropriate to apply to the dependent variable?

Question 38mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A company’s marketing team wants to segment customers based on purchase history, demographics, and website behavior. The data includes both numeric and categorical variables. Which clustering algorithm is best suited for handling mixed data types?

Question 39hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A data scientist is building a classification model to predict customer churn. The dataset has 10,000 records with 500 churners. The scientist uses logistic regression and achieves 98% accuracy, but the precision for churn class is only 15%. Which of the following is the most likely cause?

Question 40easymultiple choice

Read the full Analyzing and Modeling Data explanation →

During ETL, a data analyst discovers that a date column contains values like '01/02/2023' and '2023-01-02'. Which of the following is the best practice to ensure consistent date format before analysis?

Question 41mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst is reviewing a SQL query that joins three large tables. The query takes over an hour to run. The analyst notices that the WHERE clause filters on indexed columns in only two tables. Which of the following should the analyst do first to improve performance?

Question 42hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

An analyst is fitting a polynomial regression model and wants to choose the degree that minimizes overfitting. Which technique should the analyst use?

Question 43easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst needs to create a visual that shows the distribution of customer ages across different regions. Which chart type is most appropriate?

Question 44mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A company has a dataset with 100 features. The data analyst wants to reduce dimensionality while preserving as much variance as possible. Which technique should be used?

Question 45hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

After training a decision tree, the tree has depth 20 and 100% accuracy on training data but only 60% on test data. Which hyperparameter adjustment is most likely to improve generalization?

Question 46mediummulti select

Read the full Analyzing and Modeling Data explanation →

A data analyst is performing hypothesis testing to compare the mean sales of two store locations. Which TWO conditions must be satisfied to use a two‑sample t‑test? (Select TWO.)

Question 47hardmulti select

Read the full Analyzing and Modeling Data explanation →

A data scientist is cleaning a dataset and notices missing values in several columns. Which THREE techniques are appropriate for handling missing data? (Select THREE.)

Question 48easymulti select

Read the full Analyzing and Modeling Data explanation →

Which THREE of the following are examples of descriptive statistics? (Select THREE.)

Question 49mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

Refer to the exhibit. An analyst runs the following query: SELECT product_id, AVG(quantity) FROM sales GROUP BY product_id HAVING AVG(quantity) > 8; Which product_id(s) will be returned?

Network Topology

Question 50hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

Refer to the exhibit. Before running the code, the original salary column had 50 missing values. The median was calculated as 52000. After imputation, which of the following statements is true?

Exhibit

Refer to the exhibit.

Python pandas code and output:
```
import pandas as pd
df = pd.read_csv('employees.csv')
df['salary'].fillna(df['salary'].median(), inplace=True)
print(df['salary'].describe())
```
Output:
```
count    1000.000000
mean     55000.000000
std      15000.000000
min      25000.000000
25%      45000.000000
50%      52000.000000
75%      65000.000000
max     120000.000000
Name: salary, dtype: float64
```

Question 51easymultiple choice

Read the full Analyzing and Modeling Data explanation →

Refer to the exhibit. A data analyst wants to grant read access to an entire S3 bucket named 'data-lake'. Which of the following best describes what this policy does?

Exhibit

Refer to the exhibit.

JSON policy:
```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::data-lake/*"
    }
  ]
}
```

Question 52easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst wants to predict customer churn based on categorical features like region and plan type, and continuous features like usage and tenure. Which regression type should be used?

Question 53mediummultiple choice

Read the full NAT/PAT explanation →

A retail company wants to forecast monthly sales for the next 12 months. Sales data shows a clear upward trend and seasonal patterns that repeat yearly. Which time series model is most appropriate?

Question 54hardmultiple choice

Read the full NAT/PAT explanation →

A data scientist is analyzing a dataset with 100 variables and 5,000 records. The dataset has several missing values and a few extreme outliers. The goal is to build a regression model to predict a continuous target. Which combination of preprocessing steps is most likely to improve model performance?

Question 55easymultiple choice

Read the full Analyzing and Modeling Data explanation →

During data exploration, an analyst notices that the target variable has a heavily right-skewed distribution. Which data transformation would be most appropriate to make the distribution more symmetric?

Question 56mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A marketing team wants to segment customers into groups based on purchasing behavior without prior labels. Which algorithm should the data analyst use?

Question 57hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

After building a binary classification model, the data analyst obtains the following confusion matrix: True Positives=80, True Negatives=100, False Positives=20, False Negatives=30. What is the F1 score?

Question 58easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A dataset contains a column 'Income' with values in different scales (some in thousands, some in hundreds). What is the best way to standardize this column for use in a machine learning model?

Question 59mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst needs to determine whether the mean sales of two different regions are significantly different. The samples are independent and the data is normally distributed. Which statistical test should be used?

Question 60hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A data scientist is working with a dataset containing 1000 features and 500 samples. The goal is to build a predictive model. Which technique should be used to reduce the number of features while retaining most of the variance?

Question 61mediummulti select

Read the full Analyzing and Modeling Data explanation →

Which TWO of the following are commonly used techniques for handling missing data in a dataset? (Select TWO).

Question 62hardmulti select

Read the full Analyzing and Modeling Data explanation →

Which THREE of the following are assumptions of linear regression? (Select THREE).

Question 63easymulti select

Read the full Analyzing and Modeling Data explanation →

Which TWO of the following are true about correlation and causation? (Select TWO).

Question 64mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

The exhibit shows an SQL query executed on an 'orders' table that contains 'order_id', 'customer_id', and 'order_date'. What is the purpose of this query?

Exhibit

Refer to the exhibit.

SELECT customer_id, COUNT(order_id) AS order_count
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'
GROUP BY customer_id
HAVING COUNT(order_id) > 5;

Question 65hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

Given the linear regression output, which independent variable has the strongest effect on price, based on standardized coefficients?

Exhibit

Refer to the exhibit.

Call:
lm(formula = price ~ sqft_living + bedrooms + bathrooms, data = housing)

Residuals:
    Min      1Q  Median      3Q     Max
-1.2345 -0.3456 -0.0123  0.3456  2.3456

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.123456   0.012345  10.000  <2e-16 ***
sqft_living   0.001234   0.000123  10.000  <2e-16 ***
bedrooms     -0.056789   0.012345  -4.600  4.23e-06 ***
bathrooms     0.234567   0.045678   5.135  3.45e-07 ***
--
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4567 on 496 degrees of freedom
Multiple R-squared:  0.789, Adjusted R-squared:  0.787
F-statistic: 617.8 on 3 and 496 DF,  p-value: < 2.2e-16

Question 66easymultiple choice

Study the full Python automation breakdown →

A data analyst runs the Python code shown. What is the result of executing this code?

Exhibit

Refer to the exhibit.

import pandas as pd
df = pd.read_csv('data.csv')
df['total'] = df['price'] * df['quantity']
df.head()

Question 67easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst needs to summarize customer satisfaction scores. The data contains a few extremely low scores that skew the distribution. Which measure of central tendency is most appropriate?

Question 68mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A retail company wants to predict sales based on advertising spend and season. Which data modeling technique should the analyst use?

Question 69hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst trains a complex model that achieves 99% accuracy on training data but only 65% on new data. What is the most likely issue?

Question 70easymultiple choice

Read the full Analyzing and Modeling Data explanation →

After a marketing campaign, sales increased by 15%. The analyst wants to understand which customer segment contributed most to the increase. Which type of analysis is this?

Question 71mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

In a dataset with variables on different scales (e.g., age in years and income in dollars), which preprocessing step is necessary before applying k-means clustering?

Question 72hardmultiple choice

Read the full NAT/PAT explanation →

A data analyst uses linear regression to model the relationship between advertising spend and sales. The residual plot shows a clear U-shaped pattern. What assumption is violated?

Question 73easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst calculates a correlation coefficient of -0.85 between temperature and heating costs. What does this indicate?

Question 74mediummultiple choice

Read the full Analyzing and Modeling Data explanation →

A company wants to segment its customers into distinct groups based on purchasing behavior. Which algorithm is best suited for this task?

Question 75easymulti select

Read the full Analyzing and Modeling Data explanation →

A data analyst is preparing to build a predictive model. Which TWO steps are essential to ensure model validity? (Choose two.)

Question 76mediummulti select

Read the full Analyzing and Modeling Data explanation →

In multiple linear regression, which TWO assumptions are critical for unbiased coefficient estimates? (Choose two.)

Question 77hardmulti select

Read the full Analyzing and Modeling Data explanation →

A data analyst is performing data cleaning. Which THREE steps are part of this process? (Choose three.)

Question 78easymultiple choice

Read the full Analyzing and Modeling Data explanation →

You are a data analyst at a logistics company. The operations manager wants to reduce delivery delays. You have historical data including order date, delivery date, distance, weather conditions, and driver ID. Initial analysis shows that the average delivery time has increased over the past six months. You suspect that weather is a contributing factor, but you need to confirm. The company also wants to build a model to predict delivery times to better manage customer expectations. The data contains missing values for weather conditions in about 10% of records, and some driver IDs are incorrect. You have limited time and resources. What should you do first?

Question 79mediummultiple choice

Read the full NAT/PAT explanation →

A healthcare analytics team is analyzing patient readmission rates. They have a dataset with thousands of records including patient age, diagnosis, length of stay, number of prior admissions, and discharge date. The goal is to identify key factors influencing readmission and create a model to predict high-risk patients. The data is imbalanced: only 5% of patients are readmitted within 30 days. The team plans to use logistic regression. What is the most appropriate approach?

Question 80hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A marketing analyst wants to segment customers based on purchasing behavior and demographics. The dataset includes continuous variables (spending amount, frequency) and categorical variables (region, gender). The analyst decides to use k-means clustering. What should the analyst do to prepare the data?

Question 81hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A financial analyst is building a model to predict stock price movements. The data is time series with daily prices. The analyst wants to use a regression model but notices that the residuals are autocorrelated. What adjustment should be made?

Question 82mediummulti select

Read the full Analyzing and Modeling Data explanation →

A data analyst is building a supervised learning model to predict customer churn. The target variable is binary (churn = yes/no). Which TWO modeling techniques are appropriate for this task? (Select two.)

Question 83hardmultiple choice

Read the full Analyzing and Modeling Data explanation →

A data analyst at a retail company is building a multiple linear regression model to forecast weekly sales. The dataset contains 50 predictor variables, including store size, promotional spend, holiday indicators, and many others. After training the model, the analyst observes an R-squared of 0.99 on the training set but only 0.55 on the holdout test set. Which action should the analyst take first to address this discrepancy?

Question 84easymultiple choice

Read the full Analyzing and Modeling Data explanation →

A marketing analyst wants to segment customers based on their purchase history, including total spent, number of transactions, and average order value. The analyst runs k-means clustering with k=5 on the raw data but notices that the cluster assignments change significantly every time the algorithm is executed. What should the analyst do first to obtain consistent and meaningful clusters?

Refer to the exhibit. Python pandas code and output: ``` import pandas as pd df = pd.read_csv('employees.csv') df['salary'].fillna(df['salary'].median(), inplace=True) print(df['salary'].describe()) ``` Output: ``` count 1000.000000 mean 55000.000000 std 15000.000000 min 25000.000000 25% 45000.000000 50% 52000.000000 75% 65000.000000 max 120000.000000 Name: salary, dtype: float64 ```

Refer to the exhibit. JSON policy: ``` { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::data-lake/*" } ] } ```

Refer to the exhibit. Call: lm(formula = price ~ sqft_living + bedrooms + bathrooms, data = housing) Residuals: Min 1Q Median 3Q Max -1.2345 -0.3456 -0.0123 0.3456 2.3456 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.123456 0.012345 10.000 <2e-16 *** sqft_living 0.001234 0.000123 10.000 <2e-16 *** bedrooms -0.056789 0.012345 -4.600 4.23e-06 *** bathrooms 0.234567 0.045678 5.135 3.45e-07 *** -- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4567 on 496 degrees of freedom Multiple R-squared: 0.789, Adjusted R-squared: 0.787 F-statistic: 617.8 on 3 and 496 DF, p-value: < 2.2e-16