CompTIA Data+ DA0-001 DA0-001 Questions 976–982 | Page 14/14

976

Multi-Selecteasy

A data analyst is using pandas to clean a DataFrame that contains missing values in the 'age' and 'income' columns. Which THREE pandas methods are appropriate for handling missing data? (Select THREE).

Select 3 answers

A.dropna()

B.pivot_table()

C.merge()

D.apply() with a custom function

E.fillna()

AnswersA, D, E

Removes rows with missing values.

Why this answer

Common pandas methods for missing data include dropna (remove rows with NaN), fillna (replace NaN with a value), and apply with a custom function. Merge is for combining DataFrames; pivot_table is for reshaping.

Full explanation →

977

MCQmedium

A healthcare analytics team is analyzing patient readmission rates. They have a dataset with thousands of records including patient age, diagnosis, length of stay, number of prior admissions, and discharge date. The goal is to identify key factors influencing readmission and create a model to predict high-risk patients. The data is imbalanced: only 5% of patients are readmitted within 30 days. The team plans to use logistic regression. What is the most appropriate approach?

A.Use the dataset as is because logistic regression handles imbalance

B.Remove most of the non-readmitted patients to balance the dataset

C.Use accuracy as the evaluation metric

D.Apply oversampling techniques like SMOTE to the training set

AnswerD

Oversampling balances the classes, improving model performance on the minority class.

Why this answer

With imbalanced data, logistic regression can be biased toward the majority class. Oversampling the minority class (e.g., SMOTE) helps the model learn patterns for readmission. Using accuracy as a metric would be misleading.

Removing majority samples discards valuable data. Using data as-is often fails to predict the minority class.

Full explanation →

978

Multi-Selecthard

An e-commerce company wants to analyze sales performance across product categories. The dataset includes transaction amounts and a column 'category' with values (Electronics, Clothing, Home). The analyst decides to use stratified sampling to ensure proportional representation. Which THREE steps are required to implement this? (Select THREE).

Select 3 answers

A.Calculate the proportion of each category in the population

B.Take a random sample from each stratum with size proportional to its population proportion

C.Divide the dataset into three strata based on category

D.Select every 10th transaction from the entire dataset

E.Combine all categories into a single group and perform simple random sampling

AnswersA, B, C

Proportions are needed to determine sample sizes per stratum.

Why this answer

Stratified sampling requires dividing the population into strata (categories), then randomly sampling from each stratum in proportion to its size. Combining strata or simple random sampling without stratification would not achieve proportional representation.

Full explanation →

979

MCQmedium

A data analyst wants to find customers whose last name starts with 'Mc' and have made purchases in 2023. The purchase table has a purchase_date column. Which SQL query accomplishes this?

A.SELECT * FROM customers WHERE last_name LIKE 'Mc_' AND YEAR(purchase_date) = 2023;

B.SELECT * FROM customers WHERE last_name LIKE '%Mc%' AND purchase_date = 2023;

C.SELECT * FROM customers WHERE last_name = 'Mc%' AND YEAR(purchase_date) = 2023;

D.SELECT * FROM customers c JOIN purchases p ON c.id = p.customer_id WHERE last_name LIKE 'Mc%' AND p.purchase_date BETWEEN '2023-01-01' AND '2023-12-31';

AnswerD

Correct use of LIKE, JOIN, and date range.

Why this answer

The LIKE operator with 'Mc%' matches names starting with 'Mc', and YEAR() extracts the year from a date.

Full explanation →

980

MCQhard

A data engineer is designing a data warehouse for a multinational corporation. The company has sales data from different regions with varying currencies and date formats. To ensure consistency, which data concept should be applied to standardize the data before loading into the warehouse?

A.Data cleansing

B.Data transformation

C.Data profiling

D.Data masking

AnswerB

Transformation includes standardization of formats.

Why this answer

Data transformation is the correct concept because it involves converting data from source formats (e.g., different currencies and date formats) into a consistent, standardized format before loading into the data warehouse. This process includes applying conversion rules, such as using ISO 8601 for dates and a single base currency (e.g., USD) with exchange rate tables, ensuring uniformity across all regional data. Without transformation, the warehouse would contain incompatible data types, breaking referential integrity and analytical queries.

Exam trap

CompTIA often tests the distinction between data cleansing and data transformation, where candidates mistakenly choose cleansing because they think fixing formats is about 'cleaning' data, but cleansing addresses errors and missing values, not structural conversions like currency or date standardization.

How to eliminate wrong answers

Option A is wrong because data cleansing focuses on detecting and correcting inaccuracies, inconsistencies, or missing values (e.g., removing duplicates or fixing typos), not on converting data types or formats like currencies and dates. Option C is wrong because data profiling is an exploratory process that analyzes source data to understand its structure, quality, and relationships (e.g., checking data types or null percentages), but it does not perform any standardization or conversion. Option D is wrong because data masking is a security technique used to obfuscate sensitive information (e.g., replacing credit card numbers with tokens) for privacy or compliance, and it has no role in standardizing currencies or date formats.

Full explanation →

981

MCQeasy

A data analyst calculates the mean, median, and mode of a sales dataset and finds they are all equal. Which type of distribution does this indicate?

A.Normal distribution

B.Skewed right

C.Bimodal distribution

D.Skewed left

AnswerA

Normal distribution has equal mean, median, and mode.

Why this answer

When mean, median, and mode are equal, the distribution is symmetric and typically bell-shaped (normal).

Full explanation →

982

MCQhard

A logistics company is analyzing truck delivery times. Which variable is discrete?

A.Number of stops

B.Time taken in hours

C.Fuel consumption in liters

D.Distance traveled

AnswerA

Correct. The number of stops is a count and therefore discrete.

Why this answer

A discrete variable is one that takes on a countable number of distinct values, often integers. The number of stops a truck makes is a count (e.g., 0, 1, 2, 3) and cannot be a fraction, making it a classic discrete variable in data analysis.

Exam trap

The trap here is that candidates confuse 'recorded as an integer' with 'discrete'—for example, thinking distance in whole kilometers is discrete, when the underlying measurement scale is continuous.

How to eliminate wrong answers

Option B is wrong because time taken in hours is a continuous variable—it can be measured to any fractional precision (e.g., 2.5 hours, 3.75 hours). Option C is wrong because fuel consumption in liters is continuous; it can take any value within a range (e.g., 45.3 liters). Option D is wrong because distance traveled is continuous, as it can be measured in fractional units (e.g., 120.7 km).

Full explanation →

CompTIA Data+ DA0-001 (DA0-001) — Questions 976–982