CCNA Dap Mining Data Questions — Page 1 of 2

MCQhard

An analyst is using Python pandas and has a DataFrame 'sales' with columns 'date', 'product', 'revenue'. They need to create a pivot table showing total revenue per product per month. Which pandas function is most appropriate?

A.sales.groupby(['product', 'month']).sum()

B.sales.pivot(index='product', columns='month', values='revenue')

C.sales.pivot_table(index='product', columns='month', values='revenue', aggfunc='sum')

D.sales.melt(id_vars=['product'], value_vars=['month', 'revenue'])

AnswerC

pivot_table creates a matrix with products as rows and months as columns.

Why this answer

pivot_table is specifically designed to reshape data and aggregate values based on index and columns.

Practice this question →

Multi-Selecthard

A data analyst needs to identify the top 3 most frequent product categories from a sales table. Which SQL techniques can be used to achieve this? (Choose two.)

Select 2 answers

A.SELECT category, COUNT(*) as cnt FROM sales GROUP BY category QUALIFY DENSE_RANK() OVER (ORDER BY cnt DESC) <= 3

B.GROUP BY category ORDER BY COUNT(*) DESC LIMIT 3

C.SELECT category FROM sales LEFT JOIN (SELECT category FROM sales GROUP BY category HAVING COUNT(*) > 3) AS t ON sales.category = t.category

D.SELECT DISTINCT category FROM sales ORDER BY category DESC LIMIT 3

E.SELECT category, COUNT(*) FROM sales GROUP BY category HAVING COUNT(*) > 3

AnswersA, B

Using window function DENSE_RANK() with QUALIFY (or in a subquery) returns top 3 categories including ties.

Why this answer

Both GROUP BY with ORDER BY and LIMIT, and using a window function like DENSE_RANK() to rank categories by count and filter top 3, are valid. HAVING is for filtering groups after aggregation, but without ORDER BY and LIMIT, it doesn't give top 3. A subquery with COUNT(*)>3 would get categories with count >3, not top 3.

LEFT JOIN is irrelevant.

Practice this question →

MCQmedium

A data analyst needs to count the number of orders placed by each customer, but only for customers who have placed more than 5 orders. Which SQL clause should be used to filter the aggregated results?

A.FILTER

B.HAVING

C.WHERE

D.LIMIT

AnswerB

Correct. HAVING filters aggregated results.

Why this answer

HAVING is used to filter groups after aggregation. The query would use GROUP BY customer_id, then HAVING COUNT(*) > 5.

Practice this question →

MCQmedium

In a dataset of employee salaries, the analyst notices one value that is significantly higher than the rest. Using the IQR method, which values are typically considered outliers?

A.Values beyond Q1 - 3*IQR or Q3 + 3*IQR

B.Values beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR

C.Values beyond mean ± 2 standard deviations

D.Values beyond min and max

AnswerB

Standard IQR outlier definition.

Why this answer

Outliers are values less than Q1 - 1.5*IQR or greater than Q3 + 1.5*IQR.

Practice this question →

Multi-Selectmedium

A data analyst wants to sample a large dataset of customer transactions. Which TWO sampling methods are probability-based and ensure every element has a known chance of being selected? (Select TWO.)

Select 2 answers

A.Simple random sampling

B.Stratified sampling

C.Convenience sampling

D.Systematic sampling

E.Cluster sampling

AnswersA, B

Every element has an equal chance.

Why this answer

Simple random sampling and stratified sampling are probability-based methods where each element has a known probability. Systematic sampling is also probability-based but the question asks for TWO; convenience sampling is non-probability, and cluster sampling is probability-based but the question likely expects the two most common.

Practice this question →

MCQmedium

An analyst needs to retrieve the year from an order_date column (datetime type). Which function should be used in SQL?

A.FORMAT(order_date, 'yyyy')

B.DATEADD(YEAR, order_date, 0)

C.YEAR(order_date)

D.EXTRACT(YEAR FROM order_date)

AnswerD

EXTRACT is the standard SQL function for date parts.

Why this answer

EXTRACT(YEAR FROM order_date) is the standard SQL function to get the year part.

Practice this question →

MCQmedium

During data profiling, an analyst wants to identify the number of distinct values in a column. Which SQL function should be used?

A.DISTINCT(column)

B.COUNT(DISTINCT column)

C.COUNT(*)

D.COUNT(column)

AnswerB

Returns distinct count.

Why this answer

COUNT(DISTINCT column) returns the number of unique non-null values.

Practice this question →

MCQhard

A data analyst is writing a query to rank products by total sales amount within each category. They want ties to have the same rank and no gaps in the ranking sequence. Which window function should they use?

A.ROW_NUMBER()

B.DENSE_RANK()

C.RANK()

D.NTILE()

AnswerB

Correct. DENSE_RANK() gives same rank to ties and no gaps.

Why this answer

DENSE_RANK() assigns the same rank to ties and does not skip subsequent ranks. RANK() also assigns same rank to ties but skips numbers, creating gaps.

Practice this question →

Multi-Selectmedium

An analyst needs to aggregate sales data by region and product, then sort the results by total sales in descending order. Which SQL clauses are required? (Select THREE).

Select 3 answers

A.GROUP BY

B.HAVING

C.SUM

D.ORDER BY

E.DESC

AnswersA, D, E

Required to define groups for aggregation.

Why this answer

GROUP BY is needed to aggregate, ORDER BY to sort, and DESC for descending order. SUM is an aggregate function but not a clause.

Practice this question →

MCQmedium

An analyst needs to count the number of orders per customer but only for customers who have placed more than 5 orders. Which SQL construct allows filtering after aggregation?

A.WHERE COUNT(*) > 5

B.LIMIT 5

C.HAVING COUNT(*) > 5

D.ORDER BY COUNT(*) > 5

AnswerC

HAVING filters groups after aggregation.

Why this answer

HAVING is used to filter groups based on aggregate conditions, unlike WHERE which filters before aggregation.

Practice this question →

MCQmedium

A data analyst is using pandas in Python to clean a dataset. Which method is most appropriate to replace missing numerical values with the median of the column?

A.df.fillna(df.median())

B.df.interpolate()

C.df.dropna()

D.df.replace(np.nan, df.mean())

AnswerA

fillna with median replaces nulls with median.

Why this answer

fillna with median replaces missing values with median.

Practice this question →

MCQmedium

A data analyst needs to sample 1000 customers from a database of 100,000 customers for a survey, ensuring every customer has an equal chance of selection. Which sampling method is most appropriate?

A.Cluster sampling

B.Stratified sampling

C.Systematic sampling

D.Simple random sampling

AnswerD

Equal probability for all.

Why this answer

Simple random sampling gives each individual an equal chance of being selected.

Practice this question →

Multi-Selectmedium

A data analyst needs to identify duplicate customer records based on email and phone number. Which SQL techniques can be used to find duplicates? (Select TWO).

Select 2 answers

A.SELECT email, phone FROM customers ORDER BY email, phone

B.SELECT DISTINCT email, phone FROM customers

C.SELECT email, phone, ROW_NUMBER() OVER (PARTITION BY email, phone ORDER BY customer_id) AS rn FROM customers WHERE rn > 1

D.Use a CTE to assign ROW_NUMBER() and then select rows where rn > 1

E.SELECT email, phone, COUNT(*) FROM customers GROUP BY email, phone HAVING COUNT(*) > 1

AnswersD, E

A CTE with ROW_NUMBER() can identify duplicates by filtering on rn > 1.

Why this answer

GROUP BY with COUNT and HAVING COUNT > 1 filters groups with duplicates. ROW_NUMBER() with PARTITION BY can assign row numbers to identify duplicates.

Practice this question →

MCQmedium

A data analyst wants to randomly select 100 customers from a database for a survey, ensuring that the sample reflects the proportion of male and female customers in the population. Which sampling method is most appropriate?

A.Stratified sampling

B.Cluster sampling

C.Simple random sampling

D.Systematic sampling

AnswerA

Stratified sampling by gender ensures proportional representation.

Why this answer

Stratified sampling ensures proportional representation of subgroups (strata).

Practice this question →

MCQeasy

Which data sampling method involves selecting every k-th element from a list after a random start?

A.Systematic sampling

B.Cluster sampling

C.Stratified sampling

D.Simple random sampling

AnswerA

Correct: selects every k-th element.

Why this answer

Systematic sampling selects every k-th item after a random start.

Practice this question →

MCQeasy

Which SQL function can be used to extract the year from a date column 'order_date'?

A.DATEDIFF(year, order_date)

B.DATEADD(year, order_date)

C.YEAR(order_date)

D.FORMAT(order_date, 'yyyy')

AnswerC

Correct: YEAR returns the year as an integer.

Why this answer

The YEAR function extracts the year portion from a date.

Practice this question →

Multi-Selectmedium

A data analyst is performing data profiling on a customer table. Which TWO metrics are most useful for understanding the completeness of the data? (Choose two.)

Select 2 answers

A.Minimum and maximum values

B.Null count per column

C.Row count

D.Cardinality

E.Mean value

AnswersB, C

Directly measures missing values.

Why this answer

Row count gives total records, null count gives missing values per column, both help assess completeness. Cardinality is for uniqueness, min/max for range, mean for central tendency.

Practice this question →

Multi-Selectmedium

An analyst needs to identify outliers in a numeric column 'transaction_amount' using the interquartile range (IQR) method. Which TWO steps are part of this process? (Select TWO).

Select 2 answers

A.Subtract 1.5 times the IQR from Q1 and add 1.5 times the IQR to Q3 to define bounds

B.Calculate the median of the column

C.Calculate the first quartile (Q1) and third quartile (Q3)

D.Sort the data and remove the top and bottom 5%

E.Compute the mean and standard deviation of the column

AnswersA, C

These bounds are used to flag outliers.

Why this answer

The IQR method involves calculating Q1 and Q3 to find IQR, then defining lower and upper bounds as Q1 - 1.5*IQR and Q3 + 1.5*IQR. Computing mean and standard deviation is for Z-score method; calculating median alone is insufficient.

Practice this question →

MCQmedium

You are analyzing sales data and need to calculate the moving average of monthly sales over the previous 3 months for each month. Which type of function is best suited for this task?

A.String function

B.Window function with OVER()

C.Aggregate function with GROUP BY

D.Date function

AnswerB

Window functions operate on a set of rows related to the current row, perfect for moving averages.

Why this answer

Window functions, specifically using OVER() with ORDER BY and a frame specification, can compute moving averages. Aggregate functions alone cannot access previous rows without a self-join. String and date functions are irrelevant.

Practice this question →

MCQmedium

A data analyst is performing data profiling on a customer dataset. Which metric would best reveal the number of distinct values in the 'state' column?

A.Mean

B.Row count

C.Cardinality

D.Null count

AnswerC

Cardinality is the count of distinct values.

Why this answer

Cardinality refers to the number of unique values in a column, which directly indicates distinct states.

Practice this question →

MCQeasy

A data analyst wants to create a temporary result set that can be referenced within a single SQL statement. Which feature should be used?

A.Subquery

B.CTE

C.Temporary table

D.Derived table

AnswerB

CTE with WITH clause is the correct feature.

Why this answer

CTE (Common Table Expression) defined with WITH clause creates a temporary named result set usable within the query.

Practice this question →

MCQmedium

A data analyst wants to concatenate first_name and last_name columns with a space in between. Which string function combination should be used in SQL?

A.first_name + ' ' + last_name

B.SUBSTRING(first_name, 1, 1) + '.' + last_name

C.CONCAT(first_name, last_name)

D.CONCAT(first_name, ' ', last_name)

AnswerD

This adds a space between the two names.

Why this answer

CONCAT joins strings; adding a space produces 'First Last'.

Practice this question →

Multi-Selectmedium

A data analyst is performing data profiling on a customer table. Which TWO of the following are key metrics to assess data quality? (Select TWO.)

Select 2 answers

A.Row count

B.Minimum and maximum values

C.Cardinality

D.Data type verification

E.Null count

AnswersB, E

Min and max help identify out-of-range values or anomalies.

Why this answer

Null counts indicate missing values, and min/max values can reveal outliers or unexpected ranges. Row count alone doesn't assess quality; cardinality and data type verification are also important but the question asks for key metrics among the options.

Practice this question →

MCQhard

In a table 'employee_hierarchy' with columns 'employee_id', 'manager_id', and 'employee_name', an analyst needs to generate a list of all employees under a specific manager, including multiple levels of subordinates. Which SQL construct is most appropriate for querying this hierarchical data efficiently?

A.Recursive CTE

B.Window function with PARTITION BY

C.Subquery in WHERE clause

D.Self-JOIN with WHERE clause

AnswerA

Recursive CTEs iterate through levels, ideal for hierarchies.

Why this answer

Recursive CTEs are designed to handle hierarchical data by repeatedly joining a CTE to itself until all levels are included.

Practice this question →

MCQeasy

Which SQL aggregate function would an analyst use to calculate the average value of a numeric column?

A.SUM

B.AVG

C.COUNT

D.MEDIAN

AnswerB

AVG computes the average.

Why this answer

AVG calculates the arithmetic mean of a numeric column.

Practice this question →

MCQeasy

A data analyst uses Python's pandas library to read a CSV file into a DataFrame. Which function is used to read the file?

A.pd.import_csv()

B.pd.read_excel()

C.pd.load_csv()

D.pd.read_csv()

AnswerD

This is the correct function to read CSV files into a DataFrame.

Why this answer

pd.read_csv() is the standard pandas function to read a CSV file.

Practice this question →

MCQeasy

In a sales database, an analyst needs to retrieve all orders where the order amount is between $100 and $500. Which WHERE clause should be used?

A.WHERE amount IN (100, 500)

B.WHERE amount >= 100 AND amount <= 500

C.WHERE amount BETWEEN 100 AND 500

D.WHERE amount > 100 AND amount < 500

AnswerC

BETWEEN filters values within the inclusive range.

Why this answer

The BETWEEN operator is inclusive and is the standard way to filter a range of values.

Practice this question →

MCQmedium

A data analyst wants to retrieve the top 5 highest-paid employees from an 'employees' table, including ties. Which SQL clause should be used?

A.TOP 5 WITH TIES

B.ORDER BY salary DESC LIMIT 5

C.HAVING salary > 50000

D.WHERE ROWNUM <= 5

AnswerA

TOP 5 WITH TIES (SQL Server) includes all rows that tie for the 5th position.

Why this answer

In many SQL dialects, LIMIT can be used to restrict rows. However, to include ties, some databases offer WITH TIES with FETCH FIRST or TOP. Standard SQL: FETCH FIRST 5 ROWS WITH TIES.

TOP 5 alone does not include ties. RANK() with WHERE clause can work but is more complex.

Practice this question →

MCQmedium

A data analyst wants to find the top 5 products by total sales amount, but only for products that have been sold more than 50 times. Which SQL query accomplishes this?

A.SELECT product_id, SUM(sales_amount) FROM sales GROUP BY product_id HAVING COUNT(*) > 50 ORDER BY SUM(sales_amount) DESC LIMIT 5

B.SELECT product_id, SUM(sales_amount) FROM sales WHERE COUNT(*) > 50 GROUP BY product_id ORDER BY SUM(sales_amount) DESC LIMIT 5

C.SELECT product_id, SUM(sales_amount) FROM sales GROUP BY product_id HAVING COUNT(*) > 50 ORDER BY SUM(sales_amount) ASC LIMIT 5

D.SELECT product_id, SUM(sales_amount) FROM sales GROUP BY product_id WHERE COUNT(*) > 50 ORDER BY SUM(sales_amount) DESC LIMIT 5

AnswerA

Correct use of HAVING, ORDER BY, and LIMIT.

Why this answer

HAVING filters after aggregation, then ORDER BY and LIMIT give the top 5.

Practice this question →

MCQmedium

A data analyst needs to sample 10% of customers from each of three regions (North, South, Central) to ensure proportional representation. Which sampling method should be used?

A.Systematic sampling

B.Cluster sampling

C.Stratified sampling

D.Simple random sampling

AnswerC

Ensures each region is represented proportionally.

Why this answer

Stratified sampling divides the population into strata (regions) and samples proportionally from each. Simple random sampling would not guarantee proportional representation. Systematic sampling selects every kth element.

Cluster sampling selects entire groups randomly.

Practice this question →

MCQmedium

A dataset contains a 'salary' column. The analyst wants to identify outliers using the IQR method. If Q1 = 40,000 and Q3 = 70,000, what is the upper threshold for a non-outlier?

A.130,000

B.85,000

C.115,000

D.100,000

AnswerC

Correct calculation: 70,000 + 1.5*30,000 = 115,000.

Why this answer

Upper threshold = Q3 + 1.5 * IQR. IQR = 70,000 - 40,000 = 30,000. So upper = 70,000 + 45,000 = 115,000.

Practice this question →

MCQeasy

A data analyst needs to retrieve all unique job titles from the employees table. Which SQL clause should be used with the SELECT statement?

A.TOP

B.UNIQUE

C.DISTINCT

D.ORDER BY

AnswerC

DISTINCT filters out duplicate rows.

Why this answer

The DISTINCT keyword is used to return only distinct (different) values.

Practice this question →

MCQeasy

A table 'orders' contains columns 'order_id', 'customer_id', 'order_date', and 'total'. An analyst needs to find orders placed between January 1, 2023 and December 31, 2023. Which WHERE clause is correct?

A.WHERE order_date > '2023-01-01' AND order_date < '2023-12-31'

B.WHERE order_date IN ('2023-01-01', '2023-12-31')

C.WHERE order_date >= '2023-01-01' OR order_date <= '2023-12-31'

D.WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31'

AnswerD

BETWEEN is inclusive and concise.

Why this answer

The BETWEEN operator is inclusive and is the standard way to filter date ranges.

Practice this question →

MCQeasy

A data analyst wants to identify customers whose last name starts with 'Mc' from the 'customers' table. Which WHERE clause condition should be used?

A.last_name LIKE 'Mc_'

B.last_name LIKE 'Mc%'

C.last_name IN ('Mc%')

D.last_name = 'Mc%'

AnswerB

Correct: % matches any sequence of characters after 'Mc'.

Why this answer

The LIKE operator with '%' wildcard matches any sequence of characters after 'Mc'.

Practice this question →

Multi-Selecthard

A data analyst is investigating a correlation between two continuous variables. Which THREE of the following are appropriate steps in this exploratory data analysis? (Select THREE.)

Select 3 answers

A.Calculate the Pearson correlation coefficient

B.Create a scatter plot

C.Perform a t-test

D.Check for outliers using box plots

E.Create a contingency table

AnswersA, B, D

Quantifies linear correlation.

Why this answer

Scatter plot visualizes relationship, correlation coefficient quantifies strength, and removing outliers may be needed to avoid misleading results.

Practice this question →

MCQhard

In a table 'sales_team' with columns 'salesperson', 'quarter', and 'revenue', an analyst wants to assign a rank to each salesperson within their quarter based on revenue, with the highest revenue getting rank 1. However, if two salespeople have the same revenue, they should receive the same rank, and the next rank should be the next consecutive integer (no gaps). Which window function should be used?

A.RANK()

B.NTILE(4)

C.DENSE_RANK()

D.ROW_NUMBER()

AnswerC

DENSE_RANK() assigns consecutive ranks even with ties.

Why this answer

DENSE_RANK() assigns ranks with no gaps for ties, whereas RANK() leaves gaps.

Practice this question →

Multi-Selectmedium

A data analyst is exploring a sales dataset and wants to identify columns that are likely to be foreign keys. Which TWO characteristics would indicate a foreign key?

Select 2 answers

A.The column name ends with '_id'

B.The column contains NULL values

C.The column is of integer data type

D.The column values are a subset of a primary key column in another table

E.The column has a UNIQUE constraint

AnswersA, D

Often foreign keys are named with '_id' suffix.

Why this answer

Foreign keys typically match primary keys in another table and have a name suggestive of the relationship.

Practice this question →

MCQhard

A data analyst is using pandas to clean a DataFrame. They need to replace missing values in the 'age' column with the median age. Which method should they use?

A.df['age'].replace(np.nan, df['age'].mean())

B.df['age'].dropna()

C.df['age'].fillna(df['age'].median())

D.df['age'].interpolate()

AnswerC

Correctly fills NaN with median.

Why this answer

fillna() with median() fills NaN values with the median of the column.

Practice this question →

MCQmedium

A data analyst needs to extract the year from a column named 'order_date' in a SQL database. The database supports standard SQL functions. Which function should they use?

A.GET_YEAR(order_date)

B.YEAR(order_date)

C.DATE_PART('year', order_date)

D.EXTRACT(YEAR FROM order_date)

AnswerD

Correct standard SQL syntax.

Why this answer

The EXTRACT function is standard SQL for extracting date parts. EXTRACT(YEAR FROM order_date) returns the year.

Practice this question →

MCQmedium

A data analyst wants to generate a report showing employee names and their department names, but some employees are not assigned to any department. The analyst wants to include all employees. Which JOIN type should be used?

A.INNER JOIN

B.LEFT JOIN

C.CROSS JOIN

D.RIGHT JOIN

AnswerB

LEFT JOIN includes all employees, even those without a department.

Why this answer

LEFT JOIN includes all rows from the left table (employees) even if no match in departments.

Practice this question →

MCQeasy

A data analyst needs to count the number of distinct product categories in a table named 'products'. Which SQL function should be used in the SELECT clause?

A.COUNT(category)

B.DISTINCT COUNT(category)

C.COUNT(DISTINCT category)

D.COUNT(*) WHERE category IS NOT NULL

AnswerC

This counts only unique non-null categories.

Why this answer

COUNT(DISTINCT column) counts unique non-null values in a column.

Practice this question →

MCQmedium

A data analyst wants to extract the year from a date column 'order_date' in a SQL database. Which function should be used?

A.YEAR(order_date)

B.DATEADD(year, order_date, 0)

C.DATEDIFF(year, order_date, GETDATE())

D.GETDATE()

AnswerA

Returns the year portion of the date.

Why this answer

The YEAR() function extracts the year from a date. DATEADD adds intervals, DATEDIFF calculates differences, GETDATE() returns current date.

Practice this question →

MCQeasy

In a dataset of customer orders, you need to count the number of distinct customers who have placed orders. Which SQL aggregate function should you use?

A.DISTINCT COUNT(customer_id)

B.COUNT(customer_id)

C.COUNT(DISTINCT customer_id)

D.COUNT(*)

AnswerC

Correctly counts unique customer IDs.

Why this answer

COUNT(DISTINCT column) counts the number of unique non-null values in a column. COUNT(*) counts all rows including duplicates, COUNT(column) counts non-null values including duplicates. DISTINCT alone is not an aggregate function.

Practice this question →

MCQmedium

A dataset contains a column 'birthdate' in 'YYYY-MM-DD' format. The analyst needs to calculate the average age of customers as of today. Which combination of functions is most appropriate?

A.AVG(YEAR(GETDATE()) - YEAR(birthdate))

B.DATEDIFF(year, birthdate, GETDATE())

C.YEAR(GETDATE()) - YEAR(birthdate)

D.EXTRACT(YEAR FROM GETDATE()) - EXTRACT(YEAR FROM birthdate)

AnswerB

DATEDIFF with year returns the number of year boundaries crossed, which is a common approximation of age.

Why this answer

DATEDIFF(year, birthdate, GETDATE()) returns the number of year boundaries crossed between the two dates, which is the standard SQL method for calculating age. The other options subtract year components, which gives only an estimate that ignores the month and day, leading to inaccuracies.

Practice this question →

MCQhard

A data analyst is working with a sales table that contains columns: sale_id, product_id, sale_date, and amount. They need to calculate a 7-day moving average of sales amount for each product, ordered by sale_date. Which window function syntax should they use?

A.AVG(amount) OVER (PARTITION BY product_id ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

B.AVG(amount) OVER (PARTITION BY product_id ORDER BY sale_date)

C.AVG(amount) OVER (ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

D.SUM(amount) OVER (PARTITION BY product_id ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

AnswerA

Correct. This computes the average of the current and previous 6 rows per product.

Why this answer

A moving average requires averaging over a frame of rows. Using AVG() with an ORDER BY in the OVER clause and a frame specification (ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) calculates the 7-day moving average.

Practice this question →

MCQeasy

A data analyst needs to count the number of customers who have placed at least one order. Which SQL query should be used?

A.SELECT DISTINCT COUNT(customer_id) FROM orders

B.SELECT SUM(customer_id) FROM orders

C.SELECT COUNT(customer_id) FROM orders

D.SELECT COUNT(DISTINCT customer_id) FROM orders

AnswerD

This counts unique customer IDs from orders.

Why this answer

COUNT(DISTINCT customer_id) returns the number of unique customers, which correctly counts those with at least one order.

Practice this question →

MCQeasy

A data analyst needs to retrieve all unique job titles from an employees table. Which SQL keyword should be used in the SELECT clause?

A.UNIQUE

B.REMOVE DUPLICATES

C.DISTINCT

D.FILTER

AnswerC

Correct. DISTINCT filters out duplicate rows.

Why this answer

DISTINCT removes duplicate rows from the result set, returning only unique values. In this case, SELECT DISTINCT job_title would return each job title only once.

Practice this question →

Multi-Selecthard

A data analyst uses a Common Table Expression (CTE) to query hierarchical employee data (manager_id references employee_id). Which THREE statements about recursive CTEs are correct? (Select THREE).

Select 3 answers

A.The anchor member is the first part of the CTE that does not reference the CTE itself

B.Recursive CTEs cannot be used to generate a series of numbers

C.A recursive CTE must use the keyword RECURSIVE in the WITH clause

D.The recursive member cannot reference the CTE name

E.UNION ALL is typically used to combine the anchor and recursive members

AnswersA, C, E

The anchor member is the non-recursive initial query.

Why this answer

Recursive CTEs require the WITH RECURSIVE clause (or WITH in some DBMS that imply recursion). The UNION ALL is typical to combine anchor and recursive members. Anchor member is the starting set; recursive member references the CTE itself.

The anchor member is defined before the recursive member.

Practice this question →

MCQmedium

A data analyst is using pandas in Python to merge two DataFrames: sales (columns: sale_id, product_id, amount) and products (columns: product_id, product_name). Which pandas function should they use to combine these DataFrames on the 'product_id' column?

A.combine()

B.merge()

C.join()

D.concat()

AnswerB

Correct. merge() is designed for database-style joins.

Why this answer

The pandas merge function is used to combine DataFrames on common columns. The syntax is pd.merge(sales, products, on='product_id').

Practice this question →

MCQeasy

A data analyst is performing data profiling on a customer table. Which metric would best help identify missing values in the 'phone' column?

A.Cardinality

B.Null count

C.Mean

D.Row count

AnswerB

Null count shows number of records with missing phone values.

Why this answer

Null count directly measures missing values.

Practice this question →

Multi-Selecteasy

A data analyst needs to sample records from a large dataset for a quick analysis. Which TWO sampling methods are examples of probability sampling?

Select 2 answers

A.Snowball sampling

B.Simple random sampling

C.Systematic sampling

D.Convenience sampling

E.Quota sampling

AnswersB, C

Every element has an equal probability of selection.

Why this answer

Simple random sampling and systematic sampling are probability-based methods where every element has a known chance of selection.

Practice this question →

MCQeasy

An analyst wants to identify outliers in a dataset using the IQR method. Which values are typically considered outliers?

A.Values below the mean or above the mean

B.Values below Q1 - IQR or above Q3 + IQR

C.Values below Q2 - 2*IQR or above Q2 + 2*IQR

D.Values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR

AnswerD

Standard IQR outlier definition.

Why this answer

Outliers are values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.

Practice this question →

MCQeasy

In pandas, you have a DataFrame 'df' with columns 'product' and 'sales'. You want to calculate the total sales per product. Which method should you use?

A.df['sales'].apply(sum)

B.df.pivot_table(values='sales', index='product', aggfunc='sum')

C.df.groupby('product')['sales'].sum()

D.df.merge(df, on='product')

AnswerC

Correctly aggregates sales by product.

Why this answer

df.groupby('product')['sales'].sum() groups by product and sums sales. df.pivot_table can also do it but is more complex. df.merge is for joining, df.apply is for applying a function element-wise or row/column-wise.

Practice this question →

MCQeasy

In SQL, you want to retrieve all products whose names start with 'Pro'. Which WHERE clause should you use?

A.WHERE product_name LIKE '%Pro%'

B.WHERE product_name LIKE 'Pro_'

C.WHERE product_name = 'Pro'

D.WHERE product_name LIKE 'Pro%'

AnswerD

Matches product names starting with 'Pro'.

Why this answer

LIKE with pattern 'Pro%' matches strings starting with 'Pro' followed by any characters. '%Pro%' matches any string containing 'Pro', 'Pro_' matches 'Pro' plus one character, and 'Pro' is exact match.

Practice this question →

MCQeasy

A data analyst wants to combine first_name and last_name columns into a single full_name column in a SQL query. Which string function should be used?

A.CONCAT()

B.UPPER()

C.LENGTH()

D.SUBSTRING()

AnswerA

CONCAT() concatenates strings.

Why this answer

CONCAT() joins two or more strings together.

Practice this question →

MCQhard

You have a hierarchical table 'Employees' with columns emp_id, emp_name, manager_id (referencing emp_id). You need to generate a full reporting chain from a given employee up to the CEO. Which SQL construct is most appropriate?

A.Recursive CTE with UNION ALL

B.Non-recursive CTE

C.Window function with PARTITION BY

D.Self-join with multiple JOINs

AnswerA

Recursively joins the table to itself to traverse the hierarchy.

Why this answer

Recursive CTEs are designed for hierarchical data, allowing iteration through parent-child relationships. Non-recursive CTEs cannot loop. Self-join with multiple levels is possible but requires knowing the depth.

Window functions are not suitable for tree traversal.

Practice this question →

MCQmedium

An analyst is performing EDA and wants to measure the strength and direction of linear relationship between two continuous variables. Which statistical measure should they compute?

A.Correlation

B.Standard deviation

C.Mean

D.Mode

AnswerA

Correlation measures linear relationship.

Why this answer

Correlation coefficient (Pearson's r) measures linear relationship strength and direction.

Practice this question →

MCQmedium

A data analyst is using pandas to read a CSV file named 'sales.csv'. Which line of code correctly reads the file into a DataFrame?

A.import csv; df = csv.read('sales.csv')

B.import pandas as pd; df = pd.read('sales.csv')

C.import numpy as np; df = np.read_csv('sales.csv')

D.import pandas as pd; df = pd.read_csv('sales.csv')

AnswerD

Correct syntax.

Why this answer

The pandas function read_csv reads a CSV file into a DataFrame.

Practice this question →

MCQmedium

A data analyst is profiling a dataset and finds that the 'email' column contains some NULL values. Which SQL query can be used to count how many rows have a NULL email?

A.SELECT COUNT(email) FROM table WHERE email = NULL

B.SELECT SUM(CASE WHEN email IS NULL THEN 1 END) FROM table

C.SELECT COUNT(ISNULL(email)) FROM table

D.SELECT COUNT(*) FROM table WHERE email IS NULL

AnswerD

Correct: counts all rows with null email.

Why this answer

COUNT(*) counts all rows; WHERE email IS NULL filters only null rows.

Practice this question →

MCQmedium

A data analyst runs the query: SELECT AVG(salary) FROM employees GROUP BY department HAVING AVG(salary) > 60000. What is the purpose of the HAVING clause?

A.It orders departments by average salary descending.

B.It filters departments where the average salary exceeds $60,000.

C.It returns only the department with the maximum average salary.

D.It filters individual employee rows with salary > 60000 before grouping.

AnswerB

HAVING filters groups based on aggregate conditions.

Why this answer

HAVING filters groups after aggregation, unlike WHERE which filters rows before aggregation.

Practice this question →

Multi-Selectmedium

A data analyst is cleaning text data in a SQL database. Which THREE string functions are commonly used to standardize and clean text? (Choose three.)

Select 3 answers

A.REPLACE

B.UPPER

C.LENGTH

D.TRIM

E.CONCAT

AnswersA, B, D

Replaces occurrences of a substring.

Why this answer

TRIM removes leading/trailing spaces, UPPER/LOWER standardize case, REPLACE substitutes substrings. CONCAT concatenates strings, LENGTH returns length, SUBSTRING extracts part of string.

Practice this question →

Multi-Selecthard

A data analyst is performing EDA on a dataset with numerical features. Which methods are appropriate for identifying outliers? (Select TWO).

Select 2 answers

A.Mean imputation

B.Pearson correlation coefficient

C.Z-score method

D.Standard deviation alone

E.Interquartile range (IQR) method

AnswersC, E

Points with |Z| > 3 are often considered outliers.

Why this answer

IQR method uses Q1 - 1.5*IQR and Q3 + 1.5*IQR to define outliers. Z-score method uses threshold (e.g., |Z| > 3) to identify outliers.

Practice this question →

MCQmedium

A data analyst runs a query to count the number of customers in each city. The query uses COUNT(*) and GROUP BY city. However, the result includes NULL for some cities. What will COUNT(*) return for a group where the city is NULL?

A.NULL

B.0

C.The number of rows with NULL city

D.The number of non-NULL cities

AnswerC

COUNT(*) includes all rows, including those with NULL in the grouped column.

Why this answer

COUNT(*) counts all rows in a group, regardless of NULL values in any column. If the city is NULL, all rows in that group are counted.

Practice this question →

Multi-Selectmedium

A data analyst is validating referential integrity between orders and customers tables. Which TWO of the following checks should the analyst perform?

Select 2 answers

A.Check that every order has a non-null order_id

B.Check that no customer is deleted while having orders

C.Check that every customer_id in orders exists in customers

D.Check that customer names are unique

E.Check that order amounts are positive

AnswersB, C

Ensures no orphaned records.

Why this answer

Referential integrity ensures foreign keys match primary keys and no orphaned records.

Practice this question →

Multi-Selectmedium

A data analyst needs to perform a stratified random sample of a customer database. Which TWO steps are essential for this sampling method? (Select two.)

Select 2 answers

A.Use simple random sampling on the whole population

B.Randomly select entire clusters of customers

C.Randomly select a proportional number from each stratum

D.Divide the population into homogeneous subgroups (strata)

E.Select every nth customer from a list

AnswersC, D

Proportional selection ensures representation.

Why this answer

Stratified sampling requires dividing the population into strata and then randomly sampling from each stratum.

Practice this question →

MCQmedium

A data analyst needs to create a new column 'full_name' by concatenating 'first_name' and 'last_name' with a space. Which SQL function should be used in the SELECT clause?

A.COMBINE(first_name, last_name)

B.CONCAT(first_name, ' ', last_name)

C.JOIN(first_name, last_name)

D.first_name + ' ' + last_name

AnswerB

Correct: CONCAT joins strings.

Why this answer

CONCAT concatenates strings; in some DBMS, || or + is used, but CONCAT is standard.

Practice this question →

Multi-Selectmedium

A data analyst is conducting exploratory data analysis (EDA) on a dataset. Which TWO tasks are typically performed during EDA? (Select two.)

Select 2 answers

A.Create a sampling plan

B.Build a predictive regression model

C.Deploy the model to production

D.Identify outliers using the IQR method

E.Calculate correlation between variables

AnswersD, E

Outlier identification is part of EDA.

Why this answer

Outlier detection and correlation analysis are key EDA activities. Model building and data sampling are separate steps.

Practice this question →

MCQmedium

A data quality assessment reveals that a column named 'email' contains values like 'user@example' (missing domain extension). Which data profiling technique would best identify such pattern violations?

A.Pattern analysis

B.Cardinality analysis

C.Referential integrity check

D.Data type verification

AnswerA

Identifies values that do not conform to expected formats.

Why this answer

Pattern analysis involves checking values against expected patterns (e.g., regex for email format). Cardinality counts distinct values, referential integrity checks relationships between tables, and data type verification checks data types.

Practice this question →

MCQhard

A data analyst is writing a query to rank products by total sales within each category, showing dense rank and avoiding gaps. Which window function should be used?

A.ROW_NUMBER()

B.DENSE_RANK()

C.NTILE()

D.RANK()

AnswerB

DENSE_RANK() ranks without gaps.

Why this answer

DENSE_RANK() assigns ranks without gaps, so tied values get the same rank and the next rank is the next consecutive number.

Practice this question →

MCQmedium

A data analyst wants to ensure a sample proportionally represents different regions in a population. Which sampling method should be used?

A.Simple random sampling

B.Cluster sampling

C.Systematic sampling

D.Stratified sampling

AnswerD

Stratified sampling ensures proportional representation from each stratum.

Why this answer

Stratified sampling divides the population into strata (regions) and samples proportionally from each.

Practice this question →

MCQmedium

A data analyst is cleaning a dataset and finds that some cells in the 'email' column contain leading spaces. Which string function should be used to remove these spaces?

A.TRIM

B.LTRIM

C.REPLACE

D.SUBSTRING

AnswerA

TRIM removes both leading and trailing spaces.

Why this answer

TRIM removes leading and trailing spaces from a string.

Practice this question →

Multi-Selecthard

An analyst is using SQL to analyze employee data. Which THREE of the following are valid uses of the WHERE clause? (Select three.)

Select 3 answers

A.Sort the result set by hire_date

B.Filter groups after aggregation using HAVING

C.Filter rows where manager_id is NULL using IS NULL

D.Filter rows where the name starts with 'J' using LIKE

E.Filter rows where salary is between 50,000 and 70,000 using BETWEEN

AnswersC, D, E

IS NULL is used in WHERE to check for NULL values.

Why this answer

WHERE can filter using LIKE, BETWEEN, and IS NULL. HAVING is for aggregated results, and ORDER BY is for sorting.

Practice this question →

MCQmedium

During EDA, an analyst calculates the Z-score for each data point in a dataset. A data point with a Z-score of 3.5 is identified. What does this indicate?

A.The data point has a high frequency

B.The data point is exactly at the mean

C.The data point is likely an outlier

D.The data point is within the interquartile range

AnswerC

A Z-score above 3 or below -3 is often considered an outlier.

Why this answer

A Z-score of 3.5 means the value is 3.5 standard deviations from the mean, commonly considered an outlier (threshold often >3 or <-3).

Practice this question →

Multi-Selectmedium

A data analyst is performing data profiling on a customer table. Which TWO metrics are commonly used to assess the completeness of a column? (Select TWO.)

Select 2 answers

A.Row count

B.Null count

C.Cardinality

D.Mean

E.Standard deviation

AnswersA, B

Total rows; used to compute percentage complete.

Why this answer

Completeness is measured by null count and row count; null count shows missing values, row count gives total rows.

Practice this question →

MCQeasy

In SQL, which string function would you use to remove leading and trailing spaces from a column named 'city'?

A.TRIM

B.RTRIM

C.LTRIM

D.CLEAN

AnswerA

Correct. TRIM removes both leading and trailing spaces.

Why this answer

TRIM removes leading and trailing spaces (or other specified characters) from a string. TRIM(city) returns the city without extra spaces.

Practice this question →