CCNA Define data structures and implement SQL for Business Intelligence Questions — Page 3 of 3

151

MCQhard

A retail company uses BigQuery to store sales data. The 'sales' table has 10 billion rows and is partitioned by transaction_date (daily). The BI dashboard runs a query that aggregates sales by product_category for the last 30 days. The query is slow and expensive. Which improvement is most effective?

A.Cluster the table on product_category

B.Change partitioning to monthly

C.Denormalize the product_category into the sales table

D.Use a materialized view with aggregation on product_category

AnswerA

Clustering on product_category organizes data within each partition so that queries filtering/aggregating on that column scan fewer blocks.

Why this answer

Option A is correct because clustering the table on product_category organizes the data within each daily partition by that column, allowing BigQuery to use block-level pruning to skip irrelevant blocks when filtering or aggregating by product_category. This directly reduces the amount of data scanned for the 30-day aggregation query, improving both performance and cost.

Exam trap

Google Cloud often tests the distinction between partitioning (which limits data by time range) and clustering (which organizes data within partitions for column-based pruning), and candidates mistakenly choose partitioning changes or materialized views without understanding that clustering directly addresses the slow aggregation on a non-time column.

How to eliminate wrong answers

Option B is wrong because changing partitioning from daily to monthly would increase the partition size, forcing the query to scan more data per partition (the entire month) rather than only the last 30 days, which would actually worsen performance and cost. Option C is wrong because denormalizing product_category into the sales table is already the current schema; the issue is not about normalization but about data organization for efficient pruning. Option D is wrong because a materialized view with aggregation on product_category would still require scanning all partitions unless the view is also partitioned and clustered; moreover, materialized views in BigQuery are best for pre-aggregating high-frequency queries but do not inherently reduce scan costs if the underlying table is not properly clustered.

Practice this question →

152

MCQhard

A data engineer creates a clustered table in BigQuery with clustering order: country, city, product_id. The BI team frequently runs a query that filters on city and product_id but rarely on country. What is the most likely performance issue?

A.BigQuery allows only one clustering column per table.

B.The query does not filter on the first clustering column (country), so block pruning is minimal.

C.The table should be partitioned by country instead of clustered.

D.The query filters on too many clustering columns, causing overhead.

AnswerB

Clustering is optimized when filters include the leftmost clustering column.

Why this answer

BigQuery clustered tables use block pruning to skip reading blocks that don't match the query's filter. Pruning is most effective when the filter includes the first clustering column (country). Without it, BigQuery must scan more blocks, leading to higher query costs and slower performance.

Exam trap

Google Cloud often tests the misconception that any filter on clustering columns is equally effective, but the key is that pruning requires the first column in the clustering order to be filtered for maximum benefit.

How to eliminate wrong answers

Option A is wrong because BigQuery allows up to four clustering columns per table, not just one. Option C is wrong because partitioning by country would not help if the query rarely filters on country; partitioning is most beneficial for queries that filter on the partition column. Option D is wrong because filtering on multiple clustering columns does not cause overhead; it actually improves pruning, but the missing first column is the issue.

Practice this question →

153

MCQhard

A BI dashboard query is slow and high cost. The query does multiple joins on large tables and uses window functions. The data engineer suggests using materialized views. However, the query uses non-deterministic functions. What is the limitation?

A.Materialized views cannot include non-deterministic functions

B.Materialized views cannot be updated automatically

C.Materialized views cannot be created with joins

D.Materialized views only support simple aggregation

AnswerA

Materialized views require deterministic expressions to maintain consistency between base table changes.

Why this answer

Materialized views store the result set of a query physically, like a table. If the query includes non-deterministic functions (e.g., NOW(), RAND(), CURRENT_TIMESTAMP), the stored result would become stale immediately because the function's output changes each time it is evaluated. Most SQL databases (e.g., PostgreSQL, Oracle, Snowflake) explicitly forbid non-deterministic functions in materialized view definitions to prevent this logical inconsistency.

Exam trap

Google Cloud often tests the misconception that materialized views are 'static' and cannot be refreshed, or that they only support simple aggregations, when the real limitation is the prohibition of non-deterministic functions to ensure data consistency.

How to eliminate wrong answers

Option B is wrong because materialized views can be updated automatically via refresh mechanisms (e.g., ON COMMIT, scheduled refreshes), though they are not always updated in real-time. Option C is wrong because materialized views commonly support joins; in fact, they are often used to pre-join large tables for performance. Option D is wrong because materialized views can include complex aggregations, window functions, and multiple joins—not just simple aggregation.

Practice this question →

154

MCQmedium

A BI query uses COUNT(column) to count non-null values and COUNT(*) to count all rows. The analyst expects both counts to be equal, but COUNT(column) returns fewer rows. What is the most likely explanation?

A.The query has a WHERE clause that filters some rows.

B.COUNT(*) is faster, so it's not accurate.

C.COUNT(*) counts duplicate rows, while COUNT(column) does not.

D.The column contains NULL values, which are not counted by COUNT(column).

AnswerD

COUNT(column) only counts non-null values.

Why this answer

COUNT(column) ignores NULL values in the specified column, while COUNT(*) counts every row in the result set regardless of NULLs. If the column contains any NULLs, COUNT(column) will return a lower number. This is a fundamental SQL behavior defined in the ANSI SQL standard and is consistent across all major BI platforms (e.g., Tableau, Power BI, Looker) that generate SQL queries.

Exam trap

Google Cloud often tests the subtle distinction between COUNT(*) and COUNT(column) by embedding NULL values in the column, tempting candidates to incorrectly attribute the difference to duplicates or filtering.

How to eliminate wrong answers

Option A is wrong because a WHERE clause would filter rows before aggregation, affecting both COUNT(column) and COUNT(*) equally, so it cannot cause a discrepancy between the two counts. Option B is wrong because COUNT(*) is not inherently faster or less accurate; both functions return precise counts based on the same data set, and performance differences are irrelevant to accuracy. Option C is wrong because both COUNT(*) and COUNT(column) count duplicate rows; COUNT(column) counts non-null occurrences of the column, including duplicates, so duplicates do not cause a difference.

Practice this question →

155

Multi-Selecthard

A company is designing a data model for a BI dashboard that requires real-time updates and historical analysis. Which THREE practices should be followed?

Select 3 answers

A.Use clustering on frequently filtered columns.

B.Use streaming inserts for real-time data.

C.Create a separate table for each day's data.

D.Use the default BigQuery table expiration setting.

E.Use partitioning by ingestion time for continuous data.

AnswersA, B, E

Clustering orders data within partitions, improving filter performance.

Why this answer

Option A is correct because clustering on frequently filtered columns in BigQuery organizes data into blocks based on the values of those columns, allowing queries with filters on those columns to skip irrelevant blocks entirely. This reduces the amount of data scanned, improving query performance and lowering costs, which is critical for a BI dashboard that needs real-time updates and fast historical analysis.

Exam trap

Google Cloud often tests the misconception that creating separate tables for daily data is a good practice for time-series data, when in fact BigQuery's partitioning and clustering features are designed to handle such data more efficiently and with less administrative overhead.

Practice this question →