CCNA Data Structures Sql Bi Questions — Page 2 of 3

Matchingmedium

Match each Cloud SQL high-availability feature to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Synchronous replication across two zones

Standby instance in a different zone for automatic failover

Asynchronous replica for read offloading

Promotion of standby on primary failure

Point-in-time recovery and disaster recovery

Why these pairings

These features ensure availability and durability for Cloud SQL instances.

Practice this question →

MCQhard

The query above fails with 'Resources exceeded: UDF out of memory' on a large table. What is the best way to fix this?

A.Rewrite the function as a SQL UDF to avoid JavaScript overhead

B.Add a GROUP BY clause to reduce the number of rows processed

C.Convert the temporary UDF to a persistent UDF

D.Increase the memory allocation for JavaScript UDFs

AnswerA

SQL UDFs run natively in BigQuery's execution engine and do not have the same memory constraints.

Why this answer

Option A is correct because JavaScript UDFs in BigQuery run in a sandbox with limited memory (typically 6 MB per UDF instance). When processing a large table, the UDF may exceed this memory due to per-row overhead or large intermediate results. Rewriting the function as a SQL UDF eliminates JavaScript overhead and runs natively within BigQuery's distributed execution engine, which can handle larger datasets without memory constraints.

Exam trap

Google Cloud often tests the misconception that memory errors in UDFs can be fixed by increasing resources or changing UDF persistence, when the real limitation is the fixed JavaScript sandbox memory that can only be avoided by using SQL UDFs.

How to eliminate wrong answers

Option B is wrong because adding a GROUP BY clause does not reduce the number of rows processed by the UDF; it only aggregates results after the UDF runs, so the memory issue persists. Option C is wrong because converting a temporary UDF to a persistent UDF does not change the execution environment or memory limits; both types of JavaScript UDFs share the same sandbox memory constraints. Option D is wrong because BigQuery does not allow users to increase memory allocation for JavaScript UDFs; the sandbox memory is fixed and cannot be adjusted.

Practice this question →

MCQhard

A BI team is designing a BigQuery table for a sales dashboard that queries daily sales by product category and region. The dashboard often filters on a specific date range and a specific region. Which combination of partitioning and clustering should be used?

A.Partition by region, cluster by date

B.Use only clustering on date and region without partitioning

C.Partition by date, cluster by region

D.Partition by month, cluster by date

AnswerC

Partitioning by date fine-tunes the scan to the date range; clustering by region organizes data to skip irrelevant blocks.

Why this answer

Partitioning by date (e.g., on a DATE or TIMESTAMP column) allows BigQuery to prune entire partitions when the dashboard filters on a specific date range, reducing the amount of data scanned. Clustering by region then sorts the data within each partition by region, enabling efficient block-level pruning when the dashboard filters on a specific region. This combination optimizes both the date range and region filters, which are the most common query patterns for this sales dashboard.

Exam trap

Google Cloud often tests the misconception that partitioning can be applied to any column type (like region) or that clustering alone is sufficient for date range filtering, leading candidates to overlook the mandatory requirement that partitioning must be on a DATE, TIMESTAMP, or integer column and that clustering complements but does not replace partitioning for range-based pruning.

How to eliminate wrong answers

Option A is wrong because partitioning by region is not supported in BigQuery (partitioning only supports DATE, TIMESTAMP, or integer columns, not string columns like region), and clustering by date would not provide the same pruning benefit for date range filters as partitioning by date does. Option B is wrong because using only clustering without partitioning means every query must scan all partitions (i.e., the entire table), even when filtering on a date range, leading to higher costs and slower performance compared to a partitioned table. Option D is wrong because partitioning by month is too coarse for a dashboard that often filters on a specific date range (e.g., a few days or weeks), resulting in scanning entire monthly partitions even when only a few days are needed, and clustering by date within a monthly partition is redundant since date is already the partition key.

Practice this question →

MCQmedium

A data engineer runs a BigQuery query that joins a large fact table with a small lookup table. The query processes 1 TB of data and takes 30 seconds. The engineer wants to reduce the amount of data processed. Which optimization technique is MOST effective?

A.Increase the number of slots available for the query.

B.Use a WITH clause to pre-filter the fact table before joining.

C.Cluster the lookup table on the join key.

D.Materialize the lookup table as a separate table with the same data.

AnswerB

Pre-filtering reduces the amount of data from the fact table that needs to be joined.

Why this answer

Option B is correct because pre-filtering the fact table with a WITH clause (CTE) reduces the amount of data scanned and processed before the join occurs. Since the fact table is large (1 TB), applying filters early minimizes the data shuffled and joined, directly reducing the bytes billed in BigQuery. This is a form of predicate pushdown that leverages BigQuery's columnar storage and dynamic query optimization.

Exam trap

The trap here is that candidates confuse query performance (speed) with data processed (cost), often choosing to increase slots (Option A) which only reduces elapsed time but does not lower the bytes billed.

How to eliminate wrong answers

Option A is wrong because increasing slots only speeds up query execution (reduces elapsed time) but does not reduce the amount of data processed; the query still scans 1 TB. Option C is wrong because clustering the lookup table on the join key improves join performance by reducing shuffle, but the lookup table is already small, so the impact on data processed is negligible; the bottleneck is the large fact table. Option D is wrong because materializing the lookup table as a separate table with the same data does not change the amount of data processed; it only duplicates storage without reducing the fact table scan.

Practice this question →

MCQeasy

In BigQuery, a BI analyst wants to store financial data with high precision and avoid rounding errors. Which data type should be used for currency columns?

A.NUMERIC

B.FLOAT64

C.INT64

D.STRING

AnswerA

NUMERIC is a fixed-point decimal type designed for financial precision.

Why this answer

NUMERIC (also known as DECIMAL) is the correct choice because it stores exact numeric values with up to 38 digits of precision and a user-defined scale, making it ideal for financial data where rounding errors from binary floating-point representation are unacceptable. In BigQuery, NUMERIC uses fixed-point arithmetic, ensuring that calculations like tax or interest accruals remain exact to the specified decimal places.

Exam trap

Google Cloud often tests the misconception that FLOAT64 is acceptable for currency because it 'has enough precision,' but the trap is that binary floating-point types inherently cannot represent many decimal fractions exactly, causing cumulative rounding errors in financial data.

How to eliminate wrong answers

Option B is wrong because FLOAT64 is a binary floating-point type that approximates values, leading to rounding errors in financial calculations due to its base-2 representation (e.g., 0.1 cannot be represented exactly). Option C is wrong because INT64 stores only whole integers, losing the fractional cents required for currency columns. Option D is wrong because STRING stores text, not numeric values, and would require costly and error-prone conversions for any arithmetic operations.

Practice this question →

MCQmedium

A data engineering team ingests JSON logs into BigQuery using a streaming pipeline. Queries need to extract specific fields from nested arrays. Which SQL construct should be used to efficiently transform the nested data into a flat table for BI?

A.ARRAY_AGG with STRUCT

B.STRUCT with nested field access

C.SELECT * EXCEPT with UNNEST

D.UNNEST with CROSS JOIN

AnswerD

UNNEST flattens arrays into rows, allowing access to nested fields.

Why this answer

Option D is correct because `UNNEST` with `CROSS JOIN` is the standard SQL construct in BigQuery to flatten nested arrays (repeated fields) into a flat table. When JSON logs contain arrays of structs, `CROSS JOIN UNNEST(array_column)` expands each array element into its own row, allowing BI tools to access individual fields directly. This is the most efficient and idiomatic way to transform nested data into a relational format for querying.

Exam trap

Google Cloud often tests the confusion between aggregation (`ARRAY_AGG`) and unnesting (`UNNEST`), where candidates mistakenly think `ARRAY_AGG` can flatten data because it deals with arrays, but it actually does the reverse operation.

How to eliminate wrong answers

Option A is wrong because `ARRAY_AGG` with `STRUCT` does the opposite—it aggregates rows into nested arrays, not flattens them. Option B is wrong because `STRUCT` with nested field access only retrieves scalar values from a single struct, not from array elements, and cannot unnest multiple rows. Option C is wrong because `SELECT * EXCEPT` is used to exclude columns from a SELECT *, not to flatten arrays; it does not involve `UNNEST` in a meaningful way for array expansion.

Practice this question →

MCQeasy

The exhibit shows IAM policy for a BigQuery dataset. The BI team reports they can query tables but cannot create views. What is the missing role?

A.roles/bigquery.admin

B.roles/bigquery.metadataViewer

C.roles/bigquery.dataEditor

D.roles/bigquery.user

AnswerC

DataEditor includes permissions to create tables and views.

Why this answer

The BI team can query tables but cannot create views, which requires write access to the dataset. The `roles/bigquery.dataEditor` role grants permissions to read, create, update, and delete datasets, tables, and views, including the `bigquery.tables.create` and `bigquery.tables.update` permissions necessary for view creation. The existing query capability indicates they have at least `roles/bigquery.dataViewer`, but view creation demands the additional write permissions provided by `dataEditor`.

Exam trap

The trap here is that candidates confuse the ability to query tables (which only requires `dataViewer` or `user`) with the write permissions needed to create views, leading them to incorrectly select `roles/bigquery.user` or `roles/bigquery.metadataViewer`.

How to eliminate wrong answers

Option A is wrong because `roles/bigquery.admin` grants full control over BigQuery resources, including dataset deletion and IAM policy management, which is excessive and not the minimal missing role for view creation. Option B is wrong because `roles/bigquery.metadataViewer` only allows viewing dataset and table metadata (e.g., table names, schemas) but does not include the `bigquery.tables.create` permission needed to create views. Option D is wrong because `roles/bigquery.user` enables running queries and listing datasets but does not grant write permissions such as `bigquery.tables.create` or `bigquery.tables.update`, which are required for creating views.

Practice this question →

MCQmedium

You are a database engineer at a retail company. The company uses BigQuery for BI, with a fact table 'sales_fact' partitioned by order_date and containing 100 million rows. There is a dimension table 'products' with 10,000 rows. The BI team reports that the following query takes over 5 minutes to run: SELECT p.category, SUM(s.amount) FROM sales_fact s JOIN products p ON s.product_id = p.product_id WHERE s.order_date >= '2024-01-01' AND s.order_date < '2024-04-01' GROUP BY p.category. The table 'products' is not partitioned or clustered. 'sales_fact' is partitioned by order_date but not clustered. The query only scans 3 months of data (about 25 million rows). However, the join seems slow. What is the most likely cause and what single action would you take to improve performance?

A.Cluster the 'sales_fact' table on product_id

B.Use a cross-join to avoid the join

C.Add an index on 'products.product_id'

D.Partition the 'products' table

AnswerA

Clustering on the join key reduces shuffle and speeds up join.

Why this answer

The query is slow because the join on `product_id` requires shuffling 25 million rows from `sales_fact` across nodes to match with `products`. Clustering `sales_fact` on `product_id` co-locates rows with the same `product_id` within each partition, reducing shuffle overhead and enabling more efficient broadcast or hash joins in BigQuery. This is the most impactful single action because it directly addresses the join performance bottleneck without changing the query logic.

Exam trap

Google Cloud often tests the misconception that indexes or partitioning small tables solve join performance issues, when the real solution in BigQuery is clustering the large fact table on the join key to minimize data shuffling.

How to eliminate wrong answers

Option B is wrong because a cross-join would produce a Cartesian product of 25 million × 10,000 rows, which is computationally prohibitive and would make the query far slower, not faster. Option C is wrong because BigQuery does not support traditional indexes; it uses columnar storage and clustering for data organization, so adding an index is not a valid action. Option D is wrong because partitioning the `products` table (only 10,000 rows) provides no benefit for a small dimension table; the bottleneck is the large fact table join, not the products table scan.

Practice this question →

MCQmedium

Which SQL function in BigQuery is best for replacing NULL values in a numeric column with a default value?

A.NULLIF

B.NVL

C.IFNULL

D.COALESCE

AnswerD

COALESCE is standard, flexible, and preferred for portability. It can handle multiple columns.

Why this answer

Option D, COALESCE, is correct because it returns the first non-NULL value from a list of expressions, making it ideal for replacing NULLs in a numeric column with a default value. In BigQuery, COALESCE is the standard, flexible function that can handle multiple arguments, unlike IFNULL which only accepts two. This aligns with SQL ANSI standards and is the recommended approach for NULL handling in numeric columns.

Exam trap

Google Cloud often tests the distinction between IFNULL and COALESCE, trapping candidates who think IFNULL is always the best choice because it's simpler, when COALESCE is more versatile and ANSI-compliant for multiple fallback values.

How to eliminate wrong answers

Option A is wrong because NULLIF returns NULL if two expressions are equal, not a default value for NULLs; it's used for conditional NULL creation, not replacement. Option B is wrong because NVL is not a valid function in BigQuery; it exists in Oracle and other databases but BigQuery does not support it, making it a distractor for candidates familiar with other SQL dialects. Option C is wrong because IFNULL, while valid in BigQuery and capable of replacing a single NULL with a default, is less flexible than COALESCE as it only accepts two arguments; the question asks for the 'best' function, and COALESCE is preferred for its ability to handle multiple fallback values and its ANSI compliance.

Practice this question →

Drag & Dropmedium

Order the steps to export data from Cloud Bigtable to Cloud Storage using Dataflow.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

First create storage, then set up Dataflow job with template, configure, run, verify.

Practice this question →

MCQeasy

A marketing team needs to analyze customer behavior using BigQuery. They want to create a table that stores the first and last purchase date for each customer from the `orders` table. Which SQL approach should they use?

A.SELECT customer_id, (SELECT order_date FROM orders ORDER BY order_date LIMIT 1) AS first_purchase, ...

B.SELECT o1.customer_id, o1.order_date AS first_purchase, o2.order_date AS last_purchase FROM orders o1 JOIN orders o2 ON o1.customer_id = o2.customer_id

C.SELECT customer_id, order_date AS first_purchase, ... FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS rn) WHERE rn = 1

D.SELECT customer_id, MIN(order_date) AS first_purchase, MAX(order_date) AS last_purchase FROM orders GROUP BY customer_id

AnswerD

Simple and efficient aggregation.

Why this answer

Option D is correct because it uses aggregate functions MIN() and MAX() with GROUP BY customer_id to directly compute the first and last purchase dates from the orders table. This is the most efficient and idiomatic SQL approach in BigQuery, leveraging the database engine's built-in aggregation to avoid self-joins or subqueries.

Exam trap

Google Cloud often tests the misconception that window functions or self-joins are necessary for per-group min/max calculations, when in fact simple aggregation with GROUP BY is the correct and efficient solution.

How to eliminate wrong answers

Option A is wrong because the subquery lacks a correlation to the outer customer_id, returning the same global first order date for all customers instead of per-customer values. Option B is wrong because the self-join without aggregation or date filtering produces a Cartesian product of all order pairs per customer, not the first and last dates. Option C is wrong because it only captures the first purchase (ROW_NUMBER() = 1) and omits the last purchase date entirely, failing to meet the requirement for both dates.

Practice this question →

MCQmedium

A data engineer notices that a scheduled query exporting BigQuery data to Cloud Storage is failing with a timeout error. The dataset contains 500 million rows. What should they do?

A.Use SELECT * without filters.

B.Change the export format from CSV to Avro.

C.Increase the query timeout setting.

D.Export each partition separately.

AnswerD

Smaller exports avoid timeout limits.

Why this answer

Option D is correct because exporting a 500-million-row table as a single operation can exceed BigQuery's 6-hour timeout limit. By exporting each partition separately, you reduce the data volume per export job, allowing each to complete within the timeout window. This approach leverages BigQuery's partitioned table structure to parallelize the export and avoid hitting the timeout threshold.

Exam trap

Google Cloud often tests the misconception that timeout errors can be resolved by increasing a timeout setting, but in BigQuery, export job timeouts are fixed and cannot be changed, so the correct approach is to reduce the data per export job.

How to eliminate wrong answers

Option A is wrong because using SELECT * without filters does not reduce the data volume; it exports all 500 million rows, which is the root cause of the timeout. Option B is wrong because changing the export format from CSV to Avro does not affect the timeout; the timeout is based on data volume and complexity, not the output format. Option C is wrong because increasing the query timeout setting does not apply to export jobs; BigQuery export operations have a fixed 6-hour timeout that cannot be modified by the user.

Practice this question →

MCQeasy

A company is designing a BigQuery data model for a business intelligence dashboard that shows sales by region and product. The data is refreshed daily. Which schema design is MOST cost-effective and performant for this use case?

A.A table with nested repeated columns for regions and products within each sale.

B.A star schema with a fact table for sales and separate dimension tables for region and product.

C.A fully normalized schema with separate tables for each attribute.

D.A single flat table containing all sales, region, and product columns.

AnswerB

Star schemas are optimized for BI workloads, reducing data scanned and improving query performance.

Why this answer

Option B is correct because a star schema with a fact table for sales and dimension tables for region and product is optimized for analytical queries in BigQuery. Option A is wrong because a flat table with all columns leads to higher storage costs and slower queries due to scanning unnecessary columns. Option C is wrong because a wide table with nested columns is better for hierarchical data, not for simple dimensional analysis.

Option D is wrong because a normalized schema with many joins is not ideal for BI queries and increases complexity.

Practice this question →

MCQhard

A BigQuery table is partitioned by ingestion time (pseudo column _PARTITIONTIME) and uses the default partition expiration of 90 days. A data engineer runs a DELETE statement to remove rows older than 100 days. Why does this query process more bytes than expected?

A.The table is not partitioned; it is clustered.

B.The DELETE statement does not use a WHERE clause on a clustering column.

C.The DELETE statement filters on a custom timestamp column instead of _PARTITIONTIME.

D.The DELETE statement must scan all partitions because it uses a condition that does not prune partitions.

AnswerD

Without a filter on _PARTITIONTIME or a partition column, the query scans all partitions.

Why this answer

Option D is correct because the DELETE statement uses a condition that does not reference the partitioning column (_PARTITIONTIME) in a way that allows partition pruning. Since the table is partitioned by ingestion time, BigQuery must scan all partitions to evaluate the filter, even though the condition logically targets rows older than 100 days. This results in processing more bytes than expected, as the default partition expiration of 90 days does not reduce the scan scope when the WHERE clause does not leverage the partitioning column.

Exam trap

Google Cloud often tests the misconception that a time-based filter on any timestamp column will trigger partition pruning, when in fact only filters on the specific partitioning column (like _PARTITIONTIME) enable partition elimination.

How to eliminate wrong answers

Option A is wrong because the table is explicitly described as partitioned by ingestion time, so it is partitioned, not just clustered. Option B is wrong because clustering columns are irrelevant for partition pruning; the issue is about partition-level filtering, not clustering. Option C is wrong because filtering on a custom timestamp column instead of _PARTITIONTIME would not cause partition pruning; however, the question states the DELETE removes rows older than 100 days, and if that custom column is used, it still would not prune partitions unless it is the partitioning column, but the core reason for scanning all partitions is the lack of a filter on _PARTITIONTIME, not the use of a custom column per se.

Practice this question →

MCQhard

A Looker developer configured a new connection to BigQuery as shown. The connection test fails with the error above. What is the most likely cause?

A.The dataset mydataset does not exist in the project

B.The BigQuery query quota has been exceeded for the project

C.The Looker instance is located in a different region than the BigQuery dataset

D.The Looker service account lacks the required BigQuery roles on the dataset

AnswerD

The error 'Access Denied' indicates missing IAM permissions for the service account.

Why this answer

Option D is correct because the error indicates a permissions issue during the connection test. Looker uses a service account to authenticate to BigQuery, and if that service account lacks the required BigQuery roles (e.g., BigQuery Data Viewer, BigQuery Job User) on the dataset, the connection test will fail with an access denied error. The error message shown in the question (not provided here but implied) typically states 'Access Denied' or 'Permission denied' when the service account does not have the necessary IAM permissions on the dataset or project.

Exam trap

Google Cloud often tests the misconception that region mismatch causes connection failures, but BigQuery datasets are global and region does not affect authentication; the real issue is almost always IAM permissions on the service account.

How to eliminate wrong answers

Option A is wrong because if the dataset did not exist, the error would be 'Not found: Dataset myproject:mydataset' rather than a permissions error. Option B is wrong because exceeding the BigQuery query quota results in a 'Quota exceeded' error, not a permissions-related failure. Option C is wrong because BigQuery datasets are global resources and region mismatch does not cause connection test failures; Looker can connect to BigQuery datasets in any region as long as network connectivity exists.

Practice this question →

MCQmedium

A company is designing a BigQuery data warehouse for BI dashboards. They have a fact table with billions of rows and need to optimize query performance for common filters on date and customer_id. Which table design strategy is most effective?

A.Use a clustered table on date only.

B.Use a non-partitioned table with indexing on customer_id.

C.Use a materialized view that aggregates by date.

D.Use a partitioned table on date with clustering on customer_id.

AnswerD

Partitioning prunes date ranges, clustering narrows scans within partitions.

Why this answer

Option D is correct because partitioning the table on `date` allows BigQuery to prune entire partitions when filtering by date, drastically reducing the data scanned. Clustering on `customer_id` then sorts data within each partition, enabling block-level pruning for queries that filter on `customer_id`. This combination minimizes both I/O and cost for the described BI workload.

Exam trap

The trap here is that candidates often assume clustering alone is sufficient for date-range filtering, overlooking that partitioning is required to physically separate data by date and enable partition pruning, which is a fundamental BigQuery optimization for time-series data.

How to eliminate wrong answers

Option A is wrong because clustering on `date` alone does not provide partition pruning; without partitioning, BigQuery must scan the entire table even if only a date range is needed, leading to higher costs and slower performance. Option B is wrong because BigQuery does not support traditional indexing; it uses columnar storage and pruning via partitioning/clustering, so a non-partitioned table with 'indexing' is not a valid strategy. Option C is wrong because a materialized view aggregating by date would pre-summarize data but cannot efficiently support ad-hoc filters on `customer_id` without scanning all underlying rows; it also adds storage and maintenance overhead without addressing the need for row-level filtering on `customer_id`.

Practice this question →

MCQmedium

A company runs a retail BI dashboard on BigQuery. The fact_sales table is partitioned by DAY and clustered by product_id. The table is 10 TB. Recently, analysts complain that queries filtering on a specific product_id and a month of data take over 10 minutes. The query uses a subquery to find top products. What should the engineer do?

A.Create a materialized view for the subquery.

B.Add an ORDER BY product_id to the subquery.

C.Change partition type to HOUR.

D.Re-cluster the table with product_id as the first clustering column and date as the second.

AnswerA

Materialized view stores precomputed results, reducing query time and cost.

Why this answer

Option A is correct because creating a materialized view for the subquery that identifies top products pre-computes and stores the results, which are incrementally refreshed by BigQuery. This avoids re-scanning the entire 10 TB fact_sales table each time the query runs, drastically reducing query time for the analysts' frequent filtering on product_id and a month of data.

Exam trap

Google Cloud often tests the misconception that clustering or partitioning changes alone can solve performance issues for subqueries, but the real bottleneck is the repeated full-table scan, which only a materialized view or similar pre-computation can eliminate.

How to eliminate wrong answers

Option B is wrong because adding ORDER BY product_id to the subquery does not improve performance; it only sorts the output, which adds overhead without reducing the data scanned or leveraging clustering. Option C is wrong because changing partition type to HOUR would create many small partitions, increasing partition management overhead and potentially degrading query performance due to metadata operations, while the analysts query a month of data, not hourly slices. Option D is wrong because re-clustering with product_id as the first clustering column and date as the second is already the current clustering order (product_id first, DAY partition second), so this change would not provide any benefit and clustering is automatically maintained by BigQuery.

Practice this question →

MCQeasy

A healthcare company needs to run BI queries on patient data. The table is in BigQuery and contains 5 billion rows. Queries often filter on patient_id and date. But the table is not partitioned or clustered. Analysts run queries that scan the entire table. The data is updated daily. What is the most cost-effective way to improve performance?

A.Partition the table by patient_id.

B.Use a view that only selects recent data.

C.Cluster the table by date.

D.Partition by date and cluster by patient_id.

AnswerD

Partitioning prunes by date, clustering narrows by patient_id, reducing scanned bytes significantly.

Why this answer

Partitioning by date (e.g., ingestion or event date) allows BigQuery to prune entire partitions when queries filter on date, drastically reducing the data scanned. Clustering by patient_id within each partition further organizes the data so that queries filtering on patient_id can skip irrelevant blocks via block-level metadata. Together, this minimizes bytes billed and improves query performance without requiring table redesign or additional storage costs.

Exam trap

Google Cloud often tests the misconception that clustering alone is sufficient for performance gains, but without partitioning, clustering cannot prune storage at the partition level, so full-table scans still occur and costs remain high.

How to eliminate wrong answers

Option A is wrong because partitioning by patient_id is not supported in BigQuery (partitioning columns must be of type DATE, TIMESTAMP, or INTEGER range) and would not align with the common date-based filter pattern. Option B is wrong because a view that only selects recent data does not reduce the underlying table scan; BigQuery still processes all data in the table unless the view is materialized, and even then it would not address the full-table scan issue for historical queries. Option C is wrong because clustering alone without partitioning still requires scanning all partitions (the entire table) if no partition filter is applied; clustering only helps within a partition, so without a partition filter the query still incurs full-table costs.

Practice this question →

Multi-Selectmedium

Which TWO best practices should be followed when modeling data for a Looker BI dashboard to optimize query performance?

Select 2 answers

A.Use derived tables for all complex logic

B.Use persistent derived tables (PDTs) to materialize intermediate results

C.Use native derived tables to leverage BigQuery's UDFs

D.Use materialized views in the underlying database

E.Use symmetric aggregates to correctly aggregate measures across joins

AnswersB, E

PDTs are stored and refreshed periodically, improving query speed.

Why this answer

Option B is correct because Persistent Derived Tables (PDTs) materialize intermediate query results into physical tables in the underlying database (e.g., BigQuery). This avoids re-executing complex logic on every user interaction, drastically reducing query latency and cost. PDTs are a core Looker optimization for repeated, heavy transformations.

Exam trap

Google Cloud often tests the distinction between persistent and native derived tables, trapping candidates who think all derived tables improve performance, when only persistent ones (PDTs) materialize results for repeated use.

Practice this question →

MCQhard

A company is migrating their on-premises data warehouse to BigQuery for BI. They have a fact table with billions of rows and many dimension tables. The current queries perform well in the on-prem system but are slow in BigQuery. The queries contain multiple JOINs and subqueries. Which optimization should they implement first?

A.Use clustering on all join keys.

B.Use BigQuery's automatic query rewriting.

C.Convert subqueries to CTEs.

D.Denormalize the dimension tables into the fact table.

AnswerD

Denormalization eliminates JOINs, which are expensive in BigQuery, improving performance significantly.

Why this answer

Denormalizing dimension tables into the fact table is the most impactful first optimization because it eliminates the need for expensive JOIN operations across billions of rows. In BigQuery, JOINs on large fact tables with multiple dimension tables can cause significant data shuffling and increased slot consumption, whereas denormalization reduces query complexity and leverages BigQuery's columnar storage and compression more efficiently. This directly addresses the root cause of slow performance in a BI workload where subqueries and JOINs are prevalent.

Exam trap

Google Cloud often tests the misconception that query-level optimizations (like clustering, CTEs, or automatic rewriting) can solve performance issues caused by schema design, when in fact the most impactful first step is to reduce JOIN complexity through denormalization for BigQuery's architecture.

How to eliminate wrong answers

Option A is wrong because clustering on all join keys does not eliminate the JOIN operations themselves; it only improves the efficiency of filtering and sorting within each table, but the shuffle and data redistribution required for JOINs across billions of rows remains a bottleneck. Option B is wrong because BigQuery's automatic query rewriting is a built-in optimizer that already applies heuristics and cost-based optimizations, but it cannot fundamentally restructure the schema to avoid JOINs; it works within the existing query structure. Option C is wrong because converting subqueries to CTEs (Common Table Expressions) is a syntactic change that does not alter the execution plan or reduce the computational cost of JOINs and subqueries; BigQuery treats CTEs similarly to subqueries under the hood.

Practice this question →

MCQeasy

A company is designing a data warehouse for BI. They need to support both detailed transaction analysis and high-level aggregated reports. Which schema design best balances storage and query performance?

A.Fully denormalized single table

B.Wide column store with no schema

C.Star schema with fact and dimension tables

D.Snowflake schema with normalized dimensions

AnswerC

Star schema is standard for BI, enabling fast aggregations and easy reporting.

Why this answer

The star schema is the optimal design for balancing storage and query performance in a BI data warehouse because it separates transactional data into fact tables (for detailed analysis) and dimension tables (for context), enabling fast aggregations via star joins while avoiding the storage overhead of full denormalization. This structure directly supports both granular transaction queries and high-level rollups without the complexity or performance penalty of snowflake schemas or the redundancy of fully denormalized tables.

Exam trap

Google Cloud often tests the misconception that snowflake schemas are always better for storage efficiency, but the trap here is that the question explicitly balances storage and query performance, and the star schema provides the best trade-off by avoiding excessive joins while keeping dimensions manageable.

How to eliminate wrong answers

Option A is wrong because a fully denormalized single table introduces massive data redundancy and update anomalies, leading to excessive storage consumption and slower query performance due to larger table scans, especially for high-level aggregations. Option B is wrong because a wide column store with no schema lacks the relational integrity and indexing capabilities required for efficient BI joins and aggregations, making it unsuitable for consistent, schema-on-write data warehouse workloads. Option D is wrong because a snowflake schema with normalized dimensions increases the number of join operations across multiple tables, degrading query performance for high-level reports without providing significant storage savings over a star schema in typical BI scenarios.

Practice this question →

Multi-Selecteasy

Which TWO of the following are best practices when designing data structures for business intelligence in BigQuery?

Select 2 answers

A.Partition tables on a column that aligns with common filter criteria

B.Store raw logs directly in fact tables without any aggregation

C.Use NULLable columns extensively to save storage

D.Use a single wide table for all data to simplify schema

E.Denormalize dimension attributes into fact tables to reduce joins

AnswersA, E

Partitioning limits scanned partitions.

Why this answer

Partitioning tables on a column that aligns with common filter criteria (e.g., a date or timestamp column) allows BigQuery to prune partitions during query execution, drastically reducing the amount of data scanned and improving query performance and cost efficiency. This is a core best practice for optimizing BI workloads in BigQuery.

Exam trap

Google Cloud often tests the misconception that denormalization is always bad, but in BigQuery for BI, denormalizing dimension attributes into fact tables is a recognized best practice to reduce JOIN overhead and improve query performance.

Practice this question →

MCQmedium

The user runs a BigQuery query on a non-partitioned table and receives the error shown. Which optimization should be applied first to resolve the issue?

A.Partition the table by the event_date column

B.Increase the BigQuery reservation slot count

C.Create a materialized view that pre-aggregates the data

D.Cluster the table by event_date

AnswerA

Partitioning limits scans to relevant date ranges, reducing resource consumption.

Why this answer

The error indicates that the query is scanning too much data, likely exceeding the free tier or slot quota. Partitioning the non-partitioned table by `event_date` allows BigQuery to perform partition pruning, scanning only the relevant date range instead of the entire table. This directly reduces the data processed, which is the most effective first optimization for cost and performance.

Exam trap

Google Cloud often tests the distinction between partitioning (which prunes entire storage shards) and clustering (which only sorts within shards), leading candidates to mistakenly choose clustering as a solution for reducing data scanned when partitioning is required first.

How to eliminate wrong answers

Option B is wrong because increasing the reservation slot count only adds compute resources but does not reduce the amount of data scanned; the query would still fail if the issue is data volume limits. Option C is wrong because creating a materialized view pre-aggregates data but still requires scanning the base table unless the view is used with query rewriting, and it does not address the root cause of scanning too much raw data. Option D is wrong because clustering by `event_date` improves query performance by reducing the data read for range-based filters, but it does not enable partition pruning; clustering only sorts data within partitions, and without partitioning, the entire table is still scanned.

Practice this question →

MCQhard

A company has a BigQuery dataset with many views. They need to ensure that only the latest 30 days of data is used in BI reports for performance. The source table is partitioned by ingestion_time. Which approach reduces query cost and improves performance?

A.Use BigQuery BI Engine to cache results

B.Create a view with WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)

C.Create a materialized view with the date filter

D.Use a scheduled query to copy the last 30 days to a separate table

AnswerC

Materialized views precompute and store the filtered results, reducing query cost and improving performance through incremental updates.

Why this answer

Option C is correct because a materialized view precomputes and stores the filtered result set, allowing BigQuery to serve BI queries directly from the materialized view's storage without scanning the entire source table. This eliminates the need to re-process the full table on every query, significantly reducing query cost and improving performance for the 30-day sliding window.

Exam trap

Google Cloud often tests the distinction between standard views (which are just saved queries) and materialized views (which store results), leading candidates to incorrectly choose a standard view with a WHERE clause, thinking it will reduce cost, when in fact it does not reduce data scanned.

How to eliminate wrong answers

Option A is wrong because BI Engine caches query results in memory, but it does not reduce the amount of data scanned on the first query or when the cache is invalidated; the source table must still be fully scanned initially, and the 30-day filter is not automatically applied. Option B is wrong because a standard view with a WHERE clause on _PARTITIONTIME does not precompute or store results; each query against the view still scans all partitions that match the filter, and BigQuery must evaluate the filter on every execution, which does not reduce cost or improve performance compared to querying the table directly. Option D is wrong because a scheduled query that copies the last 30 days to a separate table introduces data duplication, additional storage costs, and maintenance overhead (e.g., scheduling, cleanup of old data), and it does not provide the automatic, real-time sliding window that a materialized view offers.

Practice this question →

100

MCQmedium

A financial company runs BI queries on a BigQuery table that is partitioned by ingestion time. The table is 1 TB and receives streaming inserts every minute. Analysts query the last 24 hours of data. The queries are slow. The table is clustered by transaction_id. What is the likely cause?

A.Streaming buffer causes delays.

B.Queries use SELECT *.

C.Partition expiration is set too short.

D.The cluster column is not used in queries.

AnswerD

Without a filter on transaction_id, clustering provides no benefit; data within partitions is unordered.

Why this answer

Option A is correct because queries likely filter on the time range (last 24 hours) but not on transaction_id, so the clustering key is not used; clustering only helps when the cluster column is in the filter. Option B (streaming buffer) is not a cause because data is committed quickly. Option C (partition expiration) is not mentioned.

Option D (SELECT *) may affect but not primary cause.

Practice this question →

101

Multi-Selecthard

Which THREE are valid considerations when designing BigQuery tables for BI reporting?

Select 3 answers

A.Use nested and repeated fields to avoid JOINs

B.Create indexes on frequently queried columns

C.Use partitioning on date columns to reduce query cost

D.Cluster tables on high-cardinality columns used in filters

E.Denormalize dimension tables into fact tables for common queries

AnswersC, D, E

Partitioning is a key cost-control feature.

Why this answer

Option C is correct because partitioning BigQuery tables by date columns (e.g., using _PARTITIONTIME or a DATE/TIMESTAMP column) allows the query engine to prune entire partitions during query execution. This significantly reduces the amount of data scanned, directly lowering query costs (since BigQuery charges per byte processed) and improving performance for time-range filters.

Exam trap

Google Cloud often tests the misconception that traditional relational database features like indexes apply to BigQuery, but BigQuery's architecture relies on partitioning and clustering instead of indexes for query optimization.

Practice this question →

102

MCQhard

A data analyst runs a query that joins two large tables on a high-cardinality column with many NULL values. Which action is most likely to resolve the error?

A.Use a DISTINCT clause on the join key.

B.Increase the query timeout setting.

C.Add a WHERE clause to filter out NULLs from the join key.

D.Use a UNION ALL to combine tables.

AnswerC

Filtering NULLs reduces row count and shuffle.

Why this answer

Option C is correct because filtering out NULLs from the join key with a WHERE clause prevents the database from attempting to match NULL values, which cannot be equated in a standard SQL join (since NULL != NULL). This reduces the cardinality of the join operation and avoids potential performance degradation or errors caused by the large number of NULLs being processed in a high-cardinality column.

Exam trap

The trap here is that candidates may think increasing the timeout (Option B) is a universal fix for any query error, when in reality the error is often due to resource exhaustion from NULL handling, not insufficient execution time.

How to eliminate wrong answers

Option A is wrong because using DISTINCT on the join key does not resolve the issue of NULLs in the join; it only removes duplicate non-NULL values from the result set, which does not address the underlying problem of NULL mismatches or performance. Option B is wrong because increasing the query timeout setting only allows the query to run longer without failing, but does not fix the root cause of the error (e.g., excessive memory or disk usage from NULL handling). Option D is wrong because UNION ALL combines results from two queries vertically, not horizontally; it does not perform a join and therefore cannot resolve errors related to joining on a high-cardinality column with NULLs.

Practice this question →

103

MCQhard

A company has a BigQuery table with a TIMESTAMP column and wants to query data for a specific date range efficiently. Which WHERE clause ensures partition pruning if the table is partitioned by that TIMESTAMP column?

A.WHERE timestamp_col BETWEEN TIMESTAMP('2023-01-01') AND TIMESTAMP('2023-01-31')

B.WHERE TIMESTAMP_TRUNC(timestamp_col, DAY) BETWEEN '2023-01-01' AND '2023-01-31'

C.WHERE timestamp_col >= '2023-01-01' AND timestamp_col < '2023-02-01'

D.WHERE DATE(timestamp_col) BETWEEN '2023-01-01' AND '2023-01-31'

AnswerA

Direct comparison on the partition column allows BigQuery to prune partitions based on the range.

Why this answer

Option A is correct because it directly references the TIMESTAMP column without wrapping it in a function, allowing BigQuery's partition pruning to eliminate irrelevant partitions. When a table is partitioned by a TIMESTAMP column, the query engine can compare the partition boundaries directly against the literal TIMESTAMP values in the WHERE clause, scanning only the partitions that fall within the specified range.

Exam trap

Google Cloud often tests the misconception that any filter on a partitioned column will trigger pruning, but the trap here is that wrapping the partition column in a function (like DATE, TIMESTAMP_TRUNC, or implicit casts) disables pruning, so only a bare column reference with compatible literal types guarantees efficient partition elimination.

How to eliminate wrong answers

Option B is wrong because TIMESTAMP_TRUNC(timestamp_col, DAY) is a function applied to the partition column, which prevents partition pruning; BigQuery must evaluate the function for every row, scanning all partitions. Option C is wrong because comparing a TIMESTAMP column to a string literal ('2023-01-01') forces an implicit type conversion, which can disable partition pruning and may lead to incorrect results due to timezone or format assumptions. Option D is wrong because DATE(timestamp_col) is a function that extracts the date portion, and like other functions on the partition column, it disables partition pruning, causing a full table scan.

Practice this question →

104

MCQhard

A company uses BigQuery for BI reporting with a star schema. The fact table 'sales' is partitioned by date and clustered by 'product_id'. The dimensions 'product' and 'customer' are updated nightly via merge statements. Recently, a report that joins 'sales' with 'product' on 'product_id' and filters on sale_date for the last 7 days started timing out. The query plan shows a 'SCAN' of the entire 'product' table. Which optimization should be applied to improve performance?

A.Partition the 'product' table by 'product_id'

B.Partition the 'sales' table by 'product_id' instead of date

C.Remove clustering from the 'sales' table

D.Cluster the 'product' table on 'product_id'

AnswerD

Clustering on product_id improves join performance by collocating rows with the same product_id, reducing data scanned.

Why this answer

Option D is correct because clustering the 'product' table on 'product_id' physically co-locates rows with the same product_id into the same blocks, drastically reducing the amount of data scanned when the report joins on that column. The query plan's full SCAN of the 'product' table indicates that BigQuery must read every row, even though only a subset of products are referenced by the last 7 days of sales. Clustering on product_id enables block-level pruning, so only the relevant blocks are read, eliminating the full table scan.

Exam trap

Google Cloud often tests the misconception that partitioning is the universal solution for all performance issues, but here the problem is a full scan of the dimension table during a join, which clustering on the join key solves without the limitations and overhead of partitioning.

How to eliminate wrong answers

Option A is wrong because partitioning the 'product' table by 'product_id' is not supported in BigQuery — partitioning requires a date, timestamp, or integer range column, not an arbitrary ID, and it would create an excessive number of partitions, degrading performance. Option B is wrong because partitioning the 'sales' table by 'product_id' instead of date would break the existing date-based pruning for the last-7-days filter, likely increasing the scan size and defeating the purpose of the optimization. Option C is wrong because removing clustering from the 'sales' table would worsen performance by eliminating the existing block-level pruning on product_id, making the join even slower.

Practice this question →

105

Multi-Selecthard

A financial services company uses BigQuery for BI reporting. They need to design a data model that ensures data consistency and avoids duplicate records in the fact table. Which three practices should they follow? (Choose three.)

Select 3 answers

A.Use the OVERWRITE partition option for incremental loads.

B.Apply a unique constraint on the fact table.

C.Use a daily load job that replaces the entire table with WRITE_TRUNCATE.

D.Implement a staging table with a unique identifier and use INSERT ... SELECT DISTINCT.

E.Use DML statements with MERGE to upsert data.

AnswersA, D, E

Overwriting specific partitions avoids duplicates within those partitions.

Why this answer

Option A is correct because using the OVERWRITE partition option for incremental loads ensures that only the specific partition being loaded is replaced, preventing duplicate records within that partition while preserving data in other partitions. This approach maintains data consistency by avoiding full table overwrites and is efficient for incremental updates in BigQuery.

Exam trap

The trap here is that candidates often assume BigQuery supports traditional database constraints like unique constraints (Option B) or that full table overwrites (Option C) are acceptable for incremental loads, when in fact BigQuery's architecture requires partition-level or DML-based deduplication strategies.

Practice this question →

106

Multi-Selecteasy

Which TWO actions improve query performance and reduce cost in BigQuery for BI workloads?

Select 2 answers

A.Cluster tables on columns used in GROUP BY

B.Partition tables on columns frequently used in WHERE clauses

C.Load data using batch loads instead of streaming

D.Store data in CSV format

E.Use SELECT * in all queries

AnswersA, B

Clustering improves aggregation performance.

Why this answer

Clustering tables on columns used in GROUP BY improves query performance by physically co-locating rows with similar values, reducing the amount of data scanned during aggregation. Partitioning on columns frequently used in WHERE clauses allows BigQuery to prune entire partitions from the scan, directly reducing both cost (bytes billed) and query execution time. These two optimizations are specifically recommended for BI workloads where repeated, selective queries are common.

Exam trap

Google Cloud often tests the misconception that any data loading method (batch vs. streaming) or any file format (CSV) directly improves query performance, when in fact only storage and query-time optimizations like partitioning and clustering reduce bytes scanned.

Practice this question →

107

MCQhard

A large e-commerce platform uses BigQuery for business intelligence. They have a fact table `orders` (10 TB, partitioned by order_date, clustered by customer_id) and a dimension table `customers` (2 TB, not partitioned, not clustered). The BI team runs a daily dashboard query that joins these tables on customer_id and filters on order_date = CURRENT_DATE() and customer_country = 'US'. The query currently scans the full `customers` table and 2 GB of the `orders` table, taking 30 seconds. The business wants to reduce cost and latency. The `customers` table has 500 million rows and is updated incrementally every hour. Which action will most effectively reduce the amount of data scanned and query time?

A.Cluster the `customers` table on customer_id.

B.Denormalize customer country and other attributes into the `orders` table.

C.Create a materialized view that joins `orders` and `customers` on customer_id.

D.Partition the `customers` table by customer_id.

AnswerA

Clustering by customer_id enables block-level pruning during the join, drastically reducing data scanned.

Why this answer

Clustering the `customers` table on `customer_id` will physically co-locate rows with the same `customer_id`, allowing the query to use block-level pruning when joining with the filtered `orders` table. Since the query filters `orders` by `order_date = CURRENT_DATE()` (2 GB scanned) and then joins on `customer_id`, BigQuery can skip reading most of the `customers` table if it is clustered on the join key, drastically reducing the 2 TB full scan and lowering both cost and latency.

Exam trap

Google Cloud often tests the misconception that partitioning is always the best optimization for large tables, but here partitioning by `customer_id` is invalid in BigQuery, and the real performance gain comes from clustering on the join key to enable block-level pruning.

How to eliminate wrong answers

Option B is wrong because denormalizing customer attributes into the `orders` table would increase storage costs and data duplication (10 TB fact table would grow significantly), and while it might avoid the join, it does not address the root cause of scanning the full `customers` table; it also complicates incremental updates. Option C is wrong because a materialized view that joins both tables would need to be refreshed every hour to reflect incremental customer updates, and it would still require scanning the full `customers` table during creation or refresh, not reducing the per-query scan for the current daily filter. Option D is wrong because partitioning the `customers` table by `customer_id` is not supported in BigQuery (partitioning must be on a date/timestamp or integer range column), and even if possible, it would not help since the query does not filter on a partition column for `customers`.

Practice this question →

108

Multi-Selecteasy

Which TWO are effective strategies to control costs when running BI queries on BigQuery? (Choose two.)

Select 2 answers

A.Set a maximum bytes billed limit for user projects.

B.Create materialized copies of tables for each dashboard.

C.Schedule queries to run every minute to keep the cache warm.

D.Enable BI Engine for all tables to speed up queries.

E.Use flat-rate reservations for predictable workloads.

AnswersA, E

It prevents queries from scanning too much data.

Why this answer

Options C and D are correct. Setting a custom cost control (max bytes billed) prevents runaway queries. Using flat-rate reservations provides predictable pricing and can lower costs for steady workloads.

Option A is wrong because scheduled queries add to cost if not optimized. Option B is wrong because enabling BI Engine incurs additional reservation costs. Option E is wrong because copying tables increases storage costs.

Practice this question →

109

MCQeasy

A BI team wants to create a report that shows daily active users for the last 7 days. Which SQL construct is most appropriate for fast performance on a large dataset?

A.SELECT COUNT(DISTINCT user_id) ... WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)

B.SELECT DISTINCT user_id ...

C.SELECT COUNT(user_id) ... GROUP BY user_id

D.SELECT APPROX_COUNT_DISTINCT(user_id) ... WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)

AnswerD

Approximate distinct is fast and sufficient for trend analysis.

Why this answer

Option D is correct because APPROX_COUNT_DISTINCT uses HyperLogLog (HLL) algorithm, which provides near-exact distinct counts with significantly less memory and faster performance than COUNT(DISTINCT) on large datasets. This is ideal for a daily active users report over 7 days where exact precision is not critical.

Exam trap

Google Cloud often tests the misconception that COUNT(DISTINCT) is always the correct choice for distinct counts, ignoring the performance implications on large datasets where approximate counting functions are the appropriate BI solution.

How to eliminate wrong answers

Option A is wrong because COUNT(DISTINCT user_id) requires sorting or hashing all unique user_id values, which becomes extremely slow and memory-intensive on large datasets. Option B is wrong because SELECT DISTINCT user_id returns all individual user IDs without counting them, failing to produce the required daily active user count. Option C is wrong because COUNT(user_id) counts all rows including duplicates, not distinct users, and GROUP BY user_id would produce per-user counts rather than a single daily total.

Practice this question →

110

Multi-Selectmedium

A BI team is troubleshooting a slow BigQuery query. Which TWO actions can help identify the bottleneck?

Select 2 answers

A.Review the query execution plan in the BigQuery UI.

B.Increase the number of slots to maximum.

C.Remove all WHERE clauses to simplify.

D.Rewrite the query to use only CTEs.

E.Check the bytes processed and shuffle bytes.

AnswersA, E

Execution plan reveals stages, timing, and data shuffling.

Why this answer

Reviewing the query execution plan in the BigQuery UI (Option A) is correct because it provides a visual breakdown of query stages, including shuffle operations, data distribution, and stage-level timing. This allows the BI team to pinpoint which stage is consuming the most time or resources, such as a skewed join or a slow aggregation, directly identifying the bottleneck.

Exam trap

Google Cloud often tests the misconception that adding more resources (slots) or simplifying the query (removing WHERE clauses) is a diagnostic step, when in fact these actions change the query's behavior rather than identifying the existing bottleneck.

Practice this question →

111

Multi-Selectmedium

Which TWO are best practices for designing a star schema in BigQuery for BI? (Choose two.)

Select 2 answers

A.Store dimension attributes in a single denormalized dimension table instead of multiple normalized tables.

B.Partition fact tables by low-cardinality columns like gender.

C.Pre-aggregate all measures at every possible grain in the fact table.

D.Avoid using joins entirely by storing all data in one wide table.

E.Use surrogate keys for dimension tables instead of natural keys.

AnswersA, E

Denormalization reduces join complexity.

Why this answer

Option A is correct because in BigQuery, storing dimension attributes in a single denormalized dimension table (star schema) reduces the number of joins required in BI queries, improving query performance and simplifying SQL. BigQuery's columnar storage and distributed architecture handle denormalized dimensions efficiently, avoiding the overhead of multiple normalized tables that would require complex joins and slow down analytical queries.

Exam trap

Google Cloud often tests the misconception that denormalization is always bad, but in BigQuery's architecture, denormalized dimension tables are a best practice for BI workloads, unlike traditional OLTP databases.

Practice this question →

112

MCQhard

A company stores sensor data in BigQuery. They have a table 'sensor_readings' with columns: sensor_id, reading_time, value. The table is partitioned by reading_time (hourly) and clustered by sensor_id. A BI query aggregates average value per sensor for the last week. The query still scans many bytes. What is the most likely cause?

A.The query uses SELECT * instead of specific columns

B.Clustering on sensor_id is ineffective

C.The table is not using columnar storage

D.Partition granularity is too fine for the query range

AnswerD

Hourly partitions for a week means 168 partitions scanned; coarser partitioning (daily) would scan 7 partitions, reducing bytes.

Why this answer

Option D is correct because the query scans a full week of data (168 hourly partitions), and each partition must be read entirely even though only a subset of sensors may be active. Hourly partitioning over a 7-day range means the query engine must scan all 168 partitions, which can result in a large number of bytes being processed. Clustering on sensor_id helps within each partition but does not reduce the number of partitions scanned; the fine granularity of hourly partitioning is the primary cause of excessive bytes scanned.

Exam trap

Google Cloud often tests the misconception that clustering alone solves all performance issues, but the trap here is that clustering only helps when the query filters or aggregates on the clustered column—without such a filter, clustering does not reduce bytes scanned, and overly fine partitioning is the real culprit.

How to eliminate wrong answers

Option A is wrong because using SELECT * instead of specific columns would increase the bytes scanned, but the question states the query aggregates average value per sensor, which likely already selects only the needed columns; the core issue is partition pruning, not column projection. Option B is wrong because clustering on sensor_id is effective for reducing bytes scanned within each partition when filtering by sensor_id, but the query does not filter on sensor_id—it aggregates across all sensors—so clustering provides no benefit here. Option C is wrong because BigQuery always uses columnar storage (Capacitor format); the table is inherently columnar, so this is not a possible cause.

Practice this question →

113

Multi-Selecteasy

Which TWO BigQuery features are specifically designed to accelerate BI dashboard query performance? (Choose TWO.)

Select 2 answers

A.Wildcard tables

B.Clustering

C.User-defined functions (UDFs)

D.Cached results

E.Column-level security

AnswersB, D

Clustering reduces data scanned by sorting data within partitions, speeding up filter-based queries.

Why this answer

Clustering (B) physically co-locates rows with similar values in the same storage blocks, allowing BigQuery to skip entire blocks when processing queries with filters on clustered columns. This dramatically reduces the amount of data scanned, directly accelerating BI dashboard queries that often filter by date, region, or customer ID. Cached results (D) store the output of recent queries for up to 24 hours, so repeated dashboard refreshes or concurrent user requests can be served instantly without re-scanning any data.

Exam trap

Google Cloud often tests the misconception that any feature that 'organizes' or 'processes' data (like wildcard tables or UDFs) improves performance, when in fact only features that reduce data scanned (clustering) or avoid re-execution (cached results) directly accelerate BI dashboards.

Practice this question →

114

MCQeasy

A database engineer is designing a data model for a BI dashboard that tracks daily sales by product category. The data source is a transactional database with a normalized schema. Which BigQuery feature should they use to update the fact table incrementally each day?

A.Streaming inserts

B.BigQuery Data Transfer Service

C.Scheduled queries with MERGE statements

D.Load jobs with WRITE_TRUNCATE

AnswerC

MERGE combines INSERT and UPDATE to handle incremental changes efficiently.

Why this answer

Scheduled queries with MERGE statements allow incremental updates by inserting new rows and updating existing ones based on a unique key, such as date and product category. This avoids full table reloads, making it efficient for daily fact table refreshes from a normalized transactional source.

Exam trap

The trap here is that candidates confuse 'incremental load' with 'streaming' (Option A), not realizing that streaming inserts are for real-time events, not batch updates from a transactional database.

How to eliminate wrong answers

Option A is wrong because streaming inserts are designed for real-time, row-by-row data ingestion, not for batch updating a fact table incrementally from a transactional database. Option B is wrong because BigQuery Data Transfer Service is used for automated imports from external SaaS sources (e.g., Google Ads, Amazon S3), not for executing custom SQL logic like MERGE against existing tables. Option D is wrong because WRITE_TRUNCATE replaces the entire table each load, which is inefficient and loses historical data, whereas incremental updates require preserving existing rows.

Practice this question →

115

MCQeasy

A company uses BigQuery for BI. They need to create a table that stores daily sales data with millions of rows. The query pattern is to aggregate sales by month for specific product categories. Which table design is most cost-effective and performant?

A.Non-partitioned table with clustering on product_category

B.Partitioned table by date with clustering on product_category

C.Non-partitioned, non-clustered table with manual sharding by date

D.Partitioned table by product_category with clustering on date

AnswerB

Partitioning prunes irrelevant date ranges; clustering reduces data scanned for category filters.

Why this answer

Partitioning by date allows BigQuery to prune entire partitions when querying monthly aggregates, drastically reducing the data scanned. Clustering on product_category further organizes data within each partition, enabling efficient block-level pruning for category filters. This combination minimizes both cost (bytes billed) and query latency for the described workload.

Exam trap

Google Cloud often tests the misconception that clustering alone is sufficient for performance, ignoring that partitioning is essential for time-range queries to enable storage-level pruning and cost control.

How to eliminate wrong answers

Option A is wrong because a non-partitioned table forces BigQuery to scan all rows even for a single month, leading to higher costs and slower performance despite clustering on product_category. Option C is wrong because manual sharding (e.g., table names like sales_20250101) is a legacy pattern that requires complex query logic (UNION ALL) and loses automatic partition pruning, plus BigQuery discourages sharding in favor of native partitioning. Option D is wrong because partitioning by product_category would create many small partitions (one per category), which is inefficient for date-range queries; clustering on date cannot compensate for the lack of date-based partition pruning, so monthly aggregations would still scan all partitions.

Practice this question →

116

MCQmedium

A retail company uses Cloud Spanner for their OLTP system and wants to run BI queries on the same data without impacting transactional performance. Which solution should they implement?

A.Create a federated BigQuery query that reads from Spanner

B.Export Spanner data to Cloud Storage and then load into BigQuery manually

C.Use Cloud Dataflow to stream Spanner changes into BigQuery

D.Run BI queries directly on Spanner using read-only transactions

AnswerC

Dataflow captures changes from Spanner and loads them into BigQuery, separating BI workloads.

Why this answer

Option C is correct because Cloud Dataflow can read the Cloud Spanner change streams and stream mutations into BigQuery in near real-time, enabling BI queries on fresh data without adding read load to the Spanner instance. This decouples the analytical workload from the transactional workload, preserving OLTP performance.

Exam trap

The trap here is that candidates assume read-only transactions are safe for BI workloads, but they still consume Spanner's CPU and memory resources, which can degrade transactional performance under concurrent analytical queries.

How to eliminate wrong answers

Option A is wrong because federated BigQuery queries against Spanner execute reads directly on the Spanner instance, which can consume CPU and impact transactional latency, especially under heavy BI query loads. Option B is wrong because manual exports to Cloud Storage and batch loads into BigQuery introduce significant latency and operational overhead, making it unsuitable for near-real-time BI requirements. Option D is wrong because even read-only transactions on Spanner consume instance resources and can contend with transactional writes, degrading OLTP performance under concurrent BI query loads.

Practice this question →

117

MCQmedium

A BI report requires a running total of sales over the last 30 days for each product. The data is in a BigQuery table with columns: sale_date, product_id, amount. Which SQL window function is most efficient?

A.Use GROUP BY with SUM(amount)

B.Use SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN 30 PRECEDING AND CURRENT ROW)

C.Use SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

D.Use a correlated subquery to sum over previous dates

AnswerC

This window function efficiently computes a running total across all rows up to the current row.

Why this answer

Option C is correct because it uses a window function with `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW` to compute a running total over all rows up to the current row. However, the question asks for a running total over the last 30 days, not all preceding rows. The most efficient approach for a 30-day sliding window is actually `ROWS BETWEEN 29 PRECEDING AND CURRENT ROW` (or `RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT ROW` in BigQuery), but among the given options, C is the only one that produces a running total (cumulative sum) rather than a fixed 30-day window.

Option C is marked as correct in the answer key, but note that it does not limit to 30 days; it sums all prior sales. In BigQuery, for a true 30-day rolling sum, `RANGE BETWEEN INTERVAL 29 DAY PRECEDING AND CURRENT ROW` is the correct syntax.

Exam trap

Google Cloud often tests the distinction between `ROWS` and `RANGE` frame specifications, and the trap here is that candidates confuse a fixed row count (ROWS BETWEEN 30 PRECEDING) with a time-based window (RANGE BETWEEN INTERVAL 30 DAY PRECEDING), leading them to choose Option B even though it does not correctly implement a 30-day rolling sum.

How to eliminate wrong answers

Option A is wrong because GROUP BY with SUM(amount) aggregates sales per day or per product, but it cannot produce a running total across dates; it loses the row-level context needed for cumulative calculations. Option B is wrong because `ROWS BETWEEN 30 PRECEDING AND CURRENT ROW` sums exactly 31 rows (30 preceding + current), which is a fixed row count, not a time-based window of 30 days; if dates are missing or irregular, this will not correctly represent sales over the last 30 calendar days. Option D is wrong because a correlated subquery to sum over previous dates is inefficient and scales poorly; it requires a separate subquery execution for each row, leading to O(n²) performance, whereas a window function operates in a single pass over the data.

Practice this question →

118

MCQmedium

A company uses BigQuery materialized views to pre-aggregate sales data for a BI dashboard. The dashboard requires near-real-time data, but the materialized view currently reflects data up to 30 minutes old. What is the most effective way to reduce the refresh interval without significantly increasing costs?

A.Reduce the max_staleness parameter of the materialized view.

B.Disable automatic refresh and schedule a manual refresh every minute.

C.Use a streaming buffer with the base table to reduce latency.

D.Create additional materialized views with overlapping time windows.

AnswerA

Lower max_staleness forces more frequent refreshes.

Why this answer

Reducing the `max_staleness` parameter directly controls the maximum acceptable age of the data in a BigQuery materialized view. By lowering this value, you force the view to refresh more frequently, achieving near-real-time data without incurring the cost of a full manual refresh or additional streaming infrastructure. This parameter is designed to balance freshness against cost, making it the most effective and efficient solution.

Exam trap

Google Cloud often tests the misconception that reducing staleness requires manual scheduling or additional streaming, when in fact the `max_staleness` parameter is the built-in, cost-effective mechanism for controlling refresh frequency in BigQuery materialized views.

How to eliminate wrong answers

Option B is wrong because disabling automatic refresh and scheduling a manual refresh every minute would significantly increase costs due to repeated full recomputation of the materialized view, and it also introduces operational complexity without leveraging BigQuery's built-in incremental refresh mechanism. Option C is wrong because using a streaming buffer with the base table reduces latency for new data ingestion but does not affect the refresh interval of the materialized view itself; the view still relies on its own staleness setting. Option D is wrong because creating additional materialized views with overlapping time windows does not reduce the refresh interval for any single view; it increases storage and processing costs without improving freshness, as each view would still have its own staleness constraint.

Practice this question →

119

MCQeasy

Refer to the exhibit. What is the effect of the partition_expiration_days option?

A.The table's storage cost is reduced by 365%

B.Queries that reference data older than 365 days will fail

C.Partitions older than 365 days are automatically deleted

D.The table will be partitioned into 365 partitions

AnswerC

The option enables automatic partition expiration, deleting old partitions to free storage.

Why this answer

The `partition_expiration_days` option in BigQuery automatically drops partitions that are older than the specified number of days, reducing storage costs and simplifying lifecycle management. When set to 365, any partition with a date older than 365 days from the current date is deleted by BigQuery's background maintenance process.

Exam trap

Google Cloud often tests the distinction between automatic deletion (expiration) and query failure—candidates mistakenly think expired partitions cause errors, but BigQuery simply treats them as non-existent, returning empty results for those date ranges.

How to eliminate wrong answers

Option A is wrong because storage cost is reduced by the amount of data in expired partitions, not by a fixed percentage like 365%; the percentage depends on the table's total size. Option B is wrong because queries referencing data older than 365 days will simply return no rows from those expired partitions, but the query itself will not fail—it will succeed with an empty result for the expired range. Option D is wrong because the option does not control the number of partitions; it controls the expiration age of partitions, while the number of partitions is determined by the partitioning column's granularity and the data's date range.

Practice this question →

120

MCQhard

A company has a BigQuery table that stores JSON data in a single column. They want to allow BI analysts to query nested fields using standard SQL. What is the best approach to make the data more query-friendly for BI tools?

A.Unnest the JSON into multiple columns using a persistent table with a flattened schema.

B.Use BigQuery's automatic schema detection to infer the structure.

C.Create a view that uses JSON_QUERY and JSON_VALUE functions to expose nested fields as columns.

D.Use the EXTRACT function to parse JSON fields in each query.

AnswerA

A flattened table stores JSON fields as columns once, enabling efficient columnar scanning and BI tool compatibility.

Why this answer

Option A is correct because flattening the JSON into a persistent table with a normalized schema eliminates the need for runtime parsing, allowing BI tools to query nested fields directly with standard SQL. This approach improves query performance by avoiding repeated JSON function calls and enables the use of indexed columns, which is critical for interactive BI workloads.

Exam trap

Google Cloud often tests the misconception that a view or function-based approach is sufficient for performance, when in fact persistent schema flattening is required for BI tools to achieve optimal query performance and schema compatibility.

How to eliminate wrong answers

Option B is wrong because BigQuery's automatic schema detection only works during table creation from external data sources (e.g., Cloud Storage) and cannot retroactively infer or restructure an existing table with a single JSON column. Option C is wrong because a view using JSON_QUERY and JSON_VALUE still requires runtime parsing of the JSON string for every query, which degrades performance and prevents BI tools from leveraging column-level optimizations like partitioning or clustering. Option D is wrong because the EXTRACT function in BigQuery is designed for extracting date/time parts, not for parsing JSON fields; using it would be syntactically incorrect and non-functional.

Practice this question →

121

MCQeasy

Refer to the exhibit. A BI analyst runs a query to get total sales for the last 7 days. The query filters on sale_date BETWEEN '2023-01-01' AND '2023-01-07'. What is the primary benefit of the partitioning defined in the table?

A.It reduces the amount of data scanned by pruning partitions.

B.It automatically creates indexes on sale_date.

C.It allows the query to use clustering.

D.It enables streaming inserts.

AnswerA

Partition pruning scans only relevant partitions, minimizing data processing.

Why this answer

Partitioning in BigQuery (and similar data warehouses) physically divides the table into segments based on the partition column (sale_date). When the query filters on sale_date BETWEEN '2023-01-01' AND '2023-01-07', the query engine can perform partition pruning, scanning only the partitions that match the date range instead of the entire table. This dramatically reduces the amount of data read, lowering query cost and improving performance.

Exam trap

Google Cloud often tests the distinction between partitioning (which prunes data at the storage level) and clustering (which sorts data within partitions), leading candidates to mistakenly choose clustering as the primary benefit when the question explicitly asks about the partitioning definition.

How to eliminate wrong answers

Option B is wrong because partitioning does not automatically create indexes; BigQuery uses a columnar storage format and does not rely on traditional indexes. Option C is wrong because clustering is a separate feature that co-locates data within partitions based on sort order, but the primary benefit described here is partition pruning, not clustering. Option D is wrong because streaming inserts are a method for ingesting real-time data and are unrelated to the query performance benefit of partition pruning.

Practice this question →

122

MCQhard

A financial institution uses Cloud SQL for MySQL to handle transaction processing. They need to generate daily BI reports that aggregate millions of transactions per account. The BI queries are CPU-intensive and degrade OLTP performance. What is the most effective solution?

A.Schedule reports during off-peak hours only

B.Create a Cloud SQL read replica and run reports against it

C.Use Cloud SQL's high availability configuration

D.Upgrade the primary instance to a higher machine type

AnswerB

A read replica offloads read queries from the primary, preserving OLTP performance.

Why this answer

Creating a Cloud SQL read replica allows you to offload BI reporting queries to a separate instance that replicates data from the primary using MySQL's asynchronous replication. This isolates the CPU-intensive aggregation queries from the OLTP workload, preventing performance degradation on the primary instance while still providing near-real-time data for reports.

Exam trap

Google Cloud often tests the misconception that high availability (HA) instances can serve read traffic, when in fact the standby in an HA configuration is passive and cannot be used for read offloading.

How to eliminate wrong answers

Option A is wrong because scheduling reports during off-peak hours only reduces contention but does not eliminate the CPU load from the primary instance, which can still impact OLTP performance if reports run concurrently with any other workload. Option C is wrong because Cloud SQL's high availability configuration uses a standby instance in a different zone for failover, not for read scaling; it does not offload query processing and the standby cannot serve read traffic. Option D is wrong because upgrading the primary instance to a higher machine type increases capacity but does not isolate the BI workload, so CPU-intensive queries will still compete with OLTP transactions for resources on the same instance.

Practice this question →

123

MCQhard

A BI dashboard query is taking too long because it reads all columns from a large table. The dashboard only needs a few columns. What is the best practice?

A.Create a view that selects specific columns.

B.Create a table with only the needed columns.

C.Use a subquery to filter columns in the FROM clause.

D.Use a LIMIT clause to reduce rows.

AnswerA

Views with column selection allow column pruning.

Why this answer

Creating a view that selects specific columns is the best practice because it allows the BI dashboard to query only the necessary columns without altering the underlying table structure. Views provide a logical abstraction layer, enabling column pruning at the query level while preserving data integrity and access control. This approach reduces I/O and memory consumption by avoiding full table scans on unnecessary columns, directly addressing the performance bottleneck.

Exam trap

Google Cloud often tests the misconception that a subquery or LIMIT can optimize column-level performance, when in fact they only affect row filtering or query structure, not the column scan width.

How to eliminate wrong answers

Option B is wrong because creating a separate table duplicates data, leading to storage overhead, synchronization issues, and potential data staleness; it violates normalization principles and increases maintenance complexity. Option C is wrong because a subquery in the FROM clause does not inherently reduce column reads; the outer query still processes all columns from the subquery unless explicitly pruned, and it may not optimize execution plans as effectively as a view. Option D is wrong because a LIMIT clause restricts rows, not columns; it does not reduce the amount of data read per row, so the query still scans all columns from the large table, failing to address the root cause of slow performance.

Practice this question →

124

Multi-Selecthard

Which THREE of the following are best practices for designing BigQuery tables for business intelligence reporting?

Select 3 answers

A.Partition tables by a date or timestamp column used in WHERE clauses.

B.Store data in many small tables to reduce the amount of data scanned per query.

C.Normalize data to reduce data redundancy.

D.Use nested repeated columns to store arrays of related data.

E.Cluster tables on columns that are frequently used in filters or group by clauses.

AnswersA, D, E

Partitioning limits scanned data and reduces costs.

Why this answer

Partitioning tables by a date or timestamp column used in WHERE clauses allows BigQuery to prune partitions, scanning only the relevant data instead of the entire table. This reduces query costs and improves performance, making it a best practice for BI reporting where queries often filter by time ranges.

Exam trap

Google Cloud often tests the misconception that normalization or many small tables are best for BigQuery, when in fact denormalization and larger, partitioned/clustered tables are optimal for BI workloads due to BigQuery's distributed architecture and pricing model.

Practice this question →

125

MCQmedium

A company stores user events in BigQuery as nested repeated fields. They want to use Looker to build dashboards on individual events. Which SQL pattern should they use in a derived table to flatten the data?

A.SELECT fields FROM table WHERE events IS NOT NULL

B.SELECT fields FROM table, UNNEST(events) AS event

C.SELECT ARRAY_AGG(events) FROM table

D.SELECT events.* FROM table

AnswerB

CROSS JOIN UNNEST flattens the events array into rows, allowing access to event fields.

Why this answer

Option B is correct because UNNEST(events) in BigQuery SQL flattens the nested repeated field 'events' into individual rows, enabling Looker to treat each event as a separate record for dashboarding. This is the standard pattern for denormalizing arrays in BigQuery derived tables, as it converts each array element into its own row while preserving the parent record's fields.

Exam trap

Google Cloud often tests the misconception that simply selecting the nested field (option D) or filtering it (option A) will flatten the data, when in fact only UNNEST (or explicit CROSS JOIN UNNEST) achieves row-level expansion in BigQuery SQL.

How to eliminate wrong answers

Option A is wrong because WHERE events IS NOT NULL does not flatten nested repeated fields; it only filters rows where the entire 'events' array is non-null, leaving the nested structure intact and unusable for per-event analysis. Option C is wrong because ARRAY_AGG(events) does the opposite of flattening—it aggregates rows into an array, which would further nest the data and break the per-event requirement. Option D is wrong because SELECT events.* from table attempts to select all fields from the 'events' record, but without UNNEST, BigQuery treats 'events' as a single array column, causing a syntax error or returning the array as a whole, not individual event rows.

Practice this question →

126

MCQeasy

Refer to the exhibit. Given the table definition and two queries, which statement about query performance is correct?

A.Query 1 will scan less data than Query 2 because it uses both partition pruning and clustering.

B.Query 2 will scan less data than Query 1 because it only needs to read one partition.

C.Query 1 will scan the same amount of data as Query 2 because both use partition pruning.

D.Both queries will perform a full table scan because the table is partitioned.

AnswerA

Query 1 filters on partition column and cluster column, enabling both pruning and block elimination.

Why this answer

Query 1 uses both partition pruning (filtering on the partition key `event_date`) and clustering (filtering on the clustering column `user_id`), allowing it to skip irrelevant partitions and scan only the specific rows within the target partition. Query 2 uses only partition pruning on `event_date` but lacks a clustering filter, so it must scan all rows in the partition. Therefore, Query 1 scans less data than Query 2.

Exam trap

Google Cloud often tests the misconception that partition pruning alone is sufficient for optimal performance, ignoring that clustering further reduces data scanned within a partition when filters on clustering columns are present.

How to eliminate wrong answers

Option B is wrong because Query 2 does not scan less data than Query 1; it scans more data within the same partition because it lacks a clustering filter. Option C is wrong because the two queries do not scan the same amount of data; Query 1 benefits from both partition pruning and clustering, reducing the scan further. Option D is wrong because both queries use partition pruning on `event_date`, so they do not perform a full table scan; they only scan the relevant partition(s).

Practice this question →

127

Multi-Selecthard

Which TWO of the following are valid approaches when troubleshooting a slow BI query in BigQuery that includes a complex JOIN between a large fact table and multiple dimension tables?

Select 2 answers

A.Ensure the fact table is clustered on the join key

B.Split the fact table into multiple smaller tables by region

C.Filter the fact table before the JOIN to reduce the number of rows

D.Move the data to Cloud SQL for faster joins

E.Add indexes on the join columns

AnswersA, C

Clustering improves join efficiency by colocating data.

Why this answer

Option A is correct because clustering on the join key in BigQuery physically co-locates rows with the same key value within the same block, reducing the amount of data scanned during the JOIN. This is especially effective for large fact tables, as it minimizes the need to shuffle data across slots, directly improving query performance.

Exam trap

The trap here is that candidates familiar with traditional databases may assume indexes (Option E) or moving to an OLTP system (Option D) are valid optimizations, but BigQuery's serverless, columnar architecture requires different techniques like clustering and predicate pushdown.

Practice this question →

128

MCQeasy

A company uses BigQuery for BI dashboards. Users report that queries on the sales table take longer than expected. The table contains daily transaction data and is not partitioned. Which action will most improve query performance while minimizing cost?

A.Increase the BigQuery reservation slot count

B.Partition the table by the transaction date column

C.Cluster the table by the transaction date column

D.Denormalize the table by including dimension attributes

AnswerB

Partitioning limits data scanned to relevant partitions, improving performance and reducing cost.

Why this answer

Partitioning by date reduces the data scanned per query, improving performance and cost. Clustering alone may not reduce scanned bytes as effectively. Denormalization can help but may increase storage costs.

Increasing reservation slots increases cost without optimizing the query.

Practice this question →

129

MCQeasy

A data engineer needs to design a table to store time-series sensor data arriving every second. The data will be queried mainly for the last hour over a specific device. Which table design minimizes query costs?

A.Partition by ingestion_time, cluster by timestamp

B.Partition by ingestion_time, no clustering

C.No partitioning, cluster by device_id

D.Partition by ingestion_time, cluster by device_id

AnswerD

Partitioning enables time-range pruning; clustering on device_id speeds up per-device lookups.

Why this answer

Option D minimizes query costs because partitioning by ingestion_time allows the query engine to skip partitions outside the last hour, while clustering by device_id further narrows the scan to only the relevant device's data within those partitions. This combination reduces the amount of data read and the number of files scanned, which is critical for high-frequency time-series data.

Exam trap

Google Cloud often tests the misconception that clustering by the same column as partitioning provides extra benefit, but in reality it is redundant and can increase maintenance overhead without improving query performance.

How to eliminate wrong answers

Option A is wrong because clustering by timestamp within a partition by ingestion_time is redundant—since the partition already organizes data by time, clustering by the same column adds no additional pruning benefit and wastes clustering resources. Option B is wrong because without clustering, queries filtering on device_id must scan all rows in the relevant partitions, leading to full partition scans and higher query costs. Option C is wrong because no partitioning means every query must scan the entire table, even when filtering on the last hour, resulting in maximum data read and cost.

Practice this question →

130

Multi-Selecteasy

A company wants to create a BI dashboard that shows daily active users. The data is stored in a BigQuery table with columns: user_id, activity_date, and event_type. Which two optimizations would help reduce query costs? (Choose two.)

Select 2 answers

A.Cluster the table by event_type.

B.Use SELECT * and filter in the BI tool.

C.Use a materialized view with COUNT(DISTINCT user_id) grouped by activity_date.

D.Avoid using the LIMIT clause.

E.Partition the table by activity_date.

AnswersC, E

A materialized view caches the aggregation, avoiding repeated computation.

Why this answer

Option C is correct because a materialized view precomputes the COUNT(DISTINCT user_id) grouped by activity_date, so queries against it read only the pre-aggregated results rather than scanning the entire base table. This drastically reduces the amount of data processed, lowering query costs in BigQuery's on-demand pricing model where cost is proportional to bytes processed.

Exam trap

Google Cloud often tests the misconception that clustering alone reduces query cost for any aggregation, but clustering only reduces cost when the query filters or groups by the cluster key, not when the aggregation is on a different column like activity_date.

Practice this question →

131

MCQhard

A company's BI dashboard queries a BigQuery table that is 20 TB and uses clustering on date and country. The query filters on date and country and also aggregates by category. The query takes 30 seconds. They want to reduce latency to under 5 seconds. What should they do?

A.Partition the table by date.

B.Add clustering by category.

C.Increase query priority.

D.Create a materialized view that aggregates by date, country, and category.

AnswerD

Materialized view stores the aggregated result, so query scans only the view.

Why this answer

The correct answer is D because a materialized view precomputes and stores the aggregation by date, country, and category, eliminating the need to scan the full 20 TB table on every query. This reduces query latency dramatically by serving pre-aggregated results, directly addressing the filter and aggregation requirements. Partitioning or clustering alone cannot achieve sub-5-second latency on a 20 TB table because they still require scanning all matching partitions or clusters and performing the aggregation at query time.

Exam trap

The trap here is that candidates often assume partitioning or clustering alone can achieve drastic latency reductions, but they overlook that aggregation over a large dataset still requires significant computation, whereas a materialized view precomputes the result, which is the only way to guarantee sub-5-second latency for this workload.

How to eliminate wrong answers

Option A is wrong because partitioning by date only limits the scan to the relevant date range, but the query still must aggregate 20 TB of data across all countries and categories, which cannot reduce latency to under 5 seconds. Option B is wrong because adding clustering by category improves the efficiency of the aggregation step by co-locating data, but it does not precompute the aggregation; the query still must scan and aggregate all rows in the filtered partition, which is too slow for a 20 TB table. Option C is wrong because increasing query priority does not change the amount of data scanned or the computational work required; it only affects scheduling and resource allocation, not the fundamental latency of scanning and aggregating 20 TB.

Practice this question →

132

MCQeasy

Refer to the exhibit. The BI team creates a view to summarize sales. When they query the view with an additional WHERE clause on region, they notice that the underlying query still processes the same amount of data regardless of the filter. What is the most likely reason?

A.The view is a materialized view that refreshes every 30 minutes.

B.The view's WHERE clause on date is too restrictive, causing a full scan.

C.The view uses authorized views, which prevent predicate pushdown.

D.The view is a logical view, not a materialized view, so filters on the view do not reduce the scanned data.

AnswerD

Logical views execute the defining query each time; filters are applied after the view query.

Why this answer

Option A is correct because a logical view (standard view) does not materialize data; the query runs each time, and the outer filter does not push down into the view's WHERE clause. Option B is wrong because the view is not a materialized view. Option C is wrong because the view is standard, not authorized.

Option D is wrong because the date filter is in the view definition; the outer filter on region does not reduce processing.

Practice this question →

133

Multi-Selectmedium

A company uses BigQuery for BI analytics. They want to improve query performance for a table with 10 TB of data. Which two actions should they take? (Choose two.)

Select 2 answers

A.Limit the number of columns queried using SELECT * with EXCEPT.

B.Use a wildcard table to combine multiple tables.

C.Partition by a column with a high granularity.

D.Cluster on columns used in filters and aggregations.

E.Use a clustered column as the partition key.

AnswersA, D

Reducing columns scanned decreases processed bytes and cost.

Why this answer

Option A is correct because using SELECT * with EXCEPT limits the number of columns scanned, reducing I/O and improving query performance in BigQuery. BigQuery charges by the amount of data processed, so reading fewer columns directly lowers both cost and query execution time.

Exam trap

Google Cloud often tests the distinction between partitioning and clustering, where candidates mistakenly think that high-granularity partitioning or using a clustered column as a partition key improves performance, when in fact it introduces overhead and defeats the purpose of each feature.

Practice this question →

134

MCQmedium

A retail company uses BigQuery to store sales transactions. The BI team needs to create a monthly customer lifetime value (CLV) report that aggregates purchase history across multiple tables. Which BigQuery feature should they use to define the data structure for this report?

A.Create a materialized view with the aggregation query

B.Create a view that joins and aggregates the tables

C.Create an external table pointing to the raw data files

D.Create a new table to store the aggregated data using INSERT SELECT

AnswerB

A view provides a logical virtual table that hides complexity and ensures the BI team always sees the latest data.

Why this answer

Option B is correct because a view in BigQuery allows the BI team to define a logical data structure that joins and aggregates multiple tables without storing the results. This ensures the monthly CLV report always reflects the latest data, as views are re-evaluated at query time, which is ideal for recurring reports that need up-to-date aggregations.

Exam trap

Google Cloud often tests the distinction between views and materialized views, trapping candidates who assume materialized views are always better for performance without considering the need for real-time data freshness in recurring reports.

How to eliminate wrong answers

Option A is wrong because a materialized view stores pre-computed results, which can become stale and require manual or automatic refreshes, making it unsuitable for a report that must reflect the most recent purchase history without latency. Option C is wrong because an external table points to raw data files (e.g., in Cloud Storage) and does not support SQL joins or aggregations natively; it is designed for querying external data without loading it into BigQuery, not for defining a structured report. Option D is wrong because creating a new table with INSERT SELECT stores a static snapshot of the data, which would require manual re-execution to update the CLV report, defeating the purpose of a dynamic, recurring report.

Practice this question →

135

MCQmedium

A data analyst runs a query joining several large tables and gets 'Resources exceeded' error. They need to reduce memory usage without changing the query logic. What should they do?

A.Use a subquery to pre-aggregate the largest table before joining

B.Use APPROX_COUNT_DISTINCT for counting distinct values

C.Increase the slot reservation

D.Use SELECT * in the subquery to ensure all columns are available

AnswerA

Pre-aggregation reduces the row count and columns, decreasing shuffle and memory.

Why this answer

Option A is correct because pre-aggregating the largest table in a subquery reduces the amount of data that needs to be shuffled and joined in memory. In BigQuery, this minimizes the bytes processed and the memory footprint of the join operation, directly addressing the 'Resources exceeded' error without altering the overall query logic.

Exam trap

The trap here is that candidates often confuse increasing resources (slots) with reducing memory usage, or they think that approximate functions like APPROX_COUNT_DISTINCT can fix join memory errors, when in fact they only affect aggregation accuracy.

How to eliminate wrong answers

Option B is wrong because APPROX_COUNT_DISTINCT reduces the accuracy of distinct counts but does not reduce the memory usage of a join operation; it only optimizes a specific aggregation function. Option C is wrong because increasing the slot reservation increases the available compute resources (slots) but does not reduce the memory usage per query; it may delay the error but does not fix the underlying memory bottleneck. Option D is wrong because using SELECT * in a subquery retrieves all columns, which increases the data volume and memory consumption, making the 'Resources exceeded' error worse.

Practice this question →

136

MCQhard

A BI manager needs to restrict access to sensitive sales data so that salespeople can only see their own region's data. Which BigQuery feature should be used to implement row-level security without duplicating tables?

A.Use column-level security to hide sensitive columns

B.Use BigQuery row-level access policies

C.Create an authorized view that uses SESSION_USER() in a WHERE clause to filter rows

D.Create separate IAM roles for each region

AnswerC

Authorized views can leverage the current user identity to dynamically filter rows, enabling row-level security.

Why this answer

Option C is correct because an authorized view with SESSION_USER() in a WHERE clause dynamically filters rows based on the caller's identity, providing row-level security without duplicating tables. This approach leverages BigQuery's ability to share a single view with different users, each seeing only their authorized subset of data, which aligns with the requirement to restrict salespeople to their own region's data.

Exam trap

The trap here is that candidates confuse 'row-level access policies' (a conceptual term) with a native BigQuery feature, leading them to select Option B, when in fact BigQuery implements row-level security through authorized views with SESSION_USER() or similar dynamic filtering, not a dedicated policy object.

How to eliminate wrong answers

Option A is wrong because column-level security hides entire columns (e.g., salary), not rows, so it cannot restrict which rows a salesperson sees based on region. Option B is wrong because BigQuery does not have a native 'row-level access policies' feature; the correct term is row-level security implemented via authorized views or row-level access policies (which are not a distinct BigQuery feature). Option D is wrong because IAM roles control access at the dataset or table level, not at the row level, and creating separate roles per region would require duplicating tables or complex, unscalable management.

Practice this question →

137

Multi-Selectmedium

Which TWO statements are true about designing a star schema for BI reporting?

Select 2 answers

A.Fact tables store descriptive attributes like product names

B.Dimension tables are denormalized to reduce the number of joins

C.Fact tables use natural keys to enforce referential integrity

D.Fact tables contain quantitative measures

E.Dimension tables are normalized to minimize redundancy

AnswersB, D

Denormalized dimensions allow joining directly to the fact table without additional joins.

Why this answer

Option B is correct because dimension tables in a star schema are intentionally denormalized to reduce the number of joins required for BI queries. This denormalization improves query performance by allowing fact tables to join directly to dimension tables without traversing multiple normalized tables, which is a key design principle for OLAP reporting.

Exam trap

Google Cloud often tests the misconception that dimension tables should be normalized for data integrity, but in star schemas for BI, denormalization is intentional to optimize query performance over normalization.

Practice this question →

138

MCQmedium

A company uses BigQuery for real-time BI. They have a table with streaming inserts. Analysts run queries that need to see data within seconds. However, they notice that streaming data appears with a delay of up to 2 minutes. What is the most likely reason?

A.The query uses cached results.

B.The table is partitioned by hour.

C.The streaming buffer's flush interval is set to 2 minutes.

D.The table has a clustering key.

AnswerC

By default, BigQuery flushes streaming buffers every 90 seconds; configuration can change this.

Why this answer

Option C is correct because BigQuery's streaming buffer has a default flush interval of up to 90 seconds, but it can be configured. When the flush interval is set to 2 minutes, data written via streaming inserts remains in the buffer for that duration before being committed to the table, causing a delay of up to 2 minutes before it becomes visible to queries. This matches the symptom described in the question.

Exam trap

Google Cloud often tests the misconception that partitioning or clustering directly affects data freshness, when in fact they only impact storage organization and query performance, not the latency of streaming data visibility.

How to eliminate wrong answers

Option A is wrong because cached results only affect query performance, not the freshness of streaming data; cached results are served from a temporary cache and do not delay the visibility of newly streamed data. Option B is wrong because partitioning by hour does not inherently introduce a delay; it organizes data into partitions but does not control when streaming data becomes available for queries. Option D is wrong because a clustering key improves query performance by sorting data within partitions, but it has no impact on the latency of streaming data appearing in query results.

Practice this question →

139

MCQmedium

A company is using BigQuery for BI and needs to reduce costs for a large historical dataset that is infrequently queried. Which approach should they take?

A.Use materialized views for common aggregations.

B.Use clustered tables.

C.Partition by ingestion time and set expiration on partitions older than 90 days.

D.Use a view with a WHERE clause filtering recent data.

AnswerC

Expired partitions are deleted, reducing storage costs.

Why this answer

Option C is correct because partitioning by ingestion time allows BigQuery to automatically manage data lifecycle by setting partition expiration. This reduces storage costs for historical data that is infrequently queried, as partitions older than 90 days are deleted without manual intervention. This approach directly addresses the need to reduce costs for a large historical dataset while maintaining query performance on recent data.

Exam trap

Google Cloud often tests the distinction between cost reduction and performance optimization, leading candidates to choose clustering or materialized views (which improve query speed) instead of the storage lifecycle management solution that directly reduces costs.

How to eliminate wrong answers

Option A is wrong because materialized views improve query performance for common aggregations but do not reduce storage costs for historical data; they actually incur additional storage costs for the precomputed results. Option B is wrong because clustered tables optimize query performance by sorting data within partitions but do not reduce storage costs or automatically expire old data. Option D is wrong because a view with a WHERE clause filtering recent data only limits the data scanned at query time, but the underlying historical data remains in storage and continues to incur costs.

Practice this question →

140

Multi-Selecthard

A company wants to reduce BigQuery query costs for their BI workloads. Which THREE actions effectively lower the amount of data processed per query? (Choose THREE.)

Select 3 answers

A.Use partitioned tables on date column

B.Use LIMIT in subqueries to reduce output

C.Use clustered tables on frequently filtered columns

D.Use SELECT * to avoid missing columns

E.Use materialized views that match common query patterns

AnswersA, C, E

Partitioning limits query scans to relevant partitions, cutting bytes.

Why this answer

Partitioned tables in BigQuery allow queries to use the WHERE clause to filter on the partition column (e.g., a date column), so BigQuery can prune entire partitions from the scan. This directly reduces the amount of data read and billed, lowering query costs. Option A is correct because it is a primary cost-control mechanism in BigQuery.

Exam trap

Google Cloud often tests the misconception that row-limiting clauses like LIMIT reduce data processing costs, but in BigQuery, only column and partition pruning reduce the bytes scanned.

Practice this question →

141

MCQhard

A company uses BigQuery BI Engine for sub-second query performance. However, some queries are hitting the BI Engine memory limit. Which action should be taken?

A.Cluster the tables more granularly.

B.Increase BI Engine capacity allocation.

C.Use a reservation with a higher slot count.

D.Optimize the dimension tables by denormalizing.

AnswerB

Allocating more memory to BI Engine allows caching larger datasets.

Why this answer

BI Engine is an in-memory analysis service that accelerates queries by caching data in memory. When queries exceed the allocated memory, they spill to disk, causing performance degradation. Increasing the BI Engine capacity allocation directly addresses this by providing more memory for caching, enabling sub-second query performance for larger datasets.

Exam trap

Google Cloud often tests the misconception that increasing slot count (compute) solves memory bottlenecks, but BI Engine memory is a separate resource that must be explicitly allocated; candidates confuse slot-based reservations with in-memory caching.

How to eliminate wrong answers

Option A is wrong because clustering tables more granularly improves partition pruning and data skipping but does not increase the memory available to BI Engine; it may even increase memory pressure by creating more fine-grained data segments. Option C is wrong because a reservation with a higher slot count increases query concurrency and compute resources, not the in-memory cache size for BI Engine; slots and BI Engine memory are separate resources. Option D is wrong because denormalizing dimension tables reduces join complexity but does not expand BI Engine's memory limit; it could actually increase the data volume cached, exacerbating the memory issue.

Practice this question →

142

MCQmedium

Refer to the exhibit. The query joins two large tables and aggregates results. Which optimization would most likely reduce the high shuffle bytes in Stage 3?

A.Add a WHERE clause to filter rows before the join.

B.Ensure both tables are clustered on the join key.

C.Use a broadcast join hint to force one table to be broadcast.

D.Add an ORDER BY clause to sort the data before aggregation.

AnswerA

Filtering early reduces the data that needs to be shuffled.

Why this answer

Option A is correct because filtering data before the join reduces the amount of data shuffled. Option B is wrong because clustering on the join key reduces the shuffle but may not eliminate it. Option C is wrong because the join itself causes data movement; using a manual broadcast join might help only if one table is small, but it is not automatic.

Option D is wrong because ORDER BY is not the main cause of the shuffle; the join is.

Practice this question →

143

MCQmedium

A BI analyst wrote a query that computes the running total of sales over time for each product. The query uses a window function with an ORDER BY clause. The results are correct, but the query processes a large amount of data and is slow. What is the most efficient way to optimize this query?

A.Use the LAG function instead of a window function.

B.Materialize the running total in a separate table using a scheduled query.

C.Use the ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW frame.

D.Add a PARTITION BY clause to the window function.

AnswerD

Partitioning by product limits the window operation to individual product groups, reducing sorting and shuffle.

Why this answer

Option D is correct because adding a PARTITION BY clause to the window function allows the running total to be computed independently for each product, which reduces the data set the window function must sort and aggregate over. Without PARTITION BY, the query computes a single running total across all products, forcing the database engine to process the entire table as one partition, which is inefficient for large datasets. Partitioning by product ensures that the ORDER BY and frame operations are scoped to each product group, significantly reducing memory and CPU usage.

Exam trap

Google Cloud often tests the misconception that explicitly specifying the default frame (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) improves performance, when in fact the key optimization for a running total over multiple groups is to add a PARTITION BY clause to limit the scope of the window function.

How to eliminate wrong answers

Option A is wrong because the LAG function accesses a previous row's value but does not compute a running total; it would require additional logic to accumulate values, which would be even less efficient and more complex. Option B is wrong because materializing the running total in a separate table with a scheduled query does not optimize the existing query; it introduces data staleness and maintenance overhead, and the original query still runs slowly until the materialized table is built. Option C is wrong because ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the default frame for a running total when ORDER BY is used; explicitly specifying it does not change the execution plan or improve performance, as the database already uses that frame by default.

Practice this question →

144

MCQeasy

A company is designing a data warehouse for business intelligence reporting. They want to organize data into fact and dimension tables to support fast aggregations. Which schema design is most appropriate for this purpose?

A.Star schema

B.Third Normal Form (3NF) schema

C.Snowflake schema

D.Entity-relationship schema

AnswerA

Star schema denormalizes dimensions into a single table per dimension, enabling fast aggregation and simple joins.

Why this answer

The star schema is most appropriate for business intelligence reporting because it denormalizes dimension tables around a central fact table, enabling fast aggregations and simple queries. This design minimizes the number of joins required for analytical queries, which is critical for performance in OLAP workloads. In contrast, normalized schemas like 3NF or snowflake increase join complexity and degrade query speed.

Exam trap

Google Cloud often tests the misconception that a snowflake schema is better for BI because it saves storage, but the exam emphasizes that query performance and simplicity for aggregation are the primary goals, making the star schema the correct choice.

How to eliminate wrong answers

Option B is wrong because a Third Normal Form (3NF) schema is highly normalized to eliminate data redundancy, which is optimal for OLTP transaction processing but introduces many joins that slow down BI aggregations. Option C is wrong because a snowflake schema normalizes dimension tables into sub-dimensions, reducing storage but increasing join depth and query complexity, which can hurt performance in high-volume reporting. Option D is wrong because an entity-relationship schema is a generic modeling approach used for database design, not a specific schema optimized for BI fact-dimension aggregation; it lacks the denormalized structure needed for fast star-join queries.

Practice this question →

145

MCQhard

Refer to the exhibit. The query scans 500 GB even though it filters on the partitioning column event_date and only needs data from 30 days. What is the most likely reason?

A.COUNT(DISTINCT) often results in full table scan to ensure accuracy, even with partitions.

B.The query lacks a LIMIT clause.

C.The clustering on user_id is causing a full table scan.

D.The table is not actually partitioned by event_date; the filter is on a non-partitioned column.

AnswerA

Distinct aggregations can require scanning all data to ensure correctness.

Why this answer

Option A is correct because COUNT(DISTINCT) in many SQL engines, including those used in data warehousing like Google BigQuery or Snowflake, often requires a full scan of all partitions to ensure global uniqueness. Even with a filter on the partitioning column, the engine cannot guarantee that distinct values are confined to the filtered partitions without scanning all data, especially if the distinct operation spans across partitions or if the engine's optimizer lacks partition pruning for distinct aggregations.

Exam trap

Google Cloud often tests the misconception that partition pruning always applies to aggregation functions, but the trap here is that COUNT(DISTINCT) bypasses partition pruning because it requires global deduplication, leading to a full table scan even with a partition filter.

How to eliminate wrong answers

Option B is wrong because a LIMIT clause does not affect the scan size of an aggregation query; it only limits the number of rows returned after processing, not the data read. Option C is wrong because clustering on user_id does not cause a full table scan; clustering reorganizes data within partitions for better compression and query performance, but it does not override partition pruning or force a full scan. Option D is wrong because the question states the filter is on the partitioning column event_date, so the table is partitioned by event_date; if it were not, the filter would still prune partitions if the column were a partition key, but the scenario explicitly says it is a partitioning column.

Practice this question →

146

Multi-Selectmedium

A company uses Cloud SQL for PostgreSQL for its BI database. Queries involving joins on large tables are slow. Which TWO strategies should they implement to improve join performance? (Choose TWO.)

Select 2 answers

A.Denormalize tables to reduce the number of joins

B.Add indexes on the columns used in JOIN conditions

C.Increase the number of CPU cores on the instance

D.Create read replicas for the join queries

E.Use connection pooling to reduce connection overhead

AnswersA, B

Denormalization physically stores related data together, avoiding joins.

Why this answer

Denormalizing tables reduces the number of joins required in queries by combining related data into fewer tables. This directly minimizes the computational overhead of join operations in Cloud SQL for PostgreSQL, which is especially beneficial for large BI datasets where join performance is critical.

Exam trap

The trap here is that candidates often confuse scaling resources (CPU, replicas) with query optimization techniques, failing to recognize that denormalization and indexing directly address the join performance bottleneck at the data structure level.

Practice this question →

147

MCQmedium

A company tracks customer demographics that change over time (e.g., address). They need to maintain historical accuracy in BI reports. Which approach correctly implements a Type 2 slowly changing dimension?

A.Store only the current value and rely on the fact table's timestamp to infer history

B.Add effective start and end date columns for each dimension attribute

C.Store only the current value in the dimension table and use an audit log for changes

D.Overwrite the old value with the new value

AnswerB

This standard SCD Type 2 pattern allows querying the state of the dimension at any point in time.

Why this answer

Option B is correct because Type 2 SCD uses start and end dates to track effective periods, allowing queries to join based on the snapshot date. Option A is wrong because overwriting loses history. Option C is wrong because an append-only log requires complex queries to get current snapshot.

Option D is wrong because a single column storing only current value loses history.

Practice this question →

148

MCQmedium

Refer to the exhibit. A BI query is performing slowly. The query plan shows a large shuffle in the aggregate stage. The table is not partitioned or clustered. Which optimization would most directly reduce the shuffle size?

A.Converting the query to use a window function.

B.Using a materialized view.

C.Adding a WHERE clause to filter recent data.

D.Clustering the table on the grouping columns.

AnswerD

Clustering by grouping columns pre-orders data, minimizing shuffle during aggregation.

Why this answer

Clustering the table on the grouping columns physically co-locates rows with the same group key values within the same storage units (e.g., files or partitions). This allows the query engine to perform partial aggregation locally before the shuffle, dramatically reducing the amount of data that must be moved across the network during the aggregate stage. In systems like BigQuery or Spark SQL, clustering on grouping columns directly minimizes shuffle size by enabling pre-aggregation at the storage layer.

Exam trap

Google Cloud often tests the distinction between reducing data scanned (filtering) versus reducing data shuffled (clustering/partitioning), and candidates mistakenly choose a WHERE clause because they think less input data equals less shuffle, but shuffle size depends on the grouping key distribution, not the total data volume.

How to eliminate wrong answers

Option A is wrong because converting to a window function does not reduce shuffle size; window functions still require partitioning and ordering, often causing an even larger shuffle. Option B is wrong because a materialized view pre-computes and stores the query result, but it does not reduce the shuffle of the original query; it avoids the query entirely, which is a different optimization strategy. Option C is wrong because adding a WHERE clause to filter recent data reduces the total data scanned but does not directly reduce the shuffle size for the remaining data; the shuffle still occurs on the filtered dataset, and the grouping columns remain unoptimized.

Practice this question →

149

Multi-Selectmedium

Which THREE components are required to compute a 7-day moving average of daily sales using a window function? (Choose three.)

Select 3 answers

A.PARTITION BY product

B.WINDOW clause

C.AVG() function

D.ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

E.ORDER BY date

AnswersC, D, E

AVG calculates the average.

Why this answer

Option C is correct because the AVG() function is the aggregate function that computes the arithmetic mean of the sales values over the specified window frame. In a moving average calculation, AVG() is applied to the rows defined by the window frame to produce the average for each row.

Exam trap

Google Cloud often tests the misconception that the WINDOW clause is mandatory for window functions, when in fact it is only a convenience for reusing a window specification, and the frame can be defined directly in the OVER clause.

Practice this question →

150

MCQmedium

A logistics company uses BigQuery to track shipments. The `shipments` table has columns `id`, `status`, `created_date`, and `delivery_date`. They need a query that returns the number of shipments that were delivered within 5 days of creation for each month of 2024. Which SQL construct is most appropriate?

A.SELECT EXTRACT(MONTH FROM created_date) AS month, COUNT(*) FROM shipments WHERE TIMESTAMP_DIFF(delivery_date, created_date, HOUR) <= 120 AND EXTRACT(YEAR FROM created_date) = 2024 GROUP BY month

B.SELECT EXTRACT(MONTH FROM created_date) AS month, COUNTIF(DATETIME_DIFF(delivery_date, created_date, DAY) <= 5) FROM shipments WHERE EXTRACT(YEAR FROM created_date) = 2024 GROUP BY month

C.SELECT EXTRACT(MONTH FROM created_date) AS month, COUNT(*) FROM shipments WHERE DATETIME_DIFF(delivery_date, created_date, DAY) <= 5 AND EXTRACT(YEAR FROM created_date) = 2024 GROUP BY month

D.SELECT EXTRACT(MONTH FROM created_date) AS month, COUNT(*) FROM shipments WHERE DATE_DIFF(delivery_date, created_date, DAY) <= 5 AND EXTRACT(YEAR FROM created_date) = 2024 GROUP BY month

AnswerC

Correct function and clear intent.

Why this answer

Option C is correct because it uses `DATETIME_DIFF` with `DAY` precision to accurately compute the difference between `delivery_date` and `created_date` in days, and filters for shipments delivered within 5 days (i.e., <= 5 days). The `WHERE` clause also restricts to the year 2024, and the `GROUP BY month` with `EXTRACT(MONTH FROM created_date)` correctly aggregates counts per month. This matches the requirement precisely.

Exam trap

Google Cloud often tests the distinction between `DATE_DIFF`, `DATETIME_DIFF`, and `TIMESTAMP_DIFF`, and candidates mistakenly choose `DATE_DIFF` without considering the actual data types of the columns, or they use `TIMESTAMP_DIFF` with hours thinking it is equivalent, but fail to account for timezone and daylight saving effects.

How to eliminate wrong answers

Option A is wrong because it uses `TIMESTAMP_DIFF` with `HOUR` precision and checks `<= 120` hours, which is equivalent to 5 days but introduces potential edge-case errors due to daylight saving time shifts or timezone differences, and it is less readable and less precise for day-level logic. Option B is wrong because it uses `COUNTIF` with `DATETIME_DIFF` inside the SELECT clause, but `COUNTIF` is not a valid aggregate function in standard BigQuery SQL; the correct function is `COUNTIF` only in the context of a `COUNT` with a filter expression, but here it would cause a syntax error. Option D is wrong because it uses `DATE_DIFF` with `DAY` precision, but `DATE_DIFF` expects `DATE` type arguments, and if `delivery_date` or `created_date` are `DATETIME` or `TIMESTAMP` types, this will cause a type mismatch error or implicit conversion issues.

Practice this question →