CCNA Define data structures and implement SQL for Business Intelligence Questions

75 of 155 questions · Page 1/3 · Define data structures and implement SQL for Business Intelligence · Answers revealed

1
MCQhard

A data team uses BigQuery for ad-hoc BI queries. They have a table with 100 columns. Analysts often select many columns. The table is partitioned by event_date. Queries are slow and expensive. What two-step optimization should they implement? (Note: This is a single correct answer among four options that combine two steps.)

A.Cluster the table by commonly used columns and limit the selected columns in queries.
B.Convert the table to an Avro format and use partitioned tables.
C.Partition by event_date and use column-level security.
D.Cluster the table by event_date and use SELECT *.
AnswerA

Clustering narrows scans within partitions; selecting only needed columns reduces bytes processed.

Why this answer

Clustering by commonly used columns organizes data within partitions so that queries scanning only those columns read fewer blocks, reducing bytes processed. Limiting selected columns in queries further reduces the data scanned by avoiding unnecessary column reads. Together, these two steps directly address the high cost and slow performance caused by scanning many columns across a large partitioned table.

Exam trap

Google Cloud often tests the misconception that partitioning alone is sufficient for all query optimizations, but the trap here is that partitioning only reduces scan by date range, not by column count—so candidates overlook the need to also limit columns or cluster on non-partition columns.

How to eliminate wrong answers

Option B is wrong because converting to Avro format does not inherently optimize query performance or cost in BigQuery; Avro is a storage format for import/export, not a query optimization technique, and partitioning alone does not reduce the column scan overhead. Option C is wrong because column-level security controls access but does not reduce the amount of data scanned or improve query performance; it adds administrative overhead without addressing the cost or speed issue. Option D is wrong because clustering by event_date is redundant when the table is already partitioned by event_date, and using SELECT * is the opposite of optimization—it forces scanning all columns, increasing cost and latency.

2
MCQhard

An analyst writes a SQL query that joins a fact table with multiple dimension tables. The query runs slowly due to shuffling. Which optimization technique should be applied?

A.Cluster the fact table on the dimension join keys.
B.Use a subquery in the FROM clause to pre-aggregate.
C.Use a LIMIT clause to restrict rows.
D.Use a window function to precompute values.
AnswerA

Clustering on join keys minimizes data movement.

Why this answer

Shuffling occurs when data must be redistributed across nodes during joins, often because the join keys are not co-located. Clustering the fact table on the dimension join keys physically co-locates rows with the same join key values, minimizing data movement during the join. This is a direct optimization for shuffle-heavy workloads in distributed SQL engines like Spark SQL or Hive.

Exam trap

Google Cloud often tests the misconception that reducing row count (via aggregation or LIMIT) solves shuffle performance, when the real bottleneck is data movement across nodes during the join itself.

How to eliminate wrong answers

Option B is wrong because pre-aggregating in a subquery reduces row count but does not address the root cause of shuffling during the join; the join still requires redistribution unless the subquery result is small enough to broadcast. Option C is wrong because a LIMIT clause only restricts the final output rows, not the intermediate data shuffled during the join; the full join still executes. Option D is wrong because window functions operate on already partitioned data and do not reduce shuffling; they can even introduce additional shuffles if the PARTITION BY clause differs from the join keys.

3
MCQhard

A financial services company uses BigQuery for risk analysis. They have a table `market_data` with columns `symbol`, `date`, `price`, and `volume`. The query pattern involves window functions over the last 30 days for many symbols. The table is partitioned by date and clustered by symbol. However, analysts report that queries are slow and expensive. What is the most likely cause?

A.Clustering does not create indexes on symbol
B.Clustering on symbol may cause many blocks to be scanned because symbols are not sorted
C.Partitioning causes data skew across partitions
D.Partitioning by date is not granular enough
AnswerB

If data is ingested without sorting by symbol, clustering effectiveness decreases, leading to many blocks being scanned.

Why this answer

Option B is correct because clustering in BigQuery does not physically sort data within partitions; it only co-locates rows with similar cluster column values. When a query uses window functions over a rolling 30-day window for many symbols, BigQuery must scan all blocks that contain any of those symbols, even if only a subset of rows is needed. Since symbols are not strictly sorted, many blocks contain multiple symbols, leading to excessive block scans and high query costs.

Exam trap

The trap here is that candidates assume clustering works like an index or a sort order, but BigQuery clustering only co-locates similar values without guaranteeing strict ordering, which leads to inefficient block pruning for range-based queries over high-cardinality columns.

How to eliminate wrong answers

Option A is wrong because BigQuery does not use indexes; clustering is a performance optimization that organizes data into blocks based on cluster column values, not an index. Option C is wrong because partitioning by date does not inherently cause data skew; data skew is more likely from uneven distribution of symbol values, not from date partitioning. Option D is wrong because partitioning by date is already granular enough for a 30-day window; the issue is not the partition granularity but the clustering inefficiency for queries that span many symbols across multiple partitions.

4
MCQeasy

A BI developer needs to display sales data in a dashboard that shows sales in local time zones. The source data stores all timestamps in UTC. Which is the best practice for handling time zone conversions?

A.Store timestamps in UTC and convert to local time in the BI tool's application layer
B.Store all timestamps in UTC and convert them to the desired time zone in SQL queries
C.Store timestamps as text strings with time zone offset to avoid conversion
D.Store both UTC and local time in separate columns
AnswerB

This ensures a single source of truth and leverages SQL functions for accurate conversion.

Why this answer

Option B is correct because storing timestamps in UTC and converting them in SQL queries ensures that the conversion logic is centralized, auditable, and consistent across all BI reports. This approach leverages the database engine's time zone functions (e.g., AT TIME ZONE in SQL Server or CONVERT_TZ in MySQL) to handle daylight saving time transitions accurately, avoiding the pitfalls of application-layer conversions that may be inconsistent or not applied uniformly.

Exam trap

Google Cloud often tests the misconception that converting time zones in the application layer is simpler and more flexible, but the trap is that this approach introduces inconsistency when multiple BI tools or direct database queries access the same data, and it fails to leverage the database's robust time zone handling for daylight saving time transitions.

How to eliminate wrong answers

Option A is wrong because converting in the BI tool's application layer can lead to inconsistencies if multiple tools access the same data, and it offloads conversion logic to the presentation tier, which may not handle daylight saving time changes correctly without additional configuration. Option C is wrong because storing timestamps as text strings with time zone offsets breaks date arithmetic, indexing, and sorting, and makes it impossible to use native temporal functions for filtering or aggregation. Option D is wrong because storing both UTC and local time in separate columns duplicates data, increases storage overhead, and risks synchronization errors when time zone rules change (e.g., daylight saving time policy updates).

5
MCQhard

A BI team needs to analyze user behavior with sessionization. Each event has a timestamp and session ID. The table 'sessions' contains columns: session_id, user_id, event_time, event_name. The team wants the first event time per session. Which query is most efficient?

A.SELECT session_id, ARRAY_AGG(event_time ORDER BY event_time LIMIT 1) FROM sessions GROUP BY session_id
B.SELECT a.session_id, a.event_time FROM sessions a INNER JOIN (SELECT session_id, MIN(event_time) min_ts FROM sessions GROUP BY session_id) b ON a.session_id = b.session_id AND a.event_time = b.min_ts
C.SELECT session_id, MIN(event_time) FROM sessions GROUP BY session_id
D.SELECT session_id, event_time FROM sessions QUALIFY ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY event_time) = 1
AnswerD

QUALIFY filters to the first row per session, efficient with window functions.

Why this answer

Option D is correct because it uses the QUALIFY clause with ROW_NUMBER() to filter directly within the window function, avoiding a self-join or subquery. This approach is efficient in Snowflake and similar platforms, as it processes the window function once and then filters to the first event per session without materializing intermediate results.

Exam trap

Google Cloud often tests the misconception that a simple GROUP BY with MIN is always the most efficient, but the trap here is that the exam expects candidates to recognize QUALIFY with ROW_NUMBER() as a more modern and efficient pattern for sessionization, especially when additional per-session calculations are needed.

How to eliminate wrong answers

Option A is wrong because ARRAY_AGG with LIMIT 1 returns an array containing a single element, not a scalar value, and is less efficient than MIN or ROW_NUMBER. Option B is wrong because it performs a self-join on both session_id and event_time, which is redundant and less efficient than a simple GROUP BY or window function; it also requires an exact match on the timestamp, which can fail if there are duplicate timestamps for the same session. Option C is wrong because although it correctly returns the first event time per session, it is not the most efficient option in the context of the PCDE exam, which often tests window functions and QUALIFY as a more modern and flexible approach.

6
MCQmedium

Refer to the exhibit. What is the likely cause of this error?

A.The table is a view
B.The query does not include WHERE clause with partition column
C.The table is not partitioned
D.The user does not have permission to query the table
AnswerB

The error states no filter over the partition column, meaning the query tries to scan all partitions, which is blocked by a query optimizer or cost control.

Why this answer

The error occurs because the query attempts to access a partitioned table without specifying the partition column in the WHERE clause. In Snowflake (the platform implied by PCDE context), querying a large partitioned table without a partition filter forces a full scan of all partitions, which can exceed resource limits or time out. The correct approach is to include the partition column in the WHERE clause to enable partition pruning.

Exam trap

Google Cloud often tests the misconception that any table can be queried without a WHERE clause, but for partitioned tables, the partition column must be included in the WHERE clause to avoid full partition scans and associated errors.

How to eliminate wrong answers

Option A is wrong because a view would not cause this specific error; views can be queried without a WHERE clause, and the error message would differ (e.g., 'invalid object' or 'view does not exist'). Option C is wrong because if the table were not partitioned, there would be no partition-related error; the error specifically indicates a partition-related issue. Option D is wrong because permission errors typically produce 'insufficient privileges' or 'access denied' messages, not the error shown in the exhibit.

7
MCQmedium

A company uses BigQuery with a table 'orders' that has a column 'items' of type ARRAY<STRUCT<product_id STRING, quantity INT64>>. An analyst needs to find orders that contain a specific product, 'ABC'. Which query is most efficient?

A.SELECT * FROM orders WHERE EXISTS (SELECT 1 FROM UNNEST(items) WHERE product_id = 'ABC')
B.SELECT * FROM orders WHERE ARRAY_LENGTH(items) > 0
C.SELECT * FROM orders WHERE 'ABC' IN UNNEST(items)
D.SELECT o.*, item FROM orders o, UNNEST(items) item WHERE item.product_id = 'ABC'
AnswerA

EXISTS with UNNEST is the standard pattern for array membership.

Why this answer

Option A is correct because it uses a correlated subquery with `UNNEST` and `EXISTS`, which stops scanning as soon as a matching product_id is found within each row's array. This is the most efficient pattern for checking array membership in BigQuery, as it avoids unnecessary row multiplication and leverages short-circuit evaluation.

Exam trap

Google Cloud often tests the misconception that `IN UNNEST` works directly with struct arrays, when in fact it requires a scalar field extraction, and that `CROSS JOIN UNNEST` is always the correct way to filter array contents, ignoring the performance penalty of row multiplication.

How to eliminate wrong answers

Option B is wrong because `ARRAY_LENGTH(items) > 0` only checks if the array is non-empty, not whether it contains the specific product 'ABC'. Option C is wrong because `'ABC' IN UNNEST(items)` is invalid syntax; `IN` with `UNNEST` requires a scalar comparison, but `items` is an array of structs, not scalars, so this will cause a type mismatch error. Option D is wrong because the implicit `CROSS JOIN` with `UNNEST` multiplies rows for each array element, which is inefficient for large tables and requires a `DISTINCT` or `SELECT o.*` with deduplication to avoid duplicate order rows, making it slower and more resource-intensive than the `EXISTS` approach.

8
MCQhard

Refer to the exhibit. The query used DATE_TRUNC(order_date, MONTH) as month. order_date is a TIMESTAMP column. What is the data type of the month column in the result?

A.STRING
B.DATE
C.DATETIME
D.TIMESTAMP
AnswerD

DATE_TRUNC of a TIMESTAMP returns a TIMESTAMP with time set to 00:00:00.

Why this answer

In BigQuery (the SQL engine for the PCDE exam), DATE_TRUNC with a TIMESTAMP input and MONTH granularity returns a TIMESTAMP value, not a DATE or DATETIME. The function truncates the timestamp to the first day of the month at 00:00:00 UTC, preserving the TIMESTAMP data type. Therefore, the month column in the result is of type TIMESTAMP.

Exam trap

The trap here is that candidates often assume DATE_TRUNC returns a DATE because of the word 'DATE' in the function name, but in BigQuery the output type matches the input type, so a TIMESTAMP input yields a TIMESTAMP output.

How to eliminate wrong answers

Option A is wrong because DATE_TRUNC does not return a STRING; it returns a temporal type, not a text representation. Option B is wrong because DATE_TRUNC on a TIMESTAMP column returns a TIMESTAMP, not a DATE; a DATE would lack the time component entirely. Option C is wrong because DATETIME is a different type that does not include timezone context, whereas BigQuery's DATE_TRUNC on a TIMESTAMP preserves the TIMESTAMP type with timezone awareness.

9
MCQeasy

A company uses BigQuery to generate daily sales reports. The query aggregates sales by product category and region. The table 'sales_raw' is 500 GB and is updated every hour with new transactions. The report runs slowly. What is the most cost-effective method to improve query performance without changing the existing table schema?

A.Partition the table by product category
B.Create a separate summary table using scheduled queries
C.Create a materialized view that aggregates sales by product category and region
D.Cluster the table by region
AnswerC

Materialized views automatically maintain pre-computed aggregates, significantly reducing query cost and latency.

Why this answer

Option C is correct because a materialized view in BigQuery pre-computes and stores the aggregated results of the query, allowing subsequent queries to read the pre-aggregated data instead of scanning the entire 500 GB 'sales_raw' table. This reduces both the data scanned and the query execution time, and it is automatically refreshed when the base table is updated (every hour), making it cost-effective as you only pay for the bytes used by the materialized view and the incremental refreshes, not for full table scans.

Exam trap

Google Cloud often tests the distinction between partitioning/clustering (which optimize data scanning but do not pre-compute results) and materialized views (which store pre-computed results), leading candidates to choose partitioning or clustering as a 'quick fix' without realizing they do not eliminate the need for full aggregation scans.

How to eliminate wrong answers

Option A is wrong because partitioning by product category is not supported in BigQuery (partitioning is based on date, timestamp, or integer range, not on string columns like product category), and even if it were, partitioning alone does not pre-aggregate data, so the query would still need to scan all partitions to compute the aggregation. Option B is wrong because creating a separate summary table using scheduled queries introduces additional complexity and cost for manual refresh scheduling, and it does not provide automatic incremental updates like a materialized view, leading to potential data staleness and extra storage costs for the duplicate table. Option D is wrong because clustering the table by region only improves the performance of queries that filter or sort by region, but it does not pre-compute the aggregation; the query would still scan all rows in the clustered blocks to perform the GROUP BY, so it does not reduce the data scanned for the aggregation itself.

10
MCQmedium

A company uses BigQuery for BI reporting. They have a large table 'events' with nested and repeated fields (ARRAY<STRUCT>). Analysts often query unnested data, which is slow. What is the best practice to improve query performance without changing the source schema?

A.Create a view that unnests the data
B.Redesign the table to be flat
C.Use a subquery with UNNEST and cache the results
D.Create a materialized view that flattens the nested data
AnswerD

Materialized views are persisted and automatically refreshed, reducing query time.

Why this answer

Option D is correct because a materialized view in BigQuery can precompute and store the results of an UNNEST operation on nested fields, significantly reducing query time for repeated flattening queries. Unlike a regular view, a materialized view persists the flattened data and is automatically refreshed, so analysts query pre-joined, pre-flattened results without altering the source schema. This directly addresses the performance issue while preserving the original nested structure for other use cases.

Exam trap

Google Cloud often tests the distinction between a view (which is just a saved query) and a materialized view (which physically stores results), leading candidates to mistakenly choose the view option as a quick fix without considering performance implications.

How to eliminate wrong answers

Option A is wrong because a view only stores the SQL query definition, not the results; each query against the view still executes the UNNEST operation at runtime, providing no performance improvement. Option B is wrong because it violates the requirement to not change the source schema, and redesigning the table to be flat would require altering the ingestion pipeline and breaking existing queries that rely on nested fields. Option C is wrong because subqueries with UNNEST and caching are not natively supported in BigQuery; caching applies only to the final query result, not intermediate subquery results, and manual caching via temporary tables is not a best practice for ongoing analyst queries.

11
MCQmedium

A data analyst is running a BigQuery query that joins multiple tables to generate a BI report. The query is slow and uses many LEFT JOINs. What is the best approach to improve performance without changing the business logic?

A.Denormalize the data using nested repeated fields to avoid joins
B.Add indexes on the join columns
C.Replace LEFT JOINs with INNER JOINs where possible
D.Increase the number of BigQuery slots
AnswerA

Using nested repeated fields reduces joins and improves query performance by storing related data together.

Why this answer

Denormalizing data using nested repeated fields in BigQuery reduces the number of JOIN operations, which are expensive in a distributed, columnar storage system. By storing related data in a single table with REPEATED fields, the query avoids shuffling large datasets across slots, directly improving performance while preserving the original business logic.

Exam trap

Google Cloud often tests the misconception that traditional database optimization techniques like indexing or increasing resources apply to BigQuery, when in fact the correct approach is to leverage BigQuery's native schema design features like nested and repeated fields.

How to eliminate wrong answers

Option B is wrong because BigQuery does not support traditional indexes; it uses columnar storage and clustering/partitioning for performance, so adding indexes is not applicable. Option C is wrong because replacing LEFT JOINs with INNER JOINs changes the business logic by excluding rows that do not have matching records in the joined table, which may alter the BI report results. Option D is wrong because increasing the number of BigQuery slots only addresses resource contention, not the root cause of slow JOINs; it is a costly workaround that does not optimize the query structure.

12
MCQhard

A BI team uses BigQuery BI Engine to accelerate dashboards. They have a 100 GB table and enable BI Engine with a reservation of 10 GB. Some queries on this table are still slow. What is the most likely reason?

A.The query selects columns that are not fully cached due to the small reservation size.
B.BI Engine only works with SQL views, not direct tables.
C.The table uses clustering, which BI Engine ignores.
D.The table is partitioned, which BI Engine does not support.
AnswerA

BI Engine reserves memory for caching columns; insufficient memory leads to partial caching.

Why this answer

BI Engine accelerates queries by caching columns in memory. With a 100 GB table and only a 10 GB reservation, the cache can hold only a fraction of the table's columns. Queries that reference columns not fully cached will fall back to BigQuery's standard execution, causing slow performance.

Exam trap

Google Cloud often tests the misconception that BI Engine caches entire tables, when in reality it caches only columns up to the reservation limit, and queries referencing uncached columns will be slow.

How to eliminate wrong answers

Option B is wrong because BI Engine works with both tables and SQL views, not exclusively with views. Option C is wrong because BI Engine fully supports clustered tables and can leverage clustering metadata for efficient pruning. Option D is wrong because BI Engine supports partitioned tables and can use partition pruning to reduce the data scanned.

13
MCQmedium

A financial institution uses BigQuery for BI reporting. They have a table 'transactions' (10 TB) partitioned by transaction_date and clustered by customer_id. A common report filters on customer_id and last 30 days. The report is slow. Which change would most improve query performance for this specific report?

A.Change partition column to customer_id
B.Remove clustering and rely only on partitioning
C.Add clustering on transaction_date in addition to customer_id
D.Manually recluster the table daily
AnswerC

Clustering on the partition column can further optimize queries that filter on both customer_id and date range.

Why this answer

Option C is correct because adding clustering on transaction_date alongside customer_id improves query performance for the specific report that filters on both customer_id and the last 30 days. BigQuery uses clustering to sort data within partitions, so clustering by transaction_date ensures that within each partition, the rows for the last 30 days are colocated, reducing the amount of data scanned. This complements the existing partition pruning by further narrowing the scan to relevant blocks.

Exam trap

Google Cloud often tests the misconception that partitioning alone is sufficient for all filter patterns, but the trap here is that clustering on the filter column (transaction_date) is needed to optimize queries that filter on both partition and clustering columns, especially when the partition column is not the primary filter.

How to eliminate wrong answers

Option A is wrong because changing the partition column to customer_id would prevent partition pruning for the date filter (last 30 days), forcing a full table scan of 10 TB and degrading performance. Option B is wrong because removing clustering entirely would eliminate the benefit of sorted blocks within partitions, increasing the amount of data scanned even with partition pruning. Option D is wrong because manually reclustering the table daily is unnecessary and inefficient; BigQuery automatically manages clustering metadata during write operations, and manual reclustering does not provide additional performance gains for this query pattern.

14
Multi-Selecthard

A data team uses BigQuery and wants to ensure data freshness for BI reports with low latency. Which three techniques can help achieve near-real-time updates? (Select THREE).

Select 3 answers
A.Create a scheduled query that rewrites the entire table every hour
B.Use a live view that queries the source table directly
C.Use BigQuery's BI Engine for caching
D.Use streaming inserts to load data in real-time
E.Schedule a query every 15 minutes to refresh a materialized view
AnswersB, D, E

A view always returns the latest data from the base table, so it reflects streaming inserts immediately.

Why this answer

Option B is correct because a live view (also known as a logical view) queries the source table directly each time it is accessed, ensuring that BI reports always see the most current data without any materialization delay. This provides near-real-time freshness by avoiding periodic refresh cycles.

Exam trap

The trap here is that candidates often confuse caching mechanisms (like BI Engine) with data freshness techniques, not realizing that caching improves query speed but does not update the underlying data; they may also mistakenly think that periodic full table rewrites (Option A) are acceptable for near-real-time, when in fact they introduce significant latency and cost.

15
MCQhard

A BI team uses a complex SQL query with multiple Common Table Expressions (CTEs) that are referenced several times within the main query. The query performs poorly. What is the best optimization strategy?

A.Add indexes on the tables used in the CTEs
B.Use temporary tables or table snapshots to materialize the CTE results
C.Reuse the same CTE names as often as possible in the query
D.Replace CTEs with derived tables in the FROM clause
AnswerB

Materializing the result once and referencing the temporary table avoids repeated computation.

Why this answer

Option B is correct because CTEs in SQL Server are not materialized by default; they are evaluated each time they are referenced, leading to repeated execution of the same logic. By using temporary tables or table snapshots, you materialize the intermediate result set once, which avoids redundant scans and significantly improves performance for complex queries with multiple CTE references.

Exam trap

Google Cloud often tests the misconception that CTEs are automatically materialized or cached, leading candidates to overlook the need for explicit temporary tables when performance is critical.

How to eliminate wrong answers

Option A is wrong because adding indexes on base tables does not address the core issue of repeated CTE evaluation; indexes can help but are not a targeted fix for the redundant execution of CTE logic. Option C is wrong because reusing the same CTE name multiple times does not change execution behavior; each reference still triggers a separate evaluation of the CTE definition. Option D is wrong because replacing CTEs with derived tables in the FROM clause does not change the execution plan; derived tables are also non-materialized and will be re-evaluated on each reference, offering no performance benefit.

16
MCQmedium

A data engineer is designing a BI solution in BigQuery for a retail chain. They need to support queries that aggregate sales by store, product, and date across millions of transactions. The data is loaded in near real-time from Cloud Pub/Sub. Which table design provides the best balance of query performance and cost?

A.Partition by store_id, cluster by product_id
B.Partition by date, cluster by store_id and product_id
C.Unpartitioned table with clustering on store_id and product_id
D.Use materialized views with aggregation on store_id, product_id, and date
AnswerB

Partitioning by date enables efficient pruning for time-range queries, and clustering on store_id and product_id speeds up common aggregations.

Why this answer

Option B is correct because partitioning by date enables BigQuery to prune entire partitions when querying by date range, which is the most common filter in sales aggregation queries. Clustering on store_id and product_id further reduces the data scanned within each partition by colocating rows with similar store and product values. This design minimizes both query cost (bytes billed) and latency, while supporting near-real-time ingestion from Pub/Sub without requiring table rewrites.

Exam trap

Google Cloud often tests the misconception that partitioning can be applied to any column type (like store_id) or that clustering alone is sufficient for cost control, when in fact BigQuery requires partitioning on a time-unit or integer-range column and clustering is a complementary optimization, not a replacement.

How to eliminate wrong answers

Option A is wrong because BigQuery does not support partitioning by store_id (partitioning requires a DATE, TIMESTAMP, or INTEGER column with a specified range), and clustering alone cannot provide the same level of cost reduction as date-based partitioning for time-range queries. Option C is wrong because an unpartitioned table with clustering only still requires scanning the entire table for queries that filter by date, leading to higher costs and slower performance compared to a partitioned design. Option D is wrong because materialized views are automatically refreshed and incur additional storage costs; they do not replace the need for an efficient base table design, and they cannot be used as the primary ingestion target for near-real-time data from Pub/Sub.

17
MCQhard

A gaming company ingests player clickstream data in real time via Cloud Pub/Sub. They need to aggregate events per player session in BigQuery with exactly-once semantics. Which architecture minimizes latency and cost?

A.Use Cloud Functions to write each message directly to BigQuery
B.Use Cloud Dataflow with exactly-once processing to BigQuery
C.Use Cloud Pub/Sub subscription to write to BigQuery directly
D.Use Cloud Dataproc to run Spark streaming jobs
AnswerB

Dataflow provides exactly-once semantics, low latency, and is cost-effective for this volume.

Why this answer

Cloud Dataflow with exactly-once processing is the correct choice because it provides a unified stream and batch processing model that guarantees exactly-once semantics when writing to BigQuery via the BigQuery I/O connector. This minimizes latency by processing events in micro-batches or streaming mode while avoiding duplicate data, and it is cost-effective as Dataflow auto-scales based on the Pub/Sub throughput.

Exam trap

Google Cloud often tests the misconception that Cloud Pub/Sub can directly write to BigQuery, but in reality Pub/Sub requires a subscriber (like Dataflow) to process the messages before they can be loaded into BigQuery.

How to eliminate wrong answers

Option A is wrong because Cloud Functions writing directly to BigQuery cannot guarantee exactly-once semantics; a function may be retried on failure, leading to duplicate rows, and it lacks built-in deduplication or checkpointing for streaming data. Option C is wrong because Cloud Pub/Sub subscriptions do not support writing directly to BigQuery; Pub/Sub is a messaging service and requires a subscriber (like Dataflow) to process and write data, so this option is not technically feasible. Option D is wrong because Cloud Dataproc running Spark streaming jobs introduces higher operational overhead and latency compared to Dataflow, and while Spark can achieve exactly-once semantics, it requires more manual configuration and does not integrate as seamlessly with BigQuery's streaming buffer as Dataflow does.

18
Multi-Selectmedium

A company is designing a BigQuery data warehouse for sales analytics. They want to minimize query costs when aggregating daily sales by region and product. Which two methods are effective? (Select TWO).

Select 2 answers
A.Creating a materialized view with GROUP BY region, product, day
B.Using a view that queries the raw data with WHERE clause
C.Storing pre-aggregated results in a separate table and updating nightly
D.Creating indexes on the raw table
E.Using a clustered table on (region, product) with partition by day
AnswersA, E

Materialized views store precomputed results and are automatically refreshed, reducing query cost and time.

Why this answer

Option A is correct because a materialized view in BigQuery pre-computes and stores the results of the GROUP BY query on region, product, and day. When the underlying data changes, the materialized view is incrementally refreshed, so queries that match the view's aggregation are served directly from the stored results, avoiding full table scans and reducing query costs (bytes processed). This is ideal for recurring aggregation patterns like daily sales summaries.

Exam trap

Google Cloud often tests the distinction between a view (which is just a saved query) and a materialized view (which stores pre-computed results), leading candidates to incorrectly select Option B as a cost-saving measure.

19
MCQmedium

A retail company uses BigQuery to analyze sales data. They need to create a weekly report showing total sales per product category for the last 4 weeks, but the query is taking too long and exceeding slot resources. The sales table has over 2 billion rows and is partitioned by date. Which design change would most improve query performance and reduce slot consumption?

A.Increase the number of available slots in the reservation.
B.Cluster the table by product_category within the existing date partitions.
C.Create a materialized view that pre-aggregates sales by category and date.
D.Partition the table by product_category instead of date.
AnswerB

Clustering by product_category allows the query to skip irrelevant blocks, reducing data scanned and slot usage.

Why this answer

Option B is correct because clustering the table by product_category within the existing date partitions organizes the data physically so that queries filtering or grouping by product_category can skip irrelevant blocks. This reduces the amount of data scanned and the slot consumption, directly addressing the performance issue without requiring additional resources.

Exam trap

Google Cloud often tests the misconception that adding more slots (Option A) is the primary solution for slow queries, when in reality data skipping techniques like clustering or partitioning are more cost-effective and fundamental to performance optimization in BigQuery.

How to eliminate wrong answers

Option A is wrong because increasing slots only adds more parallel processing capacity but does not reduce the amount of data scanned; the query would still process all 2 billion rows, leading to unnecessary slot consumption. Option C is wrong because a materialized view pre-aggregates by category and date, but it still requires scanning the base table for updates and does not optimize the existing partitioned table's scan efficiency for the weekly report; it also incurs additional storage and maintenance costs. Option D is wrong because partitioning by product_category instead of date would create a large number of small partitions (one per category), which is inefficient for range-based queries (e.g., last 4 weeks) and can lead to partition explosion, increasing metadata overhead and query latency.

20
Multi-Selecthard

A multinational corporation uses BigQuery to combine sales data from multiple regions. Each region stores data in separate tables with identical schemas. The BI team needs to create a unified view for a dashboard that queries data by region and product. Which TWO strategies should the data engineer implement to optimize query performance and reduce costs?

Select 2 answers
A.Partition the table by date and cluster by region and product
B.Use a wildcard table with a filter on _TABLE_SUFFIX to query only required region tables
C.Create a view with UNION ALL of all region tables
D.Create materialized views for each region
E.Store all data in a single table with region as a column
AnswersA, B

Reduces data scanned for common filter conditions.

Why this answer

Option A is correct because partitioning the table by date and clustering by region and product allows BigQuery to use partition pruning and clustering block elimination to scan only the relevant data for queries filtered by region and product. This directly reduces the amount of data read, lowering query costs and improving performance. Clustering also sorts data within partitions, enabling efficient filtering without full scans.

Exam trap

Google Cloud often tests the misconception that a UNION ALL view alone provides performance benefits, when in fact it does not reduce data scanned unless combined with table-level filters like _TABLE_SUFFIX or underlying partitioned/clustered tables.

21
MCQeasy

A BI analyst wants to create a report that displays total revenue by product category and month, with ability to drill down to individual products. Which schema design supports this in BigQuery?

A.Denormalized table with repeated fields
B.Single wide table with all dimensions and measures
C.Star schema with fact table and dimension tables
D.Snowflake schema with normalized dimensions
AnswerC

Star schema is optimized for BI: fact table stores measures, dimensions store attributes, enabling flexible aggregation and drill-down.

Why this answer

Option A is correct because a star schema with a fact table and dimension tables allows efficient aggregation and drill-down through joins. Snowflake schema is over-normalized for BigQuery. Wide tables cause duplication and slow aggregation.

Repeated fields are not suitable for drill-down.

22
MCQeasy

A startup is building a BI stack on Google Cloud. They have moderate data volumes and need to run ad-hoc analytical queries and real-time dashboards. Which Google Cloud database service is most appropriate for this workload?

A.BigQuery
B.Cloud Spanner
C.Firestore
D.Cloud SQL
AnswerA

BigQuery is purpose-built for analytical queries and BI.

Why this answer

BigQuery is a serverless, highly scalable data warehouse designed for analytical queries and real-time dashboards. It supports ad-hoc SQL queries on large datasets with fast execution via its columnar storage and distributed query engine, making it ideal for BI workloads with moderate data volumes.

Exam trap

The trap here is confusing transactional databases (Cloud Spanner, Cloud SQL) or NoSQL databases (Firestore) with analytical data warehouses, leading candidates to pick a familiar OLTP service instead of recognizing BigQuery's specific suitability for ad-hoc analytics and BI dashboards.

How to eliminate wrong answers

Option B is wrong because Cloud Spanner is a globally distributed, strongly consistent relational database optimized for transactional (OLTP) workloads, not ad-hoc analytical queries or real-time dashboards. Option C is wrong because Firestore is a NoSQL document database designed for mobile and web app real-time synchronization, not for complex analytical SQL queries or BI dashboards. Option D is wrong because Cloud SQL is a managed relational database for traditional OLTP workloads (e.g., MySQL, PostgreSQL) and lacks the columnar storage and massive parallelism needed for efficient ad-hoc analytics on moderate data volumes.

23
MCQmedium

A data analyst reports that a BI dashboard query on BigQuery is taking over 30 seconds to execute. The table is partitioned by date and clustered by customer_id. The query filters on a specific date range and aggregates sales by customer. What is the most likely cause of the slow performance?

A.The query does not include a filter on the clustering column, so clustering provides no benefit.
B.The query uses a LEFT JOIN that requires a broadcast join, increasing network overhead.
C.The query filters on a date column that is not the partition column, causing a full table scan.
D.The table does not have a primary key, so BigQuery cannot use index scans.
AnswerC

Partition pruning only works when the filter is on the partition column; otherwise, all partitions are scanned.

Why this answer

Option C is correct because the query filters on a specific date range, but the table is partitioned by date, so BigQuery can prune partitions to scan only the relevant ones. If the filter were on a column that is not the partition column, a full table scan would occur, causing slow performance. Since the table is partitioned by date and the query filters on a date range, partition pruning should work efficiently, making C the most likely cause only if the filter column is misidentified.

However, the question states the table is partitioned by date and the query filters on a specific date range, so partition pruning should apply; the correct answer is actually A, as clustering on customer_id provides no benefit without a filter on that column, leading to a full scan of the clustered data.

Exam trap

The trap here is that candidates assume partitioning alone guarantees fast queries, but without a filter on the clustering column, clustering is useless, and a broad date range can still result in a large scan, making option A the correct answer despite the partition filter.

How to eliminate wrong answers

Option A is wrong because clustering provides benefits only when the query filters on the clustering column; without a filter on customer_id, BigQuery cannot prune clusters, but the query still benefits from partition pruning on date, so the primary performance issue is not clustering. Option B is wrong because the question does not mention any JOIN operation, and a broadcast join would only occur if a large table is joined with a small table, which is not indicated in the scenario. Option D is wrong because BigQuery does not use indexes or primary keys; it uses columnar storage and partitioning/clustering for performance, so the absence of a primary key is irrelevant.

24
MCQeasy

A data analyst needs to create a reporting table that aggregates sales data by month. They want to ensure the table is optimized for querying by month and product category. Which table design best supports this?

A.Use a table with clustering on product_category only.
B.Use a flat table with no partitioning.
C.Use a view that selects month and product_category.
D.Partition by month and cluster by product_category.
AnswerD

Partitioning prunes months; clustering filters categories.

Why this answer

Option D is correct because partitioning by month physically separates data into monthly segments, allowing query pruning to skip irrelevant partitions when filtering by month. Clustering by product_category within each partition co-locates rows with the same category, reducing the amount of data scanned for queries that filter on both month and category. This design optimizes both I/O and scan efficiency for the described workload.

Exam trap

The trap here is that candidates often confuse a view with a materialized view or assume that any SQL object can improve performance without physical data reorganization, leading them to select Option C despite views having no storage or indexing capabilities.

How to eliminate wrong answers

Option A is wrong because clustering only on product_category without partitioning does not provide the month-level data isolation needed for efficient monthly queries; all months remain in the same storage unit, forcing full scans for any month filter. Option B is wrong because a flat table with no partitioning or clustering offers no data skipping or pruning, leading to full table scans on every query, which is highly inefficient for aggregated reporting. Option C is wrong because a view is just a stored query definition and does not physically reorganize or partition data; it cannot improve query performance on its own and still requires scanning the underlying table.

25
MCQmedium

A user runs the query above on a large table and receives an out-of-memory error. What is the most likely cause?

A.The table is a materialized view that cannot handle ORDER BY
B.The query uses COUNT(*) without a GROUP BY
C.The ORDER BY clause forces sorting of the entire dataset in memory on a single worker
D.The table is not partitioned, so full table scan causes memory overflow
AnswerC

Sorting large datasets requires memory proportional to the data size; if it exceeds available memory, the query fails.

Why this answer

Option C is correct because the ORDER BY clause in a distributed SQL engine like Snowflake or BigQuery forces all data to be sent to a single worker node for sorting, which can exceed the memory limit of that node when the dataset is large. This is a common cause of out-of-memory errors in MPP (Massively Parallel Processing) systems, as sorting is not a distributable operation by default without explicit partitioning or window functions.

Exam trap

Google Cloud often tests the misconception that any full table scan causes memory errors, but the real trap is that ORDER BY is a blocking operation that centralizes data, making it the primary culprit for out-of-memory errors in distributed systems.

How to eliminate wrong answers

Option A is wrong because materialized views can handle ORDER BY; the error is not related to materialized view limitations but to the sorting operation itself. Option B is wrong because COUNT(*) without GROUP BY returns a single scalar value, which does not cause memory overflow; it is an aggregation that can be computed in parallel without sorting. Option D is wrong because while a full table scan can be resource-intensive, it does not inherently cause out-of-memory errors; the memory overflow is specifically triggered by the ORDER BY clause forcing a single-node sort, not by the scan itself.

26
MCQeasy

What should be adjusted to improve performance and resolve the connection error?

A.Disable automatic failover to reduce overhead
B.Change the instance type to a higher memory machine
C.Increase max_connections and implement connection pooling
D.Increase the disk size to handle more I/O
AnswerC

The error indicates that the connection limit is reached; increasing it together with pooling addresses both the limit and performance.

Why this answer

The connection error is likely due to the database reaching its maximum connection limit, which causes new connection attempts to be rejected. Increasing `max_connections` allows more concurrent client connections, while implementing connection pooling (e.g., using PgBouncer or similar) reuses existing connections efficiently, reducing overhead and preventing connection exhaustion. This directly resolves the error without requiring hardware changes.

Exam trap

Google Cloud often tests the misconception that connection errors are hardware-related (memory or disk), when in fact they are typically caused by exceeding the configured connection limit, which is a software configuration parameter.

How to eliminate wrong answers

Option A is wrong because disabling automatic failover does not address connection limits or errors; failover is a high-availability feature that ensures continuity during node failure, not a performance tuning parameter. Option B is wrong because changing the instance type to a higher memory machine may improve query performance but does not resolve connection errors caused by hitting `max_connections`; memory alone does not increase the connection limit. Option D is wrong because increasing disk size handles I/O throughput and storage capacity, but connection errors are unrelated to disk space or I/O; they are a client-side connection limit issue.

27
Multi-Selectmedium

Which THREE of the following SQL techniques are commonly used to improve BI query performance in BigQuery?

Select 3 answers
A.Select all columns using SELECT * to avoid missing data
B.Avoid JOINs by storing all relevant data in a single table
C.Use self-joins to compare rows within the same table
D.Apply filters in the WHERE clause as early as possible
E.Use APPROX_COUNT_DISTINCT instead of COUNT(DISTINCT) when exact counts are not needed
AnswersB, D, E

Denormalization eliminates JOIN overhead.

Why this answer

Option B is correct because denormalizing data into a single table avoids expensive JOIN operations, which in BigQuery can cause significant performance degradation due to shuffling and data redistribution across slots. By storing all relevant data in one table, you reduce the need for large-scale data shuffling, leading to faster query execution and lower slot consumption.

Exam trap

Google Cloud often tests the misconception that 'SELECT *' is safe for ad-hoc queries, but in BigQuery it directly increases bytes billed and query latency due to full column scans, making it a poor practice for performance optimization.

28
MCQhard

A company uses BigQuery for BI reporting. They have a materialized view that refreshes automatically to provide pre-aggregated sales data. Recently, the materialized view stopped reflecting new data inserted into the base table. The base table is a streaming buffer table with ingestion-time partitioning. What is the most likely reason?

A.The materialized view does not support streaming buffer tables.
B.The automatic refresh interval has been exceeded due to high query load.
C.The materialized view has reached the maximum number of partitions allowed.
D.The base table's schema has changed, making the materialized view incompatible.
AnswerA

Materialized views require data to be committed to storage; streaming buffer data is not yet committed.

Why this answer

Materialized views in BigQuery do not support base tables that use a streaming buffer, such as ingestion-time partitioned tables. The streaming buffer contains data that has not yet been committed to managed storage, and materialized views can only read from committed storage. Therefore, when new data is inserted into the streaming buffer, the materialized view cannot reflect it until the data is flushed from the buffer, which can cause the view to appear stale or stop reflecting new data entirely.

Exam trap

Google Cloud often tests the misconception that materialized views automatically reflect all data in the base table, including uncommitted streaming buffer data, when in fact they only read from committed storage.

How to eliminate wrong answers

Option B is wrong because the automatic refresh interval is not exceeded due to high query load; BigQuery materialized views refresh based on a system-defined interval (typically within 5 minutes of base table changes) and are not affected by query load. Option C is wrong because materialized views do not have a maximum number of partitions limit that would cause them to stop reflecting new data; partition limits apply to tables, not materialized views. Option D is wrong because schema changes to the base table would cause the materialized view to become invalid or require a manual refresh, but the question states the view stopped reflecting new data, not that it became invalid, and schema changes are not the most likely cause in this streaming buffer scenario.

29
MCQeasy

A SQL query with multiple JOINs is returning duplicate rows. What is the most likely cause?

A.Using INNER JOIN instead of LEFT JOIN.
B.There is a one-to-many relationship between tables.
C.Missing ORDER BY clause.
D.Using UNION instead of UNION ALL.
AnswerB

One-to-many joins multiply rows from the one side.

Why this answer

When a SQL query with multiple JOINs returns duplicate rows, the most likely cause is a one-to-many relationship between the tables being joined. Each matching row in the 'many' side of the join multiplies the rows from the 'one' side, producing duplicates. This is a fundamental behavior of JOIN operations in SQL, where the result set is the Cartesian product of matching rows across the joined tables.

Exam trap

Google Cloud often tests the misconception that duplicate rows are caused by the type of JOIN (e.g., INNER vs LEFT) or by missing sorting, rather than understanding that duplicates arise from the cardinality of the relationship between the joined tables.

How to eliminate wrong answers

Option A is wrong because using INNER JOIN instead of LEFT JOIN does not inherently cause duplicates; it only filters out non-matching rows, which can actually reduce duplicates. Option C is wrong because the ORDER BY clause only affects the sorting of the result set, not the number of rows returned. Option D is wrong because UNION removes duplicates by default (acting like UNION ALL with a DISTINCT step), while UNION ALL preserves all rows including duplicates; the question is about duplicate rows from JOINs, not from set operations.

30
MCQmedium

A data analyst needs to create a rolling 30-day average of daily revenue. Which window function clause is required?

A.UNBOUNDED PRECEDING
B.RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW
C.PARTITION BY month
D.ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
AnswerD

This selects exactly 30 rows (current + 29 preceding) for the rolling average.

Why this answer

Option D is correct because `ROWS BETWEEN 29 PRECEDING AND CURRENT ROW` defines a physical window of exactly 30 rows (the current row plus the 29 preceding rows), which is the standard SQL approach for a rolling 30-day average when each row represents one day of revenue. This clause ensures that the window frame is fixed at 30 rows regardless of gaps in dates, making it reliable for daily data.

Exam trap

Google Cloud often tests the distinction between `ROWS` and `RANGE` window frames, where candidates mistakenly choose `RANGE` with an interval because it sounds more intuitive for date-based rolling averages, but the exam expects the precise `ROWS` syntax for a fixed row count.

How to eliminate wrong answers

Option A is wrong because `UNBOUNDED PRECEDING` includes all rows from the start of the partition, not just the last 30 days, which would compute a cumulative average rather than a rolling 30-day average. Option B is wrong because `RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW` is not valid SQL syntax in most databases (e.g., PostgreSQL uses `RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW`), and even if corrected, `RANGE` uses logical date-based boundaries that can include more than 30 rows if multiple rows share the same date, breaking the exact 30-day count. Option C is wrong because `PARTITION BY month` groups data by calendar month, which does not create a rolling window; it resets the average at each month boundary, making it a monthly average rather than a continuous rolling average.

31
MCQmedium

A company has a BigQuery table partitioned by ingestion time. They want to create a BI report showing month-over-month revenue growth. To minimize query cost, what should they do?

A.Use a WHERE clause with _PARTITIONDATE >= DATE_SUB(CURRENT_DATE(), INTERVAL 13 MONTH) and LAG
B.Use DATE_TRUNC on the ingestion timestamp without filtering partitions
C.Use LAG without a partition filter
D.Use a wildcard table with UNION ALL over monthly tables
AnswerA

This filters to only the necessary partitions for the last 13 months (to compute month-over-month) and uses LAG for growth.

Why this answer

Option A is correct because it uses a WHERE clause with _PARTITIONDATE >= DATE_SUB(CURRENT_DATE(), INTERVAL 13 MONTH) to prune partitions, ensuring BigQuery scans only the necessary 13 months of data. The LAG function then computes month-over-month revenue growth efficiently. This minimizes query cost by reducing the amount of data processed, which is critical for ingestion-time partitioned tables.

Exam trap

Google Cloud often tests the misconception that any date function or window function alone reduces cost, but without explicit partition pruning (e.g., _PARTITIONDATE filter), BigQuery still scans all partitions, negating cost benefits.

How to eliminate wrong answers

Option B is wrong because DATE_TRUNC on the ingestion timestamp without a partition filter does not prune partitions; BigQuery would still scan all partitions, leading to higher costs. Option C is wrong because using LAG without a partition filter forces a full table scan, negating any cost savings from partitioning. Option D is wrong because using a wildcard table with UNION ALL over monthly tables is an anti-pattern; it requires manual table management and does not leverage BigQuery's native partitioning, often resulting in higher costs and complexity.

32
MCQhard

Refer to the exhibit. A data engineer created a materialized view on a table that receives streaming inserts. When they query the materialized view, they get this error. What is the most likely cause?

A.The materialized view definition includes a JOIN that is not supported.
B.The materialized view has reached its maximum size limit.
C.The materialized view cannot read data from the streaming buffer.
D.The base table has a schema change that the materialized view cannot adapt to.
AnswerC

Materialized views require data to be committed; streaming buffer data is not yet readable by materialized views.

Why this answer

The error occurs because materialized views in BigQuery cannot directly read data from the streaming buffer. When a base table receives streaming inserts, the data resides in the streaming buffer for up to 90 minutes before being committed to storage. Materialized views only reflect committed data, so querying them during this window returns an error indicating that the view cannot access the streaming buffer.

Exam trap

Google Cloud often tests the misconception that materialized views can access all data in the base table immediately, including uncommitted streaming data, when in reality they only reflect committed data and cannot read from the streaming buffer.

How to eliminate wrong answers

Option A is wrong because materialized views in BigQuery support JOINs, including with other materialized views, as long as they meet the documented limitations (e.g., no self-joins, no cross-join of non-partitioned tables). Option B is wrong because materialized views in BigQuery do not have a fixed maximum size limit; they are managed storage objects that scale with the underlying base table. Option D is wrong because schema changes to the base table (e.g., adding or dropping columns) are automatically propagated to the materialized view, and the view will adapt as long as the change does not break the view definition (e.g., dropping a column used in the SELECT list).

33
Multi-Selecteasy

A BigQuery dataset contains a table with a STRUCT column for customer address. The BI team needs to query the city field from the struct. Which two approaches are valid? (Select TWO).

Select 2 answers
A.SELECT UNNEST(address) as city FROM table
B.SELECT JSON_EXTRACT(TO_JSON(address), '$.city') FROM table
C.SELECT address.city FROM table
D.SELECT address['city'] FROM table
E.SELECT address.city.standard FROM table
AnswersB, C

Converting the struct to JSON and extracting the city field is a valid but more verbose method.

Why this answer

Option B is correct because `JSON_EXTRACT(TO_JSON(address), '$.city')` converts the STRUCT to a JSON string and then extracts the `city` field using JSONPath syntax. Option C is correct because BigQuery allows direct field access on a STRUCT column using dot notation (`address.city`), which is the standard SQL syntax for nested fields.

Exam trap

Google Cloud often tests the distinction between STRUCT and ARRAY types, and the trap here is that candidates confuse `UNNEST` (for ARRAYs) with dot notation (for STRUCTs), or mistakenly apply bracket syntax from other SQL dialects like PostgreSQL or MySQL.

34
MCQmedium

A company runs near-real-time dashboards on BigQuery that query a table partitioned by day and clustered by user_id. The most common query filters on user_id and then aggregates sales over the last 7 days. However, many queries still scan full partitions. What is the most likely cause?

A.The dashboard is configured to refresh every 5 minutes, causing too many queries.
B.The table uses a wide-column schema with many repeated fields.
C.The table is partitioned by hour, not by day.
D.The table is not clustered on user_id, or the clustering expression does not match the filter.
AnswerD

Clustering on user_id allows BigQuery to prune blocks within partitions when filtering on that column.

Why this answer

Option D is correct because the most common cause of full partition scans despite partitioning by day and clustering by user_id is that the clustering expression does not match the filter predicate. In BigQuery, clustering only prunes blocks within a partition when the filter column exactly matches the clustering key; if the filter uses a different expression (e.g., a cast or function) or if clustering is not properly defined, BigQuery falls back to scanning the entire partition. This results in the described behavior where queries still scan full partitions even though the table is partitioned and clustered.

Exam trap

Google Cloud often tests the misconception that partitioning alone guarantees query efficiency, but the trap here is that clustering must exactly match the filter predicate to avoid full partition scans, and candidates may overlook the need for precise column matching in the WHERE clause.

How to eliminate wrong answers

Option A is wrong because query frequency (every 5 minutes) does not cause full partition scans; it may increase slot contention or cost but does not affect the pruning behavior of partitioning or clustering. Option B is wrong because wide-column schemas with repeated fields can increase storage and processing overhead but do not prevent partition pruning or clustering from working correctly; the issue is about filter matching, not schema complexity. Option C is wrong because the question explicitly states the table is partitioned by day, so partitioning by hour would be a different configuration; even if it were hourly, the core problem of full partition scans would still point to clustering mismatch, not the partition granularity.

35
MCQeasy

A startup is building a BI system on Cloud SQL (PostgreSQL) for small-to-medium datasets. The data warehouse includes a fact table 'sales_fact' with millions of rows and dimension tables. The BI team reports that 'sales_fact' queries are slow despite proper indexing. What design change would most likely improve performance?

A.Use a read replica to offload queries
B.Denormalize frequently joined dimension columns into the fact table
C.Switch to Cloud Spanner for better scalability
D.Add more indexes on every column used in WHERE clauses
AnswerB

This reduces the number of joins needed for BI queries.

Why this answer

Denormalizing frequently joined dimension columns into the fact table reduces the number of JOIN operations required for BI queries. In PostgreSQL on Cloud SQL, even with proper indexing, JOINs between a large fact table and multiple dimension tables can cause significant overhead due to tuple reconstruction and buffer pool churn. By storing commonly accessed dimension attributes directly in the fact table, queries become single-table scans or index lookups, dramatically reducing query latency for small-to-medium datasets.

Exam trap

Google Cloud often tests the misconception that more indexes or read replicas universally solve query performance issues, when in fact the root cause is often the JOIN overhead in star-schema designs, which denormalization directly addresses.

How to eliminate wrong answers

Option A is wrong because a read replica offloads read traffic but does not improve the performance of individual queries; the replica runs the same slow query plan on the same schema. Option C is wrong because Cloud Spanner is designed for globally distributed, horizontally scalable workloads with strong consistency, not for optimizing star-schema JOIN performance on small-to-medium datasets; it introduces higher latency and cost without addressing the JOIN overhead. Option D is wrong because adding more indexes on every column used in WHERE clauses can lead to index bloat, increased write overhead, and the query planner may still choose sequential scans or inefficient index joins if the fact table is large and the WHERE clauses are not selective enough.

36
MCQmedium

The exhibit shows query metadata for a query that scans 10 GB. Given the table is 100 GB and partitioned by hire_date, why did the query scan 10 GB and not less?

A.The filter on hire_date is not selective enough to prune most partitions
B.Clustering on department is not being used because the query has ORDER BY
C.The query uses GROUP BY, which forces a full table scan
D.The table is not clustered properly
AnswerA

If the date range covers many days, many partitions are scanned.

Why this answer

Option A is correct because partition pruning in Databricks (and Spark SQL) depends on the selectivity of the filter predicate. If the filter on `hire_date` matches a large number of partitions (e.g., filtering on a range that covers 10 GB out of 100 GB), the query scans exactly those partitions. The question states the table is 100 GB and partitioned by `hire_date`, so a 10 GB scan implies the filter pruned 90 GB of partitions but was not selective enough to reduce the scan further—e.g., the predicate may be a broad range or lack a precise equality condition.

Exam trap

Google Cloud often tests the misconception that any filter on a partition column automatically prunes to a minimal scan, ignoring that the selectivity of the predicate (e.g., range vs. equality) determines how many partitions are actually skipped.

How to eliminate wrong answers

Option B is wrong because clustering on `department` is unrelated to partition pruning; clustering improves data skipping for non-partition columns, but the query's `ORDER BY` does not disable clustering benefits—it may even leverage them for sorting. Option C is wrong because `GROUP BY` does not force a full table scan in Databricks; partition pruning occurs before aggregation, so if the filter is selective, only relevant partitions are scanned. Option D is wrong because the table is partitioned by `hire_date`, and the scan size (10 GB) is consistent with proper partitioning; improper clustering would affect data skipping, not the partition-level scan size.

37
MCQhard

You are a cloud database engineer for a financial services firm. The firm uses Cloud SQL for PostgreSQL to support a BI reporting tool. The main table 'transactions' has 500 million rows and is growing daily. Reports often run aggregations over date ranges and group by account_id. The 'transactions' table has indexes on date and account_id separately. Despite these indexes, the reporting queries are slow, often taking over 30 minutes. The database is deployed on a high-memory machine with 32 vCPUs and 256 GB RAM. You notice that the queries perform sequential scans instead of using indexes. What is the most likely reason, and what single change would you make to improve performance?

A.Partition the table by date using PostgreSQL declarative partitioning
B.Create a composite index on (date, account_id)
C.Increase the shared_buffers setting to 128 GB
D.Disable sequential scans by setting enable_seqscan = off
AnswerB

A composite index that matches the query's WHERE and GROUP BY can drastically reduce the data scanned.

Why this answer

The correct answer is B because the reporting queries filter by date ranges and group by account_id, but the existing separate indexes on date and account_id cannot be combined efficiently for both conditions. PostgreSQL's query planner often chooses a sequential scan over using two separate indexes because it estimates that reading the entire table is cheaper than the bitmap scan overhead of combining them. A composite index on (date, account_id) allows the database to directly locate rows matching the date range and then access them in account_id order, eliminating the need for a separate sort or join step.

Exam trap

Google Cloud often tests the misconception that adding separate indexes on each column is sufficient for multi-column queries, but the trap here is that PostgreSQL cannot efficiently combine separate indexes for both filtering and grouping without a composite index that matches the query's access pattern.

How to eliminate wrong answers

Option A is wrong because partitioning by date would only help if queries consistently filter on a single partition boundary, but the slow queries also group by account_id, and partitioning does not directly improve grouping performance without additional indexing. Option C is wrong because increasing shared_buffers beyond a certain point (e.g., 128 GB on a 256 GB machine) can cause PostgreSQL to spend more time managing the buffer pool and may lead to reduced performance due to kernel-level caching overhead; the issue is index usage, not memory size. Option D is wrong because disabling sequential scans with enable_seqscan = off is a dangerous global setting that can force the planner to use inefficient index scans even when a sequential scan would be faster, and it does not address the root cause of missing a suitable composite index.

38
Drag & Dropmedium

Order the steps to migrate an on-premises MySQL database to Cloud SQL using Database Migration Service (DMS).

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

First prepare source, then create connection profile, create migration job, start migration, and finally promote.

39
MCQhard

A financial services company uses BigQuery to run complex analytical queries on trading data. They notice that a particular query joining a large fact table (10 TB) with a small dimension table (100 MB) is slow. The fact table is partitioned by date and clustered by symbol. The dimension table is not partitioned. The query filters on a specific date range and a few symbols. Which optimization is MOST likely to improve query performance?

A.Denormalize the dimension table into the fact table.
B.Enable automatic query rewriting to use clustering keys for pruning on the dimension table join.
C.Partition the dimension table by its primary key.
D.Cluster the dimension table on its primary key.
AnswerB

This allows BigQuery to prune clusters in the fact table based on the join condition with the dimension table.

Why this answer

Option B is correct because BigQuery's automatic query rewriting can leverage clustering keys from the fact table to prune the join, even though the dimension table is not clustered. When the query filters on a specific date range and symbols, BigQuery can use the fact table's clustering on symbol to skip irrelevant blocks during the join, reducing data scanned and improving performance. This optimization is automatic and does not require manual denormalization or repartitioning.

Exam trap

The trap here is that candidates assume clustering or partitioning must be applied to both tables in a join, when in fact BigQuery can use clustering from only the large fact table to prune the join, making options C and D unnecessary and option A an over-engineered solution.

How to eliminate wrong answers

Option A is wrong because denormalizing a 100 MB dimension table into a 10 TB fact table would massively increase storage and processing costs, and is unnecessary when clustering and pruning can achieve the same performance gain without data duplication. Option C is wrong because partitioning the dimension table by its primary key would create many small partitions (e.g., one per row), which is inefficient and does not help with join pruning; BigQuery partitions are best for date-based or integer-range pruning, not for high-cardinality keys. Option D is wrong because clustering the dimension table on its primary key would not improve the join performance significantly, as the dimension table is already small (100 MB) and the bottleneck is scanning the large fact table; clustering is most beneficial on large tables to reduce the amount of data read during filtering and joins.

40
MCQeasy

A BI developer is designing a BigQuery dataset for a sales dashboard. Which column naming convention is considered a best practice for column names in BI reports?

A.Use names with spaces (e.g., Total Revenue).
B.Use descriptive, snake_case names (e.g., total_revenue).
C.Use short, cryptic abbreviations (e.g., tr).
D.Use camelCase names (e.g., totalRevenue).
AnswerB

Snake_case is readable and avoids quoting issues.

Why this answer

BigQuery column names are case-insensitive but must follow standard SQL naming rules. Using descriptive snake_case (e.g., total_revenue) improves readability, avoids ambiguity, and is consistent with BigQuery's own system tables and best practices for BI tools like Looker or Tableau, which often expect clean, underscore-separated identifiers.

Exam trap

Google Cloud often tests the misconception that spaces or camelCase are acceptable for readability, but the trap is that BigQuery requires backtick quoting for spaces and does not enforce a specific case convention, making snake_case the safest and most portable choice for BI reporting.

How to eliminate wrong answers

Option A is wrong because spaces in column names require backtick quoting (e.g., `Total Revenue`) in BigQuery SQL, which adds unnecessary complexity and can break automated queries or BI tool integrations. Option C is wrong because short, cryptic abbreviations (e.g., tr) reduce clarity and maintainability, making it difficult for other developers or business users to understand the data without external documentation. Option D is wrong because camelCase (e.g., totalRevenue) is not a standard convention in BigQuery; while technically allowed, it can cause confusion with case-insensitive comparisons and is less readable in SQL than snake_case.

41
Multi-Selectmedium

A company uses BigQuery to run business intelligence reports. The data engineer needs to implement a star schema for a sales data warehouse. Which THREE are best practices when designing the tables?

Select 3 answers
A.Use natural keys in dimension tables for simplicity
B.Use a primary key on fact tables to enforce uniqueness
C.Store pre-aggregated data in dimension tables
D.Denormalize dimension tables to include descriptive attributes
E.Partition fact tables by date and cluster by frequently filtered columns
AnswersB, D, E

Ensures each row is unique and allows efficient joins.

Why this answer

Option B is correct because in BigQuery, fact tables should have a primary key to enforce uniqueness of each sales transaction, preventing duplicate rows that would skew aggregations like SUM or COUNT. BigQuery does not enforce primary keys natively, but defining them in the schema (e.g., using PRIMARY KEY constraint in DDL) allows the query engine to optimize joins and deduplication, especially when using MERGE statements. This ensures data integrity in the star schema.

Exam trap

Google Cloud often tests the misconception that dimension tables should be highly normalized or contain pre-aggregated data, but the PCDE exam emphasizes denormalizing dimensions for BI readability and storing aggregates only in fact tables or materialized views.

42
Matchingmedium

Match each Cloud SQL tier to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Burstable, low-cost for small workloads

Shared-core, moderate performance

Standard machine with 1 vCPU and 3.75 GB RAM

High memory machine with 2 vCPUs and 13 GB RAM

High CPU machine with 4 vCPUs and 3.6 GB RAM

Why these pairings

These tiers reflect different vCPU and memory configurations for Cloud SQL.

43
MCQmedium

A BI team runs a daily query on a BigQuery table 'events' partitioned by event_date. The query filters on event_date = CURRENT_DATE() and counts rows by event_type. The query is slow. Upon review, the table has 500 partitions but clustering is not set. Which action reduces query cost and latency?

A.Recreate the table with only the last 30 days of data
B.Use a wildcard table for daily ingestion
C.Increase the partition expiration to 365 days
D.Add clustering on event_type
AnswerD

Clustering on event_type organizes data by that column within each partition, speeding up count and group by.

Why this answer

Adding clustering on `event_type` physically co-locates rows with the same event type within each partition. This allows BigQuery to use block-level pruning when reading data, drastically reducing the number of bytes scanned for the COUNT(*) GROUP BY query. Since the query already filters on a single partition (`event_date = CURRENT_DATE()`), the performance bottleneck is scanning all rows in that partition; clustering eliminates that overhead without changing the table's structure or retention.

Exam trap

Google Cloud often tests the misconception that reducing data volume (e.g., by deleting old partitions or using wildcards) is the primary way to fix query performance, when in fact the correct solution is to optimize data access patterns within the existing partitions using clustering.

How to eliminate wrong answers

Option A is wrong because recreating the table with only 30 days of data does not address the root cause—the query already reads only one partition, so reducing the number of partitions has no effect on the bytes scanned for that single day. Option B is wrong because using a wildcard table for daily ingestion is a pattern for querying multiple tables, not a performance optimization; it would not reduce latency or cost for a query that already targets a single partition. Option C is wrong because increasing partition expiration to 365 days retains more data, which increases storage costs and does nothing to reduce the scan size or improve query performance for a query that already filters on a single partition.

44
MCQhard

A BI team in a large enterprise uses Looker connected to BigQuery. The data model has a primary table 'sales_fact' with billions of rows and multiple dimensions. The team notices that Looker queries often time out. Which approach would most likely resolve this without changing the data model?

A.Request Google Support to increase BigQuery timeout
B.Create a materialized view in BigQuery for the most common aggregations
C.Increase BigQuery slot capacity
D.Switch Looker to use SQL Runner only
AnswerB

Materialized views precompute aggregates and are automatically refreshed, reducing query time without model changes.

Why this answer

Using Looker's persistent derived tables (PDTs) can pre-aggregate data and speed up dashboard queries.

45
MCQhard

An e-commerce company uses BigQuery for BI. They have a large orders table with columns: order_id, customer_id, order_date, amount, status. Queries frequently aggregate total amount by customer and month. The current table is not partitioned. Users complain about high costs. The table is 2 TB and grows by 50 GB daily. Which action reduces query costs most?

A.Partition the table by order_date and cluster by customer_id.
B.Use a wildcard table with daily shards.
C.Create a materialized view that aggregates by customer and month.
D.Set a maximum bytes billed limit on the project.
AnswerC

Materialized view stores the aggregation, converting queries to small scans of precomputed data.

Why this answer

Option C is correct because a materialized view pre-aggregates the total amount by customer and month, eliminating the need to scan the full 2 TB table for every query. This drastically reduces the bytes processed per query, directly lowering BigQuery costs. Since the table grows by 50 GB daily, the materialized view incrementally updates, ensuring fresh results without reprocessing historical data.

Exam trap

Google Cloud often tests the misconception that partitioning alone solves all cost issues, but the trap here is that partitioning reduces scan for date-range queries, not for aggregation queries that span many partitions; a materialized view is the correct cost-reduction strategy for pre-aggregated results.

How to eliminate wrong answers

Option A is wrong because partitioning by order_date and clustering by customer_id reduces bytes scanned for date-range filters, but queries aggregating by customer and month still require scanning all partitions that match the month, which can be large. Option B is wrong because wildcard tables with daily shards require manual management and each query must union or scan multiple shards, leading to higher costs and complexity compared to a single partitioned table. Option D is wrong because setting a maximum bytes billed limit only caps costs but does not reduce the bytes processed; queries that exceed the limit will fail, not become cheaper.

46
MCQeasy

A BI analyst needs to calculate a running total of sales by region over time in BigQuery. Which SQL window function should be used?

A.RANK() OVER (PARTITION BY region ORDER BY date)
B.SUM(sales) OVER (PARTITION BY region ORDER BY date)
C.ROW_NUMBER() OVER (PARTITION BY region ORDER BY date)
D.COUNT(sales) OVER (PARTITION BY region ORDER BY date)
AnswerB

This correctly computes a running total per region.

Why this answer

Option B is correct because the SUM() window function with an ORDER BY clause in the OVER() clause computes a running total (cumulative sum) over the specified partition. In BigQuery, when you include ORDER BY inside a window function's OVER() clause, the default window frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which produces the running total for each region ordered by date.

Exam trap

Google Cloud often tests the distinction between aggregate functions (SUM, COUNT) and ranking functions (RANK, ROW_NUMBER) in window functions, and the trap here is that candidates confuse RANK() or ROW_NUMBER() with the ability to compute a running total, not realizing that only SUM() with an ORDER BY clause produces a cumulative sum.

How to eliminate wrong answers

Option A is wrong because RANK() assigns a rank to each row based on the ordering, not a running total of sales. Option C is wrong because ROW_NUMBER() assigns a sequential integer to each row, not a cumulative sum. Option D is wrong because COUNT(sales) counts the number of non-null sales values up to the current row, not the sum of sales.

47
MCQmedium

A retail company stores sales transactions in BigQuery. They want to create a materialized view that aggregates daily sales by product category, but they need the view to refresh automatically within 5 minutes of new data being inserted. The source table is partitioned by transaction_date and has a streaming buffer. What should they do to ensure the materialized view refreshes quickly enough?

A.Set max_staleness on the base table to 5 minutes.
B.Disable streaming inserts and use batch loads only.
C.Increase the streaming buffer size on the base table.
D.Set the materialized view's max_staleness interval to 5 minutes and allow relaxed consistency.
AnswerD

This allows the view to use base table storage for faster refresh, meeting the 5-minute requirement.

Why this answer

Option D is correct because setting the `max_staleness` interval on the materialized view to 5 minutes allows BigQuery to serve query results from the view even if the underlying base table's streaming buffer hasn't fully committed, as long as the data is within the staleness window. This enables the materialized view to reflect near-real-time data without waiting for the streaming buffer to fully materialize, meeting the 5-minute refresh requirement.

Exam trap

Google Cloud often tests the misconception that `max_staleness` is set on the base table or that streaming buffer size can be manually tuned, when in fact `max_staleness` is a materialized view property that relaxes consistency to achieve faster refresh.

How to eliminate wrong answers

Option A is wrong because `max_staleness` is a property of materialized views or tables that controls how stale results can be served, not a property set on the base table to force faster refresh. Option B is wrong because disabling streaming inserts and using batch loads only would eliminate the streaming buffer but would introduce latency from batch job scheduling, making it impossible to achieve sub-5-minute refreshes. Option C is wrong because the streaming buffer size is not configurable by users; BigQuery manages it automatically, and increasing it would not speed up materialized view refresh.

48
MCQmedium

A company is using BigQuery and needs to implement row-level security so that sales representatives only see their own region's data. Which approach?

A.Use BigQuery column-level security to filter by region
B.Create separate tables for each region and union in views
C.Use authorized views with WHERE clause filtering by session user's region
D.Use IAM conditions at the dataset level
AnswerC

Authorized views can apply row-level filters using SESSION_USER() and a mapping table, ensuring users only see their data.

Why this answer

Option C is correct because BigQuery authorized views allow you to enforce row-level security by embedding a WHERE clause that filters data based on the session user's region (e.g., using SESSION_USER() or a mapping table). This ensures each sales representative sees only their own region's data without exposing the underlying tables directly.

Exam trap

Google Cloud often tests the distinction between column-level security (which restricts columns) and row-level security (which restricts rows), leading candidates to mistakenly choose column-level options when row filtering is required.

How to eliminate wrong answers

Option A is wrong because BigQuery column-level security restricts access to specific columns, not rows; it cannot filter by region values. Option B is wrong because creating separate tables per region and unioning them in views is unscalable, violates data normalization, and does not dynamically filter by the current user. Option D is wrong because IAM conditions at the dataset level control access to entire datasets or tables, not individual rows within a table.

49
MCQmedium

A data engineer is writing a SQL query in BigQuery to calculate the running total of sales per product over time. The table 'sales' has columns product_id, sale_date, and amount. The result must include the cumulative sum ordered by sale_date for each product. Which SQL construct should be used?

A.GROUP BY product_id, sale_date with SUM(amount)
B.SUM(amount) OVER (PARTITION BY product_id ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
C.ROW_NUMBER() OVER (ORDER BY sale_date)
D.LAG(amount, 1, 0) OVER (ORDER BY sale_date)
AnswerB

This window function correctly computes a running total per product.

Why this answer

Option B is correct because it uses a window function with a PARTITION BY clause to reset the running total per product and an ORDER BY with a ROWS frame to compute the cumulative sum over time. This is the standard SQL construct in BigQuery for calculating running totals within partitions.

Exam trap

Google Cloud often tests the distinction between aggregate functions with GROUP BY and window functions with OVER, where candidates mistakenly choose GROUP BY thinking it produces a running total, but it only collapses rows.

How to eliminate wrong answers

Option A is wrong because GROUP BY with SUM(amount) aggregates sales into a single total per product and date, not a running cumulative sum over time. Option C is wrong because ROW_NUMBER() assigns sequential row numbers but does not compute any sum or cumulative value. Option D is wrong because LAG() accesses a previous row's value but does not accumulate sums across rows.

50
Multi-Selectmedium

Which TWO of the following are valid ways to improve the performance of a BigQuery query that joins two large tables?

Select 2 answers
A.Apply WHERE clauses to filter each table before the join.
B.Create a materialized view that pre-joins the tables.
C.Use the 'JOIN EACH' clause.
D.Denormalize the tables into a single table.
E.Set the query option 'USE_CACHE=TRUE'.
AnswersA, B

Reducing data before joining improves performance.

Why this answer

Option A is correct because applying WHERE clauses before the join (e.g., using subqueries or CTEs to pre-filter each table) reduces the amount of data shuffled and processed during the join phase. BigQuery's query engine can push down filters to the storage layer, minimizing the bytes read and improving performance significantly.

Exam trap

Google Cloud often tests the misconception that 'JOIN EACH' is still required for large joins, when in fact it is a deprecated syntax and modern BigQuery handles large joins automatically without any special clause.

51
MCQeasy

A BI developer needs to write a query that calculates total sales by month for the current year. They create a Common Table Expression (CTE) to define monthly aggregates, then reference it in a final SELECT. What is the main benefit of using a CTE over a subquery in this scenario?

A.CTEs are always faster than subqueries.
B.CTEs reduce the amount of memory used by the query.
C.CTEs automatically cache results for subsequent queries.
D.CTEs enhance query readability and maintainability.
AnswerD

CTEs allow you to break down complex queries into named steps.

Why this answer

Option D is correct because CTEs improve query readability and maintainability by allowing you to define a named temporary result set once and reference it multiple times in the final SELECT. In this scenario, the CTE clearly separates the monthly aggregation logic from the final output, making the query easier to understand and modify compared to nesting subqueries.

Exam trap

Google Cloud often tests the misconception that CTEs provide performance benefits like caching or reduced memory, when in fact their primary advantage is structural clarity and reusability within a single query.

How to eliminate wrong answers

Option A is wrong because CTEs are not inherently faster than subqueries; performance depends on the query optimizer and execution plan, and in many cases CTEs are not materialized or optimized differently. Option B is wrong because CTEs do not reduce memory usage; in fact, a CTE that is referenced multiple times may be re-evaluated each time unless the database engine materializes it, potentially increasing memory and CPU usage. Option C is wrong because CTEs do not automatically cache results for subsequent queries; they are scoped to a single statement and are not persisted or shared across separate executions.

52
MCQhard

A company uses Cloud SQL for PostgreSQL to store transactional data and BigQuery for analytics. They need to sync a subset of tables from Cloud SQL to BigQuery daily for BI reporting. The tables are updated incrementally (INSERT, UPDATE, DELETE). Which approach is MOST reliable and cost-effective?

A.Use Datastream to stream changes from Cloud SQL to BigQuery in near real-time.
B.Write a custom cron job on App Engine to extract changes and load them into BigQuery.
C.Create BigQuery federated queries that directly read from Cloud SQL.
D.Export the Cloud SQL tables to Cloud Storage as CSV files daily, then load them into BigQuery.
AnswerA

Datastream is a managed CDC service that handles incremental changes efficiently.

Why this answer

Datastream is purpose-built for exactly this use case: it captures CDC (Change Data Capture) events from Cloud SQL for PostgreSQL (using the PostgreSQL logical replication slot and the pgoutput plugin) and streams them directly into BigQuery via a streaming ingestion pipeline. This approach handles INSERT, UPDATE, and DELETE operations reliably without custom code, and it is cost-effective because it avoids full table exports and leverages BigQuery's streaming buffer for near-real-time updates.

Exam trap

Google Cloud often tests the misconception that batch exports (Option D) are the simplest and most reliable approach, but the trap here is that incremental CDC with Datastream is actually more reliable and cost-effective for tables with frequent updates and deletes, because it avoids full table scans and manual change tracking.

How to eliminate wrong answers

Option B is wrong because a custom cron job on App Engine would require implementing complex change tracking (e.g., using timestamps or triggers) and cannot reliably capture DELETE operations without additional overhead, making it less reliable and more costly to maintain. Option C is wrong because BigQuery federated queries read from Cloud SQL directly at query time, which bypasses BigQuery's storage and performance optimizations, incurs high latency, and is not suitable for daily syncing or handling incremental changes. Option D is wrong because daily full CSV exports are inefficient for incrementally updated tables (they waste storage and compute on unchanged rows), cannot capture DELETEs without additional logic, and the daily batch load introduces a 24-hour delay, which is less reliable and more expensive than streaming CDC.

53
MCQeasy

A company needs to store raw event logs for future BI analysis. The logs are semistructured with varying fields. Which BigQuery data type should they use to store the event payload?

A.ARRAY
B.STRING
C.FLOAT64
D.JSON
AnswerD

JSON type allows storing and querying semistructured data with nested fields.

Why this answer

Option D is correct because BigQuery's JSON data type is designed to store semistructured data with varying fields, such as raw event logs. It allows schema flexibility, efficient querying of nested fields using JSON functions like `JSON_EXTRACT`, and avoids the need to predefine a rigid schema, which is ideal for BI analysis of event payloads.

Exam trap

Google Cloud often tests the misconception that STRING is sufficient for semistructured data, but the trap is that STRING lacks native querying capabilities and incurs higher costs for parsing, whereas JSON provides built-in functions and better performance for BI workloads.

How to eliminate wrong answers

Option A is wrong because ARRAY is used to store ordered lists of elements of the same data type, not for semistructured payloads with varying fields. Option B is wrong because STRING would store the payload as a plain text blob, losing the ability to query individual fields without complex parsing and increasing storage and processing overhead. Option C is wrong because FLOAT64 is a numeric data type for floating-point numbers, completely unsuitable for storing event payloads that contain diverse field types.

54
MCQmedium

The query returns results but takes a long time. The orders table has 500M rows with order_date as a timestamp and revenue as float. How can the query be optimized?

A.Add a clustering key on order_date.
B.Partition the table by month on order_date.
C.Use a wildcard table over multiple date-sharded tables.
D.Use a materialized view that caches the query result.
AnswerB

Partition pruning limits data scanned to relevant months.

Why this answer

Partitioning the table by month on order_date (Option B) is correct because it physically separates the data into monthly partitions, allowing the query engine to prune partitions that do not match the query's time range. This dramatically reduces the amount of data scanned, which is the primary cause of slow performance on a 500M-row table. In BigQuery, partitioning by a timestamp column like order_date is a native, cost-effective optimization that directly addresses the scan bottleneck.

Exam trap

The trap here is that candidates often confuse clustering with partitioning, assuming that sorting data (clustering) provides the same scan reduction as physically separating data (partitioning), but clustering only improves block pruning within already-scanned data, not the initial scan elimination.

How to eliminate wrong answers

Option A is wrong because adding a clustering key on order_date does not physically separate data into independent storage blocks; it only sorts data within existing partitions or the entire table, so it cannot reduce the amount of data scanned as effectively as partitioning. Option C is wrong because using a wildcard table over multiple date-sharded tables is a legacy approach that requires manual table management and incurs additional overhead for query planning and metadata operations, whereas native partitioning is simpler and more performant. Option D is wrong because a materialized view caches the query result but does not reduce the scan cost for the base table; it is useful for repeated aggregations, not for optimizing a single ad-hoc query that filters by order_date.

55
MCQhard

A BI team uses BigQuery to report on customer orders. The 'customers' dimension table is updated nightly with Type 2 Slowly Changing Dimensions (SCD). However, some reports show incorrect historical aggregates because the fact table references only the current customer key. Which approach resolves this issue?

A.Update the fact table nightly to replace old customer keys with the current key
B.Store the surrogate customer key from the dimension table in the fact table at transaction time
C.Denormalize customer attributes into the fact table
D.Use the natural customer ID in the fact table and join with the dimension using a BETWEEN condition on effective dates
AnswerB

This ensures the fact always points to the correct version of the customer.

Why this answer

Option B is correct because with Type 2 SCD, each customer row has a unique surrogate key that represents a specific version of the customer's attributes over time. Storing that surrogate key in the fact table at transaction time ensures that historical facts are permanently linked to the correct customer attributes as they existed at the time of the order. This prevents incorrect aggregates when the dimension table is updated, as the fact table will always join to the precise version of the customer record that was active when the transaction occurred.

Exam trap

Google Cloud often tests the misconception that updating the fact table with current keys (Option A) is acceptable for Type 2 SCD, when in reality it silently converts the design to Type 1 and destroys historical accuracy.

How to eliminate wrong answers

Option A is wrong because updating the fact table nightly to replace old customer keys with the current key destroys historical accuracy, effectively converting the Type 2 SCD into a Type 1 overwrite and breaking the ability to report on past customer attributes. Option C is wrong because denormalizing customer attributes into the fact table duplicates data, increases storage costs, and requires updating all historical fact rows whenever a customer attribute changes, which is impractical and error-prone in BigQuery's append-heavy architecture. Option D is wrong because using the natural customer ID with a BETWEEN condition on effective dates is a valid approach for Type 2 SCDs, but it requires the fact table to store the transaction timestamp; the question states the fact table references only the current customer key, so this option does not resolve the issue without also modifying the fact table schema to include a timestamp.

56
MCQmedium

A BI team finds that their BigQuery query that aggregates sales by region runs slower than expected, even with appropriate clustering and partitioning. The query filters on a date range and then groups by region. The table is partitioned by date and clustered by region. What can the team do to improve query performance without increasing cost?

A.Increase the number of clusters to include more columns.
B.Change the partition type to ingestion-time partitioning.
C.Add an ORDER BY clause to the query.
D.Use a materialized view that pre-aggregates sales by region and date.
AnswerD

Materialized views provide pre-computed results, reducing query time and data processed.

Why this answer

Option D is correct because a materialized view in BigQuery can pre-aggregate sales by region and date, allowing the query to read precomputed results instead of scanning the entire table. This reduces the amount of data processed and speeds up the query without increasing cost, as the materialized view is automatically maintained and only incremental changes are processed.

Exam trap

The trap here is that candidates often think adding more clustering columns or sorting will improve aggregation performance, but they fail to recognize that pre-aggregation via materialized views is the only option that reduces the data scanned without increasing cost.

How to eliminate wrong answers

Option A is wrong because increasing the number of clusters to include more columns does not improve performance for a query that already filters on a partitioned column and groups by a clustered column; additional clustering columns can increase write overhead and may not reduce the data scanned. Option B is wrong because changing to ingestion-time partitioning does not provide any benefit over the existing date-based partitioning; ingestion-time partitioning is typically used when no timestamp column exists, and it would not improve query performance for date-range filters. Option C is wrong because adding an ORDER BY clause does not reduce the amount of data scanned or processed; it only sorts the final result, which adds overhead without addressing the root cause of slow aggregation.

57
MCQeasy

A company is building a business intelligence dashboard on BigQuery to analyze daily sales data. The table contains a TIMESTAMP column 'order_ts' and a string column 'region'. The BI team frequently filters by month and region. Which table design best optimizes query performance and cost?

A.Use a separate table for each region
B.Clustering by order_ts and region without partitioning
C.Partition the table by date (month) and cluster by region
D.Partition the table by region and cluster by order_ts
AnswerC

Partitioning on the date granularity used in filters and clustering on region minimizes scanned data.

Why this answer

Partitioning by month and clustering by region reduces the data scanned for common filters, improving performance and cost.

58
Matchingmedium

Match each BigQuery DDL statement to its function.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts
Matches

Creates a new table

Modifies table schema or options

Deletes a table

Creates a logical view

Creates a precomputed view for faster queries

Why these pairings

DDL statements are used to define and manage database objects in BigQuery.

59
Multi-Selecteasy

A data engineer is creating a reporting layer in BigQuery for BI tools. Which TWO practices improve query performance?

Select 2 answers
A.Use approximate aggregate functions when exact accuracy is not needed.
B.Use SELECT * in queries.
C.Use ORDER BY in subqueries unnecessarily.
D.Store all data in a single table without partitioning.
E.Denormalize tables to reduce joins.
AnswersA, E

Approximate functions like APPROX_COUNT_DISTINCT use less resources.

Why this answer

Option A is correct because BigQuery's approximate aggregate functions (e.g., APPROX_COUNT_DISTINCT, APPROX_QUANTILES) use HyperLogLog++ and other sketching algorithms to return results with a small, bounded error (typically <1%) while drastically reducing the amount of data scanned and shuffled. This trade-off is ideal for BI dashboards where exact counts are not critical, as it can cut query execution time by orders of magnitude.

Exam trap

Google Cloud often tests the misconception that SELECT * is acceptable in production BI queries, but the trap is that it defeats BigQuery's columnar storage and billing model, leading to unnecessary cost and slower performance.

60
MCQeasy

A marketing team uses a BigQuery BI dashboard to analyze campaign performance. The table campaign_performance is 5 TB, partitioned by date, clustered by campaign_id. Queries filter on date range and campaign_id, and are fast. However, one query that joins this table with a user_dimensions table (10 GB, not partitioned) takes too long. The join is on user_id. What is the best improvement?

A.Denormalize user_dimensions into campaign_performance.
B.Cluster user_dimensions by user_id.
C.Partition user_dimensions by date.
D.Use a broadcast join hint.
AnswerA

Denormalizing adds user_dimension columns to the large table, avoiding the expensive join.

Why this answer

Option C is correct because user_dimensions is small (10 GB) relative to campaign_performance, denormalizing eliminates the join entirely. Option A (partition user_dimensions) helps but the join still occurs. Option B (cluster by user_id) reduces shuffle but not elimination.

Option D (broadcast join hint) forces a broadcast but join still occurs.

61
MCQeasy

A data engineer is building a BI reporting layer in BigQuery. The source data includes JSON logs with nested fields. Analysts need to query nested arrays efficiently. Which approach is best?

A.Use SQL and UNNEST to directly query nested arrays.
B.Load the data into separate tables for each array.
C.Flatten all nested fields into separate tables.
D.Create a view that flattens the data.
AnswerA

UNNEST expands arrays efficiently without physically flattening storage.

Why this answer

Option A is correct because BigQuery natively supports nested and repeated fields via the UNNEST operator, which flattens arrays into rows for SQL-based querying. This approach leverages BigQuery's columnar storage and efficient array handling, allowing analysts to query nested arrays directly without data duplication or additional ETL, which is optimal for BI reporting performance.

Exam trap

The trap here is that candidates assume flattening data into separate tables or views is always necessary for SQL compatibility, but BigQuery's UNNEST provides native, efficient array querying without data restructuring.

How to eliminate wrong answers

Option B is wrong because loading nested arrays into separate tables introduces data redundancy and requires complex JOIN operations, increasing query latency and maintenance overhead compared to BigQuery's native nested structure. Option C is wrong because flattening all nested fields into separate tables discards the relational context of nested data, leading to data duplication and loss of query efficiency that UNNEST provides. Option D is wrong because creating a view that flattens data does not change the underlying storage; it still requires UNNEST at query time and adds no performance benefit, while a view can obscure the schema and complicate debugging.

62
Multi-Selecteasy

Which TWO strategies reduce query costs for ad-hoc analysis in BigQuery? (Choose two.)

Select 2 answers
A.Use LIMIT 10 to preview data.
B.Use clustered tables on frequently filtered columns.
C.Use a flat table without partitioning.
D.Use SELECT * in all queries.
E.Use materialized views for common aggregations.
AnswersB, E

Clustering allows pruning of blocks.

Why this answer

Option B is correct because clustered tables in BigQuery physically sort data based on the specified columns, which allows the query engine to skip entire blocks of data that don't match filter predicates. This reduces the amount of data scanned and thus lowers query costs for ad-hoc analysis. Option E is correct because materialized views precompute and store the results of common aggregations, so queries against them only read the precomputed results rather than scanning the base table, significantly reducing bytes processed.

Exam trap

Google Cloud often tests the misconception that LIMIT reduces cost (it does not in BigQuery's serverless architecture) and that denormalized or flat tables are cheaper (they are not because they increase scan size).

63
MCQeasy

A data engineer is designing a BigQuery schema for a time-series dataset of IoT sensor readings. The queries will filter primarily on a timestamp column and also on sensor_id. To optimize query performance and cost, which table design is best?

A.Partition by timestamp, cluster by sensor_id
B.Partition by sensor_id, cluster by timestamp
C.Partition by timestamp, cluster by timestamp
D.No partitioning, cluster by timestamp
AnswerA

Reduces scan to relevant partitions and optimizes filtering on sensor_id.

Why this answer

Partitioning by timestamp allows BigQuery to prune entire partitions when queries filter on the timestamp column, reducing the amount of data scanned and thus lowering cost and improving performance. Clustering by sensor_id further organizes data within each partition, enabling block-level pruning for queries that filter on sensor_id. This combination optimizes for the primary filter (timestamp) and secondary filter (sensor_id) without the overhead of excessive partitions.

Exam trap

Google Cloud often tests the misconception that clustering can replace partitioning for time-based filtering, but in reality, partitioning is essential for pruning entire storage blocks, while clustering only optimizes within partitions.

How to eliminate wrong answers

Option B is wrong because partitioning by sensor_id would create a partition for each unique sensor_id, which can lead to a very large number of small partitions (exceeding BigQuery's partition limit of 4,000 per table) and does not optimize for the primary timestamp filter. Option C is wrong because clustering by timestamp when already partitioned by timestamp provides no additional benefit—clustering is redundant and wastes resources since partitioning already prunes by timestamp. Option D is wrong because no partitioning means every query must scan the entire table, even when filtering on timestamp, leading to higher costs and slower performance; clustering alone cannot prune entire partitions.

64
MCQeasy

A BI team queries this table with a WHERE clause that filters on product_id but does not include a sale_date filter. What is the outcome?

A.The query fails with an error.
B.The query runs successfully and only scans partitions containing product_id values.
C.The query runs successfully and scans only the latest partition.
D.The query runs successfully but scans all partitions.
AnswerA

require_partition_filter=true causes query to fail without a partition filter.

Why this answer

In a partitioned table (e.g., using Hive-style partitioning or a similar system like BigQuery or Snowflake), a WHERE clause that filters only on `product_id` without including the partition key `sale_date` forces a full scan of all partitions. However, if the table is defined with a strict partition pruning requirement (e.g., in Databricks or Spark SQL with dynamic partition pruning disabled, or in a system that requires the partition column in the filter), the query may fail with an error because the engine cannot determine which partitions to read without the partition key. The correct answer is A because the scenario implies a system (like certain SQL-on-Hadoop engines or strict partitioning rules) where omitting the partition column in the WHERE clause results in a query error, not a successful scan.

Exam trap

Google Cloud often tests the misconception that partition pruning automatically applies to any column in the WHERE clause, leading candidates to choose Option B, when in reality partition pruning only works on the partition key column, and the absence of that key can cause an error in strict environments.

How to eliminate wrong answers

Option B is wrong because scanning only partitions containing specific `product_id` values would require partition pruning on `product_id`, which is not a partition key; partition pruning only works on the partition column (`sale_date`). Option C is wrong because scanning only the latest partition assumes an implicit default or a system behavior that does not exist; without a `sale_date` filter, the engine has no basis to select a single partition. Option D is wrong because while a full partition scan is a common outcome in many systems, the question explicitly states the query fails with an error, indicating a stricter environment (e.g., a system that enforces partition key inclusion in WHERE clauses) where the query is rejected rather than executed.

65
MCQeasy

A company is designing a star schema for a BI dashboard that tracks sales performance. The dashboard needs to aggregate sales by product, store, and date. Which schema design is most appropriate?

A.Store all data in a single table using nested JSON arrays for product and store details
B.Create a single wide table with all attributes (product, store, date, sales)
C.Create a fact table with foreign keys to dimension tables for product, store, and date
D.Use a fully normalized snowflake schema with separate tables for each level of hierarchy
AnswerC

A star schema with fact and dimension tables is the standard for BI reporting, enabling fast aggregations.

Why this answer

Option C is correct because a star schema uses a central fact table with foreign keys to dimension tables, which is optimal for BI aggregation queries. Option A is wrong because a single wide table with all attributes leads to data redundancy and slower queries. Option B is wrong because a fully normalized schema (e.g., snowflake) introduces extra joins that can slow BI queries.

Option D is wrong because storing data as JSON arrays in a single table is not suitable for efficient SQL aggregation.

66
MCQmedium

A company uses BigQuery for BI reporting. They have a table 'orders' with columns: order_id, customer_id, order_date, amount, status. The BI team frequently runs queries that filter on order_date and group by customer_id to compute total sales per customer. Which partitioning and clustering strategy optimizes query performance and cost?

A.Partition by order_date, cluster by status
B.Do not partition, cluster by customer_id
C.Partition by customer_id, cluster by order_date
D.Partition by order_date, cluster by customer_id
AnswerD

Partitioning on order_date prunes partitions for date filters; clustering on customer_id improves group by performance.

Why this answer

Option D is correct because partitioning by order_date allows BigQuery to prune partitions for queries filtering on order_date, reducing the amount of data scanned. Clustering by customer_id organizes data within each partition so that GROUP BY customer_id queries can efficiently read only relevant blocks, minimizing shuffle and cost. This combination directly aligns with the BI team's query pattern of filtering by date and aggregating by customer.

Exam trap

Google Cloud often tests the misconception that clustering alone is sufficient for performance, ignoring that partitioning is essential for date-range filters to avoid full table scans, or that clustering on a high-cardinality column like customer_id is ideal for GROUP BY but must be paired with a partition key that matches the filter pattern.

How to eliminate wrong answers

Option A is wrong because clustering by status does not optimize the GROUP BY on customer_id, and status is not used in filtering or grouping, so it provides no benefit for the described workload. Option B is wrong because without partitioning, queries filtering on order_date must scan the entire table, increasing cost and latency, even if clustering by customer_id helps the GROUP BY. Option C is wrong because partitioning by customer_id is not practical (high cardinality, many small partitions) and does not help date-range filtering, while clustering by order_date does not optimize the GROUP BY on customer_id.

67
MCQeasy

You are a database engineer for an e-commerce company. The company uses BigQuery for its BI and analytics. The data pipeline stages raw event data into a table 'raw_events' with columns: event_id, user_id, event_time, event_type, and a JSON string 'event_data'. The BI team wants to query this data for user behavior analysis, but the JSON parsing makes queries slow. They need to perform frequent queries that extract specific fields from the JSON and filter by event_time. The table 'raw_events' is not partitioned and has 2 billion rows. What is the most effective single step to improve query performance and reduce cost?

A.Create a view that extracts JSON fields into columns
B.Partition the table on event_time and cluster on event_type
C.Increase BigQuery slots to maximum
D.Use a materialized view to precompute common queries
AnswerB

Partitioning reduces scanned data; clustering helps with event_type filters.

Why this answer

Partitioning the table on event_time allows BigQuery to prune entire partitions when queries filter by event_time, drastically reducing the amount of data scanned. Clustering on event_type further organizes data within each partition, enabling block-level pruning for queries that filter or aggregate by event_type. This combination directly addresses the slow JSON parsing and high cost by minimizing scanned bytes, which is the most effective single step for a 2-billion-row table.

Exam trap

Google Cloud often tests the misconception that a view or materialized view alone can solve performance issues, but the trap here is that without physical data reorganization (partitioning and clustering), the underlying full table scan and JSON parsing remain the bottleneck.

How to eliminate wrong answers

Option A is wrong because a view does not physically reorganize data; it only stores a query definition, so the underlying table still requires full scans and JSON parsing on every query, providing no performance or cost benefit. Option C is wrong because increasing BigQuery slots only improves concurrency and execution speed for compute-bound queries, but does not reduce the amount of data scanned; the bottleneck here is I/O from scanning billions of rows, not CPU. Option D is wrong because a materialized view would precompute results, but it still requires the base table to be partitioned and clustered to be efficient; without partitioning, the materialized view would need to scan the entire table on refresh, and it cannot dynamically prune partitions for ad-hoc filters on event_time.

68
Multi-Selecthard

Which TWO optimizations best address slow join performance caused by excessive broadcasting in BigQuery? (Choose two.)

Select 2 answers
A.Use a large query timeout.
B.Set the dimension table to be very large to prevent broadcast.
C.Increase the number of slots.
D.Use a materialized view that pre-joins the tables.
E.Cluster the fact table on the join key.
AnswersD, E

Materialized views avoid runtime joins.

Why this answer

Option D is correct because a materialized view can pre-compute and store the join result, eliminating the need to re-execute the join at query time. This avoids the overhead of broadcasting the dimension table repeatedly, as the materialized view is incrementally refreshed and queried directly, reducing both shuffle and broadcast costs.

Exam trap

Google Cloud often tests the misconception that increasing resources (slots or timeout) or making a table larger can fix join performance issues, when the correct approach is to restructure the data or use pre-computed results like materialized views.

69
MCQeasy

A Dataflow streaming pipeline that writes to a BigQuery table fails with the error above. Which change should be made to the table schema to prevent this error?

A.Add a clustering column
B.Partition the table by ingestion time
C.Increase the streaming buffer size in the table definition
D.Change the table to use a wildcard table pattern
AnswerB

Partitioning spreads writes across multiple partition buffers, preventing overflow.

Why this answer

Partitioning the table by ingestion time (e.g., _PARTITIONTIME) distributes the streaming buffer across multiple partitions, avoiding the per-partition buffer limit. Increasing the buffer size is a workaround but not a schema change. Clustering does not affect the streaming buffer.

Using a wildcard table is unrelated.

70
MCQhard

A financial company uses Cloud SQL for PostgreSQL to store transaction data. They need to create a materialized view that aggregates daily sales for a BI dashboard. The underlying transaction table is updated continuously. Which approach ensures the materialized view remains up to date without manual intervention?

A.Use BigQuery federated query to directly query the Cloud SQL table
B.Use a Cloud SQL read replica and create the materialized view on the replica
C.Schedule a Cloud Function via Cloud Scheduler to run REFRESH MATERIALIZED VIEW periodically
D.Add a trigger on the base table to refresh the materialized view on each update
AnswerC

This provides automated periodic refreshes without manual effort.

Why this answer

Option C is correct because Cloud SQL for PostgreSQL does not support automatic materialized view refresh. The only way to keep a materialized view up to date without manual intervention is to schedule a periodic refresh using Cloud Scheduler to invoke a Cloud Function that executes the REFRESH MATERIALIZED VIEW command. This approach balances freshness with resource cost, as refreshing on every transaction would be too expensive.

Exam trap

Google Cloud often tests the misconception that materialized views in PostgreSQL can be automatically refreshed via triggers or that read replicas support materialized view creation, leading candidates to pick options that ignore the fundamental write-lock and replication limitations of Cloud SQL for PostgreSQL.

How to eliminate wrong answers

Option A is wrong because BigQuery federated queries read the Cloud SQL table directly without creating a materialized view, so they do not provide the pre-aggregated, fast-query performance that a materialized view offers, and they still incur query-time overhead. Option B is wrong because a Cloud SQL read replica is a read-only copy of the database; you cannot create a materialized view on a replica because PostgreSQL does not support materialized views on replicas (they require write access to store the view data). Option D is wrong because adding a trigger to refresh the materialized view on each update would cause severe performance degradation and is not supported in Cloud SQL for PostgreSQL—triggers cannot execute REFRESH MATERIALIZED VIEW directly, and even if they could, the overhead of refreshing on every row change would be prohibitive.

71
MCQeasy

A financial BI application stores monetary values such as revenue and tax amounts. Which BigQuery data type should be used to ensure accuracy in calculations?

A.Use STRING and parse numbers as needed
B.Use INT64 and store amounts in cents
C.Use FLOAT64
D.Use NUMERIC or BIGNUMERIC
AnswerD

Exact numeric types guarantee precision for decimals, essential for financial data.

Why this answer

Option D is correct because NUMERIC and BIGNUMERIC are exact numeric types with fixed precision and scale, designed to avoid floating-point rounding errors. In BigQuery, monetary calculations require exact decimal arithmetic, and these types provide up to 38 (NUMERIC) or 76 (BIGNUMERIC) digits of precision, ensuring accuracy for revenue and tax computations.

Exam trap

Google Cloud often tests the misconception that FLOAT64 is acceptable for financial data because it handles decimals, but the trap is that floating-point arithmetic is inherently imprecise for exact monetary calculations, leading to subtle rounding errors that fail audit requirements.

How to eliminate wrong answers

Option A is wrong because storing monetary values as STRING forces parsing on every query, introduces conversion overhead, and loses the ability to perform direct arithmetic operations without explicit casting, which is inefficient and error-prone. Option B is wrong because storing amounts in cents as INT64, while avoiding floating-point issues, requires manual scaling and can overflow for large values (e.g., billions of dollars in cents exceed INT64 max of ~9.2e18) and complicates tax calculations involving fractions of a cent. Option C is wrong because FLOAT64 is a floating-point type that introduces binary rounding errors (e.g., 0.1 + 0.2 != 0.3), which can cause cumulative inaccuracies in financial calculations and violate accounting standards.

72
Drag & Dropmedium

Order the steps to perform a disaster recovery drill for a Cloud Spanner database using backups.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order

Why this order

Backup first, then restore to another region, verify, update apps, test.

73
Multi-Selecthard

Which THREE methods are effective for improving query performance in BigQuery for BI workloads?

Select 3 answers
A.Clustering on frequently filtered columns
B.Replacing joins with subqueries
C.Partitioning on a date column
D.Using SELECT * in queries
E.Using pre-aggregated summary tables
AnswersA, C, E

Clustering allows BigQuery to skip reading blocks that don't match filter conditions.

Why this answer

Option A is correct because clustering on frequently filtered columns physically co-locates related data within blocks, significantly reducing the amount of data scanned for queries with filter predicates. This is especially effective for BI workloads that often filter on high-cardinality columns like customer ID or transaction type, as it avoids full table scans and improves query performance without additional storage costs.

Exam trap

Google Cloud often tests the misconception that subqueries are always more efficient than joins, but in BigQuery, joins are optimized for distributed processing while subqueries can cause performance degradation due to lack of parallelism.

74
Multi-Selecthard

Which THREE techniques can improve query performance in BigQuery for BI workloads? (Choose three.)

Select 3 answers
A.Use approximate aggregation functions when exact results are not required.
B.Avoid SELECT * in production queries; select only needed columns.
C.Use SELECT * with LIMIT to preview data.
D.Use ORDER BY on large result sets without LIMIT.
E.Cluster the table on columns frequently used in WHERE clauses.
AnswersA, B, E

Approximate functions use less memory and are faster.

Why this answer

Option A is correct because approximate aggregation functions (e.g., APPROX_COUNT_DISTINCT, APPROX_QUANTILES) in BigQuery use HyperLogLog++ algorithms to return near-exact results with significantly lower resource consumption and faster execution. For BI workloads where exact precision is not critical (e.g., dashboard approximations), this reduces query cost and latency.

Exam trap

Google Cloud often tests the misconception that SELECT * with LIMIT is a performance optimization, when in fact it still incurs full column scan costs, and that ORDER BY without LIMIT is acceptable for large datasets, ignoring BigQuery's requirement for a LIMIT clause to enable distributed sorting.

75
Multi-Selecthard

A financial services company needs to design a BigQuery data model for real-time fraud detection. Data arrives from multiple streaming sources and must be joined with historical customer profiles (10 TB) and transaction lookup tables (500 GB). Which TWO design considerations are most important to minimize query latency and cost?

Select 2 answers
A.Use time-based partitioning on the historical customer table and cluster on customer_id.
B.Partition streaming data by ingestion time and cluster by customer_id and transaction_type.
C.Schedule a nightly script to recluster tables based on query patterns.
D.Use a single table for all streaming data without partitioning to avoid partition management overhead.
E.Denormalize all historical and lookup data into a single wide table.
AnswersA, B

Time-based partitioning reduces scan for recent customers, and clustering on join key speeds up the join.

Why this answer

Option A is correct because time-based partitioning on the historical customer table (10 TB) allows BigQuery to prune irrelevant partitions during queries, reducing the amount of data scanned and thus lowering cost and latency. Clustering on customer_id further optimizes joins with streaming data by colocating related rows, minimizing shuffle overhead.

Exam trap

Google Cloud often tests the misconception that manual reclustering is required for performance, when in fact BigQuery's automatic reclustering handles it transparently, and that denormalization is always beneficial for joins, ignoring the storage and maintenance costs in large-scale systems.

Page 1 of 3 · 155 questions totalNext →

Ready to test yourself?

Try a timed practice session using only Define data structures and implement SQL for Business Intelligence questions.