Google Professional Cloud Database Engineer PCDE Questions 301–375 | Page 5/7

301

MCQeasy

A startup uses Cloud SQL (MySQL) for a blogging platform. The schema has a table 'posts' with columns: post_id (auto-increment PK), title, content, author_id, created_at. The application frequently runs a query to display the latest 10 posts from a specific author: SELECT * FROM posts WHERE author_id = ? ORDER BY created_at DESC LIMIT 10. This query is slow when an author has thousands of posts. The team wants to optimize this query without changing the application code. What schema change will be most effective?

A.Add a composite index on (author_id, created_at DESC).

B.Partition the table by author_id using range partitioning.

C.Increase the query cache size in Cloud SQL.

D.Migrate the posts table to Cloud Spanner and use interleaved indexes.

AnswerA

This index directly supports the query, allowing an index range scan and limit.

Why this answer

Option A is correct. A composite index on (author_id, created_at) allows the database to efficiently find the posts for a given author ordered by created_at without scanning all rows. Option B (query cache) is not a schema change.

Option C (Spanner) is a different database. Option D (partitioning) could help but ordering across partitions is complex and not as effective.

Full explanation →

302

MCQeasy

A Database Engineer is responsible for managing a Cloud SQL for MySQL instance. The engineer needs to ensure that automated backups are retained for 14 days and that point-in-time recovery (PITR) is enabled. Which configuration should the engineer set?

A.Create scheduled manual backups each day and retain them for 14 days; PITR is not needed because backups are daily.

B.Enable automated backups, set backup retention to 14 days, and enable binary logging.

C.Enable automated backups with a retention of 14 days; PITR is automatically enabled.

D.Enable automated backups with 14-day retention and set up a cross-region replica for disaster recovery.

AnswerB

Automated backups with binary logging enable PITR.

Why this answer

Option B is correct because enabling automated backups with a 14-day retention ensures backup files are kept for the required duration, and enabling binary logging (which is required for PITR) allows point-in-time recovery by replaying transaction logs against a base backup. In Cloud SQL for MySQL, PITR is not automatically enabled with automated backups; binary logging must be explicitly enabled.

Exam trap

The trap here is that candidates assume enabling automated backups automatically enables point-in-time recovery, but Cloud SQL for MySQL requires binary logging to be separately enabled for PITR functionality.

How to eliminate wrong answers

Option A is wrong because scheduled manual backups do not provide the continuous transaction log coverage needed for point-in-time recovery, and PITR is still required for granular recovery to a specific timestamp, not just daily backups. Option C is wrong because PITR is not automatically enabled when automated backups are enabled; binary logging must be explicitly turned on to support PITR. Option D is wrong because a cross-region replica provides disaster recovery and high availability, not point-in-time recovery; PITR requires binary logging, not replication.

Full explanation →

303

MCQeasy

A company is designing a BigQuery data model for a business intelligence dashboard that shows sales by region and product. The data is refreshed daily. Which schema design is MOST cost-effective and performant for this use case?

A.A table with nested repeated columns for regions and products within each sale.

B.A star schema with a fact table for sales and separate dimension tables for region and product.

C.A fully normalized schema with separate tables for each attribute.

D.A single flat table containing all sales, region, and product columns.

AnswerB

Star schemas are optimized for BI workloads, reducing data scanned and improving query performance.

Why this answer

Option B is correct because a star schema with a fact table for sales and dimension tables for region and product is optimized for analytical queries in BigQuery. Option A is wrong because a flat table with all columns leads to higher storage costs and slower queries due to scanning unnecessary columns. Option C is wrong because a wide table with nested columns is better for hierarchical data, not for simple dimensional analysis.

Option D is wrong because a normalized schema with many joins is not ideal for BI queries and increases complexity.

Full explanation →

304

MCQmedium

Refer to the exhibit. A developer tries to connect to a Cloud SQL instance from a VM using the public IP. The connection fails with this error. What should the developer do to fix the connection?

A.Enable SSL-only mode on the Cloud SQL instance.

B.Add the client IP to the authorized networks.

C.Connect using the Cloud SQL Proxy.

D.Change the instance to require SSL for private connections.

AnswerC

The Cloud SQL Proxy handles SSL encryption and IAM authentication, bypassing the need for client-side SSL configuration.

Why this answer

The error indicates that the Cloud SQL instance does not have an authorized network allowing the VM's public IP. However, the correct fix is to use the Cloud SQL Proxy, which establishes a secure, authenticated tunnel to the instance without needing to authorize the VM's IP. The proxy handles IAM-based authentication and encryption, bypassing the public IP network authorization requirement entirely.

Exam trap

Google Cloud often tests the misconception that adding the client IP to authorized networks is the only way to fix a public IP connection failure, but the trap is that the Cloud SQL Proxy is the secure, recommended alternative that avoids IP management and works even when the VM's IP is not static or known.

How to eliminate wrong answers

Option A is wrong because enabling SSL-only mode enforces encryption for connections but does not bypass the authorized networks check; the connection would still fail if the client IP is not authorized. Option B is wrong because adding the client IP to authorized networks would work in principle, but the question implies the developer is using a VM with a dynamic or non-static public IP, making this approach impractical and insecure; the Cloud SQL Proxy is the recommended solution for such scenarios. Option D is wrong because requiring SSL for private connections applies only to private IP connections, not public IP connections, and does not resolve the public IP authorization issue.

Full explanation →

305

MCQeasy

A company uses Cloud Spanner with a multi-region configuration (nam3) for a global application. They notice that write latency has increased significantly during peak hours. After investigation, they find that the number of splits has increased from 10 to 50, and the CPU utilization on most nodes is below 10%. However, writes are being throttled due to excessive hot spots on a few nodes. What should they do?

A.Redesign the primary key to avoid monotonic increases

B.Enable interleaved tables

C.Increase the number of nodes

D.Use a read replica

AnswerA

Using a non-monotonic key (e.g., hashed or UUID) spreads writes across all splits, reducing hot spots and throttling.

Why this answer

Option C is correct because hot spots in Spanner are often caused by monotonically increasing primary keys, which concentrate writes on a few splits. Redesigning the key to distribute writes (e.g., using a hash prefix) spreads the load evenly. Option A (increase nodes) does not fix the hot spot; the bottleneck is lock contention on specific splits.

Option B (interleaved tables) helps with child table joins but not with primary key distribution. Option D (read replicas) is for read scaling, not write throughput.

Full explanation →

306

MCQeasy

A mobile app stores user profiles in Firestore. Users are spread globally. Which data model ensures low latency reads and writes?

A.A single collection containing all user documents

B.One document per user in a single collection with composite indexes

C.One collection per geographic region

D.Subcollections under a geographic region collection (e.g., /regions/{region}/users/{user})

AnswerD

Using subcollections under region documents distributes writes across regions, improving latency.

Why this answer

Option D is correct because it uses geographic region as the top-level collection key, which enables Firestore to co-locate user documents within the same region. This minimizes latency by ensuring that reads and writes for users in the same geographic area are served from a nearby Firestore instance, leveraging Firestore's automatic multi-region replication and strong consistency within a location.

Exam trap

The trap here is that candidates confuse composite indexes with data locality, assuming indexes solve latency issues, when in fact Firestore's performance depends on document grouping and proximity to the client's location.

How to eliminate wrong answers

Option A is wrong because a single collection containing all user documents forces Firestore to distribute documents across multiple regions, increasing read and write latency for globally distributed users due to cross-region data access. Option B is wrong because composite indexes do not affect data locality; they only optimize query performance, not reduce latency for geographically dispersed users. Option C is wrong because while it groups users by region, it still stores all user documents in a single collection per region, which does not provide the same locality benefits as using subcollections under a region document, and it can lead to hot-spotting on the region document itself.

Full explanation →

307

MCQhard

A BigQuery table is partitioned by ingestion time (pseudo column _PARTITIONTIME) and uses the default partition expiration of 90 days. A data engineer runs a DELETE statement to remove rows older than 100 days. Why does this query process more bytes than expected?

A.The table is not partitioned; it is clustered.

B.The DELETE statement does not use a WHERE clause on a clustering column.

C.The DELETE statement filters on a custom timestamp column instead of _PARTITIONTIME.

D.The DELETE statement must scan all partitions because it uses a condition that does not prune partitions.

AnswerD

Without a filter on _PARTITIONTIME or a partition column, the query scans all partitions.

Why this answer

Option D is correct because the DELETE statement uses a condition that does not reference the partitioning column (_PARTITIONTIME) in a way that allows partition pruning. Since the table is partitioned by ingestion time, BigQuery must scan all partitions to evaluate the filter, even though the condition logically targets rows older than 100 days. This results in processing more bytes than expected, as the default partition expiration of 90 days does not reduce the scan scope when the WHERE clause does not leverage the partitioning column.

Exam trap

Google Cloud often tests the misconception that a time-based filter on any timestamp column will trigger partition pruning, when in fact only filters on the specific partitioning column (like _PARTITIONTIME) enable partition elimination.

How to eliminate wrong answers

Option A is wrong because the table is explicitly described as partitioned by ingestion time, so it is partitioned, not just clustered. Option B is wrong because clustering columns are irrelevant for partition pruning; the issue is about partition-level filtering, not clustering. Option C is wrong because filtering on a custom timestamp column instead of _PARTITIONTIME would not cause partition pruning; however, the question states the DELETE removes rows older than 100 days, and if that custom column is used, it still would not prune partitions unless it is the partitioning column, but the core reason for scanning all partitions is the lack of a filter on _PARTITIONTIME, not the use of a custom column per se.

Full explanation →

308

MCQhard

A Looker developer configured a new connection to BigQuery as shown. The connection test fails with the error above. What is the most likely cause?

A.The dataset mydataset does not exist in the project

B.The BigQuery query quota has been exceeded for the project

C.The Looker instance is located in a different region than the BigQuery dataset

D.The Looker service account lacks the required BigQuery roles on the dataset

AnswerD

The error 'Access Denied' indicates missing IAM permissions for the service account.

Why this answer

Option D is correct because the error indicates a permissions issue during the connection test. Looker uses a service account to authenticate to BigQuery, and if that service account lacks the required BigQuery roles (e.g., BigQuery Data Viewer, BigQuery Job User) on the dataset, the connection test will fail with an access denied error. The error message shown in the question (not provided here but implied) typically states 'Access Denied' or 'Permission denied' when the service account does not have the necessary IAM permissions on the dataset or project.

Exam trap

Google Cloud often tests the misconception that region mismatch causes connection failures, but BigQuery datasets are global and region does not affect authentication; the real issue is almost always IAM permissions on the service account.

How to eliminate wrong answers

Option A is wrong because if the dataset did not exist, the error would be 'Not found: Dataset myproject:mydataset' rather than a permissions error. Option B is wrong because exceeding the BigQuery query quota results in a 'Quota exceeded' error, not a permissions-related failure. Option C is wrong because BigQuery datasets are global resources and region mismatch does not cause connection test failures; Looker can connect to BigQuery datasets in any region as long as network connectivity exists.

Full explanation →

309

MCQmedium

A company is designing a BigQuery data warehouse for BI dashboards. They have a fact table with billions of rows and need to optimize query performance for common filters on date and customer_id. Which table design strategy is most effective?

A.Use a clustered table on date only.

B.Use a non-partitioned table with indexing on customer_id.

C.Use a materialized view that aggregates by date.

D.Use a partitioned table on date with clustering on customer_id.

AnswerD

Partitioning prunes date ranges, clustering narrows scans within partitions.

Why this answer

Option D is correct because partitioning the table on `date` allows BigQuery to prune entire partitions when filtering by date, drastically reducing the data scanned. Clustering on `customer_id` then sorts data within each partition, enabling block-level pruning for queries that filter on `customer_id`. This combination minimizes both I/O and cost for the described BI workload.

Exam trap

The trap here is that candidates often assume clustering alone is sufficient for date-range filtering, overlooking that partitioning is required to physically separate data by date and enable partition pruning, which is a fundamental BigQuery optimization for time-series data.

How to eliminate wrong answers

Option A is wrong because clustering on `date` alone does not provide partition pruning; without partitioning, BigQuery must scan the entire table even if only a date range is needed, leading to higher costs and slower performance. Option B is wrong because BigQuery does not support traditional indexing; it uses columnar storage and pruning via partitioning/clustering, so a non-partitioned table with 'indexing' is not a valid strategy. Option C is wrong because a materialized view aggregating by date would pre-summarize data but cannot efficiently support ad-hoc filters on `customer_id` without scanning all underlying rows; it also adds storage and maintenance overhead without addressing the need for row-level filtering on `customer_id`.

Full explanation →

310

MCQeasy

You are monitoring a Memorystore for Redis instance serving as a cache for an e-commerce application. The cache hit ratio has dropped from 95% to 70%. Which action is most likely to restore the hit ratio?

A.Reduce the network latency between the application and Redis by moving them to the same zone.

B.Change the eviction policy to 'noeviction' to prevent key removal.

C.Increase the maxmemory setting to accommodate more cache entries.

D.Enable AOF persistence to ensure data survives restarts.

AnswerC

Larger memory reduces evictions, improving hit ratio.

Why this answer

A drop in cache hit ratio from 95% to 70% indicates that the cache is evicting frequently accessed keys to make room for new ones. Increasing the maxmemory setting allows Redis to store more entries, reducing evictions and restoring the hit ratio. This is the most direct way to address the capacity issue without changing application behavior.

Exam trap

The trap here is that candidates confuse eviction policy changes (like 'noeviction') with capacity increases, or assume that persistence or network optimization can fix a hit ratio problem caused by insufficient memory.

How to eliminate wrong answers

Option A is wrong because network latency affects response times, not the cache hit ratio; moving to the same zone reduces latency but does not prevent evictions or increase the number of cached keys. Option B is wrong because setting 'noeviction' causes Redis to return errors on write operations when memory is full, which can break the application and does not restore existing evicted keys. Option D is wrong because AOF persistence ensures data durability across restarts but does not affect the eviction behavior or the number of keys in memory; it may even reduce available memory for caching due to overhead.

Full explanation →

311

MCQhard

A financial services company uses Cloud Spanner with a database that has multiple tables with interleaved relationships. They need to enforce a strict consistency requirement across two related tables that are not interleaved. Which method ensures global strong consistency?

A.Use Spanner's built-in atomicity by executing the updates in a single read-write transaction.

B.Use Cloud Pub/Sub to eventually synchronize the tables.

C.Use a commit timestamp-based approach to synchronize writes.

D.Use a client-side distributed transaction across the two tables.

AnswerA

Spanner supports multi-table transactions with global strong consistency.

Why this answer

Spanner provides global ACID transactions across all tables within a database. Option B (single read-write transaction) is the correct approach. Option A (client-side) is unnecessary and less reliable.

Option C (commit timestamps) does not provide atomicity. Option D (Pub/Sub) is eventual.

Full explanation →

312

MCQhard

Refer to the exhibit. You notice replication latency is 15ms. What is the most likely cause of this latency?

A.The table load is not uniform

B.Storage utilization is at 70%

C.High CPU utilization (85%) on the cluster

D.The number of nodes is above the recommended count

AnswerC

High CPU can slow down replication operations.

Why this answer

High CPU utilization (85%) on the cluster is the most likely cause of replication latency because the replication process (often using protocols like Raft or Paxos in distributed databases) is CPU-intensive. When CPU resources are saturated, the cluster cannot process replication requests in a timely manner, leading to increased latency. In many distributed database systems, replication involves serialization, checksumming, and network I/O, all of which compete for CPU cycles.

Exam trap

Google Cloud often tests the misconception that storage utilization or node count are the primary causes of replication latency, when in fact CPU saturation is the most direct bottleneck for the replication protocol's processing pipeline.

How to eliminate wrong answers

Option A is wrong because non-uniform table load typically causes hot spots or uneven data distribution, which can lead to performance degradation but does not directly cause replication latency; replication latency is a cluster-wide issue related to the replication protocol's processing speed. Option B is wrong because storage utilization at 70% is within normal operational limits and does not directly impact replication latency; replication latency is more sensitive to CPU and network bandwidth than to storage capacity. Option D is wrong because the number of nodes being above the recommended count can increase network overhead and coordination delays, but the most direct and common cause of replication latency in a cluster with high CPU is CPU saturation, not node count alone.

Full explanation →

313

MCQeasy

A startup is using Cloud Spanner for a global user base. They need to design a schema that minimizes interleaved table joins for common access patterns. Which schema design principle should they prioritize?

A.Normalize all tables to reduce data redundancy.

B.Store data in separate databases per region.

C.Use secondary indexes on all foreign key columns.

D.Use composite primary keys to colocate related data.

AnswerD

Correct. Composite primary keys enable interleaving, colocating rows and minimizing joins.

Why this answer

Interleaved tables in Spanner allow colocation of parent-child rows, reducing cross-node joins. Option A uses composite primary keys to colocate related data, which is the core principle of interleaving. Option B (normalization) increases joins.

Option C (secondary indexes) helps but is not as fundamental. Option D (separate databases) increases complexity.

Full explanation →

314

MCQmedium

A company runs a retail BI dashboard on BigQuery. The fact_sales table is partitioned by DAY and clustered by product_id. The table is 10 TB. Recently, analysts complain that queries filtering on a specific product_id and a month of data take over 10 minutes. The query uses a subquery to find top products. What should the engineer do?

A.Create a materialized view for the subquery.

B.Add an ORDER BY product_id to the subquery.

C.Change partition type to HOUR.

D.Re-cluster the table with product_id as the first clustering column and date as the second.

AnswerA

Materialized view stores precomputed results, reducing query time and cost.

Why this answer

Option A is correct because creating a materialized view for the subquery that identifies top products pre-computes and stores the results, which are incrementally refreshed by BigQuery. This avoids re-scanning the entire 10 TB fact_sales table each time the query runs, drastically reducing query time for the analysts' frequent filtering on product_id and a month of data.

Exam trap

Google Cloud often tests the misconception that clustering or partitioning changes alone can solve performance issues for subqueries, but the real bottleneck is the repeated full-table scan, which only a materialized view or similar pre-computation can eliminate.

How to eliminate wrong answers

Option B is wrong because adding ORDER BY product_id to the subquery does not improve performance; it only sorts the output, which adds overhead without reducing the data scanned or leveraging clustering. Option C is wrong because changing partition type to HOUR would create many small partitions, increasing partition management overhead and potentially degrading query performance due to metadata operations, while the analysts query a month of data, not hourly slices. Option D is wrong because re-clustering with product_id as the first clustering column and date as the second is already the current clustering order (product_id first, DAY partition second), so this change would not provide any benefit and clustering is automatically maintained by BigQuery.

Full explanation →

315

MCQeasy

An e-commerce platform requires strong consistency across global regions. Which database service should they choose?

A.Cloud Bigtable

B.Firestore

C.Cloud Spanner

D.Cloud SQL

AnswerC

Cloud Spanner provides globally distributed strong consistency.

Why this answer

Cloud Spanner is the correct choice because it provides strong consistency across global regions via synchronous replication and TrueTime, ensuring ACID transactions with external consistency. This meets the e-commerce platform's requirement for globally consistent reads and writes without eventual consistency trade-offs.

Exam trap

The trap here is that candidates often confuse 'strong consistency' with 'eventual consistency' and pick Firestore for its real-time capabilities, overlooking that Firestore's multi-region mode sacrifices strong consistency for availability.

How to eliminate wrong answers

Option A is wrong because Cloud Bigtable is a NoSQL wide-column database designed for high-throughput analytical workloads, not strong consistency across regions—it offers only eventual consistency for replicated data. Option B is wrong because Firestore provides strong consistency within a single region but uses eventual consistency for multi-region deployments, failing the global strong consistency requirement. Option D is wrong because Cloud SQL is a regional relational database service that does not support multi-region replication with strong consistency; it relies on asynchronous replication for cross-region failover, which can lead to data loss or stale reads.

Full explanation →

316

MCQeasy

A company is migrating their on-premises PostgreSQL database to Cloud SQL. They want to minimize downtime during the migration. Which approach should they use?

A.Use Database Migration Service with continuous replication

B.Use pg_dump and pg_restore

C.Export data to CSV and import into Cloud SQL

D.Use a third-party ETL tool

AnswerA

Database Migration Service supports continuous replication (CDC) to minimize downtime.

Why this answer

Database Migration Service (DMS) with continuous replication is the correct approach because it uses change data capture (CDC) to replicate ongoing transactions from the source PostgreSQL database to Cloud SQL, enabling a near-zero downtime migration. DMS handles schema conversion, data validation, and automated failover, which minimizes the cutover window to seconds or minutes.

Exam trap

The trap here is that candidates often assume any backup-and-restore method (like pg_dump) is sufficient for migration, but the PCDE exam specifically tests the requirement for minimal downtime, which only continuous replication can achieve.

How to eliminate wrong answers

Option B is wrong because pg_dump and pg_restore perform a logical backup and restore, which requires taking the source database offline or in read-only mode during the dump, causing significant downtime. Option C is wrong because exporting data to CSV and importing into Cloud SQL is a manual, batch-oriented process that does not support continuous replication, leading to extended downtime and potential data inconsistency. Option D is wrong because a third-party ETL tool typically extracts data in batches and cannot provide the continuous, low-latency replication needed for minimal downtime, and it introduces additional complexity and cost without native integration with Cloud SQL.

Full explanation →

317

MCQeasy

A company plans to migrate an on-premises PostgreSQL database to Cloud SQL. The database is 2 TB in size and requires minimal downtime. Which migration approach should they use?

A.Export the database using pg_dump and import into Cloud SQL using psql.

B.Use Datastream to stream data into Cloud SQL.

C.Use Database Migration Service with a continuous migration job.

D.Copy the data files to Cloud Storage and use gcloud to load into Bigtable.

AnswerC

Database Migration Service supports virtually zero-downtime migrations through continuous replication.

Why this answer

Database Migration Service (DMS) with a continuous migration job is the correct approach because it supports minimal-downtime migrations from on-premises PostgreSQL to Cloud SQL. DMS uses logical replication (via PostgreSQL's pgoutput plugin) to continuously sync changes from the source to the target, allowing a short cutover window. For a 2 TB database, this avoids the lengthy downtime required by a full dump-and-load method.

Exam trap

The trap here is that candidates may choose pg_dump (Option A) because it is familiar and works for smaller databases, but they overlook the minimal-downtime requirement and the impracticality of exporting 2 TB without significant service interruption.

How to eliminate wrong answers

Option A is wrong because pg_dump and psql require a full export and import, which would take hours or days for a 2 TB database, causing significant downtime. Option B is wrong because Datastream is designed for streaming change data capture (CDC) to BigQuery or Cloud Storage, not for direct ingestion into Cloud SQL. Option D is wrong because copying data files to Cloud Storage and loading into Bigtable is for NoSQL workloads, not for migrating a relational PostgreSQL database to Cloud SQL.

Full explanation →

318

MCQmedium

Refer to the exhibit. The company wants to achieve a 99.99% SLA for this Cloud SQL instance. What should they do?

A.Change to a different tier.

B.Change the availability type to REGIONAL.

C.Enable automatic backups.

D.Increase the number of CPUs.

AnswerB

REGIONAL availability uses zonal replications and offers a 99.99% SLA.

Full explanation →

319

MCQeasy

A healthcare company needs to run BI queries on patient data. The table is in BigQuery and contains 5 billion rows. Queries often filter on patient_id and date. But the table is not partitioned or clustered. Analysts run queries that scan the entire table. The data is updated daily. What is the most cost-effective way to improve performance?

A.Partition the table by patient_id.

B.Use a view that only selects recent data.

C.Cluster the table by date.

D.Partition by date and cluster by patient_id.

AnswerD

Partitioning prunes by date, clustering narrows by patient_id, reducing scanned bytes significantly.

Why this answer

Partitioning by date (e.g., ingestion or event date) allows BigQuery to prune entire partitions when queries filter on date, drastically reducing the data scanned. Clustering by patient_id within each partition further organizes the data so that queries filtering on patient_id can skip irrelevant blocks via block-level metadata. Together, this minimizes bytes billed and improves query performance without requiring table redesign or additional storage costs.

Exam trap

Google Cloud often tests the misconception that clustering alone is sufficient for performance gains, but without partitioning, clustering cannot prune storage at the partition level, so full-table scans still occur and costs remain high.

How to eliminate wrong answers

Option A is wrong because partitioning by patient_id is not supported in BigQuery (partitioning columns must be of type DATE, TIMESTAMP, or INTEGER range) and would not align with the common date-based filter pattern. Option B is wrong because a view that only selects recent data does not reduce the underlying table scan; BigQuery still processes all data in the table unless the view is materialized, and even then it would not address the full-table scan issue for historical queries. Option C is wrong because clustering alone without partitioning still requires scanning all partitions (the entire table) if no partition filter is applied; clustering only helps within a partition, so without a partition filter the query still incurs full-table costs.

Full explanation →

320

Multi-Selectmedium

Which TWO best practices should be followed when modeling data for a Looker BI dashboard to optimize query performance?

Select 2 answers

A.Use derived tables for all complex logic

B.Use persistent derived tables (PDTs) to materialize intermediate results

C.Use native derived tables to leverage BigQuery's UDFs

D.Use materialized views in the underlying database

E.Use symmetric aggregates to correctly aggregate measures across joins

AnswersB, E

PDTs are stored and refreshed periodically, improving query speed.

Why this answer

Option B is correct because Persistent Derived Tables (PDTs) materialize intermediate query results into physical tables in the underlying database (e.g., BigQuery). This avoids re-executing complex logic on every user interaction, drastically reducing query latency and cost. PDTs are a core Looker optimization for repeated, heavy transformations.

Exam trap

Google Cloud often tests the distinction between persistent and native derived tables, trapping candidates who think all derived tables improve performance, when only persistent ones (PDTs) materialize results for repeated use.

Full explanation →

321

MCQhard

A company is migrating their on-premises data warehouse to BigQuery for BI. They have a fact table with billions of rows and many dimension tables. The current queries perform well in the on-prem system but are slow in BigQuery. The queries contain multiple JOINs and subqueries. Which optimization should they implement first?

A.Use clustering on all join keys.

B.Use BigQuery's automatic query rewriting.

C.Convert subqueries to CTEs.

D.Denormalize the dimension tables into the fact table.

AnswerD

Denormalization eliminates JOINs, which are expensive in BigQuery, improving performance significantly.

Why this answer

Denormalizing dimension tables into the fact table is the most impactful first optimization because it eliminates the need for expensive JOIN operations across billions of rows. In BigQuery, JOINs on large fact tables with multiple dimension tables can cause significant data shuffling and increased slot consumption, whereas denormalization reduces query complexity and leverages BigQuery's columnar storage and compression more efficiently. This directly addresses the root cause of slow performance in a BI workload where subqueries and JOINs are prevalent.

Exam trap

Google Cloud often tests the misconception that query-level optimizations (like clustering, CTEs, or automatic rewriting) can solve performance issues caused by schema design, when in fact the most impactful first step is to reduce JOIN complexity through denormalization for BigQuery's architecture.

How to eliminate wrong answers

Option A is wrong because clustering on all join keys does not eliminate the JOIN operations themselves; it only improves the efficiency of filtering and sorting within each table, but the shuffle and data redistribution required for JOINs across billions of rows remains a bottleneck. Option B is wrong because BigQuery's automatic query rewriting is a built-in optimizer that already applies heuristics and cost-based optimizations, but it cannot fundamentally restructure the schema to avoid JOINs; it works within the existing query structure. Option C is wrong because converting subqueries to CTEs (Common Table Expressions) is a syntactic change that does not alter the execution plan or reduce the computational cost of JOINs and subqueries; BigQuery treats CTEs similarly to subqueries under the hood.

Full explanation →

322

MCQeasy

A company is designing a data warehouse for BI. They need to support both detailed transaction analysis and high-level aggregated reports. Which schema design best balances storage and query performance?

A.Fully denormalized single table

B.Wide column store with no schema

C.Star schema with fact and dimension tables

D.Snowflake schema with normalized dimensions

AnswerC

Star schema is standard for BI, enabling fast aggregations and easy reporting.

Why this answer

The star schema is the optimal design for balancing storage and query performance in a BI data warehouse because it separates transactional data into fact tables (for detailed analysis) and dimension tables (for context), enabling fast aggregations via star joins while avoiding the storage overhead of full denormalization. This structure directly supports both granular transaction queries and high-level rollups without the complexity or performance penalty of snowflake schemas or the redundancy of fully denormalized tables.

Exam trap

Google Cloud often tests the misconception that snowflake schemas are always better for storage efficiency, but the trap here is that the question explicitly balances storage and query performance, and the star schema provides the best trade-off by avoiding excessive joins while keeping dimensions manageable.

How to eliminate wrong answers

Option A is wrong because a fully denormalized single table introduces massive data redundancy and update anomalies, leading to excessive storage consumption and slower query performance due to larger table scans, especially for high-level aggregations. Option B is wrong because a wide column store with no schema lacks the relational integrity and indexing capabilities required for efficient BI joins and aggregations, making it unsuitable for consistent, schema-on-write data warehouse workloads. Option D is wrong because a snowflake schema with normalized dimensions increases the number of join operations across multiple tables, degrading query performance for high-level reports without providing significant storage savings over a star schema in typical BI scenarios.

Full explanation →

323

MCQmedium

A company runs a critical application on Cloud SQL for PostgreSQL. The database engineer needs to ensure that if the primary instance fails, a standby instance in a different region can take over with minimal data loss. Which configuration should the Database Engineer implement?

A.Set up a second Cloud SQL instance and configure application-level dual-writes to both instances.

B.Configure high availability (HA) within the same region using a regional persistent disk.

C.Create a cross-region replica with asynchronous replication and manually promote it during a disaster.

D.Create a cross-region replica with synchronous replication and enable automatic failover.

AnswerC

Cross-region replica with async replication is the standard DR configuration; manual promotion gives control.

Why this answer

Option C is correct because Cloud SQL for PostgreSQL supports cross-region replicas with asynchronous replication, which allows a standby instance in a different region to be promoted manually during a disaster. This minimizes data loss by replicating changes asynchronously, though some transactions may be lost if the primary fails before replication completes. Automatic failover is not supported for cross-region replicas in Cloud SQL, so manual promotion is required.

Exam trap

Google Cloud often tests the misconception that synchronous replication and automatic failover are available for cross-region replicas, but Cloud SQL only supports asynchronous replication for cross-region replicas and requires manual promotion.

How to eliminate wrong answers

Option A is wrong because application-level dual-writes introduce complexity, potential inconsistency, and do not leverage Cloud SQL's built-in replication, making it error-prone and not a standard disaster recovery solution. Option B is wrong because high availability (HA) within the same region using a regional persistent disk only protects against zonal failures, not regional disasters, and does not provide cross-region failover. Option D is wrong because Cloud SQL for PostgreSQL does not support synchronous replication for cross-region replicas, and automatic failover is not available for cross-region replicas; synchronous replication would also introduce unacceptable latency across regions.

Full explanation →

324

Multi-Selectmedium

Which TWO schema design practices help reduce write contention in Cloud Spanner?

Select 2 answers

A.Use a hash prefix in the primary key to distribute writes across splits.

B.Use a timestamp prefix in the primary key to sort by time.

C.Use interleaved tables to keep related rows together.

D.Design the schema so that hot rows are split into multiple rows with different keys.

E.Decrease the number of splits by using a less granular primary key.

AnswersA, D

Hashing prevents sequential writes from hitting the same split.

Why this answer

Options A and D are correct. Option A: Using a monotonically increasing primary key causes hotspotting; instead, use a hash prefix. Option D: Splitting hot rows into multiple rows with different keys spreads writes.

Option B is wrong because interleaving can increase contention if parent is hot. Option C is wrong because decreasing splits reduces parallelism. Option E is wrong because timestamp prefix causes hotspotting.

Full explanation →

325

MCQhard

A Bigtable instance stores time-series data with row keys formatted as `#deviceID#timestamp`. The application often queries recent data for a specific device. Monitoring shows high read latency when scanning multiple devices. The row key design is causing hotspotting. What is the best redesign?

A.Use separate tables per device.

B.Use a hash prefix before deviceID.

C.Store timestamps in column qualifiers.

D.Prefix the row key with the deviceID and reverse the timestamp.

AnswerB

Hashing distributes rows evenly across the keyspace, alleviating hotspotting while preserving locality for device queries.

Why this answer

Option B is correct because prepending a hash prefix (e.g., a cryptographic hash of the deviceID) to the row key distributes writes and reads evenly across Bigtable tablets, eliminating hotspotting caused by sequential deviceID-based keys. This ensures that queries for recent data (which would otherwise concentrate on a single tablet) are spread across multiple nodes, reducing read latency.

Exam trap

Google Cloud often tests the misconception that reversing the timestamp (option D) solves hotspotting, but candidates fail to realize that the leading key component (deviceID) is still sequential, so all recent data for a device remains on the same tablet, and only a hash prefix (option B) truly distributes the load.

How to eliminate wrong answers

Option A is wrong because using separate tables per device does not solve hotspotting; it merely shifts the problem to table-level contention and increases operational overhead, as Bigtable is optimized for a single wide table, not many small tables. Option C is wrong because storing timestamps in column qualifiers does not address row key hotspotting; it only changes the schema structure without distributing the load, and queries still scan the same hot row key range. Option D is wrong because prefixing with deviceID and reversing the timestamp still results in sequential deviceID-based keys, which cause hotspotting; reversing the timestamp only helps if the deviceID is already distributed, but here the deviceID is the leading component, so all recent data for a device still falls on the same tablet.

Full explanation →

326

Multi-Selecteasy

Which TWO of the following are best practices when designing data structures for business intelligence in BigQuery?

Select 2 answers

A.Partition tables on a column that aligns with common filter criteria

B.Store raw logs directly in fact tables without any aggregation

C.Use NULLable columns extensively to save storage

D.Use a single wide table for all data to simplify schema

E.Denormalize dimension attributes into fact tables to reduce joins

AnswersA, E

Partitioning limits scanned partitions.

Why this answer

Partitioning tables on a column that aligns with common filter criteria (e.g., a date or timestamp column) allows BigQuery to prune partitions during query execution, drastically reducing the amount of data scanned and improving query performance and cost efficiency. This is a core best practice for optimizing BI workloads in BigQuery.

Exam trap

Google Cloud often tests the misconception that denormalization is always bad, but in BigQuery for BI, denormalizing dimension attributes into fact tables is a recognized best practice to reduce JOIN overhead and improve query performance.

Full explanation →

327

MCQhard

A slow query log entry shows the above for a Cloud SQL for MySQL instance. Which index would most improve performance?

A.Index on products(product_id)

B.Index on orders(order_date)

C.Composite index on orders(product_id, order_date)

D.Composite index on orders(order_date, product_id)

AnswerC

This index allows the join to quickly find matching product_ids and then apply the date range, reducing the number of rows examined.

Why this answer

The query likely filters or joins on product_id and then sorts or filters by order_date, so a composite index on orders(product_id, order_date) allows the database to satisfy both conditions with a single index scan, avoiding a filesort or extra lookups. In MySQL, a composite index with the most selective column first (product_id) followed by the range/order column (order_date) is optimal for queries that filter on product_id and then order or filter by order_date.

Exam trap

Google Cloud often tests the leftmost prefix rule and the importance of column order in composite indexes, trapping candidates who think any composite index covering both columns is equally effective regardless of column order.

How to eliminate wrong answers

Option A is wrong because indexing only product_id on the products table does not help with filtering or ordering on the orders table's order_date column, and the query likely involves the orders table. Option B is wrong because indexing only order_date on orders does not help with filtering on product_id, which is typically the more selective filter. Option D is wrong because a composite index on orders(order_date, product_id) would be less efficient if the query filters on product_id first, as MySQL cannot use the second column of the index when the first column is not used in a equality condition, leading to a full index scan or extra sorting.

Full explanation →

328

MCQhard

A company uses Cloud Memorystore for Redis as a cache for their web application. They want to ensure that cache data survives a failover event with minimal data loss. The current instance has a standard tier (with replication) and persistence disabled. What change should they make?

A.Switch to the basic tier without replication but with high memory.

B.Enable persistence (AOF) on the instance.

C.Increase the instance memory size to hold more data.

D.Add a read replica to the instance.

AnswerB

Persistence ensures data is written to disk and can be recovered after failover.

Why this answer

Enabling AOF (Append-Only File) persistence on a Cloud Memorystore for Redis standard tier instance ensures that write operations are durably logged to disk. In the event of a failover, the promoted replica can replay the AOF to recover the most recent writes, minimizing data loss beyond what the default in-memory replication provides.

Exam trap

The trap here is that candidates assume replication alone guarantees data durability, but replication only copies data in memory and does not protect against loss of uncommitted writes during a failover without disk-based persistence enabled.

How to eliminate wrong answers

Option A is wrong because switching to the basic tier removes replication entirely, which increases the risk of data loss during any failure and does not address persistence. Option C is wrong because increasing memory size only allows more data to be cached in RAM, but does not make that data durable across a failover event. Option D is wrong because adding a read replica does not enable persistence; replicas in standard tier already exist for high availability, but without AOF they still lose data on failover if persistence is disabled.

Full explanation →

329

MCQeasy

Refer to the exhibit. You are analyzing a slow query in Cloud SQL for PostgreSQL. The execution plan shows a sequential scan. Which index should you create to most effectively improve query performance?

A.CREATE INDEX idx_orders_partial ON orders(created_at) WHERE user_id = 123;

B.CREATE INDEX idx_orders_created_at ON orders(created_at);

C.CREATE INDEX idx_orders_created_user ON orders(created_at, user_id);

D.CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);

AnswerD

Allows index seek on user_id then range scan on created_at.

Why this answer

Option D is correct because the query likely filters on `user_id` and then sorts or filters by `created_at`. A composite index on `(user_id, created_at)` allows PostgreSQL to first narrow down by `user_id` using index seek, then efficiently access rows in `created_at` order, avoiding a sequential scan. This matches the most common pattern for slow queries involving equality on `user_id` and range or ordering on `created_at`.

Exam trap

Google Cloud often tests the misconception that any composite index with the right columns will work, but the column order matters critically — candidates pick `(created_at, user_id)` thinking it covers both, not realizing the leading column must match the equality filter for optimal performance.

How to eliminate wrong answers

Option A is wrong because a partial index with a hardcoded `user_id = 123` only benefits queries for that specific user, not the general slow query; it also ignores the `created_at` column needed for ordering or filtering. Option B is wrong because an index on `created_at` alone does not help if the query filters on `user_id` first — PostgreSQL may still perform a sequential scan or need to filter many rows. Option C is wrong because the column order `(created_at, user_id)` is suboptimal: if the query filters on `user_id` (equality) and then orders by `created_at`, the leading column should be `user_id` to allow index seek; leading with `created_at` forces a full index scan or inefficient filtering.

Full explanation →

330

MCQeasy

A team executed the above DDL to create interleaved tables in Cloud Spanner. They need to query all orders for a specific customer. Which query will be most efficient?

A.SELECT * FROM Orders WHERE CustomerId = 1234 AND OrderDate = '2023-01-01';

B.SELECT * FROM Customers JOIN Orders ON Customers.CustomerId = Orders.CustomerId WHERE Customers.CustomerId = 1234;

C.SELECT * FROM Orders WHERE CustomerId = 1234;

D.SELECT * FROM Orders WHERE OrderId = 5678;

AnswerC

Interleaving colocates all orders for a customer, making this query very efficient.

Why this answer

Since Orders are interleaved under Customers on CustomerId, filtering by CustomerId (A) allows Spanner to directly access the colocated rows. Option B filters only by OrderId, which may require a full scan. Option C adds an extra condition but still benefits from CustomerId.

Option D uses a join, which is unnecessary because interleaving already provides the relationship.

Full explanation →

331

Multi-Selectmedium

You are troubleshooting a slow-performing query on Cloud Spanner. The query scans a large table with a secondary index. Which TWO metrics from the Query Insights dashboard would most directly indicate the source of the performance issue?

Select 2 answers

A.CPU time

B.Rows scanned

C.Storage utilization

D.Commit latency

E.Lock wait time

AnswersA, B

High CPU time indicates the query is computationally expensive.

Why this answer

CPU time (A) is correct because high CPU usage indicates that the query is performing expensive operations like sorting, joining, or complex filtering, which can slow performance even if the index is used. Rows scanned (B) is correct because scanning a large number of rows, even with a secondary index, suggests the index is not selective enough or the query is retrieving many rows, leading to excessive I/O and latency. Both metrics directly point to query execution inefficiency.

Exam trap

Google Cloud often tests the distinction between metrics that indicate query execution inefficiency (CPU time, rows scanned) versus metrics related to storage or write contention, leading candidates to mistakenly select storage utilization or lock wait time for a read-only query performance issue.

Full explanation →

332

MCQmedium

The user runs a BigQuery query on a non-partitioned table and receives the error shown. Which optimization should be applied first to resolve the issue?

A.Partition the table by the event_date column

B.Increase the BigQuery reservation slot count

C.Create a materialized view that pre-aggregates the data

D.Cluster the table by event_date

AnswerA

Partitioning limits scans to relevant date ranges, reducing resource consumption.

Why this answer

The error indicates that the query is scanning too much data, likely exceeding the free tier or slot quota. Partitioning the non-partitioned table by `event_date` allows BigQuery to perform partition pruning, scanning only the relevant date range instead of the entire table. This directly reduces the data processed, which is the most effective first optimization for cost and performance.

Exam trap

Google Cloud often tests the distinction between partitioning (which prunes entire storage shards) and clustering (which only sorts within shards), leading candidates to mistakenly choose clustering as a solution for reducing data scanned when partitioning is required first.

How to eliminate wrong answers

Option B is wrong because increasing the reservation slot count only adds compute resources but does not reduce the amount of data scanned; the query would still fail if the issue is data volume limits. Option C is wrong because creating a materialized view pre-aggregates data but still requires scanning the base table unless the view is used with query rewriting, and it does not address the root cause of scanning too much raw data. Option D is wrong because clustering by `event_date` improves query performance by reducing the data read for range-based filters, but it does not enable partition pruning; clustering only sorts data within partitions, and without partitioning, the entire table is still scanned.

Full explanation →

333

MCQhard

Refer to the exhibit. A company creates these Cloud Spanner tables. What happens when a customer record is deleted?

A.The deletion fails if there are orders.

B.The orders are deleted only if the order date is older than 30 days.

C.All orders for that customer are automatically deleted.

D.The orders remain orphaned.

AnswerC

Cascade delete removes all child rows associated with the deleted parent row.

Why this answer

Option C is correct because Cloud Spanner enforces referential integrity through interleaved tables. When a parent row in the Customers table is deleted, all child rows in the Orders table that are interleaved under that customer are automatically deleted via a cascading delete. This behavior is inherent to the interleaved table structure, not an explicit ON DELETE CASCADE clause.

Exam trap

The trap here is that candidates may assume Cloud Spanner behaves like traditional relational databases (e.g., requiring explicit ON DELETE CASCADE or failing on foreign key violations), but interleaved tables automatically cascade deletes without any additional syntax.

How to eliminate wrong answers

Option A is wrong because Cloud Spanner interleaved tables automatically delete child rows, so the deletion does not fail even if orders exist. Option B is wrong because there is no time-based condition in the table schema; deletion of orders is unconditional and not filtered by order date. Option D is wrong because orphaned rows cannot occur in interleaved tables; the parent-child relationship ensures child rows are removed when the parent is deleted.

Full explanation →

334

MCQhard

A company has a BigQuery dataset with many views. They need to ensure that only the latest 30 days of data is used in BI reports for performance. The source table is partitioned by ingestion_time. Which approach reduces query cost and improves performance?

A.Use BigQuery BI Engine to cache results

B.Create a view with WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)

C.Create a materialized view with the date filter

D.Use a scheduled query to copy the last 30 days to a separate table

AnswerC

Materialized views precompute and store the filtered results, reducing query cost and improving performance through incremental updates.

Why this answer

Option C is correct because a materialized view precomputes and stores the filtered result set, allowing BigQuery to serve BI queries directly from the materialized view's storage without scanning the entire source table. This eliminates the need to re-process the full table on every query, significantly reducing query cost and improving performance for the 30-day sliding window.

Exam trap

Google Cloud often tests the distinction between standard views (which are just saved queries) and materialized views (which store results), leading candidates to incorrectly choose a standard view with a WHERE clause, thinking it will reduce cost, when in fact it does not reduce data scanned.

How to eliminate wrong answers

Option A is wrong because BI Engine caches query results in memory, but it does not reduce the amount of data scanned on the first query or when the cache is invalidated; the source table must still be fully scanned initially, and the 30-day filter is not automatically applied. Option B is wrong because a standard view with a WHERE clause on _PARTITIONTIME does not precompute or store results; each query against the view still scans all partitions that match the filter, and BigQuery must evaluate the filter on every execution, which does not reduce cost or improve performance compared to querying the table directly. Option D is wrong because a scheduled query that copies the last 30 days to a separate table introduces data duplication, additional storage costs, and maintenance overhead (e.g., scheduling, cleanup of old data), and it does not provide the automatic, real-time sliding window that a materialized view offers.

Full explanation →

335

MCQhard

You are building a globally distributed leaderboard application that requires strongly consistent reads with latency under 10 ms and high write throughput. Which Google Cloud database service is most suitable?

A.Cloud Bigtable

B.Cloud Spanner

C.Memorystore

D.Firestore

AnswerB

Spanner offers strong consistency across regions, low latency, and high write throughput.

Why this answer

Cloud Spanner is the correct choice because it provides strongly consistent reads across globally distributed regions with latency under 10 ms, while also supporting high write throughput. It uses TrueTime and synchronous replication to ensure ACID transactions and global consistency, meeting the exact requirements of a globally distributed leaderboard application.

Exam trap

The trap here is that candidates often confuse 'low latency' with 'strong consistency' and choose Cloud Bigtable or Memorystore for their speed, overlooking the critical requirement for globally consistent reads that only Spanner can provide.

How to eliminate wrong answers

Option A is wrong because Cloud Bigtable is a NoSQL wide-column database designed for high throughput and low latency, but it offers only eventual consistency, not strongly consistent reads. Option C is wrong because Memorystore is an in-memory cache (Redis/Memcached) that provides low latency but lacks native global distribution and strong consistency across regions; it is typically used as a caching layer, not a primary globally consistent database. Option D is wrong because Firestore provides strong consistency within a single region but offers eventual consistency for multi-region deployments, and its write throughput is limited compared to Spanner, making it unsuitable for a globally distributed leaderboard with high write throughput.

Full explanation →

336

MCQeasy

Refer to the exhibit. You are reviewing a Firestore security rules file. What is the main security flaw in the database schema design that these rules expose?

A.The rules do not protect against brute force attacks

B.The senderId field is not indexed

C.The delete rule allows admin to delete any message

D.Users can set the visibility field, allowing them to make messages public

AnswerD

The create rule does not restrict the visibility value, so users can bypass intended privacy.

Why this answer

Option B is correct: the schema includes a 'visibility' field that users can set when creating documents. Since the create rule only checks that the senderId matches the authenticated user, users can set visibility to 'public' for any message they create, potentially exposing private messages. Option A is not a security flaw.

Option C is not a flaw. Option D is vague and not specifically about the schema.

Full explanation →

337

MCQmedium

A financial company runs BI queries on a BigQuery table that is partitioned by ingestion time. The table is 1 TB and receives streaming inserts every minute. Analysts query the last 24 hours of data. The queries are slow. The table is clustered by transaction_id. What is the likely cause?

A.Streaming buffer causes delays.

B.Queries use SELECT *.

C.Partition expiration is set too short.

D.The cluster column is not used in queries.

AnswerD

Without a filter on transaction_id, clustering provides no benefit; data within partitions is unordered.

Why this answer

Option A is correct because queries likely filter on the time range (last 24 hours) but not on transaction_id, so the clustering key is not used; clustering only helps when the cluster column is in the filter. Option B (streaming buffer) is not a cause because data is committed quickly. Option C (partition expiration) is not mentioned.

Option D (SELECT *) may affect but not primary cause.

Full explanation →

338

MCQhard

A developer reports that their application cannot connect to a Cloud SQL instance using private IP, but public IP works. The Cloud SQL instance is in VPC peering with the application's VPC. The application is in the same region. What is the most likely cause?

A.The VPC peering connection is not established.

B.The private IP range of the Cloud SQL instance conflicts with the application's VPC.

C.The Cloud SQL proxy is not running.

D.The Cloud SQL instance has 'require SSL' enabled.

AnswerB

IP overlap in peered VPCs causes routing issues, preventing private IP connectivity while public IP remains unaffected.

Why this answer

Option B is correct because when a Cloud SQL instance is configured with a private IP address that overlaps with the application's VPC CIDR range, the VPC peering connection cannot route traffic correctly. This is due to the fact that VPC peering requires non-overlapping IP ranges to establish proper routing tables; overlapping ranges cause route conflicts and connectivity failures. Since public IP works, the issue is isolated to private IP routing, making IP range conflict the most likely cause.

Exam trap

Google Cloud often tests the misconception that VPC peering automatically handles overlapping IP ranges, when in fact overlapping ranges cause routing failures that prevent private IP connectivity even if the peering connection itself is established.

How to eliminate wrong answers

Option A is wrong because if the VPC peering connection were not established, public IP would also fail (since the application would be in a different network), and the question states public IP works. Option C is wrong because the Cloud SQL proxy is a tool for connecting to Cloud SQL via public IP or IAM authentication, but it is not required for private IP connectivity; the application can connect directly to the private IP without the proxy. Option D is wrong because requiring SSL affects encryption of the connection, not the ability to establish a TCP connection; if SSL were required, the connection would fail with an SSL error, not a complete inability to connect via private IP.

Full explanation →

339

Multi-Selectmedium

A database engineer is designing a schema for a Cloud Spanner database. Which three practices should they follow to ensure good performance? (Choose three.)

Select 3 answers

A.Use split points to distribute data across nodes.

B.Use locking read (SELECT ... FOR UPDATE) for all transactional reads.

C.Design primary keys to avoid monotonically increasing values near the beginning of the key.

D.Use interleaved tables for parent-child relationships to colocate data.

E.Create secondary indexes on every column to speed up queries.

AnswersA, C, D

Explicit splits help avoid hot spots.

Why this answer

Option A is correct because explicitly defining split points in Cloud Spanner allows you to control how data is distributed across nodes, which can prevent hot spots and improve read/write throughput. By specifying split boundaries, you ensure that frequently accessed data is spread evenly, avoiding performance bottlenecks.

Exam trap

Google Cloud often tests the misconception that all transactional reads require locking to ensure consistency, but Cloud Spanner's snapshot isolation provides serializable reads without locks, making SELECT ... FOR UPDATE an anti-pattern for most workloads.

Full explanation →

340

MCQhard

Refer to the exhibit. A database administrator notices that the Spanner instance has only 3 nodes, but the application experiences high read latency during peak hours. The team needs to improve performance without over-provisioning. What should they do?

A.Increase node count to 6

B.Use an interleaved table

C.Enable point-in-time recovery

D.Change to a multi-region configuration

E.Create a secondary index

AnswerE

Secondary indexes enable faster lookups and avoid full table scans, improving read latency.

Why this answer

Option C is correct because creating a secondary index can reduce full table scans, lowering read latency without requiring additional nodes. Option A (increase nodes) is over-provisioning if current CPU is low. Option B (interleaved table) may improve join performance but not general reads.

Option D (PITR) increases storage cost without improving latency. Option E (multi-region) adds write latency and cost.

Full explanation →

341

Multi-Selecthard

Which THREE are valid considerations when designing BigQuery tables for BI reporting?

Select 3 answers

A.Use nested and repeated fields to avoid JOINs

B.Create indexes on frequently queried columns

C.Use partitioning on date columns to reduce query cost

D.Cluster tables on high-cardinality columns used in filters

E.Denormalize dimension tables into fact tables for common queries

AnswersC, D, E

Partitioning is a key cost-control feature.

Why this answer

Option C is correct because partitioning BigQuery tables by date columns (e.g., using _PARTITIONTIME or a DATE/TIMESTAMP column) allows the query engine to prune entire partitions during query execution. This significantly reduces the amount of data scanned, directly lowering query costs (since BigQuery charges per byte processed) and improving performance for time-range filters.

Exam trap

Google Cloud often tests the misconception that traditional relational database features like indexes apply to BigQuery, but BigQuery's architecture relies on partitioning and clustering instead of indexes for query optimization.

Full explanation →

342

Multi-Selecthard

Which THREE considerations are critical when migrating from Cassandra to Cloud Bigtable?

Select 3 answers

A.Using CQL for queries

B.Denormalizing data to avoid joins

C.Maintaining eventual consistency model

D.Row key design for even distribution

E.Using secondary indexes for efficient filtering

AnswersB, C, D

Bigtable is a wide-column store and does not support joins; data must be denormalized.

Why this answer

Option B is correct because Cloud Bigtable is a NoSQL wide-column database that does not support joins. Denormalization is a standard practice in Bigtable to model relational data into a single table, ensuring efficient single-row lookups and avoiding the performance penalty of multi-table queries that would require application-level joins.

Exam trap

Google Cloud often tests the misconception that CQL is a universal NoSQL query language, but in reality it is proprietary to Cassandra and not compatible with Bigtable's API.

Full explanation →

343

MCQmedium

A financial services company runs a critical application on Cloud SQL for PostgreSQL. They require point-in-time recovery (PITR) with the ability to recover to any second within the past 7 days. However, their current backup configuration only allows recovery to the previous 7 days, but not within seconds. What should they do to enable PITR?

A.Enable point-in-time recovery and set the transaction log retention to 7 days.

B.Use the Cloud SQL query insight feature to replay queries.

C.Enable binary logging and set the binary log retention period to 7 days.

D.Increase the number of automated backups to 7 per day.

AnswerA

PITR in Cloud SQL for PostgreSQL uses transaction logs (WAL) retained for the specified period.

Why this answer

Option A is correct because enabling point-in-time recovery (PITR) on Cloud SQL for PostgreSQL automatically uses write-ahead log (WAL) archiving to allow recovery to any second within a specified retention period. Setting the transaction log retention to 7 days ensures that the archived WAL segments are kept for exactly 7 days, enabling recovery to any point within that window. This directly satisfies the requirement for second-granularity recovery over the past 7 days.

Exam trap

The trap here is that candidates confuse the number of automated backups (full backups) with the retention of transaction logs required for PITR, leading them to select Option D, or they mistakenly apply MySQL binary logging concepts (Option C) to a PostgreSQL environment.

How to eliminate wrong answers

Option B is wrong because Cloud SQL Query Insights is a performance monitoring and diagnostic feature that captures query metrics and execution plans; it does not replay queries or provide any recovery capability. Option C is wrong because binary logging is a MySQL-specific feature; Cloud SQL for PostgreSQL uses WAL (write-ahead logging) for PITR, not binary logs, and there is no 'binary log retention period' setting for PostgreSQL. Option D is wrong because increasing the number of automated backups (e.g., to 7 per day) only increases the frequency of full backups, not the retention of transaction logs; PITR requires transaction log retention, not more full backups.

Full explanation →

344

Matchingmedium

Match each Cloud SQL backup type to its retention policy.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Configurable retention up to 365 days

Retained until manually deleted

Retained for point-in-time recovery window

Stored in a different region for disaster recovery

Defined start time and window for automated backups

Why these pairings

These backup types and policies ensure data protection and recovery options.

Full explanation →

345

MCQhard

A data analyst runs a query that joins two large tables on a high-cardinality column with many NULL values. Which action is most likely to resolve the error?

A.Use a DISTINCT clause on the join key.

B.Increase the query timeout setting.

C.Add a WHERE clause to filter out NULLs from the join key.

D.Use a UNION ALL to combine tables.

AnswerC

Filtering NULLs reduces row count and shuffle.

Why this answer

Option C is correct because filtering out NULLs from the join key with a WHERE clause prevents the database from attempting to match NULL values, which cannot be equated in a standard SQL join (since NULL != NULL). This reduces the cardinality of the join operation and avoids potential performance degradation or errors caused by the large number of NULLs being processed in a high-cardinality column.

Exam trap

The trap here is that candidates may think increasing the timeout (Option B) is a universal fix for any query error, when in reality the error is often due to resource exhaustion from NULL handling, not insufficient execution time.

How to eliminate wrong answers

Option A is wrong because using DISTINCT on the join key does not resolve the issue of NULLs in the join; it only removes duplicate non-NULL values from the result set, which does not address the underlying problem of NULL mismatches or performance. Option B is wrong because increasing the query timeout setting only allows the query to run longer without failing, but does not fix the root cause of the error (e.g., excessive memory or disk usage from NULL handling). Option D is wrong because UNION ALL combines results from two queries vertically, not horizontally; it does not perform a join and therefore cannot resolve errors related to joining on a high-cardinality column with NULLs.

Full explanation →

346

MCQhard

A company has a BigQuery table with a TIMESTAMP column and wants to query data for a specific date range efficiently. Which WHERE clause ensures partition pruning if the table is partitioned by that TIMESTAMP column?

A.WHERE timestamp_col BETWEEN TIMESTAMP('2023-01-01') AND TIMESTAMP('2023-01-31')

B.WHERE TIMESTAMP_TRUNC(timestamp_col, DAY) BETWEEN '2023-01-01' AND '2023-01-31'

C.WHERE timestamp_col >= '2023-01-01' AND timestamp_col < '2023-02-01'

D.WHERE DATE(timestamp_col) BETWEEN '2023-01-01' AND '2023-01-31'

AnswerA

Direct comparison on the partition column allows BigQuery to prune partitions based on the range.

Why this answer

Option A is correct because it directly references the TIMESTAMP column without wrapping it in a function, allowing BigQuery's partition pruning to eliminate irrelevant partitions. When a table is partitioned by a TIMESTAMP column, the query engine can compare the partition boundaries directly against the literal TIMESTAMP values in the WHERE clause, scanning only the partitions that fall within the specified range.

Exam trap

Google Cloud often tests the misconception that any filter on a partitioned column will trigger pruning, but the trap here is that wrapping the partition column in a function (like DATE, TIMESTAMP_TRUNC, or implicit casts) disables pruning, so only a bare column reference with compatible literal types guarantees efficient partition elimination.

How to eliminate wrong answers

Option B is wrong because TIMESTAMP_TRUNC(timestamp_col, DAY) is a function applied to the partition column, which prevents partition pruning; BigQuery must evaluate the function for every row, scanning all partitions. Option C is wrong because comparing a TIMESTAMP column to a string literal ('2023-01-01') forces an implicit type conversion, which can disable partition pruning and may lead to incorrect results due to timezone or format assumptions. Option D is wrong because DATE(timestamp_col) is a function that extracts the date portion, and like other functions on the partition column, it disables partition pruning, causing a full table scan.

Full explanation →

347

MCQhard

Your Cloud SQL for PostgreSQL instance is experiencing high CPU utilization during peak hours. You notice that the query `SELECT * FROM orders WHERE order_date >= '2024-01-01'` is frequently run against a table with 10 million rows. The table has a B-tree index on `order_date`. What is the most likely cause of the high CPU usage, and how should you address it?

A.The index is fragmented; rebuild the index to improve performance.

B.Increase the instance machine type to provide more CPU capacity.

C.The query retrieves all columns, causing significant heap lookup overhead; rewrite the query to select only required columns.

D.The index on `order_date` is not being used; add a hint to force index usage.

AnswerC

Selecting only needed columns reduces I/O and CPU, as fewer heap lookups are required.

Why this answer

Option C is correct because the query uses `SELECT *`, which forces PostgreSQL to fetch all columns from the heap (the main table storage) even though the index on `order_date` can efficiently locate the matching rows. This results in significant heap lookup (also known as bitmap heap scan or index scan with tuple retrieval) overhead, consuming CPU cycles for each row fetched. Reducing the selected columns to only those needed minimizes I/O and CPU usage by avoiding unnecessary data retrieval from the heap.

Exam trap

Google Cloud often tests the misconception that high CPU is always due to missing or unused indexes, but here the index is used and the real problem is the overhead of fetching all columns from the heap, which candidates overlook when they focus solely on index usage.

How to eliminate wrong answers

Option A is wrong because B-tree index fragmentation in PostgreSQL is typically not a primary cause of high CPU usage; while index bloat can affect performance, the main issue here is the `SELECT *` causing excessive heap lookups, not index fragmentation. Option B is wrong because increasing the instance machine type treats the symptom (high CPU) rather than the root cause (inefficient query design), and it incurs unnecessary cost without addressing the underlying query pattern. Option D is wrong because the index on `order_date` is very likely being used (PostgreSQL's planner will use it for a range scan on a large table), but the high CPU stems from the heap lookups for all columns, not from the index being ignored; forcing index usage would not reduce the heap access overhead.

Full explanation →

348

Multi-Selectmedium

A team is designing a schema for a user activity logging system using Bigtable. Each log entry includes a user ID, activity type, timestamp, and details. The access pattern requires retrieving all activities for a specific user within a time range. Which TWO row key designs are suitable? (Choose TWO.)

Select 2 answers

A.timestamp#user_id

B.random_uuid

C.reverse_timestamp

D.user_id#activity_type#timestamp

E.user_id#timestamp

AnswersD, E

Allows filtering by activity type within a user.

Why this answer

Option D (user_id#activity_type#timestamp) is correct because it groups all activities for a user under a single row key prefix, enabling efficient row range scans for a specific user. The activity_type suffix allows filtering by activity type if needed, while the timestamp ensures uniqueness and ordered storage. Option E (user_id#timestamp) is correct because it directly supports the access pattern of retrieving all activities for a user within a time range by scanning rows with the user_id prefix and filtering on the timestamp component.

Exam trap

Google Cloud often tests the misconception that a timestamp-first key is optimal for time-range queries, but the actual requirement is user-specific retrieval, which demands a user-first key design to avoid full-table scans.

Full explanation →

349

MCQhard

A company uses BigQuery for BI reporting with a star schema. The fact table 'sales' is partitioned by date and clustered by 'product_id'. The dimensions 'product' and 'customer' are updated nightly via merge statements. Recently, a report that joins 'sales' with 'product' on 'product_id' and filters on sale_date for the last 7 days started timing out. The query plan shows a 'SCAN' of the entire 'product' table. Which optimization should be applied to improve performance?

A.Partition the 'product' table by 'product_id'

B.Partition the 'sales' table by 'product_id' instead of date

C.Remove clustering from the 'sales' table

D.Cluster the 'product' table on 'product_id'

AnswerD

Clustering on product_id improves join performance by collocating rows with the same product_id, reducing data scanned.

Why this answer

Option D is correct because clustering the 'product' table on 'product_id' physically co-locates rows with the same product_id into the same blocks, drastically reducing the amount of data scanned when the report joins on that column. The query plan's full SCAN of the 'product' table indicates that BigQuery must read every row, even though only a subset of products are referenced by the last 7 days of sales. Clustering on product_id enables block-level pruning, so only the relevant blocks are read, eliminating the full table scan.

Exam trap

Google Cloud often tests the misconception that partitioning is the universal solution for all performance issues, but here the problem is a full scan of the dimension table during a join, which clustering on the join key solves without the limitations and overhead of partitioning.

How to eliminate wrong answers

Option A is wrong because partitioning the 'product' table by 'product_id' is not supported in BigQuery — partitioning requires a date, timestamp, or integer range column, not an arbitrary ID, and it would create an excessive number of partitions, degrading performance. Option B is wrong because partitioning the 'sales' table by 'product_id' instead of date would break the existing date-based pruning for the last-7-days filter, likely increasing the scan size and defeating the purpose of the optimization. Option C is wrong because removing clustering from the 'sales' table would worsen performance by eliminating the existing block-level pruning on product_id, making the join even slower.

Full explanation →

350

Multi-Selecthard

A financial services company uses BigQuery for BI reporting. They need to design a data model that ensures data consistency and avoids duplicate records in the fact table. Which three practices should they follow? (Choose three.)

Select 3 answers

A.Use the OVERWRITE partition option for incremental loads.

B.Apply a unique constraint on the fact table.

C.Use a daily load job that replaces the entire table with WRITE_TRUNCATE.

D.Implement a staging table with a unique identifier and use INSERT ... SELECT DISTINCT.

E.Use DML statements with MERGE to upsert data.

AnswersA, D, E

Overwriting specific partitions avoids duplicates within those partitions.

Why this answer

Option A is correct because using the OVERWRITE partition option for incremental loads ensures that only the specific partition being loaded is replaced, preventing duplicate records within that partition while preserving data in other partitions. This approach maintains data consistency by avoiding full table overwrites and is efficient for incremental updates in BigQuery.

Exam trap

The trap here is that candidates often assume BigQuery supports traditional database constraints like unique constraints (Option B) or that full table overwrites (Option C) are acceptable for incremental loads, when in fact BigQuery's architecture requires partition-level or DML-based deduplication strategies.

Full explanation →

351

Multi-Selecteasy

Which TWO actions improve query performance and reduce cost in BigQuery for BI workloads?

Select 2 answers

A.Cluster tables on columns used in GROUP BY

B.Partition tables on columns frequently used in WHERE clauses

C.Load data using batch loads instead of streaming

D.Store data in CSV format

E.Use SELECT * in all queries

AnswersA, B

Clustering improves aggregation performance.

Why this answer

Clustering tables on columns used in GROUP BY improves query performance by physically co-locating rows with similar values, reducing the amount of data scanned during aggregation. Partitioning on columns frequently used in WHERE clauses allows BigQuery to prune entire partitions from the scan, directly reducing both cost (bytes billed) and query execution time. These two optimizations are specifically recommended for BI workloads where repeated, selective queries are common.

Exam trap

Google Cloud often tests the misconception that any data loading method (batch vs. streaming) or any file format (CSV) directly improves query performance, when in fact only storage and query-time optimizations like partitioning and clustering reduce bytes scanned.

Full explanation →

352

MCQmedium

An e-commerce platform uses Cloud Bigtable for real-time analytics on customer behavior. The table uses a row key of 'customer_id#timestamp' (customer ID followed by reverse timestamp). Queries for a specific customer's recent events are fast, but queries that filter by event type (e.g., 'purchase') across many customers are slow. What schema change can improve query performance for event-type filtering?

A.Create a separate column family for each event type.

B.Add a secondary index on the event_type column.

C.Use a separate Bigtable instance for each event type.

D.Change the row key to 'event_type#customer_id#timestamp'.

AnswerD

This allows efficient range scans for a specific event type across all customers.

Why this answer

Option A is correct because by making the row key start with the event type, scans can efficiently filter by event type. Option B is incorrect because Bigtable does not support secondary indexes natively (you can use row key design or column families, but not indexes like relational databases). Option C (adding a column family) does not help with filtering.

Option D (using a different database) is an architecture change.

Full explanation →

353

MCQhard

You are optimizing a Cloud SQL for MySQL instance for an OLTP application. You observe frequent buffer pool contention and high disk reads per second. The instance has 16 vCPUs and 120 GB memory. What is the most effective initial tuning action?

A.Increase the max_connections parameter to handle more concurrent sessions.

B.Reduce max_heap_table_size to avoid in-memory temp tables spilling to disk.

C.Increase innodb_buffer_pool_size to 80% of available memory.

D.Enable query cache to cache SELECT results.

AnswerC

This allows more data in memory, reducing disk I/O.

Why this answer

The correct answer is C because buffer pool contention and high disk reads per second are classic symptoms of an undersized InnoDB buffer pool. For a dedicated Cloud SQL for MySQL OLTP instance with 120 GB memory, increasing innodb_buffer_pool_size to 80% (96 GB) maximizes cached data and index pages in memory, reducing physical disk I/O and contention. This directly addresses the root cause—insufficient memory for the working set—without introducing side effects like connection overhead or query cache invalidation.

Exam trap

Google Cloud often tests the misconception that increasing max_connections or enabling the query cache is a universal performance fix, but the trap here is that buffer pool contention and high disk reads specifically point to an undersized innodb_buffer_pool_size, not to connection limits or caching of SELECT results.

How to eliminate wrong answers

Option A is wrong because increasing max_connections does not reduce buffer pool contention or disk reads; it may actually worsen contention by allowing more concurrent sessions to compete for the same limited buffer pool resources. Option B is wrong because reducing max_heap_table_size would force more temporary tables to disk, increasing disk reads per second, which is the opposite of the desired effect. Option D is wrong because the query cache is deprecated and removed in MySQL 8.0; even in earlier versions, it introduces mutex contention and is ineffective for write-heavy OLTP workloads, failing to address buffer pool contention or disk I/O.

Full explanation →

354

MCQhard

A large e-commerce platform uses BigQuery for business intelligence. They have a fact table `orders` (10 TB, partitioned by order_date, clustered by customer_id) and a dimension table `customers` (2 TB, not partitioned, not clustered). The BI team runs a daily dashboard query that joins these tables on customer_id and filters on order_date = CURRENT_DATE() and customer_country = 'US'. The query currently scans the full `customers` table and 2 GB of the `orders` table, taking 30 seconds. The business wants to reduce cost and latency. The `customers` table has 500 million rows and is updated incrementally every hour. Which action will most effectively reduce the amount of data scanned and query time?

A.Cluster the `customers` table on customer_id.

B.Denormalize customer country and other attributes into the `orders` table.

C.Create a materialized view that joins `orders` and `customers` on customer_id.

D.Partition the `customers` table by customer_id.

AnswerA

Clustering by customer_id enables block-level pruning during the join, drastically reducing data scanned.

Why this answer

Clustering the `customers` table on `customer_id` will physically co-locate rows with the same `customer_id`, allowing the query to use block-level pruning when joining with the filtered `orders` table. Since the query filters `orders` by `order_date = CURRENT_DATE()` (2 GB scanned) and then joins on `customer_id`, BigQuery can skip reading most of the `customers` table if it is clustered on the join key, drastically reducing the 2 TB full scan and lowering both cost and latency.

Exam trap

Google Cloud often tests the misconception that partitioning is always the best optimization for large tables, but here partitioning by `customer_id` is invalid in BigQuery, and the real performance gain comes from clustering on the join key to enable block-level pruning.

How to eliminate wrong answers

Option B is wrong because denormalizing customer attributes into the `orders` table would increase storage costs and data duplication (10 TB fact table would grow significantly), and while it might avoid the join, it does not address the root cause of scanning the full `customers` table; it also complicates incremental updates. Option C is wrong because a materialized view that joins both tables would need to be refreshed every hour to reflect incremental customer updates, and it would still require scanning the full `customers` table during creation or refresh, not reducing the per-query scan for the current daily filter. Option D is wrong because partitioning the `customers` table by `customer_id` is not supported in BigQuery (partitioning must be on a date/timestamp or integer range column), and even if possible, it would not help since the query does not filter on a partition column for `customers`.

Full explanation →

355

MCQmedium

An e-commerce application uses Cloud Spanner for its global inventory database. The application experiences high write latency during peak hours. After reviewing the schema, the database engineer notices that the primary key is an auto-incrementing integer. What is the most likely cause of the high write latency?

A.The application is using read-only transactions instead of read-write transactions.

B.The database is configured with a backup retention period of 2 seconds.

C.The application is using read replicas that are out of date.

D.The monotonically increasing primary key creates a hotspot.

AnswerD

Spanner distributes data by key range; sequential keys cause all new rows to be written to a single node, creating a hotspot.

Why this answer

Cloud Spanner distributes data across splits based on the primary key range. A monotonically increasing integer primary key, such as an auto-incrementing ID, causes all new writes to target the same split (the highest key range), creating a hotspot. This single split becomes a bottleneck, leading to high write latency during peak hours, as Spanner cannot parallelize the writes across multiple nodes.

Exam trap

Google Cloud often tests the misconception that auto-incrementing keys are always optimal for performance, but in distributed databases like Spanner, they cause hotspots; candidates may incorrectly attribute latency to backup settings or read replicas instead of the key design flaw.

How to eliminate wrong answers

Option A is wrong because read-only transactions do not cause write latency; they are used for reading data and do not impact write performance. Option B is wrong because a backup retention period of 2 seconds is not a valid configuration (minimum retention is typically 1 day) and has no direct effect on write latency. Option C is wrong because read replicas (read-only nodes) are used for scaling reads, not writes; stale replicas do not cause high write latency.

Full explanation →

356

Matchingmedium

Match each Google Cloud tool to its purpose in database management.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Web-based UI for managing resources

Command-line tool for managing Google Cloud services

Browser-based terminal with pre-installed tools

Infrastructure as code for provisioning databases

Observability and alerting for database performance

Why these pairings

These tools are essential for database administration in Google Cloud.

Full explanation →

357

Multi-Selecteasy

Which TWO are effective strategies to control costs when running BI queries on BigQuery? (Choose two.)

Select 2 answers

A.Set a maximum bytes billed limit for user projects.

B.Create materialized copies of tables for each dashboard.

C.Schedule queries to run every minute to keep the cache warm.

D.Enable BI Engine for all tables to speed up queries.

E.Use flat-rate reservations for predictable workloads.

AnswersA, E

It prevents queries from scanning too much data.

Why this answer

Options C and D are correct. Setting a custom cost control (max bytes billed) prevents runaway queries. Using flat-rate reservations provides predictable pricing and can lower costs for steady workloads.

Option A is wrong because scheduled queries add to cost if not optimized. Option B is wrong because enabling BI Engine incurs additional reservation costs. Option E is wrong because copying tables increases storage costs.

Full explanation →

358

MCQmedium

Refer to the exhibit. Which of the following statements is true regarding this schema design?

A.Deleting a Singer row will automatically delete all associated Album rows.

B.The Albums table cannot have any secondary indexes because of the INTERLEAVE clause.

C.The Albums table's rows are physically stored independent of the Singer table.

D.The Albums table's primary key must include the SingerId column only.

E.The ON DELETE CASCADE clause ensures that deleting an Album row will delete the corresponding Singer row.

AnswerA

The ON DELETE CASCADE clause enforces this behavior.

Why this answer

Option A is correct because the `ON DELETE CASCADE` clause on the foreign key from `Albums` to `Singer` ensures that when a row in the `Singer` table is deleted, all rows in the `Albums` table that reference that singer are automatically deleted. This is a standard referential integrity behavior in relational databases, and in Cloud Spanner (the technology context for PCDE), it is enforced at the database level to maintain consistency.

Exam trap

Google Cloud often tests the direction of `ON DELETE CASCADE` — candidates mistakenly think it deletes the parent when a child is deleted, but it only propagates from parent to child.

How to eliminate wrong answers

Option B is wrong because the `INTERLEAVE` clause does not prevent secondary indexes on the `Albums` table; Cloud Spanner allows secondary indexes on interleaved tables, though they must be created with the `INTERLEAVE IN` option to maintain locality. Option C is wrong because the `INTERLEAVE` clause physically stores child rows (Albums) adjacent to their parent row (Singer) in the same split, not independently. Option D is wrong because the `Albums` table's primary key must include `SingerId` as the first column (due to interleaving), but it can and typically does include additional columns (e.g., `AlbumId`) to uniquely identify rows.

Option E is wrong because `ON DELETE CASCADE` propagates deletion from the parent (Singer) to the child (Albums), not the reverse; deleting an `Album` row does not delete the corresponding `Singer` row.

Full explanation →

359

MCQmedium

Refer to the exhibit. Which BigQuery SQL query correctly flattens the items into rows?

A.SELECT * FROM orders WHERE items IS NOT NULL

B.SELECT * FROM orders, UNNEST(items) AS items

C.SELECT * FROM orders INNER JOIN items ON true

D.SELECT * FROM orders CROSS JOIN UNNEST(items) AS items

AnswerD

UNNEST with CROSS JOIN correctly flattens the nested field.

Why this answer

Option A is correct: using CROSS JOIN UNNEST(items) expands the repeated record into separate rows. Option B only filters null items. Option C is invalid syntax.

Option D is missing CROSS JOIN.

Full explanation →

360

Multi-Selecthard

You are planning capacity for a Cloud Spanner instance. Which TWO factors directly affect the number of nodes required?

Select 2 answers

A.Number of users

B.Read throughput in queries per second (QPS)

C.Number of indexes

D.Write throughput in queries per second (QPS)

E.Storage size in GB

AnswersB, D

Read QPS directly determines CPU and node requirements.

Why this answer

Cloud Spanner node capacity is primarily determined by compute and I/O requirements, which are directly driven by read and write throughput (QPS). Each node provides a fixed amount of processing power and throughput; therefore, to handle a given QPS, you must provision enough nodes to meet the peak read and write demand. Storage size (Option E) is not a direct factor because Spanner automatically uses available node resources for storage, and you can add nodes for throughput without exceeding storage limits.

Exam trap

The trap here is that candidates often assume storage size is a primary driver for node count, but Spanner decouples throughput and storage, so you must focus on QPS requirements first, especially in exam scenarios where throughput is the bottleneck.

Full explanation →

361

MCQeasy

A BI team wants to create a report that shows daily active users for the last 7 days. Which SQL construct is most appropriate for fast performance on a large dataset?

A.SELECT COUNT(DISTINCT user_id) ... WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)

B.SELECT DISTINCT user_id ...

C.SELECT COUNT(user_id) ... GROUP BY user_id

D.SELECT APPROX_COUNT_DISTINCT(user_id) ... WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)

AnswerD

Approximate distinct is fast and sufficient for trend analysis.

Why this answer

Option D is correct because APPROX_COUNT_DISTINCT uses HyperLogLog (HLL) algorithm, which provides near-exact distinct counts with significantly less memory and faster performance than COUNT(DISTINCT) on large datasets. This is ideal for a daily active users report over 7 days where exact precision is not critical.

Exam trap

Google Cloud often tests the misconception that COUNT(DISTINCT) is always the correct choice for distinct counts, ignoring the performance implications on large datasets where approximate counting functions are the appropriate BI solution.

How to eliminate wrong answers

Option A is wrong because COUNT(DISTINCT user_id) requires sorting or hashing all unique user_id values, which becomes extremely slow and memory-intensive on large datasets. Option B is wrong because SELECT DISTINCT user_id returns all individual user IDs without counting them, failing to produce the required daily active user count. Option C is wrong because COUNT(user_id) counts all rows including duplicates, not distinct users, and GROUP BY user_id would produce per-user counts rather than a single daily total.

Full explanation →

362

Multi-Selectmedium

A BI team is troubleshooting a slow BigQuery query. Which TWO actions can help identify the bottleneck?

Select 2 answers

A.Review the query execution plan in the BigQuery UI.

B.Increase the number of slots to maximum.

C.Remove all WHERE clauses to simplify.

D.Rewrite the query to use only CTEs.

E.Check the bytes processed and shuffle bytes.

AnswersA, E

Execution plan reveals stages, timing, and data shuffling.

Why this answer

Reviewing the query execution plan in the BigQuery UI (Option A) is correct because it provides a visual breakdown of query stages, including shuffle operations, data distribution, and stage-level timing. This allows the BI team to pinpoint which stage is consuming the most time or resources, such as a skewed join or a slow aggregation, directly identifying the bottleneck.

Exam trap

Google Cloud often tests the misconception that adding more resources (slots) or simplifying the query (removing WHERE clauses) is a diagnostic step, when in fact these actions change the query's behavior rather than identifying the existing bottleneck.

Full explanation →

363

Multi-Selectmedium

A Database Engineer is deploying a Cloud SQL for PostgreSQL instance for a financial services application that requires high availability and automatic failover. The engineer also needs to ensure that backups are taken daily and retained for 30 days. Which TWO actions should the engineer take? (Choose two.)

Select 2 answers

A.Create a cross-region replica and configure automatic failover.

B.Enable automated backups and set backup retention to 30 days.

C.Enable high availability (HA) configuration on the instance with a regional persistent disk.

D.Schedule a Cloud Scheduler job to export the database to Cloud Storage daily.

E.Enable point-in-time recovery (PITR) with a 7-day retention.

AnswersB, C

Automated backups with 30-day retention satisfies the requirement.

Why this answer

Option B is correct because Cloud SQL automated backups can be configured with a retention period of up to 365 days, and setting it to 30 days meets the requirement for daily backups with 30-day retention. Option C is correct because enabling high availability (HA) on a Cloud SQL instance with a regional persistent disk provides automatic failover to a standby instance in a different zone within the same region, ensuring high availability for the financial services application.

Exam trap

Google Cloud often tests the distinction between high availability (automatic failover within a region) and disaster recovery (cross-region replicas), leading candidates to incorrectly select cross-region replicas for failover when they are read-only and require manual promotion.

Full explanation →

364

MCQmedium

A Cloud SQL instance stores financial data. They need to meet a 1-hour RPO and 30-minute RTO. What backup configuration should they use?

A.Export once a day to Cloud Storage.

B.Automatic backups plus binary logging to enable point-in-time recovery (PITR).

C.Enable high availability.

D.Automatic backups with a schedule of 1 hour.

AnswerB

PITR with binary logs allows recovery to any point in the last 7 days, meeting the 1-hour RPO.

Why this answer

Option B is correct because automatic backups combined with binary logging enable point-in-time recovery (PITR), which allows restoring the database to any point within the backup retention period. This configuration meets the 1-hour RPO by recovering transactions committed within the last hour, and the 30-minute RTO by using the most recent full backup plus binary logs to restore quickly.

Exam trap

Google Cloud often tests the distinction between high availability (HA) and backup/recovery; candidates mistakenly think HA alone satisfies RPO/RTO requirements, but HA only ensures uptime, not point-in-time data recovery.

How to eliminate wrong answers

Option A is wrong because exporting once a day to Cloud Storage provides an RPO of up to 24 hours, far exceeding the required 1-hour RPO, and restoring from an export can take significantly longer than 30 minutes. Option C is wrong because enabling high availability (HA) provides failover to a standby instance in case of zone failure, but does not address data backup or recovery point objectives; it does not protect against data corruption or accidental deletion. Option D is wrong because automatic backups with a 1-hour schedule create full backups every hour, but without binary logging, you can only restore to the exact backup timestamps, not to any point within the hour, so the effective RPO could be up to 1 hour plus the time to complete the backup, and recovery time may exceed 30 minutes due to the need to restore the full backup.

Full explanation →

365

Multi-Selectmedium

Which TWO are best practices for designing a star schema in BigQuery for BI? (Choose two.)

Select 2 answers

A.Store dimension attributes in a single denormalized dimension table instead of multiple normalized tables.

B.Partition fact tables by low-cardinality columns like gender.

C.Pre-aggregate all measures at every possible grain in the fact table.

D.Avoid using joins entirely by storing all data in one wide table.

E.Use surrogate keys for dimension tables instead of natural keys.

AnswersA, E

Denormalization reduces join complexity.

Why this answer

Option A is correct because in BigQuery, storing dimension attributes in a single denormalized dimension table (star schema) reduces the number of joins required in BI queries, improving query performance and simplifying SQL. BigQuery's columnar storage and distributed architecture handle denormalized dimensions efficiently, avoiding the overhead of multiple normalized tables that would require complex joins and slow down analytical queries.

Exam trap

Google Cloud often tests the misconception that denormalization is always bad, but in BigQuery's architecture, denormalized dimension tables are a best practice for BI workloads, unlike traditional OLTP databases.

Full explanation →

366

MCQhard

Your company runs a global application on Cloud Spanner. You notice that recent schema changes have caused a significant increase in latency for cross-node transactions. The previous schema used interleaved tables for parent-child relationships, but the new schema uses separate tables with foreign keys. What is the most likely cause of the increased latency?

A.The new schema uses foreign keys that require cross-node transactions.

B.The new schema does not use commit timestamps for versioning.

C.The new schema lacks secondary indexes on foreign key columns.

D.The Spanner instance was not resized after the schema change.

AnswerA

Foreign keys between separate tables can lead to distributed transactions across nodes.

Why this answer

The new schema uses separate tables with foreign keys instead of interleaved tables. In Cloud Spanner, interleaved tables guarantee that parent and child rows are co-located on the same split, allowing local joins without cross-node communication. Foreign keys between non-interleaved tables can reference rows stored on different splits, forcing distributed transactions that require two-phase commit across nodes, which significantly increases latency.

Exam trap

The trap here is that candidates may think foreign keys inherently cause performance issues due to constraint checking, but the real cause is the loss of data locality and resulting cross-split coordination in Spanner's distributed architecture.

How to eliminate wrong answers

Option B is wrong because commit timestamps are used for versioning and consistency, not for reducing cross-node transaction latency; omitting them would not cause the described latency increase. Option C is wrong because secondary indexes on foreign key columns improve query performance but do not eliminate the need for cross-node coordination when the referenced rows are on different splits. Option D is wrong because resizing the Spanner instance (adding/removing nodes) affects throughput and storage capacity, not the fundamental latency of cross-node transactions caused by non-interleaved schemas.

Full explanation →

367

MCQeasy

A company uses Cloud SQL for SQL Server. They want to store JSON data in a column and query it efficiently. What should they do?

A.Store each JSON field as a separate column.

B.Store JSON in an nvarchar(max) column and use JSON_VALUE in queries.

C.Use a TEXT column with no indexing.

D.Store JSON as a binary column and parse in application.

AnswerB

SQL Server's JSON support allows querying inside nvarchar(max) columns.

Why this answer

Option A is correct because SQL Server supports JSON functions like JSON_VALUE. Using nvarchar(max) with JSON functions allows querying. Option B is wrong because each value in a separate column is not flexible.

Option C is wrong because a single TEXT column cannot be efficiently queried. Option D is wrong because storing JSON as binary adds complexity.

Full explanation →

368

MCQhard

Refer to the exhibit. The application requires low-latency reads for users in Europe. The current cluster is in us-central1. What should they do?

A.Add a new cluster in a European region (e.g., europe-west1).

B.Increase the number of nodes in the existing cluster.

C.Use a multi-cluster instance with existing cluster.

D.Change the storage type to SSD in the existing cluster.

AnswerA

Adding a cluster in Europe allows reads to be served from a nearby location, reducing latency.

Why this answer

Adding a new cluster in a European region (e.g., europe-west1) is correct because it places data physically closer to users, reducing network latency for read operations. In a multi-region deployment, the application can read from the nearest cluster, achieving low-latency reads without changing the existing cluster's configuration. This approach leverages geographic proximity to minimize round-trip time (RTT) for European users.

Exam trap

Google Cloud often tests the misconception that scaling up (more nodes or faster storage) can solve geographic latency issues, when in fact only adding a regional cluster addresses the fundamental physics of network propagation delay.

How to eliminate wrong answers

Option B is wrong because increasing the number of nodes in the existing us-central1 cluster does not reduce the physical distance between European users and the data; network latency is dominated by propagation delay, not node count. Option C is wrong because a multi-cluster instance (e.g., in Cloud Spanner) is designed for global strong consistency and high availability, but it does not inherently provide low-latency reads for a specific region unless a new cluster is added in that region; the existing cluster alone cannot serve European users with low latency. Option D is wrong because changing the storage type to SSD improves I/O performance (e.g., lower disk latency) but does not address the network latency caused by geographic distance; the bottleneck for European users is the long-haul network path, not storage speed.

Full explanation →

369

MCQhard

A company stores sensor data in BigQuery. They have a table 'sensor_readings' with columns: sensor_id, reading_time, value. The table is partitioned by reading_time (hourly) and clustered by sensor_id. A BI query aggregates average value per sensor for the last week. The query still scans many bytes. What is the most likely cause?

A.The query uses SELECT * instead of specific columns

B.Clustering on sensor_id is ineffective

C.The table is not using columnar storage

D.Partition granularity is too fine for the query range

AnswerD

Hourly partitions for a week means 168 partitions scanned; coarser partitioning (daily) would scan 7 partitions, reducing bytes.

Why this answer

Option D is correct because the query scans a full week of data (168 hourly partitions), and each partition must be read entirely even though only a subset of sensors may be active. Hourly partitioning over a 7-day range means the query engine must scan all 168 partitions, which can result in a large number of bytes being processed. Clustering on sensor_id helps within each partition but does not reduce the number of partitions scanned; the fine granularity of hourly partitioning is the primary cause of excessive bytes scanned.

Exam trap

Google Cloud often tests the misconception that clustering alone solves all performance issues, but the trap here is that clustering only helps when the query filters or aggregates on the clustered column—without such a filter, clustering does not reduce bytes scanned, and overly fine partitioning is the real culprit.

How to eliminate wrong answers

Option A is wrong because using SELECT * instead of specific columns would increase the bytes scanned, but the question states the query aggregates average value per sensor, which likely already selects only the needed columns; the core issue is partition pruning, not column projection. Option B is wrong because clustering on sensor_id is effective for reducing bytes scanned within each partition when filtering by sensor_id, but the query does not filter on sensor_id—it aggregates across all sensors—so clustering provides no benefit here. Option C is wrong because BigQuery always uses columnar storage (Capacitor format); the table is inherently columnar, so this is not a possible cause.

Full explanation →

370

Multi-Selecthard

A company uses Cloud Spanner with a schema that includes a table 'Events' with primary key (EventId, Timestamp). They need to run range queries on Timestamp across all events. They notice slow queries. Which two actions can improve query performance? (Choose two.)

Select 2 answers

A.Create a secondary index on Timestamp.

B.Create a covering index that includes all queried columns.

C.Add a hash prefix to EventId to distribute writes.

D.Use interleaving with a parent table on EventId.

E.Change the primary key to (Timestamp, EventId).

AnswersA, B

A secondary index on Timestamp allows efficient range scans.

Why this answer

Creating a secondary index on Timestamp (A) enables efficient range scans. Creating a covering index (D) that includes all queried columns avoids table lookups. Option B helps writes but not reads.

Option C is about interleaving, not relevant. Option E changes the primary key, which could help but may cause hot spots; not the best immediate action.

Full explanation →

371

Multi-Selecteasy

Which TWO BigQuery features are specifically designed to accelerate BI dashboard query performance? (Choose TWO.)

Select 2 answers

A.Wildcard tables

B.Clustering

C.User-defined functions (UDFs)

D.Cached results

E.Column-level security

AnswersB, D

Clustering reduces data scanned by sorting data within partitions, speeding up filter-based queries.

Why this answer

Clustering (B) physically co-locates rows with similar values in the same storage blocks, allowing BigQuery to skip entire blocks when processing queries with filters on clustered columns. This dramatically reduces the amount of data scanned, directly accelerating BI dashboard queries that often filter by date, region, or customer ID. Cached results (D) store the output of recent queries for up to 24 hours, so repeated dashboard refreshes or concurrent user requests can be served instantly without re-scanning any data.

Exam trap

Google Cloud often tests the misconception that any feature that 'organizes' or 'processes' data (like wildcard tables or UDFs) improves performance, when in fact only features that reduce data scanned (clustering) or avoid re-execution (cached results) directly accelerate BI dashboards.

Full explanation →

372

MCQeasy

A database engineer is designing a data model for a BI dashboard that tracks daily sales by product category. The data source is a transactional database with a normalized schema. Which BigQuery feature should they use to update the fact table incrementally each day?

A.Streaming inserts

B.BigQuery Data Transfer Service

C.Scheduled queries with MERGE statements

D.Load jobs with WRITE_TRUNCATE

AnswerC

MERGE combines INSERT and UPDATE to handle incremental changes efficiently.

Why this answer

Scheduled queries with MERGE statements allow incremental updates by inserting new rows and updating existing ones based on a unique key, such as date and product category. This avoids full table reloads, making it efficient for daily fact table refreshes from a normalized transactional source.

Exam trap

The trap here is that candidates confuse 'incremental load' with 'streaming' (Option A), not realizing that streaming inserts are for real-time events, not batch updates from a transactional database.

How to eliminate wrong answers

Option A is wrong because streaming inserts are designed for real-time, row-by-row data ingestion, not for batch updating a fact table incrementally from a transactional database. Option B is wrong because BigQuery Data Transfer Service is used for automated imports from external SaaS sources (e.g., Google Ads, Amazon S3), not for executing custom SQL logic like MERGE against existing tables. Option D is wrong because WRITE_TRUNCATE replaces the entire table each load, which is inefficient and loses historical data, whereas incremental updates require preserving existing rows.

Full explanation →

373

MCQeasy

A company uses BigQuery for BI. They need to create a table that stores daily sales data with millions of rows. The query pattern is to aggregate sales by month for specific product categories. Which table design is most cost-effective and performant?

A.Non-partitioned table with clustering on product_category

B.Partitioned table by date with clustering on product_category

C.Non-partitioned, non-clustered table with manual sharding by date

D.Partitioned table by product_category with clustering on date

AnswerB

Partitioning prunes irrelevant date ranges; clustering reduces data scanned for category filters.

Why this answer

Partitioning by date allows BigQuery to prune entire partitions when querying monthly aggregates, drastically reducing the data scanned. Clustering on product_category further organizes data within each partition, enabling efficient block-level pruning for category filters. This combination minimizes both cost (bytes billed) and query latency for the described workload.

Exam trap

Google Cloud often tests the misconception that clustering alone is sufficient for performance, ignoring that partitioning is essential for time-range queries to enable storage-level pruning and cost control.

How to eliminate wrong answers

Option A is wrong because a non-partitioned table forces BigQuery to scan all rows even for a single month, leading to higher costs and slower performance despite clustering on product_category. Option C is wrong because manual sharding (e.g., table names like sales_20250101) is a legacy pattern that requires complex query logic (UNION ALL) and loses automatic partition pruning, plus BigQuery discourages sharding in favor of native partitioning. Option D is wrong because partitioning by product_category would create many small partitions (one per category), which is inefficient for date-range queries; clustering on date cannot compensate for the lack of date-based partition pruning, so monthly aggregations would still scan all partitions.

Full explanation →

374

MCQmedium

A retail company uses Cloud Spanner for their OLTP system and wants to run BI queries on the same data without impacting transactional performance. Which solution should they implement?

A.Create a federated BigQuery query that reads from Spanner

B.Export Spanner data to Cloud Storage and then load into BigQuery manually

C.Use Cloud Dataflow to stream Spanner changes into BigQuery

D.Run BI queries directly on Spanner using read-only transactions

AnswerC

Dataflow captures changes from Spanner and loads them into BigQuery, separating BI workloads.

Why this answer

Option C is correct because Cloud Dataflow can read the Cloud Spanner change streams and stream mutations into BigQuery in near real-time, enabling BI queries on fresh data without adding read load to the Spanner instance. This decouples the analytical workload from the transactional workload, preserving OLTP performance.

Exam trap

The trap here is that candidates assume read-only transactions are safe for BI workloads, but they still consume Spanner's CPU and memory resources, which can degrade transactional performance under concurrent analytical queries.

How to eliminate wrong answers

Option A is wrong because federated BigQuery queries against Spanner execute reads directly on the Spanner instance, which can consume CPU and impact transactional latency, especially under heavy BI query loads. Option B is wrong because manual exports to Cloud Storage and batch loads into BigQuery introduce significant latency and operational overhead, making it unsuitable for near-real-time BI requirements. Option D is wrong because even read-only transactions on Spanner consume instance resources and can contend with transactional writes, degrading OLTP performance under concurrent BI query loads.

Full explanation →

375

MCQeasy

A company is migrating a PostgreSQL database to Cloud SQL for PostgreSQL. They want to ensure minimal downtime during the migration. Which migration strategy should they use?

A.Set up application-level dual writes to both databases and switch over

B.Use Database Migration Service (DMS) with continuous replication

C.Create a read replica in Cloud SQL and promote it

D.Export the database using pg_dump and import into Cloud SQL

AnswerB

DMS provides minimal downtime via change data capture and replication.

Why this answer

Database Migration Service (DMS) with continuous replication is the correct strategy because it uses change data capture (CDC) to replicate ongoing transactions from the source PostgreSQL database to Cloud SQL with minimal lag. This allows the application to remain fully operational during the migration, and the cutover can be performed in seconds by stopping writes to the source and promoting the target, achieving near-zero downtime.

Exam trap

The trap here is that candidates often confuse 'read replica promotion' (which only works within Cloud SQL) with cross-environment migration, or they assume that pg_dump can be used with minimal downtime by running it on a replica, but the export still locks tables or requires a consistent snapshot that interrupts writes.

How to eliminate wrong answers

Option A is wrong because application-level dual writes require modifying application code to write to both databases simultaneously, which introduces complexity, potential data inconsistency, and does not guarantee minimal downtime during the actual cutover. Option C is wrong because creating a read replica in Cloud SQL and promoting it is not a supported migration path from an external PostgreSQL database; Cloud SQL read replicas can only be created from a Cloud SQL primary instance, not from an on-premises or external source. Option D is wrong because exporting with pg_dump and importing into Cloud SQL is a batch, offline process that requires the source database to be read-only or stopped during the export to ensure consistency, resulting in significant downtime.

Full explanation →

Google Professional Cloud Database Engineer (PCDE) — Questions 301–375