Knowledge + Practice

Google Professional Data Engineer (PDE) — Questions 751–825

990 questions total · 14pages · All types, answers revealed

Take a mock exam Exam hub

Page 11 of 14

751

MCQeasy

A data engineer needs to process a large dataset (500 TB) stored in Cloud Storage using Dataproc. The processing job requires reading the entire dataset and writing results back to Cloud Storage. The job is expected to run for 6 hours. Which configuration minimizes cost?

A.Use a single-node cluster with standard VMs.

B.Use a cluster with local SSDs for faster I/O.

C.Use a cluster with a mix of standard and preemptible VMs.

D.Use a cluster with n1-highmem-32 instances and 1000 cores.

AnswerC

Preemptible VMs reduce cost significantly while providing sufficient compute.

Why this answer

Option C is correct because preemptible VMs cost about 80% less than standard VMs, and mixing them with standard VMs provides fault tolerance for the job's 6-hour duration. Since the job reads and writes to Cloud Storage (not local HDFS), local SSDs are unnecessary, and a single-node cluster would lack the parallelism needed to process 500 TB efficiently within 6 hours. Using a mix of standard (for critical master/worker nodes) and preemptible VMs (for worker nodes) minimizes cost while ensuring job completion.

Exam trap

Google Cloud often tests the misconception that local SSDs always improve performance for data processing jobs, but in Dataproc, when data resides in Cloud Storage, the bottleneck is network throughput, not local disk speed, making SSDs an unnecessary cost.

How to eliminate wrong answers

Option A is wrong because a single-node cluster cannot process 500 TB in 6 hours due to limited CPU and memory resources, and it lacks fault tolerance if the node fails. Option B is wrong because local SSDs add cost without benefit when reading/writing from Cloud Storage, as the bottleneck is network I/O, not disk I/O; Dataproc uses Cloud Storage as the primary data source, not HDFS. Option D is wrong because using 1000 cores with n1-highmem-32 instances is over-provisioned and expensive, and the job's 6-hour runtime does not justify such a large cluster; it also ignores the cost savings of preemptible VMs.

Full explanation →

752

MCQmedium

You are running a streaming pipeline with Dataflow that reads from Pub/Sub and writes to BigQuery. You notice that the system lag metric is increasing over time, indicating that messages are taking longer to process. What is the most likely cause and how should you address it?

A.The source Pub/Sub topic has insufficient throughput; increase the number of partitions.

B.The Dataflow workers are CPU-bound; increase the number of workers or adjust autoscaling settings.

C.The BigQuery destination table has too many columns; reduce the number of columns.

D.The pipeline uses a batch transform that should be replaced with a streaming transform.

AnswerB

High system lag suggests worker resources are insufficient; adding workers reduces lag.

Why this answer

Increasing system lag often means the pipeline is CPU-bound, failing to keep up with the incoming data rate. Updating the pipeline with a higher number of workers (or enabling autoscaling) can resolve this.

Full explanation →

753

MCQhard

A financial services company operates a real-time fraud detection pipeline using Apache Beam running on Google Cloud Dataflow. The pipeline reads transactions from Pub/Sub, enriches them with customer data from Bigtable, runs a machine learning model with side inputs from a Redis cluster, and writes results to BigQuery for downstream reporting. The data must be processed with exactly-once semantics to avoid duplicate fraud alerts or missing transactions. The pipeline currently uses a global window with 5-minute accumulation, but the team is experiencing high latency and occasional duplicates when the model side input is updated (triggered every 15 minutes via a WatchTransform). Additionally, the pipeline has a dead letter queue that outputs failed records to a separate Pub/Sub topic, but these records are never reprocessed. The team needs to ensure high reliability and data quality. Which course of action should the team take to improve solution quality?

A.Use fixed windows with a 10-minute duration and session gap of 2 minutes, disable side input caching, and log all dead letter records to Cloud Storage for manual inspection.

B.Switch to a batch processing approach that runs every minute using Cloud Composer, with data loaded from Pub/Sub into BigQuery and then processed with Dataproc to run the model.

C.Implement sliding windows of 5 minutes with a 2-minute allowed lateness, use side inputs with periodic refreshes using the .withUpdateFrequency transformation, and set up a Cloud Function to automatically replay dead letter records back to the main Pub/Sub topic after fixing the issue.

D.Keep the global window but use a custom trigger with early firings every 30 seconds and a late-firing threshold of 1 minute, and configure the side input to be broadcast every 5 minutes using a Read transform.

AnswerC

Sliding windows with allowed lateness handle late data without blocking, periodic side input refreshes reduce latency, and automatic replay of dead letters ensures data quality.

Why this answer

Option B is correct because switching to a sliding window with allowed lateness ensures that late-arriving transactions are captured without blocking the window, and using side inputs with periodic refreshes (e.g., .withUpdateFrequency) reduces latency from model updates. Adding a system to reprocess dead letter records (e.g., via a Cloud Function that replays to the main topic) ensures data completeness. Option A is incorrect because fixed windows with session gaps do not help with side input latency and may cause data loss.

Option C is incorrect because GlobalWindow with triggers can cause duplicates if not configured carefully; defaults may not achieve exactly-once. Option D is incorrect because it focuses on batching, which is not suitable for real-time detection and introduces latency.

Full explanation →

754

MCQmedium

A retail company needs to generate product recommendations for millions of users every few hours. The model is a small scikit-learn model. Which prediction method should be used to minimize infrastructure cost while meeting the latency requirements?

A.Use Cloud Run to host the model and invoke it for each user request.

B.Export the model as a container and run on Google Kubernetes Engine with cluster autoscaling.

C.Deploy the model to a Vertex AI endpoint with a single replica for online predictions.

D.Use a Vertex AI batch prediction job that reads from BigQuery and writes results back to BigQuery or Cloud Storage.

AnswerD

Batch prediction is designed for such use cases and is cost-efficient for large datasets processed periodically.

Why this answer

Option D is correct because batch prediction is the most cost-effective approach for generating recommendations for millions of users every few hours. Vertex AI batch prediction jobs process large datasets in parallel without maintaining always-on infrastructure, and they can read from BigQuery and write results directly to BigQuery or Cloud Storage, minimizing compute costs while meeting the latency requirement of 'every few hours' (not real-time).

Exam trap

Google Cloud often tests the distinction between online (real-time) and batch (asynchronous) prediction patterns, and the trap here is that candidates assume 'predictions' always require a live endpoint, overlooking that batch jobs are the correct choice when latency requirements are in hours and the workload is massive and periodic.

How to eliminate wrong answers

Option A is wrong because Cloud Run invokes the model per user request, which would require millions of individual invocations every few hours, leading to high request-based costs and potential cold-start latency issues that are unnecessary for a batch workload. Option B is wrong because Google Kubernetes Engine with cluster autoscaling is overkill for a small scikit-learn model and introduces cluster management overhead and always-on node costs, even with autoscaling, making it more expensive than a serverless batch solution. Option C is wrong because a Vertex AI endpoint with a single replica is designed for online (real-time) predictions, which would be idle most of the time between the batch windows, incurring continuous compute costs for a single replica that is not needed for a scheduled batch job.

Full explanation →

755

MCQeasy

A company runs batch jobs on Dataproc. They need to ensure that if a job fails, it automatically retries with exponential backoff. What is the recommended approach?

A.Schedule a cron job to check job status and restart manually.

B.Use a Cloud Function triggered by Stackdriver alerts to restart the job.

C.Use Dataproc Workflow Templates with the maxAttempts parameter set to 3.

D.Create a Cloud Composer DAG that monitors job status and retries on failure.

AnswerC

Workflow Templates natively support retries with configurable backoff, making it the simplest and most robust solution.

Why this answer

Option C is correct because Dataproc Workflow Templates natively support automatic retries with exponential backoff via the `maxAttempts` parameter. This allows you to define a workflow that, upon failure, will retry the job with increasing delays between attempts, meeting the requirement without custom scripting or external services.

Exam trap

The trap here is that candidates may over-engineer the solution by choosing Cloud Composer or Cloud Functions, not realizing that Dataproc Workflow Templates already provide a built-in, serverless retry mechanism with exponential backoff that requires no additional services or custom code.

How to eliminate wrong answers

Option A is wrong because a cron job that manually checks and restarts jobs is not automated retry with exponential backoff; it introduces latency and operational overhead, and does not leverage Dataproc's built-in retry mechanisms. Option B is wrong because while a Cloud Function triggered by Stackdriver alerts could restart a job, it does not provide exponential backoff out of the box and requires custom logic to implement delays, making it less reliable and more complex than the native workflow template feature. Option D is wrong because a Cloud Composer DAG can monitor and retry jobs, but it is an external orchestration layer that adds unnecessary complexity and cost; Dataproc Workflow Templates already provide the required retry capability natively.

Full explanation →

756

MCQeasy

A team is setting up a Dataflow pipeline for a time-sensitive ETL job that must complete within a specific time window. Which monitoring metric should they use to determine if the pipeline is on track to finish on time?

A.The number of failed elements and retries.

B.The system lag metric, which measures the time between event occurrence and processing.

C.The number of elements processed in the current window.

D.The job's estimated time to completion shown in the Dataflow monitoring interface.

AnswerD

This metric directly estimates remaining time based on throughput.

Why this answer

Option D is correct because the Dataflow monitoring interface provides an estimated time to completion for the pipeline, which is the most direct metric for determining if the job will finish within the required time window. This estimate is calculated based on current throughput, backlog, and resource utilization, making it the appropriate choice for time-sensitive ETL jobs. Other metrics like system lag or element counts do not directly predict job completion time.

Exam trap

Google Cloud often tests the distinction between metrics that measure current performance (like system lag or element count) versus metrics that predict future completion (like estimated time to completion), leading candidates to pick a metric that sounds relevant but does not answer the specific question about finishing on time.

How to eliminate wrong answers

Option A is wrong because the number of failed elements and retries indicates data quality or processing errors, not the pipeline's progress toward completion within a time window. Option B is wrong because system lag measures the delay between event occurrence and processing, which is useful for streaming latency but does not provide an estimated finish time for a batch or bounded pipeline. Option C is wrong because the number of elements processed in the current window shows throughput but not whether the remaining workload can be completed before the deadline, as it ignores the backlog and processing rate.

Full explanation →

757

MCQeasy

A data engineer needs to create a BigQuery table that is optimized for queries that filter on a 'customer_id' column and sort by 'transaction_date'. The table will be used for interactive analysis. Which combination of table features should be used?

A.Partition by customer_id and cluster by transaction_date

B.Cluster by both customer_id and transaction_date

C.Use a materialized view with customer_id and transaction_date

D.Partition by transaction_date and cluster by customer_id

AnswerD

Partitioning by date allows BigQuery to prune partitions for queries with date ranges. Clustering by customer_id sorts data within partitions, speeding up customer-level queries.

Why this answer

Clustering sorts data based on one or more columns, improving query performance for filters and sorts on those columns. Partitioning by date/timestamp can further improve performance for time-range queries. For this scenario, clustering on customer_id and partitioning by transaction_date is optimal.

Full explanation →

758

MCQmedium

A data engineer needs to query data across BigQuery (in Google Cloud) and Snowflake (in AWS) without moving the data. Which service should they use?

A.Dataflow

B.Cloud SQL

C.Vertex AI Feature Store

D.BigQuery Omni

AnswerD

BigQuery Omni supports multi-cloud analytics without data movement.

Why this answer

BigQuery Omni allows querying data across multiple clouds using BigQuery's interface, with compute running in the respective cloud. Data stays in place.

Full explanation →

759

MCQhard

A streaming pipeline ingests events from Pub/Sub, enriches them via a slow REST API call, and writes the result to BigQuery. The API has a limit of 10 requests per second per client. The pipeline processes 1000 messages per second. Which approach minimizes latency while respecting API limits?

A.Use a global window with a trigger that fires every second, and inside the DoFn limit concurrent API calls to 10.

B.Fan out the stream to multiple REST API instances using Pub/Sub topic splitting.

C.Use a Dataflow Flex Template to run multiple pipelines, each processing a subset of messages.

D.Assign each message a random key and use a sliding window of 10 seconds; the API call will be distributed across workers.

AnswerA

Groups messages into batches per second, then controls concurrency to stay within the 10 req/s limit.

Why this answer

Using a global window of 1 second groups 1000 messages and then throttles API calls to 10 concurrent requests (e.g., via a fixed-size thread pool in a DoFn). This respects the limit while batching work. Beam does not automatically throttle; using a global window on a single key would create a bottleneck.

Fanning out to multiple API endpoints does not help if the limit is per client. Dataflow Flex Templates are irrelevant to throttling.

Full explanation →

760

MCQeasy

You need to choose a messaging service for a real-time streaming application that requires low cost and can tolerate occasional message loss. Which service is MOST suitable?

A.Cloud Scheduler

B.Pub/Sub Lite

C.Pub/Sub

D.Cloud Tasks

AnswerB

Pub/Sub Lite is cheaper and suitable for applications that can tolerate some message loss.

Why this answer

Pub/Sub Lite offers a lower-cost option compared to Pub/Sub, with reduced reliability (e.g., at-least-once delivery but not exactly-once). It is designed for cost-sensitive streaming workloads where occasional loss is acceptable.

Full explanation →

761

MCQhard

The exhibit shows a Spark job submitted to Dataproc that fails with an out-of-memory error. Which change should be made to the submission command to resolve the issue?

A.Use a different Spark example class.

B.Increase the number of worker nodes in the cluster.

C.Add --properties spark.executor.memory=8g to the command.

D.Add --driver-memory 8g to the command.

AnswerC

Increases executor heap space.

Why this answer

The out-of-memory error indicates that the Spark executors do not have enough memory to process the data. Adding `--properties spark.executor.memory=8g` increases the memory allocated to each executor, directly addressing the root cause. This property overrides the default executor memory (typically 1g or 4g depending on the cluster configuration) and is the standard way to tune executor memory in Spark on Dataproc.

Exam trap

Google Cloud often tests the distinction between driver memory and executor memory, and candidates mistakenly choose `--driver-memory` because they confuse the driver's role with the executors' memory needs, or they assume that increasing cluster size (more nodes) automatically increases per-executor memory.

How to eliminate wrong answers

Option A is wrong because changing the Spark example class does not affect memory allocation; the error is a resource exhaustion issue, not a logic or classpath problem. Option B is wrong because increasing the number of worker nodes distributes the workload across more machines but does not increase the memory per executor; the existing executors would still run out of memory if the data partitions are too large. Option D is wrong because `--driver-memory` controls the memory of the Spark driver process, not the executors; the out-of-memory error occurs in the executors (task execution), not in the driver (which handles scheduling and results collection).

Full explanation →

762

MCQeasy

A company needs a fully managed, globally distributed relational database with strong consistency, external consistency, and 99.999% SLA for a financial transaction processing system. Which Google Cloud service should they use?

A.Firestore

B.Cloud Spanner

C.Bigtable

D.Cloud SQL

AnswerB

Cloud Spanner is globally distributed, strongly consistent, and offers 99.999% SLA.

Why this answer

Cloud Spanner is the correct choice because it is a fully managed, globally distributed relational database service that provides strong consistency, external consistency (true serializable transactions across regions), and a 99.999% SLA. These features are essential for a financial transaction processing system that requires ACID compliance and global scalability without sacrificing consistency.

Exam trap

The trap here is that candidates often confuse Cloud Spanner with Bigtable or Firestore because all three are globally distributed, but only Spanner offers the relational model, strong consistency, and the 99.999% SLA required for financial transactions.

How to eliminate wrong answers

Option A (Firestore) is wrong because it is a NoSQL document database that does not support relational queries or strong consistency across global distributions (it offers eventual consistency by default). Option C (Bigtable) is wrong because it is a wide-column NoSQL database designed for high-throughput analytical workloads, not relational transactions, and it does not provide SQL support or ACID transactions. Option D (Cloud SQL) is wrong because it is a regional relational database service that cannot provide global distribution or a 99.999% SLA; it supports only single-region deployments with limited failover.

Full explanation →

763

MCQmedium

Your team uses Cloud Dataproc for Spark ML training jobs. You want to reduce costs for non-critical, fault-tolerant training jobs. Which Dataproc feature should you use for worker nodes?

A.Use preemptible instances for worker nodes.

B.Use custom machine types with more memory.

C.Use SSDs instead of HDDs for persistent disks.

D.Use committed use discounts for 1-year or 3-year terms.

AnswerA

Preemptible instances cost ~60-80% less and are suitable for fault-tolerant batch jobs.

Why this answer

Preemptible instances are short-lived, lower-cost VMs that Cloud Dataproc can use for worker nodes. Because the training jobs are non-critical and fault-tolerant (e.g., they can handle node failures via Spark's built-in resilience), preemptible instances significantly reduce costs while still completing the workload. This directly addresses the requirement to reduce costs for fault-tolerant jobs.

Exam trap

Cisco often tests the distinction between cost-saving features that require commitment (committed use discounts) versus those that exploit workload characteristics (preemptible instances), and candidates mistakenly choose committed use discounts because they think 'discount' always means lower cost, ignoring the fault-tolerance requirement.

How to eliminate wrong answers

Option B is wrong because custom machine types with more memory increase cost per node, which contradicts the goal of reducing costs. Option C is wrong because SSDs are more expensive than HDDs, and while they improve I/O performance, the question focuses on cost reduction, not performance. Option D is wrong because committed use discounts require a 1-year or 3-year commitment and are typically applied to all instances in a project, not specifically to worker nodes in a Dataproc cluster; they also do not leverage the fault-tolerant nature of the jobs to achieve the lowest possible cost.

Full explanation →

764

MCQhard

A financial services company uses Vertex AI to serve a fraud detection model. The model was trained on historical data that is updated daily. The team wants to automate retraining when data drift is detected. Which approach best operationalizes this requirement with minimal manual intervention?

A.Use Cloud Monitoring alerts on prediction latency to trigger a retraining pipeline.

B.Manually monitor model performance metrics in Vertex AI Experiments and retrain when accuracy drops.

C.Use scheduled Vertex AI Pipelines to retrain the model every night, then deploy automatically.

D.Enable Vertex AI Model Monitoring for feature drift and skew, then create a Cloud Function that triggers a Vertex AI Pipeline to retrain and deploy the model after validation.

AnswerD

This automates detection of data drift, triggers retraining only when needed, and includes validation before deployment.

Why this answer

Option D is correct because it uses Vertex AI Model Monitoring to automatically detect feature drift or skew, then triggers a Cloud Function that invokes a Vertex AI Pipeline to retrain and redeploy the model after validation. This approach minimizes manual intervention by automating both the detection of data drift and the subsequent retraining and deployment lifecycle.

Exam trap

Google Cloud often tests the distinction between scheduled retraining (Option C) and event-driven retraining triggered by actual drift detection (Option D), where candidates mistakenly choose the simpler scheduled approach without recognizing that it ignores the requirement to retrain only when drift is detected.

How to eliminate wrong answers

Option A is wrong because prediction latency is unrelated to data drift; monitoring latency only detects performance issues, not changes in data distribution. Option B is wrong because manually monitoring metrics in Vertex AI Experiments requires human intervention and does not automate retraining, contradicting the requirement for minimal manual intervention. Option C is wrong because scheduled nightly retraining ignores whether data drift has actually occurred, leading to unnecessary retraining and potential deployment of models that are not improved, and it does not use drift detection as the trigger.

Full explanation →

765

MCQmedium

A data engineer is responsible for a batch ETL pipeline that runs daily using Cloud Composer and Dataproc. The pipeline extracts data from Cloud SQL, transforms it with Spark, and loads to BigQuery. Last night, the pipeline failed because the Spark job ran out of memory. The team needs a solution that prevents future failures without manual intervention. Options: A. Use a larger machine type for Dataproc. B. Enable Dataproc autoscaling and configure memory-based scaling. C. Split the Spark job into multiple stages. D. Use Cloud Functions to retry the job.

A.Enable Dataproc autoscaling and configure memory-based scaling

B.Use Cloud Functions to retry the job

C.Use a larger machine type for Dataproc

D.Split the Spark job into multiple stages

AnswerA

Autoscaling adjusts cluster size based on memory usage, preventing OOM.

Why this answer

Option A is correct because Dataproc autoscaling with memory-based scaling dynamically adjusts the cluster size based on the memory utilization of running jobs. This prevents out-of-memory failures by automatically adding worker nodes when memory pressure increases, without requiring manual intervention or pre-provisioning oversized clusters. It directly addresses the root cause—insufficient memory during peak processing—while maintaining cost efficiency.

Exam trap

Google Cloud often tests the misconception that retrying a failed job or manually resizing resources is a sufficient solution, when in fact dynamic, automated scaling is required to handle variable workloads without manual intervention.

How to eliminate wrong answers

Option B is wrong because retrying the failed job with Cloud Functions does not fix the underlying memory issue; the job will simply fail again on retry if the same memory constraints persist. Option C is wrong because using a larger machine type is a static, manual fix that may waste resources during normal operation and still fail if future data volumes exceed the chosen machine's capacity. Option D is wrong because splitting the Spark job into multiple stages does not inherently reduce memory usage per stage; it only reorganizes execution steps and may even increase overhead without addressing memory pressure.

Full explanation →

766

MCQeasy

Your organization has a data lake on Cloud Storage with millions of small files (average 10 KB). You need to build a batch processing pipeline using Cloud Dataproc that runs a Spark job to transform the data and output results to BigQuery. The pipeline currently takes 4 hours to run because Spark spends a large amount of time listing files and managing tasks. You want to reduce the run time without changing the cluster size. Which action should you take?

A.Convert the input files from CSV to Parquet format

B.Use Spark coalesce to reduce the number of output partitions

C.Increase the number of Spark partitions to process more files in parallel

D.Enable the Spark Dynamic Resource Allocation and combine small files using a separate job before the main transformation

AnswerD

Combining files reduces task count and listing overhead.

Why this answer

Option D is correct because the primary bottleneck is the overhead of listing millions of small files and managing many Spark tasks. By combining small files into larger ones using a separate job before the main transformation, you reduce the number of files Spark must list and the number of tasks required, which directly cuts the 4-hour runtime. Enabling Spark Dynamic Resource Allocation ensures resources are used efficiently during this preprocessing step without changing the cluster size.

Exam trap

The trap here is that candidates focus on data format or partitioning tuning (A, B, C) instead of recognizing that the root cause is the sheer number of small files causing excessive file listing and task overhead, which requires a preprocessing step to consolidate files.

How to eliminate wrong answers

Option A is wrong because converting CSV to Parquet improves read performance and compression but does not address the overhead of listing millions of small files or the task management cost; the bottleneck is file count, not format. Option B is wrong because using Spark coalesce reduces the number of output partitions, which only affects the write phase to BigQuery and does nothing to reduce the input file listing or task scheduling overhead. Option C is wrong because increasing the number of Spark partitions would create even more tasks, exacerbating the overhead from managing millions of small files and likely increasing runtime, not reducing it.

Full explanation →

767

Multi-Selecthard

Your company has a Dataproc cluster that runs Spark jobs. You need to choose between RDDs, DataFrames, and Datasets for a new job that performs complex aggregations on structured data. Which TWO statements are correct regarding performance and ease of use?

Select 2 answers

A.DataFrames and Datasets are both available in PySpark.

B.DataFrames store data in a columnar format, allowing better compression.

C.RDDs are easier to use than DataFrames for complex aggregations.

D.DataFrames are optimized by Spark's Catalyst optimizer, leading to faster execution.

E.Datasets provide compile-time type safety and are always faster than DataFrames.

AnswersB, D

DataFrames use Spark's internal binary format (Tungsten) with columnar storage, enabling efficient compression and serialization.

Why this answer

DataFrames are optimized with Catalyst optimizer and Tungsten execution, providing better performance than RDDs for structured data. Datasets combine type safety with optimized execution, but for most analytics workloads, DataFrames are sufficient and simpler.

Full explanation →

768

MCQhard

A Dataflow streaming pipeline reads from Pub/Sub, applies a ParDo that uses a side input from a BigQuery table (refreshed hourly), and writes to BigQuery. The side input is large and causes increased latency and worker OOM errors. Which design change solves this?

A.Use a stateful ParDo and store the lookup data in an external cache like Cloud Bigtable, performing lookups per element.

B.Increase the side input broadcast frequency to update more often.

C.Split the pipeline into two: one to load the side input, the other to process main input.

D.Use smaller worker machine types to distribute memory across more workers.

AnswerA

External cache reduces per-worker memory footprint and scales well.

Why this answer

Option A is correct because moving the large lookup data to an external cache like Cloud Bigtable offloads memory pressure from workers, eliminating OOM errors. The side input broadcast approach keeps the entire dataset in each worker's memory, which causes OOM when the data is large. Using an external cache allows per-element lookups without storing the entire dataset in memory, reducing latency by avoiding broadcast overhead.

Exam trap

Google Cloud often tests the misconception that increasing resources (like worker size or frequency) solves memory issues, when the real solution is to avoid storing large datasets in memory altogether by using an external lookup service.

How to eliminate wrong answers

Option B is wrong because increasing the broadcast frequency would make the OOM and latency problems worse, as it would reload the large dataset into memory more often without reducing memory footprint. Option C is wrong because splitting the pipeline into two pipelines does not solve the fundamental issue of storing the large side input in memory; the side input would still need to be broadcast or cached, and the two pipelines would require coordination, adding complexity without addressing memory pressure. Option D is wrong because using smaller worker machine types reduces available memory per worker, which would exacerbate OOM errors and increase latency due to more frequent garbage collection and slower processing.

Full explanation →

769

MCQmedium

A team is designing an event-driven data pipeline. They need to process messages from Cloud Pub/Sub, transform them, and write to BigQuery. The messages have variable volume and spikes. What is the best serverless compute option for this workload?

A.Cloud Functions triggered by Pub/Sub

B.Compute Engine with a Pub/Sub client library

C.Cloud Run invoked via Eventarc

D.Cloud Dataflow with a streaming pipeline

AnswerD

Dataflow can handle variable volume, autoscale, and directly read from Pub/Sub and write to BigQuery.

Why this answer

Cloud Dataflow with a streaming pipeline is the best serverless compute option because it is purpose-built for unbounded, variable-volume data streams from Pub/Sub and provides exactly-once processing semantics, auto-scaling, and built-in BigQuery sink integration via the Beam SDK. Unlike simpler compute options, Dataflow handles backpressure, windowing, and state management natively, making it ideal for spikes and high-throughput transformations without manual scaling or idempotency concerns.

Exam trap

Google Cloud often tests the misconception that any serverless compute (like Cloud Functions or Cloud Run) can handle streaming data pipelines, but the trap here is that these services lack native support for unbounded data, stateful processing, and automatic scaling under variable volume, which only Dataflow provides as a fully managed stream processor.

How to eliminate wrong answers

Option A is wrong because Cloud Functions triggered by Pub/Sub is designed for lightweight, short-lived event processing (max 9 minutes timeout) and cannot handle sustained high-throughput streaming transformations or complex stateful operations like windowing and joins, leading to data loss or timeouts under spikes. Option B is wrong because Compute Engine with a Pub/Sub client library is not serverless—it requires manual provisioning, scaling, and management of VMs, and it lacks native integration with BigQuery for streaming writes, adding operational overhead. Option C is wrong because Cloud Run invoked via Eventarc is a request-response compute model with a 60-minute timeout and concurrency limits; it does not natively support unbounded streaming, checkpointing, or exactly-once processing for Pub/Sub messages, making it unsuitable for variable-volume data pipelines.

Full explanation →

770

MCQmedium

A company wants to use AutoML Tables to build a classification model on a dataset with 100 features and 500,000 rows. They need to deploy the model for online predictions with low latency (<100 ms). Which deployment option should they choose?

A.Export the model as a TF SavedModel and deploy on Cloud Run

B.Deploy the model on AI Platform Prediction

C.Deploy the model to an endpoint in Vertex AI using the AutoML endpoint service

D.Use batch prediction in Vertex AI

AnswerC

Vertex AI provides a managed endpoint for AutoML models with low latency prediction.

Why this answer

AutoML Tables supports online prediction endpoints that are deployed on a dedicated cluster, providing low latency for real-time predictions.

Full explanation →

771

MCQhard

Based on the exhibit, what is the most likely cause of duplicate rows despite using the same event_id as insertId?

A.BigQuery's streaming buffer deduplication is best-effort and may not catch duplicates within a short time window.

B.The Dataflow pipeline is retrying inserts due to network errors, and the same event_id is not being used in retries.

C.The pipeline is writing more than 100,000 rows per second, exceeding BigQuery's streaming quota.

D.The table is partitioned by timestamp, so BigQuery cannot deduplicate across partitions.

AnswerA

Duplicate inserts within milliseconds can bypass dedup due to coarseness.

Why this answer

BigQuery's streaming buffer uses best-effort deduplication based on the `insertId` field. When multiple rows are inserted with the same `event_id` mapped to `insertId` within a short time window (typically up to a few minutes), the deduplication mechanism may fail to remove all duplicates, especially under high throughput or network retries. This is a documented limitation of BigQuery streaming, not a guarantee of exactly-once semantics.

Exam trap

Google Cloud often tests the misconception that BigQuery's streaming deduplication is a strong guarantee, when in fact it is best-effort and can fail under concurrent writes or short time windows.

How to eliminate wrong answers

Option B is wrong because if the same `event_id` is not used in retries, BigQuery would treat them as distinct rows and not deduplicate, but the question states the same `event_id` is used as `insertId`; the issue is that deduplication is best-effort, not that the ID is missing. Option C is wrong because exceeding the streaming quota (default 100,000 rows per second per table) would cause ingestion errors or throttling, not duplicate rows; duplicates arise from the buffer's deduplication behavior, not quota limits. Option D is wrong because BigQuery can deduplicate across partitions within the streaming buffer; partitioning does not disable deduplication, and duplicates can occur even in a single partition due to the buffer's best-effort nature.

Full explanation →

772

MCQmedium

A team wants to use Cloud Storage to build a data lake with separate zones for raw, curated, and processed data. They need to automatically move objects older than 30 days from the raw zone to a cheaper storage class. How can they achieve this?

A.Set a bucket retention policy that forces deletion after 30 days

B.Write a Cloud Function to delete objects older than 30 days

C.Use gsutil rsync to move objects between buckets

D.Configure a Cloud Storage object lifecycle rule with SetStorageClass action

AnswerD

Lifecycle rules automate class transitions based on age.

Why this answer

Option D is correct because Cloud Storage object lifecycle management rules can automatically transition objects from one storage class to a cheaper one (e.g., from Standard to Nearline or Coldline) based on age. By configuring a rule with the `SetStorageClass` action and a `Condition` of `age: 30`, objects in the raw zone bucket older than 30 days are moved to a lower-cost class without manual intervention or additional compute services.

Exam trap

Cisco often tests the distinction between lifecycle rules that change storage class versus retention policies that enforce immutability, and candidates may confuse 'move to cheaper storage' with 'delete' or 'retain'.

How to eliminate wrong answers

Option A is wrong because a retention policy prevents deletion or modification of objects until the retention period expires; it does not move objects to a cheaper storage class and would lock the data, not transition it. Option B is wrong because while a Cloud Function could delete objects, the requirement is to move them to a cheaper storage class, not delete them; using a function for this is also less efficient and more complex than a native lifecycle rule. Option C is wrong because `gsutil rsync` synchronizes content between buckets but does not automatically trigger based on age; it requires manual or scheduled execution and does not natively support storage class transitions based on object age.

Full explanation →

773

MCQhard

A team deploys a Cloud Run service that processes user-uploaded files. Some requests time out after 60 minutes. They need to handle large files reliably without losing tasks. What is the best solution?

A.Containerize the processing logic and trigger it via Cloud Tasks.

B.Increase the request timeout to 3600 seconds.

C.Use Cloud Functions instead of Cloud Run.

D.Split the file into chunks and process them concurrently.

AnswerA

Cloud Tasks decouples the request, provides retry, and can handle long-running operations without timeout limits.

Why this answer

Cloud Run has a maximum request timeout of 60 minutes (3600 seconds), so simply increasing the timeout is not possible. By containerizing the processing logic and triggering it via Cloud Tasks, you decouple the request from the synchronous HTTP timeout, allowing the task to run asynchronously for up to 24 hours. This ensures large files are processed reliably without losing tasks, as Cloud Tasks provides automatic retries and dead-letter queues.

Exam trap

The trap here is that candidates assume increasing the timeout or switching to Cloud Functions will solve the problem, but both services have the same 60-minute hard limit, whereas Cloud Tasks decouples execution from the synchronous request lifecycle, enabling much longer processing times.

How to eliminate wrong answers

Option B is wrong because Cloud Run's maximum request timeout is 3600 seconds (60 minutes), so increasing it beyond that is not supported; the request would still time out. Option C is wrong because Cloud Functions also has a maximum timeout of 60 minutes (9 minutes for 1st gen, 60 minutes for 2nd gen), so it does not solve the timeout issue for long-running file processing. Option D is wrong because splitting the file into chunks and processing them concurrently does not address the fundamental timeout limit; each chunk's processing still must complete within the 60-minute timeout, and managing chunk reassembly adds complexity without guaranteeing reliability.

Full explanation →

774

MCQmedium

Your team is responsible for operationalizing a series of machine learning models that are trained and deployed using Vertex AI Pipelines. The pipeline consists of several steps including data preprocessing, training with hyperparameter tuning, model evaluation, and deployment to an endpoint. Recently, the pipeline has been failing intermittently at the model evaluation step with an error indicating insufficient memory. The evaluation step uses a custom container with a memory limit of 4 GB. The training step uses 8 GB and completes successfully. You need to resolve the failure without drastically increasing costs. What should you do?

A.Increase the memory limit for the evaluation custom container to 8 GB to match the training step.

B.Optimize the evaluation code to use streaming or incremental processing to reduce peak memory usage.

C.Reduce the batch size used in the evaluation step to lower memory consumption.

D.Use a smaller machine type for the evaluation step to force lower memory usage.

AnswerB

Optimizing the code is a cost-effective long-term solution that addresses the root cause.

Full explanation →

775

MCQmedium

An organization needs to restrict access to BigQuery and Cloud Storage so that data can only be accessed from within a specific VPC network and cannot be exfiltrated. Which Google Cloud feature should they use?

A.Private Service Access

B.VPC Service Controls

C.VPC firewall rules

D.IAM conditions

AnswerB

Creates a security perimeter to prevent data exfiltration.

Why this answer

VPC Service Controls (option B) is the correct choice because it creates a security perimeter around Google Cloud services like BigQuery and Cloud Storage, preventing data exfiltration even from within a VPC. It enforces context-aware access based on the VPC network, ensuring data can only be accessed from authorized VPC sources and blocking unauthorized transfers outside the perimeter.

Exam trap

The trap here is that candidates confuse VPC firewall rules (which control network traffic) with VPC Service Controls (which control data access at the API layer), leading them to choose firewall rules because they think 'restricting access to a VPC' is purely a network-level concern.

How to eliminate wrong answers

Option A is wrong because Private Service Access is used to enable private connectivity from a VPC to Google-managed services (e.g., Cloud SQL, Memorystore) via internal IPs, but it does not provide exfiltration prevention or restrict data movement across services. Option C is wrong because VPC firewall rules control network traffic at the packet level (IP addresses, ports, protocols) but cannot prevent data exfiltration via API calls or service-to-service transfers, as they operate at layers 3/4, not at the application layer. Option D is wrong because IAM conditions allow fine-grained access control based on attributes like IP address or time, but they do not create a perimeter around services; they can restrict who can call an API but cannot block data movement between services or prevent exfiltration via authorized credentials.

Full explanation →

776

Multi-Selecthard

You are designing a Cloud Spanner schema for a global e-commerce application. The database will include a Customers table and an Orders table. To optimise performance for queries that join Customers with their Orders, which THREE design choices are recommended? (Choose 3.)

Select 3 answers

A.Use a single table for both Customers and Orders and filter by CustomerId.

B.Use interleaved tables: make Orders an interleaved table under Customers.

C.Denormalise the schema by embedding order details into the Customers table as repeated fields.

D.Include CustomerId as the first part of the Orders primary key.

E.Create a secondary index on OrderDate in the Orders table.

AnswersB, D, E

Interleaved tables store Orders rows near their parent Customer, improving join performance.

Why this answer

Spanner interleaved tables store child rows physically close to parent rows, improving join performance. Including the parent's primary key as the first part of the child's primary key is required for interleaving. Secondary indexes on non-key columns are also needed for efficient lookups.

Full explanation →

777

Multi-Selecthard

Which THREE practices are recommended when designing a Cloud Data Fusion pipeline to ensure efficient execution and monitoring? (Choose three.)

Select 3 answers

A.Manually partition input files to control parallelism.

B.Limit the memory and disk usage per stage to avoid Dataproc node resource exhaustion.

C.Use a dedicated Dataproc cluster for each production pipeline to avoid resource contention.

D.Schedule pipeline runs using Cloud Scheduler and Pub/Sub triggers to avoid manual starts.

E.Set up custom metrics and alerts for pipeline backpressure and latency.

AnswersB, C, E

Resource limits prevent OOM errors and improve stability.

Why this answer

Option B is correct because Cloud Data Fusion pipelines run on Dataproc clusters, and limiting memory and disk usage per stage prevents resource exhaustion on worker nodes. This ensures that no single stage consumes all available resources, which could cause the pipeline to fail or degrade performance. Proper resource limits help maintain stable execution and avoid out-of-memory errors.

Exam trap

Google Cloud often tests the misconception that manual partitioning (Option A) gives better control, but Cloud Data Fusion's auto-partitioning is more efficient and recommended; candidates may also overlook that scheduling (Option D) is about automation, not execution efficiency or monitoring.

Full explanation →

778

MCQmedium

A data pipeline uses Cloud Composer to orchestrate Dataflow and BigQuery jobs. The pipeline fails intermittently with dependency errors. Which design change can improve reliability?

A.Use retries with exponential backoff

B.Switch to Cloud Functions for orchestration

C.Increase worker count in Dataflow

D.Use a simpler DAG with fewer dependencies

AnswerA

Retries with backoff handle transient failures, improving reliability.

Why this answer

Cloud Composer (Apache Airflow) tasks can fail due to transient issues like API rate limits or resource contention. Implementing retries with exponential backoff allows the DAG to automatically re-attempt failed tasks with increasing delays, reducing the impact of intermittent failures without manual intervention. This is a standard Airflow pattern for improving reliability in orchestrated pipelines.

Exam trap

Google Cloud often tests the distinction between scaling compute resources (Dataflow workers) and improving orchestration reliability (retries), leading candidates to mistakenly choose option C when the problem is transient task failures, not resource bottlenecks.

How to eliminate wrong answers

Option B is wrong because Cloud Functions is a serverless compute service, not a workflow orchestrator; it lacks built-in support for managing task dependencies, retries, and scheduling across multiple services like Dataflow and BigQuery. Option C is wrong because increasing the Dataflow worker count addresses throughput and latency, not dependency errors in the orchestration layer; dependency errors stem from task sequencing or transient failures in Airflow, not from Dataflow parallelism. Option D is wrong because simplifying the DAG reduces complexity but does not handle intermittent failures; the core issue is transient errors, not the number of dependencies, and removing dependencies may break business logic.

Full explanation →

779

MCQeasy

A data engineer needs to orchestrate a complex data pipeline that involves multiple steps including data extraction from Cloud Storage, transformation using Dataflow, and loading into BigQuery. The pipeline has dependencies between tasks and requires monitoring and retries. Which Google Cloud service should be used for orchestration?

A.Workflows

B.Cloud Scheduler

C.Cloud Composer

D.Cloud Tasks

AnswerC

Cloud Composer (Airflow) is designed for orchestrating complex pipelines with dependencies.

Why this answer

Cloud Composer is a managed Apache Airflow service that provides a robust platform for orchestrating complex workflows with task dependencies, retries, and monitoring.

Full explanation →

780

MCQhard

A financial institution needs to deploy a TensorFlow model for fraud detection with strict latency requirements (<100ms). The model uses custom ops that are not available in standard TF Serving. What is the most appropriate serving solution?

A.Export the model as a SavedModel and serve on Vertex AI Prediction

B.Use Cloud Run with a custom container that includes the model and pre-loads the library

C.Use NVIDIA Triton Inference Server with a custom backend

D.Package the model with Docker using TF Serving and add custom ops via TensorFlow's custom op registration

AnswerC

NVIDIA Triton supports custom backends and is designed for high-performance inference with low latency.

Why this answer

Option C is correct because NVIDIA Triton Inference Server supports custom backends written in C++ or Python, allowing the integration of custom ops that are not available in standard TensorFlow Serving. This enables the model to meet strict latency requirements (<100ms) by leveraging GPU acceleration and optimized inference pipelines, while avoiding the limitations of TF Serving's fixed op registry.

Exam trap

The trap here is that candidates assume TF Serving's custom op registration (Option D) is straightforward, but Cisco tests the understanding that TF Serving does not support dynamic loading of custom ops without a custom build, making Triton's backend architecture the correct choice for production-grade latency requirements.

How to eliminate wrong answers

Option A is wrong because Vertex AI Prediction relies on standard TF Serving or custom containers, but exporting as a SavedModel does not automatically include custom ops; Vertex AI would fail to load the model if the custom ops are not registered in its runtime. Option B is wrong because Cloud Run with a custom container can serve the model, but it lacks the specialized inference optimization features (e.g., dynamic batching, model concurrency) needed to guarantee <100ms latency under load, and it does not natively support custom backends for ops. Option D is wrong because TF Serving's custom op registration requires recompiling TF Serving from source with the custom ops linked, which is complex and not supported via standard Docker images; even if done, TF Serving's architecture is less flexible than Triton's custom backend for handling non-standard ops efficiently.

Full explanation →

781

MCQhard

A Dataflow pipeline reads from Cloud Pub/Sub and writes to Cloud Storage. The pipeline needs to guarantee exactly-once processing despite worker failures. Which configuration ensures exactly-once semantics?

A.Use a side input from a deduplication dataset

B.Set the pipeline to use a global window with no early triggers

C.Insert a Reshuffle transform after reading

D.Enable exactly-once delivery on the Pub/Sub subscription and use an idempotent sink

AnswerD

Pub/Sub exactly-once delivery and an idempotent Storage write (e.g., using file naming) ensure no duplicates.

Why this answer

Option D is correct because Pub/Sub subscriptions can be configured with exactly-once delivery (using the `enableExactlyOnceDelivery` flag), which ensures that each message is delivered to the subscriber exactly once. Combining this with an idempotent sink (e.g., Cloud Storage with unique filenames or deduplication logic) guarantees that even if a worker fails and the pipeline retries, the output will not contain duplicates. This is the only option that directly addresses both the source and sink to achieve end-to-end exactly-once semantics.

Exam trap

Google Cloud often tests the misconception that a single transform (like Reshuffle) or windowing strategy can guarantee exactly-once processing, when in reality it requires both source-level exactly-once delivery and an idempotent sink to handle retries from worker failures.

How to eliminate wrong answers

Option A is wrong because using a side input from a deduplication dataset does not prevent duplicate processing at the source; it only attempts to deduplicate after the fact, which is not a guarantee of exactly-once processing and adds complexity and latency. Option B is wrong because a global window with no early triggers controls when results are emitted, but it does not prevent duplicate messages from being processed due to worker failures or retries. Option C is wrong because a Reshuffle transform (which inserts a GroupByKey and an UngroupByKey) can help with fault tolerance by breaking fusion, but it does not provide exactly-once semantics; it only ensures that elements are redistributed, not that duplicates are eliminated.

Full explanation →

782

MCQeasy

A data engineer needs to design a data processing system that ingests large volumes of sensor data from IoT devices. The data should be stored in a schema-less format and allow for real-time analytics. Which Google Cloud service is most appropriate?

A.Cloud Spanner

B.Firestore

C.Cloud Bigtable

D.Cloud SQL

AnswerC

Bigtable is schema-less, highly scalable, and ideal for time-series sensor data.

Why this answer

Cloud Bigtable is the most appropriate choice because it is a fully managed, scalable NoSQL database designed for large-scale analytical and operational workloads. It supports schema-less storage of time-series sensor data and integrates with real-time analytics tools like BigQuery and Dataflow via the HBase API, meeting the requirements for high-throughput ingestion and low-latency queries.

Exam trap

The trap here is that candidates often confuse Cloud Bigtable with Firestore or Cloud SQL because they all offer NoSQL or relational storage, but fail to recognize that Bigtable is purpose-built for high-throughput, schema-less time-series data and real-time analytics, while the others are optimized for transactional or mobile workloads.

How to eliminate wrong answers

Option A is wrong because Cloud Spanner is a globally distributed, strongly consistent relational database that enforces a fixed schema, making it unsuitable for schema-less IoT data and overkill for real-time analytics at scale. Option B is wrong because Firestore is a document-oriented NoSQL database optimized for mobile and web app real-time synchronization, not for high-throughput ingestion of large volumes of sensor data or analytical workloads. Option D is wrong because Cloud SQL is a managed relational database service (MySQL, PostgreSQL, SQL Server) that requires a predefined schema and cannot handle the petabyte-scale, high-write throughput demands of IoT sensor data without significant performance degradation.

Full explanation →

783

MCQeasy

You are designing a streaming Dataflow pipeline that reads from Cloud Pub/Sub. Some data may arrive late due to network delays. You need to ensure that late-arriving data is still processed, but after a certain point, it should be discarded to avoid unbounded state. What is the best practice?

A.Switch to a batch pipeline

B.Use fixed windows without allowed lateness

C.Discard all late-arriving data

D.Set a watermark and allowed lateness

AnswerD

Allowed lateness enables processing of late data within a configurable period, balancing completeness and latency.

Why this answer

Option D is correct because in streaming Dataflow pipelines, setting a watermark and allowed lateness provides a mechanism to handle late-arriving data from Pub/Sub without unbounded state growth. The watermark defines the point after which data is considered late, and allowed lateness specifies how long to wait for late data before discarding it, balancing completeness and state management.

Exam trap

The trap here is that candidates often confuse 'allowed lateness' with simply discarding late data, failing to recognize that it provides a controlled buffer for late arrivals while still bounding state growth.

How to eliminate wrong answers

Option A is wrong because switching to a batch pipeline would lose the streaming, low-latency processing requirement and cannot handle late-arriving data in real time. Option B is wrong because fixed windows without allowed lateness would immediately discard any data arriving after the window end, even if it is only slightly delayed, leading to data loss. Option C is wrong because discarding all late-arriving data is too aggressive and ignores the need to process data that arrives within a reasonable delay, which is common in distributed systems like Pub/Sub.

Full explanation →

784

MCQeasy

You need to stream real-time user click events from your application into BigQuery for immediate analysis. The events must be available for query within seconds. Which approach is recommended?

A.Use Pub/Sub to Dataflow to BigQuery with the Storage Write API for high-throughput streaming.

B.Use Cloud Data Fusion to ingest streaming data from Pub/Sub into BigQuery.

C.Use Cloud Functions to receive events from Pub/Sub and insert them into BigQuery using the legacy streaming API.

D.Use Pub/Sub with a BigQuery subscription to directly write events into BigQuery.

AnswerA

This is the recommended architecture: Pub/Sub for ingestion, Dataflow for stream processing, and Storage Write API for low-latency streaming writes.

Why this answer

Pub/Sub to Dataflow to BigQuery using the Storage Write API provides the highest throughput and reliability with near-real-time latency. Legacy streaming inserts are limited and have higher latency. Direct Pub/Sub to BigQuery subscription is not a native feature.

Cloud Functions is not suitable for high-throughput streaming.

Full explanation →

785

MCQmedium

A Cloud SQL instance for PostgreSQL is experiencing heavy read traffic. The team wants to offload read queries while maintaining data consistency. Which solution meets their needs?

A.Create read replicas and direct read queries to them.

B.Increase the tier of the Cloud SQL instance.

C.Create a Cloud SQL failover replica.

D.Use Cloud Memorystore as a cache in front of Cloud SQL.

AnswerA

Read replicas handle read traffic, reducing load on the primary instance.

Why this answer

Read replicas in Cloud SQL for PostgreSQL allow you to offload read traffic from the primary instance by creating one or more asynchronous replicas that serve read queries. This maintains data consistency because replicas use PostgreSQL's native streaming replication, ensuring that all committed transactions on the primary are eventually reflected on the replica, providing a consistent snapshot for read operations.

Exam trap

Cisco often tests the distinction between read replicas (for offloading reads) and failover replicas (for high availability), so candidates may confuse the two and incorrectly select the failover replica option.

How to eliminate wrong answers

Option B is wrong because increasing the tier of the Cloud SQL instance only scales the primary instance vertically, which does not offload read traffic — it simply gives the same instance more resources, which may not be sufficient under heavy read loads and does not separate read and write workloads. Option C is wrong because a Cloud SQL failover replica is designed for high availability and automatic failover in case of primary instance failure, not for offloading read queries; it is a synchronous standby that does not serve read traffic. Option D is wrong because Cloud Memorystore as a cache can reduce read load but does not maintain strong data consistency with Cloud SQL; cached data may become stale, and the cache is not a direct offload of read queries from the database — it introduces eventual consistency and requires application-level cache invalidation logic.

Full explanation →

786

MCQhard

A company is migrating an on-premises PostgreSQL database to Google Cloud. The database runs complex analytical queries mixed with OLTP workloads. They need PostgreSQL compatibility and want to improve analytical query performance without changing the application. Which database should they choose?

A.BigQuery

B.Cloud Spanner

C.Cloud SQL for PostgreSQL

D.AlloyDB

AnswerD

AlloyDB is PostgreSQL-compatible and uses a columnar engine to accelerate analytical queries while supporting OLTP.

Why this answer

AlloyDB is the correct choice because it is a fully managed PostgreSQL-compatible database service specifically designed for demanding transactional and analytical workloads. It combines the PostgreSQL ecosystem with a columnar engine and adaptive caching to accelerate analytical queries by up to 100x over standard PostgreSQL, all without requiring application changes. This makes it ideal for mixed OLTP and complex analytical queries while maintaining PostgreSQL compatibility.

Exam trap

The trap here is that candidates often choose Cloud SQL for PostgreSQL because it is the most familiar PostgreSQL option, overlooking that AlloyDB is specifically engineered for mixed OLTP and analytical workloads with PostgreSQL compatibility, while Cloud SQL lacks the advanced analytical acceleration features.

How to eliminate wrong answers

Option A is wrong because BigQuery is a serverless data warehouse that is not PostgreSQL-compatible and requires application changes to use its SQL dialect; it is designed for large-scale analytics, not OLTP workloads. Option B is wrong because Cloud Spanner is a globally distributed, strongly consistent relational database that uses a proprietary SQL dialect, not PostgreSQL, and is optimized for horizontal scalability and high availability, not for improving analytical query performance on mixed workloads. Option C is wrong because Cloud SQL for PostgreSQL is a fully managed PostgreSQL service but lacks the built-in columnar engine and adaptive caching needed to significantly accelerate complex analytical queries; it is best suited for standard OLTP workloads, not mixed analytical and transactional demands.

Full explanation →

787

MCQmedium

You are building a multi-cloud analytics solution to join data from Google Cloud and AWS S3. You need to query the S3 data using BigQuery without moving it. Which Google Cloud service should you use?

A.Looker

B.Dataproc

C.BigQuery Data Transfer Service

D.BigQuery Omni

AnswerD

BigQuery Omni enables cross-cloud analytics by querying data in AWS S3 and Azure without moving it.

Why this answer

BigQuery Omni allows querying data stored in AWS S3 and Azure Blob Storage using BigQuery SQL, without data movement. BigQuery Omni runs in the cloud provider's region.

Full explanation →

788

MCQeasy

You have a BigQuery table with sales data and want to pivot product categories into columns. Which SQL clause should you use?

A.UNPIVOT

B.PIVOT

C.ARRAY_AGG with CROSS JOIN

D.STRUCT

AnswerB

PIVOT transforms rows into columns.

Why this answer

PIVOT is the standard SQL clause to rotate rows into columns. UNPIVOT rotates columns to rows. ARRAY_AGG and STRUCT are not used for pivoting.

Full explanation →

789

MCQeasy

You are migrating on-premises Hadoop jobs to Google Cloud. The existing jobs use Spark for ETL and Hive for querying. You want to minimize changes to the existing code and maintain the ability to use Hive queries with the same metastore across multiple clusters. Which service combination should you use?

A.Cloud Dataflow with Beam SQL

B.Cloud Dataproc with Dataproc on GKE

C.Cloud BigQuery with external tables on Cloud Storage

D.Cloud Dataproc with Cloud Storage and Dataproc Metastore

AnswerD

Dataproc runs Spark and Hive, and Dataproc Metastore provides a shared Hive metastore. Cloud Storage replaces HDFS.

Why this answer

Dataproc is the managed Hadoop/Spark service that allows you to run existing Spark and Hive code without modification. Dataproc Metastore provides a fully managed Hive metastore that can be shared across multiple Dataproc clusters. Cloud Storage is used instead of HDFS for storing data, but the metastore is the key component.

Full explanation →

790

MCQeasy

A data engineer is running a Dataproc cluster for a batch ETL job that needs to process 10 TB of data. The job is memory-intensive. The cluster currently uses n1-standard-4 workers. Performance is poor. What is the most cost-effective change to improve performance?

A.Use high-memory machine types (n1-highmem-4)

B.Use preemptible workers to reduce cost

C.Switch to n2-standard-4 machine types

D.Add more n1-standard-4 workers

AnswerA

High-memory machines provide more memory per core, better for memory-bound jobs.

Why this answer

The job is memory-intensive, and n1-standard-4 workers have 15 GB of RAM, which may be insufficient for the workload, causing excessive disk spill or OOM errors. Switching to n1-highmem-4 provides 26 GB of RAM per worker (a 73% increase) without increasing vCPU count, directly addressing the memory bottleneck at a lower cost than adding more workers. This is the most cost-effective change because it improves performance without incurring the overhead of additional vCPUs or licensing costs.

Exam trap

The trap here is that candidates often assume adding more workers (scaling out) is always the best way to improve performance, but for memory-intensive jobs, scaling up (using high-memory instances) is more cost-effective because it addresses the root cause—per-worker memory pressure—without wasting resources on additional vCPUs.

How to eliminate wrong answers

Option B is wrong because preemptible workers reduce cost but do not improve performance for a memory-intensive job; they are suitable for fault-tolerant, stateless workloads, not for memory-bound ETL tasks that may fail if preempted. Option C is wrong because n2-standard-4 machine types offer similar memory (16 GB) to n1-standard-4 (15 GB) and only provide a modest CPU performance improvement via newer architecture, which does not address the memory bottleneck. Option D is wrong because adding more n1-standard-4 workers increases total vCPUs and cost but does not increase per-worker memory, so the memory-intensive job will still suffer from the same per-worker memory constraints, leading to inefficient resource utilization.

Full explanation →

791

MCQmedium

You need to transform and clean messy CSV data using a visual interface without writing code. The transformation should be scheduled to run weekly. Which Google Cloud service should you use?

A.Cloud Dataprep

B.Cloud Dataflow

C.Dataproc

D.Cloud Data Fusion

AnswerA

Dataprep is specifically designed for visual data wrangling with scheduling capabilities.

Why this answer

Cloud Dataprep (Trifacta) provides a visual interface for data wrangling, allowing users to create recipes and schedule jobs.

Full explanation →

792

MCQhard

A company uses Cloud Dataflow to process financial transactions from Pub/Sub to BigQuery. The pipeline must ensure exactly-once semantics. Recently, they noticed duplicate rows in BigQuery. The source publishes with at-least-once. The Dataflow pipeline uses idempotent writes. What is the most likely cause? Options: A. The pipeline uses GlobalWindows. B. The pipeline has autoscaling enabled. C. The pipeline uses file loads as a sink. D. The pipeline's watermark is misconfigured.

A.The pipeline uses file loads as a sink

B.The pipeline's watermark is misconfigured

C.The pipeline uses GlobalWindows

D.The pipeline has autoscaling enabled

AnswerB

A misconfigured watermark can cause late data to be processed again, producing duplicates.

Why this answer

The most likely cause is a misconfigured watermark. In Dataflow, the watermark tracks event time progress and determines when to trigger window results. If the watermark is misconfigured (e.g., too aggressive or based on incorrect timestamps), late-arriving data may be processed in multiple windows, leading to duplicate rows even with idempotent writes.

Since the source uses at-least-once delivery, late data can be re-published, and a faulty watermark can cause it to be written again.

Exam trap

The trap here is that candidates assume idempotent writes alone guarantee exactly-once, but they overlook that watermark misconfiguration can cause the same event to be processed in multiple windows, leading to duplicates despite idempotent sinks.

How to eliminate wrong answers

Option A is wrong because GlobalWindows do not cause duplicates; they aggregate all data into a single window, and duplicates would still be prevented by idempotent writes. Option C is wrong because autoscaling adjusts worker count but does not inherently cause duplicate writes; Dataflow handles state and checkpointing correctly during scaling. Option D is wrong because file loads as a sink can cause duplicates if the load job is retried, but the question states the pipeline uses idempotent writes, and file loads are not mentioned as the sink; the sink is BigQuery, and Dataflow's streaming inserts to BigQuery are idempotent by default.

Full explanation →

793

Multi-Selectmedium

You need to select two BigQuery features that improve query performance by reducing the amount of data read. Which two options accomplish this? (Choose TWO)

Select 2 answers

A.Clustering on commonly filtered columns

B.BI Engine reservation

C.Materialized views

D.Partitioning on a DATE column

E.Approximate aggregation functions

AnswersA, D

Clustering enables block-level pruning, reducing data read.

Why this answer

Partitioning and clustering both reduce data scanned by narrowing the data read. BI Engine caches data but does not reduce scan size. Materialized views may reduce scans but the question asks for features, not views.

Approximate aggregation reduces computation but not data read.

Full explanation →

794

MCQmedium

A data engineering team uses Cloud Data Fusion to build ETL pipelines. They have a pipeline that reads from Cloud SQL, transforms data using Wrangler, and writes to BigQuery. The pipeline fails intermittently with a 'connection timeout' error from Cloud SQL. What is the best way to handle this?

A.Use Cloud NAT to provide a static IP for Data Fusion to whitelist.

B.Configure the Cloud SQL connector in Data Fusion to use retry logic and increase the connection timeout.

C.Increase the number of Data Fusion nodes to distribute the load.

D.Migrate Cloud SQL to Cloud Spanner to handle higher concurrency.

AnswerB

Retries and longer timeouts handle transient failures.

Why this answer

Option B is correct because Cloud Data Fusion's Cloud SQL connector can be configured with retry logic and an increased connection timeout to handle transient network issues. This directly addresses the intermittent 'connection timeout' error without requiring architectural changes, as the error is likely due to brief network latency or resource contention, not a persistent connectivity problem.

Exam trap

The trap here is that candidates often assume connectivity issues require network-level fixes (like static IPs or NAT) or scaling, rather than recognizing that transient timeouts are best handled by application-level retry and timeout configuration.

How to eliminate wrong answers

Option A is wrong because using Cloud NAT to provide a static IP for whitelisting addresses IP-based access control, but the error is a connection timeout, not an authorization failure; whitelisting does not resolve transient network delays. Option C is wrong because increasing the number of Data Fusion nodes distributes compute load but does not fix connection timeouts to Cloud SQL, which are caused by network or database-side issues, not pipeline parallelism. Option D is wrong because migrating to Cloud Spanner is an overengineered solution for a transient timeout; it introduces unnecessary complexity and cost, and does not address the root cause of intermittent connectivity.

Full explanation →

795

MCQhard

A company uses Dataplex to manage data lakes on Google Cloud. They want to enforce data quality rules on a BigQuery table, such as ensuring that a 'email' column is not null and matches a regex pattern. Which Dataplex feature should they use?

A.Dataplex Universal Catalog

B.Dataplex Lake

C.Dataplex Data Quality

D.Dataplex Data Lineage

AnswerC

Correct: Data Quality is the feature for defining and running quality rules.

Why this answer

Dataplex Data Quality is a feature that allows you to define and run data quality checks on BigQuery tables. You can create a Data Quality Task using the Dataplex UI or API to specify rules like NOT_NULL and REGEX.

Full explanation →

796

Multi-Selectmedium

A company wants to ensure high availability for their Cloud SQL instance. Which TWO actions are most appropriate? (Choose two.)

Select 2 answers

A.Create a read replica in a different region.

B.Configure a failover replica in the same region.

C.Enable automatic backups with a retention period of 7 days.

D.Increase the instance's memory and storage size.

E.Set up horizontal scaling with multiple read replicas.

AnswersA, C

A cross-region read replica can be promoted to a standalone instance in a disaster, providing DR.

Why this answer

Options A and B are correct: A read replica in a different region provides disaster recovery, and automatic backups allow point-in-time recovery. Option C (failover replica in same region) provides HA but not DR. Option D (increased memory) does not improve availability.

Option E (horizontal scaling with read replicas) does not provide failover for writes.

Full explanation →

797

Multi-Selectmedium

Which TWO actions can help reduce the latency of a Vertex AI endpoint serving a large neural network model?

Select 2 answers

A.Use a larger machine type with more CPU cores

B.Enable model compression with quantization

C.Increase the number of model versions deployed on the same endpoint

D.Deploy the model on a machine type with GPU accelerators

E.Use a smaller batch size for prediction requests

AnswersD, E

GPUs speed up neural network inference.

Why this answer

Option D is correct because GPU accelerators are specifically designed to handle the parallel computations required by large neural networks, significantly reducing inference latency compared to CPU-only machines. Vertex AI endpoints with GPUs can process multiple predictions concurrently, which is critical for deep learning models where matrix operations dominate the workload.

Exam trap

Google Cloud often tests the misconception that more CPU cores or model compression always reduce latency, but the trap here is that for large neural networks, the primary bottleneck is parallel compute capability, which only GPUs or TPUs can address effectively.

Full explanation →

798

MCQhard

A data engineer is designing a Bigtable row key for a time-series application that records temperature sensor readings every second. To avoid hotspotting, they want to distribute writes across all nodes. Which row key design is best?

A.[timestamp reversed]#[sensor_id]

B.[sensor_id]#[timestamp]

C.[hash of sensor_id]#[timestamp]

D.[timestamp]#[sensor_id]

AnswerC

Hashing distributes writes across tablets, avoiding hotspotting.

Why this answer

Hotspotting occurs when sequential keys hit a single tablet server. A reversed timestamp or hashed prefix distributes writes. Pre-pending a hash ensures even distribution.

Timestamp alone causes hotspotting. Sensor ID + timestamp can still be sequential if sensor ID is low cardinality.

Full explanation →

799

MCQmedium

A retail company is building a recommendation engine that requires processing customer clickstream data in near real-time. The data is ingested via Pub/Sub, and must be joined with a lookup table of product details (updated daily) before being used for model inference. Which design pattern should they use?

A.Enrich the stream by querying BigQuery for each event using a Cloud Function.

B.Use a Dataflow pipeline that reads from Pub/Sub and uses a side input from a regularly refreshed PCollection loaded from Cloud Storage.

C.Store product details in Cloud Memorystore (Redis) and have the streaming application look up each event.

D.Write events to BigQuery and use scheduled queries to join with the product table in batch.

AnswerB

Side inputs enable efficient streaming-batch joins within Dataflow.

Why this answer

Option B is correct because Dataflow can read streaming data from Pub/Sub and use a side input from a regularly refreshed PCollection loaded from Cloud Storage. This pattern allows the product lookup table (updated daily) to be periodically reloaded into the pipeline as a side input, enabling efficient, low-latency enrichment of each event without per-event external calls or batch delays.

Exam trap

Google Cloud often tests the distinction between streaming enrichment patterns that require external lookups (which add latency and cost) versus using side inputs for static or slowly-changing reference data, leading candidates to mistakenly choose a cache-based solution like Redis when the data is already available in Cloud Storage.

How to eliminate wrong answers

Option A is wrong because querying BigQuery for each event via a Cloud Function would introduce high latency and cost due to per-event query overhead, and BigQuery is not designed for real-time point lookups. Option C is wrong because while Cloud Memorystore (Redis) provides low-latency lookups, it requires managing a separate cache and does not natively integrate with the daily-updated Cloud Storage file; the pattern also lacks the automatic refresh mechanism that side inputs provide. Option D is wrong because writing events to BigQuery and using scheduled queries for batch joins introduces significant latency (minutes to hours), which violates the near real-time requirement for the recommendation engine.

Full explanation →

800

MCQmedium

A media company ingests video files from partners via a REST API. Files are stored in Cloud Storage, and metadata is written to Firestore. A Cloud Function is triggered on object finalize to transcode video using Transcoder API. Sometimes, the function fails because the file is still being uploaded when triggered. How should this be fixed?

A.Implement a Cloud Composer workflow to poll for file existence.

B.Require partners to use resumable uploads.

C.Increase the Cloud Functions timeout to allow time for the upload to finish.

D.Use Cloud Pub/Sub notifications for Cloud Storage and trigger the function from the subscription.

AnswerD

Pub/Sub notifications are sent after object finalization.

Why this answer

Option D is correct because Cloud Storage object finalize notifications are sent only after the entire file has been written and committed. By using Pub/Sub notifications for Cloud Storage and triggering the Cloud Function from the subscription, you decouple the trigger from the upload process, ensuring the function only runs when the file is fully available. This eliminates the race condition where the function is triggered before the upload completes.

Exam trap

The trap here is that candidates assume 'object finalize' means the upload is complete, but in practice, the event can fire before the upload is fully committed, leading to the misconception that increasing timeouts or changing upload methods will fix the issue.

How to eliminate wrong answers

Option A is wrong because implementing a Cloud Composer workflow to poll for file existence adds unnecessary complexity, latency, and cost; polling is an inefficient solution compared to event-driven triggers. Option B is wrong because requiring partners to use resumable uploads does not change the fact that the Cloud Function is triggered on object finalize before the upload is fully committed; resumable uploads affect the upload mechanism, not the timing of the finalize event. Option C is wrong because increasing the Cloud Functions timeout does not address the root cause—the function is triggered prematurely; the function will still fail if the file is incomplete, regardless of how long it runs.

Full explanation →

801

MCQhard

A multinational e-commerce company runs a real-time recommendation system. The architecture: user click events are sent via HTTP to a Cloud Run service, which publishes them to a Cloud Pub/Sub topic. A Dataflow streaming pipeline reads from the subscription, joins with user profile data from Firestore, computes recommendations using a TensorFlow model (loaded as a side input), and writes results to a Redis cache (Memorystore) for low-latency serving. The pipeline is deployed in us-central1. Recently, the team noticed that recommendation latency has increased from 50ms to 500ms, and the pipeline's backlog is growing. The Dataflow monitoring shows high CPU utilization on workers, and the SystemLag metric is 2 minutes and increasing. The Redis cluster shows no performance issues. The Firestore queries are within normal latency. The team suspects the TensorFlow model inference is the bottleneck. The model is a large neural network (500MB) loaded in each worker's memory. The pipeline uses 10 n1-standard-4 workers. The pipeline is using Dataflow's streaming engine. The team wants to reduce latency without increasing cost significantly. What should they do?

A.Increase the number of workers by adding a secondary worker group with preemptible VMs.

B.Switch to a batch pipeline that runs every minute to reduce frequency of inference.

C.Increase the machine type of workers to n1-highmem-8 to provide more memory for the model.

D.Remove the model side input and call Cloud Run for inference using a separate service.

AnswerA

More workers parallelize inference, preemptible VMs keep cost low.

Why this answer

Option A is correct because adding preemptible VMs as a secondary worker group allows horizontal scaling at lower cost, distributing the TensorFlow model inference load across more workers. This reduces CPU utilization per worker and decreases the SystemLag without significantly increasing cost, as preemptible VMs are much cheaper than regular instances. The bottleneck is CPU-bound model inference, not memory, so more workers directly address the high CPU utilization and growing backlog.

Exam trap

The trap here is that candidates assume a memory issue (option C) because the model is large, but the real bottleneck is CPU utilization from repeated inference, not memory exhaustion.

How to eliminate wrong answers

Option B is wrong because switching to a batch pipeline would increase latency (from seconds to minutes) and is unsuitable for real-time recommendations; the team needs low-latency streaming, not batch. Option C is wrong because the issue is high CPU utilization, not memory pressure; the model is 500MB and n1-standard-4 has 15GB RAM, which is sufficient, so increasing memory does not address the CPU bottleneck and increases cost unnecessarily. Option D is wrong because removing the model side input and calling Cloud Run for inference adds network latency and cost per request, likely worsening latency and increasing cost, and does not leverage Dataflow's in-memory model loading for efficiency.

Full explanation →

802

MCQhard

A data scientist wants to import a pre-trained TensorFlow model into BigQuery ML for batch predictions. The model is stored in a Cloud Storage bucket. Which statement is correct?

A.Use CREATE MODEL with model_type='tensorflow' and model_path='gs://bucket/model'.

B.Use CREATE MODEL with model_type='imported_tensorflow' and model_path='gs://bucket/model'.

C.First upload the model to Vertex AI Model Registry, then reference it in BigQuery ML.

D.Use the ML.IMPORT_MODEL function to load the model into BigQuery.

AnswerA

This is the correct syntax for importing a TensorFlow model.

Why this answer

BigQuery ML supports importing TensorFlow models via CREATE MODEL with model_type='tensorflow' and model_path pointing to the SavedModel directory in Cloud Storage.

Full explanation →

803

MCQhard

Your company runs a batch data processing pipeline using Cloud Dataproc and Cloud Composer. The pipeline processes hundreds of terabytes of data daily. Recently, the pipeline has been failing intermittently due to Dataproc cluster creation errors: 'Insufficient resources to create cluster in zone us-central1-f.' The project has a global quota of 1000 vCPUs for Compute Engine. The team usually uses n2-standard-8 (8 vCPU) worker nodes. You notice that the error occurs during peak usage times. You need to ensure the pipeline runs reliably without increasing the global quota. Which action should you take?

A.Increase the global Compute Engine quota to 2000 vCPUs

B.Switch to using preemptible VMs only, which have higher availability

C.Use fewer workers with larger machine types, such as n2-standard-64

D.Configure the Dataproc cluster to use multiple zones via the --zone argument with a zonal list

AnswerD

Spreading across zones avoids zonal capacity issues.

Why this answer

Option D is correct because configuring the Dataproc cluster to use multiple zones via the `--zone` argument with a zonal list distributes worker node creation across several zones in the same region. This avoids the 'Insufficient resources' error by not exhausting capacity in a single zone, without requiring a global quota increase. Cloud Dataproc supports specifying a comma-separated list of zones, and the service will attempt to create the cluster in the first available zone.

Exam trap

The trap here is that candidates often assume the only solution to resource exhaustion is to increase quotas or switch to preemptible VMs, overlooking the zonal distribution feature that directly addresses the 'Insufficient resources' error without changing the global quota.

How to eliminate wrong answers

Option A is wrong because the question explicitly states you must not increase the global quota; raising it to 2000 vCPUs would violate that constraint and does not address the zonal resource exhaustion issue. Option B is wrong because preemptible VMs have lower availability (they can be reclaimed at any time) and are not suitable as the only worker type for a reliable production pipeline processing hundreds of terabytes daily; they also do not solve the zone-specific capacity shortage. Option C is wrong because using fewer workers with larger machine types (e.g., n2-standard-64) does not reduce the total vCPU count required for the workload; it may even increase the risk of hitting the global quota per cluster creation request and does not mitigate the zonal resource exhaustion.

Full explanation →

804

MCQhard

You are designing a system to serve predictions from a large language model (LLM) with a latency SLO of 500ms. The model does not fit on a single GPU and requires model parallelism. You are considering using Vertex AI Endpoints with a custom container. What additional setup is required to achieve the latency target?

A.Compile the model using TensorFlow XLA to optimize for single GPU execution.

B.Deploy the model across multiple endpoints and use a load balancer to send requests to different parts of the model.

C.Use Vertex AI Prediction as a service for LLMs, which automatically handles hardware selection.

D.Use a machine type with multiple GPUs and configure the container to use tensor parallelism.

AnswerD

Leveraging multiple GPUs on one node via model parallelism (e.g., tensor parallelism) is the standard approach to fit large models and meet latency.

Why this answer

Option D is correct because the model does not fit on a single GPU and requires model parallelism. Using a machine type with multiple GPUs and configuring the container to use tensor parallelism allows the model to be split across GPUs within a single instance, enabling efficient parallel computation to meet the 500ms latency SLO. Tensor parallelism distributes individual tensor operations across GPUs, reducing communication overhead compared to pipeline parallelism and is a standard approach for large models on multi-GPU instances in Vertex AI.

Exam trap

Cisco often tests the misconception that load balancing across multiple endpoints is equivalent to model parallelism, but this ignores the critical requirement for low-latency inter-GPU communication within a single instance, which tensor parallelism provides.

How to eliminate wrong answers

Option A is wrong because TensorFlow XLA compiles and optimizes computation graphs for single-device execution, but the model does not fit on a single GPU, so XLA cannot solve the memory constraint or enable model parallelism across multiple GPUs. Option B is wrong because deploying the model across multiple endpoints and using a load balancer would require splitting the model into separate services, introducing significant network latency between endpoints and breaking the model's internal state, making it impractical for the tight 500ms SLO. Option C is wrong because Vertex AI Prediction as a service for LLMs does not automatically handle custom model parallelism configurations; it provides pre-built endpoints for specific model architectures but does not support arbitrary custom containers that require tensor parallelism setup.

Full explanation →

805

Multi-Selectmedium

A company uses Cloud Pub/Sub for event ingestion. They want to ensure that if a subscriber fails to process a message after 5 attempts, the message is sent to a dead letter topic for analysis. Which TWO configurations are needed?

Select 2 answers

A.Set max delivery attempts to 5 on the subscription.

B.Set the subscription's ack deadline to 600 seconds.

C.Enable message ordering on the subscription.

D.Create a dead letter topic and attach it to the subscription.

E.Set the subscription type to push.

AnswersA, D

Required to trigger dead letter after attempts.

Why this answer

Dead letter topics require setting max delivery attempts and specifying a dead letter topic.

Full explanation →

806

MCQhard

A company wants to use Cloud DLP to inspect data in BigQuery for sensitive information and de-identify it by masking credit card numbers. They want to perform this on a schedule. Which approach should they take?

A.Use Dataplex data quality rules with a custom SQL regex

B.Use Cloud Data Loss Prevention API with Cloud Composer

C.Use BigQuery column-level security with classification

D.Use Cloud DLP inspect and de-identify jobs triggered by Cloud Scheduler

AnswerD

DLP supports scheduled inspection and de-identification via Cloud Scheduler.

Why this answer

Cloud DLP can inspect BigQuery tables and de-identify using transforms like masking. Scheduling can be done via Cloud Scheduler.

Full explanation →

807

Multi-Selectmedium

You are building a data pipeline that ingests data from on-premises into Cloud Storage, then processes it with Dataproc, and finally loads into BigQuery. You need to schedule the pipeline to run daily. The pipeline must handle occasional failures gracefully. Which THREE Google Cloud services should you use together to achieve this? (Choose 3)

Select 3 answers

A.Cloud Storage

B.Dataproc

C.Cloud Composer

D.Dataflow

E.Pub/Sub

AnswersA, B, C

Storage is the landing zone for raw data.

Why this answer

Cloud Composer orchestrates the whole pipeline. Cloud Storage is the staging area. Dataproc processes data.

BigQuery is the destination. Dataflow is not needed. Pub/Sub is for messaging, not scheduling.

Full explanation →

808

MCQeasy

A data engineer needs to query a BigQuery table that contains an array of structs. They want to expand the array into separate rows for each element. Which SQL function should they use?

A.STRUCT

B.UNNEST

C.ARRAY_AGG

D.SPLIT

AnswerB

UNNEST expands an array into rows; it is typically used with CROSS JOIN.

Why this answer

UNNEST is used to flatten arrays into a set of rows. CROSS JOIN UNNEST is standard. STRUCT is for creating structs, ARRAY_AGG is for aggregation, and SPLIT is for strings.

The question asks to expand an array, which is exactly UNNEST.

Full explanation →

809

MCQhard

You are designing a row key for Cloud Bigtable to store user activity logs. Each log entry has a timestamp (millisecond precision) and a user ID. There will be millions of writes per second from many users. To avoid hotspotting, which row key design is BEST?

A.timestamp_millis#hash(userID)

B.timestamp_millis#userID

C.userID#timestamp_millis

D.hash(userID)#userID#timestamp_millis

AnswerD

Hashing the userID distributes writes evenly. Including userID and timestamp enables efficient queries per user over time.

Why this answer

Option D is best because it uses a hash of the user ID as the row key prefix, which distributes writes across all Bigtable nodes and avoids hotspotting. Appending the user ID and timestamp ensures uniqueness and supports efficient queries for a specific user's logs. This design prevents the sequential timestamp from creating a single hot node, which is critical for handling millions of writes per second.

Exam trap

Cisco often tests the misconception that placing the most selective or unique field first (like timestamp) is best for queries, but in Bigtable the row key design must prioritize write distribution over read optimization to avoid hotspotting.

How to eliminate wrong answers

Option A is wrong because placing the timestamp first causes all writes for the same millisecond to hit a single tablet server, creating a hotspot. Option B is wrong because using the raw timestamp as the prefix leads to sequential writes that overload one node, negating Bigtable's horizontal scaling. Option C is wrong because while userID as prefix distributes writes, it does not guarantee uniqueness for multiple log entries from the same user at the same millisecond, and it lacks the hash to prevent skewed access patterns if user IDs are sequential or predictable.

Full explanation →

810

MCQhard

A company uses BigQuery to store event data. They need to load data from multiple sources with different schemas and expect frequent schema changes. Which approach provides the most flexibility for schema evolution while minimizing load failures and performance impact?

A.Load data as JSON files in Cloud Storage and use external tables

B.Use the Storage Write API in buffered mode with schema auto-detection

C.Use legacy streaming inserts with schema auto-detect enabled

D.Use Dataflow to preprocess and write to BigQuery using Storage Write API in committed mode

AnswerB

Buffered mode allows schema updates and auto-detection, reducing failures and handling schema evolution well.

Why this answer

Using the Storage Write API with buffered mode allows schema auto-detection and flexible schema updates without failing loads, and provides better performance than legacy streaming inserts.

Full explanation →

811

Multi-Selectmedium

A data engineer is designing a batch processing pipeline that runs daily. The pipeline reads CSV files from GCS, transforms them using Python, and writes the results to BigQuery. They need to parameterize the pipeline for different environments and run it on a schedule. Which THREE components should they use? (Choose 3)

Select 2 answers

A.Cloud Composer

B.Dataproc

C.Dataflow Flex Template

D.Storage Transfer Service

E.Cloud Functions

AnswersA, C

Cloud Composer schedules and orchestrates the pipeline.

Why this answer

Cloud Composer (A) is correct because it is a managed Apache Airflow service that natively supports scheduling, parameterization, and orchestration of batch pipelines. It allows you to define DAGs that run daily, pass environment-specific parameters via Airflow variables or environment configurations, and trigger Python transforms or Dataflow jobs on a schedule.

Exam trap

Cisco often tests the distinction between orchestration/scheduling services (Cloud Composer) and compute/processing services (Dataproc, Cloud Functions), leading candidates to mistakenly choose Dataproc for scheduling or Cloud Functions for batch processing.

Full explanation →

812

Matchingmedium

Match each Google Cloud monitoring/logging service to its function.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Metrics and alerting for cloud resources

Centralized log storage and analysis

Aggregates and analyzes application errors

Records administrative and data access activities

Why these pairings

Services for observability and compliance.

Full explanation →

813

MCQhard

A financial services company has a BigQuery dataset containing sensitive customer data. They need to share a subset of this data (excluding PII columns) with an external analytics partner. The partner should be able to query the data using their own BigQuery account, but the company must maintain full control over the underlying table and ensure the partner cannot see or access the original table. Which approach should they use?

A.Use dataset-level ACLs to deny the partner access to the original table and grant access to a view.

B.Export the filtered data to a new BigQuery dataset and grant the partner access to that dataset.

C.Create an authorized view in the same dataset, excluding PII columns, and share only the view with the partner's BigQuery account.

D.Create a materialized view and grant the partner the bigquery.dataViewer role on the dataset.

AnswerC

Authorized views allow precise control. The view is defined in the dataset and can be shared with specific users. The partner can query the view but cannot access the underlying table, even if they have permissions on the view. The company retains full control.

Why this answer

Authorized views allow you to share a view with specific users/groups while restricting access to the underlying table. The view can be defined to exclude PII columns and can be shared with the partner's account. The partner queries the view directly, but cannot access the base table.

This maintains data control and security.

Full explanation →

814

MCQmedium

A data pipeline uses Cloud Pub/Sub to ingest events and Cloud Functions to transform and write to BigQuery. The system is experiencing data loss during Pub/Sub subscription outages. Which design change improves reliability?

A.Use Dataflow with at-least-once delivery and checkpointing

B.Use a pull subscription with a custom app that polls frequently

C.Use long ack deadlines to keep messages in the subscription

D.Increase the timeout in Cloud Functions

AnswerA

Dataflow provides exactly-once semantics with checkpointing to prevent data loss.

Why this answer

Dataflow with at-least-once delivery and checkpointing ensures that messages are not lost during Pub/Sub subscription outages because Dataflow tracks processing progress via checkpoints and can replay unacknowledged messages from the last checkpoint. This decouples the processing from the subscription's transient failures, providing fault-tolerant, exactly-once or at-least-once semantics depending on the sink.

Exam trap

Google Cloud often tests the misconception that increasing timeouts or ack deadlines alone can prevent data loss, when in reality they only delay the inevitable loss without a replay mechanism like checkpointing or a persistent buffer.

How to eliminate wrong answers

Option B is wrong because a pull subscription with a custom app that polls frequently does not inherently provide reliability during subscription outages; the app would still lose messages if the subscription itself is unavailable or if the app fails to acknowledge before the ack deadline. Option C is wrong because long ack deadlines only keep messages in the subscription for a longer time, but they do not prevent data loss if the subscriber crashes or the subscription becomes unavailable; messages can still be dropped if the deadline expires without ack. Option D is wrong because increasing the timeout in Cloud Functions does not address data loss from subscription outages; it only allows the function to run longer before timing out, but does not provide replay or checkpointing mechanisms.

Full explanation →

815

Multi-Selectmedium

A company stores sensitive customer data in BigQuery. They need to implement column-level security to restrict access to personally identifiable information (PII) columns. Which two BigQuery features can they use together? (Choose TWO)

Select 2 answers

A.BigQuery row-level security

B.BigQuery foreign key constraints

C.BigQuery data masking

D.Authorized views

E.BigQuery column-level access control using policy tags

AnswersD, E

Authorized views can expose a subset of columns to users based on policy tags.

Why this answer

Column-level security can be achieved via policy tags (with Data Catalog) and then authorized views that filter columns based on user roles. Policy tags enforce access control at the column level.

Full explanation →

816

MCQmedium

An organization runs periodic Apache Spark jobs on Dataproc to process data from Cloud Storage. They want to reduce costs by using preemptible instances for worker nodes. What is a key consideration when using preemptible instances in Dataproc?

A.Preemptible instances cannot be used with the standard cluster mode

B.Jobs must be designed to handle node preemption, and overall job runtime may increase

C.Preemptible instances are only available in certain regions

D.Jobs will automatically restart from the last checkpoint without any performance impact

AnswerB

Preemption can cause task re-execution, so fault tolerance is required and runtime may increase.

Why this answer

Preemptible VMs can be terminated at any time, so Spark jobs must be fault-tolerant. Dataproc handles this by automatically rescheduling failed tasks, but the job may take longer.

Full explanation →

817

MCQeasy

A company needs to deploy a trained model for real-time predictions with low latency. Which Vertex AI resource should they use?

A.Cloud TPU

B.Vertex AI Batch Prediction

C.Vertex AI Endpoints

D.Cloud Run

AnswerC

Endpoints provide real-time model serving with low latency.

Why this answer

Vertex AI Endpoints are designed for online prediction, providing a managed service that hosts models for real-time inference with low latency. They automatically scale resources and handle traffic routing, making them the correct choice for deploying a trained model that needs to respond to individual prediction requests quickly.

Exam trap

Google Cloud often tests the distinction between batch and online prediction, and the trap here is that candidates confuse Vertex AI Batch Prediction (which is for offline, large-scale inference) with the real-time serving capability of Vertex AI Endpoints, leading them to select option B.

How to eliminate wrong answers

Option A is wrong because Cloud TPUs are specialized hardware accelerators for training and batch inference, not a deployment service for real-time predictions; they require manual management and are not designed for low-latency serving of individual requests. Option B is wrong because Vertex AI Batch Prediction is intended for asynchronous, high-throughput predictions on large datasets, not for real-time, low-latency responses; it processes jobs in batches and returns results to a storage location. Option D is wrong because Cloud Run is a serverless compute platform for containerized applications, but it lacks the native model hosting, versioning, and traffic splitting capabilities that Vertex AI Endpoints provide for machine learning models.

Full explanation →

818

Multi-Selecteasy

Your company is evaluating managed messaging services for a new event-driven application. The application requires pub/sub semantics, high throughput (millions of messages per second), and integration with Google Cloud services like Cloud Functions and Dataflow. Which TWO services should you consider? (Choose two.)

Select 2 answers

A.Cloud Pub/Sub

B.Cloud Scheduler

C.Cloud Pub/Sub Lite

D.Cloud Functions

E.Cloud Tasks

AnswersA, C

Cloud Pub/Sub is a fully managed, scalable pub/sub messaging service with native Google Cloud integration.

Why this answer

Cloud Pub/Sub (A) is the correct choice because it provides fully managed, highly scalable pub/sub messaging with exactly-once delivery semantics and support for millions of messages per second. It integrates natively with Cloud Functions and Dataflow, making it ideal for event-driven architectures requiring high throughput and decoupled communication.

Exam trap

Google Cloud often tests the distinction between managed messaging services (Pub/Sub vs. Pub/Sub Lite) and other Google Cloud services like Cloud Tasks or Cloud Scheduler, where candidates mistakenly select compute or scheduling services for messaging needs.

Full explanation →

819

MCQmedium

A company runs a Dataproc cluster for ETL jobs that process data nightly. They want to reduce costs while maintaining performance. Which strategy is MOST effective?

A.Use committed use discounts for all VMs

B.Enable Dataproc auto-scaling

C.Use preemptible VMs for all nodes including master

D.Use preemptible VMs for worker nodes only

AnswerD

Workers can be preemptible because batch jobs can tolerate interruptions; master remains on-demand for reliability.

Why this answer

Preemptible VMs are cheaper and suitable for fault-tolerant batch jobs. They can be used for worker nodes in Dataproc.

Full explanation →

820

Multi-Selecthard

A company uses Dataplex to manage data quality across multiple BigQuery datasets. They need to define data quality rules that check for null values in critical columns and enforce uniqueness constraints. Which two Dataplex features should they use? (Choose TWO)

Select 2 answers

A.Dataplex Lake

B.Dataplex Data Lineage

C.Dataplex Data Quality Rules

D.Dataplex Data Quality Tasks

E.Dataplex Universal Catalog

AnswersC, D

Data Quality Rules allow defining custom checks like null and uniqueness.

Why this answer

Dataplex Data Quality can define rules (including null check and uniqueness) and schedule them as Data Quality Tasks. Data Quality Rules are defined in YAML and can be attached to entities.

Full explanation →

821

MCQmedium

You need to analyse streaming data from thousands of IoT devices, each sending temperature readings every second. You want to calculate the average temperature per device over the last 5 minutes, updating every minute. Which windowing strategy should you use in Dataflow?

A.Sliding windows of length 5 minutes with a period of 1 minute

B.Global windows with a trigger firing every minute

C.Fixed windows of 5 minutes

D.Session windows with a gap duration of 1 minute

AnswerA

Sliding windows produce overlapping windows every minute, exactly what is needed.

Why this answer

Sliding windows of length 5 minutes with a period of 1 minute give the desired overlapping windows: every minute, you get the average over the last 5 minutes.

Full explanation →

822

MCQhard

You are a data engineer at a financial services company. You manage a batch pipeline that processes daily trade settlement reports. The pipeline runs on Cloud Dataproc using PySpark jobs triggered by Cloud Composer (Airflow). Recent trades have increased by 3x, and the pipeline now frequently fails with 'OutOfMemoryError' in the executor logs. You have already increased the executor memory from 4g to 8g, but the problem persists. The cluster uses standard worker nodes (n1-standard-4) with 15 GB RAM per node. You need to make the pipeline stable and cost-efficient. What should you do?

A.Use n1-highmem-4 instances for the cluster to get 26 GB RAM per node and increase executor memory to 12g.

B.Migrate the PySpark jobs to Cloud Dataflow with the Apache Beam SDK to benefit from auto-scaling.

C.Increase the number of executors and reduce the executor memory to 4g, then add preemptible secondary workers to lower cost.

D.Enable cluster autoscaling and set minimum to 5 workers, maximum to 20 workers.

AnswerC

Adding more executor instances distributes memory and reduces per executor load; preemptible workers lower costs.

Why this answer

Option C is correct because the OutOfMemoryError persists even after increasing executor memory to 8g, indicating that the issue is not simply insufficient memory per executor but rather that the total memory across all executors is insufficient for the 3x data volume. By increasing the number of executors (parallelism) and reducing executor memory back to 4g, you distribute the data processing load across more JVMs, reducing the memory pressure per executor. Adding preemptible secondary workers lowers cost while providing the additional compute capacity needed to handle the increased data volume efficiently.

Exam trap

Google Cloud often tests the misconception that increasing executor memory alone solves OutOfMemoryErrors, when in reality the issue is often insufficient parallelism or misconfigured memory overhead, and the correct solution involves balancing executor count, memory, and cost-efficient instance types like preemptible VMs.

How to eliminate wrong answers

Option A is wrong because simply using n1-highmem-4 instances with 26 GB RAM and increasing executor memory to 12g does not address the root cause—the pipeline needs more parallelism, not just more memory per executor; the OutOfMemoryError can still occur if the data skew or shuffle operations overwhelm a single executor. Option B is wrong because migrating to Cloud Dataflow with Apache Beam SDK is a significant architectural change that does not directly solve the memory issue; Dataflow auto-scaling can help with throughput but does not guarantee stability if the pipeline's memory configuration is fundamentally misaligned with the data volume. Option D is wrong because enabling cluster autoscaling with a minimum of 5 workers and maximum of 20 workers does not address the executor memory configuration; autoscaling adds nodes but if the executor memory is still too high per node (e.g., 8g on a 15 GB node), the system may still run out of memory due to overhead from the OS, YARN, and other daemons, and it does not optimize cost by using preemptible instances.

Full explanation →

823

MCQmedium

A Dataflow pipeline reads events from Pub/Sub and transforms them. Some events contain invalid product IDs that should be filtered out. The list of valid product IDs is stored in a frequently updated BigQuery table. What is the best approach to filter out invalid events?

A.Read the BigQuery table as a side input and refresh it periodically using a global window with a periodic trigger

B.Use a Combine.PerKey to group by product ID and then filter

C.Use a custom pipeline option to read the valid IDs at startup and cache them

D.Use a ParDo with a side input that is a MapSideInput of valid IDs, and refresh it on each element

AnswerA

This approach allows the side input to be updated without restarting the pipeline, and the trigger ensures periodic refresh.

Why this answer

Option A is correct because reading the BigQuery table as a side input with a global window and periodic trigger allows the pipeline to refresh the list of valid product IDs at a configurable interval without reprocessing the entire stream. This pattern is idiomatic for Beam/Dataflow when the reference data changes frequently and must be kept reasonably current while maintaining low latency for streaming events.

Exam trap

Google Cloud often tests the misconception that side inputs are static or that per-element refresh is feasible, leading candidates to choose Option D, but in reality side inputs are materialized once per window/trigger and cannot be efficiently updated per element.

How to eliminate wrong answers

Option B is wrong because Combine.PerKey is designed for aggregating values per key (e.g., summing counts), not for filtering based on an external lookup; it would not incorporate the BigQuery table at all. Option C is wrong because custom pipeline options are evaluated at pipeline construction time and cannot be updated during pipeline execution, so the cached list would become stale as soon as the BigQuery table is updated. Option D is wrong because refreshing the side input on each element would cause excessive BigQuery read operations, leading to high latency and cost; MapSideInput is read-only once materialized and does not support per-element refresh.

Full explanation →

824

MCQhard

A company needs to process sensitive data in BigQuery with column-level security. They want to allow analysts to see aggregated data but not individual records. What approach?

A.Use table-level access controls

B.Use column-level access controls with masking

C.Use authorized views with aggregation functions

D.Use Cloud Data Loss Prevention to de-identify data

AnswerC

Authorized views can present aggregated data while hiding raw details.

Why this answer

Option C is correct because authorized views in BigQuery allow you to define SQL queries that aggregate data (e.g., using SUM, COUNT, AVG) and expose only the aggregated results to analysts, while hiding individual records. This approach enforces column-level security by granting access to the view rather than the underlying table, ensuring analysts cannot query the raw data directly. It meets the requirement of seeing aggregated data without seeing individual records, leveraging BigQuery's native authorization and SQL capabilities.

Exam trap

Google Cloud often tests the distinction between column-level masking (which still allows row-level access) and authorized views (which enforce aggregation at the query level), leading candidates to pick B because they confuse masking with aggregation-based security.

How to eliminate wrong answers

Option A is wrong because table-level access controls grant access to entire tables, which would allow analysts to see individual records, not just aggregated data, violating the requirement. Option B is wrong because column-level access controls with masking can hide specific column values (e.g., by replacing them with NULL or a mask), but they still allow analysts to query individual rows and see non-masked columns, potentially exposing record-level details; they do not inherently restrict access to only aggregated results. Option D is wrong because Cloud Data Loss Prevention (DLP) is used for de-identifying data at rest or in transit (e.g., via inspection and transformation jobs), but it does not provide real-time, query-level aggregation controls within BigQuery; analysts would still have access to the underlying de-identified table, which could contain individual records.

Full explanation →

825

MCQeasy

A team is deploying a model on AI Platform Prediction. They want to monitor for data drift to maintain model quality. Which service should they use?

A.Cloud DLP

B.AI Platform Continuous Evaluation

C.Cloud Monitoring

D.Cloud Audit Logs

AnswerB

This service provides monitoring for model predictions and drift analysis.

Why this answer

AI Platform Continuous Evaluation (CE) is the correct service because it is specifically designed to monitor deployed models for data drift and feature skew. It automatically compares the distribution of incoming prediction requests against the training data distribution, alerting when statistically significant drift is detected, which directly addresses the need to maintain model quality over time.

Exam trap

Cisco often tests the misconception that general-purpose monitoring or logging services (like Cloud Monitoring or Audit Logs) are sufficient for ML-specific drift detection, when in fact only a dedicated ML evaluation service like AI Platform Continuous Evaluation provides the necessary statistical comparison against training data.

How to eliminate wrong answers

Option A is wrong because Cloud DLP (Data Loss Prevention) is used to inspect, classify, and de-identify sensitive data, not to monitor statistical distributions of model features. Option C is wrong because Cloud Monitoring collects metrics and logs for infrastructure and application performance but lacks built-in statistical drift detection for ML model features. Option D is wrong because Cloud Audit Logs record administrative actions and access to resources, not the distributional properties of prediction data.

Full explanation →

Page 11 of 14

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Practice PDE by domain

Target a specific domain to shore up weak areas.

Designing Data Processing Systems Ingesting and Processing the Data Storing the Data Preparing and Using Data for Analysis Maintaining and Automating Data Workloads Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

See all domains with question counts →

Google Professional Data Engineer PDE Questions 751–825 | Page 11/14 | Courseiva