Knowledge + Practice

Google Professional Data Engineer (PDE) — Questions 301–375

499 questions total · 7pages · All types, answers revealed

Take a mock exam Exam hub

Page 5 of 7

301

MCQmedium

A company uses BigQuery to run reporting queries on a table that is partitioned by date and clustered by customer_id. Queries filtering by customer_id and a date range are performing poorly. What is the most likely cause?

A.The project lacks sufficient BigQuery slot capacity

B.The table is too large for BigQuery

C.Clustering column order should be date first, then customer_id

D.The date range filter is too wide, causing scans of many partitions

AnswerD

Wide date ranges nullify the benefit of clustering; BigQuery scans many partitions.

Why this answer

Option D is correct because when a table is partitioned by date and clustered by customer_id, queries that filter on both columns can still perform poorly if the date range filter is too wide, causing BigQuery to scan many partitions. Even with clustering, scanning a large number of partitions negates the benefit of clustering, as clustering only reduces the data scanned within each partition. The query optimizer must read all partitions that fall within the date range, and if that range is broad, the scan overhead dominates.

Exam trap

The trap here is that candidates often assume clustering alone guarantees fast queries on any filter combination, without understanding that partition pruning happens first and a wide date range undermines the benefit of clustering.

How to eliminate wrong answers

Option A is wrong because insufficient slot capacity would cause slow query execution or queuing, not specifically poor performance on partitioned and clustered tables; the issue here is data scanning inefficiency, not resource contention. Option B is wrong because BigQuery is designed to handle tables of any size, and 'too large' is not a meaningful limitation; the problem is query design, not table size. Option C is wrong because the clustering column order is already correct for the typical query pattern (filtering by customer_id and date range); clustering by date first would not improve performance for queries that filter on customer_id, as clustering only benefits the first column in the order.

Full explanation →

302

Multi-Selecthard

A streaming pipeline uses Cloud Pub/Sub and Dataflow to process financial transactions. The pipeline must guarantee that each transaction is processed exactly once and in order per customer key. Which two configurations are necessary? (Choose two.)

Select 2 answers

A.Use a session window with max gap duration

B.Use a keyed state with a value state per customer

C.Use Dataflow stateful processing with event time ordering

D.Use a Pub/Sub topic with ordering keys

E.Use a global window with a trigger

AnswersC, D

Dataflow stateful processing with event time ordering allows processing events per key in the order they were generated, with exactly-once guarantees.

Why this answer

Options A and B are correct. Pub/Sub ordering keys (A) ensure messages with the same ordering key are delivered in order. Dataflow stateful processing with event time ordering (B) allows processing events in order while maintaining exactly-once semantics.

Option C (global window with trigger) does not guarantee order. Option D (keyed state) is required but is encompassed by B. Option E (session window) is not about ordering.

Full explanation →

303

Multi-Selectmedium

Which THREE components are typically part of a Vertex AI Pipeline for automated model retraining and deployment?

Select 3 answers

A.Cloud Monitoring alerting component

B.Cloud Storage artifact storage component

C.Training component (e.g., CustomContainerTrainingJob)

D.Model evaluation component (e.g., evaluating on a test set)

E.Deployment component (e.g., deploying model to endpoint)

AnswersC, D, E

Training is the core step.

Why this answer

Option C is correct because a training component, such as a `CustomContainerTrainingJob`, is the core step in a Vertex AI Pipeline that executes the model training logic. It defines the container image, machine configuration, and hyperparameters, enabling automated retraining when triggered by a schedule or event.

Exam trap

Google Cloud often tests the distinction between pipeline components (which are executable tasks in the DAG) and supporting infrastructure (like Cloud Monitoring or Cloud Storage), leading candidates to select options that are related to the pipeline's operation but not actual components within the pipeline definition.

Full explanation →

304

Matchingmedium

Match each machine learning term to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Model trained on labeled data

Model trained on unlabeled data

Agent learns by interacting with environment

Model performs well on training data but poorly on new data

Why these pairings

Key ML concepts commonly tested in PDE exam.

Full explanation →

305

Multi-Selecthard

Which THREE best practices should be followed when designing a Dataflow pipeline for real-time data processing?

Select 3 answers

A.Set up monitoring alerts for system lag and data freshness.

B.Use static side inputs that are loaded once at pipeline start.

C.Implement watermark estimation to handle late data.

D.Use global windows with early triggers for low latency.

E.Use idempotent sinks to ensure exactly-once processing.

AnswersA, C, E

Monitoring is critical for streaming pipelines.

Why this answer

Option A is correct because monitoring alerts for system lag and data freshness are essential for maintaining operational visibility in real-time Dataflow pipelines. System lag (the time between data ingestion and processing) and data freshness (how current the processed output is) directly impact the pipeline's ability to meet latency SLAs. Without these alerts, issues like worker backpressure or Pub/Sub subscription backlog can go unnoticed, leading to stale or lost data.

Exam trap

Google Cloud often tests the misconception that static side inputs are acceptable for streaming pipelines, but they are only appropriate for batch or bounded data; real-time pipelines require side inputs that can be periodically refreshed (e.g., via a streaming source or a periodic lookup).

Full explanation →

306

MCQhard

A company runs a streaming data pipeline on Google Cloud using Cloud Pub/Sub, Cloud Dataflow, and BigQuery. The pipeline processes real-time sensor data for predictive maintenance. Recently, the Dataflow job's lag has increased from seconds to minutes, and the system shows backpressure. The pipeline uses fixed windows of 1 minute and writes results to BigQuery. The data volume has doubled. The team has already increased the number of workers. What should they do next? Options: A. Use session windows instead of fixed windows. B. Enable Streaming Engine and use Upsert to BigQuery. C. Decrease the window duration. D. Use Cloud Storage as temporary sink.

A.Enable Streaming Engine and use Upsert to BigQuery

B.Decrease the window duration

C.Use session windows instead of fixed windows

D.Use Cloud Storage as temporary sink

AnswerA

Streaming Engine reduces overhead and Upsert makes BigQuery writes more efficient.

Why this answer

The correct answer is A because enabling Streaming Engine offloads the heavy shuffle and state management from the worker VMs to the backend service, reducing the impact of backpressure. Using Upsert to BigQuery allows the pipeline to handle late-arriving data within the fixed windows without requiring a full table rewrite, which is critical when data volume has doubled and lag has increased.

Exam trap

The trap here is that candidates often assume increasing workers or changing window sizes will fix backpressure, but the real bottleneck is often the shuffle and state management in Dataflow, which Streaming Engine directly addresses.

How to eliminate wrong answers

Option B is wrong because decreasing the window duration would increase the number of windows and the frequency of writes, exacerbating the backpressure and lag rather than solving it. Option C is wrong because session windows are designed for grouping events based on gaps of inactivity, which is not relevant to the fixed-window requirement for predictive maintenance sensor data; they would not reduce backpressure. Option D is wrong because using Cloud Storage as a temporary sink adds an extra write step and does not address the root cause of backpressure in the Dataflow pipeline; it would increase latency and complexity.

Full explanation →

307

MCQeasy

A streaming Dataflow job is processing messages from Cloud Pub/Sub. The job is underutilizing resources and the throughput is lower than expected. Which parameter should be adjusted to increase parallelism?

A.Change the workerMachineType to a higher CPU machine

B.Increase the number of workers via maxNumWorkers

C.Set the streaming engine to Dataflow Streaming Engine

D.Set autoscalingAlgorithm to THROUGHPUT_BASED

AnswerB

More workers allow more parallelism.

Why this answer

The job is underutilizing resources, meaning the existing workers are not fully loaded. Increasing the number of workers via maxNumWorkers directly increases parallelism by allowing Dataflow to distribute work across more VMs, which can increase throughput without changing the per-worker resource profile. This parameter controls the upper bound on the number of workers, enabling the autoscaler to scale out when there is backlog.

Exam trap

Google Cloud often tests the misconception that increasing per-worker resources (CPU/memory) is the primary way to improve throughput in a streaming job, when in fact underutilization indicates the need to scale out workers rather than scale up individual workers.

How to eliminate wrong answers

Option A is wrong because changing workerMachineType to a higher CPU machine increases per-worker compute capacity but does not address underutilization; if workers are idle, adding more CPU per worker will not increase parallelism or throughput. Option C is wrong because Dataflow Streaming Engine is a service that offloads shuffle and state management to the backend, reducing per-worker overhead and improving scalability, but it does not directly increase parallelism; it changes the execution model. Option D is wrong because setting autoscalingAlgorithm to THROUGHPUT_BASED is already the default for streaming jobs; it enables autoscaling based on throughput metrics, but without adjusting maxNumWorkers, the autoscaler cannot scale beyond the default limit, so throughput remains capped.

Full explanation →

308

MCQhard

A healthcare startup is deploying a natural language processing (NLP) model for extracting medical entities from clinical notes. The model is a fine-tuned BERT model served on Vertex AI Prediction using a custom container. The team observes that prediction latency is around 500ms per request, but they need to handle up to 100 requests per second (QPS) with end-to-end latency under 200ms. The model currently runs on n1-standard-4 machines (4 vCPU, 15 GB memory). During load testing, CPU utilization reaches 90% and memory usage is 12 GB. The team is considering options to meet the requirements. Which action should they take?

A.Use a machine type with a GPU, such as n1-standard-4 with a NVIDIA Tesla T4 accelerator, and optimize the model with TensorRT.

B.Switch to n1-highmem-4 machines to provide more memory for the model.

C.Deploy the model using TensorFlow Serving with CPU-only nodes and increase the number of replicas.

D.Move the model to Cloud Run with automatic scaling to handle the QPS.

AnswerA

GPU accelerates BERT inference and TensorRT further optimizes latency.

Why this answer

Option A is correct because the bottleneck is CPU-bound inference (90% CPU utilization) with memory well within limits (12 GB of 15 GB). Adding a GPU (NVIDIA Tesla T4) and optimizing with TensorRT reduces per-request latency via hardware acceleration and graph optimizations, enabling sub-200ms inference at 100 QPS. This directly addresses the latency requirement without changing the machine family or scaling strategy.

Exam trap

Google Cloud often tests the misconception that scaling horizontally (more replicas or Cloud Run) solves latency problems, when the real issue is per-request compute bottleneck that requires hardware acceleration or model optimization.

How to eliminate wrong answers

Option B is wrong because memory is not the bottleneck (12 GB used out of 15 GB); increasing memory does not reduce CPU-bound inference latency. Option C is wrong because TensorFlow Serving on CPU-only nodes still relies on CPU compute, and increasing replicas adds cost and complexity without addressing the fundamental latency per request; the CPU utilization is already saturated, so more replicas would require horizontal scaling but still not guarantee sub-200ms latency per request. Option D is wrong because Cloud Run's automatic scaling handles QPS but does not reduce per-request latency; the model's inference time remains CPU-bound, and Cloud Run's cold starts and CPU-only instances would not meet the 200ms latency target.

Full explanation →

309

MCQmedium

A data engineering team needs to process a large volume of CSV files stored in Cloud Storage using Dataproc. The files are generated hourly and each contains millions of rows. They want to minimize the number of Dataproc cluster nodes to reduce cost while processing within an hour. Which configuration should they recommend?

A.Use a cluster with preemptible worker nodes only.

B.Use a cluster with local SSDs for temporary storage.

C.Use a cluster with a few large worker nodes and use Spark static allocation.

D.Use a cluster with many small worker nodes and use Spark dynamic allocation.

AnswerD

Dynamic allocation adjusts resources based on workload; small nodes provide granular scaling.

Why this answer

Option D is correct because using many small worker nodes with Spark dynamic allocation allows the cluster to scale resources precisely to the workload, minimizing idle capacity and cost. Dynamic allocation enables executors to be added or removed based on the processing demands of the hourly CSV files, ensuring the job completes within the hour without over-provisioning nodes.

Exam trap

Google Cloud often tests the misconception that larger nodes are always more cost-effective for big data processing, but in practice, many small nodes with dynamic allocation reduce idle resource waste and better match the parallelism needs of distributed file processing.

How to eliminate wrong answers

Option A is wrong because preemptible worker nodes only can be terminated at any time by Google Cloud, risking job failure or delays when processing millions of rows per hour, and they cannot be the sole worker nodes for a reliable Dataproc cluster. Option B is wrong because local SSDs improve I/O performance for shuffle operations but do not directly reduce the number of nodes or cost; they add cost per node and are not a configuration for minimizing node count. Option C is wrong because using a few large worker nodes with Spark static allocation reserves a fixed number of executors regardless of actual workload, leading to underutilization and higher cost if the job does not need all resources, and it does not adapt to the hourly data volume variations.

Full explanation →

310

MCQmedium

Refer to the exhibit. A team configured a Cloud Monitoring alerting policy as shown. They recently started receiving false positive alerts. What is the most likely cause?

A.The duration of 60 seconds is too short, making the alert sensitive to brief spikes.

B.The alignment period of 60 seconds is too short, causing noise.

C.The threshold of 10 is too low.

D.The aggregator should be ALIGN_SUM instead of ALIGN_RATE.

AnswerA

A short duration means a spike lasting just over 60 seconds will trigger an alert; a longer duration (e.g., 300s) would reduce sensitivity.

Why this answer

Option C is correct because the duration is set to 60 seconds, meaning any 60-second window with a rate >10 will trigger an alert. If the error count is bursty, brief spikes cause false positives. Increasing the duration would smooth out transient spikes.

Option A (alignment period) affects granularity but does not cause false positives. Option B (threshold) might be low, but the primary issue is the short duration. Option D (aligner) is appropriate for rate.

Full explanation →

311

MCQmedium

A company uses Cloud Composer to orchestrate a daily ETL pipeline that includes multiple Dataproc jobs. The pipeline processes sensitive financial data. The security team requires that all data in transit be encrypted, and all Cloud Storage buckets used by the pipeline should have uniform bucket-level access enabled and VPC Service Controls. The pipeline currently uses a single Cloud Composer environment in us-east1. The Dataproc clusters are created using the standard image and use custom service accounts with minimal permissions. The pipeline runs successfully during testing, but in production, the Dataproc jobs fail with 'Access Denied' errors when trying to write to a Cloud Storage bucket. The bucket has uniform bucket-level access enabled and is inside a VPC Service Controls perimeter. The Dataproc service account has the Storage Object Admin role at the project level. What is the most likely cause of the access denied error?

A.The service account does not have the Storage Object Admin role on the bucket.

B.Data in transit encryption is not enabled for the Cloud Storage bucket.

C.Uniform bucket-level access prevents writes from service accounts.

D.The Dataproc cluster is not in the VPC Service Controls perimeter.

AnswerD

VPC Service Controls deny access from resources outside the perimeter.

Why this answer

The Dataproc cluster is created outside the VPC Service Controls perimeter, so even though the service account has the Storage Object Admin role at the project level, requests from the cluster are blocked by the perimeter's ingress/egress rules. VPC Service Controls enforce a security boundary that prevents resources outside the perimeter from accessing protected services like Cloud Storage, regardless of IAM permissions. The 'Access Denied' error in production, despite successful testing, strongly indicates a perimeter configuration mismatch.

Exam trap

Google Cloud often tests the distinction between IAM permissions and VPC Service Controls boundaries, tricking candidates into thinking a project-level IAM role is sufficient when the real blocker is network-level perimeter enforcement.

How to eliminate wrong answers

Option A is wrong because the service account has the Storage Object Admin role at the project level, which grants write access to all buckets in the project, including this one; uniform bucket-level access does not override project-level IAM roles. Option B is wrong because data in transit encryption is automatically enforced by Google Cloud for all API calls to Cloud Storage (using HTTPS/TLS), and the question states the pipeline already encrypts data in transit, so this is not the cause of the error. Option C is wrong because uniform bucket-level access does not prevent writes from service accounts; it simply disables ACLs and requires all access decisions to be made via IAM policies, which the service account already has via its project-level role.

Full explanation →

312

Multi-Selectmedium

Which TWO actions should be taken to optimize a Dataflow streaming pipeline that is experiencing high system lag and backpressure? (Choose two.)

Select 2 answers

A.Use a higher memory machine type for all workers.

B.Increase the number of worker threads by adjusting the streaming worker's parallelism hint.

C.Enable autoscaling and increase the maximum number of workers.

D.Reduce the number of workers to decrease cost.

E.Set maxNumWorkers to 1 to force single-worker processing.

AnswersB, C

More threads can increase throughput per worker.

Why this answer

Option B is correct because increasing the parallelism hint allows each worker to process more bundles concurrently, which can reduce backpressure by improving throughput without adding more workers. Option C is correct because enabling autoscaling and increasing the maximum number of workers allows the pipeline to dynamically scale out to handle increased load, directly mitigating high system lag and backpressure.

Exam trap

Google Cloud often tests the misconception that simply adding more memory or reducing workers will solve backpressure, when in fact the correct approaches involve increasing parallelism or scaling out the worker pool.

Full explanation →

313

Drag & Dropmedium

Drag and drop the steps to set up a Pub/Sub topic with a push subscription to an HTTPS endpoint into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Push subscriptions send messages to a configured HTTPS endpoint.

Full explanation →

314

Multi-Selecthard

You are designing a streaming pipeline that must guarantee exactly-once processing. Which three services or features can help achieve this? (Choose THREE.)

Select 3 answers

A.Cloud Functions for post-processing

B.BigQuery streaming inserts with a unique key for deduplication

C.Cloud Spanner for deduplication state across the pipeline

D.Cloud Pub/Sub with duplicate detection (using message IDs)

E.Dataflow with idempotent write operations to BigQuery

AnswersC, D, E

Using Cloud Spanner as a global state store allows tracking processed event IDs for deduplication.

Why this answer

Cloud Spanner is correct because it provides globally distributed, strongly consistent transactions that can be used to maintain deduplication state across the entire streaming pipeline. By storing a unique key for each processed event in Spanner, the pipeline can atomically check and record whether an event has already been handled, ensuring exactly-once semantics even in the face of retries or failures.

Exam trap

Google Cloud often tests the misconception that BigQuery streaming inserts can guarantee exactly-once processing via a unique key, when in fact BigQuery only supports at-least-once delivery and requires external deduplication mechanisms like Cloud Spanner or Dataflow with idempotent writes.

Full explanation →

315

Drag & Dropmedium

Drag and drop the steps to set up Cloud IAP (Identity-Aware Proxy) for an App Engine app into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

IAP verifies identity and authorization before allowing access to the application.

Full explanation →

316

MCQeasy

A company processes CSV files that are uploaded to Cloud Storage by external partners. Each file is around 500 MB, and they need to be parsed and loaded into BigQuery. The processing must start as soon as the file arrives. What is the most efficient serverless architecture?

A.Cloud Storage triggers a Cloud Function that publishes events to Pub/Sub; a Dataflow streaming pipeline reads from Pub/Sub and writes to BigQuery.

B.Use Cloud Scheduler to periodically check for new files and process them with Dataflow batch jobs.

C.Cloud Storage triggers a Dataproc job that reads the file and loads it into BigQuery.

D.Cloud Storage triggers a Cloud Function that directly loads the data into BigQuery using the BigQuery API.

AnswerA

Serverless and scales well with file uploads.

Why this answer

Option A is correct because it combines Cloud Storage event-driven triggers with Pub/Sub for reliable asynchronous message delivery, and uses Dataflow streaming with autoscaling to handle 500 MB files efficiently. This serverless architecture ensures processing starts immediately upon file arrival, scales to handle large files without manual intervention, and leverages BigQuery's streaming inserts for near-real-time data loading.

Exam trap

Google Cloud often tests the misconception that Cloud Functions can handle large file processing directly, but the 9-minute timeout and memory limits make them unsuitable for files over a few hundred MB, pushing candidates toward the seemingly simpler Option D.

How to eliminate wrong answers

Option B is wrong because Cloud Scheduler polling introduces latency and inefficiency, as it checks for new files on a fixed schedule rather than reacting instantly, which violates the requirement that processing must start as soon as the file arrives. Option C is wrong because Dataproc is a managed Hadoop/Spark service that requires cluster provisioning and startup time, adding overhead for a simple CSV-to-BigQuery load; it is not serverless and not the most efficient for this use case. Option D is wrong because Cloud Functions have a 9-minute timeout and 2 GB memory limit, making them unsuitable for parsing and loading a 500 MB CSV file directly via the BigQuery API, which would likely exceed these constraints and cause failures.

Full explanation →

317

MCQmedium

A gaming company uses Avro schemas for its streaming event data. They anticipate adding new optional fields to events over time. They need to ensure backward compatibility so that existing pipelines continue to work. Which strategy should they adopt?

A.Use Avro with a schema registry that enforces backward-compatible changes

B.Use JSON instead of Avro and ignore unknown fields

C.Use Protocol Buffers with breaking changes

D.Use FlatBuffers for performance

AnswerA

Avro's schema evolution rules allow adding optional fields without breaking existing consumers, and a schema registry enables version management.

Why this answer

Option A is correct because Avro, combined with a schema registry, allows schema evolution with backward compatibility. The registry enforces rules such as adding optional fields with defaults, ensuring that consumers using older schemas can still deserialize new data without breaking. This directly addresses the requirement for existing pipelines to continue working as new optional fields are added.

Exam trap

Google Cloud often tests the misconception that any serialization format (like JSON or Protocol Buffers) inherently supports backward compatibility, but the key is the combination of a schema registry with enforced evolution rules, which only Avro explicitly provides in this context.

How to eliminate wrong answers

Option B is wrong because JSON lacks a schema enforcement mechanism; while ignoring unknown fields is possible, JSON does not provide built-in compatibility guarantees or schema evolution rules, making it error-prone in large-scale streaming systems. Option C is wrong because Protocol Buffers can support backward compatibility, but the option specifies 'breaking changes,' which would violate the requirement for backward compatibility. Option D is wrong because FlatBuffers prioritize performance (zero-copy deserialization) but do not inherently enforce backward-compatible schema evolution, and they are less suited for streaming event data with frequent schema changes.

Full explanation →

318

Multi-Selecthard

Which THREE steps are essential for implementing a continuous training pipeline with Vertex AI?

Select 3 answers

A.If the new model passes evaluation, deploy it to a production endpoint.

B.Manually approve each new model version before deployment.

C.Deploy the original model once and set it to auto-update.

D.Set up a trigger to start a training pipeline when new training data is available (e.g., via Cloud Storage events).

E.Include a step in the pipeline that evaluates the new model against a validation set.

AnswersA, D, E

Automated deployment upon passing evaluation completes the continuous pipeline.

Why this answer

A continuous training pipeline involves automated retraining, evaluation, and deployment when new data or model improvements occur. Manual approval is optional, not essential. One-time manual deployment is not continuous.

The three essential steps are: trigger on new data, train, and evaluate/promote.

Full explanation →

319

MCQhard

A data pipeline ingests real-time events from Cloud Pub/Sub into BigQuery using Dataflow. The pipeline uses a sliding window of 5 minutes with a 1-minute period to aggregate event counts. Recently, the pipeline started failing with 'The worker failed to provide a heartbeat.' The Dataflow logs show high CPU usage on the workers. What is the best course of action to resolve the issue?

A.Increase the number of workers and enable autoscaling to distribute the load.

B.Reduce the number of workers to minimize coordination overhead.

C.Use a global window with a trigger to reduce state size.

D.Change the windowing to a fixed 5-minute window to reduce computations.

AnswerA

More workers can handle the CPU load from streaming inserts.

Why this answer

The 'worker failed to provide a heartbeat' error combined with high CPU usage indicates that workers are overloaded and cannot process data fast enough to maintain their heartbeat to the Dataflow service. Increasing the number of workers and enabling autoscaling distributes the computational load across more machines, reducing per-worker CPU pressure and allowing heartbeats to be sent on time. This directly addresses the root cause of resource exhaustion.

Exam trap

Google Cloud often tests the misconception that reducing workers or changing window types is a universal fix for resource exhaustion, when in fact the immediate solution for heartbeat failures due to high CPU is to scale out the worker pool.

How to eliminate wrong answers

Option B is wrong because reducing the number of workers would concentrate the same workload on fewer machines, increasing per-worker CPU usage and worsening the heartbeat failure. Option C is wrong because using a global window with a trigger does not reduce state size for sliding windows; it would accumulate all events into a single unbounded window, potentially increasing memory pressure and CPU overhead. Option D is wrong because changing to a fixed 5-minute window does not reduce computations compared to a sliding window with a 1-minute period; it actually changes the semantics (non-overlapping windows) and may still cause high CPU if the underlying load is unchanged.

Full explanation →

320

MCQmedium

A team notices that the latency for online predictions from a Vertex AI endpoint has increased significantly over the past hour. The model is a large TensorFlow model deployed with automatic scaling (minReplicaCount=2, maxReplicaCount=10). The CPU utilization of the deployed instances is consistently above 85%. What is the most likely cause of the increased latency?

A.The network latency between the client and the endpoint has increased due to regional issues.

B.The model is deployed with GPU acceleration, but the instances are using incorrect CUDA drivers.

C.The model is too large for the instance memory, causing disk swapping.

D.The model is CPU-bound, and the current replicas are saturated, causing queuing.

AnswerD

High CPU utilization indicates the replicas are at capacity, leading to request queuing and higher latency.

Why this answer

The correct answer is D because the consistently high CPU utilization (above 85%) indicates that the existing replicas are saturated, unable to process incoming requests quickly enough. When all replicas are busy, new requests are queued, which directly increases latency. Automatic scaling can add more replicas up to maxReplicaCount=10, but if the scaling is slow or the traffic spike is sudden, queuing occurs first, causing the observed latency increase.

Exam trap

Google Cloud often tests the distinction between symptoms of CPU saturation (queuing/latency) versus memory or GPU issues; the trap here is that candidates may incorrectly attribute latency to network or hardware driver problems when the clear indicator is sustained high CPU utilization on existing instances.

How to eliminate wrong answers

Option A is wrong because network latency between client and endpoint is not indicated by CPU utilization of deployed instances; regional network issues would affect all requests uniformly, not correlate with high CPU. Option B is wrong because incorrect CUDA drivers would cause GPU-related errors or failures, not consistently high CPU utilization; the model would likely fail to run or produce errors, not just increase latency. Option C is wrong because disk swapping due to insufficient memory would manifest as high disk I/O and memory pressure, not primarily high CPU utilization; the symptom described is CPU-bound, not memory-bound.

Full explanation →

321

MCQeasy

A company wants to implement a data lake on Google Cloud to store raw sensor data (unstructured binary files) and allow data scientists to run SQL queries on processed data. They expect to store terabytes of data and have different access patterns. Which combination of GCP services best meets these requirements?

A.Bigtable for raw data and Cloud Spanner for processed data

B.Cloud Storage for both raw and processed data

C.Cloud SQL for raw data and Cloud Dataproc for processing

D.Cloud Storage for raw data and BigQuery for processed data

AnswerD

Cloud Storage stores any file type cost-effectively, and BigQuery provides fast SQL queries on structured data.

Why this answer

Cloud Storage is the ideal service for storing raw, unstructured binary sensor data at petabyte scale, offering low-cost, durable object storage with multiple access tiers. BigQuery is a serverless, highly scalable data warehouse that allows data scientists to run SQL queries on processed data, with features like columnar storage and automatic optimization for analytical workloads. This combination directly addresses the need for raw storage and SQL-based analytics on processed data.

Exam trap

Google Cloud often tests the misconception that Cloud Storage can serve as a queryable database for SQL, when in fact it requires an external query engine like BigQuery or Dataproc for SQL access.

How to eliminate wrong answers

Option A is wrong because Bigtable is a NoSQL wide-column database optimized for real-time, low-latency access, not for storing raw unstructured binary files, and Cloud Spanner is a globally distributed relational database for transactional workloads, not for analytical SQL queries on processed data. Option B is wrong because while Cloud Storage can store both raw and processed data, it does not natively support SQL queries; data scientists would need an additional service like BigQuery or Dataproc to run SQL. Option C is wrong because Cloud SQL is a relational database for structured data, not designed for raw unstructured binary files, and Cloud Dataproc is a managed Spark/Hadoop service for processing, not a SQL query engine for processed data.

Full explanation →

322

MCQmedium

A company uses Cloud Composer to orchestrate data pipelines. One DAG fails intermittently with the error: 'Task received SIGTERM signal.' The task runs a long-running Dataproc job. What is the most likely cause?

A.The Dataproc cluster was preempted by Google Cloud.

B.The Dataproc job failed due to an error in the code.

C.The Cloud Composer environment ran out of disk space.

D.The Airflow task timed out due to the default execution timeout.

AnswerD

SIGTERM indicates the task was killed, possibly due to timeout.

Why this answer

The default Airflow task execution timeout is 28 days in Cloud Composer, but individual tasks can have a shorter `execution_timeout` set in the DAG definition. When a long-running Dataproc job exceeds this timeout, Airflow sends a SIGTERM signal to the task to kill it, resulting in the observed error. This is the most likely cause because the error message directly indicates a forced termination by the Airflow scheduler, not an infrastructure or code failure.

Exam trap

The trap here is that candidates often attribute SIGTERM errors to infrastructure issues like cluster preemption or disk space, when in fact the error is a direct result of Airflow's task timeout mechanism, which is a common misconfiguration in long-running pipeline tasks.

How to eliminate wrong answers

Option A is wrong because Dataproc cluster preemption would cause a different error (e.g., 'Cluster not found' or 'Job failed due to node loss'), not a SIGTERM signal from Airflow. Option B is wrong because a code error in the Dataproc job would produce a job failure status and a different error message (e.g., 'Job failed with exit code 1'), not a SIGTERM from the orchestrator. Option C is wrong because running out of disk space in the Cloud Composer environment would cause worker crashes or DAG parsing errors, not a targeted SIGTERM to a specific task.

Full explanation →

323

MCQhard

You are designing a disaster recovery strategy for a critical streaming data processing pipeline. The pipeline reads from Cloud Pub/Sub, processes with Dataflow streaming, and writes to BigQuery. The required RPO is less than 1 minute, and RTO is less than 5 minutes. Which architecture should you implement?

A.Use cross-region replication with two separate Dataflow pipelines reading from a Pub/Sub cross-region subscription and writing to a BigQuery cross-region dataset

B.Run the pipeline using Dataflow batch mode with a 1-minute trigger and store intermediate results in Cloud Storage

C.Deploy resources in a single region with regular backups to Cloud Storage

D.Use a single Dataflow pipeline with a standby cluster in another region, but failover is manual

AnswerA

Cross-region replication ensures data is available in another region with minimal latency, meeting RPO and RTO.

Why this answer

Option A is correct because cross-region replication for Pub/Sub ensures messages are available in a secondary region with sub-second latency, and a separate Dataflow pipeline reading from a cross-region subscription provides active-active processing. BigQuery cross-region dataset replication (using the 'cross-region' dataset location, e.g., EU or US multi-region, or a specific dual-region configuration) ensures data durability and availability within the RPO of <1 minute. This architecture meets both RPO and RTO by eliminating single points of failure and enabling automatic failover without manual intervention.

Exam trap

The trap here is that candidates often assume a single pipeline with a standby cluster is sufficient, but they overlook that manual failover cannot meet the strict RTO of <5 minutes, and that cross-region replication must be active-active (not active-passive) to achieve sub-minute RPO.

How to eliminate wrong answers

Option B is wrong because Dataflow batch mode with a 1-minute trigger cannot achieve sub-minute RPO; batch processing introduces inherent latency and does not provide continuous streaming, so the RPO of <1 minute is not guaranteed. Option C is wrong because deploying in a single region with regular backups to Cloud Storage fails to meet the RTO of <5 minutes; restoring from backups takes significantly longer than 5 minutes, and there is no active standby to fail over to. Option D is wrong because a manual failover process cannot achieve the RTO of <5 minutes; manual intervention introduces unpredictable delays, and a standby cluster without automatic failover violates the RTO requirement.

Full explanation →

324

Multi-Selectmedium

Which TWO actions are recommended to improve the reliability of a Cloud Dataflow streaming pipeline that processes event data from Pub/Sub?

Select 2 answers

A.Use a pull subscription with a 10-second acknowledgment deadline.

B.Enable Dataflow Streaming Engine.

C.Enable exactly-once processing sinks (e.g., BigQuery with guaranteed row-level insertion).

D.Disable autoscaling to prevent worker churn.

E.Use micro-batch processing with a small batch size.

AnswersB, C

Streaming Engine offloads state management to the backend, improving reliability.

Why this answer

Option B is correct because enabling Dataflow Streaming Engine moves state and computation from worker VMs to the backend service, reducing the impact of worker scaling and preemption. This improves reliability by providing consistent performance and fault tolerance for streaming pipelines, especially those with high throughput or stateful processing.

Exam trap

The trap here is that candidates often confuse reliability with throughput or latency, and may incorrectly choose micro-batching or disabling autoscaling as reliability improvements, when in fact Dataflow's reliability comes from its managed backend services like Streaming Engine.

Full explanation →

325

MCQhard

You are designing a data pipeline that must process sensitive customer data with strict access controls. The data is ingested via Cloud Pub/Sub, processed by Cloud Dataflow, and stored in BigQuery. The security team requires that data is encrypted at rest and in transit, and that access is limited to specific service accounts. Which implementation strategy meets all requirements?

A.Use Cloud KMS for BigQuery only; leave Dataflow with default encryption

B.Use VPC Service Controls and Cloud Armor for network security

C.Use default Google-managed encryption keys and IAM roles only

D.Use CMEK for Pub/Sub, Dataflow, and BigQuery, and VPC-SC with per-service service accounts

AnswerD

CMEK ensures encryption control; VPC-SC and service accounts enforce access.

Why this answer

Option D is correct because it combines Customer-Managed Encryption Keys (CMEK) for all three services (Pub/Sub, Dataflow, BigQuery) to ensure data is encrypted at rest with keys controlled by the customer, and uses VPC Service Controls (VPC-SC) with per-service service accounts to enforce network perimeter security and least-privilege access. This meets the requirements for encryption at rest and in transit (CMEK also covers in-transit encryption via TLS) and strict access controls via service accounts and VPC-SC.

Exam trap

Google Cloud often tests the misconception that network security tools like VPC Service Controls or Cloud Armor alone satisfy encryption requirements, or that default encryption is sufficient when customer-managed keys are explicitly required.

How to eliminate wrong answers

Option A is wrong because it only applies Cloud KMS to BigQuery, leaving Dataflow with default Google-managed encryption, which does not meet the requirement for customer-controlled encryption at rest across all services. Option B is wrong because VPC Service Controls and Cloud Armor provide network security and perimeter controls but do not address data encryption at rest or in transit, which is a separate requirement. Option C is wrong because default Google-managed encryption keys and IAM roles alone do not provide customer-controlled encryption keys (CMEK) or the granular access controls enforced by VPC-SC with per-service service accounts.

Full explanation →

326

MCQhard

A company uses Vertex AI Feature Store for serving features. They have a high-throughput online serving requirement. Which configuration should they use?

A.Cloud Storage with high-memory instances

B.Bigtable as serving source

C.Firestore

D.Vertex AI Feature Store with online serving enabled

AnswerD

Vertex AI Feature Store is purpose-built for high-throughput online feature serving.

Why this answer

Vertex AI Feature Store with online serving enabled is the correct choice because it is specifically designed for low-latency, high-throughput retrieval of feature values for online predictions. It uses a managed Bigtable backend optimized for real-time serving, ensuring consistent performance under high request loads without requiring manual infrastructure management.

Exam trap

Google Cloud often tests the misconception that any low-latency database (like Bigtable or Firestore) can directly replace Vertex AI Feature Store, ignoring the managed orchestration, feature registry, and point-in-time lookup capabilities that are essential for consistent online serving in ML workflows.

How to eliminate wrong answers

Option A is wrong because Cloud Storage is a blob storage service with high latency and no indexing for real-time feature lookups, making it unsuitable for high-throughput online serving. Option B is wrong because Bigtable is a NoSQL database that can serve features, but it requires manual configuration, scaling, and integration with Vertex AI Feature Store, whereas the Feature Store provides a managed, optimized serving layer with built-in consistency and monitoring. Option C is wrong because Firestore is a document database designed for mobile and web apps with moderate throughput, not for the sub-millisecond latency and high concurrency required by ML feature serving at scale.

Full explanation →

327

MCQmedium

A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?

A.Use Cloud Dataproc Serverless for all Spark jobs.

B.Migrate jobs to Cloud Dataflow.

C.Run Spark on Compute Engine instances with startup scripts.

D.Use Dataproc clusters with auto-scaling and preemptible VMs.

AnswerD

Reduces cost and operational overhead.

Why this answer

Option D is correct because Dataproc clusters with auto-scaling and preemptible VMs directly address the need to reduce operational overhead and minimize costs for on-premises Spark migrations. Auto-scaling dynamically adjusts cluster size based on workload, while preemptible VMs (which cost 60-80% less than standard VMs) handle fault-tolerant tasks, making this the most cost-effective and operationally efficient architecture for Spark on Dataproc.

Exam trap

The trap here is that candidates often choose Cloud Dataproc Serverless (Option A) thinking it eliminates all operational overhead, but they overlook that it lacks the cost-saving benefits of preemptible VMs and may not support all Spark features, making auto-scaling clusters with preemptible VMs the more appropriate choice for minimizing costs in a migration scenario.

How to eliminate wrong answers

Option A is wrong because Cloud Dataproc Serverless is designed for batch Spark workloads without cluster management, but it lacks the flexibility and cost optimization of preemptible VMs for long-running or complex jobs, and may not support all Spark configurations or libraries used in on-premises environments. Option B is wrong because Cloud Dataflow is a different processing engine (Apache Beam) that requires rewriting Spark jobs into Beam pipelines, adding migration complexity and operational overhead, not reducing it. Option C is wrong because running Spark on Compute Engine instances with startup scripts requires manual cluster management, scaling, and fault tolerance, increasing operational overhead and negating the benefits of a managed service like Dataproc.

Full explanation →

328

MCQhard

What is the root cause of this error and the correct solution?

A.The BigQuery table requires authorized view access.

B.The user running the job needs the BigQuery Admin role.

C.The Dataflow service account needs the BigQuery User role.

D.The Dataflow worker service account needs the BigQuery Data Viewer role.

AnswerD

BigQuery Data Viewer includes the required getData permission.

Why this answer

Option D is correct because Dataflow workers execute under a specific service account (compute engine default or custom), and that service account must have the BigQuery Data Viewer role to read data from BigQuery tables. Without this permission, the workers cannot access the source data, causing the job to fail with access errors. The BigQuery User role is insufficient for reading table data, and the BigQuery Admin role is overly permissive and not required for this task.

Exam trap

Google Cloud often tests the distinction between the Dataflow controller service account (which manages the job) and the Dataflow worker service account (which performs data operations), leading candidates to incorrectly assign permissions to the controller account instead of the worker account.

How to eliminate wrong answers

Option A is wrong because authorized view access is a mechanism to share query results without granting direct table access, but the error here is about the Dataflow service account lacking read permissions on the BigQuery table, not about view authorization. Option B is wrong because the BigQuery Admin role grants full control over BigQuery resources, which is excessive and not necessary; the user running the job does not need admin rights—only the worker service account needs read access. Option C is wrong because the BigQuery User role allows running queries and creating datasets but does not grant read access to table data; the Dataflow service account (which orchestrates the job) does not directly read data—the worker service account does.

Full explanation →

329

MCQhard

Refer to the exhibit. A BigQuery dataset is shared with the group 'analysts@example.com' using the IAM policy shown. A user who is a member of this group reports that they cannot run queries on the dataset, though they can see the tables. What is the most likely reason?

A.The group needs the 'roles/bigquery.jobUser' role at the project level.

B.The user is using an incorrect client library version.

C.The user's account is not activated in the group membership.

D.The dataset has an organization policy that denies query access.

AnswerA

DataViewer provides read access but not job submission; jobUser must be granted at the project level to run queries.

Why this answer

The role 'roles/bigquery.dataViewer' allows viewing table metadata and data but does not allow running queries; users also need 'roles/bigquery.jobUser' at the project level to submit query jobs.

Full explanation →

330

Multi-Selecteasy

A data engineering team is operationalizing a machine learning model for real-time fraud detection. The model must process transactions with sub-100ms latency and be highly available. Which TWO strategies should the team implement?

Select 2 answers

A.Deploy the model to multiple Google Cloud regions for failover.

B.Deploy the model to a single zone to minimize cross-zone latency.

C.Use Cloud Batch for asynchronous prediction.

D.Optimize the model by pruning or quantizing to reduce size.

E.Store the model in Cloud Storage and load it on each request.

AnswersA, D

Why this answer

Deploying the model to multiple Google Cloud regions ensures high availability and failover capability. If one region becomes unavailable, traffic can be routed to another region, maintaining sub-100ms latency by using regional load balancing and Cloud DNS. This aligns with the requirement for a highly available, low-latency fraud detection system.

Exam trap

Google Cloud often tests the misconception that single-zone deployment minimizes latency, but the real trade-off is between availability and negligible intra-region latency, making multi-region deployment the correct choice for high availability.

Full explanation →

331

MCQmedium

A team uses Vertex AI AutoML Tables to train a model. They need to deploy the model for real-time predictions with high availability. Which deployment configuration should they use?

A.Export as a Cloud Function

B.Deploy to a Vertex AI Endpoint with 1 replica

C.Use a Vertex AI Batch Prediction job

D.Deploy to a Vertex AI Endpoint with multiple replicas and auto-scaling

AnswerD

Multiple replicas provide HA.

Why this answer

For real-time predictions with high availability, you need a deployment that can handle traffic spikes and failover. Deploying to a Vertex AI Endpoint with multiple replicas and auto-scaling ensures that the model is served from multiple instances, providing redundancy and the ability to scale up or down based on demand. This configuration meets the high-availability requirement by distributing load and automatically recovering from instance failures.

Exam trap

The trap here is that candidates often confuse batch prediction with real-time serving, or assume that a single replica is sufficient for high availability, not realizing that high availability requires redundancy and automatic scaling.

How to eliminate wrong answers

Option A is wrong because exporting as a Cloud Function is not a deployment method for Vertex AI AutoML Tables models; Cloud Functions are for serverless event-driven code, not for hosting ML model endpoints with real-time prediction capabilities. Option B is wrong because deploying to a Vertex AI Endpoint with only 1 replica provides no redundancy or high availability; if that single instance fails or becomes overloaded, predictions will be unavailable. Option C is wrong because a Vertex AI Batch Prediction job is designed for asynchronous, offline predictions on large datasets, not for real-time, low-latency serving.

Full explanation →

332

Multi-Selectmedium

A team needs to optimize online prediction cost for a model that has unpredictable traffic spikes. Which TWO strategies are most effective?

Select 2 answers

A.Enable autoscaling with a low min_replica_count and high max_replica_count

B.Set up Model Monitoring to trigger scaling

C.Deploy the model on a single high-memory machine

D.Use a smaller model version

E.Use batch prediction during high traffic

AnswersA, D

Autoscaling provides elasticity, scaling from a low base to handle spikes.

Why this answer

Option A is correct because autoscaling with a low min_replica_count and high max_replica_count allows the deployment to handle unpredictable traffic spikes by dynamically adjusting the number of replicas. This ensures cost efficiency during low traffic while providing capacity to scale out rapidly when demand surges, a key requirement for online prediction serving.

Exam trap

Google Cloud often tests the distinction between monitoring (observability) and scaling (infrastructure action), leading candidates to incorrectly select Model Monitoring as a scaling trigger.

Full explanation →

333

Multi-Selectmedium

A team runs a production application on Compute Engine. They want to ensure high availability and quality. Which three best practices should they implement? (Choose three.)

Select 3 answers

A.Use health checks and load balancing.

B.Use Cloud SQL read replicas for database load.

C.Enable OS Login for SSH access.

D.Use regional persistent disks for stateful data.

E.Use managed instance groups (MIGs) with autoscaling.

AnswersA, D, E

Health checks ensure only healthy instances receive traffic; load balancing provides fault tolerance.

Why this answer

Use managed instance groups for autoscaling and autohealing, regional persistent disks for durable high-availability storage, and health checks with load balancing to distribute traffic to healthy instances.

Full explanation →

334

MCQmedium

Refer to the exhibit. A data scientist deploys a model using this configuration. Users report that after a few hours of inactivity, the first prediction request takes over 30 seconds. What is the most likely cause?

A.The automatic scaling configuration allows scaling down to zero replicas, causing a cold start on the first request.

B.The network latency between the client and the endpoint is high due to regional distance.

C.The endpoint is misconfigured with the wrong regional endpoint.

D.The model is too large and exceeds the instance memory.

AnswerA

minReplicaCount: 0 permits scaling to zero, and after inactivity, the first request must wait for a new replica to start.

Why this answer

Option A is correct because the automatic scaling configuration that allows scaling down to zero replicas means that after a period of inactivity, all model replicas are terminated. When a new prediction request arrives, the endpoint must provision a new replica from scratch, which involves loading the model artifacts, initializing the inference container, and performing health checks. This cold start process typically takes 30 seconds or more, matching the reported behavior.

Exam trap

Google Cloud often tests the distinction between cold start latency (caused by scaling to zero) and persistent performance issues like network latency or resource exhaustion, so candidates must recognize that a delay only after inactivity points to replica provisioning, not a constant problem.

How to eliminate wrong answers

Option B is wrong because network latency due to regional distance would cause consistent high latency on every request, not just the first request after a period of inactivity. Option C is wrong because a misconfigured regional endpoint would result in persistent errors or high latency on all requests, not a delay only after inactivity. Option D is wrong because if the model exceeded instance memory, the endpoint would fail to serve predictions consistently or return out-of-memory errors, not exhibit a delay only on the first request after inactivity.

Full explanation →

335

Multi-Selectmedium

Which TWO security best practices should be applied to secure data in transit for a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery? (Choose 2)

Select 2 answers

A.Use Cloud Key Management Service (Cloud KMS) to encrypt data in transit

B.Enable TLS encryption on all endpoints

C.Use VPC Service Controls to create a service perimeter

D.Use Cloud Armor to protect against DDoS

E.Use private IP addresses for Dataflow workers

AnswersB, C

TLS ensures data encryption between Google Cloud services, which is already enabled by default but should be verified.

Why this answer

Option B is correct because TLS (Transport Layer Security) encryption ensures that data is encrypted during transmission between endpoints, such as between Cloud Pub/Sub and Dataflow workers, and between Dataflow workers and BigQuery. This is a fundamental security best practice for protecting data in transit against eavesdropping and man-in-the-middle attacks.

Exam trap

The trap here is that candidates often confuse encryption at rest (Cloud KMS) with encryption in transit, or assume that using private IPs alone secures data in transit without needing TLS.

Full explanation →

336

MCQeasy

A team has multiple versions of a model and wants to manage them centrally, including tracking metadata and promoting versions to production. Which tool should they use?

A.Cloud Storage

B.BigQuery

C.GitHub

D.Vertex AI Model Registry

AnswerD

Centralized model versioning and metadata.

Why this answer

Vertex AI Model Registry is designed for managing model versions, metadata, and deployment. Cloud Storage is storage only. BigQuery is for analytics.

GitHub is for source code.

Full explanation →

337

MCQhard

A data pipeline uses Cloud Pub/Sub to ingest events, then a Dataflow job writes to Cloud Storage in Avro format. The Dataflow job uses Global windows with a 10-minute trigger. The data is later loaded into BigQuery. They notice duplicate rows in BigQuery because the trigger produced multiple panes. What should the Dataflow pipeline change to eliminate duplicates?

A.Enable exactly-once sink to BigQuery via Dataflow

B.Use a sharded output to Cloud Storage with unique filenames

C.Write to a staging table and use a MERGE statement in BigQuery

D.Use a session window instead of global window

AnswerA

Dataflow's exactly-once sink to BigQuery uses record IDs to deduplicate, preventing duplicates caused by trigger panes.

Why this answer

Option A is correct because enabling exactly-once sinks in Dataflow ensures that each record is written to the sink only once, even if the pipeline produces multiple panes due to triggers. In this scenario, the 10-minute trigger on a global window causes multiple output panes, leading to duplicate rows in BigQuery. Exactly-once sinks use idempotent writes and deduplication mechanisms to prevent duplicates, directly addressing the issue without changing the windowing or trigger logic.

Exam trap

Google Cloud often tests the misconception that changing windowing or output file naming can solve duplicate data issues, when the real solution is to enable exactly-once processing guarantees at the sink level.

How to eliminate wrong answers

Option B is wrong because sharded output with unique filenames only prevents file-level collisions in Cloud Storage, but does not eliminate duplicate rows within the Avro files; duplicates from multiple panes still exist. Option C is wrong because writing to a staging table and using a MERGE statement is a workaround that does not fix the root cause in the Dataflow pipeline; it adds complexity and latency, and is not a Dataflow-native solution. Option D is wrong because session windows group events based on activity gaps, not time intervals; they do not prevent duplicate panes from triggers and are inappropriate for a global-windowed pipeline that needs to deduplicate across all data.

Full explanation →

338

Multi-Selecteasy

Which TWO actions can help reduce prediction latency for a Vertex AI endpoint?

Select 2 answers

A.Increase the number of features

B.Optimize the model architecture to reduce size

C.Use a custom prediction container with optimized dependencies

D.Use a larger machine type with more vCPUs

E.Set min replicas to 0 to save cost

AnswersB, C

Smaller models predict faster.

Why this answer

Optimizing the model architecture to reduce size directly decreases the computational load during inference, which lowers prediction latency. Smaller models require fewer floating-point operations (FLOPs) per prediction, enabling faster response times on Vertex AI endpoints.

Exam trap

Google Cloud often tests the misconception that adding more compute resources (larger machine types) always reduces latency, when in fact it can increase overhead and does not address the root cause of slow inference, which is model complexity.

Full explanation →

339

MCQeasy

A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?

A.Dataproc Serverless with PySpark

B.Dataflow with batch mode

C.Cloud Data Fusion

D.BigQuery Data Transfer Service

AnswerA

Dataproc Serverless is cost-effective and suitable for batch processing of large CSVs.

Why this answer

Dataproc Serverless with PySpark is the most cost-effective choice because it eliminates cluster management overhead and automatically scales resources based on workload, charging only for the processing time used. For 10 GB CSV files processed daily within a 24-hour window, the serverless model avoids the fixed costs of a persistent cluster, making it ideal for batch, non-time-sensitive jobs. PySpark's native support for CSV parsing and BigQuery integration via the Spark BigQuery connector ensures efficient data loading without additional services.

Exam trap

The trap here is that candidates often choose Dataflow (Option B) because it is a popular batch processing service, but they overlook that Dataproc Serverless is more cost-effective for non-time-sensitive, large CSV batch jobs due to its serverless pricing model and native Spark support for CSV processing.

How to eliminate wrong answers

Option B is wrong because Dataflow with batch mode, while capable, uses a streaming-optimized runner that incurs higher per-job overhead and cost for simple batch CSV processing, especially when the data is not time-sensitive and can tolerate longer processing windows. Option C is wrong because Cloud Data Fusion is a visual ETL tool designed for complex data pipelines and integration scenarios, not for cost-effective batch processing of large CSV files; it adds unnecessary abstraction and cost for a straightforward load operation. Option D is wrong because BigQuery Data Transfer Service is designed for scheduled imports from SaaS applications (e.g., Google Ads, YouTube) or Cloud Storage only when using a predefined schema and format (e.g., Avro, Parquet), and it does not support direct CSV loading with custom transformations or PySpark logic, making it unsuitable for processing raw CSV files before loading.

Full explanation →

340

Multi-Selectmedium

A company deploys an ML model using Vertex AI Pipelines. They want to ensure reproducibility and traceability. Which TWO practices should they implement?

Select 2 answers

A.Pin all dependency versions

B.Record dataset version using Vertex AI Dataset

C.Use custom containers for every step

D.Store pipeline run metadata in Vertex AI Experiments

E.Use Kubeflow Pipelines instead

AnswersA, D

Pinning versions ensures consistent environments across runs.

Why this answer

Pinning all dependency versions (Option A) ensures that every pipeline run uses the exact same library versions, eliminating variability from package updates. This is a fundamental practice for reproducibility because even a minor version bump can change model behavior or break code. In Vertex AI Pipelines, dependencies are typically specified in a `requirements.txt` or `Dockerfile`, and pinning them (e.g., `tensorflow==2.12.0`) guarantees consistent execution environments across runs.

Exam trap

Google Cloud often tests the misconception that dataset versioning (Option B) is a core requirement for reproducibility in Vertex AI Pipelines, but the exam emphasizes that dependency pinning and experiment metadata storage are the two primary practices for ensuring reproducibility and traceability in ML pipelines.

Full explanation →

341

MCQeasy

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

A.roles/bigquery.admin

B.roles/bigquery.user

C.roles/bigquery.jobUser

D.roles/bigquery.dataEditor

AnswerD

Includes bigquery.tables.get.

Why this answer

The error indicates the service account lacks the `bigquery.tables.get` permission, which is required to read table metadata. `roles/bigquery.dataEditor` includes this permission along with `bigquery.tables.get`, `bigquery.tables.update`, and `bigquery.tables.export`, making it the minimal role that resolves the access denied error for a Dataflow job reading from a BigQuery table.

Exam trap

Google Cloud often tests the misconception that `roles/bigquery.user` or `roles/bigquery.jobUser` provide sufficient read access for Dataflow jobs, when in fact they lack the specific `bigquery.tables.get` permission needed for table metadata retrieval.

How to eliminate wrong answers

Option A is wrong because `roles/bigquery.admin` grants full control over BigQuery resources, including dataset deletion and IAM policy management, which is excessive and violates the principle of least privilege for a Dataflow job that only needs to read table data. Option B is wrong because `roles/bigquery.user` provides `bigquery.datasets.get` and `bigquery.jobs.create` but does not include `bigquery.tables.get`, so it would not resolve the specific permission error. Option C is wrong because `roles/bigquery.jobUser` only allows creating and managing jobs (e.g., queries) but does not grant any direct table read permissions like `bigquery.tables.get`.

Full explanation →

342

MCQmedium

A company is building a real-time streaming pipeline to ingest clickstream events from web servers, enrich them with user profile data from Cloud Bigtable, and aggregate metrics into BigQuery. The expected throughput is 10,000 events per second with occasional spikes up to 50,000. The data must be processed with low latency (seconds) and exactly-once semantics. Which Google Cloud service should be the core processing engine?

A.Cloud Dataflow (Apache Beam runner)

B.Cloud Pub/Sub with Cloud Functions

C.Cloud Dataproc with Apache Spark Streaming

D.Cloud Data Fusion

AnswerA

Dataflow provides auto-scaling, exactly-once semantics, low latency, and native integration with BigQuery and Bigtable.

Why this answer

Cloud Dataflow, as a managed Apache Beam runner, is the correct choice because it provides exactly-once processing semantics, low-latency streaming (sub-second to seconds), and autoscaling to handle throughput spikes from 10,000 to 50,000 events per second. Its unified batch and streaming model allows you to enrich clickstream events with user profile data from Cloud Bigtable via side inputs or asynchronous lookups, and write aggregated metrics to BigQuery with exactly-once guarantees using the Beam BigQuery I/O connector.

Exam trap

Google Cloud often tests the misconception that Cloud Pub/Sub with Cloud Functions is sufficient for low-latency streaming, but candidates overlook that Cloud Functions lacks stateful processing and exactly-once semantics, making it unsuitable for aggregation and enrichment at high throughput.

How to eliminate wrong answers

Option B (Cloud Pub/Sub with Cloud Functions) is wrong because Cloud Functions has a maximum timeout of 9 minutes and does not support exactly-once processing semantics; it is at-least-once by default and lacks checkpointing for stateful operations like aggregation. Option C (Cloud Dataproc with Apache Spark Streaming) is wrong because Spark Streaming's micro-batch architecture introduces a minimum latency of several seconds (typically 5-10 seconds), which does not meet the 'seconds' low-latency requirement, and managing exactly-once semantics requires additional configuration (e.g., Kafka offsets) that is not natively handled by the managed service. Option D (Cloud Data Fusion) is wrong because it is a visual ETL tool designed for batch-oriented data integration and does not support real-time streaming ingestion or exactly-once processing; its pipelines are not suitable for sub-second latency or high-throughput event streams.

Full explanation →

343

MCQhard

A company uses Cloud Dataproc to run Spark jobs on ephemeral clusters. The input data is in Cloud Storage and output is also to Cloud Storage. The cluster is created and deleted daily. The cost is high due to spinning up nodes. Which change can reduce cost without sacrificing performance?

A.Use standard VMs with a larger number of smaller machines

B.Use Cloud Dataflow instead

C.Use a combination of standard and preemptible VMs for worker nodes

D.Use preemptible VMs for all nodes

AnswerC

Preemptible VMs for workers reduce cost significantly; standard VMs for the master and a few worker nodes ensure reliability.

Why this answer

Option C is correct because using a combination of standard and preemptible VMs for worker nodes reduces cost significantly while maintaining performance. Preemptible VMs are up to 80% cheaper than standard VMs, and since Spark is fault-tolerant and can handle node preemptions via speculative execution, the job can complete without performance degradation. Standard VMs for master nodes ensure cluster stability, while preemptible workers handle the bulk of data processing.

Exam trap

Google Cloud often tests the misconception that preemptible VMs can be used for all nodes, but the trap here is that the master node must be a standard VM to avoid cluster instability, while workers can safely use preemptible VMs due to Spark's fault tolerance.

How to eliminate wrong answers

Option A is wrong because using a larger number of smaller machines increases overhead from inter-node communication and task scheduling, potentially degrading performance and not necessarily reducing cost. Option B is wrong because Cloud Dataflow is a different service for batch and stream processing, not a direct replacement for Spark on Dataproc; migrating would require rewriting jobs and may not preserve existing Spark-specific logic or performance characteristics. Option D is wrong because using preemptible VMs for all nodes, including the master node, risks cluster failure if the master is preempted, as Dataproc does not automatically recover the master; this sacrifices reliability and can cause job failures.

Full explanation →

344

MCQhard

A healthcare company uses Vertex AI to deploy a medical image classification model. The model is deployed on a private endpoint with automatic scaling (minReplicaCount=2, maxReplicaCount=10). The model uses a custom container with a GPU for inference. Recently, during peak business hours (9 AM - 5 PM), users report that prediction requests frequently time out after 60 seconds, and the error rate increases. The team checks Cloud Monitoring and observes that CPU utilization averages 40%, GPU utilization averages 30%, and the number of replicas stays at 2. There are no errors in the container logs. The model serves a few hundred requests per second during peak. The team suspects the issue is not resource saturation but something else. What should they do to resolve the problem?

A.Switch from online prediction to batch prediction using Vertex AI Batch Prediction.

B.Increase the minReplicaCount to 5 to ensure more replicas are always available.

C.Increase the request timeout setting on the load balancer to 120 seconds.

D.Optimize the prediction container to handle requests faster by reducing image pre-processing and using async I/O.

AnswerD

Improving request handling efficiency directly addresses the timeout. Likely the container is blocking on I/O or serialization.

Why this answer

Option D is correct because the symptoms—low CPU/GPU utilization, replicas stuck at 2, and timeouts—indicate that the container is taking too long to process each request, not that resources are saturated. Optimizing the container (e.g., reducing image pre-processing, using async I/O) reduces per-request latency, allowing the model to handle the same request rate within the 60-second timeout. This directly addresses the root cause without changing scaling or timeout settings.

Exam trap

The trap here is that candidates assume low resource utilization means the system is under-provisioned (leading them to increase replicas or timeout), when in fact the bottleneck is per-request latency within the container, which autoscaling cannot fix.

How to eliminate wrong answers

Option A is wrong because switching to batch prediction would not solve real-time inference timeouts; batch prediction is for offline, non-latency-sensitive workloads and would break the real-time requirement. Option B is wrong because increasing minReplicaCount to 5 does not address the fact that existing replicas are underutilized (30-40% CPU/GPU) and requests are timing out due to slow processing, not lack of replicas. Option C is wrong because increasing the load balancer timeout to 120 seconds would only mask the symptom; the container still cannot process requests fast enough, and the underlying latency issue would persist, potentially causing cascading failures.

Full explanation →

345

MCQeasy

A startup is deploying a machine learning model for real-time fraud detection. They need low latency and automatic scaling during peak hours. Which Google Cloud service should they use?

A.Cloud Functions

B.Batch Prediction on Vertex AI

C.Cloud AI Platform Prediction with custom containers

D.Vertex AI Endpoints

AnswerD

Vertex AI Endpoints provides managed online prediction with automatic scaling and low latency.

Why this answer

Vertex AI Endpoints is the managed service for online predictions with autoscaling, ideal for real-time low-latency requirements.

Full explanation →

346

MCQhard

Your company uses Vertex AI Pipelines to automate the ML lifecycle. The pipeline includes training, evaluation, and deployment steps. You want to ensure that if a pipeline run fails due to a transient error (e.g., resource quota shortage), it automatically retries before marking the run as failed. What is the best way to implement this?

A.Configure Vertex AI Pipelines to automatically restart failed runs.

B.In the pipeline component code, implement retry logic using exponential backoff for specific exceptions.

C.Set a high timeout value for the pipeline so that transient errors resolve before timeout.

D.Use Cloud Tasks to schedule pipeline runs and retry upon failure.

AnswerB

Retrying within the component handles transient failures gracefully without failing the entire pipeline.

Why this answer

Vertex AI Pipelines does not have built-in retry logic for failed steps. You can wrap each step's logic to catch transient errors and retry, or use a retry mechanism in the container itself. Kubeflow Pipelines' retry policy can be specified.

Modifying pipeline code is the most direct way.

Full explanation →

347

MCQeasy

A large retail company processes point-of-sale transactions from thousands of stores daily. The current batch pipeline runs on Cloud Dataproc using Spark and takes 3 hours to complete. The business wants to reduce processing time to under 30 minutes. The pipeline reads from Cloud Storage, joins with inventory data from BigQuery, performs aggregations, and writes to Cloud SQL for reporting. What is the most effective optimization?

A.Migrate the pipeline to Cloud Dataflow with Apache Beam for auto-scaling

B.Read inventory data from BigQuery and pre-join in BigQuery, then export to Cloud Storage as ORC files

C.Write intermediate results to Cloud SQL instead of BigQuery for faster access

D.Increase the number of worker nodes in the Dataproc cluster

AnswerB

Reduces data shuffle in Spark and speeds up processing.

Why this answer

Option B is correct because it offloads the join operation to BigQuery, which is optimized for large-scale analytics and can process the join much faster than Spark. By pre-joining and exporting the result as ORC files (a columnar format optimized for Spark), the pipeline avoids the expensive shuffle and data transfer between Cloud Storage and BigQuery, significantly reducing the overall processing time to meet the 30-minute target.

Exam trap

The trap here is that candidates often assume that simply scaling up the existing infrastructure (more workers or auto-scaling) is the most effective optimization, but Cisco tests the understanding that architectural changes to reduce data movement and leverage service-specific strengths (like BigQuery for joins) are far more impactful than brute-force scaling.

How to eliminate wrong answers

Option A is wrong because migrating to Cloud Dataflow with Apache Beam introduces auto-scaling but does not address the fundamental bottleneck of joining large datasets across Cloud Storage and BigQuery; the join operation would still require significant data movement and processing, likely not achieving the required speedup. Option C is wrong because writing intermediate results to Cloud SQL instead of BigQuery would actually slow down the pipeline, as Cloud SQL is a transactional database not designed for high-throughput batch writes, and it would introduce additional latency and potential contention. Option D is wrong because simply increasing the number of worker nodes in the Dataproc cluster may improve parallelism but does not eliminate the costly shuffle and data transfer inherent in the join between Cloud Storage and BigQuery; it would also increase costs without guaranteeing the 6x performance improvement needed.

Full explanation →

348

MCQmedium

After migrating a production Cloud SQL for PostgreSQL database to a larger machine type, the team notices slower queries. What is the best step to identify the cause?

A.Reindex all tables to improve index efficiency.

B.Enable query caching through the database flags.

C.Enable pg_stat_statements and review query execution times.

D.Increase max_connections to handle more concurrent queries.

AnswerC

This extension captures per-query statistics, allowing identification of regressed queries.

Why this answer

Option C is correct because pg_stat_statements is a PostgreSQL extension that provides detailed query execution statistics, including total execution time, number of calls, and I/O metrics. After migrating to a larger machine type, slower queries often stem from plan changes due to different hardware characteristics or configuration settings; reviewing pg_stat_statements output helps pinpoint which queries are underperforming and why.

Exam trap

Google Cloud often tests the misconception that performance issues after a migration are always due to indexing or connection limits, when in fact the most effective first step is to gather query-level metrics using built-in tools like pg_stat_statements.

How to eliminate wrong answers

Option A is wrong because reindexing all tables is a maintenance task that can improve index bloat but does not address the root cause of slower queries after a migration; it is a reactive measure without diagnostic value. Option B is wrong because Cloud SQL for PostgreSQL does not support a generic 'query caching' database flag; PostgreSQL relies on shared buffers and the buffer cache, and enabling any such flag would not provide diagnostic insight into query performance. Option D is wrong because increasing max_connections can actually degrade performance by increasing context switching and memory contention; it does not help identify why queries are slower and may worsen the issue.

Full explanation →

349

Multi-Selectmedium

Which THREE features of Cloud Pub/Sub guarantee at-least-once delivery and enable exactly-once processing downstream? (Choose three.)

Select 3 answers

A.Subscriber-retry policy with exponential backoff.

B.Exactly-once delivery source feature (enabled by default in current gcloud).

C.Message ordering by message key.

D.Cloud Dataproc integration for message replay.

E.Acknowledgment deadlines and message persistence.

AnswersA, B, E

Retries ensure messages are eventually delivered on failure.

Why this answer

Option A is correct because a subscriber-retry policy with exponential backoff ensures that messages that fail to be processed are retried with increasing delays, preventing transient failures from causing message loss. This mechanism, combined with Pub/Sub's persistent storage, guarantees that each message is delivered at least once, as the subscriber will keep retrying until it acknowledges the message.

Exam trap

Google Cloud often tests the misconception that message ordering or replay features contribute to delivery guarantees, when in fact ordering is about sequence and replay is not a native Pub/Sub capability; the key trap is confusing 'exactly-once delivery' (which Pub/Sub does not offer) with 'exactly-once processing' (which requires subscriber-side idempotency).

Full explanation →

350

MCQeasy

An e-commerce company processes real-time clickstream data using Pub/Sub and Dataflow. They want to ensure that if a Dataflow worker fails, the pipeline can resume processing from the point of failure without data loss. Which feature should they enable?

A.At-least-once delivery mode

B.Exactly-once processing mode

C.Snapshot-based recovery

D.Streaming engine

AnswerC

Allows periodic saving of pipeline state and resumption from saved snapshots.

Why this answer

Snapshot-based recovery (Option C) is the correct feature because Dataflow snapshots capture the entire pipeline state, including the current position in each Pub/Sub subscription and the state of all transforms. If a worker fails, the pipeline can be resumed from the exact snapshot point, ensuring no data loss and exactly-once processing semantics for the recovered data.

Exam trap

Google Cloud often tests the misconception that exactly-once processing alone guarantees failure recovery, but it only prevents duplicates during normal operation, not resumption after a worker crash.

How to eliminate wrong answers

Option A is wrong because at-least-once delivery mode ensures messages are delivered at least once but does not provide a mechanism to resume from a specific point of failure; it may cause duplicate processing but not lossless recovery. Option B is wrong because exactly-once processing mode is a processing guarantee that prevents duplicates but does not inherently provide a recovery mechanism to resume from a failure point; it relies on other features like snapshots for stateful resumption. Option D is wrong because Streaming Engine is a Dataflow feature that moves state and shuffle data to a backend service to reduce worker resource usage, but it does not directly provide a point-of-failure recovery mechanism; snapshots are required for that.

Full explanation →

351

MCQeasy

A startup is using Cloud Build to automate the training and deployment of their machine learning models. The workflow is defined in cloudbuild.yaml and includes steps to: 1) Run a training job on AI Platform Training, 2) Build a custom prediction container, 3) Deploy the container to Cloud Run for serving. The deployment step fails intermittently with the error: 'Cloud Run service already exists and is not owned by the calling user.' You need to fix this so that deployments are reliable. What should you do?

A.Ensure the Cloud Build service account has the 'run.services.update' permission on the Cloud Run service.

B.Delete the existing Cloud Run service manually before each build.

C.Use 'gcloud run deploy --replace' in the build step to force replace the existing service.

D.Use Cloud Run for Anthos instead of fully managed Cloud Run to avoid ownership issues.

AnswerA

The error suggests a permissions issue; granting the correct role to the Cloud Build service account resolves it.

Full explanation →

352

MCQeasy

Your team uses Cloud Dataproc to run a Spark ML training job. The job is failing with an error: 'Container killed by YARN for exceeding memory limits.' What should you do to fix this?

A.Increase the spark.executor.memory property

B.Use preemptible VMs for faster execution

C.Increase the number of worker nodes

D.Enable the external shuffle service

AnswerA

This directly addresses the memory limit for each executor.

Why this answer

The error 'Container killed by YARN for exceeding memory limits' indicates that the Spark executor process is using more memory than the YARN container allows. Increasing `spark.executor.memory` allocates a larger YARN container for each executor, providing the necessary headroom for the Spark application's memory demands, including overhead for off-heap memory and JVM internals.

Exam trap

The trap here is that candidates often confuse scaling horizontally (adding nodes) with scaling vertically (increasing per-node resources), and assume more nodes will fix memory limits when the issue is per-container allocation.

How to eliminate wrong answers

Option B is wrong because preemptible VMs are cheaper but can be terminated at any time, which does not address memory limits and can actually cause more failures due to preemption. Option C is wrong because increasing the number of worker nodes adds more executors but does not increase the memory per executor; the existing executors will still exceed their container limits. Option D is wrong because the external shuffle service helps with shuffle data persistence and reduces executor memory pressure during shuffle operations, but it does not increase the per-executor memory allocation; the root cause is insufficient container memory, not shuffle management.

Full explanation →

353

MCQhard

A data science team uses AI Platform Training with hyperparameter tuning. They observe that some trials fail due to transient errors. To improve solution quality and reduce costs, what should they do?

A.Enable early stopping using a Bayesian optimization algorithm.

B.Set the maxFailedTrials parameter to a high value (e.g., 10).

C.Use larger machine types for each trial.

D.Increase the number of parallel trials.

AnswerB

This allows the tuning job to tolerate transient failures and continue searching without aborting, improving completion rate and model quality.

Why this answer

Option D is correct because setting maxFailedTrials to a high value allows more trials to complete despite transient failures, improving the chance of finding a good model without wasting resources on re-running failed trials. Option A increases parallelism but still pays for failed trials. Option B (early stopping) prunes unpromising trials, but does not address transient errors.

Option C increases cost per trial without solving the failure issue.

Full explanation →

354

MCQeasy

Your team needs to store time-series data from millions of IoT devices. Each device sends a reading every 5 minutes, and the total data volume is about 2 TB per month. The most common query pattern is retrieving all readings for a specific device over a time range (e.g., last 24 hours). Which storage service should you choose?

A.Cloud Storage (objects per device per time interval)

B.BigQuery

C.Cloud Bigtable

D.Cloud Spanner

AnswerC

Bigtable is ideal for time-series data with high write throughput and row-key-based range scans for device/time.

Why this answer

Cloud Bigtable is a fully managed, scalable NoSQL database designed for high-throughput, low-latency time-series data. It supports single-row key lookups and range scans, making it ideal for retrieving all readings for a specific device over a time range (e.g., last 24 hours) from millions of IoT devices generating 2 TB/month. Its row key design (e.g., device_id + timestamp) enables efficient time-range queries without full table scans, unlike object storage or analytical warehouses.

Exam trap

Google Cloud often tests the misconception that BigQuery is suitable for operational, low-latency time-series queries, but the trap here is that BigQuery is an analytical warehouse optimized for large-scale batch queries, not for repeated, sub-second per-device range scans, which is a classic NoSQL (Bigtable) workload.

How to eliminate wrong answers

Option A is wrong because Cloud Storage (object storage) is optimized for immutable blob storage and lacks native indexing for time-range queries; retrieving all readings for a device over a time range would require listing and filtering millions of objects, which is slow and costly. Option B is wrong because BigQuery is a serverless data warehouse designed for analytical SQL queries on large datasets, not for real-time, high-throughput point lookups or range scans with sub-millisecond latency; it would incur high query costs and latency for repeated per-device time-range retrievals. Option D is wrong because Cloud Spanner is a globally distributed relational database with strong consistency and ACID transactions, which is overkill for time-series IoT data and would be prohibitively expensive and slower for high-volume, simple key-value range scans compared to Bigtable.

Full explanation →

355

MCQmedium

A user named Charlie needs to deploy a model to a Vertex AI Endpoint and also create training jobs. Which role should be assigned to Charlie?

A.roles/aiplatform.user

B.roles/owner

C.roles/aiplatform.modelUser

D.roles/editor

AnswerA

aiplatform.user allows creating models, deploying endpoints, and running training jobs.

Why this answer

Correct: B. aiplatform.user includes permissions to create models and deploy endpoints. Option A is wrong because modelUser is read-only for predictions. Option C is wrong because editor includes unrelated permissions.

Option D is wrong because owner is too broad.

Full explanation →

356

MCQeasy

A company runs a nightly Dataproc batch job to process large log files. The job is idempotent and can tolerate node failures if restarted. Minimizing cost is critical. What is the most cost-effective cluster design?

A.Use preemptible instances for all nodes and enable automatic restart

B.Use standard instances with autoscaling based on YARN memory

C.Use all preemptible instances and configure the cluster to delete after the job completes

D.Use a single-node cluster with a high-memory machine type

AnswerA

Preemptible instances are 60-80% cheaper, and automatic restart allows the job to continue after a preemption.

Why this answer

Preemptible instances cost about 80% less than standard instances, making them the most cost-effective choice for fault-tolerant, idempotent batch jobs. Enabling automatic restart ensures that if a preemptible instance is terminated (which can happen at any time), Dataproc will automatically recreate it, maintaining cluster capacity without manual intervention. This design minimizes cost while preserving the job's ability to complete despite node failures.

Exam trap

Google Cloud often tests the misconception that deleting the cluster after the job completes is the primary cost-saving measure, but the trap here is that without automatic restart, preemptible instances alone can cause job failure due to node preemption, negating cost benefits.

How to eliminate wrong answers

Option B is wrong because standard instances are significantly more expensive than preemptible instances, and autoscaling based on YARN memory does not reduce cost as effectively as using preemptible instances for a fault-tolerant batch job. Option C is wrong because configuring the cluster to delete after the job completes is a good practice for cost savings, but using all preemptible instances without enabling automatic restart risks job failure if preemptible instances are reclaimed, as the cluster may lose nodes and become unable to complete the job. Option D is wrong because a single-node cluster with a high-memory machine type is not cost-effective for processing large log files; it lacks fault tolerance and scalability, and high-memory instances are expensive compared to using multiple preemptible instances.

Full explanation →

357

MCQmedium

A company has deployed a machine learning model on Vertex AI Prediction that serves real-time predictions for a customer-facing application. The model was trained using a custom container and is hosted on a single endpoint with a minimum number of nodes. Recently, the team noticed that during peak traffic, prediction latency increases significantly and some requests time out. The endpoint is configured with a baseline traffic split of 100% on the current model version. Which action should the team take to reduce latency and improve reliability?

A.Reduce the minimum number of nodes to zero to allow scale-to-zero during low traffic.

B.Place a Google Cloud Load Balancer in front of the Vertex AI endpoint to distribute requests across multiple endpoints.

C.Configure horizontal autoscaling with a higher maximum number of nodes and set a CPU utilization target.

D.Implement A/B testing by splitting traffic between two model versions to distribute load.

AnswerC

Autoscaling allows the endpoint to add nodes during high traffic, reducing latency and preventing timeouts.

Why this answer

Option C is correct because configuring horizontal autoscaling with a higher maximum number of nodes and a CPU utilization target allows Vertex AI Prediction to automatically add more nodes during peak traffic, distributing the inference load and reducing latency. This directly addresses the root cause—insufficient compute resources under high demand—without requiring architectural changes or sacrificing availability.

Exam trap

The trap here is that candidates often confuse load balancing (Option B) with autoscaling, thinking that distributing requests across multiple endpoints is the same as adding more compute capacity, but Vertex AI endpoints are single resources that cannot be fronted by a load balancer to increase capacity—they require autoscaling to add nodes.

How to eliminate wrong answers

Option A is wrong because reducing the minimum number of nodes to zero would cause cold starts when traffic arrives, increasing latency rather than reducing it, and scale-to-zero is not suitable for a customer-facing application requiring real-time predictions. Option B is wrong because placing a Google Cloud Load Balancer in front of a single Vertex AI endpoint does not distribute requests across multiple endpoints—it would only add unnecessary network hops and complexity without solving the resource bottleneck. Option D is wrong because A/B testing splits traffic between model versions for evaluation purposes, not for load distribution; it does not increase the total compute capacity available to handle peak traffic.

Full explanation →

358

MCQmedium

Your organization uses Vertex AI Feature Store to serve features for a real-time fraud detection model. The model is deployed on a Vertex AI endpoint. After a data pipeline update, the model's online predictions became inconsistent. What is the most likely cause?

A.The model's prediction server is running out of memory.

B.The feature store's online serving values are not synchronized with the batch feature values used during training.

C.The model was retrained with a different training dataset.

D.The online serving endpoint's model version was accidentally rolled back.

AnswerB

If the pipeline update changed how features are computed or stored, online serving might use out-of-sync values, leading to inconsistent predictions.

Why this answer

In Vertex AI Feature Store, batch feature values used during model training and online serving values are stored separately. If a data pipeline update changes the batch feature values but the online serving values are not updated or synchronized, the model will receive different feature values at inference time than it was trained on, leading to inconsistent predictions. This is the most common cause of prediction drift after a pipeline change.

Exam trap

The trap here is that candidates may confuse a data pipeline update with a model retraining or version rollback, but the key is recognizing that feature store synchronization between batch and online stores is a distinct operational concern that directly causes prediction inconsistency.

How to eliminate wrong answers

Option A is wrong because running out of memory on the prediction server would cause errors or timeouts, not inconsistent predictions; the model would either fail or produce no output, not produce varying results. Option C is wrong because retraining with a different dataset would produce a new model version, but the question states predictions became inconsistent after a data pipeline update, not after a retraining event; a retrained model would be deployed as a new version, not cause inconsistency in the existing model's outputs. Option D is wrong because a rollback of the model version would revert to a previous consistent state, not introduce inconsistency; the predictions would be consistent with the older model version, not inconsistent.

Full explanation →

359

MCQeasy

A company runs a batch ETL pipeline on Cloud Dataproc. During peak hours, the job takes longer than expected. The pipeline reads from Cloud Storage, transforms data, and writes to BigQuery. What is the most cost-effective way to improve performance without redesigning the pipeline?

A.Add a secondary worker group using preemptible VMs and increase the number of workers.

B.Enable local SSDs on all worker nodes.

C.Increase the master node's machine type to n1-highmem-32.

D.Use Cloud Composer to schedule the job with a higher priority.

AnswerA

Preemptible VMs are cost-effective and add parallelism.

Why this answer

Adding a secondary worker group with preemptible VMs is the most cost-effective way to improve performance because it allows you to scale out the cluster horizontally with compute instances that are significantly cheaper (up to 80% discount) than regular VMs. This directly addresses the bottleneck of processing capacity during peak hours without requiring any pipeline redesign, as Cloud Dataproc can automatically distribute work across additional workers.

Exam trap

The trap here is that candidates assume scaling up the master node or improving local storage will help, but the exam tests understanding that horizontal scaling with cheap, ephemeral workers is the most cost-effective approach for batch processing workloads that are CPU-bound and fault-tolerant.

How to eliminate wrong answers

Option B is wrong because enabling local SSDs on all worker nodes improves I/O performance for intermediate data, but the pipeline reads from Cloud Storage and writes to BigQuery, which are network-based operations; the bottleneck is CPU/memory for transformation, not local disk speed, making this an expensive upgrade with minimal impact. Option C is wrong because increasing the master node's machine type to n1-highmem-32 only improves the coordination and management of the cluster, not the actual data processing capacity; the master node does not perform data transformation work, so this does not address the performance bottleneck. Option D is wrong because Cloud Composer is a workflow orchestration tool that schedules and monitors jobs, but it does not directly improve the runtime performance of the ETL pipeline; setting a higher priority only affects scheduling order, not execution speed.

Full explanation →

360

MCQhard

A data engineering team uses Cloud Composer (Airflow) for workflow orchestration. They notice DAG runs frequently fail, and the error indicates insufficient Airflow workers. The team wants to ensure reliable execution. Which approach best addresses the issue?

A.Switch from Cloud Composer to Cloud Scheduler for simpler workloads.

B.Reduce the concurrency of all DAGs to fit within available workers.

C.Use the GKE-based Composer environment, which provides autoscaling of Airflow workers.

D.Increase the parallelism setting in the Airflow configuration.

AnswerC

GKE-based Composer auto-scales worker pods, handling variable loads effectively.

Why this answer

Running Airflow on GKE allows worker autoscaling based on load, ensuring sufficient capacity during peak DAG concurrency.

Full explanation →

361

MCQmedium

A Dataflow batch job fails consistently with the error shown. The job uses a custom container image and runs in a VPC with a private IP. What should the engineer do to resolve the issue?

A.Request a CPU quota increase in the region.

B.Verify that the VPC has Private Google Access enabled and that Cloud NAT is configured for outbound internet access if needed.

C.Rebuild the custom container image and upload it to Container Registry.

D.Check that the custom image is based on the latest Dataflow SDK version.

AnswerB

In a private VPC, workers need connectivity to Dataflow API and container registry.

Why this answer

The error indicates that the Dataflow batch job cannot access required resources (e.g., container image, dependencies) because the VPC with private IPs lacks outbound internet connectivity. Option B is correct because enabling Private Google Access allows the VMs to reach Google APIs (like Container Registry) via the Google network, and Cloud NAT provides outbound internet access for non-Google APIs or external dependencies. Without these, the job fails to pull the custom container image or download necessary artifacts.

Exam trap

The trap here is that candidates often assume the error is due to the container image or SDK version, overlooking the VPC networking prerequisites (Private Google Access and Cloud NAT) that are required for Dataflow jobs using private IPs.

How to eliminate wrong answers

Option A is wrong because a CPU quota increase would not resolve connectivity issues; the error is about network access, not resource limits. Option C is wrong because rebuilding the container image does not fix the underlying network configuration problem; the image itself is not the cause of the failure. Option D is wrong because the Dataflow SDK version in the custom image is irrelevant to VPC networking; the job fails due to lack of outbound connectivity, not SDK compatibility.

Full explanation →

362

MCQhard

You manage a large-scale machine learning system that recommends products to users. The model is a deep neural network trained on TensorFlow and deployed on Vertex AI Endpoint with global load balancing. The model receives over 10,000 requests per second. Recently, the team added a new feature: the user's current geographic location (latitude/longitude). After deploying the updated model, you notice that the average prediction latency has doubled, and the error rate has increased, particularly for requests from regions far from the model's primary training data (North America). You suspect the location feature is causing issues. What should you do to diagnose and mitigate the problem?

A.Remove the location feature from the model and retrain without it to restore performance.

B.Increase the number of replicas for the endpoint to handle the increased latency.

C.Switch to a regional endpoint in North America to reduce latency for the majority of users.

D.Examine the latency breakdown using Cloud Monitoring to see if the location feature is causing computationally expensive operations, then consider feature engineering like bucketing coordinates.

AnswerD

Understanding the latency source and engineering the feature properly can resolve the issue without sacrificing model accuracy.

Full explanation →

363

MCQeasy

A company needs to process streaming data from IoT devices with sub-second latency and exactly-once processing guarantees. Which Google Cloud service should they use?

A.BigQuery

B.Cloud Dataproc

C.Cloud Dataflow

D.Cloud Pub/Sub

AnswerC

Dataflow supports streaming with auto-scaling and exactly-once processing, meeting the requirements.

Why this answer

Cloud Dataflow is the correct choice because it provides a unified stream and batch processing model with exactly-once processing guarantees and sub-second latency via its Apache Beam SDK. It supports event-time processing, watermarks, and triggers to handle out-of-order data from IoT devices while ensuring each record is processed exactly once, even in the case of failures.

Exam trap

Google Cloud often tests the distinction between data ingestion (Pub/Sub) and data processing (Dataflow), so the trap here is that candidates confuse Pub/Sub's streaming ingestion capability with the processing guarantees needed for exactly-once semantics.

How to eliminate wrong answers

Option A is wrong because BigQuery is a serverless data warehouse designed for analytical queries on large datasets, not for real-time stream processing with sub-second latency and exactly-once guarantees; it can ingest streaming data but does not provide the fine-grained per-record processing semantics required. Option B is wrong because Cloud Dataproc is a managed Hadoop/Spark service that can process streaming data via Spark Streaming, but it does not natively guarantee exactly-once processing out of the box and typically has higher latency due to micro-batching. Option D is wrong because Cloud Pub/Sub is a messaging and ingestion service that provides at-least-once delivery by default and does not perform data processing; it is a transport layer, not a processing engine.

Full explanation →

364

MCQhard

A company is building a continuous training pipeline that retrains a model daily using new data from a feature store. The training data must include features computed up to the timestamp of each training run. Which architecture should be used to ensure time-consistent feature values without label leakage?

A.Train on a fixed window of the most recent features without considering timestamps.

B.Use Vertex AI Feature Store with point-in-time lookup enabled to retrieve features as of the training timestamp.

C.Store all features in a Cloud SQL database and perform a join at training time.

D.Use Pub/Sub to stream new features into Cloud Storage and train on the latest snapshot.

AnswerB

Point-in-time lookups ensure that for each training example, features are retrieved as they existed at the prediction time, preventing leakage.

Why this answer

Option B is correct because Vertex AI Feature Store's point-in-time lookup retrieves the exact feature values as they existed at the specified training timestamp, ensuring time-consistency and preventing label leakage. This mechanism avoids using future data that would not have been available at the time of prediction, which is critical for realistic model evaluation and production performance.

Exam trap

Google Cloud often tests the misconception that simply using the most recent data or a snapshot is sufficient for time-consistency, but the key requirement is to retrieve features as of the exact training timestamp to prevent label leakage, which only point-in-time lookup guarantees.

How to eliminate wrong answers

Option A is wrong because training on a fixed window of the most recent features without considering timestamps can introduce label leakage by including future feature values relative to the label timestamp, and it ignores the temporal ordering required for time-series data. Option C is wrong because storing all features in Cloud SQL and performing a join at training time lacks point-in-time semantics, meaning the join may inadvertently use features from after the label timestamp, causing leakage and inconsistent feature values. Option D is wrong because using Pub/Sub to stream new features into Cloud Storage and training on the latest snapshot does not guarantee that features are retrieved as of the exact training timestamp; the snapshot may include data that arrived after the label was generated, leading to leakage.

Full explanation →

365

MCQmedium

A financial services firm uses Cloud Pub/Sub to ingest real-time market data. The data is processed by a Cloud Dataflow streaming pipeline that aggregates trades per symbol and writes to BigQuery. The pipeline currently uses a single global window with a trigger that fires every minute. The firm now needs to support late data up to 5 minutes and also wants to reduce the number of writes to BigQuery to avoid hitting the table limit of 1,500 inserts per second. The current pipeline writes every minute, which is acceptable for inserts per second, but after adding late data handling, the number of writes doubles. How can you redesign the pipeline to handle late data while keeping write volume low?

A.Use fixed windows of 5 minutes with allowed lateness 5 minutes and trigger every 30 seconds

B.Increase the global window duration to 10 minutes and keep the same trigger

C.Discard all late data and keep the current windowing

D.Use session windows with a gap duration of 5 minutes and a count-based trigger that fires after accumulating 1000 elements

AnswerD

Session windows group events; count-based trigger reduces writes by batching.

Why this answer

Option D is correct because session windows naturally group events into bursts of activity separated by a gap duration (5 minutes), which reduces the number of writes by accumulating many trades per symbol before emitting a pane. Adding a count-based trigger that fires after 1000 elements further limits write frequency, keeping the insert rate well below BigQuery's 1,500 per second limit while still allowing late data up to the gap duration. This design handles late data implicitly within the session gap and avoids the write amplification seen with fixed windows and frequent triggers.

Exam trap

The trap here is that candidates assume fixed windows with allowed lateness are the only way to handle late data, overlooking that session windows naturally accommodate late arrivals while reducing write frequency through event grouping and count-based triggers.

How to eliminate wrong answers

Option A is wrong because fixed windows of 5 minutes with a trigger every 30 seconds would increase the number of writes (12 panes per window per key) rather than reduce them, exacerbating the BigQuery insert rate issue. Option B is wrong because increasing the global window duration to 10 minutes does not change the trigger frequency (still every minute), so the number of writes remains the same and late data handling is not addressed. Option C is wrong because discarding late data violates the requirement to support late data up to 5 minutes and is not a valid redesign for the stated need.

Full explanation →

366

MCQhard

A company runs a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery. The pipeline uses a side input that is a large lookup table (10 GB) stored in Cloud Storage. The side input is updated hourly. The pipeline experiences high latency and OOM errors on workers. What is the best approach to resolve this?

A.Use a Cloud Bigtable table as a side input via a RichSDF.

B.Use a side input from a PCollection and broadcast it.

C.Increase the number of workers to distribute the side input.

D.Increase the worker memory to 16 GB per worker.

AnswerA

Bigtable provides scalable key-value lookups without loading all data into memory.

Why this answer

Option A is correct because using a Cloud Bigtable table as a side input via a RichSDF (Rich Splittable DoFn) allows the pipeline to perform point lookups on the large (10 GB) lookup table without loading it entirely into worker memory. This avoids OOM errors and reduces latency by leveraging Bigtable's low-latency, scalable key-value storage, which is ideal for high-throughput streaming pipelines that require frequent, random access to a large, frequently updated dataset.

Exam trap

The trap here is that candidates often assume increasing resources (memory or workers) is the solution to memory pressure, but the real issue is the architectural pattern of broadcasting a large, frequently updated dataset—requiring a shift to an external, queryable store like Bigtable.

How to eliminate wrong answers

Option B is wrong because broadcasting a 10 GB PCollection as a side input would require every worker to hold the entire lookup table in memory, causing OOM errors and high latency due to serialization and shuffle overhead. Option C is wrong because increasing the number of workers does not reduce the per-worker memory footprint of a broadcast side input; each worker still needs to load the full 10 GB table, so OOM errors persist. Option D is wrong because simply increasing worker memory to 16 GB per worker is a temporary workaround that does not scale—if the lookup table grows or multiple side inputs are used, OOM errors will recur, and it does not address the fundamental issue of loading the entire dataset into memory.

Full explanation →

367

MCQmedium

A company is designing a streaming pipeline using Dataflow to process real-time clickstream data. The pipeline reads from Pub/Sub, performs user sessionization using Apache Beam's Session window, and writes to BigQuery. The team notices that the pipeline's lag is growing and the worker utilization is low. What is the most likely cause and recommended fix?

A.Too many workers are created; reduce the number of workers.

B.The pipeline is not using autoscaling; enable autoscaling.

C.Insufficient disk space per worker; increase the boot disk size.

D.The session window gap duration is too large, causing excessive state per key; reduce the gap duration.

AnswerD

Large gap leads to long-lived state, causing lag and low utilization.

Why this answer

D is correct because a large session window gap duration causes Dataflow to maintain excessive state per key (user session), leading to high memory pressure and slow processing. This results in growing pipeline lag despite low worker utilization, as workers spend more time managing state than processing data. Reducing the gap duration limits the state size and improves throughput.

Exam trap

Google Cloud often tests the misconception that low worker utilization means too many workers, but the real cause is often state bloat from session windows, not resource overprovisioning.

How to eliminate wrong answers

Option A is wrong because low worker utilization indicates workers are underutilized, not overprovisioned; reducing workers would worsen lag. Option B is wrong because autoscaling is enabled by default in Dataflow streaming pipelines, and low utilization suggests the issue is not scaling but state management. Option C is wrong because insufficient disk space typically causes worker failures or OOM errors, not low utilization with growing lag; the symptom here points to state size, not disk I/O.

Full explanation →

368

MCQeasy

Your team is using Cloud Data Fusion to build batch ETL pipelines that load data from Cloud Storage into BigQuery. You have several pipelines that run daily. Recently, one pipeline started failing with a 'Permission denied' error when trying to read a new CSV file uploaded to a specific Cloud Storage bucket. Other pipelines using the same bucket succeed. The failing pipeline has a Cloud Storage source plugin that uses a service account with the roles/storage.objectViewer role. The bucket has uniform bucket-level access enabled. What is likely causing the issue?

A.Create a custom IAM role with storage.buckets.get and storage.objects.get permissions and assign it to the service account.

B.Check that the service account used by the failing pipeline's Data Fusion instance has the correct permissions, and ensure that the service account is the same as the one used by working pipelines.

C.Disable uniform bucket-level access and add bucket ACLs for the service account.

D.Add the service account as a member of the Cloud Storage bucket with the roles/storage.objectViewer role.

AnswerB

The root cause is likely a different service account or misconfiguration in the failing pipeline's Data Fusion instance.

Why this answer

The correct answer is B because the error is likely due to the Data Fusion instance's service account, not the source plugin's service account. In Cloud Data Fusion, the pipeline execution uses the service account attached to the Data Fusion instance itself to access Cloud Storage, even if the source plugin specifies a different service account. Since other pipelines using the same bucket succeed, the issue is that the failing pipeline's Data Fusion instance uses a service account that lacks the roles/storage.objectViewer role on the bucket, while working pipelines use an instance with the correct permissions.

Exam trap

Google Cloud often tests the misconception that the service account specified in a plugin (e.g., Cloud Storage source) is the one used for authentication, when in fact the Data Fusion instance's service account is the effective identity for all pipeline operations.

How to eliminate wrong answers

Option A is wrong because the roles/storage.objectViewer role already includes storage.objects.get permission, and storage.buckets.get is not required for reading objects; adding a custom role is unnecessary and does not address the root cause. Option C is wrong because disabling uniform bucket-level access and using ACLs is an outdated approach that contradicts best practices; the issue is not about access control mode but about which service account is being used. Option D is wrong because the service account used by the source plugin already has roles/storage.objectViewer on the bucket (as stated), but the pipeline fails because the Data Fusion instance's service account, not the plugin's service account, is the one making the request.

Full explanation →

369

MCQhard

A company runs a Cloud Dataflow streaming pipeline that reads from Cloud Pub/Sub, performs a fixed window of 10 seconds, joins with a slowly-changing dimension table stored in Cloud Bigtable, and writes results to BigQuery. The pipeline has been running for months but recently started exhibiting increasing latency and occasional data loss. The pipeline uses default settings with autoscaling enabled (min 2, max 20 workers). The Bigtable cluster has 3 nodes. The dimensions are updated infrequently. The latency has grown from seconds to minutes. Examining the Dataflow monitoring UI, you see that the 'System Lag' metric is increasing, and some windows are not being emitted. The CPU utilization on Bigtable nodes is below 50%. There are no errors in the logs. Which action is most likely to resolve the issue?

A.Set the pipeline option --maxNumWorkers to a value between 5 and 10.

B.Increase the window duration to 30 seconds to reduce the number of windows.

C.Redesign the pipeline to use a side input for the dimension table instead of a lookup.

D.Increase the number of Bigtable nodes to reduce lookup latency.

AnswerA

Prevents over-scaling and shuffle overhead.

Why this answer

The increasing system lag and unemitted windows in a streaming pipeline with autoscaling (2–20 workers) and a 3-node Bigtable cluster indicate that the pipeline is bottlenecked by the number of workers, not by Bigtable performance. With default autoscaling, Dataflow may not scale up aggressively enough to handle the sustained load, causing backlog and window expiration. Capping maxNumWorkers to 5–10 ensures sufficient parallelism without over-provisioning, allowing the pipeline to catch up and emit windows reliably.

Exam trap

Google Cloud often tests the misconception that Bigtable or side inputs are the bottleneck when the real issue is insufficient worker parallelism, leading candidates to choose scaling Bigtable or redesigning the join strategy instead of adjusting autoscaling limits.

How to eliminate wrong answers

Option B is wrong because increasing the window duration to 30 seconds would only delay window emission, not resolve the root cause of increasing system lag or data loss; it could even worsen latency by accumulating more data per window. Option C is wrong because using a side input for a slowly-changing dimension table would require periodic re-reading of the entire table, increasing memory pressure and shuffle overhead, and would not fix a worker-scaling bottleneck. Option D is wrong because Bigtable CPU is below 50%, indicating the lookup latency is not the issue; adding nodes would be unnecessary and would not address the pipeline’s inability to keep up with the streaming throughput.

Full explanation →

370

MCQmedium

Your Dataflow streaming pipeline is processing financial transactions and writing results to BigQuery. You need to monitor the pipeline for data freshness (end-to-end latency) and alert if it exceeds 5 minutes. The pipeline uses fixed windows of 1 minute. Which metrics should you use for alerting?

A.System Lag metric from Dataflow monitoring.

B.Data Freshness metric from BigQuery monitoring.

C.Element Count metric from Dataflow monitoring.

D.Worker Threads Utilization metric from Dataflow monitoring.

AnswerA

System Lag tracks the delay between event time and processing time; if it exceeds 5 minutes, alert.

Why this answer

Dataflow's 'System Lag' metric measures the difference between event time and processing time, indicating how far behind the pipeline is. For windowed pipelines, this reflects overall latency. Option A (Element Count) shows throughput, not latency.

Option C (Data Freshness) is a BigQuery-specific metric for table currency. Option D (Worker Threads Utilization) relates to parallelism.

Full explanation →

371

MCQhard

A company stores IoT sensor readings in BigQuery. The table is partitioned by day and clustered by sensor_id. Query performance has degraded as data grows; many queries filter by a date range and a single sensor_id. Which optimization should be applied first?

A.Remove clustering on sensor_id as it may cause overhead.

B.Add a WHERE clause to filter by partition date even if the query already filters by a date range.

C.Increase the number of BigQuery slots assigned to the project.

D.Recluster the table to ensure data is sorted by sensor_id within each partition.

AnswerD

Clustering improves filter performance by reducing scanned data.

Why this answer

Option D is correct because reclustering the table ensures that within each daily partition, the data is physically sorted by sensor_id. This optimizes the performance of queries that filter by a date range and a single sensor_id, as BigQuery can use the clustering metadata to prune blocks and read only the relevant data, reducing the amount of data scanned and improving query speed.

Exam trap

Google Cloud often tests the misconception that adding more compute resources (slots) or redundant WHERE clauses will fix performance issues caused by poor data layout, when the correct first step is to optimize data organization through clustering and partitioning.

How to eliminate wrong answers

Option A is wrong because removing clustering on sensor_id would eliminate the physical sorting that helps prune blocks for queries filtering by sensor_id, likely worsening performance. Option B is wrong because adding a WHERE clause to filter by partition date is redundant if the query already filters by a date range; BigQuery automatically performs partition pruning based on the date filter, so this would not improve performance. Option C is wrong because increasing the number of BigQuery slots addresses compute resource contention, not the underlying data layout issue; if the query is scanning too much data due to poor clustering, more slots will not reduce the bytes processed.

Full explanation →

372

MCQhard

The query above runs slowly on the 10 TB table. Which optimization would most improve performance?

A.Use a subquery to filter item.category first

B.Cluster the table by customer_id

C.Create a materialized view that pre-aggregates by customer_id and item category

D.Partition the table by order_date

AnswerC

A materialized view pre-computes the COUNT for each (customer_id, category), so the query reads a small pre-aggregated table.

Why this answer

Option C is correct because a materialized view can pre-compute and store the aggregated results by customer_id and item category, eliminating the need to scan the full 10 TB table for each query. This dramatically reduces I/O and computation time, especially when the underlying aggregation is expensive and the query pattern is predictable.

Exam trap

Google Cloud often tests the misconception that partitioning or clustering alone can accelerate arbitrary aggregation queries, when in fact they only help with filter-based pruning or specific join patterns, not with reducing the full scan required for grouping without a WHERE clause.

How to eliminate wrong answers

Option A is wrong because using a subquery to filter item.category first does not reduce the scan size; the database still must read the entire 10 TB table to evaluate the subquery, and the optimizer may not push the filter down effectively. Option B is wrong because clustering by customer_id improves range scans and joins on that column, but it does not help with aggregation queries that group by customer_id and item category; the table still must be fully scanned to compute the aggregates. Option D is wrong because partitioning by order_date only prunes partitions when queries filter on order_date; the query in question does not filter by date, so all partitions would be scanned, providing no performance benefit.

Full explanation →

373

MCQeasy

A team is using Kubeflow Pipelines on Google Kubernetes Engine to orchestrate ML workflows. They need to track parameters, metrics, and artifacts for each run. Which tool should they integrate?

A.Cloud Monitoring

B.Cloud Logging

C.BigQuery

D.Vertex ML Metadata

AnswerD

Vertex ML Metadata is designed to track ML artifacts, parameters, and metrics across pipeline runs.

Why this answer

Vertex ML Metadata is the correct choice because it is purpose-built for tracking parameters, metrics, and artifacts in ML workflows, and it integrates natively with Kubeflow Pipelines on Google Kubernetes Engine. It stores metadata for each pipeline run, enabling lineage tracking, comparison, and reproducibility of experiments.

Exam trap

Google Cloud often tests the distinction between general-purpose monitoring/logging tools and ML-specific metadata stores, so the trap here is that candidates may confuse Cloud Monitoring or Cloud Logging with a tool that can track ML metrics, when in fact they lack the structured schema and lineage capabilities required for ML workflow orchestration.

How to eliminate wrong answers

Option A is wrong because Cloud Monitoring is designed for infrastructure and application performance monitoring (e.g., CPU utilization, latency), not for tracking ML-specific parameters, metrics, and artifacts. Option B is wrong because Cloud Logging collects and stores log data (e.g., text logs from applications), not structured ML metadata like hyperparameters or model artifacts. Option C is wrong because BigQuery is a serverless data warehouse for analytical queries on large datasets, not a metadata store for ML pipeline runs.

Full explanation →

374

MCQhard

You are designing a streaming data pipeline that must guarantee exactly-once processing semantics for financial transactions. The pipeline reads from Cloud Pub/Sub and writes to Cloud Bigtable. Each transaction has a unique transaction ID. Which features do you need to implement to ensure exactly-once semantics end-to-end?

A.Use Cloud Pub/Sub with synchronous pull and manually commit offsets after successfully writing to Bigtable.

B.Use Dataflow with exactly-once processing, and ensure the Bigtable sink uses idempotent mutations based on the transaction ID.

C.Use Dataflow with at-least-once processing and implement deduplication in a windowed transform using the transaction ID.

D.Use Cloud Pub/Sub with exactly-once delivery enabled, and write to Bigtable using single-row transactions.

AnswerB

Dataflow deduplicates records using unique identifiers; Bigtable idempotent writes (e.g., using CheckAndMutate) ensure that even if a mutation is retried, the result is the same.

Why this answer

Option B is correct because Dataflow's exactly-once processing guarantees that each record is processed precisely once, and idempotent Bigtable mutations (keyed by transaction ID) ensure that even if a mutation is retried, the result is the same. This combination provides end-to-end exactly-once semantics: Dataflow handles source-side deduplication and checkpointing, while Bigtable's idempotent writes prevent duplicates at the sink.

Exam trap

Google Cloud often tests the misconception that Pub/Sub's 'exactly-once delivery' feature exists or that manual offset management alone can achieve end-to-end exactly-once semantics, when in reality Pub/Sub only offers at-least-once delivery and requires a processing framework like Dataflow to achieve exactly-once end-to-end.

How to eliminate wrong answers

Option A is wrong because Cloud Pub/Sub synchronous pull with manual offset commit does not guarantee exactly-once delivery; Pub/Sub's at-least-once delivery model means duplicates can still occur, and manual offset management does not eliminate duplicates from Pub/Sub itself. Option C is wrong because at-least-once processing in Dataflow inherently allows duplicates, and windowed deduplication using transaction ID is not sufficient for end-to-end exactly-once semantics—it only handles duplicates within a window and does not address failures during checkpointing or sink writes. Option D is wrong because Cloud Pub/Sub does not support exactly-once delivery; its default is at-least-once, and single-row transactions in Bigtable do not prevent duplicates from Pub/Sub redelivery.

Full explanation →

375

MCQeasy

A data engineer needs to process a large dataset (500 TB) stored in Cloud Storage using Dataproc. The processing job requires reading the entire dataset and writing results back to Cloud Storage. The job is expected to run for 6 hours. Which configuration minimizes cost?

A.Use a single-node cluster with standard VMs.

B.Use a cluster with local SSDs for faster I/O.

C.Use a cluster with a mix of standard and preemptible VMs.

D.Use a cluster with n1-highmem-32 instances and 1000 cores.

AnswerC

Preemptible VMs reduce cost significantly while providing sufficient compute.

Why this answer

Option C is correct because preemptible VMs cost about 80% less than standard VMs, and mixing them with standard VMs provides fault tolerance for the job's 6-hour duration. Since the job reads and writes to Cloud Storage (not local HDFS), local SSDs are unnecessary, and a single-node cluster would lack the parallelism needed to process 500 TB efficiently within 6 hours. Using a mix of standard (for critical master/worker nodes) and preemptible VMs (for worker nodes) minimizes cost while ensuring job completion.

Exam trap

Google Cloud often tests the misconception that local SSDs always improve performance for data processing jobs, but in Dataproc, when data resides in Cloud Storage, the bottleneck is network throughput, not local disk speed, making SSDs an unnecessary cost.

How to eliminate wrong answers

Option A is wrong because a single-node cluster cannot process 500 TB in 6 hours due to limited CPU and memory resources, and it lacks fault tolerance if the node fails. Option B is wrong because local SSDs add cost without benefit when reading/writing from Cloud Storage, as the bottleneck is network I/O, not disk I/O; Dataproc uses Cloud Storage as the primary data source, not HDFS. Option D is wrong because using 1000 cores with n1-highmem-32 instances is over-provisioned and expensive, and the job's 6-hour runtime does not justify such a large cluster; it also ignores the cost savings of preemptible VMs.

Full explanation →

Page 5 of 7

All pages

Practice PDE by domain

Target a specific domain to shore up weak areas.

Designing data processing systems Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

See all domains with question counts →