Knowledge + Practice

Google Professional Data Engineer (PDE) — Questions 601–675

990 questions total · 14pages · All types, answers revealed

Take a mock exam Exam hub

Page 9 of 14

601

Multi-Selecthard

You have a BigQuery table with a REQUIRED column that you now need to allow NULL values. You also need to add two new nullable columns. Which THREE steps are required to achieve this schema evolution? (Choose 3)

Select 3 answers

A.Add the new columns using ALTER TABLE ADD COLUMN.

B.Use the ALTER TABLE SET OPTIONS statement to change the column mode to NULLABLE.

C.Update the IAM permissions on the table.

D.Use CREATE OR REPLACE TABLE with the new schema and import data.

E.Export the table to Cloud Storage in Avro format.

AnswersB, D, E

ALTER TABLE ... ALTER COLUMN SET DATA TYPE or SET OPTIONS can change mode to NULLABLE.

Why this answer

BigQuery requires exporting data, modifying the schema, and importing data or using SQL statements. The schema can be updated via `bq update` or ALTER TABLE. Simply adding columns is possible, but changing from REQUIRED to NULLABLE requires recreation or ALTER TABLE CHANGE COLUMN.

Full explanation →

602

MCQhard

A data pipeline ingests sensor data from IoT devices via Cloud Pub/Sub, processes it with Cloud Dataflow, and writes to BigQuery. The pipeline is failing with high latency and data loss. Which troubleshooting step should be taken first?

A.Check Stackdriver logging for error messages.

B.Disable exactly-once processing in Dataflow.

C.Increase the number of Dataflow workers.

D.Switch to BigQuery streaming inserts.

AnswerA

Identifies root cause.

Why this answer

Option A is correct because Stackdriver (now Cloud Logging) is the first place to investigate when a Dataflow pipeline experiences high latency and data loss. Dataflow automatically logs errors, worker failures, and system messages to Cloud Logging, which can reveal root causes such as insufficient resources, stuck steps, or Pub/Sub subscription issues. Checking logs first avoids premature scaling or configuration changes that may not address the actual problem.

Exam trap

Google Cloud often tests the principle of 'diagnose before you optimize' — the trap here is that candidates jump to scaling or switching technologies (options C and D) without first checking logs, which is the fundamental first step in any troubleshooting workflow.

How to eliminate wrong answers

Option B is wrong because disabling exactly-once processing in Dataflow would not fix high latency or data loss; it could actually increase data duplication and make debugging harder, while the core issue remains unaddressed. Option C is wrong because increasing the number of Dataflow workers without first diagnosing the bottleneck (e.g., a hot key, slow transform, or Pub/Sub backlog) can waste resources and may not resolve the underlying cause of latency or loss. Option D is wrong because switching to BigQuery streaming inserts does not address pipeline-level failures; streaming inserts have their own quotas, error handling, and latency characteristics, and the problem likely lies in the Dataflow processing logic or resource allocation, not the sink.

Full explanation →

603

MCQhard

A Dataflow streaming pipeline processes events from Pub/Sub and writes to BigQuery using a dynamically generated table destination based on the event type. The pipeline is experiencing high latency, and the worker CPU utilization is low. Which action is most likely to reduce latency?

A.Increase the batch size parameter in the BigQuery sink to write larger batches.

B.Reduce the number of workers to increase CPU utilization per worker.

C.Enable Dataflow Streaming Engine to improve throughput and reduce latency.

D.Increase the worker disk size to reduce I/O wait time.

AnswerC

B is correct because Streaming Engine moves state to backend, reducing worker overhead and improving latency.

Why this answer

Option C is correct because Dataflow Streaming Engine moves state and computation from worker VMs to the backend service, reducing per-worker overhead and enabling better resource utilization. This directly addresses the symptom of high latency with low CPU utilization, which indicates workers are bottlenecked on shuffle or state management rather than compute.

Exam trap

The trap here is that candidates often assume low CPU utilization means workers are underutilized and should be scaled down (Option B), when in fact low CPU with high latency indicates a bottleneck in shuffle or state management that is not compute-bound.

How to eliminate wrong answers

Option A is wrong because increasing batch size in the BigQuery sink can actually increase latency for streaming pipelines, as larger batches require more time to fill before writing, and the issue here is not sink throughput but worker inefficiency. Option B is wrong because reducing the number of workers would decrease parallelism and likely worsen latency, and low CPU utilization suggests workers are not compute-bound but rather waiting on I/O or shuffle. Option D is wrong because increasing worker disk size does not reduce I/O wait time for streaming pipelines; disk I/O is not the bottleneck when CPU is low and the pipeline uses Pub/Sub and BigQuery, which are network-bound.

Full explanation →

604

MCQmedium

A company uses BigQuery to run reporting queries on a table that is partitioned by date and clustered by customer_id. Queries filtering by customer_id and a date range are performing poorly. What is the most likely cause?

A.The project lacks sufficient BigQuery slot capacity

B.The table is too large for BigQuery

C.Clustering column order should be date first, then customer_id

D.The date range filter is too wide, causing scans of many partitions

AnswerD

Wide date ranges nullify the benefit of clustering; BigQuery scans many partitions.

Why this answer

Option D is correct because when a table is partitioned by date and clustered by customer_id, queries that filter on both columns can still perform poorly if the date range filter is too wide, causing BigQuery to scan many partitions. Even with clustering, scanning a large number of partitions negates the benefit of clustering, as clustering only reduces the data scanned within each partition. The query optimizer must read all partitions that fall within the date range, and if that range is broad, the scan overhead dominates.

Exam trap

The trap here is that candidates often assume clustering alone guarantees fast queries on any filter combination, without understanding that partition pruning happens first and a wide date range undermines the benefit of clustering.

How to eliminate wrong answers

Option A is wrong because insufficient slot capacity would cause slow query execution or queuing, not specifically poor performance on partitioned and clustered tables; the issue here is data scanning inefficiency, not resource contention. Option B is wrong because BigQuery is designed to handle tables of any size, and 'too large' is not a meaningful limitation; the problem is query design, not table size. Option C is wrong because the clustering column order is already correct for the typical query pattern (filtering by customer_id and date range); clustering by date first would not improve performance for queries that filter on customer_id, as clustering only benefits the first column in the order.

Full explanation →

605

Multi-Selectmedium

A company is migrating their on-premises data warehouse to BigQuery. They have a mix of batch and streaming ingestion. The data team wants to optimize query costs. Which THREE practices should they adopt?

Select 3 answers

A.Switch to flat-rate pricing to cap slot usage.

B.Use materialized views for frequently executed aggregations.

C.Partition tables by a date or timestamp column.

D.Limit the number of concurrent queries by setting a maximum slot capacity.

E.Cluster tables on columns that are frequently used in filters and joins.

AnswersB, C, E

Materialized views automatically refresh and are used by the query optimizer to speed up queries and reduce scanned bytes.

Why this answer

Partitioning by date reduces bytes scanned. Using clustered tables improves performance for filter/join queries. Using materialized views can precompute aggregations and reduce scans.

Flat-rate pricing is about reservation management, not cost optimization per query. Limiting slots is not a cost optimization; it may cause throttling.

Full explanation →

606

Multi-Selecthard

A streaming pipeline uses Cloud Pub/Sub and Dataflow to process financial transactions. The pipeline must guarantee that each transaction is processed exactly once and in order per customer key. Which two configurations are necessary? (Choose two.)

Select 2 answers

A.Use a session window with max gap duration

B.Use a keyed state with a value state per customer

C.Use Dataflow stateful processing with event time ordering

D.Use a Pub/Sub topic with ordering keys

E.Use a global window with a trigger

AnswersC, D

Dataflow stateful processing with event time ordering allows processing events per key in the order they were generated, with exactly-once guarantees.

Why this answer

Option C is correct because Dataflow's stateful processing with event time ordering allows you to maintain per-key state and process elements in the order of their event timestamps, which is essential for guaranteeing exactly-once processing and in-order handling per customer key. This ensures that each transaction is processed once and in the correct sequence, even in the presence of out-of-order data arrival.

Exam trap

Cisco often tests the misconception that session windows or global windows can provide per-key ordering, but these windowing strategies are designed for different use cases and do not enforce event time ordering or exactly-once processing per key.

Full explanation →

607

Multi-Selectmedium

Which THREE components are typically part of a Vertex AI Pipeline for automated model retraining and deployment?

Select 3 answers

A.Cloud Monitoring alerting component

B.Cloud Storage artifact storage component

C.Training component (e.g., CustomContainerTrainingJob)

D.Model evaluation component (e.g., evaluating on a test set)

E.Deployment component (e.g., deploying model to endpoint)

AnswersC, D, E

Training is the core step.

Why this answer

Option C is correct because a training component, such as a `CustomContainerTrainingJob`, is the core step in a Vertex AI Pipeline that executes the model training logic. It defines the container image, machine configuration, and hyperparameters, enabling automated retraining when triggered by a schedule or event.

Exam trap

Google Cloud often tests the distinction between pipeline components (which are executable tasks in the DAG) and supporting infrastructure (like Cloud Monitoring or Cloud Storage), leading candidates to select options that are related to the pipeline's operation but not actual components within the pipeline definition.

Full explanation →

608

MCQeasy

You need to react to changes in a GCS bucket (e.g., new object creation) and trigger a Cloud Run service to process the new file. Which Google Cloud service should you use to route the event?

A.Pub/Sub directly with a Cloud Run subscription

B.Cloud Tasks

C.Eventarc

D.Cloud Scheduler

AnswerC

Eventarc handles events from GCS and other sources, routing them to Cloud Run.

Why this answer

Eventarc is the correct choice because it is purpose-built to route events from Google Cloud sources (like Cloud Storage) to Cloud Run. It directly supports Cloud Storage audit logs and Pub/Sub event triggers, allowing you to react to object creation events without custom middleware. Eventarc handles the event routing, filtering, and delivery to your Cloud Run service automatically.

Exam trap

Cisco often tests the misconception that Pub/Sub is the direct answer for any event routing, but the trap here is that Eventarc is the managed service that simplifies the integration between GCS and Cloud Run, making it the correct choice over raw Pub/Sub.

How to eliminate wrong answers

Option A is wrong because Pub/Sub directly with a Cloud Run subscription requires you to manually configure a Pub/Sub topic and subscription, and Cloud Run can only pull messages via a push subscription; Eventarc abstracts this complexity and provides native integration with Cloud Storage events. Option B is wrong because Cloud Tasks is a task queue for asynchronous execution of HTTP requests, not designed for event-driven routing from GCS; it would require you to manually publish tasks in response to events, adding unnecessary overhead. Option D is wrong because Cloud Scheduler is a cron job scheduler for periodic tasks, not an event router; it cannot react to real-time object creation events in a GCS bucket.

Full explanation →

609

Matchingmedium

Match each machine learning term to its description.

Drag a concept onto its matching description — or click a concept then click the description.

Concepts

Matches

Model trained on labeled data

Model trained on unlabeled data

Agent learns by interacting with environment

Model performs well on training data but poorly on new data

Why these pairings

Key ML concepts commonly tested in PDE exam.

Full explanation →

610

Multi-Selecthard

Which THREE best practices should be followed when designing a Dataflow pipeline for real-time data processing?

Select 3 answers

A.Set up monitoring alerts for system lag and data freshness.

B.Use static side inputs that are loaded once at pipeline start.

C.Implement watermark estimation to handle late data.

D.Use global windows with early triggers for low latency.

E.Use idempotent sinks to ensure exactly-once processing.

AnswersA, C, E

Monitoring is critical for streaming pipelines.

Why this answer

Option A is correct because monitoring alerts for system lag and data freshness are essential for maintaining operational visibility in real-time Dataflow pipelines. System lag (the time between data ingestion and processing) and data freshness (how current the processed output is) directly impact the pipeline's ability to meet latency SLAs. Without these alerts, issues like worker backpressure or Pub/Sub subscription backlog can go unnoticed, leading to stale or lost data.

Exam trap

Google Cloud often tests the misconception that static side inputs are acceptable for streaming pipelines, but they are only appropriate for batch or bounded data; real-time pipelines require side inputs that can be periodically refreshed (e.g., via a streaming source or a periodic lookup).

Full explanation →

611

MCQmedium

A company wants to train a machine learning model to predict customer churn using BigQuery ML. The dataset has a severe class imbalance (only 2% churn). Which approach should the data engineer take to handle this imbalance within BigQuery ML?

A.Use SMOTE directly in BigQuery SQL before training

B.Set the CLASS_WEIGHTS option to 'balanced' in the CREATE MODEL statement

C.Create a custom Vertex AI model using TensorFlow and use the class_weight parameter

D.Oversample the minority class using a SQL query that duplicates rows

AnswerB

BigQuery ML's CLASS_WEIGHTS option can be set to 'balanced' to automatically compute weights inversely proportional to class frequencies.

Why this answer

BigQuery ML supports class weights via the CLASS_WEIGHTS option in CREATE MODEL, which adjusts the loss function to penalize misclassifications of the minority class more heavily.

Full explanation →

612

Multi-Selecthard

A data analyst wants to compute the rank of sales per region and also the difference in sales between consecutive months for each region. Which BigQuery analytic functions should they use? (Select TWO)

Select 2 answers

A.RANK()

B.ROW_NUMBER()

C.LAG()

D.LEAD()

E.NTILE()

AnswersA, C

RANK() computes the rank of sales per region.

Why this answer

RANK() computes the rank of rows within a partition. LAG() accesses data from a previous row in the same result set, which can be used to compute differences. ROW_NUMBER() assigns unique sequential integers, not rank.

NTILE() distributes rows into buckets. LEAD() accesses next row, not previous.

Full explanation →

613

Multi-Selectmedium

You want to optimize BigQuery costs for a large dataset that is frequently queried by time range. You also need to ensure that predictable workloads have dedicated slot capacity. Which TWO strategies should you combine? (Choose 2)

Select 2 answers

A.Use query caching

B.Partition the table by date

C.Purchase committed use reservations for baseline capacity

D.Create a materialized view for the entire table

E.Enable autoscaling slots

AnswersB, C

Partitioning by date limits the bytes scanned per query, reducing cost.

Why this answer

Partitioned tables reduce bytes scanned by time-range queries. Committed use reservations provide predictable slot capacity at a discount. Autoscaling does not provide dedicated capacity; caching is automatic.

Full explanation →

614

MCQmedium

You are migrating an existing Kafka cluster to Google Cloud using Dataproc. The cluster handles high-throughput streaming data with strict ordering requirements per partition. Which choice of Dataproc configuration is most appropriate?

A.Use Dataflow with Kafka IO instead of Dataproc.

B.Use Dataproc with local SSDs for better performance, and enable autoscaling.

C.Use Dataproc with preemptible workers to reduce cost, and attach standard persistent disks.

D.Use Dataproc with non-preemptible workers and persistent SSD storage for brokers.

AnswerD

Non-preemptible workers provide stability for Kafka brokers, and SSDs offer low latency for high-throughput streaming.

Why this answer

Option D is correct because Kafka brokers in a Dataproc cluster require persistent, non-preemptible workers to maintain data durability and strict ordering per partition. Preemptible workers can be terminated at any time, causing data loss or rebalancing that violates ordering guarantees. Persistent SSD storage provides the low-latency I/O needed for high-throughput Kafka workloads, while non-preemptible instances ensure broker stability and consistent replication.

Exam trap

Cisco often tests the misconception that preemptible VMs or local SSDs are acceptable for stateful, ordered workloads like Kafka, when in fact they violate durability and ordering guarantees due to ephemeral storage and abrupt termination.

How to eliminate wrong answers

Option A is wrong because Dataflow with Kafka IO is a serverless stream processing service, not a Kafka cluster migration target; the question asks about migrating an existing Kafka cluster to Dataproc, not replacing it with a different processing paradigm. Option B is wrong because local SSDs are ephemeral and lose data on instance termination, which is incompatible with Kafka's durability and ordering requirements; autoscaling can cause partition rebalancing that disrupts strict ordering. Option C is wrong because preemptible workers can be terminated at any time, leading to data loss and partition leader re-elections that break ordering guarantees; standard persistent disks have higher latency than SSDs, degrading Kafka's throughput.

Full explanation →

615

Multi-Selectmedium

A company wants to use BigQuery ML to build a recommendation system for movies. The data includes user IDs, movie IDs, and ratings. Which BigQuery ML model types are suitable for this? (Select TWO)

Select 2 answers

A.ARIMA_PLUS

B.k-means

C.AutoML Tables

D.Boosted tree classifier

E.Matrix factorization

AnswersD, E

Boosted trees can be used to predict ratings as a classification problem.

Why this answer

Matrix factorization (via implicit or explicit feedback) is specifically designed for recommendation systems. Boosted tree classifiers can also be used for predicting ratings as a classification problem. AutoML Tables is not a model type in BigQuery ML.

ARIMA_PLUS is for time-series. k-means is for clustering, not recommendations directly.

Full explanation →

616

MCQhard

A company runs a streaming data pipeline on Google Cloud using Cloud Pub/Sub, Cloud Dataflow, and BigQuery. The pipeline processes real-time sensor data for predictive maintenance. Recently, the Dataflow job's lag has increased from seconds to minutes, and the system shows backpressure. The pipeline uses fixed windows of 1 minute and writes results to BigQuery. The data volume has doubled. The team has already increased the number of workers. What should they do next? Options: A. Use session windows instead of fixed windows. B. Enable Streaming Engine and use Upsert to BigQuery. C. Decrease the window duration. D. Use Cloud Storage as temporary sink.

A.Enable Streaming Engine and use Upsert to BigQuery

B.Decrease the window duration

C.Use session windows instead of fixed windows

D.Use Cloud Storage as temporary sink

AnswerA

Streaming Engine reduces overhead and Upsert makes BigQuery writes more efficient.

Why this answer

The correct answer is A because enabling Streaming Engine offloads the heavy shuffle and state management from the worker VMs to the backend service, reducing the impact of backpressure. Using Upsert to BigQuery allows the pipeline to handle late-arriving data within the fixed windows without requiring a full table rewrite, which is critical when data volume has doubled and lag has increased.

Exam trap

The trap here is that candidates often assume increasing workers or changing window sizes will fix backpressure, but the real bottleneck is often the shuffle and state management in Dataflow, which Streaming Engine directly addresses.

How to eliminate wrong answers

Option B is wrong because decreasing the window duration would increase the number of windows and the frequency of writes, exacerbating the backpressure and lag rather than solving it. Option C is wrong because session windows are designed for grouping events based on gaps of inactivity, which is not relevant to the fixed-window requirement for predictive maintenance sensor data; they would not reduce backpressure. Option D is wrong because using Cloud Storage as a temporary sink adds an extra write step and does not address the root cause of backpressure in the Dataflow pipeline; it would increase latency and complexity.

Full explanation →

617

MCQmedium

A data engineer needs to load 2 TB of Avro files stored in Cloud Storage into BigQuery on a daily schedule. The schema is static and the data should overwrite the existing table each day. What is the most efficient way to accomplish this?

A.Create a BigQuery Data Transfer Service from Cloud Storage

B.Create a Dataflow pipeline to read Avro files and stream to BigQuery

C.Mount Cloud Storage as a filesystem and use SELECT INTO

D.Use bq load with --replace flag in a cron job

AnswerD

Why this answer

The bq load command with the --replace flag is the most efficient method because it directly loads Avro files from Cloud Storage into BigQuery in a single, serverless operation without requiring any intermediate processing. Since the schema is static and the data overwrites the existing table daily, a simple cron job invoking bq load --replace is the simplest, fastest, and most cost-effective solution, avoiding the overhead of additional services like Dataflow or Data Transfer Service.

Exam trap

Cisco often tests the misconception that managed services like Data Transfer Service or Dataflow are always the best choice for scheduled loads, but here the trap is that the Data Transfer Service cannot overwrite tables on a schedule, and Dataflow adds unnecessary overhead for a simple batch load with a static schema.

How to eliminate wrong answers

Option A is wrong because the BigQuery Data Transfer Service for Cloud Storage does not support overwriting an existing table on a schedule; it only supports appending or creating new tables, and it cannot use the --replace flag. Option B is wrong because a Dataflow pipeline introduces unnecessary complexity, cost, and latency for a straightforward batch load; streaming to BigQuery is not needed for daily overwrites of static Avro files, and batch loads are more efficient. Option C is wrong because mounting Cloud Storage as a filesystem (e.g., via gcsfuse) and using SELECT INTO is not a native BigQuery operation; BigQuery does not support SELECT INTO from a mounted filesystem, and this approach would require an external processing layer, defeating efficiency.

Full explanation →

618

Multi-Selecthard

Your company runs a Dataflow streaming pipeline that processes user activity from Pub/Sub and writes aggregated results to BigQuery. Lately, the pipeline is experiencing high latency and backlog growth during peak hours. You need to troubleshoot and improve performance. Which THREE actions should you take? (Choose 3.)

Select 3 answers

A.Change the worker machine type to a higher CPU/memory configuration

B.Decrease the window duration to reduce data per window

C.Enable Dataflow Streaming Engine

D.Increase the number of workers in the pipeline

E.Add additional Pub/Sub subscriptions to the same topic

AnswersA, C, D

More CPU/memory per worker can speed up processing if the transform is compute-intensive.

Why this answer

Increasing the number of workers allows the pipeline to process more data in parallel. Using streaming engine can improve throughput and reduce latency by offloading state management. Adjusting the worker machine type to use more CPU/memory can help if the processing is compute-intensive.

Adding more subscriptions would not help because the pipeline reads from a single subscription. Changing the window size affects business logic but not necessarily performance. Combining these three optimizations addresses common bottlenecks.

Full explanation →

619

MCQeasy

A streaming Dataflow job is processing messages from Cloud Pub/Sub. The job is underutilizing resources and the throughput is lower than expected. Which parameter should be adjusted to increase parallelism?

A.Change the workerMachineType to a higher CPU machine

B.Increase the number of workers via maxNumWorkers

C.Set the streaming engine to Dataflow Streaming Engine

D.Set autoscalingAlgorithm to THROUGHPUT_BASED

AnswerB

More workers allow more parallelism.

Why this answer

The job is underutilizing resources, meaning the existing workers are not fully loaded. Increasing the number of workers via maxNumWorkers directly increases parallelism by allowing Dataflow to distribute work across more VMs, which can increase throughput without changing the per-worker resource profile. This parameter controls the upper bound on the number of workers, enabling the autoscaler to scale out when there is backlog.

Exam trap

Google Cloud often tests the misconception that increasing per-worker resources (CPU/memory) is the primary way to improve throughput in a streaming job, when in fact underutilization indicates the need to scale out workers rather than scale up individual workers.

How to eliminate wrong answers

Option A is wrong because changing workerMachineType to a higher CPU machine increases per-worker compute capacity but does not address underutilization; if workers are idle, adding more CPU per worker will not increase parallelism or throughput. Option C is wrong because Dataflow Streaming Engine is a service that offloads shuffle and state management to the backend, reducing per-worker overhead and improving scalability, but it does not directly increase parallelism; it changes the execution model. Option D is wrong because setting autoscalingAlgorithm to THROUGHPUT_BASED is already the default for streaming jobs; it enables autoscaling based on throughput metrics, but without adjusting maxNumWorkers, the autoscaler cannot scale beyond the default limit, so throughput remains capped.

Full explanation →

620

MCQhard

A healthcare startup is deploying a natural language processing (NLP) model for extracting medical entities from clinical notes. The model is a fine-tuned BERT model served on Vertex AI Prediction using a custom container. The team observes that prediction latency is around 500ms per request, but they need to handle up to 100 requests per second (QPS) with end-to-end latency under 200ms. The model currently runs on n1-standard-4 machines (4 vCPU, 15 GB memory). During load testing, CPU utilization reaches 90% and memory usage is 12 GB. The team is considering options to meet the requirements. Which action should they take?

A.Use a machine type with a GPU, such as n1-standard-4 with a NVIDIA Tesla T4 accelerator, and optimize the model with TensorRT.

B.Switch to n1-highmem-4 machines to provide more memory for the model.

C.Deploy the model using TensorFlow Serving with CPU-only nodes and increase the number of replicas.

D.Move the model to Cloud Run with automatic scaling to handle the QPS.

AnswerA

GPU accelerates BERT inference and TensorRT further optimizes latency.

Why this answer

Option A is correct because the bottleneck is CPU-bound inference (90% CPU utilization) with memory well within limits (12 GB of 15 GB). Adding a GPU (NVIDIA Tesla T4) and optimizing with TensorRT reduces per-request latency via hardware acceleration and graph optimizations, enabling sub-200ms inference at 100 QPS. This directly addresses the latency requirement without changing the machine family or scaling strategy.

Exam trap

Google Cloud often tests the misconception that scaling horizontally (more replicas or Cloud Run) solves latency problems, when the real issue is per-request compute bottleneck that requires hardware acceleration or model optimization.

How to eliminate wrong answers

Option B is wrong because memory is not the bottleneck (12 GB used out of 15 GB); increasing memory does not reduce CPU-bound inference latency. Option C is wrong because TensorFlow Serving on CPU-only nodes still relies on CPU compute, and increasing replicas adds cost and complexity without addressing the fundamental latency per request; the CPU utilization is already saturated, so more replicas would require horizontal scaling but still not guarantee sub-200ms latency per request. Option D is wrong because Cloud Run's automatic scaling handles QPS but does not reduce per-request latency; the model's inference time remains CPU-bound, and Cloud Run's cold starts and CPU-only instances would not meet the 200ms latency target.

Full explanation →

621

MCQmedium

A company wants to use dbt to transform data in BigQuery. Their source data is loaded daily into staging tables. They need to run dbt transformations on a schedule and only process tables that have changed. Which dbt feature should they use?

A.dbt snapshots

B.dbt incremental models

C.dbt seeds

D.dbt sources

AnswerB

Incremental models only process new/changed records, reducing cost and runtime.

Why this answer

dbt incremental models allow processing only new or changed records based on a configured timestamp or unique key. dbt snapshots capture historical changes. dbt seeds load CSV files. dbt sources are for configuration, not incremental processing.

Full explanation →

622

Multi-Selectmedium

Your company uses Cloud Composer to orchestrate a data pipeline that includes Dataproc Spark jobs and BigQuery load operations. You need to pass the output file path from the Spark job to the next BigQuery task in the DAG. Which two mechanisms can you use to share data between tasks? (Choose TWO.)

Select 2 answers

A.Store the output path as a Cloud Composer variable.

B.Publish the output path to a Pub/Sub topic and subscribe in the next task.

C.Write the output path to a Cloud Storage object and read it in the next task.

D.Use BigQuery as an intermediary to store the output path.

E.Use Airflow XComs to push the output path from the Spark task and pull it in the BigQuery task.

AnswersC, E

Cloud Storage is a durable store that both tasks can access.

Why this answer

Airflow XComs allow tasks to exchange small amounts of data (e.g., file paths) by pushing and pulling values. Cloud Storage can be used as an intermediate store: the Spark job writes output to GCS, and the BigQuery task reads from that location. BigQuery does not directly communicate with Dataproc.

Cloud Composer variables are for global configuration, not task-to-task. Pub/Sub is not needed for simple file path sharing.

Full explanation →

623

MCQmedium

A data engineering team needs to process a large volume of CSV files stored in Cloud Storage using Dataproc. The files are generated hourly and each contains millions of rows. They want to minimize the number of Dataproc cluster nodes to reduce cost while processing within an hour. Which configuration should they recommend?

A.Use a cluster with preemptible worker nodes only.

B.Use a cluster with local SSDs for temporary storage.

C.Use a cluster with a few large worker nodes and use Spark static allocation.

D.Use a cluster with many small worker nodes and use Spark dynamic allocation.

AnswerD

Dynamic allocation adjusts resources based on workload; small nodes provide granular scaling.

Why this answer

Option D is correct because using many small worker nodes with Spark dynamic allocation allows the cluster to scale resources precisely to the workload, minimizing idle capacity and cost. Dynamic allocation enables executors to be added or removed based on the processing demands of the hourly CSV files, ensuring the job completes within the hour without over-provisioning nodes.

Exam trap

Google Cloud often tests the misconception that larger nodes are always more cost-effective for big data processing, but in practice, many small nodes with dynamic allocation reduce idle resource waste and better match the parallelism needs of distributed file processing.

How to eliminate wrong answers

Option A is wrong because preemptible worker nodes only can be terminated at any time by Google Cloud, risking job failure or delays when processing millions of rows per hour, and they cannot be the sole worker nodes for a reliable Dataproc cluster. Option B is wrong because local SSDs improve I/O performance for shuffle operations but do not directly reduce the number of nodes or cost; they add cost per node and are not a configuration for minimizing node count. Option C is wrong because using a few large worker nodes with Spark static allocation reserves a fixed number of executors regardless of actual workload, leading to underutilization and higher cost if the job does not need all resources, and it does not adapt to the hourly data volume variations.

Full explanation →

624

MCQmedium

Refer to the exhibit. A team configured a Cloud Monitoring alerting policy as shown. They recently started receiving false positive alerts. What is the most likely cause?

A.The duration of 60 seconds is too short, making the alert sensitive to brief spikes.

B.The alignment period of 60 seconds is too short, causing noise.

C.The threshold of 10 is too low.

D.The aggregator should be ALIGN_SUM instead of ALIGN_RATE.

AnswerA

A short duration means a spike lasting just over 60 seconds will trigger an alert; a longer duration (e.g., 300s) would reduce sensitivity.

Why this answer

A 60-second duration means the alert fires if the condition is met for just one minute. This is too short to distinguish transient spikes from sustained issues, causing false positives. Increasing the duration would require the metric to breach the threshold for a longer, more meaningful period.

Exam trap

Cisco often tests the distinction between alignment period (how data is aggregated) and duration (how long the condition must persist), tempting candidates to blame the alignment period when the real issue is the insufficient duration.

How to eliminate wrong answers

Option B is wrong because the alignment period of 60 seconds is standard for aggregating data into regular intervals; a shorter alignment period could increase noise, but the primary cause of false positives here is the short duration, not the alignment. Option C is wrong because a threshold of 10 is not inherently too low; the false positives are due to the alert triggering on brief spikes, not because the threshold value is misconfigured. Option D is wrong because ALIGN_RATE is appropriate for metrics that measure change over time (e.g., requests per second), and using ALIGN_SUM would sum rates incorrectly, potentially masking spikes rather than causing false positives.

Full explanation →

625

MCQmedium

A company uses Cloud Composer to orchestrate a daily ETL pipeline that includes multiple Dataproc jobs. The pipeline processes sensitive financial data. The security team requires that all data in transit be encrypted, and all Cloud Storage buckets used by the pipeline should have uniform bucket-level access enabled and VPC Service Controls. The pipeline currently uses a single Cloud Composer environment in us-east1. The Dataproc clusters are created using the standard image and use custom service accounts with minimal permissions. The pipeline runs successfully during testing, but in production, the Dataproc jobs fail with 'Access Denied' errors when trying to write to a Cloud Storage bucket. The bucket has uniform bucket-level access enabled and is inside a VPC Service Controls perimeter. The Dataproc service account has the Storage Object Admin role at the project level. What is the most likely cause of the access denied error?

A.The service account does not have the Storage Object Admin role on the bucket.

B.Data in transit encryption is not enabled for the Cloud Storage bucket.

C.Uniform bucket-level access prevents writes from service accounts.

D.The Dataproc cluster is not in the VPC Service Controls perimeter.

AnswerD

VPC Service Controls deny access from resources outside the perimeter.

Why this answer

The Dataproc cluster is created outside the VPC Service Controls perimeter, so even though the service account has the Storage Object Admin role at the project level, requests from the cluster are blocked by the perimeter's ingress/egress rules. VPC Service Controls enforce a security boundary that prevents resources outside the perimeter from accessing protected services like Cloud Storage, regardless of IAM permissions. The 'Access Denied' error in production, despite successful testing, strongly indicates a perimeter configuration mismatch.

Exam trap

Google Cloud often tests the distinction between IAM permissions and VPC Service Controls boundaries, tricking candidates into thinking a project-level IAM role is sufficient when the real blocker is network-level perimeter enforcement.

How to eliminate wrong answers

Option A is wrong because the service account has the Storage Object Admin role at the project level, which grants write access to all buckets in the project, including this one; uniform bucket-level access does not override project-level IAM roles. Option B is wrong because data in transit encryption is automatically enforced by Google Cloud for all API calls to Cloud Storage (using HTTPS/TLS), and the question states the pipeline already encrypts data in transit, so this is not the cause of the error. Option C is wrong because uniform bucket-level access does not prevent writes from service accounts; it simply disables ACLs and requires all access decisions to be made via IAM policies, which the service account already has via its project-level role.

Full explanation →

626

Multi-Selectmedium

Which TWO actions should be taken to optimize a Dataflow streaming pipeline that is experiencing high system lag and backpressure? (Choose two.)

Select 2 answers

A.Use a higher memory machine type for all workers.

B.Increase the number of worker threads by adjusting the streaming worker's parallelism hint.

C.Enable autoscaling and increase the maximum number of workers.

D.Reduce the number of workers to decrease cost.

E.Set maxNumWorkers to 1 to force single-worker processing.

AnswersB, C

More threads can increase throughput per worker.

Why this answer

Option B is correct because increasing the parallelism hint allows each worker to process more bundles concurrently, which can reduce backpressure by improving throughput without adding more workers. Option C is correct because enabling autoscaling and increasing the maximum number of workers allows the pipeline to dynamically scale out to handle increased load, directly mitigating high system lag and backpressure.

Exam trap

Google Cloud often tests the misconception that simply adding more memory or reducing workers will solve backpressure, when in fact the correct approaches involve increasing parallelism or scaling out the worker pool.

Full explanation →

627

MCQeasy

A company wants to stream real-time clickstream data from a website into BigQuery for near-real-time analytics. They expect peaks of 10,000 events per second. Which combination of services is most suitable for ingestion?

A.Cloud Storage → Cloud Functions → BigQuery

B.Direct Web → Dataflow → BigQuery

C.Pub/Sub → Dataflow → BigQuery (Storage Write API)

D.Pub/Sub → Dataflow → BigQuery (legacy streaming inserts)

AnswerC

This is the modern recommended architecture: Pub/Sub for ingestion, Dataflow for processing, Storage Write API for high-throughput streaming ingestion into BigQuery.

Why this answer

Pub/Sub is designed for ingesting high-throughput event streams, Dataflow can process and transform the data in real time, and the BigQuery Storage Write API provides exactly-once semantics and higher throughput than legacy streaming inserts. Option C uses the correct pipeline. Option A uses Dataproc which is suitable for batch processing, not streaming.

Option B uses legacy streaming inserts which are deprecated and have lower throughput. Option D uses Cloud Functions which are not designed for high-throughput stream processing.

Full explanation →

628

MCQeasy

Which Google Cloud service is a serverless, highly scalable data warehouse for analytical queries, supporting SQL and integration with BI tools?

A.Firestore

B.Cloud SQL

C.Cloud Spanner

D.BigQuery

AnswerD

Correct: BigQuery is the serverless analytics warehouse.

Why this answer

BigQuery is a serverless, highly scalable data warehouse designed for analytical queries over large datasets. It supports standard SQL and integrates seamlessly with BI tools like Looker and Tableau, making it the correct choice for this use case.

Exam trap

The trap here is that candidates may confuse Cloud Spanner's global scale and SQL support with data warehousing, but Spanner is optimized for transactional consistency, not analytical query performance or BI tool integration.

How to eliminate wrong answers

Option A is wrong because Firestore is a NoSQL document database for mobile and web app development, not a data warehouse for analytical SQL queries. Option B is wrong because Cloud SQL is a fully managed relational database for OLTP workloads, not a serverless data warehouse optimized for large-scale analytics. Option C is wrong because Cloud Spanner is a globally distributed, strongly consistent relational database service for transactional workloads, not a data warehouse designed for analytical queries and BI integration.

Full explanation →

629

MCQhard

A data engineer needs to alert when Pub/Sub subscription has messages older than 1 hour. Which Cloud Monitoring metric and filter should they use?

A.Metric: topic/send_message_operation_count; filter: topic_id

B.Metric: subscription/ack_message_count; filter: subscription_id

C.Metric: subscription/num_undelivered_messages; filter: subscription_id

D.Metric: subscription/oldest_unacked_message_age; filter: subscription_id

AnswerD

Correct metric and filter for alerting on message age.

Why this answer

The metric subscription/oldest_unacked_message_age gives the age of the oldest unacknowledged message. Filtering by subscription ID targets the specific subscription.

Full explanation →

630

Drag & Dropmedium

Drag and drop the steps to set up a Pub/Sub topic with a push subscription to an HTTPS endpoint into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Push subscriptions send messages to a configured HTTPS endpoint.

Full explanation →

631

Multi-Selecthard

You are designing a streaming pipeline that must guarantee exactly-once processing. Which three services or features can help achieve this? (Choose THREE.)

Select 3 answers

A.Cloud Functions for post-processing

B.BigQuery streaming inserts with a unique key for deduplication

C.Cloud Spanner for deduplication state across the pipeline

D.Cloud Pub/Sub with duplicate detection (using message IDs)

E.Dataflow with idempotent write operations to BigQuery

AnswersC, D, E

Using Cloud Spanner as a global state store allows tracking processed event IDs for deduplication.

Why this answer

Cloud Spanner is correct because it provides globally distributed, strongly consistent transactions that can be used to maintain deduplication state across the entire streaming pipeline. By storing a unique key for each processed event in Spanner, the pipeline can atomically check and record whether an event has already been handled, ensuring exactly-once semantics even in the face of retries or failures.

Exam trap

Google Cloud often tests the misconception that BigQuery streaming inserts can guarantee exactly-once processing via a unique key, when in fact BigQuery only supports at-least-once delivery and requires external deduplication mechanisms like Cloud Spanner or Dataflow with idempotent writes.

Full explanation →

632

Drag & Dropmedium

Drag and drop the steps to set up Cloud IAP (Identity-Aware Proxy) for an App Engine app into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

IAP verifies identity and authorization before allowing access to the application.

Full explanation →

633

MCQeasy

A company processes CSV files that are uploaded to Cloud Storage by external partners. Each file is around 500 MB, and they need to be parsed and loaded into BigQuery. The processing must start as soon as the file arrives. What is the most efficient serverless architecture?

A.Cloud Storage triggers a Cloud Function that publishes events to Pub/Sub; a Dataflow streaming pipeline reads from Pub/Sub and writes to BigQuery.

B.Use Cloud Scheduler to periodically check for new files and process them with Dataflow batch jobs.

C.Cloud Storage triggers a Dataproc job that reads the file and loads it into BigQuery.

D.Cloud Storage triggers a Cloud Function that directly loads the data into BigQuery using the BigQuery API.

AnswerA

Serverless and scales well with file uploads.

Why this answer

Option A is correct because it combines Cloud Storage event-driven triggers with Pub/Sub for reliable asynchronous message delivery, and uses Dataflow streaming with autoscaling to handle 500 MB files efficiently. This serverless architecture ensures processing starts immediately upon file arrival, scales to handle large files without manual intervention, and leverages BigQuery's streaming inserts for near-real-time data loading.

Exam trap

Google Cloud often tests the misconception that Cloud Functions can handle large file processing directly, but the 9-minute timeout and memory limits make them unsuitable for files over a few hundred MB, pushing candidates toward the seemingly simpler Option D.

How to eliminate wrong answers

Option B is wrong because Cloud Scheduler polling introduces latency and inefficiency, as it checks for new files on a fixed schedule rather than reacting instantly, which violates the requirement that processing must start as soon as the file arrives. Option C is wrong because Dataproc is a managed Hadoop/Spark service that requires cluster provisioning and startup time, adding overhead for a simple CSV-to-BigQuery load; it is not serverless and not the most efficient for this use case. Option D is wrong because Cloud Functions have a 9-minute timeout and 2 GB memory limit, making them unsuitable for parsing and loading a 500 MB CSV file directly via the BigQuery API, which would likely exceed these constraints and cause failures.

Full explanation →

634

MCQmedium

A company needs to process streaming sensor data and run both real-time analytics and batch reanalysis on historical data. They want to minimize infrastructure management. Which architecture and service combination is MOST suitable?

A.Kappa architecture with Pub/Sub and Dataflow for both real-time and batch processing

B.Lambda architecture with Pub/Sub for streaming and Cloud Storage for batch, processed by Dataflow

C.Batch processing only with Dataflow and Cloud Storage, ignoring real-time needs

D.Kappa architecture with Pub/Sub Lite and Dataflow Serverless

AnswerA

Kappa architecture uses a single streaming pipeline, and Dataflow can replay from Pub/Sub for batch reanalysis, minimizing management.

Why this answer

Kappa architecture processes all data as a stream, avoiding separate batch/speed layers. Pub/Sub ingests streaming data, and Dataflow with Apache Beam can handle both real-time and batch (replay) pipelines, minimizing infrastructure management.

Full explanation →

635

Multi-Selecteasy

A company needs to store and analyze large amounts of unstructured data (images, videos) and structured data (CSV logs) in a cost-effective manner. The data should be accessible for analytics with BigQuery. Which two services should they use? (Choose TWO.)

Select 2 answers

A.Cloud SQL

B.Cloud Spanner

C.BigQuery

D.Cloud Storage

E.Firestore

AnswersC, D

BigQuery can query data stored in Cloud Storage via external tables, enabling analytics.

Why this answer

Cloud Storage is the best option for storing unstructured and structured files cost-effectively. BigQuery can analyze this data directly via external tables or after loading, making it a powerful analytics platform.

Full explanation →

636

MCQhard

An IoT application writes sensor readings to Cloud Bigtable with a row key of 'deviceID#timestamp'. The team notices high write latency and hotspots on a few nodes. Which row key design change would most likely improve performance?

A.Add a random prefix to the row key (e.g., hash of deviceID modulo 1000)

B.Reverse the key to 'timestamp#deviceID'

C.Use a single column family with many columns

D.Store all data in one row per device

AnswerA

Hashing the device ID distributes writes across tablet servers, reducing hotspots.

Why this answer

Prefixing timestamps can cause hotspots because writes go to the same tablet server for the same time range. Hashing the device ID or using a field-leveled design (e.g., deviceID inverted) distributes writes across nodes. Adding a random prefix helps but salting with a hash is more systematic.

Full explanation →

637

MCQmedium

You are monitoring a Dataproc cluster and notice that the cluster utilisation is high, but jobs are running slowly. The cluster uses preemptible workers for cost savings. What is the most likely cause of the performance degradation?

A.The primary workers are using standard disks instead of SSDs.

B.The preemptible workers are being preempted frequently, causing task retries and slowdowns.

C.The cluster is under-provisioned; increase the number of preemptible workers.

D.The cluster is using an older image version; upgrade to the latest.

AnswerB

Preemptible workers have a high chance of termination, which affects job performance.

Why this answer

Preemptible workers can be terminated at any time, causing job restarts and slower execution. The high utilisation indicates many workers are preempted, leading to recomputation.

Full explanation →

638

MCQmedium

A company runs Apache Kafka on Dataproc for real-time event streaming. They want to archive the Kafka topics to Cloud Storage for long-term retention and later analysis in BigQuery. Which approach is the most cost-effective and operationally simple?

A.Use Apache Spark streaming on Dataproc to read from Kafka and write to GCS

B.Use Kafka MirrorMaker to replicate topics to a second cluster that writes to GCS

C.Use the Pub/Sub connector to publish Kafka messages to Pub/Sub, then a Dataflow job to write to GCS

D.Use Kafka Connect with the GCS Sink Connector to write directly to Cloud Storage

AnswerD

Kafka Connect GCS Sink Connector is purpose-built, simple to configure, and runs on the same Dataproc cluster.

Why this answer

Option D is correct because Kafka Connect with the GCS Sink Connector is purpose-built for exactly this use case: it directly streams Kafka topics to Cloud Storage in Avro, Parquet, or JSON format without requiring intermediate processing clusters or services. This approach minimizes operational overhead (no Spark or Dataflow jobs to manage) and is cost-effective since it runs as a lightweight connector within the existing Kafka ecosystem, leveraging Dataproc's managed Kafka cluster.

Exam trap

Cisco often tests the misconception that streaming data to Cloud Storage requires a full streaming pipeline (Spark, Dataflow) or an intermediary service like Pub/Sub, when in fact Kafka Connect provides a native, lightweight, and cost-effective sink directly to GCS.

How to eliminate wrong answers

Option A is wrong because using Apache Spark streaming on Dataproc to read from Kafka and write to GCS introduces unnecessary compute overhead, latency, and operational complexity (managing Spark jobs, checkpointing, and resource scaling) compared to a direct connector. Option B is wrong because Kafka MirrorMaker is designed for cross-cluster replication, not for writing to GCS; it would require an additional sink to write to GCS, adding complexity and cost without any benefit. Option C is wrong because routing Kafka messages through Pub/Sub adds latency, extra cost (Pub/Sub egress and Dataflow processing), and operational complexity (managing a Pub/Sub topic, subscription, and Dataflow pipeline) when a direct connector to GCS exists.

Full explanation →

639

MCQmedium

A gaming company uses Avro schemas for its streaming event data. They anticipate adding new optional fields to events over time. They need to ensure backward compatibility so that existing pipelines continue to work. Which strategy should they adopt?

A.Use Avro with a schema registry that enforces backward-compatible changes

B.Use JSON instead of Avro and ignore unknown fields

C.Use Protocol Buffers with breaking changes

D.Use FlatBuffers for performance

AnswerA

Avro's schema evolution rules allow adding optional fields without breaking existing consumers, and a schema registry enables version management.

Why this answer

Option A is correct because Avro, combined with a schema registry, allows schema evolution with backward compatibility. The registry enforces rules such as adding optional fields with defaults, ensuring that consumers using older schemas can still deserialize new data without breaking. This directly addresses the requirement for existing pipelines to continue working as new optional fields are added.

Exam trap

Google Cloud often tests the misconception that any serialization format (like JSON or Protocol Buffers) inherently supports backward compatibility, but the key is the combination of a schema registry with enforced evolution rules, which only Avro explicitly provides in this context.

How to eliminate wrong answers

Option B is wrong because JSON lacks a schema enforcement mechanism; while ignoring unknown fields is possible, JSON does not provide built-in compatibility guarantees or schema evolution rules, making it error-prone in large-scale streaming systems. Option C is wrong because Protocol Buffers can support backward compatibility, but the option specifies 'breaking changes,' which would violate the requirement for backward compatibility. Option D is wrong because FlatBuffers prioritize performance (zero-copy deserialization) but do not inherently enforce backward-compatible schema evolution, and they are less suited for streaming event data with frequent schema changes.

Full explanation →

640

Multi-Selecthard

Which THREE steps are essential for implementing a continuous training pipeline with Vertex AI?

Select 3 answers

A.If the new model passes evaluation, deploy it to a production endpoint.

B.Manually approve each new model version before deployment.

C.Deploy the original model once and set it to auto-update.

D.Set up a trigger to start a training pipeline when new training data is available (e.g., via Cloud Storage events).

E.Include a step in the pipeline that evaluates the new model against a validation set.

AnswersA, D, E

Automated deployment upon passing evaluation completes the continuous pipeline.

Why this answer

Option A is correct because a continuous training pipeline aims to automate model updates. After a new model is trained and evaluated, deploying it to a production endpoint (e.g., using Vertex AI Endpoints) is the essential final step to serve predictions from the improved model. This completes the automation loop without manual intervention, assuming the evaluation passes predefined thresholds.

Exam trap

Cisco often tests the distinction between essential automation steps and optional manual controls, so candidates mistakenly include manual approval (B) or believe models can auto-update (C) when Vertex AI requires explicit pipeline triggers and deployments.

Full explanation →

641

MCQhard

A data pipeline ingests real-time events from Cloud Pub/Sub into BigQuery using Dataflow. The pipeline uses a sliding window of 5 minutes with a 1-minute period to aggregate event counts. Recently, the pipeline started failing with 'The worker failed to provide a heartbeat.' The Dataflow logs show high CPU usage on the workers. What is the best course of action to resolve the issue?

A.Increase the number of workers and enable autoscaling to distribute the load.

B.Reduce the number of workers to minimize coordination overhead.

C.Use a global window with a trigger to reduce state size.

D.Change the windowing to a fixed 5-minute window to reduce computations.

AnswerA

More workers can handle the CPU load from streaming inserts.

Why this answer

The 'worker failed to provide a heartbeat' error combined with high CPU usage indicates that workers are overloaded and cannot process data fast enough to maintain their heartbeat to the Dataflow service. Increasing the number of workers and enabling autoscaling distributes the computational load across more machines, reducing per-worker CPU pressure and allowing heartbeats to be sent on time. This directly addresses the root cause of resource exhaustion.

Exam trap

Google Cloud often tests the misconception that reducing workers or changing window types is a universal fix for resource exhaustion, when in fact the immediate solution for heartbeat failures due to high CPU is to scale out the worker pool.

How to eliminate wrong answers

Option B is wrong because reducing the number of workers would concentrate the same workload on fewer machines, increasing per-worker CPU usage and worsening the heartbeat failure. Option C is wrong because using a global window with a trigger does not reduce state size for sliding windows; it would accumulate all events into a single unbounded window, potentially increasing memory pressure and CPU overhead. Option D is wrong because changing to a fixed 5-minute window does not reduce computations compared to a sliding window with a 1-minute period; it actually changes the semantics (non-overlapping windows) and may still cause high CPU if the underlying load is unchanged.

Full explanation →

642

MCQmedium

A team notices that the latency for online predictions from a Vertex AI endpoint has increased significantly over the past hour. The model is a large TensorFlow model deployed with automatic scaling (minReplicaCount=2, maxReplicaCount=10). The CPU utilization of the deployed instances is consistently above 85%. What is the most likely cause of the increased latency?

A.The network latency between the client and the endpoint has increased due to regional issues.

B.The model is deployed with GPU acceleration, but the instances are using incorrect CUDA drivers.

C.The model is too large for the instance memory, causing disk swapping.

D.The model is CPU-bound, and the current replicas are saturated, causing queuing.

AnswerD

High CPU utilization indicates the replicas are at capacity, leading to request queuing and higher latency.

Why this answer

The correct answer is D because the consistently high CPU utilization (above 85%) indicates that the existing replicas are saturated, unable to process incoming requests quickly enough. When all replicas are busy, new requests are queued, which directly increases latency. Automatic scaling can add more replicas up to maxReplicaCount=10, but if the scaling is slow or the traffic spike is sudden, queuing occurs first, causing the observed latency increase.

Exam trap

Google Cloud often tests the distinction between symptoms of CPU saturation (queuing/latency) versus memory or GPU issues; the trap here is that candidates may incorrectly attribute latency to network or hardware driver problems when the clear indicator is sustained high CPU utilization on existing instances.

How to eliminate wrong answers

Option A is wrong because network latency between client and endpoint is not indicated by CPU utilization of deployed instances; regional network issues would affect all requests uniformly, not correlate with high CPU. Option B is wrong because incorrect CUDA drivers would cause GPU-related errors or failures, not consistently high CPU utilization; the model would likely fail to run or produce errors, not just increase latency. Option C is wrong because disk swapping due to insufficient memory would manifest as high disk I/O and memory pressure, not primarily high CPU utilization; the symptom described is CPU-bound, not memory-bound.

Full explanation →

643

MCQeasy

A company wants to implement a data lake on Google Cloud to store raw sensor data (unstructured binary files) and allow data scientists to run SQL queries on processed data. They expect to store terabytes of data and have different access patterns. Which combination of GCP services best meets these requirements?

A.Bigtable for raw data and Cloud Spanner for processed data

B.Cloud Storage for both raw and processed data

C.Cloud SQL for raw data and Cloud Dataproc for processing

D.Cloud Storage for raw data and BigQuery for processed data

AnswerD

Cloud Storage stores any file type cost-effectively, and BigQuery provides fast SQL queries on structured data.

Why this answer

Cloud Storage is the ideal service for storing raw, unstructured binary sensor data at petabyte scale, offering low-cost, durable object storage with multiple access tiers. BigQuery is a serverless, highly scalable data warehouse that allows data scientists to run SQL queries on processed data, with features like columnar storage and automatic optimization for analytical workloads. This combination directly addresses the need for raw storage and SQL-based analytics on processed data.

Exam trap

Google Cloud often tests the misconception that Cloud Storage can serve as a queryable database for SQL, when in fact it requires an external query engine like BigQuery or Dataproc for SQL access.

How to eliminate wrong answers

Option A is wrong because Bigtable is a NoSQL wide-column database optimized for real-time, low-latency access, not for storing raw unstructured binary files, and Cloud Spanner is a globally distributed relational database for transactional workloads, not for analytical SQL queries on processed data. Option B is wrong because while Cloud Storage can store both raw and processed data, it does not natively support SQL queries; data scientists would need an additional service like BigQuery or Dataproc to run SQL. Option C is wrong because Cloud SQL is a relational database for structured data, not designed for raw unstructured binary files, and Cloud Dataproc is a managed Spark/Hadoop service for processing, not a SQL query engine for processed data.

Full explanation →

644

MCQeasy

A data engineer wants to automatically move objects from Standard storage class to Nearline after 30 days, and then to Archive after 365 days. Which Cloud Storage feature should they configure?

A.Object Versioning

B.Retention Policy

C.Bucket Lock

D.Object Lifecycle rule with SetStorageClass actions

AnswerD

Lifecycle rules can change storage class based on object age.

Why this answer

Option D is correct because Object Lifecycle rules in Google Cloud Storage allow you to automatically transition objects between storage classes (e.g., from Standard to Nearline after 30 days, then to Archive after 365 days) using the SetStorageClass action. This feature is specifically designed for automated lifecycle management, including deletion and class transitions, based on object age or other conditions.

Exam trap

Cisco often tests the distinction between lifecycle management (which changes storage classes) and retention/versioning features (which protect data but do not automate class transitions), leading candidates to confuse Object Versioning or Retention Policy with lifecycle rules.

How to eliminate wrong answers

Option A is wrong because Object Versioning is a feature that preserves non-current object versions to protect against accidental deletion or overwriting; it does not automate storage class transitions. Option B is wrong because Retention Policy is used to enforce a minimum retention period on objects, preventing deletion or modification, but it cannot change storage classes over time. Option C is wrong because Bucket Lock is a mechanism to permanently lock a retention policy, making it immutable; it does not provide any lifecycle-based storage class transitions.

Full explanation →

645

MCQmedium

A company uses Cloud Composer to orchestrate data pipelines. One DAG fails intermittently with the error: 'Task received SIGTERM signal.' The task runs a long-running Dataproc job. What is the most likely cause?

A.The Dataproc cluster was preempted by Google Cloud.

B.The Dataproc job failed due to an error in the code.

C.The Cloud Composer environment ran out of disk space.

D.The Airflow task timed out due to the default execution timeout.

AnswerD

SIGTERM indicates the task was killed, possibly due to timeout.

Why this answer

The default Airflow task execution timeout is 28 days in Cloud Composer, but individual tasks can have a shorter `execution_timeout` set in the DAG definition. When a long-running Dataproc job exceeds this timeout, Airflow sends a SIGTERM signal to the task to kill it, resulting in the observed error. This is the most likely cause because the error message directly indicates a forced termination by the Airflow scheduler, not an infrastructure or code failure.

Exam trap

The trap here is that candidates often attribute SIGTERM errors to infrastructure issues like cluster preemption or disk space, when in fact the error is a direct result of Airflow's task timeout mechanism, which is a common misconfiguration in long-running pipeline tasks.

How to eliminate wrong answers

Option A is wrong because Dataproc cluster preemption would cause a different error (e.g., 'Cluster not found' or 'Job failed due to node loss'), not a SIGTERM signal from Airflow. Option B is wrong because a code error in the Dataproc job would produce a job failure status and a different error message (e.g., 'Job failed with exit code 1'), not a SIGTERM from the orchestrator. Option C is wrong because running out of disk space in the Cloud Composer environment would cause worker crashes or DAG parsing errors, not a targeted SIGTERM to a specific task.

Full explanation →

646

MCQhard

Your Dataflow pipeline reads from Pub/Sub, performs transformations, and writes to BigQuery. You notice that the pipeline's autoscaling is not keeping up with sudden spikes in traffic, causing increased lag. The pipeline uses Classic Templates. Which change would most effectively improve autoscaling responsiveness?

A.Enable Dataflow Streaming Engine on the pipeline.

B.Switch to Dataflow Prime with Vertical Autoscaling enabled.

C.Increase the initial number of workers to handle the spike.

D.Use Flex Templates instead of Classic Templates.

AnswerA

Streaming Engine improves autoscaling by decoupling compute from state, allowing workers to scale more quickly.

Why this answer

Enabling Dataflow Streaming Engine reduces the overhead of checkpointing and state management by offloading them to the service side, which allows the pipeline to scale more quickly in response to sudden traffic spikes. This directly addresses the autoscaling lag because Streaming Engine decouples compute from state, enabling faster worker adjustments without the bottleneck of persistent disk-based shuffle.

Exam trap

Cisco often tests the misconception that Flex Templates improve runtime performance or autoscaling, when in fact they only affect deployment flexibility, not the underlying execution engine's scaling behavior.

How to eliminate wrong answers

Option B is wrong because Dataflow Prime with Vertical Autoscaling adjusts the CPU/memory of existing workers, not the number of workers, so it does not improve horizontal autoscaling responsiveness to sudden traffic spikes. Option C is wrong because increasing the initial number of workers only sets a starting point; it does not improve the pipeline's ability to scale up dynamically during a spike, and it may waste resources during low traffic. Option D is wrong because Flex Templates only affect how the pipeline is deployed and parameterized, not the runtime autoscaling behavior; Classic Templates and Flex Templates share the same autoscaling mechanisms.

Full explanation →

647

MCQhard

You are designing a disaster recovery strategy for a critical streaming data processing pipeline. The pipeline reads from Cloud Pub/Sub, processes with Dataflow streaming, and writes to BigQuery. The required RPO is less than 1 minute, and RTO is less than 5 minutes. Which architecture should you implement?

A.Use cross-region replication with two separate Dataflow pipelines reading from a Pub/Sub cross-region subscription and writing to a BigQuery cross-region dataset

B.Run the pipeline using Dataflow batch mode with a 1-minute trigger and store intermediate results in Cloud Storage

C.Deploy resources in a single region with regular backups to Cloud Storage

D.Use a single Dataflow pipeline with a standby cluster in another region, but failover is manual

AnswerA

Cross-region replication ensures data is available in another region with minimal latency, meeting RPO and RTO.

Why this answer

Option A is correct because cross-region replication for Pub/Sub ensures messages are available in a secondary region with sub-second latency, and a separate Dataflow pipeline reading from a cross-region subscription provides active-active processing. BigQuery cross-region dataset replication (using the 'cross-region' dataset location, e.g., EU or US multi-region, or a specific dual-region configuration) ensures data durability and availability within the RPO of <1 minute. This architecture meets both RPO and RTO by eliminating single points of failure and enabling automatic failover without manual intervention.

Exam trap

The trap here is that candidates often assume a single pipeline with a standby cluster is sufficient, but they overlook that manual failover cannot meet the strict RTO of <5 minutes, and that cross-region replication must be active-active (not active-passive) to achieve sub-minute RPO.

How to eliminate wrong answers

Option B is wrong because Dataflow batch mode with a 1-minute trigger cannot achieve sub-minute RPO; batch processing introduces inherent latency and does not provide continuous streaming, so the RPO of <1 minute is not guaranteed. Option C is wrong because deploying in a single region with regular backups to Cloud Storage fails to meet the RTO of <5 minutes; restoring from backups takes significantly longer than 5 minutes, and there is no active standby to fail over to. Option D is wrong because a manual failover process cannot achieve the RTO of <5 minutes; manual intervention introduces unpredictable delays, and a standby cluster without automatic failover violates the RTO requirement.

Full explanation →

648

MCQmedium

A company runs a critical batch pipeline using Cloud Dataflow. The pipeline processes financial transactions and runs every hour. Recently, some runs have failed due to transient errors (e.g., network timeouts). The engineer wants to automatically retry failed runs without manual intervention. The pipeline is launched from a Cloud Composer DAG using DataflowPythonOperator. What is the BEST way to handle retries?

A.Add a DataflowJobStatusSensor in the DAG that waits for job completion and retries if failed.

B.Set the 'retries' parameter in the DAG's default_args to a positive integer.

C.Configure the Dataflow pipeline to automatically retry on failure using the --numberOfWorkerHarnessThreads option.

D.Use a Cloud Function triggered by Cloud Scheduler to re-launch the pipeline if the Dataflow job fails.

AnswerB

This allows Airflow to retry the entire task (which launches the Dataflow job) if it fails due to transient errors.

Why this answer

Option B is correct because Cloud Composer (Apache Airflow) natively supports task-level retries via the 'retries' parameter in default_args. When a DataflowPythonOperator fails due to a transient error, Airflow automatically re-executes the task up to the specified number of retries, without requiring custom sensors or external triggers. This is the simplest and most reliable mechanism for handling transient failures in a DAG-driven pipeline.

Exam trap

The trap here is that candidates confuse Dataflow-level retry options (like --maxRetryAttempts) with Airflow task-level retries, or assume that a sensor or external trigger is required to detect and retry failures, when in fact Airflow's native retry parameter is the simplest and most appropriate solution for transient errors in a DAG-managed pipeline.

How to eliminate wrong answers

Option A is wrong because a DataflowJobStatusSensor only monitors job status and does not automatically retry the pipeline; it would require additional branching logic to relaunch the job, adding unnecessary complexity. Option C is wrong because --numberOfWorkerHarnessThreads controls parallelism within the Dataflow worker, not retry behavior on pipeline failure; retries are configured via --maxRetryAttempts or similar Dataflow pipeline options, not this flag. Option D is wrong because using a Cloud Function and Cloud Scheduler introduces an external dependency and latency, whereas Airflow's built-in retry mechanism is more direct and integrated with the DAG's execution context.

Full explanation →

649

Multi-Selectmedium

Which TWO actions are recommended to improve the reliability of a Cloud Dataflow streaming pipeline that processes event data from Pub/Sub?

Select 2 answers

A.Use a pull subscription with a 10-second acknowledgment deadline.

B.Enable Dataflow Streaming Engine.

C.Enable exactly-once processing sinks (e.g., BigQuery with guaranteed row-level insertion).

D.Disable autoscaling to prevent worker churn.

E.Use micro-batch processing with a small batch size.

AnswersB, C

Streaming Engine offloads state management to the backend, improving reliability.

Why this answer

Option B is correct because enabling Dataflow Streaming Engine moves state and computation from worker VMs to the backend service, reducing the impact of worker scaling and preemption. This improves reliability by providing consistent performance and fault tolerance for streaming pipelines, especially those with high throughput or stateful processing.

Exam trap

The trap here is that candidates often confuse reliability with throughput or latency, and may incorrectly choose micro-batching or disabling autoscaling as reliability improvements, when in fact Dataflow's reliability comes from its managed backend services like Streaming Engine.

Full explanation →

650

MCQhard

You are designing a data pipeline that must process sensitive customer data with strict access controls. The data is ingested via Cloud Pub/Sub, processed by Cloud Dataflow, and stored in BigQuery. The security team requires that data is encrypted at rest and in transit, and that access is limited to specific service accounts. Which implementation strategy meets all requirements?

A.Use Cloud KMS for BigQuery only; leave Dataflow with default encryption

B.Use VPC Service Controls and Cloud Armor for network security

C.Use default Google-managed encryption keys and IAM roles only

D.Use CMEK for Pub/Sub, Dataflow, and BigQuery, and VPC-SC with per-service service accounts

AnswerD

CMEK ensures encryption control; VPC-SC and service accounts enforce access.

Why this answer

Option D is correct because it combines Customer-Managed Encryption Keys (CMEK) for all three services (Pub/Sub, Dataflow, BigQuery) to ensure data is encrypted at rest with keys controlled by the customer, and uses VPC Service Controls (VPC-SC) with per-service service accounts to enforce network perimeter security and least-privilege access. This meets the requirements for encryption at rest and in transit (CMEK also covers in-transit encryption via TLS) and strict access controls via service accounts and VPC-SC.

Exam trap

Google Cloud often tests the misconception that network security tools like VPC Service Controls or Cloud Armor alone satisfy encryption requirements, or that default encryption is sufficient when customer-managed keys are explicitly required.

How to eliminate wrong answers

Option A is wrong because it only applies Cloud KMS to BigQuery, leaving Dataflow with default Google-managed encryption, which does not meet the requirement for customer-controlled encryption at rest across all services. Option B is wrong because VPC Service Controls and Cloud Armor provide network security and perimeter controls but do not address data encryption at rest or in transit, which is a separate requirement. Option C is wrong because default Google-managed encryption keys and IAM roles alone do not provide customer-controlled encryption keys (CMEK) or the granular access controls enforced by VPC-SC with per-service service accounts.

Full explanation →

651

MCQeasy

An organization uses BigQuery on-demand pricing. To control costs, they want to estimate the bytes processed by a query before running it. Which command or method should they use?

A.Use the bq query --dry_run command

B.Use bq ls to list table sizes

C.Use BigQuery reservations to get cost estimate

D.Use INFORMATION_SCHEMA.JOBS_BY_PROJECT to view past costs

AnswerA

Dry run provides byte estimate without running the query.

Why this answer

BigQuery dry run estimates bytes processed without executing the query. It can be done via CLI with --dry_run flag or in the console.

Full explanation →

652

MCQmedium

You need to create a BigQuery table that stores customer transaction data. The table will be queried frequently by a customer_id column to retrieve recent transactions (last 30 days). Which table design optimizes query performance and cost?

A.Partition by customer_id and cluster by transaction_date

B.Partition by ingestion_time and cluster by customer_id

C.Cluster by transaction_date and customer_id without partitioning

D.Partition by transaction_date and cluster by customer_id

AnswerD

This design minimizes scanned bytes by pruning partitions on date and cluster blocks on customer_id.

Why this answer

Partitioning by transaction_date allows queries to scan only relevant partitions. Clustering by customer_id sorts data within each partition by customer_id, further reducing the amount of data scanned for queries filtering on customer_id. This combination is best for time-range queries with frequent customer_id filters.

Full explanation →

653

MCQhard

A company uses Vertex AI Feature Store for serving features. They have a high-throughput online serving requirement. Which configuration should they use?

A.Cloud Storage with high-memory instances

B.Bigtable as serving source

C.Firestore

D.Vertex AI Feature Store with online serving enabled

AnswerD

Vertex AI Feature Store is purpose-built for high-throughput online feature serving.

Why this answer

Vertex AI Feature Store with online serving enabled is the correct choice because it is specifically designed for low-latency, high-throughput retrieval of feature values for online predictions. It uses a managed Bigtable backend optimized for real-time serving, ensuring consistent performance under high request loads without requiring manual infrastructure management.

Exam trap

Google Cloud often tests the misconception that any low-latency database (like Bigtable or Firestore) can directly replace Vertex AI Feature Store, ignoring the managed orchestration, feature registry, and point-in-time lookup capabilities that are essential for consistent online serving in ML workflows.

How to eliminate wrong answers

Option A is wrong because Cloud Storage is a blob storage service with high latency and no indexing for real-time feature lookups, making it unsuitable for high-throughput online serving. Option B is wrong because Bigtable is a NoSQL database that can serve features, but it requires manual configuration, scaling, and integration with Vertex AI Feature Store, whereas the Feature Store provides a managed, optimized serving layer with built-in consistency and monitoring. Option C is wrong because Firestore is a document database designed for mobile and web apps with moderate throughput, not for the sub-millisecond latency and high concurrency required by ML feature serving at scale.

Full explanation →

654

MCQmedium

You are designing a streaming pipeline that needs to handle sudden spikes in traffic without losing data. The pipeline uses Pub/Sub and Dataflow. Which configuration ensures data is not lost if Dataflow falls behind?

A.Use Pub/Sub with a pull subscription and set the message retention duration to 7 days

B.Use Cloud Pub/Sub Lite with a smaller retention period

C.Use Pub/Sub with a push subscription and increase the acknowledgment deadline

D.Use Pub/Sub with exactly-once delivery and Dataflow with at-least-once processing

AnswerA

Pull subscriptions allow Dataflow to control the pace. 7-day retention lets Dataflow catch up after spikes.

Why this answer

Pub/Sub stores messages for up to 7 days, allowing Dataflow to catch up. Dataflow uses checkpointing to track progress. This combination ensures no data loss.

Full explanation →

655

MCQeasy

You have a BigQuery table 'orders' with columns order_id, customer_id, order_amount, and order_date. You need to rank customers by total spend per month, assigning the rank 1 to the highest spender. Which SQL function should you use in a window clause?

A.NTILE()

B.DENSE_RANK()

C.ROW_NUMBER()

D.RANK()

AnswerD

RANK() assigns the same rank to ties and leaves gaps; appropriate for ranking top spenders.

Why this answer

RANK() assigns a rank with gaps for ties; for top-spender ranking, that is appropriate. DENSE_RANK() also works but without gaps; the stem does not specify. ROW_NUMBER() gives unique numbers even for ties.

However, typical ranking with ties uses RANK().

Full explanation →

656

MCQmedium

A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?

A.Use Cloud Dataproc Serverless for all Spark jobs.

B.Migrate jobs to Cloud Dataflow.

C.Run Spark on Compute Engine instances with startup scripts.

D.Use Dataproc clusters with auto-scaling and preemptible VMs.

AnswerD

Reduces cost and operational overhead.

Why this answer

Option D is correct because Dataproc clusters with auto-scaling and preemptible VMs directly address the need to reduce operational overhead and minimize costs for on-premises Spark migrations. Auto-scaling dynamically adjusts cluster size based on workload, while preemptible VMs (which cost 60-80% less than standard VMs) handle fault-tolerant tasks, making this the most cost-effective and operationally efficient architecture for Spark on Dataproc.

Exam trap

The trap here is that candidates often choose Cloud Dataproc Serverless (Option A) thinking it eliminates all operational overhead, but they overlook that it lacks the cost-saving benefits of preemptible VMs and may not support all Spark features, making auto-scaling clusters with preemptible VMs the more appropriate choice for minimizing costs in a migration scenario.

How to eliminate wrong answers

Option A is wrong because Cloud Dataproc Serverless is designed for batch Spark workloads without cluster management, but it lacks the flexibility and cost optimization of preemptible VMs for long-running or complex jobs, and may not support all Spark configurations or libraries used in on-premises environments. Option B is wrong because Cloud Dataflow is a different processing engine (Apache Beam) that requires rewriting Spark jobs into Beam pipelines, adding migration complexity and operational overhead, not reducing it. Option C is wrong because running Spark on Compute Engine instances with startup scripts requires manual cluster management, scaling, and fault tolerance, increasing operational overhead and negating the benefits of a managed service like Dataproc.

Full explanation →

657

MCQmedium

You have a BigQuery table that is used by multiple teams. To save costs, you want to provide a consistent view of the data as of a specific point in time without creating full copies. Which BigQuery feature should you use?

A.Authorized views

B.Materialized views

C.Table snapshot

D.Table clone

AnswerC

Table snapshots are read-only, point-in-time copies that cost only storage and are ideal for sharing consistent views.

Why this answer

BigQuery table snapshots provide a point-in-time copy of a table that incurs only storage costs for the snapshot (no additional slot usage). They are read-only and can be used to share data without duplicating the base table.

Full explanation →

658

MCQhard

What is the root cause of this error and the correct solution?

A.The BigQuery table requires authorized view access.

B.The user running the job needs the BigQuery Admin role.

C.The Dataflow service account needs the BigQuery User role.

D.The Dataflow worker service account needs the BigQuery Data Viewer role.

AnswerD

BigQuery Data Viewer includes the required getData permission.

Why this answer

Option D is correct because Dataflow workers execute under a specific service account (compute engine default or custom), and that service account must have the BigQuery Data Viewer role to read data from BigQuery tables. Without this permission, the workers cannot access the source data, causing the job to fail with access errors. The BigQuery User role is insufficient for reading table data, and the BigQuery Admin role is overly permissive and not required for this task.

Exam trap

Google Cloud often tests the distinction between the Dataflow controller service account (which manages the job) and the Dataflow worker service account (which performs data operations), leading candidates to incorrectly assign permissions to the controller account instead of the worker account.

How to eliminate wrong answers

Option A is wrong because authorized view access is a mechanism to share query results without granting direct table access, but the error here is about the Dataflow service account lacking read permissions on the BigQuery table, not about view authorization. Option B is wrong because the BigQuery Admin role grants full control over BigQuery resources, which is excessive and not necessary; the user running the job does not need admin rights—only the worker service account needs read access. Option C is wrong because the BigQuery User role allows running queries and creating datasets but does not grant read access to table data; the Dataflow service account (which orchestrates the job) does not directly read data—the worker service account does.

Full explanation →

659

MCQmedium

A company uses Pub/Sub to ingest clickstream data. Each message contains a JSON payload with a nested array of user actions. The data must be written to BigQuery, with each action in the array becoming a separate row. Which BigQuery feature or approach should be used to achieve this transformation?

A.Use a Dataflow pipeline with a ParDo that explodes the array

B.Load the JSON as-is into BigQuery and use UNNEST in a query

C.Use a BigQuery scripting loop to iterate over the array

D.Preprocess the data with a Dataflow pipeline and write to BigQuery

AnswerB

UNNEST can be used in a query to flatten the array into rows, which is a cost-effective approach.

Why this answer

BigQuery's UNNEST function is designed to flatten arrays into separate rows, which is exactly what is needed here.

Full explanation →

660

MCQhard

A company needs to continuously synchronize customer data changes from an on-premises Oracle database to BigQuery for near-real-time analytics. The Oracle database has Change Data Capture (CDC) enabled. Which Google Cloud service should be used to stream these changes with minimal latency and schema evolution support?

A.Deploy a Dataflow pipeline with a JDBC source and Pub/Sub

B.Use Cloud SQL with a read replica and enable binary logging

C.Use Transfer Appliance to copy Oracle data periodically

D.Use Datastream to stream CDC changes from Oracle to BigQuery

AnswerD

Datastream directly supports Oracle CDC and streams to BigQuery with schema evolution.

Why this answer

Datastream is designed to stream CDC from Oracle (and MySQL/PostgreSQL) directly to BigQuery or GCS, supporting schema evolution and low-latency replication.

Full explanation →

661

Multi-Selecteasy

You need to deploy a reusable Dataflow pipeline that can be executed with different parameters from Cloud Composer. Which TWO components should you use? (Choose 2)

Select 2 answers

A.Direct runner

B.Dataflow Flex Template

C.Cloud Composer with DataflowStartFlexTemplateOperator

D.Dataflow Classic Template

E.Cloud Scheduler

AnswersB, C

Flex Templates are reusable and parameterizable.

Why this answer

Dataflow Flex Templates allow you to create custom templates that can accept runtime parameters. Cloud Composer can trigger these templates using the DataflowStartFlexTemplateOperator. Direct runner options are not needed.

Full explanation →

662

MCQeasy

A data engineer needs to ingest on-premises Oracle CDC data into BigQuery in near real-time with minimal operational overhead. Which service should they use?

A.Pub/Sub + Dataflow

B.Storage Transfer Service

C.Transfer Appliance

D.Datastream

AnswerD

Datastream is purpose-built for serverless CDC from databases to Google Cloud destinations like BigQuery and GCS.

Why this answer

Datastream is purpose-built for streaming change data capture (CDC) from Oracle and other sources into BigQuery with near-real-time latency and minimal operational overhead. It handles schema propagation, checkpointing, and automatic retries, eliminating the need to manage custom ingestion pipelines.

Exam trap

Cisco often tests the distinction between batch migration tools (Storage Transfer Service, Transfer Appliance) and streaming CDC services (Datastream), leading candidates to choose a batch option when the question explicitly requires near-real-time ingestion.

How to eliminate wrong answers

Option A is wrong because Pub/Sub + Dataflow requires building and maintaining a custom pipeline to handle Oracle CDC, including log mining and transformation logic, which increases operational overhead compared to a managed service. Option B is wrong because Storage Transfer Service is designed for bulk batch transfers of files from cloud or on-premises storage to Google Cloud, not for streaming CDC from a live database. Option C is wrong because Transfer Appliance is a physical device for offline, high-volume data migration, which cannot provide near-real-time streaming and introduces significant latency.

Full explanation →

663

MCQmedium

A financial analytics team uses Looker to explore BigQuery data. They need to allow business users to filter by a custom date range that is not tied to an existing dimension. The date range must be user-input at query time. What is the best approach in Looker?

A.Create an explore with a custom filter field in the Looker UI

B.Use a filter parameter directly on the date dimension

C.Add a dimension with a yesno filter that toggles the date range

D.Create a parameter in LookML using Liquid templating

AnswerD

Parameters allow user input at query time, rendered as filter controls, and can be used in conditions.

Why this answer

Looker uses Liquid templating in LookML to create parameters that render as filter controls at runtime. Users can input values that are then injected into the SQL. Creating a dimension with a yesno filter requires predefined values.

The filter parameter on a dimension only allows selecting from existing values, not arbitrary input.

Full explanation →

664

MCQmedium

An organization needs to trigger a Cloud Run service whenever a new file is uploaded to a specific Cloud Storage bucket. Which service should they use to set up this event-driven architecture?

A.Eventarc with a trigger for Cloud Storage events

B.Pub/Sub notifications on the bucket with a push subscription to Cloud Run

C.Cloud Scheduler calling Cloud Run on a schedule

D.Cloud Functions with a GCS trigger

AnswerA

Why this answer

Eventarc can capture Cloud Storage events (e.g., OBJECT_FINALIZE) and route them to Cloud Run, Cloud Functions, or Workflows. It supports CloudEvents standard.

Full explanation →

665

Multi-Selectmedium

A company is building a real-time anomaly detection pipeline using Dataflow. Events are ingested from Pub/Sub, and the pipeline must compute a sliding window average every minute over a 1-hour window. Which TWO configurations are required for this pipeline? (Choose 2)

Select 2 answers

A.Set the pipeline to use event time for watermarking.

B.Use a Sliding window of 1 hour with a 1-minute slide.

C.Use a Fixed window of 1 minute.

D.Use stateful processing with a custom timer.

E.Set the pipeline to use processing time for watermarking.

AnswersA, B

Event time ensures windows based on actual event occurrence time, necessary for correct sliding window semantics.

Why this answer

A sliding window of 1-hour length with a 1-minute slide period fits the requirement (every minute, compute over last hour). Fixed window of 1 minute would compute only per-minute, not sliding. Using stateful processing with timers is an alternative but not standard for sliding windows.

Dataflow's default watermark is based on event time; processing time would cause incorrect results. The window type and period are the key.

Full explanation →

666

Multi-Selectmedium

A data engineer needs to build a Dataflow pipeline that reads JSON messages from Pub/Sub, transforms them (including filtering, grouping, and enrichment), and writes the results to BigQuery. The pipeline must handle schema evolution in the input messages and minimize data loss. Which THREE settings or features should the engineer use? (Choose THREE.)

Select 3 answers

A.Use side inputs to enrich the data with reference data from BigQuery

B.Set the `withAllowedLateness` to 0 for windowing to minimize latency

C.Set up a dead letter queue (DLQ) for messages that fail to parse or validate

D.Enable autoscaling to handle spikes in message volume

E.Enable Streaming Engine to reduce checkpoint size

AnswersA, C, D

Side inputs allow joining with slowly changing reference data.

Why this answer

To handle schema evolution, using a dead letter queue (option A) is essential to capture messages that do not conform to the current schema. Using side inputs (option B) is a good practice for enrichment with reference data. Enabling autoscaling (option C) ensures the pipeline can handle varying throughput.

Option D is not necessary for schema evolution; setting a limit on number of shards is for grouping. Option E is incorrect: Streaming Engine is a separate feature that manages state, but it is not directly related to schema evolution or data loss minimization.

Full explanation →

667

MCQhard

Refer to the exhibit. A BigQuery dataset is shared with the group 'analysts@example.com' using the IAM policy shown. A user who is a member of this group reports that they cannot run queries on the dataset, though they can see the tables. What is the most likely reason?

A.The group needs the 'roles/bigquery.jobUser' role at the project level.

B.The user is using an incorrect client library version.

C.The user's account is not activated in the group membership.

D.The dataset has an organization policy that denies query access.

AnswerA

DataViewer provides read access but not job submission; jobUser must be granted at the project level to run queries.

Why this answer

The IAM policy grants the 'roles/bigquery.dataViewer' role at the dataset level, which allows the user to see tables but not run queries. To run queries, the user also needs the 'roles/bigquery.jobUser' role at the project level, because BigQuery query jobs are project-scoped resources. Without this role, the user lacks permission to create query jobs, even though they can view dataset metadata.

Exam trap

Cisco often tests the distinction between dataset-level and project-level roles in BigQuery, trapping candidates who assume that dataset-level view permissions are sufficient to run queries.

How to eliminate wrong answers

Option B is wrong because client library version does not affect IAM permissions; authentication and authorization are handled by Google Cloud IAM, not by the library version. Option C is wrong because if the user's account were not activated in the group membership, they would not be able to see the tables at all, as the dataset-level view permission would not apply. Option D is wrong because an organization policy that denies query access would typically block all query operations for all users, not just this user, and the user can see tables, which contradicts a blanket deny on queries.

Full explanation →

668

Multi-Selecteasy

A data engineering team is operationalizing a machine learning model for real-time fraud detection. The model must process transactions with sub-100ms latency and be highly available. Which TWO strategies should the team implement?

Select 2 answers

A.Deploy the model to multiple Google Cloud regions for failover.

B.Deploy the model to a single zone to minimize cross-zone latency.

C.Use Cloud Batch for asynchronous prediction.

D.Optimize the model by pruning or quantizing to reduce size.

E.Store the model in Cloud Storage and load it on each request.

AnswersA, D

Why this answer

Deploying the model to multiple Google Cloud regions ensures high availability and failover capability. If one region becomes unavailable, traffic can be routed to another region, maintaining sub-100ms latency by using regional load balancing and Cloud DNS. This aligns with the requirement for a highly available, low-latency fraud detection system.

Exam trap

Google Cloud often tests the misconception that single-zone deployment minimizes latency, but the real trade-off is between availability and negligible intra-region latency, making multi-region deployment the correct choice for high availability.

Full explanation →

669

MCQmedium

A company uses Google Ads and wants to automatically load their advertising data into BigQuery daily. They also need to transform the data with SQL and schedule a recurring query. Which combination of services meets these requirements with minimal operational overhead?

A.Cloud Functions triggered by Cloud Scheduler to call Google Ads API and load into BigQuery

B.Cloud Composer to extract Google Ads API and Dataflow to transform

C.Storage Transfer Service to move CSV files to GCS, then load into BigQuery

D.BigQuery Data Transfer Service for Google Ads and scheduled queries

AnswerD

Direct integration with scheduled queries for transformation.

Why this answer

BigQuery Data Transfer Service can automatically load Google Ads data; scheduled queries handle transformation.

Full explanation →

670

MCQmedium

A team uses Vertex AI AutoML Tables to train a model. They need to deploy the model for real-time predictions with high availability. Which deployment configuration should they use?

A.Export as a Cloud Function

B.Deploy to a Vertex AI Endpoint with 1 replica

C.Use a Vertex AI Batch Prediction job

D.Deploy to a Vertex AI Endpoint with multiple replicas and auto-scaling

AnswerD

Multiple replicas provide HA.

Why this answer

For real-time predictions with high availability, you need a deployment that can handle traffic spikes and failover. Deploying to a Vertex AI Endpoint with multiple replicas and auto-scaling ensures that the model is served from multiple instances, providing redundancy and the ability to scale up or down based on demand. This configuration meets the high-availability requirement by distributing load and automatically recovering from instance failures.

Exam trap

The trap here is that candidates often confuse batch prediction with real-time serving, or assume that a single replica is sufficient for high availability, not realizing that high availability requires redundancy and automatic scaling.

How to eliminate wrong answers

Option A is wrong because exporting as a Cloud Function is not a deployment method for Vertex AI AutoML Tables models; Cloud Functions are for serverless event-driven code, not for hosting ML model endpoints with real-time prediction capabilities. Option B is wrong because deploying to a Vertex AI Endpoint with only 1 replica provides no redundancy or high availability; if that single instance fails or becomes overloaded, predictions will be unavailable. Option C is wrong because a Vertex AI Batch Prediction job is designed for asynchronous, offline predictions on large datasets, not for real-time, low-latency serving.

Full explanation →

671

MCQmedium

A financial services company receives real-time stock trade data via Pub/Sub. They need to enrich each trade with reference data from a Cloud SQL table and write the results to BigQuery for real-time analytics. The enrichment must handle late-arriving data and ensure exactly-once processing. Which Dataflow streaming pipeline configuration should be used?

A.Use a Dataflow Flex Template that reads from Pub/Sub, joins in memory, and writes to BigQuery using legacy streaming inserts

B.Use Pub/Sub to BigQuery template with streaming inserts and a side input from Cloud SQL

C.Build a custom Dataflow pipeline using the Storage Write API with exactly-once semantics and a side input from Cloud SQL

D.Deploy a Dataproc Spark Streaming job that reads from Pub/Sub, enriches via JDBC, and writes to BigQuery

AnswerC

Storage Write API with exactly-once ensures no duplicates, and side input allows enrichment from Cloud SQL.

Why this answer

Using the Storage Write API with exactly-once semantics and side inputs to join with reference data provides the required enrichment and exactly-once guarantees.

Full explanation →

672

Multi-Selectmedium

A team needs to optimize online prediction cost for a model that has unpredictable traffic spikes. Which TWO strategies are most effective?

Select 2 answers

A.Enable autoscaling with a low min_replica_count and high max_replica_count

B.Set up Model Monitoring to trigger scaling

C.Deploy the model on a single high-memory machine

D.Use a smaller model version

E.Use batch prediction during high traffic

AnswersA, D

Autoscaling provides elasticity, scaling from a low base to handle spikes.

Why this answer

Option A is correct because autoscaling with a low min_replica_count and high max_replica_count allows the deployment to handle unpredictable traffic spikes by dynamically adjusting the number of replicas. This ensures cost efficiency during low traffic while providing capacity to scale out rapidly when demand surges, a key requirement for online prediction serving.

Exam trap

Google Cloud often tests the distinction between monitoring (observability) and scaling (infrastructure action), leading candidates to incorrectly select Model Monitoring as a scaling trigger.

Full explanation →

673

MCQhard

A financial services company needs to ingest real-time trade data from multiple sources into BigQuery for immediate fraud detection. The data volume is high (1 million messages per second) and each message must be available for queries within seconds. They are considering the Storage Write API. Which stream mode should they choose to balance data availability and cost?

A.Legacy streaming inserts

B.Pending mode

C.Buffered mode

D.Committed mode

AnswerC

Buffered mode provides low-latency streaming with data available within seconds, and is cost-effective for high-volume ingestion.

Why this answer

Buffered mode (option C) is correct because it provides the best balance between data availability and cost for high-volume, real-time fraud detection. In buffered mode, data is written to BigQuery's managed storage within seconds, making it available for queries almost immediately, while the cost is lower than committed mode because buffered mode does not require an additional commit step. This mode is ideal for streaming use cases where latency is critical but cost efficiency is also a priority.

Exam trap

Cisco often tests the misconception that 'committed mode' is always the best for data availability, but the trap here is that committed mode's higher cost and explicit commit requirement make it overkill for scenarios where near-real-time availability (buffered mode) is sufficient and cost is a concern.

How to eliminate wrong answers

Option A is wrong because legacy streaming inserts are deprecated and do not support the Storage Write API; they use the older tabledata.insertAll method, which has higher latency and is not optimized for the 1 million messages per second throughput required. Option B is wrong because pending mode is used for two-phase commit scenarios where data must be explicitly committed before it becomes visible, which adds latency and is unsuitable for immediate fraud detection. Option D is wrong because committed mode provides the strongest consistency guarantees but incurs higher cost due to the need for an explicit commit operation, making it less cost-effective for high-volume streaming without the need for such guarantees.

Full explanation →

674

Multi-Selectmedium

A team runs a production application on Compute Engine. They want to ensure high availability and quality. Which three best practices should they implement? (Choose three.)

Select 3 answers

A.Use health checks and load balancing.

B.Use Cloud SQL read replicas for database load.

C.Enable OS Login for SSH access.

D.Use regional persistent disks for stateful data.

E.Use managed instance groups (MIGs) with autoscaling.

AnswersA, D, E

Health checks ensure only healthy instances receive traffic; load balancing provides fault tolerance.

Why this answer

Health checks and load balancing distribute traffic across healthy instances, automatically routing requests away from failed instances to maintain availability. This is a core pattern for fault-tolerant Compute Engine deployments, as health checks (e.g., HTTP, TCP, or SSL) probe instance responsiveness and load balancers (e.g., External HTTP(S) Load Balancer) use the health status to direct traffic only to healthy backends.

Exam trap

Cisco often tests the distinction between security features (like OS Login) and availability/quality features, tempting candidates to select a security option when the question explicitly asks for high availability and quality best practices.

Full explanation →

675

MCQmedium

Refer to the exhibit. A data scientist deploys a model using this configuration. Users report that after a few hours of inactivity, the first prediction request takes over 30 seconds. What is the most likely cause?

A.The automatic scaling configuration allows scaling down to zero replicas, causing a cold start on the first request.

B.The network latency between the client and the endpoint is high due to regional distance.

C.The endpoint is misconfigured with the wrong regional endpoint.

D.The model is too large and exceeds the instance memory.

AnswerA

minReplicaCount: 0 permits scaling to zero, and after inactivity, the first request must wait for a new replica to start.

Why this answer

Option A is correct because the automatic scaling configuration that allows scaling down to zero replicas means that after a period of inactivity, all model replicas are terminated. When a new prediction request arrives, the endpoint must provision a new replica from scratch, which involves loading the model artifacts, initializing the inference container, and performing health checks. This cold start process typically takes 30 seconds or more, matching the reported behavior.

Exam trap

Google Cloud often tests the distinction between cold start latency (caused by scaling to zero) and persistent performance issues like network latency or resource exhaustion, so candidates must recognize that a delay only after inactivity points to replica provisioning, not a constant problem.

How to eliminate wrong answers

Option B is wrong because network latency due to regional distance would cause consistent high latency on every request, not just the first request after a period of inactivity. Option C is wrong because a misconfigured regional endpoint would result in persistent errors or high latency on all requests, not a delay only after inactivity. Option D is wrong because if the model exceeded instance memory, the endpoint would fail to serve predictions consistently or return out-of-memory errors, not exhibit a delay only on the first request after inactivity.

Full explanation →

Page 9 of 14

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Practice PDE by domain

Target a specific domain to shore up weak areas.

Designing Data Processing Systems Ingesting and Processing the Data Storing the Data Preparing and Using Data for Analysis Maintaining and Automating Data Workloads Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

See all domains with question counts →

Google Professional Data Engineer PDE Questions 601–675 | Page 9/14 | Courseiva