Knowledge + Practice

Google Professional Data Engineer (PDE) — Questions 826–900

990 questions total · 14pages · All types, answers revealed

Take a mock exam Exam hub

Page 12 of 14

826

MCQeasy

A company needs to stream data from a fleet of IoT devices to BigQuery for near-real-time analytics. The data volume is unpredictable and can spike during certain events. Which Google Cloud service should be used as the ingestion point to handle variable throughput with minimal operational overhead?

A.Cloud Datastore

B.Cloud Functions

C.Cloud Storage

D.Cloud Pub/Sub

AnswerD

Cloud Pub/Sub ingests variable-volume data and decouples producers from consumers.

Why this answer

Cloud Pub/Sub is the correct choice because it is a fully managed, scalable messaging service designed to decouple data producers from consumers, handling unpredictable and spiky throughput without requiring manual scaling. It can ingest millions of messages per second and buffer them until BigQuery is ready to consume, ensuring near-real-time analytics with minimal operational overhead.

Exam trap

Google Cloud often tests the misconception that Cloud Functions can serve as a direct ingestion point for streaming data, but candidates overlook that Cloud Functions lacks durable buffering and automatic scaling for high-throughput spikes, making Pub/Sub the correct decoupling layer.

How to eliminate wrong answers

Option A is wrong because Cloud Datastore is a NoSQL document database for storing structured data, not a streaming ingestion service; it cannot handle variable-throughput message ingestion or buffer spikes. Option B is wrong because Cloud Functions is a serverless compute platform for event-driven code execution, not a durable ingestion buffer; it lacks built-in buffering and would require custom scaling logic to handle throughput spikes. Option C is wrong because Cloud Storage is an object storage service for batch data, not designed for near-real-time streaming ingestion; it introduces latency and requires additional components (e.g., Cloud Functions or Pub/Sub notifications) to trigger downstream processing.

Full explanation →

827

MCQeasy

Based on the exhibit, what is the most likely cause of the out-of-memory error?

A.The BigQuery output table schema does not match the transformed data, causing write failures.

B.The Pub/Sub subscription is not acknowledging messages quickly enough, causing a backlog.

C.The worker machine type has insufficient memory for the message size and throughput.

D.The fixed window duration of 1 minute is too short, causing excessive state overhead.

AnswerC

Large messages (50 KB) and high throughput (1000/sec) require more memory; n1-standard-4 may be undersized.

Why this answer

The out-of-memory error in a Dataflow pipeline is most likely caused by the worker machine type having insufficient memory for the message size and throughput. When messages are large or the throughput is high, each worker must hold data in memory for processing, windowing, and shuffling. If the worker's memory is too small, the JVM heap runs out of memory, leading to an OOM error.

Exam trap

Google Cloud often tests the misconception that OOM errors are caused by schema mismatches or Pub/Sub backlogs, but the real cause is almost always insufficient worker memory for the data volume.

How to eliminate wrong answers

Option A is wrong because a schema mismatch between the BigQuery output table and the transformed data would cause write failures or errors in the BigQuery IO connector, not an out-of-memory error on the worker. Option B is wrong because a Pub/Sub subscription not acknowledging messages quickly enough would cause a backlog and increase unacknowledged message count, but it would not directly cause an out-of-memory error on the Dataflow worker; the pipeline would still process messages at its own pace, and the backlog would be in Pub/Sub, not in worker memory. Option D is wrong because a fixed window duration of 1 minute being too short would increase state overhead only if the pipeline uses stateful processing or triggers that accumulate state across windows; for a simple streaming pipeline, shorter windows actually reduce the amount of data held in memory per window, not cause OOM.

Full explanation →

828

MCQhard

A Dataflow pipeline reads from Pub/Sub, applies a keyed stateful ParDo that uses state variables to deduplicate events based on event ID, and writes to BigQuery. During a pipeline update, some events are duplicated in BigQuery. The state is not preserved across updates. Which configuration ensures exactly-once semantics during updates?

A.Drain the pipeline and start the updated pipeline; all in-flight data will be processed.

B.Cancel the pipeline and restart it; Pub/Sub subscriptions will be rewound.

C.Use the Storage Write API's exactly-once delivery mode.

D.Take a snapshot of the pipeline before updating, then start the new pipeline from the snapshot.

AnswerD

Snapshots preserve the state of the pipeline, including deduplication state, allowing the new pipeline to resume without reprocessing duplicates.

Why this answer

Draining the pipeline stops it and completes processing in-flight, then the updated pipeline can start fresh. However, because state is lost, duplicates may still occur if the new pipeline processes events that were already committed. To preserve state, use snapshotting: take a snapshot before update and start the new pipeline from the snapshot.

BigQuery's Storage Write API with exactly-once semantics can help at the sink but does not prevent duplicate processing if state is lost. As long as the deduplication state is recovered from the snapshot, duplicates are avoided.

Full explanation →

829

MCQmedium

A company stores IoT sensor data in BigQuery. Queries that filter on a timestamp column and a device_id column are slow even though the table is partitioned by day. What should the data engineer do to improve query performance?

A.Increase the partition size to monthly

B.Switch to ingestion-time partitioning instead of column-based

C.Enable automatic query rewriting with BI Engine

D.Cluster the table on device_id

AnswerD

Clustering organizes data within partitions, improving filter performance.

Why this answer

Clustering on device_id organizes the data within each day partition by device_id, allowing BigQuery to prune blocks during queries that filter on that column. This reduces the amount of data scanned and improves query performance without changing the partitioning scheme. Partitioning alone only limits scans by time range; clustering adds intra-partition sorting for non-time-based filters.

Exam trap

Google Cloud often tests the distinction between partitioning (which prunes by time) and clustering (which prunes by non-time columns), and the trap here is assuming that partitioning alone is sufficient for all filter columns, leading candidates to choose an option that changes the partition strategy rather than adding clustering.

How to eliminate wrong answers

Option A is wrong because increasing partition size to monthly would reduce the number of partitions, making each partition larger and actually increasing the data scanned for queries that filter on a specific day, worsening performance. Option B is wrong because ingestion-time partitioning is equivalent to partitioning on a pseudo-column (_PARTITIONTIME) and does not address the need to optimize filtering on device_id; it would not improve performance for queries filtering on device_id. Option C is wrong because BI Engine accelerates sub-second queries on small to medium datasets by caching results, but it does not reduce the amount of data scanned for large tables or optimize filtering on device_id; it is designed for interactive analytics, not for improving slow queries due to full table scans.

Full explanation →

830

MCQmedium

A data scientist needs to provide explanations for each prediction made by a deployed autoML model to comply with regulatory requirements. Which Vertex AI feature should they use?

A.Vertex AI Model Monitoring

B.Vertex AI Vizier

C.Vertex AI Explainable AI

D.Vertex AI Feature Store

AnswerC

Provides per-prediction explanations.

Why this answer

Vertex AI Explainable AI is the correct feature because it provides feature attributions and explanations for each prediction, enabling compliance with regulatory requirements that demand interpretability. It uses techniques like Shapley value approximations or integrated gradients to quantify the contribution of each input feature to the model's output, which is essential for auditing and transparency in deployed autoML models.

Exam trap

Google Cloud often tests the distinction between monitoring (detecting drift) and explaining (interpreting predictions), so candidates mistakenly choose Model Monitoring when the question explicitly asks for per-prediction explanations for regulatory compliance.

How to eliminate wrong answers

Option A is wrong because Vertex AI Model Monitoring is designed to detect prediction drift, data drift, and feature skew over time, not to provide per-prediction explanations. Option B is wrong because Vertex AI Vizier is a hyperparameter tuning and optimization service that helps find the best model architecture or parameters, not a tool for explaining individual predictions. Option D is wrong because Vertex AI Feature Store is a centralized repository for storing, serving, and sharing feature data, but it does not generate explanations for model predictions.

Full explanation →

831

MCQmedium

You need to automate retraining of a model when new training data becomes available every week. The training pipeline runs on Vertex AI Pipelines and is triggered by Cloud Composer. After retraining, you want to evaluate the new model against a golden dataset. If the model's accuracy improves by at least 1%, it should be automatically deployed to the staging endpoint. What is the best way to implement the decision logic?

A.Use Cloud Functions to compare metrics and call the endpoint if conditions are met.

B.Add a conditional step in the Vertex AI Pipeline to evaluate the model and deploy if the accuracy improvement threshold is met.

C.After training, run a batch prediction job on the golden dataset and compare metrics manually.

D.Use Vertex AI Experiments to log metrics and set up an alert to manually deploy.

AnswerB

Pipelines can include a condition step to check metrics and decide deployment.

Why this answer

Option B is correct because Vertex AI Pipelines supports conditional execution natively via the `Condition` component, allowing you to evaluate the new model's accuracy against the golden dataset within the same pipeline and deploy only if the improvement threshold (≥1%) is met. This approach keeps the entire retraining, evaluation, and deployment workflow automated, auditable, and tightly coupled within a single orchestrated pipeline, avoiding external triggers or manual steps.

Exam trap

Google Cloud often tests the misconception that external services like Cloud Functions are needed for decision logic, when in fact Vertex AI Pipelines' native conditional steps are the simpler, more integrated, and recommended approach for automated model evaluation and deployment within a pipeline.

How to eliminate wrong answers

Option A is wrong because Cloud Functions would introduce an external, event-driven component that adds latency, complexity, and potential failure points; Vertex AI Pipelines already provides built-in conditional logic for this exact use case, making an extra function unnecessary. Option C is wrong because running a batch prediction job and manually comparing metrics defeats the automation goal and introduces human error and delay, which is not suitable for a weekly retraining cadence. Option D is wrong because Vertex AI Experiments is designed for tracking and comparing experiments, not for automated decision-making or deployment; relying on alerts for manual deployment contradicts the requirement for automatic retraining and deployment.

Full explanation →

832

Multi-Selectmedium

A company wants to implement model monitoring for a deployed classification model. Which three types of monitoring should they set up? (Select 3)

Select 3 answers

A.Infrastructure cost monitoring

B.Training-serving skew

C.Prediction drift

D.Input feature drift

E.Model version comparison

AnswersB, C, D

Skew detection identifies differences between training and serving data.

Why this answer

Training-serving skew (B) is correct because it detects discrepancies between the data used for training and the data the model sees in production, which can cause performance degradation. This is a critical monitoring type for classification models to ensure the model's assumptions remain valid in the live environment.

Exam trap

Cisco often tests the distinction between model monitoring (which focuses on data and prediction quality) and operational or lifecycle management tasks, leading candidates to mistakenly select infrastructure cost monitoring or version comparison as monitoring types.

Full explanation →

833

Multi-Selecteasy

A data team uses Cloud Composer to orchestrate Airflow DAGs. They need to ensure that a downstream task runs only if at least two out of three upstream sensor tasks succeed. Which TWO configurations should they combine?

Select 2 answers

A.Set trigger_rule to 'none_failed_or_skipped' and use a condition.

B.Set trigger_rule to 'one_success'.

C.Set trigger_rule to 'all_done'.

D.Set trigger_rule to 'none_failed'.

E.Use a PythonOperator to check the number of successes.

AnswersA, E

Combined with a condition, this ensures at least two succeeded.

Why this answer

Option A is correct because the 'none_failed_or_skipped' trigger rule triggers the downstream task when all upstream tasks have succeeded or been skipped. Combined with a condition (e.g., using a PythonOperator or BranchPythonOperator) that checks whether at least two of the three sensor tasks succeeded, this ensures the downstream task runs only when the required threshold is met. This approach leverages Airflow's built-in trigger rules and conditional logic to implement a quorum-based dependency.

Exam trap

Google Cloud often tests the misconception that a single trigger rule like 'one_success' or 'none_failed' can directly enforce a quorum condition, when in fact you must combine a trigger rule with explicit conditional logic to count successes.

Full explanation →

834

Multi-Selectmedium

A data engineer is designing a Cloud Bigtable schema for high-volume time-series data. Which TWO practices should they follow to avoid performance issues?

Select 2 answers

A.Place the timestamp as the first component of the row key

B.Create as many column families as possible

C.Use a hashed prefix in the row key to distribute writes

D.Group related columns into column families

E.Store all columns in a single column family

AnswersC, D

Hashing avoids sequential hot-spotting.

Why this answer

Using a hashed prefix to avoid hot-spotting and grouping related columns into column families are recommended. Timestamp-first keys cause hot-spotting. Single column family for all data is inefficient.

Large number of column families also adds overhead.

Full explanation →

835

MCQmedium

A company uses Dataproc Serverless for Spark batch jobs. They notice that some jobs are failing due to out-of-memory (OOM) errors. Which configuration parameter should they adjust to allocate more memory per executor?

A.Use a custom image with more memory

B.Set spark.driver.memory to a higher value

C.Increase the number of workers by setting --num-workers

D.Set spark.executor.memory to a higher value, e.g., 8g

AnswerD

This directly increases memory per executor, fixing OOM errors.

Why this answer

In Dataproc Serverless, Spark properties can be set via --properties. The spark.executor.memory property controls the memory per executor. Increasing it can resolve OOM errors.

Full explanation →

836

Multi-Selecthard

You are building a Dataflow pipeline that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline must handle late-arriving data and ensure that the windowing and triggering are correct. Which THREE configurations should you consider? (Choose 3)

Select 3 answers

A.Enable Dataflow Streaming Engine for exactly-once processing.

B.Use side inputs to enrich streaming data with static data.

C.Use the BigQuery Storage Write API with committed mode to ensure exactly-once writes.

D.Set an allowed lateness duration to handle late-arriving data.

E.Configure a triggering frequency to control how often results are emitted.

AnswersC, D, E

Why this answer

Option C is correct because the BigQuery Storage Write API with committed mode provides exactly-once write semantics, which is essential for ensuring that late-arriving data processed by the pipeline does not result in duplicate rows in BigQuery. This mode uses stream offsets to track writes, guaranteeing that each record is written exactly once even if the pipeline retries.

Exam trap

Cisco often tests the misconception that Dataflow Streaming Engine alone provides exactly-once processing, but in reality it is the combination of source/sink semantics (like the Storage Write API) that ensures exactly-once, not the engine itself.

Full explanation →

837

MCQmedium

A company uses Cloud Composer for pipeline orchestration. They need to define task dependencies where Task B and Task C can run in parallel after Task A, and Task D must run after both B and C complete. How should they define the DAG?

A.A >> B; B >> D; A >> C; C >> D

B.A >> B >> C >> D

C.A >> [B, C] >> D

D.A.set_downstream(B); B.set_upstream(C); C.set_downstream(D)

AnswerC

Correct: A executes, then B and C in parallel, then D after both.

Why this answer

Using bitshift operators: A >> [B, C] >> D sets B and C after A, and D after both B and C complete.

Full explanation →

838

MCQmedium

You are designing a streaming pipeline that ingests events from Pub/Sub, enriches them with a machine learning model, and writes the results to BigQuery. The ML model is deployed on Cloud Run and has a high latency (500ms per request). You need to minimize the impact of slow ML inference on the overall pipeline throughput. Which approach should you take?

A.Use Dataflow to write events to Pub/Sub, then use a separate Dataflow pipeline that batches calls to Cloud Run.

B.Increase the number of Dataflow workers to compensate for the latency.

C.Use Cloud Functions to call Cloud Run and write directly to BigQuery.

D.Use Dataflow's ParDo with synchronous calls to Cloud Run for each element.

AnswerA

Decoupling via Pub/Sub allows batching and async processing, improving throughput.

Why this answer

Option A is correct because it uses Dataflow to batch events before sending them to Cloud Run, which amortizes the 500ms per-request latency over multiple events, significantly increasing throughput. By writing events to Pub/Sub and then processing them in a separate Dataflow pipeline with batched calls, you decouple the ingestion from the inference and avoid blocking on each individual request.

Exam trap

The trap here is that candidates assume parallelism (more workers) or faster invocation methods (Cloud Functions) can overcome high per-request latency, when the real solution is to batch requests to reduce the number of round trips.

How to eliminate wrong answers

Option B is wrong because increasing the number of Dataflow workers does not reduce the per-element latency of synchronous calls; it only adds parallelism, which can lead to excessive concurrent calls to Cloud Run and potential throttling or cost spikes. Option C is wrong because Cloud Functions are not designed for high-throughput streaming pipelines and would still make synchronous calls to Cloud Run for each event, suffering the same latency bottleneck. Option D is wrong because using ParDo with synchronous calls per element means each element waits 500ms before the next element is processed, severely limiting throughput and not leveraging batching.

Full explanation →

839

MCQmedium

A data engineer needs to create a Dataflow pipeline that reads from Pub/Sub, applies a Python transformation, and writes to BigQuery. The pipeline should be reusable across environments with different parameters. Which deployment method is most appropriate?

A.Classic Template

B.Flex Template

C.Direct pipeline submission with gcloud dataflow jobs run

D.Cloud Composer to trigger Dataflow jobs

AnswerB

Flex Templates support any SDK (including Python) and allow runtime parameters.

Why this answer

Flex Templates (Option B) are the most appropriate deployment method because they allow you to package a custom Docker image containing your Python transformation code and dependencies, making the pipeline reusable across environments with different runtime parameters. Unlike Classic Templates, Flex Templates support arbitrary pipeline code and can be parameterized at runtime via the Dataflow UI or API, which is essential for a multi-environment deployment strategy.

Exam trap

The trap here is that candidates often confuse Classic Templates with Flex Templates, assuming both support custom code, but Classic Templates are limited to Google-provided templates and cannot run arbitrary Python transformations, making Flex Templates the only correct choice for custom, reusable pipelines.

How to eliminate wrong answers

Option A is wrong because Classic Templates are pre-built, Google-provided templates that do not support custom Python transformations; they are limited to a fixed set of template parameters and cannot be easily parameterized for different environments. Option C is wrong because direct pipeline submission with gcloud dataflow jobs run does not provide a reusable, parameterized template mechanism; each submission requires the full pipeline code and configuration, making it unsuitable for repeated deployment across environments. Option D is wrong because Cloud Composer is an orchestration tool for scheduling and monitoring workflows, not a deployment method for creating reusable, parameterized Dataflow templates; it can trigger Dataflow jobs but does not solve the need for a template that can be reused with different parameters.

Full explanation →

840

Multi-Selecteasy

A company wants to use BigQuery for analytics. They need to meet compliance requirements by encrypting data at rest with a key they control. Which TWO actions should they take? (Choose 2.)

Select 2 answers

A.Set the Cloud KMS key as the default encryption key for the BigQuery dataset.

B.Create a Cloud Storage bucket and load data there.

C.Use VPC Service Controls to restrict access to the dataset.

D.Create a key ring and cryptographic key in Cloud KMS.

E.Enable BigQuery column-level encryption using AEAD functions.

AnswersA, D

Setting the dataset default encryption key encrypts all tables in the dataset with the CMEK.

Why this answer

BigQuery supports Customer-Managed Encryption Keys (CMEK) for encrypting data at rest. You need to create a Cloud KMS key and then set it as the default encryption key for a BigQuery dataset. All tables in that dataset will be encrypted with that key.

Full explanation →

841

MCQhard

A data science team uses Vertex AI Pipelines to automate retraining. They want to ensure that only models with performance above a threshold are deployed. Which component should they add to the pipeline?

A.Vertex AI Feature Store

B.Vertex AI Model Evaluation

C.Cloud Build trigger

D.Cloud Monitoring alert

AnswerB

Evaluates model and can block deployment if threshold not met.

Why this answer

Vertex AI Model Evaluation provides built-in evaluation metrics and threshold-based validation that can be used as a pipeline condition to gate model deployment. By adding a Model Evaluation component, the pipeline can compare model performance against a predefined threshold and only proceed to deploy if the metrics (e.g., AUC, precision, recall) meet or exceed the required value.

Exam trap

The trap here is that candidates may confuse monitoring (Cloud Monitoring) or feature management (Feature Store) with the evaluation step needed to gate deployment, but only Model Evaluation provides the threshold-based conditional logic within the pipeline itself.

How to eliminate wrong answers

Option A is wrong because Vertex AI Feature Store is a centralized repository for storing, serving, and sharing feature data, not for evaluating model performance or enforcing deployment thresholds. Option C is wrong because Cloud Build trigger is used to automate builds and tests of source code, not to evaluate trained model metrics within a Vertex AI Pipeline. Option D is wrong because Cloud Monitoring alert is designed to notify operators about system or application anomalies, not to serve as a pipeline gate that conditionally deploys models based on evaluation results.

Full explanation →

842

Multi-Selectmedium

A company wants to build a reporting pipeline where data is collected from IoT devices, stored raw in Cloud Storage, and then processed into BigQuery for analytics. They need to ensure data is encrypted at rest using customer-managed keys. Which THREE steps should they take? (Choose 3 correct options)

Select 3 answers

A.Delete the Cloud KMS key after data is loaded to BigQuery

B.Enable CMEK on the Cloud Storage bucket by specifying the KMS key

C.Configure the BigQuery dataset to use a CMEK key

D.Use Google-managed encryption keys

E.Create a key ring and key in Cloud Key Management Service

AnswersB, C, E

You can set a default KMS key for a bucket.

Why this answer

Option B is correct because enabling CMEK on a Cloud Storage bucket by specifying a KMS key ensures that all objects stored in the bucket are encrypted at rest using a customer-managed key, which meets the requirement for customer-managed encryption. This is done by setting the bucket's default encryption to use a specific Cloud KMS key, and any object uploaded without its own encryption key will inherit this setting.

Exam trap

Cisco often tests the misconception that you can delete the KMS key after encryption to save costs, but the trap is that this permanently locks the data, making it unrecoverable and non-compliant with retention policies.

Full explanation →

843

MCQeasy

The push endpoint is returning 500 errors. What is the most likely cause?

A.The push endpoint requires authentication but none is set

B.The topic has no messages

C.The push endpoint is not a valid HTTPS URL

D.The ack deadline is too short

AnswerA

If the endpoint expects an Authorization header, requests without it will fail with 500 or 401.

Why this answer

The push endpoint likely requires authentication, but none is configured, causing the 500 errors.

Full explanation →

844

MCQmedium

A company stores sensitive data in BigQuery and needs to encrypt certain columns with customer-managed encryption keys (CMEK) while using BigQuery's analytics capabilities. What should they do?

A.Store the sensitive data in Cloud Storage with CMEK and use external tables.

B.Use BigQuery column-level security with data classification.

C.Create a BigQuery table with CMEK enabled; it will automatically encrypt all columns.

D.Use the AEAD encryption functions in BigQuery to encrypt specific columns during query time.

AnswerD

AEAD functions allow column-level encryption/decryption with customer-managed keys, enabling granular control.

Why this answer

Option D is correct because BigQuery's AEAD encryption functions allow you to encrypt specific columns at query time using customer-managed keys, while still leveraging BigQuery's full analytics capabilities on the unencrypted portions of the data. This approach meets the requirement of encrypting certain columns with CMEK without losing the ability to run analytical queries on the rest of the table.

Exam trap

The trap here is that candidates confuse table-level CMEK encryption (which encrypts all data at rest) with the ability to selectively encrypt specific columns, leading them to choose Option C, when in fact BigQuery requires using AEAD functions for column-level encryption with customer-managed keys.

How to eliminate wrong answers

Option A is wrong because storing data in Cloud Storage with CMEK and using external tables does not encrypt specific columns within BigQuery; it encrypts the entire file at rest, and external tables cannot enforce column-level encryption natively. Option B is wrong because BigQuery column-level security with data classification controls access to columns via policies (e.g., masking or row-level security), but it does not encrypt the data with CMEK; it only restricts visibility. Option C is wrong because enabling CMEK on a BigQuery table encrypts the entire table at rest, not specific columns; you cannot selectively apply CMEK to only certain columns within a table.

Full explanation →

845

MCQmedium

You are designing a Cloud Composer workflow that loads data from Cloud Storage into BigQuery, runs a Dataflow job to transform the data, and then triggers a Dataproc Spark job. After each step, you need to conditionally branch based on success or failure. Which Airflow feature allows you to pass messages between tasks to enable dynamic branching?

A.Sensors

B.XComs

C.TaskFlow API

D.DAG dependencies

AnswerB

XComs are the standard mechanism for passing messages between Airflow tasks, enabling branching based on results.

Why this answer

XComs (cross-communications) in Airflow allow tasks to exchange small amounts of data, such as status or metadata. This data can be used by BranchPythonOperator to conditionally choose downstream tasks.

Full explanation →

846

MCQhard

Your company uses Cloud Data Fusion to build ETL pipelines. You have a pipeline that reads from Cloud Storage, transforms data using a custom Wrangler recipe, and writes to BigQuery. The pipeline is failing with an error indicating that the Wrangler directive is invalid. You have verified the recipe works in the Cloud Data Fusion Studio. What is the most likely cause of the failure?

A.The pipeline is using a different version of the Wrangler plugin

B.The Cloud Storage bucket is in a different region than the Data Fusion instance

C.The Wrangler plugin is not deployed in the Cloud Data Fusion instance

D.The service account used in the pipeline does not have permissions to write to BigQuery

AnswerC

The Cloud Data Fusion studio uses a different environment than the pipeline runtime. The Wrangler plugin must be deployed in the runtime environment; otherwise, directives fail.

Why this answer

When a pipeline that works in the studio fails at runtime, common issues include differences in environment (e.g., runtime arguments, service account permissions, or plugin versions). But the most likely cause is that the pipeline configuration does not include the necessary plugins or the runtime environment is missing the required artifacts. In Cloud Data Fusion, the studio uses a local or preview environment, while the pipeline runs on a separate Cloud Data Fusion instance with its own set of plugins.

If the Wrangler plugin is not deployed to the runtime environment, the directive will fail.

Full explanation →

847

MCQmedium

A financial services company deploys a regression model to predict loan default risk. The model is served using Vertex AI Endpoints with autoscaling. After deployment, latency increases significantly during peak hours, causing timeouts. The model uses scikit-learn and has a large feature set. Which action should the team take to reduce latency while maintaining prediction accuracy?

A.Switch to batch prediction for all requests.

B.Increase the minimum number of replicas in the endpoint to handle peak load.

C.Increase the memory allocation for the serving container.

D.Apply feature selection to reduce the number of input features.

AnswerD

Reducing features decreases model size and inference time.

Why this answer

Option D is correct because the latency spike is caused by the large feature set, which increases the time for preprocessing and inference in the scikit-learn model. Reducing the number of input features via feature selection directly decreases the computational load per request, lowering latency without sacrificing accuracy if the selected features retain predictive power. This addresses the root cause, unlike scaling or resource changes that only mask the symptom.

Exam trap

The trap here is that candidates often confuse scaling solutions (increasing replicas or memory) with performance optimization, but the question specifically asks for reducing latency per request, which requires addressing the computational bottleneck—feature reduction—rather than adding more resources.

How to eliminate wrong answers

Option A is wrong because switching to batch prediction does not reduce per-request latency; it processes requests asynchronously in bulk, which is unsuitable for real-time serving and would still cause timeouts during peak hours. Option B is wrong because increasing the minimum number of replicas only adds more instances to handle concurrent requests, but each individual request still suffers from the same high latency due to the large feature set—autoscaling already adds replicas under load, so this does not fix the per-request processing time. Option C is wrong because increasing memory allocation for the serving container helps with out-of-memory errors but does not reduce the CPU-bound computation time required to process a large feature set; the bottleneck is compute, not memory.

Full explanation →

848

MCQeasy

Your company uses Cloud Dataflow to process streaming data from Pub/Sub. The pipeline occasionally fails with a 'worker terminated unexpectedly' error. What is the most likely cause of this error?

A.Insufficient memory per worker causing OOM errors

B.Incorrect VPC firewall rules blocking internal communication

C.Staging location bucket lacks write permissions

D.Pub/Sub subscription throughput quota exceeded

AnswerA

OOM errors cause workers to terminate unexpectedly.

Why this answer

The 'worker terminated unexpectedly' error in Cloud Dataflow typically indicates that a worker process ran out of memory (OOM) and was killed by the operating system. This occurs when the pipeline's memory requirements exceed the configured worker machine type's memory capacity, often due to large windowing accumulations, skewed data, or inefficient state handling.

Exam trap

Google Cloud often tests the distinction between infrastructure-level errors (like OOM) and configuration or permission errors, so candidates may incorrectly attribute the generic 'worker terminated' message to network or IAM issues rather than resource exhaustion.

How to eliminate wrong answers

Option B is wrong because VPC firewall rules blocking internal communication would cause connectivity errors like 'unable to connect to shuffle service' or 'worker cannot reach Dataflow service', not a generic termination error. Option C is wrong because staging location bucket lacking write permissions would cause a pipeline submission failure with a permission denied error, not a runtime worker termination. Option D is wrong because Pub/Sub subscription throughput quota exceeded would result in Pub/Sub-specific errors such as 'RESOURCE_EXHAUSTED' or backlog buildup, not a worker termination.

Full explanation →

849

MCQeasy

A company trains a custom model using TensorFlow and wants to deploy it to Vertex AI for low-latency predictions. The model is large (2 GB). Which deployment option should they choose?

A.Use Vertex AI Batch Prediction job

B.Deploy as a Cloud Function

C.Deploy to Vertex AI Endpoint with a custom container

D.Deploy to Cloud Run with minimum instances

AnswerC

Custom containers allow large models.

Why this answer

Option C is correct because deploying a large (2 GB) model to Vertex AI Endpoint with a custom container allows you to package the model, its dependencies, and a serving framework (e.g., TensorFlow Serving) into a Docker image. This approach supports low-latency predictions by keeping the model loaded in memory across requests, and it can scale to handle real-time inference traffic, unlike batch or serverless options that have cold-start or size limitations.

Exam trap

Google Cloud often tests the misconception that Cloud Run or Cloud Functions can handle large models for real-time inference, ignoring their size limits, cold-start latency, and lack of native Vertex AI integration for model management and scaling.

How to eliminate wrong answers

Option A is wrong because Vertex AI Batch Prediction is designed for asynchronous, high-throughput processing of large datasets, not for low-latency real-time predictions; it processes jobs in batches and does not maintain a persistent endpoint. Option B is wrong because Cloud Functions have a maximum deployment size of 2 GB (unpackaged) and a 60-second timeout, making them unsuitable for a 2 GB model that requires persistent memory and low-latency inference. Option D is wrong because Cloud Run has a container image size limit of 2 GB (uncompressed) and a request timeout of 60 minutes, but it lacks native integration with Vertex AI's model registry and optimized serving infrastructure, and it may incur cold-start latency even with minimum instances.

Full explanation →

850

MCQhard

A company runs a real-time fraud detection model using Cloud Dataflow for streaming inference. The model is updated every hour with new training data. The team wants to minimize downtime and ensure that both old and new model versions are available during the update. Which deployment strategy should they use?

A.A/B testing: route a small percentage of traffic to the new model and compare performance.

B.Rolling deployment: gradually replace instances of the old model with the new model.

C.Blue/green deployment: deploy the new model to a separate endpoint, then switch all traffic at once.

D.Canary deployment: deploy the new model alongside the old one, gradually increase traffic to the new model while monitoring.

AnswerD

Canary deployment ensures both versions are available and traffic is shifted gradually, minimizing downtime and risk.

Why this answer

Canary deployment is the correct strategy because it allows the new model to be deployed alongside the old one, with traffic gradually shifted to the new version while monitoring for errors or performance degradation. This minimizes downtime and ensures both versions are available during the update, which is critical for a real-time fraud detection system where continuous availability and risk mitigation are paramount.

Exam trap

The trap here is that candidates confuse A/B testing (a statistical evaluation method) with canary deployment (a release strategy), or assume blue/green deployment is always best for zero-downtime updates without considering the requirement for gradual traffic shifting and availability of both versions during the update.

How to eliminate wrong answers

Option A is wrong because A/B testing is a statistical method for comparing model performance, not a deployment strategy for minimizing downtime or ensuring availability during updates. Option B is wrong because rolling deployment gradually replaces instances, which can cause a brief period where only the new model is available, violating the requirement that both old and new versions be available during the update. Option C is wrong because blue/green deployment switches all traffic at once after the new model is deployed, which introduces a cutover risk and does not allow gradual traffic shifting or monitoring during the transition.

Full explanation →

851

MCQeasy

A developer wants to create a BigQuery table that automatically expires data older than 30 days to reduce storage costs. Which table design feature should be used?

A.Authorized view

B.Clustered table

C.Materialized view

D.Partitioned table with partition expiration

AnswerD

Partition expiration automatically deletes partitions older than a specified number of days. This is ideal for time-based data retention.

Why this answer

Partitioned tables with a partition expiration allow automatic deletion of partitions. Clustering does not affect data expiration. Materialized views are for pre-computed aggregates, not data lifecycle.

Authorized views control access.

Full explanation →

852

MCQmedium

A company uses BigQuery for analytics. They have a table that is queried frequently by date range. To reduce costs, they want to ensure queries only scan the relevant partitions. They also want to improve performance for queries filtering on a specific customer_id. Which table design should they use?

A.Partition by ingestion time and cluster by customer_id

B.Use a materialized view that filters by date and customer_id

C.Cluster by date column and partition by customer_id

D.Partition by date column and cluster by customer_id

AnswerD

Partitioning reduces scan to relevant dates; clustering improves filtering on customer_id.

Why this answer

Partitioning by date allows pruning irrelevant partitions; clustering on customer_id orders data within partitions for efficient filtering. Clustering alone doesn't prune partitions. Ingestion-time partitioning is based on arrival time, not logical date.

Full explanation →

853

Multi-Selecteasy

A company is developing a streaming Dataflow pipeline to process real-time sensor data. To ensure data quality, the team wants to detect malformed records and late data. Which two practices should they implement? (Choose two.)

Select 2 answers

A.Use Beam’s PAssert to validate each element in the pipeline.

B.Enable Dataflow’s built-in schema validation on the PCollection.

C.Configure a dead letter queue for unprocessable records.

D.Use Cloud Monitoring alerting on Dataflow system lag metric.

E.Run a separate batch pipeline to re-process data for validation.

AnswersC, D

A dead letter queue stores malformed records for later analysis, ensuring no data is silently lost.

Why this answer

Option C is correct because a dead letter queue (DLQ) is a standard pattern in streaming pipelines for isolating malformed or unprocessable records without blocking the main data flow. In Dataflow, this is typically implemented by writing bad records to a separate output (e.g., a Pub/Sub topic or Cloud Storage bucket) for later analysis or reprocessing. Option D is correct because the Dataflow system lag metric in Cloud Monitoring measures the time between when data enters the pipeline and when it is processed, making it an effective way to detect late data and trigger alerts for SLA violations.

Exam trap

Google Cloud often tests the misconception that PAssert can be used in production pipelines, but it is strictly a testing utility, and candidates may also confuse schema validation with Dataflow's built-in type checking, which does not exist for arbitrary record validation.

Full explanation →

854

MCQhard

A company processes financial transactions using Cloud Dataflow. They need to ensure that late-arriving data is handled correctly for fraud detection. The pipeline uses event time processing. Which approach should they use to handle late data?

A.Sliding windows with early firing

B.Session windows with gap duration

C.Fixed windows with allowed lateness

D.Global windows with triggers

AnswerC

Allowed lateness includes late events in the correct window.

Why this answer

Option C is correct because fixed windows with allowed lateness are the standard approach in Cloud Dataflow (Apache Beam) for handling late-arriving data in event-time processing. By specifying an allowed lateness duration, the pipeline retains the window state for that period, allowing late events to be correctly assigned to their original window and triggering recomputation of results. This ensures fraud detection pipelines can account for delayed transactions without missing or misordering data.

Exam trap

Google Cloud often tests the misconception that sliding or session windows inherently handle late data, when in fact only explicit allowed lateness (or a similar mechanism) provides the necessary state retention and watermark adjustment for late-arriving events.

How to eliminate wrong answers

Option A is wrong because sliding windows with early firing are designed to produce speculative results before the window closes, not to handle late-arriving data; early firing does not extend the window to accept late events. Option B is wrong because session windows with gap duration are used to group events into sessions based on inactivity gaps, not to manage late data; they do not provide a mechanism to accept events that arrive after the session has closed. Option D is wrong because global windows with triggers are typically used for unbounded aggregations where all data belongs to a single window, but they do not naturally handle late-arriving data within specific time boundaries required for fraud detection; they lack the per-window lateness cutoff that fixed windows offer.

Full explanation →

855

MCQhard

An e-commerce company deploys a recommendation model on Vertex AI Endpoints. The endpoint receives a high volume of requests with a large payload. They notice high latency and occasional timeouts. Which action should they take to improve performance without sacrificing accuracy?

A.Enable request batching on the endpoint

B.Switch to a smaller machine type

C.Reduce the model size by pruning

D.Increase the number of replicas

AnswerA

Batching improves throughput by combining requests, reducing overhead and latency without affecting model accuracy.

Why this answer

Enabling request batching on the Vertex AI endpoint allows multiple inference requests to be grouped into a single prediction call, reducing per-request overhead and improving throughput. This directly addresses high latency and timeouts caused by a high volume of large payloads without altering the model or its accuracy.

Exam trap

Google Cloud often tests the misconception that scaling replicas or reducing model size is the default fix for latency, but the trap here is that batching addresses throughput without sacrificing accuracy, whereas pruning or smaller machines would degrade performance or accuracy.

How to eliminate wrong answers

Option B is wrong because switching to a smaller machine type reduces compute resources, which would increase latency and worsen timeouts under high request volume. Option C is wrong because reducing model size by pruning can degrade prediction accuracy, which the question explicitly states must not be sacrificed. Option D is wrong because increasing the number of replicas adds cost and may not resolve timeouts if the bottleneck is per-request processing overhead rather than concurrency limits.

Full explanation →

856

MCQhard

A company uses Kafka on Dataproc to ingest streaming data. They want to process the data with Spark Structured Streaming and write results to BigQuery. The team is using Dataproc clusters. Which approach minimizes cost while maintaining performance?

A.Use a Dataproc cluster with all preemptible VMs

B.Use a single-node Dataproc cluster

C.Use a Dataproc cluster with standard master nodes and preemptible worker nodes

D.Use a Dataproc cluster with standard nodes and enable autoscaling

AnswerC

Workers can be preemptible; master should be standard for stability.

Why this answer

Preemptible VMs are cost-effective for worker nodes; master nodes should be standard for reliability.

Full explanation →

857

Multi-Selectmedium

A company needs to stream real-time user activity data from their application into BigQuery for immediate dashboarding. They want to minimize latency (under 5 seconds) and ensure exactly-once delivery. Which TWO options should they consider? (Choose 2)

Select 2 answers

A.Use Cloud Functions to receive events and call the BigQuery REST API

B.Use BigQuery Storage Write API in committed mode

C.Use BigQuery legacy streaming inserts directly from the application

D.Use Apache Kafka on Dataproc and write to BigQuery via the BigQuery Kafka connector

E.Stream data to Pub/Sub, then use Dataflow to write to BigQuery with exactly-once guarantees

AnswersB, E

Committed mode provides exactly-once semantics and low latency (sub-second).

Why this answer

Option B is correct because the BigQuery Storage Write API in committed mode provides exactly-once delivery semantics and low-latency streaming (typically under 5 seconds) directly into BigQuery. It is designed for real-time ingestion with strong consistency guarantees, making it ideal for immediate dashboarding.

Exam trap

Cisco often tests the distinction between legacy streaming inserts (at-least-once) and the Storage Write API (exactly-once), and candidates mistakenly choose legacy inserts because they are simpler to implement, ignoring the exactly-once requirement.

Full explanation →

858

MCQeasy

A data engineer needs to automatically delete objects from a Cloud Storage bucket after 30 days and archive them to nearline storage after 7 days. Which configuration should they use?

A.Set a lifecycle rule to SetStorageClass to nearline after 30 days only

B.Set a lifecycle rule to delete objects after 7 days only

C.Set a lifecycle rule to SetStorageClass to nearline after 7 days and delete after 30 days

D.Set a lifecycle rule to delete objects after 7 days and SetStorageClass to nearline after 30 days

AnswerC

Correct: archive after 7 days, delete after 30.

Why this answer

Option C is correct because it implements a lifecycle rule that first transitions objects to Nearline storage after 7 days (reducing costs for infrequently accessed data) and then deletes them after 30 days. This matches the requirement to archive after 7 days and delete after 30 days, using the `SetStorageClass` and `Delete` actions in the correct chronological order.

Exam trap

Google Cloud often tests the order of lifecycle actions: candidates mistakenly think deletion should come before archiving, but the correct sequence is to archive first (to reduce cost) and delete later, as objects cannot be archived after deletion.

How to eliminate wrong answers

Option A is wrong because it only sets the storage class to Nearline after 30 days, missing the deletion requirement entirely and incorrectly archiving after 30 days instead of 7. Option B is wrong because it only deletes objects after 7 days, ignoring the archive-to-Nearline step and deleting data too early. Option D is wrong because it reverses the order: it deletes objects after 7 days (before they can be archived) and then attempts to set storage class to Nearline after 30 days, which is impossible since the objects are already deleted.

Full explanation →

859

MCQmedium

A company wants to automate model retraining and deployment whenever new training data becomes available. Which service should be used to orchestrate the end-to-end workflow?

A.Cloud Build

B.Vertex AI Pipelines

C.Cloud Scheduler

D.Cloud Composer

AnswerB

Designed for ML pipeline orchestration with prebuilt components.

Why this answer

Vertex AI Pipelines is the correct choice because it is a managed service specifically designed to orchestrate and automate end-to-end ML workflows, including model retraining and deployment triggered by new data. It allows you to define pipelines as a directed acyclic graph (DAG) of steps using the Kubeflow Pipelines SDK or pre-built components, and it integrates natively with other Vertex AI services for training, evaluation, and deployment.

Exam trap

The trap here is that candidates often confuse Cloud Composer (a general-purpose Airflow service) with Vertex AI Pipelines, but the exam expects you to recognize that Vertex AI Pipelines is the ML-specific, fully managed solution for end-to-end ML workflow orchestration, while Cloud Composer requires more manual setup and lacks native Vertex AI integration.

How to eliminate wrong answers

Option A is wrong because Cloud Build is a CI/CD service focused on building, testing, and deploying software artifacts (e.g., container images), not on orchestrating ML workflows with steps like data validation, model training, and deployment. Option C is wrong because Cloud Scheduler is a cron job service that triggers actions on a time-based schedule, not on the event of new training data becoming available, and it lacks the workflow orchestration capabilities needed for complex ML pipelines. Option D is wrong because Cloud Composer is a managed Apache Airflow service that can orchestrate workflows, but it is a general-purpose workflow orchestrator, not purpose-built for ML pipelines; Vertex AI Pipelines provides tighter integration with Vertex AI components, managed execution, and artifact tracking, making it the more appropriate choice for this specific ML automation scenario.

Full explanation →

860

MCQhard

You are optimizing a BigQuery query that runs on a large table (hundreds of TB). The table is partitioned by date and frequently queried with filters on a specific customer_id column and date range. Queries are slow even after partitioning. Which optimization should you apply?

A.Increase the number of BigQuery slots

B.Columnar clustering on customer_id

C.Create materialized views for each customer

D.Denormalize the table to reduce joins

AnswerB

Clustering sorts data within each partition by customer_id, enabling block pruning for queries filtering on that column.

Why this answer

Clustering on customer_id within the partition improves query performance because BigQuery can prune blocks based on clustered columns. Partitioning alone doesn't help with non-date filters. Materialized views may help pre-aggregated queries but not ad-hoc customer_id filters.

Denormalization is not an optimization. Increasing slots is expensive and doesn't address data structure.

Full explanation →

861

MCQmedium

A data pipeline ingests streaming events into Pub/Sub. You need to guarantee that each event is processed exactly once downstream in Dataflow. Which combination of Pub/Sub and Dataflow configurations should you use?

A.Use Pub/Sub with exactly-once delivery enabled and Dataflow with exactly-once processing

B.Use Pub/Sub with a unique message ID and Dataflow with idempotent writes or Dataflow's exactly-once sink

C.Use Pub/Sub with message deduplication and Dataflow with at-least-once processing

D.Use Pub/Sub with a dead letter topic and Dataflow with automatic retries

AnswerB

By using a unique ID, you can deduplicate in Dataflow. Dataflow's exactly-once sinks also help ensure no duplicates.

Why this answer

Pub/Sub offers at-least-once delivery. To achieve exactly-once processing, the pipeline must be idempotent or use Dataflow's exactly-once sinks. Using a unique message ID for deduplication is a common approach.

Full explanation →

862

MCQeasy

A company uses Dataflow to process streaming data from Pub/Sub. They notice increased processing latency. What is the most likely cause?

A.Insufficient workers

B.Pub/Sub subscription issue

C.Too many shards

D.Wrong machine type

AnswerA

Insufficient workers create backpressure and increased latency as the pipeline cannot keep up with throughput.

Why this answer

In Dataflow, processing latency increases most commonly due to insufficient workers, as the streaming pipeline cannot keep up with the incoming data rate when the number of Compute Engine instances is too low. This causes backpressure from Pub/Sub, leading to growing unacknowledged messages and higher end-to-end latency. Autoscaling may be delayed or limited by max worker count settings, making manual or configuration-based worker scaling the primary corrective action.

Exam trap

Google Cloud often tests the misconception that Pub/Sub subscription issues (like ack deadline) are the primary cause of latency, but the trap here is that latency in Dataflow is almost always a worker scaling problem, not a Pub/Sub configuration issue.

How to eliminate wrong answers

Option B is wrong because a Pub/Sub subscription issue (e.g., expired pull request or misconfigured ack deadline) would cause message delivery failures or duplicates, not a gradual increase in processing latency across the pipeline. Option C is wrong because too many shards (i.e., excessive parallelism) can cause overhead but typically leads to underutilization or increased cost, not increased latency; latency from too many shards is rare and usually secondary to worker count. Option D is wrong because the wrong machine type (e.g., low CPU or memory) could degrade per-worker performance, but the most likely and direct cause of increased latency in a streaming Dataflow job is insufficient worker count, not machine type, as Dataflow’s autoscaling primarily adjusts worker count rather than machine type.

Full explanation →

863

MCQhard

A data engineer needs to split time-series data for training a forecasting model. The data is sorted by timestamp. The engineer wants to avoid leakage where future data influences training. Which data splitting approach should they use?

A.Use k-fold cross-validation with random assignment

B.Use stratified splitting on the target variable

C.Perform a random 80/20 split on the entire dataset

D.Use a time-series aware split: first 80% of data by timestamp for training, last 20% for testing

AnswerD

This preserves temporal order and avoids leakage.

Why this answer

For time-series, the only safe split is to use an earlier contiguous block for training and a later block for testing, preserving temporal order. Random splits would cause leakage. K-fold cross-validation on time-series requires special techniques like forward chaining, not standard k-fold.

Stratified split is for classification.

Full explanation →

864

Multi-Selectmedium

A data team is building a near-real-time dashboard that displays aggregated metrics from Kafka topics. They want to use Pub/Sub as a managed messaging service and Dataflow for stream processing. They need to ingest data from Kafka into Pub/Sub with minimal custom code. Which THREE Google Cloud services should they use together? (Choose three.)

Select 3 answers

A.Dataflow

B.Pub/Sub

C.Kafka Connect (with Pub/Sub connector)

D.Cloud NAT

E.Cloud Functions

AnswersA, B, C

Dataflow processes the streaming data for aggregation.

Why this answer

Pub/Sub is the target messaging system. Dataflow can read from Kafka directly using the Kafka I/O connector, with no need for intermediate services. Cloud NAT is not needed.

Cloud Functions could be used but is not required and adds complexity. Kafka Connect with the Pub/Sub connector is a standard way to stream data from Kafka to Pub/Sub. So the three services are Pub/Sub, Dataflow, and Kafka Connect (or a combination of Dataflow reading from Kafka and writing to Pub/Sub).

However, the options given are: Pub/Sub, Dataflow, Cloud NAT, Cloud Functions, and Kafka Connect. The correct three are: Pub/Sub, Dataflow, and Kafka Connect. But note that Dataflow can read from Kafka and write to Pub/Sub, eliminating the need for Kafka Connect.

However, the question specifically says 'with minimal custom code', and Kafka Connect provides a no-code connector. Alternatively, Dataflow with the Kafka I/O connector requires some code but is still minimal. The best answer set is: Pub/Sub, Dataflow, and Kafka Connect.

Full explanation →

865

MCQhard

A company runs a critical real-time data pipeline using Dataflow that ingests events from Cloud Pub/Sub, performs aggregations using sliding windows, and writes results to BigQuery. The pipeline is deployed in us-central1. The pipeline's latency has increased recently, and the Dataflow monitoring shows that the 'system lag' metric is consistently above 5 minutes. The pipeline is using Streaming Engine and has 10 workers with 4 vCPUs each. The pipeline processes approximately 100,000 events per second. The team has verified that the source Pub/Sub topic has sufficient publish throughput and the BigQuery table has no quota issues. The pipeline logs show that some workers are experiencing GC overhead limit exceeded errors. The pipeline code uses stateful processing with a custom keyed state for deduplication. What is the most likely cause of the increased latency?

A.The number of workers is insufficient; increasing to 20 workers will reduce latency.

B.The stateful processing is causing large state sizes that lead to GC overhead; use a more efficient state backend or increase worker memory.

C.The sliding window duration is too long; reducing it to 1 minute will improve performance.

D.The deduplication logic is causing a bottleneck; removing it will reduce latency.

AnswerB

GC overhead indicates memory pressure from large state; increasing memory or using a more efficient state backend like Cloud Bigtable can help.

Why this answer

The GC overhead limit exceeded errors indicate that workers are spending too much time garbage collecting, which is a classic symptom of excessive heap memory usage. Stateful processing with custom keyed state for deduplication can cause large per-key state sizes, especially with sliding windows that maintain overlapping state for each key. This forces the JVM to constantly garbage collect, increasing system lag beyond 5 minutes.

Using a more efficient state backend (e.g., reducing state size or using Dataflow's built-in deduplication) or increasing worker memory directly addresses the root cause.

Exam trap

Google Cloud often tests the misconception that scaling workers (Option A) is the universal fix for latency, when in reality memory-related issues like GC overhead require tuning state management or worker resources, not just parallelism.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers does not fix the GC overhead issue; it may even worsen it by distributing state across more workers without reducing per-worker memory pressure. Option C is wrong because reducing the sliding window duration does not address the state size or GC problem; it could actually increase the number of overlapping windows and state churn. Option D is wrong because removing deduplication would compromise data correctness; the bottleneck is not the logic itself but the memory footprint of the state, which can be mitigated without removing the feature.

Full explanation →

866

MCQmedium

A company is deploying a large-scale streaming application on Google Kubernetes Engine. They need to ensure the application can handle sudden traffic spikes without dropping data. Which architectural pattern is most appropriate?

A.Implement custom retry logic with exponential backoff in the application.

B.Use Cloud SQL as a temporary buffer and process from there.

C.Pre-provision 3x the expected peak capacity to handle spikes.

D.Use a Pub/Sub topic as a buffer and autoscale consumer pods based on Pub/Sub subscription backlog.

AnswerD

Pub/Sub provides a highly scalable buffer; autoscaling consumers based on backlog ensures capacity matches demand.

Why this answer

Option D is correct because Pub/Sub provides a durable, scalable, and asynchronous message buffer that decouples the producer from the consumer. By autoscaling consumer pods based on the Pub/Sub subscription backlog (e.g., using the 'pubsub.googleapis.com/subscription/num_undelivered_messages' custom metric with Horizontal Pod Autoscaler), the application can elastically handle traffic spikes without data loss, as messages are persisted until acknowledged.

Exam trap

The trap here is that candidates confuse buffering with retry logic or database storage, failing to recognize that Pub/Sub is the Google Cloud-native service specifically designed for decoupling and buffering in event-driven architectures.

How to eliminate wrong answers

Option A is wrong because custom retry logic with exponential backoff addresses transient failures but does not provide a buffer for sudden traffic spikes; if the producer outpaces the consumer, data is still dropped or rejected. Option B is wrong because Cloud SQL is not designed as a message buffer; it is a relational database with limited throughput and connection scaling, and using it as a temporary buffer would create a bottleneck and risk data loss under high load. Option C is wrong because pre-provisioning 3x the expected peak capacity leads to significant cost overprovisioning and still cannot guarantee handling of unexpected spikes beyond that factor; it violates the cloud-native principle of elastic scaling.

Full explanation →

867

MCQmedium

A company uses Looker to define business logic in LookML. They need to create a new measure that calculates the average order value, defined as total revenue divided by number of orders. Which LookML syntax should they use?

A.measure: avg_order_value { type: sum; sql: ${revenue} / ${order_count} ;; }

B.measure: avg_order_value { type: average; sql: ${revenue} / ${order_count} ;; }

C.dimension: avg_order_value { type: number; sql: ${revenue} / ${order_count} ;; }

D.dimension: avg_order_value { type: average; sql: ${revenue} / ${order_count} ;; }

AnswerB

Correct syntax for a measure that computes an average of a ratio.

Why this answer

Measures in LookML are defined with type and sql expression. The correct syntax for a calculated measure is: measure: avg_order_value { type: average; sql: ${revenue} / ${order_count} ;; }

Full explanation →

868

MCQhard

A company's Dataflow pipeline uses the PubSubIO source to read messages and writes to BigQuery via the BigQueryIO sink. The pipeline is running in Streaming mode with exactly-once semantics enabled. Occasionally, duplicate rows appear in BigQuery. What is the most likely reason?

A.The user-provided record ID for deduplication in BigQuery's streaming inserts is not being set for all messages, leading to duplicate rows.

B.The pipeline is using the WriteResult method with WRITE_APPEND in batch mode, which can cause duplicates if retries happen.

C.The pipeline is experiencing the 'dataflow streaming log processing' bug, causing duplicate logs to be written.

D.The PubSubIO source is configured with a dead-letter queue and messages are being redelivered without proper deduplication.

AnswerA

BigQueryIO uses insertId for deduplication; if it's missing or inconsistent, duplicates can occur.

Why this answer

In Dataflow streaming pipelines with exactly-once semantics, BigQuery's streaming inserts use user-provided record IDs for deduplication. If the record ID is not set for all messages, BigQuery cannot identify duplicates, and retries or redeliveries from Pub/Sub can result in duplicate rows. This is the most common cause of duplicates in this scenario.

Exam trap

Google Cloud often tests the misconception that exactly-once semantics in Dataflow automatically deduplicates at the sink, but in reality, BigQuery requires explicit user-provided record IDs for deduplication during streaming inserts.

How to eliminate wrong answers

Option B is wrong because WRITE_APPEND in batch mode is not relevant to a streaming pipeline with exactly-once semantics; the question specifies streaming mode, and batch mode duplicates would not explain streaming-specific behavior. Option C is wrong because there is no known 'dataflow streaming log processing' bug that causes duplicate logs; this is a fabricated term. Option D is wrong because a dead-letter queue handles failed messages after retries are exhausted, not redelivery; Pub/Sub redelivery without deduplication is already addressed by the user-provided record ID mechanism, and the dead-letter queue does not cause duplicates.

Full explanation →

869

MCQmedium

A company has a trained model stored in Vertex AI Model Registry. They want to automate retraining when new training data arrives in Cloud Storage. Which approach is most efficient?

A.Use Cloud Functions triggered by Cloud Storage events to start a Vertex AI Training job

B.Use Dataflow to continuously update the model

C.Use Cloud Scheduler to trigger a Cloud Build retraining step

D.Schedule a weekly Cloud Composer DAG to check for new data and retrain

AnswerA

Cloud Functions provide real-time event-driven triggers to initiate retraining immediately when new data appears.

Why this answer

Cloud Functions can be directly triggered by Cloud Storage events (e.g., object finalize) to invoke the Vertex AI Training service via the AI Platform API. This creates an event-driven, serverless pipeline that retrains the model immediately when new data arrives, without polling or manual intervention, making it the most efficient and cost-effective approach.

Exam trap

Google Cloud often tests the distinction between event-driven (Cloud Functions) and scheduled (Cloud Scheduler, Cloud Composer) approaches, and candidates mistakenly choose a scheduled option thinking it is simpler, missing the requirement for immediate reaction to new data.

How to eliminate wrong answers

Option B is wrong because Dataflow is a stream/batch data processing service for transforming data, not for orchestrating model retraining; it would require custom code to trigger training and lacks native integration with Vertex AI Model Registry. Option C is wrong because Cloud Scheduler triggers jobs on a fixed schedule, not on data arrival events, so it cannot react to new data in real time and may waste resources on unnecessary retraining. Option D is wrong because a weekly Cloud Composer DAG introduces latency (up to a week) and operational overhead for a simple event-driven task, and it is less efficient than a serverless function that fires instantly on data arrival.

Full explanation →

870

Multi-Selecthard

Which THREE actions reduce the cost of a Cloud Composer environment?

Select 3 answers

A.Delete old and unused DAG files to reduce scheduler load

B.Use standard network tier instead of premium

C.Set up a maintenance window to shut down the environment during idle hours

D.Use a smaller environment size (e.g., small instead of medium)

E.Increase the number of schedulers for higher throughput

AnswersA, C, D

Less load means fewer resources needed.

Why this answer

Option A is correct because deleting old and unused DAG files reduces the number of DAGs the scheduler must parse and evaluate. The Cloud Composer scheduler scans the DAG folder every 30 seconds by default; fewer DAG files mean lower CPU and memory consumption, directly reducing the cost of the environment's compute resources.

Exam trap

The trap here is that candidates confuse scaling up (Option E) with cost optimization, not realizing that adding schedulers increases resource consumption and cost, while the correct cost-saving actions involve reducing resource usage or shutting down idle capacity.

Full explanation →

871

MCQhard

You have a BigQuery table 'events' with a TIMESTAMP column 'event_time'. You need to compute, for each event, the difference in seconds from the previous event of the same user. Which window function should you use?

A.FIRST_VALUE(event_time) OVER (PARTITION BY user_id ORDER BY event_time)

B.LEAD(event_time) OVER (PARTITION BY user_id ORDER BY event_time)

C.LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time)

D.ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time)

AnswerC

LAG accesses the previous event, then you can use TIMESTAMP_DIFF to compute difference.

Why this answer

LAG() allows accessing the previous row in a partition. Combined with TIMESTAMP_DIFF, you can compute the difference. LEAD() accesses next row.

ROW_NUMBER() and FIRST_VALUE() are not suitable.

Full explanation →

872

MCQhard

A company uses Vertex AI Pipelines to orchestrate ML workflows. They want to automatically retrain the model when new data arrives, but only if the model's performance drops below a threshold. Which approach is best?

A.Use BigQuery scheduled queries to trigger pipeline

B.Trigger a pipeline on a schedule

C.Use Vertex AI Model Monitor to detect skew and trigger retraining

D.Use Cloud Functions to evaluate performance and trigger pipeline

AnswerC

Model Monitor can detect performance degradation and automatically trigger retraining pipelines.

Why this answer

Option C is correct because Vertex AI Model Monitor is specifically designed to detect prediction drift and data skew in deployed models. When the monitor identifies that model performance has dropped below a defined threshold, it can automatically trigger a retraining pipeline via a Cloud Function or Pub/Sub notification, ensuring retraining occurs only when necessary rather than on a fixed schedule.

Exam trap

Cisco often tests the distinction between scheduled retraining (Option B) and event-driven retraining triggered by actual model degradation (Option C), where candidates mistakenly choose a schedule-based approach because they overlook the requirement to retrain 'only if' performance drops below a threshold.

How to eliminate wrong answers

Option A is wrong because BigQuery scheduled queries are used for running SQL queries on a schedule, not for triggering ML pipelines based on model performance metrics. Option B is wrong because triggering a pipeline on a schedule would retrain the model at fixed intervals regardless of whether performance has degraded, wasting resources and potentially deploying unnecessary model versions. Option D is wrong because while Cloud Functions can evaluate performance and trigger a pipeline, this approach requires custom code to monitor model performance and lacks the built-in skew/drift detection capabilities that Vertex AI Model Monitor provides out-of-the-box.

Full explanation →

873

MCQhard

A data engineer is designing a real-time fraud detection system using Dataflow. The system must detect patterns across events from multiple users within a sliding window of 10 minutes. Events arrive on Pub/Sub topics per user. Which approach should they use to join the streams?

A.Use a side input to read one stream as a map and enrich the other stream

B.Use Flatten to merge the streams and then Partition

C.Use CoGroupByKey on the two streams using a common key like user_id

D.Use Union to combine both streams into one and then apply GroupByKey

AnswerC

CoGroupByKey joins multiple streams by key within the same window.

Why this answer

CoGroupByKey joins multiple PCollections by key. Using user_id as common key, both streams can be joined. Side inputs and Union are not for joining.

Flatten merges PCollections of same type.

Full explanation →

874

MCQmedium

Your team is migrating a legacy batch processing system that uses Apache Spark on-premises. The migration must be completed with minimal code changes and support both batch and streaming in the future. You want to use a fully managed service. Which Google Cloud service is most appropriate?

A.Cloud Data Fusion

B.Cloud Dataflow

C.Cloud Dataproc Serverless

D.Cloud Dataproc (standard cluster)

AnswerD

Dataproc standard clusters support both batch and streaming Spark jobs with minimal code changes. It is managed, though not fully serverless.

Why this answer

Cloud Dataflow uses Apache Beam, which is a different programming model than Spark. Dataproc is the managed Spark service that allows you to run existing Spark code with minimal changes, and Dataproc Serverless eliminates cluster management. However, Dataproc Serverless currently only supports batch workloads, not streaming.

The question asks for both batch and streaming future support. Dataproc (standard) supports both batch and streaming with Spark Structured Streaming. But it is not fully serverless.

Dataproc Serverless is serverless but only batch. So the best answer is Dataproc (standard) with a cluster that can be used for both.

Full explanation →

875

MCQhard

A healthcare company processes patient data using a Dataflow pipeline that reads from Cloud Storage, transforms data, and writes to BigQuery. They need to ensure that the processing is idempotent to handle failures and retries without duplicating records. The data arrives in daily batches and may be re-delivered if earlier processing failed. What approach should they take to guarantee exactly-once processing in BigQuery?

A.Use BigQuery's streaming inserts with InsertId to deduplicate

B.Ingest data via Pub/Sub and use a Dataflow pipeline with exactly-once processing

C.Use Dataflow's built-in exactly-once semantics and write to BigQuery via load jobs

D.Write data to a staging BigQuery table, then use a MERGE statement to upsert into the final table

AnswerD

MERGE ensures idempotency by matching on unique keys.

Why this answer

Option D is correct because BigQuery load jobs are not idempotent by default; if a load job is retried, it can create duplicate rows. By writing to a staging table first and then using a MERGE statement (or INSERT IF NOT EXISTS) to upsert into the final table, you can deduplicate based on a unique key. This approach guarantees exactly-once semantics even when the same batch is re-delivered, as the MERGE operation will only insert rows that do not already exist in the target table.

Exam trap

The trap here is that candidates often assume Dataflow's exactly-once semantics automatically extend to the sink (BigQuery), but in reality, BigQuery load jobs are not idempotent, so you must implement a deduplication strategy like staging + MERGE to guarantee exactly-once processing.

How to eliminate wrong answers

Option A is wrong because BigQuery streaming inserts with InsertId provide best-effort deduplication within the streaming buffer, but duplicates can still occur if the InsertId is reused after the deduplication window (typically a few minutes) or if the insert fails and is retried with a different InsertId. Option B is wrong because Pub/Sub with Dataflow's exactly-once processing ensures that each message is processed exactly once within the pipeline, but it does not guarantee idempotent writes to BigQuery; if the pipeline fails after writing to BigQuery but before acknowledging the message, a retry could cause duplicate rows. Option C is wrong because Dataflow's built-in exactly-once semantics apply to the pipeline's internal state and shuffle operations, but BigQuery load jobs are not idempotent; if a load job is retried (e.g., due to a worker failure), the same data can be loaded multiple times, resulting in duplicates.

Full explanation →

876

MCQhard

A Dataflow streaming pipeline is experiencing high latency and frequent OOM errors when processing variable-sized JSON messages from Pub/Sub. The team suspects that the autoscaling is not effective. Which feature should they enable to improve resource utilization?

A.Horizontal autoscaling

B.Dataflow Prime

C.FlexRS

D.Streaming Engine

AnswerB

Dataflow Prime offers vertical scaling and right-fitting, which helps with variable-sized messages and OOM errors.

Why this answer

Dataflow Prime is the correct choice because it provides intelligent resource management that automatically adjusts worker resources (CPU, memory) based on the pipeline's processing demands, which is critical for variable-sized JSON messages. It addresses both high latency and OOM errors by optimizing resource utilization beyond simple autoscaling, including predictive autoscaling and flexible resource scheduling to handle spikes in message size without manual tuning.

Exam trap

Cisco often tests the misconception that Streaming Engine solves all streaming performance issues, but it specifically addresses shuffle and state persistence, not worker memory management for variable payloads.

How to eliminate wrong answers

Option A is wrong because Horizontal autoscaling is a basic feature already enabled by default in Dataflow; it only scales the number of workers horizontally and does not address memory inefficiencies or OOM errors caused by variable-sized messages. Option C is wrong because FlexRS is designed for batch pipelines with flexible scheduling to reduce costs, not for streaming pipelines requiring low latency and real-time processing. Option D is wrong because Streaming Engine offloads shuffle and state storage to backend services to reduce disk I/O and checkpoint latency, but it does not directly manage per-worker memory allocation or prevent OOM errors from variable-sized payloads.

Full explanation →

877

MCQhard

A Dataflow pipeline processes a high-volume stream of JSON events. The pipeline has a bottleneck where a ParDo transformation performs an external API call for each element, causing high latency. Which strategy would BEST improve throughput without sacrificing correctness?

A.Increase the number of workers in the pipeline.

B.Switch from ParDo to MapElements.

C.Use a side input to batch elements and make fewer API calls.

D.Use a GroupByKey to group elements with the same key and then make one API call per group.

AnswerC

Batching reduces API calls, improving throughput.

Why this answer

Using side inputs to batch data before API calls can reduce the number of calls and improve throughput.

Full explanation →

878

MCQmedium

What is the most likely cause of data duplication after this command?

A.The Pub/Sub source is not exactly-once.

B.The pipeline uses at-least-once semantics.

C.The snapshot was taken before scaling.

D.The BigQuery sink is not idempotent.

AnswerD

If the sink is not idempotent, duplicate data can be written when workers are re-added or when job state is replayed.

Why this answer

Option D is correct because BigQuery sinks in Dataflow are not idempotent by default; if the pipeline retries writes (e.g., due to worker failures or checkpoint issues), duplicate rows can be inserted into the BigQuery table. This is a known limitation: BigQuery does not support deduplication at the sink level unless you implement custom deduplication logic or use a staging table with merge operations. The command likely triggered a retry scenario, and the non-idempotent sink caused the duplication.

Exam trap

Google Cloud often tests the misconception that at-least-once semantics alone cause duplication, but the real trap is that the sink's idempotency (or lack thereof) is the decisive factor when retries occur.

How to eliminate wrong answers

Option A is wrong because Pub/Sub sources in Dataflow can be configured for exactly-once delivery using the 'exactly-once' flag (e.g., with Pub/Sub Lite or by enabling the 'enable_exactly_once' option), and the question does not indicate that the source is the cause. Option B is wrong because at-least-once semantics are a pipeline processing mode, not a direct cause of data duplication; they can lead to duplicates if the sink is not idempotent, but the question asks for the 'most likely cause' and the sink's idempotency is the immediate factor. Option C is wrong because taking a snapshot before scaling does not inherently cause data duplication; snapshots preserve pipeline state for resumption, and scaling only affects parallelism, not data integrity.

Full explanation →

879

MCQeasy

A data engineer needs to transfer 500 TB of on-premises data to Google Cloud Storage. The data is stored on NAS devices and the network bandwidth is limited to 100 Mbps. What is the most cost-effective and timely transfer method?

A.Use Storage Transfer Service over the internet

B.Use a VPN connection and rsync

C.Use gsutil cp in parallel

D.Use Transfer Appliance

AnswerD

Transfer Appliance is designed for offline petabyte-scale transfers, avoiding bandwidth limitations.

Why this answer

At 100 Mbps, transferring 500 TB over the network would take over 500 days. Transfer Appliance is designed for petabyte-scale offline transfer, shipping a physical appliance to your data center. Other options are not feasible due to bandwidth constraints.

Full explanation →

880

Multi-Selectmedium

A retail company uses Dataflow to process real-time clickstream data. They need to enrich each event with customer profile data from Cloud Bigtable and session metadata from Cloud Spanner. Which two Dataflow features should they use?

Select 2 answers

A.ParDo

B.Windowing

C.CoGroupByKey

D.GroupByKey

E.Side inputs

AnswersA, E

ParDo is used for per-element transformation, such as looking up enrichment data.

Why this answer

Side inputs allow reading from Bigtable and Spanner in a non-blocking way. ParDo is for per-element processing where enrichment occurs. GroupByKey and Windowing are not needed for this enrichment step.

Full explanation →

881

MCQhard

A data engineer is designing a batch ETL pipeline that reads CSV files from Cloud Storage, transforms them using Dataproc, and writes the results to BigQuery. The data volume is expected to grow 10x in the next year. Which design approach best balances cost and performance?

A.Create a single large persistent Dataproc cluster to handle the peak load.

B.Use Cloud Data Fusion to visually design the pipeline and run it on Dataproc.

C.Use a Dataproc cluster with preemptible worker nodes and autoscaling enabled.

D.Migrate the pipeline to Dataflow with Apache Beam and use flexRS for cost savings.

AnswerC

Preemptible VMs are cost-effective, and autoscaling handles growth.

Why this answer

Option C is correct because preemptible worker nodes significantly reduce cost (up to 80% discount) while autoscaling dynamically adjusts cluster size to match the growing workload, ensuring performance without over-provisioning. This combination handles the 10x data growth efficiently by scaling out during peak loads and scaling in during lulls, using preemptible instances for fault-tolerant tasks like transformation.

Exam trap

The trap here is that candidates often choose Dataflow (Option D) assuming it is always the best for cost and performance, but the question specifically involves Dataproc and batch ETL from Cloud Storage to BigQuery, where preemptible nodes with autoscaling provide a more direct and cost-effective solution without requiring a pipeline rewrite.

How to eliminate wrong answers

Option A is wrong because a single large persistent cluster incurs high costs even when idle, and cannot efficiently handle a 10x growth without manual resizing, leading to either underutilization or performance bottlenecks. Option B is wrong because Cloud Data Fusion is a visual design tool that adds complexity and cost (via Dataproc provisioning) without inherent autoscaling or preemptible node benefits, and is not optimized for batch ETL cost control. Option D is wrong because Dataflow with flexRS is designed for batch workloads with flexible scheduling, but it requires rewriting the pipeline in Apache Beam, which adds migration overhead and may not leverage existing Dataproc investments; flexRS offers cost savings but with potential execution delays, making it less balanced for immediate performance needs.

Full explanation →

882

MCQeasy

Which Google Cloud service provides a fully managed, serverless Spark environment without requiring cluster provisioning?

A.Dataproc on GKE

B.Dataflow

C.Dataproc Serverless

D.Cloud Data Fusion

AnswerC

Serverless Spark is a feature of Dataproc Serverless.

Why this answer

Dataproc Serverless allows running Spark workloads without managing clusters.

Full explanation →

883

Multi-Selecthard

A company wants to implement a robust MLOps lifecycle on Google Cloud. Which THREE components are essential?

Select 3 answers

A.Vertex AI Model Registry for versioning

B.Vertex AI Pipelines for orchestration

C.Pub/Sub for event-driven retraining

D.Cloud Build for CI/CD

E.Cloud SQL for model metadata

AnswersA, B, D

Model Registry centralizes model version management and deployment.

Why this answer

Vertex AI Model Registry is essential for versioning because it provides a centralized repository to track, manage, and deploy different versions of trained ML models. This ensures reproducibility, auditability, and the ability to roll back to previous versions, which is critical for a robust MLOps lifecycle.

Exam trap

The trap here is that candidates may confuse optional supporting services (like Pub/Sub for event triggers or Cloud SQL for metadata) with the essential components required for a robust MLOps lifecycle, which are versioning, orchestration, and CI/CD.

Full explanation →

884

MCQhard

A healthcare company deploys a model for diagnosing medical images on Vertex AI using a custom container with a TensorFlow model. The model uses a mixture of GPUs (NVIDIA T4) and CPUs. After deployment, you notice that prediction latency is highly variable: sometimes under 100ms, sometimes over 10 seconds. Investigation shows that the variability correlates with the number of concurrent requests. The endpoint has a min replicas of 1 and max replicas of 3, with target CPU utilization set to 80%. You also observe that GPU utilization remains low (<20%) even during high load. What is the most likely cause of the latency variability? A) The model is not fully utilizing GPUs due to inefficient data loading from CPU. B) The autoscaling metric (CPU utilization) is not appropriate for a GPU-bound workload; the endpoint does not scale based on GPU utilization. C) The GPU machine type is too small for the model. D) The container is not configured to use the GPU correctly.

A.The autoscaling metric (CPU utilization) is not appropriate for a GPU-bound workload; the endpoint does not scale based on GPU utilization.

B.The model is not fully utilizing GPUs due to inefficient data loading from CPU.

C.The container is not configured to use the GPU correctly.

D.The GPU machine type is too small for the model.

AnswerA

Standard autoscaling uses CPU; for GPU workloads, you should use custom metrics like GPU utilization or request count.

Why this answer

Option B is correct because Vertex AI scales based on CPU utilization by default, but GPU-bound workloads may have low CPU utilization, causing autoscaling not trigger. Thus, during high load, the single replica is overwhelmed, causing high latency. Option A (inefficient data loading) could contribute but is not the primary cause.

Option C (GPU too small) would cause consistently high latency. Option D (GPU not configured) would cause continuous errors, not variable latency.

Full explanation →

885

MCQeasy

You need to process a large Spark ML training job on a Dataproc cluster. The job is fault-tolerant and can handle occasional node failures. To reduce costs, which type of worker nodes should you use?

A.Preemptible worker nodes

B.Standard worker nodes

C.High-memory worker nodes

D.Sole-tenant nodes

AnswerA

Preemptible VMs offer up to 80% discount and are suitable for fault-tolerant workloads.

Why this answer

Preemptible VMs are significantly cheaper but can be terminated at any time. Since the job is fault-tolerant, preemptible workers can be used for cost savings.

Full explanation →

886

MCQhard

A financial services company needs to process high-frequency trading data with strict ordering guarantees. They use Pub/Sub with ordering keys and Dataflow. The pipeline occasionally produces out-of-order results. What is the most likely cause?

A.Dataflow does not preserve order when using multiple workers

B.Dataflow uses at-least-once processing, which can reorder events

C.Pub/Sub does not guarantee message ordering

D.The window trigger allows late data to be included after the main output

AnswerD

Late data can be emitted in a different pane, causing apparent out-of-order results.

Why this answer

Option D is correct because Dataflow's default window trigger behavior allows late data to arrive after the main pane is emitted. When using Pub/Sub with ordering keys, late-arriving events (e.g., due to network delays or retries) can be assigned to the correct window but emitted in a separate pane, causing the final output to appear out-of-order relative to the event time. This is a known behavior when combining event-time windows with late data handling.

Exam trap

Google Cloud often tests the misconception that Pub/Sub's lack of ordering guarantees is the primary cause of out-of-order results in Dataflow, when in fact the issue is typically the window trigger and late data handling within Dataflow itself.

How to eliminate wrong answers

Option A is wrong because Dataflow can preserve order within a key when using a single worker per key, but the question's scenario involves ordering keys and the issue is not about multiple workers reordering events—Dataflow's shuffle and grouping operations maintain order per key. Option B is wrong because at-least-once processing guarantees delivery but does not inherently reorder events; reordering is caused by late data or window triggers, not by the processing semantics alone. Option C is wrong because Pub/Sub does guarantee message ordering when messages are published to the same ordering key and within the same region, as long as the subscriber acknowledges messages in order; the question states they use ordering keys, so Pub/Sub ordering is not the root cause.

Full explanation →

887

MCQmedium

A data engineer is using Apache Spark on Dataproc to process a large dataset. They need to perform complex aggregation and transformation with high performance. The dataset has a known schema and they want to take advantage of Catalyst optimizer. Which Spark API should they use?

A.Spark SQL only

B.DataFrames

C.Datasets

D.RDDs

AnswerB

DataFrames have Catalyst optimizer, which improves performance for complex transformations.

Why this answer

DataFrames provide high-level API with Catalyst optimizer for performance, making them ideal for complex aggregations and transformations on structured data.

Full explanation →

888

MCQeasy

Which BigQuery feature allows you to read data directly from Cloud Storage without loading it into BigQuery storage?

A.External tables

B.BI Engine

C.Federated queries

D.Authorized views

AnswerA

External tables reference data in Cloud Storage and can be queried directly.

Why this answer

External tables in BigQuery allow querying data stored in Cloud Storage (e.g., CSV, Parquet, ORC) without loading. Authorized views restrict access, federated queries allow querying other databases, and BI Engine is for acceleration.

Full explanation →

889

MCQeasy

Your company has deployed a machine learning model on Vertex AI Endpoint to serve real-time predictions for a mobile application. The model was trained using TensorFlow and the prediction requests include raw images that are preprocessed by the client before sending. Recently, the application developers reported that the predictions are becoming less accurate over time. They suspect the issue is related to changes in the client-side preprocessing code. You need to verify this hypothesis and monitor for future regressions. What should you do?

A.Retrain the model using the latest client data to adapt to any changes in preprocessing.

B.Roll back to a previous model version that was known to work well and disable automatic retraining.

C.Ask the developers to provide the exact preprocessing code and manually compare it with the training pipeline's preprocessing.

D.Enable Vertex AI Model Monitoring for feature attribution and set up alerting on skew detection.

AnswerD

Model Monitoring can detect training-serving skew by comparing feature distributions; this would catch preprocessing changes effectively.

Why this answer

Option D is correct because Vertex AI Model Monitoring can automatically detect skew between the training data distribution and the live prediction data distribution. By enabling feature attribution and alerting on skew detection, you can quantitatively verify whether changes in client-side preprocessing are causing prediction drift, without manual code comparison or disruptive rollbacks.

Exam trap

Cisco often tests the misconception that manual code comparison or retraining is the best way to diagnose preprocessing drift, when in fact automated monitoring with statistical skew detection is the correct operational practice for continuous verification.

How to eliminate wrong answers

Option A is wrong because retraining the model on the latest client data would adapt to the preprocessing changes, but it would not verify the hypothesis that preprocessing changes caused the accuracy drop; it would mask the root cause and potentially introduce new biases. Option B is wrong because rolling back to a previous model version and disabling retraining is a reactive, non-diagnostic approach that does not confirm whether preprocessing changes are the issue and may ignore other legitimate improvements. Option C is wrong because manually comparing preprocessing code is error-prone, does not scale, and cannot detect subtle distribution shifts that occur in production; it also provides no ongoing monitoring for future regressions.

Full explanation →

890

MCQeasy

A company is ingesting real-time sensor data from thousands of devices into Cloud Pub/Sub. They need to process this data with low latency (seconds) and exactly-once semantics. Which data processing service should they use?

A.Cloud Run with Pub/Sub push

B.Cloud Functions triggered by Pub/Sub

C.Dataflow streaming with exactly-once processing

D.Dataproc with Spark Streaming

AnswerC

Dataflow provides exactly-once processing for streaming data with low latency, ideal for real-time sensor data.

Why this answer

Dataflow streaming with exactly-once processing is the correct choice because it provides exactly-once semantics for Pub/Sub sources via checkpointing and idempotent sinks, and it meets the low-latency (seconds) requirement through its streaming engine that minimizes per-element overhead. Cloud Dataflow's integration with Pub/Sub ensures that each message is processed exactly once, even in the presence of failures, by using snapshots and consistent state management.

Exam trap

Google Cloud often tests the misconception that serverless services like Cloud Functions or Cloud Run inherently provide exactly-once processing, when in fact they rely on Pub/Sub's at-least-once delivery and require additional logic to achieve exactly-once semantics.

How to eliminate wrong answers

Option A is wrong because Cloud Run with Pub/Sub push does not guarantee exactly-once processing; Pub/Sub push delivery is at-least-once, and Cloud Run's stateless containers cannot enforce exactly-once semantics without external coordination. Option B is wrong because Cloud Functions triggered by Pub/Sub also uses at-least-once delivery from Pub/Sub and lacks built-in mechanisms for exactly-once processing; it is designed for lightweight, event-driven tasks, not for stateful streaming with exactly-once guarantees. Option D is wrong because Dataproc with Spark Streaming provides at-least-once or exactly-once semantics only with additional configuration (e.g., checkpointing and idempotent sinks), but it introduces higher latency (typically seconds to minutes) due to micro-batching and is not optimized for sub-second or low-latency streaming compared to Dataflow's streaming engine.

Full explanation →

891

MCQmedium

A data engineering team uses Cloud Pub/Sub to ingest clickstream events and Cloud Dataflow to process them. They need to maintain strict event ordering per user session, and the processing output must be written to a BigQuery table with exactly-once semantics. Which configuration should the team implement?

A.Enable message ordering in Pub/Sub with a session ID as the ordering key, and in Dataflow use a global window with a custom trigger that fires on watermark and uses a BigQuery sink with 'exactly-once' mode enabled.

B.Use a Pub/Sub pull subscription with a subscriber that acknowledges messages immediately after processing, and a Dataflow pipeline with a sliding window.

C.Assign a unique session ID as the message ordering key in Pub/Sub, use a Dataflow pipeline with session windows and .withAllowedLateness(0), and write to BigQuery using a batch load.

D.Use a Pub/Sub push subscription with an acknowledgment deadline of 600 seconds and enable exactly-once delivery on the subscription.

AnswerA

D is correct because Pub/Sub ordering keys maintain order per session, and Dataflow's exactly-once sink to BigQuery prevents duplicates when combined with deterministic triggers.

Why this answer

Option A is correct because it combines Pub/Sub message ordering (using a session ID as the ordering key) with Dataflow's exactly-once sink to BigQuery. The global window with a watermark-based trigger ensures all events for a session are processed in order before writing, while the BigQuery 'exactly-once' mode prevents duplicate rows even if the pipeline retries. This satisfies both strict per-session ordering and exactly-once semantics.

Exam trap

Google Cloud often tests the misconception that Pub/Sub's exactly-once delivery subscription alone guarantees end-to-end exactly-once processing, ignoring that Dataflow's sink configuration and windowing strategy are required for ordering and deduplication in the output.

How to eliminate wrong answers

Option B is wrong because acknowledging messages immediately after processing (auto-ack) can cause message loss if the pipeline fails before writing to BigQuery, breaking exactly-once semantics; sliding windows do not maintain per-session ordering. Option C is wrong because session windows in Dataflow group events by session gaps, not by a fixed ordering key, and .withAllowedLateness(0) drops late events, risking incomplete sessions; batch loads to BigQuery do not provide exactly-once write semantics (they can produce duplicates on retry). Option D is wrong because enabling exactly-once delivery on a Pub/Sub subscription only ensures at-least-once delivery from Pub/Sub, not exactly-once processing downstream; a 600-second acknowledgment deadline does not guarantee ordering or exactly-once writes to BigQuery.

Full explanation →

892

MCQmedium

A data engineer needs to create a Dataflow pipeline template that can be reused across multiple environments (dev, staging, prod) with different parameters (e.g., input Pub/Sub topic, output BigQuery table). Which template type should they use?

A.Dataflow Prime

B.Flex Template

C.Classic Template

D.Cloud Composer workflow template

AnswerB

Flex Templates support custom Docker images and runtime parameters, making them suitable for multi-environment reuse.

Why this answer

Flex Templates (B) are the correct choice because they package a Dataflow pipeline as a Docker image, allowing environment-specific parameters (e.g., Pub/Sub topic, BigQuery table) to be passed at runtime via the --parameters flag. This enables true reusability across dev, staging, and prod without modifying the template code, unlike Classic Templates which require compile-time parameterization.

Exam trap

Cisco often tests the distinction between Classic Templates (compile-time parameterization) and Flex Templates (runtime parameterization), trapping candidates who assume all templates support the same level of parameter flexibility.

How to eliminate wrong answers

Option A is wrong because Dataflow Prime is a managed service for optimizing resource utilization and autoscaling, not a template type for parameterized reuse. Option C is wrong because Classic Templates require parameters to be baked in at staging time, making them less flexible for multi-environment reuse without rebuilding the template. Option D is wrong because Cloud Composer is an Apache Airflow orchestration service used to schedule and monitor workflows, not a Dataflow template type for parameterized pipeline reuse.

Full explanation →

893

MCQmedium

An organization uses Cloud Storage to store backup files. They want to automatically delete files older than 90 days, and after deletion, move remaining files to Nearline storage if not accessed for 30 days. Which Cloud Storage feature should they configure?

A.Object Versioning

B.Retention Policies

C.Bucket Lock

D.Object Lifecycle Management

AnswerD

Lifecycle rules can delete objects after a specified age and change storage class based on last access time (using Condition with LastAccessTime).

Why this answer

Object Lifecycle Management (D) is the correct feature because it allows you to define rules to automatically transition objects to colder storage classes (such as Nearline) after a specified period of inactivity and to delete objects after a set age. In this scenario, a lifecycle rule can be configured to delete objects older than 90 days and, for the remaining objects, move them to Nearline storage if they have not been accessed for 30 days. This fully automates the required data management without manual intervention.

Exam trap

Cisco often tests the distinction between lifecycle management (which automates transitions and deletions) and retention-related features (like Bucket Lock or Retention Policies), so the trap here is that candidates confuse 'automatically deleting old files' with 'preventing deletion,' leading them to incorrectly choose a retention-focused option.

How to eliminate wrong answers

Option A is wrong because Object Versioning is used to preserve, retrieve, and restore every version of an object in a bucket, not to automate deletion or storage class transitions based on age or access patterns. Option B is wrong because Retention Policies are used to enforce a minimum retention period for objects, preventing their deletion or overwrite, which is the opposite of automatically deleting old files. Option C is wrong because Bucket Lock is a feature that locks a bucket's retention policy, making it immutable and preventing any changes to the retention settings; it does not provide automated lifecycle actions like deletion or storage class transitions.

Full explanation →

894

Multi-Selecteasy

Which TWO roles are required to allow a service account to run a Dataflow job and write results to BigQuery? (Choose two.)

Select 2 answers

A.roles/pubsub.subscriber

B.roles/dataflow.worker

C.roles/bigquery.dataEditor

D.roles/storage.objectAdmin

E.roles/dataflow.admin

AnswersB, C

Required for the worker service account to run the job.

Why this answer

Option B is correct because the roles/dataflow.worker role grants the service account the necessary permissions to execute Dataflow worker tasks, such as reading from sources and writing to sinks. Option C is correct because roles/bigquery.dataEditor allows the service account to insert rows into BigQuery tables, which is required for the Dataflow job to write results.

Exam trap

The trap here is that candidates often select roles/dataflow.admin thinking it is needed to run a job, but the exam tests that the worker role is sufficient for execution, while admin is for management tasks like creating or updating jobs.

Full explanation →

895

MCQeasy

You need to orchestrate a simple, linear workflow that calls several Cloud Functions and API endpoints sequentially with conditional logic. The workflow should be defined as code and have minimal overhead. Which GCP service should you use?

A.Cloud Tasks

B.Workflows

C.Dataflow

D.Cloud Composer

AnswerB

Workflows is serverless, YAML/JSON-based, and perfect for simple orchestrations.

Why this answer

Workflows is a serverless orchestration service that uses YAML/JSON to define workflows. It is ideal for simpler, linear or conditional orchestrations without the need for full Airflow infrastructure.

Full explanation →

896

Multi-Selectmedium

A company is migrating an on-premises PostgreSQL database to Google Cloud. They need a fully managed database that is compatible with PostgreSQL and can handle both transactional and analytical workloads with high performance. Which two database services meet these requirements? (Choose TWO.)

Select 2 answers

A.Cloud Spanner

B.Cloud SQL for PostgreSQL

C.BigQuery

D.AlloyDB

E.Firestore

AnswersB, D

Fully managed, PostgreSQL-compatible, supports OLTP and some analytical queries.

Why this answer

Cloud SQL for PostgreSQL is a fully managed database service that is compatible with PostgreSQL, making it suitable for transactional workloads. AlloyDB is also a fully managed PostgreSQL-compatible database that is optimized for high performance on both transactional and analytical workloads, offering up to 100x faster query performance for analytical queries compared to standard PostgreSQL.

Exam trap

Cisco often tests the distinction between general-purpose managed databases (Cloud SQL) and specialized high-performance databases (AlloyDB), and the trap here is that candidates may think BigQuery or Spanner are PostgreSQL-compatible because they support SQL, but they do not support the PostgreSQL dialect or transactional workloads natively.

Full explanation →

897

MCQhard

A company runs a large Dataflow pipeline that aggregates user activity data from Pub/Sub into BigQuery every 10 minutes using fixed windows. Recently, the daily summary reports have shown 5-10% lower user engagement for certain segments compared to historical trends. The pipeline is completing successfully with no errors in Cloud Monitoring, and the Dataflow job dashboard shows all steps in green. There are no alarms. The team suspects data is being dropped or missed. They have verified that the Pub/Sub topic is receiving data correctly. After reviewing the pipeline code, they find that the pipeline uses a global window with a default 10-minute trigger, and writes results to a single BigQuery table partitioned by date. They also use exactly-once processing mode. Which of the following is the most likely cause and the best course of action to diagnose and fix the data quality issue?

A.Implement a retry mechanism in the Pub/Sub subscription to ensure no messages are lost.

B.Enable Cloud Logging for all pipeline steps and analyze the logs for dropped elements.

C.Add a global window with a late-data trigger to capture any data arriving after the window ends.

D.Use Dataflow’s built-in metrics to compare the number of elements read from Pub/Sub and written to BigQuery for each window.

AnswerD

This identifies exactly where data is lost, enabling targeted debugging without overhead.

Why this answer

Option D is correct because the pipeline uses a global window with a default 10-minute trigger, which means data is processed in micro-batches but the global window never closes, so late-arriving data is included. However, the team suspects data is being dropped, and the most direct way to diagnose this is to compare the number of elements read from Pub/Sub (using the Pub/Sub subscription's 'pubsub_subscription' metric) with the number of elements written to BigQuery (using the BigQuery sink's 'bigquery_rows_written' metric) for each window. This comparison will reveal if any data is lost between reading and writing, which is a common issue when using exactly-once processing mode with streaming inserts that may silently fail due to schema mismatches or quota limits.

Exam trap

The trap here is that candidates assume 'exactly-once processing' guarantees no data loss, but in reality, exactly-once only ensures no duplicates, not that all data is successfully written to the sink; silent failures in streaming inserts to BigQuery can cause data to be dropped without triggering pipeline errors.

How to eliminate wrong answers

Option A is wrong because Pub/Sub subscriptions already have built-in retry mechanisms (e.g., at-least-once delivery) and the issue is not about message loss from Pub/Sub; the team verified the topic is receiving data correctly. Option B is wrong because enabling Cloud Logging for all pipeline steps would generate excessive logs and is not the most efficient diagnostic approach; Dataflow already provides built-in metrics (e.g., 'pubsub_subscription' and 'bigquery_rows_written') that can directly compare element counts without needing to parse logs. Option C is wrong because the pipeline already uses a global window with a default 10-minute trigger, which inherently captures late data (since the global window never closes); adding a late-data trigger is redundant and does not address the potential data loss between Pub/Sub and BigQuery.

Full explanation →

898

MCQeasy

A data engineer is building a Dataflow pipeline that reads from BigQuery, transforms data using Apache Beam, and writes results to Cloud Storage in Avro format. They need to ensure the pipeline can be easily redeployed with different parameters without modifying code. Which deployment method should they use?

A.Dataflow Flex Templates

B.Direct deployment using the gcloud command with parameters

C.Dataflow Classic Templates

D.Deploy as a Cloud Function triggered by Cloud Scheduler

AnswerA

Flex Templates use Docker images and support arbitrary pipeline options, including custom parameters.

Why this answer

Dataflow Flex Templates allow you to package a pipeline as a Docker image and pass runtime parameters, enabling parameterized deployments without code changes.

Full explanation →

899

MCQmedium

A data engineer uses Cloud Composer to orchestrate a daily batch pipeline. A downstream task should only start after an upstream BigQuery load job finishes successfully and a specific file appears in Cloud Storage. Which combination of operators should the engineer use in the Airflow DAG?

A.BigQueryInsertJobOperator with wait_for_downstream=True

B.BigQueryInsertJobOperator and GCSObjectExistenceSensor with upstream dependency

C.DataflowPythonOperator and GCSObjectExistenceSensor

D.BigQueryOperator and FileSensor with downstream dependency

AnswerB

Correct: BigQueryInsertJobOperator performs the load, GCSObjectExistenceSensor polls for the file, and upstream dependency ensures order.

Why this answer

The BigQueryInsertJobOperator (or BigQueryOperator) handles the load job, and the GoogleCloudStorageObjectExistenceSensor (or GCSObjectExistenceSensor) waits for the file. Task dependencies link them.

Full explanation →

900

MCQeasy

A company deploys a new machine learning model for real-time predictions using Vertex AI. The model is stored in a Cloud Storage bucket and deployed to an endpoint. To ensure traceability and rollback capability, which practice should be followed?

A.Deploy multiple versions of the model to the same endpoint using traffic splitting and set the primary version to 100% traffic.

B.Use the same model name for all deployments and overwrite the existing model.

C.Store the model in a Cloud Storage bucket with a fixed name and rely on Cloud Build for rollback.

D.Create a new model resource in Vertex AI for each version and deploy the specific version to an endpoint.

AnswerD

This allows version tracking, easy rollback by redeploying a previous version, and maintains a clean deployment history.

Why this answer

Option D is correct because creating a new model resource in Vertex AI for each version ensures that each model iteration is independently tracked, versioned, and can be deployed to an endpoint with full rollback capability. This practice aligns with Vertex AI's model versioning and endpoint deployment model, where each model resource has a unique ID and can be deployed or undeployed without affecting other versions, enabling precise traceability and rollback.

Exam trap

Google Cloud often tests the misconception that traffic splitting alone (Option A) provides sufficient versioning and rollback, but the trap is that traffic splitting still operates within a single model resource, which does not preserve independent version history or allow clean rollback to a prior model resource without manual intervention.

How to eliminate wrong answers

Option A is wrong because deploying multiple versions to the same endpoint with traffic splitting and setting the primary version to 100% traffic does not inherently create separate model resources for each version; it still relies on a single model resource with aliases, which can complicate rollback if the model resource itself is overwritten or corrupted. Option B is wrong because using the same model name for all deployments and overwriting the existing model destroys the previous version's metadata and artifacts, making rollback impossible without manual restoration from backups. Option C is wrong because storing the model in a Cloud Storage bucket with a fixed name and relying on Cloud Build for rollback does not provide native Vertex AI model versioning or endpoint deployment tracking; Cloud Build is a CI/CD tool, not a model registry, and overwriting the bucket contents loses previous versions.

Full explanation →

Page 12 of 14

All pages

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Practice PDE by domain

Target a specific domain to shore up weak areas.

Designing Data Processing Systems Ingesting and Processing the Data Storing the Data Preparing and Using Data for Analysis Maintaining and Automating Data Workloads Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

See all domains with question counts →

Google Professional Data Engineer PDE Questions 826–900 | Page 12/14 | Courseiva