Knowledge + Practice

Google Professional Data Engineer (PDE) — Questions 226–300

499 questions total · 7pages · All types, answers revealed

Take a mock exam Exam hub

Page 4 of 7

226

MCQmedium

A company is migrating their on-premises Apache Spark jobs to Dataproc. They want to minimize code changes and take advantage of serverless infrastructure. Which Dataproc feature should they use?

A.Dataproc clusters with preemptible VMs

B.Dataproc Workflow Templates

C.Dataproc Serverless Spark

D.Dataproc Jobs API with custom machine types

AnswerC

Serverless Spark runs jobs without cluster management and is compatible with existing Spark code.

Why this answer

Dataproc Serverless Spark is the correct choice because it allows the company to run Spark workloads without provisioning or managing clusters, minimizing code changes by using the same Spark APIs and libraries. This serverless infrastructure automatically scales resources and handles failures, aligning with the goal of reducing operational overhead while maintaining compatibility with existing Spark jobs.

Exam trap

Google Cloud often tests the distinction between 'serverless' and 'managed' services; the trap here is that candidates may confuse Dataproc Workflow Templates or Jobs API with serverless capabilities, but those still require cluster management, whereas Dataproc Serverless Spark truly abstracts the infrastructure.

How to eliminate wrong answers

Option A is wrong because preemptible VMs are cost-effective but still require managing a cluster and do not provide serverless infrastructure; they are prone to termination, which can disrupt jobs without proper checkpointing. Option B is wrong because Workflow Templates orchestrate job sequences on existing clusters but do not eliminate cluster management or provide serverless execution. Option D is wrong because the Dataproc Jobs API with custom machine types still requires a running cluster to submit jobs, thus not achieving serverless infrastructure or minimizing cluster management.

Full explanation →

227

Multi-Selectmedium

Which TWO best practices should be followed when managing multiple model versions on Vertex AI Endpoints for a production system?

Select 2 answers

A.Always keep all historical versions deployed to enable fast rollback.

B.If two versions share the same endpoint, they must have exactly the same machine type.

C.Use traffic splitting to gradually shift traffic to a new version while monitoring performance.

D.Upload each model version as a new model resource and deploy to a separate endpoint for isolation.

E.Use the same endpoint for multiple versions and adjust min_replica_count, max_replica_count for each version.

AnswersC, D

Traffic splitting enables canary deployments and safe rollback.

Why this answer

Option C is correct because Vertex AI Endpoints support traffic splitting, allowing you to route a percentage of inference requests to a new model version while the rest goes to the existing version. This enables gradual rollout, monitoring of performance metrics (e.g., latency, error rate), and safe rollback without downtime. It is a best practice for production systems to validate a new version before fully cutting over.

Exam trap

Google Cloud often tests the misconception that multiple versions on the same endpoint must share identical infrastructure settings (like machine type), but Vertex AI allows heterogeneous configurations per version, and traffic splitting is the correct method for gradual rollouts, not keeping all versions or adjusting autoscaling parameters.

Full explanation →

228

MCQmedium

A financial services company uses a Dataflow streaming pipeline to process real-time stock trades. The pipeline reads from Pub/Sub, enriches with reference data from Cloud Bigtable, and writes to BigQuery. Recently, they noticed an increase in processing latency during market open hours. Investigation shows that the pipeline is data-skewed: a few stock symbols generate 90% of the traffic. The team wants to reduce latency without changing the pipeline structure. What should they do?

A.Increase the Pub/Sub subscription flow control to buffer less data

B.Use event-time windows based on trade timestamp to spread data

C.Enable Dataflow Streaming Engine to dynamically repartition work

D.Increase the number of workers and use more CPU

AnswerC

Streaming Engine handles hot keys by splitting processing across workers.

Why this answer

Option C is correct because using a streaming engine separates compute from storage, allowing better handling of hot keys. Option A is wrong because more workers may not help if the hot key bottleneck is within a single worker. Option B is wrong because reshuffling is already happening; using a different window doesn't fix skew.

Option D is wrong because waiting for no backlog is not a solution.

Full explanation →

229

MCQmedium

An MLOps team wants to implement continuous deployment of ML models using Cloud Build and Vertex AI. They have a GitHub repository with training code. What should they use?

A.Deploy using Cloud Run

B.Vertex AI Pipelines integrated with Cloud Build

C.Cloud Functions to monitor GitHub

D.Cloud Build trigger with a custom step to run Vertex AI Training job and deploy

AnswerD

Cloud Build can be configured to trigger on GitHub pushes and run training/deployment steps.

Why this answer

Option A is correct: Cloud Build triggers can include custom steps to run Vertex AI Training jobs and then deploy the model. Option B is wrong because Vertex AI Pipelines is an orchestration service, not a CI/CD system; Cloud Build is the CI/CD tool. Option C is wrong because Cloud Functions is event-driven but not designed for CI/CD pipelines.

Option D is wrong because Cloud Run is for serverless containers, not for training and deploying ML models.

Full explanation →

230

MCQmedium

Your company is building a real-time fraud detection system using Google Cloud. Transactions are streamed into Pub/Sub, and you need to process them with low latency (under 100ms per event) and aggregate data over sliding windows. Which Google Cloud service is best suited for this processing logic?

A.Dataflow

B.BigQuery streaming inserts with scheduled queries

C.Dataproc with Spark Streaming

D.Cloud Functions

AnswerA

Dataflow provides exactly-once, low-latency stream processing with native sliding window support.

Why this answer

Dataflow is the best choice because it provides a unified stream and batch processing model with native support for Pub/Sub, exactly-once semantics, and low-latency sliding window aggregations. Its autoscaling and millisecond-level checkpointing enable sub-100ms per event processing, which is critical for real-time fraud detection.

Exam trap

Google Cloud often tests the misconception that BigQuery streaming inserts can handle real-time per-event processing, but candidates overlook that scheduled queries add latency and BigQuery is not designed for stateful per-event aggregations with sliding windows.

How to eliminate wrong answers

Option B is wrong because BigQuery streaming inserts with scheduled queries cannot achieve sub-100ms latency per event; scheduled queries run on a periodic basis (e.g., every minute), introducing significant delay, and BigQuery is optimized for analytical queries, not per-event low-latency processing. Option C is wrong because Dataproc with Spark Streaming introduces higher startup and shuffle overhead, typically achieving latencies in the seconds range, and requires manual cluster management, making it unsuitable for consistent sub-100ms per event. Option D is wrong because Cloud Functions has a maximum timeout of 9 minutes and is designed for stateless, short-lived tasks; it lacks built-in support for stateful sliding window aggregations and cannot maintain per-key state across events without external services.

Full explanation →

231

MCQhard

Your team manages a multi-model ensemble deployed on Vertex AI Endpoint. The ensemble consists of three models: a neural network (NN), a gradient boosted tree (GBT), and a logistic regression (LR). They are deployed as separate endpoints and traffic is split using a traffic split configuration. Recently, the overall accuracy dropped from 92% to 85%. Monitoring shows that the NN model's latency has increased significantly, causing it to miss timeouts and fall back to default predictions. The other two models are performing normally. The NN model is the most complex and handles the majority of the traffic. You need to restore accuracy quickly. What should you do first?

A.Increase the timeout for predictions on the NN endpoint to avoid fallback.

B.Enable fallback logic to use the GBT model when NN times out, ensuring no prediction is missed.

C.Temporarily reduce the traffic percentage to the NN model to 0% and redistribute to GBT and LR until the NN issue is resolved.

D.Relaunch the NN model with a larger machine type and more replicas to reduce latency.

AnswerC

This immediately stops the problematic model from serving and restores accuracy using the other models.

Full explanation →

232

MCQmedium

What is the most likely cause of this error?

A.The JSONL file is missing some required fields.

B.The input images have a different number of channels than expected.

C.The instances are in JSON format instead of JSONL.

D.Each JSONL line contains a single image tensor without a batch dimension.

AnswerD

The model expects a batch dimension; each line should contain a batch of images.

Why this answer

The error occurs because each line in a JSONL file is expected to be a self-contained JSON object representing a single inference request. When the line contains a raw image tensor without a batch dimension, the model's serving framework (e.g., TensorFlow Serving or TorchServe) cannot perform batched inference, as it expects input tensors to have shape (batch_size, channels, height, width) or (batch_size, height, width, channels). The missing batch dimension causes a shape mismatch error during model execution.

Exam trap

Google Cloud often tests the distinction between data format errors (like JSON vs JSONL) and tensor shape errors, trapping candidates who confuse a missing batch dimension with a missing field or channel mismatch.

How to eliminate wrong answers

Option A is wrong because missing required fields would typically cause a parsing or validation error, not a tensor shape mismatch; the error described is specifically about the batch dimension, not missing fields. Option B is wrong because an incorrect number of channels would produce a channel mismatch error, not a missing batch dimension error; the model expects a specific channel count, but the error message would reference channel depth, not batch size. Option C is wrong because JSON format instead of JSONL would cause a file parsing error (e.g., expecting one JSON object per line but finding a JSON array), not a tensor shape error; the serving framework would fail to load the file entirely.

Full explanation →

233

Multi-Selecthard

A company is migrating an on-premises Hadoop cluster to Google Cloud. They need to run existing Spark jobs with minimal modification. Which THREE strategies should they consider? (Choose THREE.)

Select 3 answers

A.Migrate to BigQuery for all analytics.

B.Use Cloud Dataproc with Spark and Hive components.

C.Store data in Cloud Storage instead of HDFS.

D.Rewrite Spark jobs as Dataflow pipelines.

E.Use Dataproc Jobs API to submit jobs.

AnswersB, C, E

Compatible with existing code.

Why this answer

Option B is correct because Cloud Dataproc is a managed Spark and Hadoop service that supports the same Spark and Hive components used on-premises, allowing existing Spark jobs to run with minimal modification. It provides native integration with Cloud Storage, which can replace HDFS without changing job logic, and the Dataproc Jobs API enables programmatic job submission, preserving existing workflows.

Exam trap

The trap here is that candidates may assume BigQuery or Dataflow are the only Google Cloud data processing options, overlooking that Dataproc is specifically designed for minimal-change migrations of existing Spark/Hadoop workloads.

Full explanation →

234

Multi-Selecthard

A company building a real-time analytics pipeline with Pub/Sub and Dataflow. Which THREE best practices should they follow?

Select 3 answers

A.Use event time processing with watermarks and allowed lateness

B.Design idempotent sinks to handle duplicate outputs

C.Use exactly-once processing for all transforms

D.Use at-least-once delivery with deduplication in the pipeline

E.Use event time processing only for batch pipelines

AnswersA, B, D

Event time processing supports out-of-order data and ensures accurate windowing.

Why this answer

Option A is correct because in streaming pipelines, event time processing with watermarks and allowed lateness is essential for handling out-of-order data. Watermarks track the progress of event time, and allowed lateness specifies how long to wait for late-arriving data before considering it as late, ensuring accurate windowed aggregations.

Exam trap

Google Cloud often tests the misconception that exactly-once processing must be applied uniformly across all pipeline transforms, when in practice it is only required at sinks and can be replaced by at-least-once with deduplication for better performance.

Full explanation →

235

MCQeasy

A startup is building a real-time dashboard that shows aggregated metrics from social media feeds. They expect up to 10,000 events per second. The data must be near-real-time (< 30 seconds latency) and stored in BigQuery for historical analysis. They have limited experience managing infrastructure. The CTO suggests using Apache Kafka on Compute Engine for ingestion. However, the data engineer recommends a fully managed solution. Which approach should the team adopt?

A.Use Cloud Functions to ingest events directly into BigQuery

B.Use Apache Kafka on Compute Engine for ingestion, then use Dataflow to write to BigQuery

C.Use Cloud Pub/Sub for ingestion and Cloud Dataflow for streaming into BigQuery

D.Use App Engine to receive events and write to BigQuery

AnswerC

Fully managed, scales automatically, low operations overhead.

Why this answer

Option C is correct because Cloud Pub/Sub provides a fully managed, scalable ingestion service that can handle 10,000+ events per second without infrastructure management, and Cloud Dataflow offers exactly-once, auto-scaling streaming into BigQuery with sub-30-second latency. This combination meets the near-real-time requirement while eliminating operational overhead, aligning with the data engineer's recommendation for a fully managed solution.

Exam trap

The trap here is that candidates may choose Option B (Kafka on Compute Engine) because Kafka is a common streaming tool, but the question emphasizes limited infrastructure experience and a fully managed solution, making the self-managed Kafka approach a distraction that ignores operational overhead.

How to eliminate wrong answers

Option A is wrong because Cloud Functions has a maximum invocation timeout of 9 minutes and is designed for event-driven, short-lived tasks, not sustained high-throughput ingestion of 10,000 events per second; it would also lack buffering and retry mechanisms for streaming into BigQuery. Option B is wrong because managing Apache Kafka on Compute Engine requires significant operational expertise for cluster setup, partitioning, and monitoring, contradicting the team's limited experience and the goal of a fully managed solution. Option D is wrong because App Engine is a web application platform, not a streaming ingestion service; it would introduce HTTP overhead and scaling bottlenecks for high-velocity event streams, and writing directly to BigQuery from App Engine would risk data loss without a buffer.

Full explanation →

236

MCQeasy

Given the query plan, what is the most likely reason this query is efficient despite processing 10 billion rows?

A.The query uses a wildcard function.

B.The table is partitioned by sale_date.

C.The table is materialized.

D.The table is clustered by product_id.

AnswerB

Partition pruning removes irrelevant partitions, reducing scanned data from billions of rows to only those in the date range.

Why this answer

Option B is correct because partitioning by sale_date enables partition pruning, which allows the query engine to scan only the relevant partitions instead of the entire 10-billion-row table. This drastically reduces the amount of data read and processed, making the query efficient even with a large total row count.

Exam trap

Google Cloud often tests the distinction between partitioning (which reduces scanned rows via pruning) and clustering (which only improves sorting and compression within partitions), leading candidates to mistakenly choose clustering as the primary efficiency driver.

How to eliminate wrong answers

Option A is wrong because using a wildcard function (e.g., SELECT *) typically increases I/O and processing overhead by reading all columns, which would not improve efficiency. Option C is wrong because a materialized table is a precomputed snapshot that can speed up queries, but it does not inherently reduce the number of rows scanned; the efficiency gain here comes from partition pruning, not materialization. Option D is wrong because clustering by product_id organizes data within partitions for better compression and filter performance, but without partition pruning, the query would still need to scan all 10 billion rows, so clustering alone does not explain the efficiency.

Full explanation →

237

MCQeasy

A media company runs a batch data pipeline on Cloud Dataflow that ingests log files from Cloud Storage, transforms them, and writes results to BigQuery for analytics. The pipeline runs daily and has been stable for months. Recently, the source log format changed: a new optional field was added to some records. The pipeline started failing with ParseErrors for rows that contain the new field. The error logs show that the Dataflow job uses a hardcoded JSON schema that does not include the new field. The Dataflow pipeline logs are written to Stackdriver Logging, but no alerts are configured. The team wants to ensure that future schema changes do not break the pipeline and that failures are detected promptly. The team has limited experience with streaming and wants to keep the batch approach. Which course of action should the team take to improve solution quality?

A.Create a Cloud Monitoring alert on any PipelineError log entries from the Dataflow job, and set up a runbook to manually fix schema mismatches within one hour.

B.Schedule a Cloud Function to run every hour that checks the latest log file headers and compares them to the pipeline schema, sending an alert if differences are found.

C.Use BigQuery dry run queries to validate the schema before loading data, and if a mismatch is detected, block the pipeline run and notify the team via email.

D.Implement schema validation and evolution using a schema registry (e.g., AVRO) in the Dataflow pipeline, and configure Stackdriver alerts on pipeline failure or error logs.

AnswerD

A schema registry allows the pipeline to handle new fields gracefully by using a flexible schema, and monitoring alerts ensure timely detection of any remaining issues.

Why this answer

Option D is correct because using a schema registry (e.g., AVRO schema registry) and updating the pipeline to use a flexible schema (e.g., infer schema from data or use a schema registry) allows the pipeline to handle new fields without failing. Additionally, configuring Stackdriver alerts on pipeline failure logs ensures prompt detection of issues. Option A is incorrect because it only addresses detection after the fact, not prevention.

Option B is incorrect because BigQuery dry run does not prevent pipeline failures. Option C is incorrect because scheduling a job to check every hour is reactive and inefficient.

Full explanation →

238

MCQmedium

Your Dataflow streaming pipeline is reading from Cloud Pub/Sub and writing to BigQuery. Users report occasional data duplication in the BigQuery table. You verify the pipeline uses exactly-once processing and idempotent writes. The Dataflow monitoring shows no errors, but the pipeline has occasional worker restarts. What is the most likely cause of the duplicates?

A.The pipeline is using a global window with an early trigger, causing late data to be reprocessed.

B.The Pub/Sub subscription is configured with at-least-once delivery, causing duplicate messages.

C.The BigQuery table has a time-based partitioning column that is not aligned with the event timestamp.

D.The pipeline does not set the insertId parameter in the BigQuery streaming output.

AnswerD

BigQuery streaming inserts use insertId for deduplication. Without it, retried inserts may create duplicate rows.

Why this answer

Option D is correct because BigQuery's streaming API uses the `insertId` parameter to deduplicate records within the streaming buffer. Without a unique `insertId`, BigQuery cannot detect and discard duplicate inserts that may occur when Dataflow retries a write after a worker restart. Even with exactly-once processing in the pipeline, the BigQuery streaming endpoint itself is at-least-once, so the `insertId` is essential for deduplication.

Exam trap

Google Cloud often tests the misconception that exactly-once processing in the pipeline (Dataflow) automatically guarantees exactly-once delivery to the sink (BigQuery), ignoring that the sink itself may require explicit deduplication parameters like `insertId`.

How to eliminate wrong answers

Option A is wrong because a global window with an early trigger would cause multiple emissions per window, but the pipeline uses exactly-once processing and idempotent writes, so any late data would be handled without duplication. Option B is wrong because Pub/Sub subscriptions are inherently at-least-once, but Dataflow's exactly-once processing (via checkpointing and deduplication) handles this; the issue is downstream at BigQuery. Option C is wrong because time-based partitioning misalignment would cause data to land in the wrong partition, not duplicate rows; duplication is a separate concern related to insert identification.

Full explanation →

239

Multi-Selecthard

An MLOps team manages a pipeline that retrains an XGBoost classifier weekly using BigQuery data. The pipeline is orchestrated with Cloud Composer and deploys the new model to Vertex AI Endpoint if validation metrics (AUC > 0.9) are met. Over the past month, the deployed model's AUC has dropped from 0.95 to 0.88, despite the training pipeline consistently reporting AUC > 0.9. Which THREE steps should the team take to diagnose and fix this issue?

Select 3 answers

A.Review the training pipeline's hyperparameter tuning configuration to ensure it is not overfitting to stale data.

B.Add a canary deployment step where new model version receives a small percentage of traffic before full rollout.

C.Compare feature distributions between the training data and online serving data using Vertex AI Model Monitoring.

D.Retrain the model using a longer training history to include older data that may still be relevant.

E.Implement model validation on the deployed endpoint by logging predictions and comparing against actuals for a sample of traffic using Vertex Explainable AI.

AnswersB, C, E

Canary testing can catch performance issues early before the model is fully deployed.

Full explanation →

240

Multi-Selectmedium

You are designing a streaming Dataflow pipeline that processes high-throughput data. Which two features can help minimize cost? (Choose TWO.)

Select 2 answers

A.Enable autoscaling based on CPU utilization

B.Use batch loads to BigQuery for streaming inserts

C.Enable Streaming Engine to decouple compute and storage

D.Use preemptible VMs for all workers

E.Use a global window and batch output to BigQuery every hour

AnswersA, C

Autoscaling adjusts the number of workers to meet demand, avoiding over-provisioning and reducing cost.

Why this answer

Option A is correct because enabling autoscaling based on CPU utilization allows the Dataflow pipeline to dynamically adjust the number of worker instances in response to the actual processing load. This prevents over-provisioning during low-throughput periods, directly reducing compute cost while maintaining performance during spikes.

Exam trap

Google Cloud often tests the misconception that preemptible VMs are always cost-effective for streaming workloads, but the trap here is that preemptible VMs are unsuitable for stateful streaming pipelines due to frequent preemption causing data reprocessing and instability.

Full explanation →

241

MCQhard

A data analyst frequently queries a BigQuery table that contains an array of structs representing product purchases. The query below runs slowly: SELECT customer_id, COUNT(purchase) as total_purchases FROM sales, UNNEST(purchases) as purchase GROUP BY customer_id What change would most improve query performance?

A.Create a materialized view that pre-aggregates by customer_id and purchase count

B.Partition the table by transaction date

C.Use a subquery to filter purchases first

D.Cluster the table by purchases.product_id

AnswerA

A materialized view pre-computes the aggregation, so queries read the view instead of scanning the full table.

Why this answer

The query runs slowly because it must unnest the `purchases` array for every row and then aggregate. A materialized view pre-aggregates the data by `customer_id` and purchase count, avoiding repeated full scans and unnesting. This is the most impactful optimization because it eliminates the compute cost of UNNEST and GROUP BY at query time.

Exam trap

Google Cloud often tests the misconception that any indexing or partitioning strategy (like clustering or partitioning) universally speeds up all queries, when in fact the fix must target the specific expensive operation — here, the UNNEST and GROUP BY — rather than adding a generic optimization.

How to eliminate wrong answers

Option B is wrong because partitioning by transaction date does not help this query — there is no WHERE clause filtering by date, so all partitions would still be scanned. Option C is wrong because a subquery to filter purchases first does not reduce the amount of data that must be unnested or aggregated; it adds a nested scan without addressing the core performance bottleneck. Option D is wrong because clustering by purchases.product_id would only improve queries that filter or group by that field, but this query groups by customer_id, not product_id.

Full explanation →

242

MCQeasy

You are deploying a machine learning model to production using Vertex AI. The model requires GPU acceleration for low-latency predictions. You need to minimize costs while ensuring availability during a defined business hours window (8 AM to 6 PM). Which deployment strategy should you use?

A.Deploy to an endpoint with manual scaling, set min nodes to zero and max nodes to 10, and use a cron job to adjust during business hours.

B.Use a custom prediction routine (CPR) that dynamically requests GPUs from the cluster.

C.Deploy to a dedicated endpoint with a GPU machine and configure autoscaling.

D.Use Cloud Functions to invoke the model, and let Google Cloud manage the underlying GPU infrastructure.

AnswerA

Manual scaling allows setting min to zero, stopping all nodes outside hours, and auto-scheduling via cron or Cloud Scheduler to scale up before 8 AM and down after 6 PM, minimizing cost.

Why this answer

Option A is correct because it uses manual scaling with a cron job to set min nodes to zero outside business hours (8 AM–6 PM) and scale up to a maximum of 10 nodes during business hours, ensuring GPU availability when needed while minimizing costs by running zero instances when the model is not required. This approach directly addresses the requirement for low-latency GPU predictions during a defined window without paying for idle GPU resources outside that window.

Exam trap

Google Cloud often tests the misconception that autoscaling alone is sufficient for cost optimization, but the trap here is that autoscaling with a GPU machine typically requires a minimum of one replica, which still incurs 24/7 GPU costs, whereas manual scaling with a cron job to set min nodes to zero is the only way to completely eliminate GPU costs outside the defined business hours.

How to eliminate wrong answers

Option B is wrong because a custom prediction routine (CPR) is a way to package custom logic for serving predictions, not a deployment strategy for managing GPU scaling or scheduling; it does not inherently control when GPUs are requested or released based on a business hours window. Option C is wrong because deploying to a dedicated endpoint with a GPU machine and autoscaling will keep at least one instance running continuously (autoscaling typically has a minimum of 1 node), incurring costs 24/7 even when the model is not needed outside business hours. Option D is wrong because Cloud Functions does not support GPU acceleration; it is a serverless compute platform for lightweight, stateless functions and cannot attach GPUs for model inference.

Full explanation →

243

MCQmedium

A data team uses Cloud Dataproc to run nightly Spark jobs. The job volume has increased, and the cluster is often underutilized during the day. They want to reduce costs while ensuring jobs can scale when needed. Which strategy should they adopt?

A.Use preemptible workers for both primary and secondary nodes to minimize cost.

B.Manually scale the cluster up before nightly jobs and down after.

C.Use a cluster with a small number of primary workers and a large pool of preemptible workers, and enable autoscaling.

D.Use custom machine types with local SSDs for primary workers to improve I/O.

AnswerC

Preemptible workers are cheap, and autoscaling adjusts to load.

Why this answer

Option C is correct because it combines a small number of primary (non-preemptible) workers for reliability with a large pool of preemptible workers for cost-effective scaling, and enables autoscaling to dynamically adjust the cluster size based on workload. This minimizes cost during idle periods (preemptible instances are ~80% cheaper) while ensuring jobs can scale up quickly when needed, as autoscaling adds preemptible workers automatically. Preemptible workers are ideal for fault-tolerant Spark jobs that can handle node preemptions.

Exam trap

Google Cloud often tests the misconception that preemptible instances can be used for all nodes, but the trap here is that primary nodes require non-preemptible instances for cluster stability, while preemptible workers are only suitable for secondary (task) nodes in a fault-tolerant framework.

How to eliminate wrong answers

Option A is wrong because using preemptible workers for primary nodes is not allowed in Cloud Dataproc—primary nodes must be non-preemptible to ensure cluster stability and avoid data loss from coordinator failures. Option B is wrong because manual scaling is inefficient and error-prone for a nightly job pattern; it requires human intervention and cannot react to sudden workload spikes, leading to either underutilization or job delays. Option D is wrong because custom machine types with local SSDs improve I/O performance but do not address cost reduction or scaling needs; they increase cost without solving underutilization during the day.

Full explanation →

244

Multi-Selecthard

A company trains a model using Cloud TPUs. The model is deployed to AI Platform Prediction using a custom container with TensorFlow. Which THREE considerations are most important when serving this model?

Select 3 answers

A.The model should be retrained using GPU to ensure identical performance on serving hardware.

B.The serving container must have the same TensorFlow version that was used during training to avoid compatibility issues.

C.The model should be quantized to reduce memory footprint before deployment.

D.The serving infrastructure must use GPU or CPU, as AI Platform Prediction does not support TPU serving.

E.The model must be exported as a TensorFlow SavedModel and packaged in a custom container with proper dependencies.

AnswersB, D, E

Version mismatch can cause errors or different behavior.

Why this answer

Option B is correct because TensorFlow models are tightly coupled to the specific version of TensorFlow used during training. Serving with a different version can lead to incompatibilities in graph serialization, op definitions, or checkpoint formats, causing runtime errors or silent prediction failures. AI Platform Prediction's custom container must therefore match the training environment's TensorFlow version to ensure the model loads and executes correctly.

Exam trap

Google Cloud often tests the misconception that hardware must match between training and serving, but the real requirement is software version compatibility, not hardware identity.

Full explanation →

245

Multi-Selectmedium

A data warehouse in BigQuery is experiencing performance issues. Which THREE techniques can improve performance without moving data to a different storage system?

Select 3 answers

A.Partition by date

B.Cluster by common filter columns

C.Use streaming buffer

D.Use BigQuery slots

E.Use materialized views

AnswersA, B, E

Partitioning limits scans to relevant partitions.

Why this answer

Partitioning by date in BigQuery allows the query engine to prune entire partitions that do not match the query's date filter, significantly reducing the amount of data scanned and improving performance. This technique works without moving data to a different storage system because it is a metadata-level reorganization of the existing table.

Exam trap

Google Cloud often tests the misconception that streaming buffer (Option C) is a performance optimization, when in fact it is designed for near-real-time ingestion and can degrade query performance due to the small, unoptimized files it creates.

Full explanation →

246

MCQmedium

A machine learning pipeline uses Vertex AI Pipelines. One component fails intermittently due to resource constraints. What is the best way to handle this?

A.Use retry policies in the component specification

B.Deploy the pipeline on a larger cluster

C.Increase the pipeline timeout

D.Use a different orchestrator

AnswerA

Retry policies handle intermittent failures by automatically retrying the component.

Why this answer

Option A is correct because Vertex AI Pipelines supports retry policies at the component level via the `retry` field in the component specification (YAML or Python). This allows the pipeline to automatically re-execute a failed component when the failure is transient (e.g., resource exhaustion), without manual intervention. Retry policies are the standard mechanism for handling intermittent failures in a serverless orchestration environment like Vertex AI Pipelines.

Exam trap

Google Cloud often tests the misconception that scaling up infrastructure (Option B) is the primary fix for intermittent failures, when in fact retry policies are the correct, cost-efficient solution for transient resource constraints in a managed pipeline service.

How to eliminate wrong answers

Option B is wrong because deploying the pipeline on a larger cluster does not address the intermittent nature of the failure; it only increases resource capacity, which may not be cost-effective and does not handle transient resource spikes. Option C is wrong because increasing the pipeline timeout does not resolve resource constraints; it only gives the component more time to run, which will still fail if resources are insufficient. Option D is wrong because using a different orchestrator (e.g., Kubeflow Pipelines, Argo) does not inherently fix resource constraints; the issue is with the component's resource allocation, not the orchestration engine itself.

Full explanation →

247

MCQmedium

A retail company uses a machine learning model to predict inventory demand. The model is retrained weekly using Vertex AI Pipelines. Recently, the model's accuracy has degraded because the data distribution has shifted. Which action should you take to monitor and detect this drift automatically?

A.Enable Vertex AI Model Monitoring for the endpoint and configure alerting on feature drift

B.Set up alerts for when the model's mean absolute error exceeds a threshold on the evaluation dataset

C.Enable Cloud Logging for the prediction endpoint and search for error logs

D.Schedule a job to compare the distribution of incoming features with the training data using Cloud Dataflow

AnswerA

Model Monitoring automates drift detection.

Why this answer

Vertex AI Model Monitoring is purpose-built to automatically detect feature drift and prediction drift on deployed endpoints. By enabling it and configuring alerting on feature drift, you can proactively identify when the distribution of incoming features deviates from the training data, which directly addresses the root cause of accuracy degradation without manual intervention.

Exam trap

Google Cloud often tests the distinction between monitoring model performance metrics (like MAE) versus monitoring input data distributions (feature drift), and candidates mistakenly choose a performance-based alerting option because they think accuracy degradation is the only signal, ignoring that drift detection is the proactive mechanism to catch the root cause before accuracy drops.

How to eliminate wrong answers

Option B is wrong because setting alerts on mean absolute error (MAE) on the evaluation dataset only detects performance degradation after the fact, not the underlying data distribution shift; it also requires ground truth labels, which may not be available in real time. Option C is wrong because Cloud Logging for the prediction endpoint captures request/response logs and error messages, but it does not perform statistical drift analysis or compare feature distributions. Option D is wrong because scheduling a job with Cloud Dataflow to compare distributions is a custom, manual approach that lacks the automated, integrated monitoring and alerting capabilities of Vertex AI Model Monitoring, and it introduces unnecessary operational overhead.

Full explanation →

248

Multi-Selectmedium

Which THREE metrics should be monitored for a deployed machine learning model in production?

Select 3 answers

A.Number of replicas

B.Prediction error rate

C.Data drift detection

D.Training time

E.Prediction latency

AnswersB, C, E

Accuracy metric.

Why this answer

Prediction error rate (Option B) is a direct measure of model accuracy in production, reflecting how often the model's predictions deviate from actual outcomes. Monitoring this metric is essential for detecting model degradation, data quality issues, or concept drift that can silently reduce model performance over time.

Exam trap

Google Cloud often tests the distinction between operational metrics (like latency, error rate, drift) and development/infrastructure metrics (like training time, replica count) to see if candidates understand what is relevant for ongoing model monitoring versus model building or deployment scaling.

Full explanation →

249

MCQeasy

A team developed a microservice that writes logs to stdout. They want to centralize logs for analysis. Which GCP service should they use to automatically collect and store logs?

A.Install the Cloud Logging agent on the VM running the microservice.

B.Publish logs to a Pub/Sub topic and later store them.

C.Write logs directly to Cloud Storage.

D.Use the Cloud Logging client library (google-cloud-logging) for the microservice's language.

AnswerD

The client library automatically sends structured logs to Cloud Logging, enabling centralized analysis.

Why this answer

Option D is correct because Cloud Logging with the client library automatically captures stdout logs and sends them to Cloud Logging. Option A (Cloud Logging agent) is for VMs, not containers. Option B (Cloud Storage) is for object storage.

Option C (Pub/Sub) is for messaging, not log collection.

Full explanation →

250

Multi-Selectmedium

A data science team has deployed a custom TensorFlow model on Vertex AI Prediction. They notice increasing prediction latency and a growing number of 503 errors during peak traffic hours. The model is served using a single regional endpoint with min replica count of 2 and max replica count of 10. Which TWO actions should the team take to address these issues?

Select 2 answers

A.Use a larger machine type (e.g., n1-highmem-8) instead of the current n1-standard-4 to improve per-replica throughput.

B.Enable autoscaling with a higher max replica count and configure a CPU utilization target of 60%.

C.Reduce the min replica count to 0 to allow the service to scale down to zero when not in use.

D.Deploy the model as a batch prediction job and move all online predictions to batch.

E.Switch to a global endpoint with automatic scaling to distribute traffic across multiple regions.

AnswersB, E

Increasing max replicas and tuning CPU utilization target helps handle peak load and reduce latency.

Full explanation →

251

MCQhard

A company serves multiple models using Vertex AI endpoints. Each model has different latency and memory requirements. To minimize cost, the company wants to share underlying compute resources among models. Which approach should they use?

A.Deploy each model as a separate Cloud Run service and use a load balancer.

B.Use a single GKE cluster with multiple deployments and use Istio for routing.

C.Deploy all models to a single Vertex AI endpoint and configure traffic splitting.

D.Create separate endpoints for each model and use a load balancer to route traffic.

AnswerC

Vertex AI endpoints allow deploying multiple models behind one endpoint, sharing resources.

Why this answer

Vertex AI endpoints support traffic splitting, allowing you to deploy multiple models behind a single endpoint and route a percentage of traffic to each model. This enables resource sharing and cost optimization because the underlying compute infrastructure is shared among the models, unlike separate endpoints which would each require dedicated resources.

Exam trap

Google Cloud often tests the misconception that separate endpoints or services are required for different models, when in fact Vertex AI endpoints support multi-model deployment with traffic splitting to share resources and minimize cost.

How to eliminate wrong answers

Option A is wrong because deploying each model as a separate Cloud Run service and using a load balancer does not share underlying compute resources; each service runs in its own container instance, leading to higher cost and no direct model-level traffic splitting. Option B is wrong because using a single GKE cluster with multiple deployments and Istio for routing is overly complex for Vertex AI model serving, and it bypasses the managed Vertex AI endpoint capabilities that natively support traffic splitting and resource sharing. Option D is wrong because creating separate endpoints for each model and using a load balancer defeats the purpose of sharing compute resources; each endpoint would have its own dedicated resources, increasing cost and management overhead.

Full explanation →

252

MCQmedium

A media company uses Cloud Data Loss Prevention (DLP) API to inspect and de-identify sensitive data before loading into BigQuery. They want to reduce costs by sampling the data during inspection. Which configuration should they use?

A.Use the 'ROWS' limit in the inspection job.

B.Set the sample method to 'RANDOM' with a percentage.

C.Use a hybrid inspection with a BigQuery sample table.

D.Use the 'BYTES_LIMIT' parameter.

AnswerB

DLP supports random sampling to inspect a subset of data, reducing cost.

Why this answer

Option B is correct because the Cloud DLP API supports a 'sample_method' of 'RANDOM' with a 'sampling_percentage' to inspect only a random subset of rows. This directly reduces the volume of data scanned, lowering costs while still providing statistically representative coverage for sensitive data discovery.

Exam trap

The trap here is that candidates confuse 'limiting rows/bytes' (which scans sequentially from the start) with 'random sampling' (which distributes inspection across the entire dataset), leading them to pick options A or D, which do not achieve representative cost reduction.

How to eliminate wrong answers

Option A is wrong because the 'ROWS' limit in an inspection job caps the total number of rows scanned but does not sample randomly; it stops after scanning that many rows from the start, which can miss sensitive data in later rows and does not provide representative sampling. Option C is wrong because hybrid inspection with a BigQuery sample table requires manually creating and maintaining a separate table, adding complexity and storage costs, whereas the DLP API's built-in sampling is simpler and directly integrated. Option D is wrong because 'BYTES_LIMIT' limits the total bytes scanned but, like 'ROWS', scans sequentially from the beginning and does not perform random sampling, leading to biased results and potential cost inefficiency.

Full explanation →

253

MCQmedium

A company has a batch ETL job that runs daily using Cloud Dataflow. The job reads from Cloud Storage, transforms data, and writes to BigQuery. Recently, the job started failing with 'Resources have been exhausted' errors. What is the most likely cause?

A.The Cloud Storage bucket has been deleted.

B.The project has reached its Dataflow API quota.

C.The input data volume has increased significantly.

D.The BigQuery output table schema has changed.

AnswerB

Resource exhausted error indicates quota issue.

Why this answer

The 'Resources have been exhausted' error in Cloud Dataflow typically indicates that the project has reached its Dataflow API quota, such as the maximum number of concurrent jobs or API requests per minute. This is a common issue when multiple jobs run simultaneously or when the quota is set low by default. The error is distinct from resource exhaustion in the underlying compute or storage layers.

Exam trap

Google Cloud often tests the distinction between API quota exhaustion and resource exhaustion in the underlying infrastructure (e.g., Compute Engine CPU/memory), leading candidates to incorrectly attribute the error to increased data volume or schema changes.

How to eliminate wrong answers

Option A is wrong because deleting the Cloud Storage bucket would cause a 'bucket not found' or 'object not found' error, not a 'Resources have been exhausted' error. Option C is wrong because a significant increase in input data volume would lead to autoscaling limits or worker resource exhaustion (e.g., out of memory), but the specific 'Resources have been exhausted' message is tied to API quota limits, not data volume. Option D is wrong because a schema change in BigQuery would result in a schema mismatch or insertion error, not an API quota exhaustion error.

Full explanation →

254

MCQeasy

Your team uses Vertex AI Feature Store to serve features for online predictions. A feature value changes frequently (e.g., user session clicks). Which type of feature should you use to ensure low-latency writes and reads?

A.Streaming feature

B.Batch feature

C.Feature view

D.Bigtable-backed feature

AnswerA

Streaming features are designed for low-latency, high-frequency updates and reads.

Why this answer

Correct: A. Streaming features are for high-frequency updates. Option B is wrong because batch features are for static data.

Option C is wrong because Vertex AI doesn't have 'feature view' as a type. Option D is wrong because Bigtable is not a feature store feature.

Full explanation →

255

MCQmedium

Refer to the exhibit. A BigQuery dataset has the IAM policy shown above. An analyst is trying to run a SELECT query on a table in this dataset but receives an 'Access Denied' error. What is the most likely reason?

A.The analyst does not have permission to list datasets in the project.

B.The analyst only has the roles/bigquery.metadataviewer role, which does not allow reading table data.

C.The table is in a different region than the dataset, and the analyst's query is not cross-region compatible.

D.The analyst has not been granted the 'bigquery.jobs.create' permission to run queries.

AnswerB

D is correct because metadataviewer only allows viewing metadata, not querying data.

Why this answer

The roles/bigquery.metadataviewer role grants permissions to view table and dataset metadata (e.g., table names, schemas) but does not include the bigquery.tables.getData permission required to read table rows. Therefore, when the analyst runs a SELECT query, BigQuery denies access because the role lacks the data-reading privilege. This is the most likely reason for the 'Access Denied' error.

Exam trap

Google Cloud often tests the distinction between metadata-viewing roles and data-reading roles, trapping candidates who assume that being able to see table names and schemas implies permission to query the data.

How to eliminate wrong answers

Option A is wrong because listing datasets is not required to run a SELECT query; the error is about reading table data, not dataset enumeration. Option C is wrong because BigQuery does not enforce cross-region compatibility at the dataset-table level; tables reside within the same dataset and region, and cross-region queries are allowed with appropriate permissions. Option D is wrong because the 'bigquery.jobs.create' permission is needed to submit a query job, but the error specifically indicates a data access issue, not a job creation failure; the analyst likely has this permission if they can attempt a query.

Full explanation →

256

Multi-Selecthard

A data pipeline reads thousands of JSON files from Cloud Storage, processes them with Cloud Dataflow, and writes to BigQuery. The pipeline sometimes fails because of malformed JSON records. Which three steps should the data engineering team take to improve pipeline reliability? (Choose THREE.)

Select 3 answers

A.Integrate Cloud Pub/Sub as an intermediary to buffer and allow message retry

B.Use a try-catch block in the pipeline to retry processing failed records

C.Create a Cloud Monitoring alert on pipeline failures

D.Add schema validation before processing to reject invalid JSON records

E.Implement a dead-letter queue in the Dataflow pipeline to store failed records for later analysis

AnswersA, D, E

Pub/Sub can retry delivery of messages, improving reliability.

Why this answer

Option A is correct because integrating Cloud Pub/Sub as an intermediary decouples the ingestion of JSON files from the Dataflow pipeline. Pub/Sub provides at-least-once delivery and automatic retries for messages that are not acknowledged, which buffers against transient failures and malformed records. This allows the pipeline to pull messages at its own pace and retry processing without losing data.

Exam trap

The trap here is that candidates often confuse reactive monitoring (Option C) with proactive reliability improvements, or they assume a simple try-catch block (Option B) is sufficient in a distributed processing framework like Dataflow, where fault tolerance requires persistent retry mechanisms and dead-letter queues.

Full explanation →

257

MCQeasy

A team needs to migrate an existing on-premises Hadoop Hive workload to Google Cloud. They want to minimize code changes and use a managed service for transient clusters. Which service should they choose?

A.Cloud Dataflow

B.Cloud Dataprep

C.Cloud Dataproc

D.BigQuery

AnswerC

Dataproc is fully compatible with Hadoop/Hive and offers ephemeral clusters with minimal code changes.

Why this answer

Cloud Dataproc is the correct choice because it is a managed Spark and Hadoop service that supports Hive workloads natively, allowing you to run existing Hive scripts with minimal changes. It also supports transient clusters, which can be automatically scaled up and down, aligning with the requirement for transient clusters.

Exam trap

The trap here is that candidates often confuse Cloud Dataflow's ability to process batch data with Hadoop compatibility, but Dataflow does not support Hive or transient Hadoop clusters, making Dataproc the only correct option for minimizing code changes.

How to eliminate wrong answers

Option A is wrong because Cloud Dataflow is a unified stream and batch data processing service based on Apache Beam, not designed for Hive workloads or transient Hadoop clusters. Option B is wrong because Cloud Dataprep is a data preparation and cleaning service (based on Trifacta) that does not run Hive or provide transient clusters. Option D is wrong because BigQuery is a serverless data warehouse that does not support Hive execution engines or transient clusters; migrating Hive to BigQuery would require significant code changes.

Full explanation →

258

Multi-Selectmedium

A data engineer is monitoring a Dataflow streaming pipeline and notices that the 'System Lag' metric is increasing. Which TWO actions should be taken to diagnose the issue?

Select 2 answers

A.Check the Dataflow monitoring UI for each stage's throughput and backlog.

B.Cancel the pipeline and restart with a larger initial worker count.

C.Increase the maximum number of workers to handle backlog.

D.Examine the worker logs for error messages or stack traces.

E.Increase the BigQuery quota for streaming inserts.

AnswersA, D

Identifies bottleneck stages.

Why this answer

Option A is correct because the Dataflow monitoring UI provides per-stage metrics such as throughput and backlog, which directly indicate where data is accumulating. By examining these metrics, you can identify the specific stage causing the increasing system lag, enabling targeted troubleshooting without unnecessary pipeline changes.

Exam trap

Google Cloud often tests the distinction between diagnostic actions and remedial actions; the trap here is that candidates confuse scaling up workers (a fix) with diagnosing the root cause of the lag.

Full explanation →

259

MCQhard

A Dataflow streaming job is processing high-volume sensor data from thousands of IoT devices. The job uses global windows with a 10-minute processing time trigger. Recently, the job's CPU utilization is nearly 100% and it is falling behind. Which action is most likely to reduce CPU load while maintaining data freshness?

A.Increase the number of workers to distribute the load.

B.Change the trigger to event time with a 10-minute allowed lateness.

C.Replace GroupByKey with Combine.globally and use a fanout.

D.Use side inputs to broadcast a static lookup table to all workers.

AnswerC

Combine.globally with fanout reduces the number of unique keys tracked per worker, lowering CPU usage from grouping large numbers of keys.

Why this answer

Option C is correct because using `Combine.globally` with a fanout reduces the amount of data shuffled and merged in a single worker, lowering CPU load. In Dataflow, `GroupByKey` triggers a full shuffle and per-key aggregation, which is expensive for high-volume sensor data; `Combine.globally` with a fanout performs partial aggregation on each worker before a final merge, reducing network I/O and CPU cycles. This maintains data freshness because the 10-minute processing time trigger still fires on time, but with less per-element overhead.

Exam trap

Google Cloud often tests the misconception that scaling out workers (Option A) is the universal fix for performance issues, but the trap here is that the real bottleneck is the shuffle-heavy `GroupByKey` operation, not worker count.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers distributes load but does not address the root cause—the high CPU cost of per-key grouping and shuffling in `GroupByKey`; it may temporarily reduce backlog but adds cost and can still hit scaling limits. Option B is wrong because changing to event time with allowed lateness does not reduce CPU utilization; it only changes watermark semantics and may increase state size, worsening CPU pressure. Option D is wrong because using side inputs to broadcast a static lookup table does not reduce the CPU cost of the aggregation step; it adds memory overhead and does not address the shuffle bottleneck.

Full explanation →

260

MCQmedium

You deployed a model on Vertex AI Endpoints using a custom container. The model serves predictions but the latency is higher than expected. You suspect the container is not making full use of the CPU resources. What should you do to reduce latency?

A.Modify the container to use multi-threading or increase the number of workers in the prediction server (e.g., Gunicorn workers).

B.Enable response caching on the endpoint.

C.Change the machine type to a GPU-accelerated machine.

D.Increase the number of nodes by adjusting autoscaling limits.

AnswerA

Properly configuring concurrency allows each node to process multiple requests in parallel, reducing latency under load.

Why this answer

Option A is correct because high latency in a CPU-based custom container often stems from underutilizing available CPU cores. By increasing the number of workers (e.g., Gunicorn workers) or enabling multi-threading, you allow the prediction server to handle multiple requests concurrently, reducing queue time and improving throughput. This directly addresses the symptom of the container not making full use of CPU resources.

Exam trap

Google Cloud often tests the misconception that scaling out (adding more nodes) or upgrading hardware (GPU) is the default fix for latency, when the real issue is often software-level concurrency configuration within the container.

How to eliminate wrong answers

Option B is wrong because response caching reduces latency only for repeated identical requests, not for the general case of underutilized CPU resources; it does not improve concurrent request handling. Option C is wrong because switching to a GPU-accelerated machine would only help if the model benefits from GPU parallelism (e.g., deep learning models), but the question states the container is not making full use of CPU resources, implying the bottleneck is software configuration, not hardware type. Option D is wrong because increasing the number of nodes via autoscaling adds more instances but does not fix the per-instance CPU underutilization; it may even increase cost without addressing the root cause of inefficient request handling within each container.

Full explanation →

261

MCQhard

You are a data engineer at a financial services company. You have deployed a credit risk model on Vertex AI Endpoints using a custom container with a TensorFlow SavedModel. The model expects input features as a JSON object. Recently, the model has been returning high prediction latency and occasional 503 errors. You have enabled autoscaling with minNodes=2 and maxNodes=10. The model is CPU-only and uses n1-standard-4 machines. Monitoring shows that during peak hours, CPU utilization reaches 90% and memory is at 80%. The number of prediction requests per second peaks at 100. You suspect that the model is not scaling fast enough. Which action will most effectively reduce latency and eliminate 503 errors?

A.Increase maxNodes to 20 to allow more replicas during peak

B.Change the machine type to n1-standard-4 with a GPU (e.g., NVIDIA T4) and update the custom container to use GPU

C.Set minNodes to 5 to keep more replicas warm

D.Switch to n1-highmem-4 machines to provide more memory per node

AnswerB

GPU acceleration reduces per-request latency and can handle more requests per node.

Why this answer

Option B is correct because the high CPU utilization (90%) indicates that the model's inference is compute-bound. Offloading the computation to a GPU (NVIDIA T4) significantly accelerates TensorFlow model inference, reducing per-request latency and allowing each replica to handle more requests per second. This directly addresses the root cause of the 503 errors (requests timing out due to slow inference) and reduces the need for rapid scaling.

Exam trap

Google Cloud often tests the misconception that scaling out (increasing replicas) is always the solution to latency and 503 errors, when in fact the root cause may be per-replica performance (CPU vs. GPU) that scaling cannot fix.

How to eliminate wrong answers

Option A is wrong because increasing maxNodes to 20 does not address the fundamental bottleneck: each replica is CPU-bound at 90% utilization. More replicas would still be slow and may not scale quickly enough to handle sudden spikes, and they would increase cost without fixing latency. Option C is wrong because setting minNodes to 5 keeps more replicas warm but does not reduce the latency of each individual prediction; the replicas would still be CPU-bound, so 503 errors from slow inference would persist.

Option D is wrong because memory is only at 80%, not a bottleneck; switching to n1-highmem-4 provides more memory but does not accelerate the CPU-bound computation, so latency and 503 errors would remain.

Full explanation →

262

MCQeasy

Your Cloud Dataflow pipeline is failing due to a 'Permission denied' error when writing to a BigQuery table. The error persists even though the service account has bigquery.dataEditor role. What is the most likely missing permission?

A.pubsub.topics.publish on a notification topic

B.storage.objects.create on the staging bucket

C.bigquery.tables.get on the table

D.bigquery.tables.create on the dataset

AnswerD

Dataflow requires create permission if table is created automatically.

Why this answer

Option A is correct because Dataflow needs bigquery.tables.create if the table doesn't exist. Option B is wrong because read permissions are not needed for writing. Option C is wrong because bucket permissions are for staging, not writing to BigQuery.

Option D is wrong because pub/sub roles are not needed.

Full explanation →

263

Multi-Selecthard

A payment processing company needs to detect fraudulent transactions in real time. The system must have sub-second latency for high-value transactions and use a machine learning model. Which two components should be part of the architecture? (Choose TWO.)

Select 2 answers

A.Cloud Storage for transaction logs

B.Bigtable to store user profiles and transaction history for fast lookups

C.Dataflow for stream processing with sliding windows

D.Cloud SQL to store reference data

E.Cloud Functions for long-running batch model training

AnswersB, C

Bigtable offers sub-millisecond latency for point lookups, essential for real-time fraud scoring.

Why this answer

Bigtable is a fully managed, scalable NoSQL database that provides consistent sub-10ms latency for high-throughput read/write operations, making it ideal for real-time lookups of user profiles and transaction history in fraud detection. Its ability to handle large volumes of data with low latency supports the sub-second requirement for high-value transactions.

Exam trap

Google Cloud often tests the distinction between storage services optimized for real-time access (Bigtable) versus batch/archive (Cloud Storage) and between stream processing (Dataflow) versus batch processing or short-lived compute (Cloud Functions).

Full explanation →

264

MCQeasy

A startup wants to build a data lake on Google Cloud using Cloud Storage. They need to store raw data in its original format for future analysis. Which storage class should they use to optimize for cost given that data will be accessed occasionally after the first month?

A.Nearline storage class

B.Coldline storage class

C.Standard storage class

D.Archive storage class

AnswerA

Optimized for data accessed less than once a month, cost-effective.

Why this answer

Nearline storage class is the optimal choice because it offers low-cost storage for data accessed less than once a month, with a 30-day minimum storage duration. Since the data is accessed occasionally after the first month, Nearline provides significant cost savings over Standard while still offering low-latency access (milliseconds) suitable for analytics. Coldline and Archive have lower storage costs but impose higher retrieval fees and minimum storage durations (90 and 365 days respectively), making them more expensive for data that is accessed occasionally within the first year.

Exam trap

Google Cloud often tests the misconception that lower storage cost always means lower total cost, ignoring the impact of retrieval fees and minimum storage duration penalties, which can make Coldline or Archive more expensive for data accessed occasionally within the first year.

How to eliminate wrong answers

Option B (Coldline) is wrong because it is designed for data accessed less than once a quarter (90-day minimum storage duration) and has higher retrieval costs, making it more expensive than Nearline for data accessed occasionally after the first month. Option C (Standard) is wrong because it is optimized for frequently accessed data (no minimum storage duration) and has the highest storage cost, which is not cost-effective for data that is only accessed occasionally. Option D (Archive) is wrong because it is intended for long-term archival data accessed less than once a year (365-day minimum storage duration) and has very high retrieval costs and latency (hours), making it unsuitable for occasional access within a year.

Full explanation →

265

MCQmedium

A data engineer is designing a batch data pipeline that reads Avro files from Cloud Storage, transforms data using Apache Beam, and writes to BigQuery. The pipeline must handle daily runs and backfills. Which runner should they use?

A.FlinkRunner

B.DataflowRunner

C.SparkRunner

D.DirectRunner

AnswerB

DataflowRunner is a fully managed service that supports batch pipelines, backfills, and direct integration with GCS and BigQuery.

Why this answer

DataflowRunner is the correct choice because it is the fully managed service runner for Apache Beam on Google Cloud, optimized for batch and streaming pipelines. It automatically handles scaling, resource management, and exactly-once processing semantics, which are essential for reliable daily runs and backfills with Avro files from Cloud Storage and BigQuery sinks.

Exam trap

The trap here is that candidates may confuse the runner with the execution engine, assuming that any distributed runner (Flink, Spark) is suitable for production, when the question specifically tests knowledge of Google Cloud-native services and the need for managed infrastructure for batch pipelines with backfills.

How to eliminate wrong answers

Option A is wrong because FlinkRunner is designed for running Beam pipelines on Apache Flink clusters, which require manual cluster management and are not natively integrated with Google Cloud services like Cloud Storage and BigQuery. Option C is wrong because SparkRunner runs Beam pipelines on Apache Spark, which is not a managed service on Google Cloud and lacks the seamless integration with Cloud Storage and BigQuery that DataflowRunner provides. Option D is wrong because DirectRunner is intended for local testing and development only, not for production workloads or handling large-scale daily runs and backfills.

Full explanation →

266

Multi-Selectmedium

Which TWO are best practices for managing a Cloud Dataflow pipeline in production?

Select 2 answers

A.Always use batch mode for streaming data to reduce cost

B.Disable autoscaling to keep compute costs predictable

C.Set up Cloud Monitoring alerts based on Dataflow job metrics

D.Use pipeline updates (update) to modify running streaming pipelines

E.Restart the pipeline when code changes are needed

AnswersC, D

Alerts help detect issues proactively.

Why this answer

Option C is correct because Cloud Monitoring alerts on Dataflow job metrics (e.g., system lag, watermark delay, or element count) enable proactive detection of pipeline health issues such as backpressure or stuck workers. This is a best practice for production pipelines to ensure reliability and timely intervention.

Exam trap

Google Cloud often tests the misconception that disabling autoscaling or restarting pipelines is acceptable for cost control or simplicity, when in fact these actions violate production best practices for reliability and data integrity.

Full explanation →

267

MCQhard

A financial services company uses Cloud Pub/Sub with ordering keys to process transactions in order. Some messages are failing processing and getting stuck. The team wants to ensure that if a message fails, it can be reprocessed later without blocking subsequent messages. What should they implement?

A.Create multiple subscriptions for the same topic

B.Use a pull subscription with flow control settings

C.Configure a dead letter topic and handle the failed message separately

D.Increase the acknowledgment deadline to 600 seconds

AnswerC

Dead letter topics isolate failures, allowing forwarding of messages for later reprocessing.

Why this answer

Option C is correct because a dead letter topic (DLT) allows failed messages to be moved aside after exhausting retry attempts, so they do not block the processing of subsequent ordered messages. In Cloud Pub/Sub, ordering keys require messages with the same key to be delivered in order; if a message fails and is not acknowledged, it blocks all later messages with the same key. By configuring a dead letter topic, the failed message is automatically forwarded to the DLT after a maximum of 5 delivery attempts (default), and the original subscription can continue processing the next messages in order.

The team can then reprocess the failed message from the DLT separately, without affecting the order of other messages.

Exam trap

Google Cloud often tests the misconception that increasing the acknowledgment deadline or adding flow control can resolve stuck messages with ordering keys, but the real solution is to use a dead letter topic to offload the failing message and unblock the ordered stream.

How to eliminate wrong answers

Option A is wrong because creating multiple subscriptions for the same topic does not solve the blocking issue; each subscription independently receives all messages, but within a single subscription, ordering keys still cause a failed message to block subsequent messages with the same key. Option B is wrong because pull subscriptions with flow control settings only limit the rate of message delivery and do not handle failed messages that are stuck; they do not provide a mechanism to move failed messages out of the way to unblock ordering. Option D is wrong because increasing the acknowledgment deadline to 600 seconds only gives the subscriber more time to process a message before it is redelivered, but it does not prevent a persistently failing message from blocking subsequent ordered messages indefinitely.

Full explanation →

268

MCQhard

An organization uses Cloud Dataproc to run Spark jobs that process sensitive data. They need to ensure data is encrypted at rest and that only specific service accounts can access the data on cluster disks. What should they do?

A.Rely on the default encryption at rest and use VPC Service Controls to limit data exfiltration.

B.Use customer-supplied encryption keys (CSEK) and write a startup script to mount encrypted disks.

C.Enable encryption at rest using Google-managed encryption keys and grant all users the Dataproc Editor role.

D.Use customer-managed encryption keys (CMEK) for the cluster's persistent disks and assign a dedicated service account to the cluster with minimal IAM roles.

AnswerD

CMEK provides control over keys, and a dedicated service account restricts data access.

Why this answer

Option D is correct because using customer-managed encryption keys (CMEK) allows the organization to control and manage the encryption keys for persistent disks attached to the Dataproc cluster, ensuring data at rest is encrypted. Assigning a dedicated service account with minimal IAM roles ensures that only that service account can access the data on the cluster disks, following the principle of least privilege.

Exam trap

The trap here is that candidates often confuse CSEK (used for Cloud Storage) with CMEK (used for persistent disks), or assume that default encryption combined with VPC Service Controls is sufficient for granular access control to disk data.

How to eliminate wrong answers

Option A is wrong because default encryption at rest uses Google-managed keys, which does not allow the organization to control key access or restrict which service accounts can access data on cluster disks; VPC Service Controls prevent data exfiltration but do not enforce service-account-level access to disk data. Option B is wrong because customer-supplied encryption keys (CSEK) are used for encrypting data in Cloud Storage, not for persistent disks on Dataproc; mounting encrypted disks via a startup script is not a supported or recommended method for Dataproc clusters. Option C is wrong because granting all users the Dataproc Editor role would allow any user to access and modify cluster resources, violating the requirement that only specific service accounts can access data on cluster disks.

Full explanation →

269

Multi-Selecteasy

A company is deploying a machine learning model for fraud detection. The model is trained using TensorFlow and will be served on Vertex AI Prediction. The team wants to implement model monitoring to detect prediction drift. Which TWO actions should they take? (Choose 2)

Select 2 answers

A.Configure Vertex AI Model Monitoring to compare online prediction inputs against training data statistics.

B.Collect ground truth labels for all predictions to measure accuracy drift.

C.Set up a separate Cloud Monitoring alerting policy to watch for prediction errors.

D.Enable automatic model retraining in Vertex AI Model Monitoring when drift is detected.

E.Enable prediction drift monitoring to detect changes in model output distribution.

AnswersA, E

This detects feature drift, which is a common monitoring need.

Why this answer

Option A is correct because Vertex AI Model Monitoring can be configured to compare online prediction inputs against training data statistics to detect skew, which is a form of drift. This is a standard capability of Vertex AI Model Monitoring, where you specify a baseline dataset (typically training data) and the service automatically computes statistics on incoming prediction requests to identify distribution shifts.

Exam trap

Google Cloud often tests the distinction between monitoring for drift (which focuses on input/output distributions) versus monitoring for model accuracy (which requires ground truth labels), and candidates mistakenly think collecting ground truth is a prerequisite for drift detection.

Full explanation →

270

MCQmedium

A retail company processes real-time clickstream data using Cloud Pub/Sub and Dataflow. The pipeline aggregates events by user session and writes to Bigtable for low-latency queries. However, users report that session data is sometimes missing or duplicated. What is the most likely cause?

A.Session windowing is configured with too short a gap duration.

B.Bigtable schema design causes row key collisions.

C.Dataflow's default behavior discards late-arriving data.

D.Pub/Sub provides at-least-once delivery, and Dataflow does not deduplicate by default.

AnswerD

At-least-once delivery leads to duplicates without dedup in pipeline.

Why this answer

D is correct because Pub/Sub offers at-least-once delivery, meaning the same message may be delivered multiple times. Dataflow does not automatically deduplicate messages unless explicitly configured (e.g., using idempotent sinks or custom deduplication logic). Without deduplication, the same session event can be processed more than once, leading to duplicate session data in Bigtable.

Exam trap

Google Cloud often tests the misconception that Pub/Sub provides exactly-once delivery or that Dataflow automatically deduplicates messages from Pub/Sub, when in fact Pub/Sub is at-least-once and Dataflow requires explicit deduplication for idempotent processing.

How to eliminate wrong answers

Option A is wrong because a short gap duration would cause sessions to be split prematurely, leading to missing data (events not grouped into the same session), not duplicates. Option B is wrong because row key collisions in Bigtable would cause overwrites or errors, not missing or duplicate session data; Bigtable uses lexicographic ordering and row keys are unique per write. Option C is wrong because Dataflow's default behavior for late-arriving data depends on the windowing strategy; with session windows, late data can be included if within the allowed lateness, and Dataflow does not discard late data by default—it uses a default allowed lateness of 0 seconds, which would cause late data to be dropped, but this would result in missing data, not duplicates.

Full explanation →

271

Multi-Selecteasy

Which TWO are benefits of using Vertex AI Endpoints for model serving?

Select 2 answers

A.Batch prediction support out of the box.

B.Integrated monitoring for prediction latency and error rates.

C.Automatic scaling based on traffic.

D.Automatic model retraining when drift is detected.

E.Built-in support for A/B testing without any additional configuration.

AnswersB, C

Vertex AI endpoints integrate with Cloud Monitoring for operational metrics.

Why this answer

Vertex AI Endpoints provide integrated monitoring for prediction latency and error rates out of the box, enabling you to track model performance and detect anomalies without additional instrumentation. This is a core operational feature that helps maintain service-level objectives (SLOs) and quickly identify degradation in production.

Exam trap

Google Cloud often tests the distinction between features that are 'built-in' versus those that require separate services or additional configuration, so candidates mistakenly assume batch prediction or automatic retraining are part of Endpoints when they are actually separate Vertex AI components.

Full explanation →

272

Multi-Selecthard

A team is deploying a complex model with multiple preprocessing steps. They want to ensure consistent preprocessing during training and serving. Which three approaches can achieve this? (Select 3)

Select 3 answers

A.Store preprocessing logic in a shared Python module

B.Use a separate preprocessing service called from the model

C.Use two separate pipelines for training and serving

D.Use Vertex AI Feature Transform Engine

E.Embed preprocessing logic in the model graph

AnswersA, D, E

A shared module ensures the same code is used in training and serving if properly versioned.

Why this answer

Option A is correct because storing preprocessing logic in a shared Python module ensures that the same code is used during both training and serving, eliminating drift between environments. This approach leverages version control and dependency management to guarantee consistency, which is critical for reproducibility in production ML pipelines.

Exam trap

Google Cloud often tests the misconception that a separate preprocessing service (Option B) is a good architectural pattern for consistency, when in fact it introduces a single point of failure and versioning complexity that undermines the goal of identical preprocessing.

Full explanation →

273

Multi-Selectmedium

A data engineer is designing a BigQuery table for time-series data that will be queried frequently by time range and also by a customer_id. Which TWO design decisions will improve query performance and manage costs? (Choose two.)

Select 2 answers

A.Partition the table by day on the timestamp column

B.Cluster the table on customer_id

C.Disable automatic reclustering to save costs

D.Set partition expiration to 1 year

E.Use nested repeated fields for customer data

AnswersA, B

Enables partition pruning for time-range queries.

Why this answer

Partitioning the table by day on the timestamp column allows BigQuery to prune partitions when queries filter by a time range, scanning only the relevant partitions instead of the entire table. This directly reduces the amount of data read, improving query performance and lowering costs.

Exam trap

Google Cloud often tests the misconception that disabling automatic reclustering saves costs, but in reality it is free and essential for maintaining clustering benefits, while partition expiration is a lifecycle management feature, not a performance optimization.

Full explanation →

274

MCQhard

A company has a batch prediction job that runs daily using AI Platform Batch Prediction. The job uses a TensorFlow model and processes 10 GB of data. Recently, the job started failing with the error 'The replica worker 0 exited with a non-zero exit code: Out of memory'. Which action should the team take to resolve this without rewriting the model?

A.Increase the number of workers (parallelism) to distribute the data across more machines.

B.Use a machine type with more memory, such as n1-highmem-8.

C.Reduce the batch size parameter in the prediction job configuration.

D.Optimize the model to use less memory by pruning or quantization.

AnswerB

Directly addresses the out-of-memory error by providing more RAM per worker.

Why this answer

The error 'Out of memory' on replica worker 0 indicates that the machine type assigned to the prediction job does not have enough RAM to load the model and process the 10 GB batch. Increasing the machine type to one with more memory (e.g., n1-highmem-8) directly addresses the memory constraint without requiring any code changes. This is the most straightforward fix because AI Platform Batch Prediction allows you to specify machine types in the job configuration, and the error is purely a resource allocation issue.

Exam trap

Google Cloud often tests the distinction between scaling horizontally (adding workers) and scaling vertically (increasing machine resources), where candidates mistakenly assume parallelism solves memory issues, but the error is per-worker memory exhaustion, not throughput.

How to eliminate wrong answers

Option A is wrong because increasing the number of workers (parallelism) distributes the data across more machines but does not increase the memory per worker; each replica still has the same limited memory, so the out-of-memory error would persist on each worker. Option C is wrong because reducing the batch size parameter controls how many predictions are processed per step, which can reduce peak memory usage per request, but the error occurs during model loading or initial data processing, not during per-step prediction; the 10 GB dataset and model size still require sufficient base memory. Option D is wrong because while pruning or quantization could reduce model memory footprint, the question explicitly states 'without rewriting the model,' and these techniques require modifying the model architecture or retraining, which is a form of rewriting.

Full explanation →

275

MCQeasy

A company needs to process large files (100GB each) from Cloud Storage using Dataproc. They want to minimize job execution time. Which configuration is most appropriate?

A.Use a single-node cluster

B.Use a cluster with preemptible worker nodes and high-CPU machine types

C.Use HDFS for input data to avoid network latency

D.Use a cluster with many standard worker nodes

AnswerB

Preemptible VMs reduce cost, high-CPU machines improve speed.

Why this answer

Option B is correct because preemptible worker nodes are significantly cheaper than standard nodes, allowing you to scale out the cluster with many more workers for the same cost, which directly reduces job execution time for embarrassingly parallel data processing tasks. High-CPU machine types are ideal for compute-intensive Dataproc jobs like data transformation or machine learning, as they provide more vCPUs per core for parallel processing. This combination maximizes parallelism and minimizes wall-clock time for large-scale batch jobs.

Exam trap

The trap here is that candidates often assume standard worker nodes are always better for performance, ignoring the cost-benefit of preemptible nodes that allow scaling to many more workers for the same budget, which directly reduces execution time for parallelizable jobs.

How to eliminate wrong answers

Option A is wrong because a single-node cluster lacks parallelism, so processing 100GB files would be severely bottlenecked by a single machine's CPU and memory, leading to long execution times. Option C is wrong because HDFS is not used for input data from Cloud Storage; Dataproc reads directly from Cloud Storage via the gs:// connector, and using HDFS would require copying data first, adding network latency and storage overhead. Option D is wrong because using many standard worker nodes is less cost-effective than using preemptible nodes; standard nodes are more expensive, so for the same budget you can provision fewer workers, resulting in longer job execution times compared to a larger cluster of preemptible nodes.

Full explanation →

276

MCQmedium

A team is designing a data lake on Google Cloud using Cloud Storage and BigQuery. They need to ensure that sensitive data (e.g., PII) is encrypted at rest and have the ability to audit access. Which approach meets these requirements?

A.Use Customer-Managed Encryption Keys (CMEK) and enable VPC Service Controls.

B.Use Customer-Managed Encryption Keys (CMEK) and enable Cloud Audit Logs.

C.Use Default Encryption and enable Data Loss Prevention (DLP) API.

D.Use Customer-Supplied Encryption Keys (CSEK) and enable VPC Service Controls.

AnswerB

CMEK provides control over encryption keys, and Cloud Audit Logs record access to data.

Why this answer

Option B is correct because Customer-Managed Encryption Keys (CMEK) allow the team to control and manage the encryption keys used to protect data at rest in Cloud Storage and BigQuery, while enabling Cloud Audit Logs provides the necessary audit trail for access to both the data and the keys. This combination directly satisfies the requirements for encryption at rest and auditability.

Exam trap

Google Cloud often tests the distinction between encryption key management (CMEK vs. CSEK vs. Default) and security controls (VPC Service Controls vs.

Audit Logs), leading candidates to conflate network perimeter controls with audit capabilities.

How to eliminate wrong answers

Option A is wrong because VPC Service Controls provide network-based security boundaries to prevent data exfiltration, but they do not provide audit logging of access to data or keys, which is a separate requirement. Option C is wrong because Default Encryption uses Google-managed keys, which do not give the team control over encryption keys, and the DLP API is for inspecting and classifying sensitive data, not for encryption at rest or audit logging. Option D is wrong because Customer-Supplied Encryption Keys (CSEK) require the customer to manage their own keys outside Google Cloud, which adds operational complexity and does not integrate with Cloud Audit Logs for key access auditing; VPC Service Controls again do not provide audit logging.

Full explanation →

277

MCQmedium

You configured a model deployment monitor on your Vertex AI endpoint as shown. What will happen when the feature 'age' has a skew of 0.4?

A.An alert will be sent to admin@example.com

B.The endpoint will automatically roll back to a previous model version

C.No alert will be sent because the skew threshold is 0.2 for income

D.An alert will be sent only if both features exceed their thresholds

AnswerA

Skew 0.4 exceeds threshold 0.3 for age.

Why this answer

Option A is correct because the monitoring configuration shows an alert threshold of 0.2 for the feature 'age', and a skew of 0.4 exceeds that threshold. Vertex AI Model Monitoring will trigger the configured alert action, which in this case is sending an email to admin@example.com. The alert is based on the specific feature's threshold, not on any other feature's threshold.

Exam trap

Google Cloud often tests the misconception that alerts require multiple features to exceed thresholds or that the system can automatically roll back models, when in reality each feature is evaluated independently and only notifications are sent.

How to eliminate wrong answers

Option B is wrong because Vertex AI Model Monitoring does not automatically roll back model deployments; it only sends alerts based on configured actions, and auto-rollback is not a supported feature in this context. Option C is wrong because the skew threshold for 'age' is 0.2, not 0.2 for 'income'; the question states the skew for 'age' is 0.4, which exceeds its own threshold, so an alert will be sent regardless of the 'income' feature's threshold. Option D is wrong because the alert is triggered per feature when its individual threshold is exceeded; there is no requirement for both features to exceed their thresholds simultaneously.

Full explanation →

278

Drag & Dropmedium

Drag and drop the steps to migrate an on-premises MySQL database to Cloud SQL using Database Migration Service into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Database Migration Service enables minimal-downtime migrations using replication.

Full explanation →

279

Multi-Selecthard

A company uses Cloud Composer to orchestrate Dataproc and BigQuery jobs. They need to implement retry logic for transient failures. Which THREE features can help?

Select 3 answers

A.Dataflow pipeline retries

B.DAG retry_delay

C.BigQuery job retries

D.Cloud Composer high availability

E.Task retries and retry_delay

AnswersB, C, E

Composer can retry the entire DAG on failure with a delay.

Why this answer

Option B is correct because Cloud Composer (Apache Airflow) allows setting `retry_delay` at the DAG level to define the time delay between task retries. This is a native Airflow feature that helps handle transient failures by automatically retrying failed tasks after a specified delay, reducing manual intervention.

Exam trap

The trap here is confusing infrastructure-level high availability (Option D) with application-level retry logic, leading candidates to select HA as a retry mechanism when it only ensures environment uptime, not task-level failure recovery.

Full explanation →

280

MCQmedium

An e-commerce company runs a daily batch pipeline that processes clickstream data from Cloud Storage using Cloud Dataproc with Spark. The pipeline includes a join between a large fact table and a small dimension table. The dimension table is stored in Cloud Storage as a CSV file. The join is slow due to shuffling. The data engineer considers broadcasting the dimension table. However, the dimension table is updated daily and the pipeline reads the latest version. What is the best approach to implement this optimization?

A.Use DataFrame.join with broadcast hint on the dimension DataFrame

B.Read the fact table and dimension table into separate DataFrames and use standard join

C.Read the dimension table as an RDD and collect as a map, then use map-side join

D.Increase the spark.sql.autoBroadcastJoinThreshold to a large value

AnswerA

Forces broadcast join regardless of table size.

Why this answer

Option A is correct because broadcasting the small dimension table using the broadcast hint (e.g., `broadcast(dimensionDF)`) forces Spark to replicate the dimension data to all executor nodes, eliminating the need for a shuffle during the join. This is ideal when the dimension table is small enough to fit in executor memory, and since the pipeline reads the latest CSV daily, the broadcast will automatically use the updated data without additional code changes.

Exam trap

The trap here is that candidates may think increasing `spark.sql.autoBroadcastJoinThreshold` is a safe global fix, but it can cause memory pressure and does not guarantee a broadcast join if the table size fluctuates, whereas the explicit broadcast hint provides deterministic behavior.

How to eliminate wrong answers

Option B is wrong because a standard join without any hint or optimization will trigger a full shuffle of both datasets, which is exactly the performance problem described. Option C is wrong because manually collecting the dimension table as an RDD and using a map-side join is an outdated, error-prone approach that bypasses Spark SQL's Catalyst optimizer and broadcast join optimizations; it also requires manual handling of updates and memory management. Option D is wrong because increasing `spark.sql.autoBroadcastJoinThreshold` globally may cause the dimension table to be broadcast automatically, but it does not guarantee the join uses a broadcast if the table size exceeds the threshold, and it can lead to out-of-memory errors if the threshold is set too high without considering executor memory limits.

Full explanation →

281

MCQeasy

A data engineer tries to grant a service account read access to a Cloud Storage bucket using the IAM policy above. The service account still cannot read objects. What is the most likely reason?

A.The role does not include the necessary permission

B.The condition prevents access because the request time is after 2023

C.The service account is misspelled

D.The role should be roles/storage.admin

AnswerB

The condition expression requires request.time before 2023, which is likely no longer true.

Why this answer

Option B is correct because the IAM condition explicitly restricts access to requests made before January 1, 2023. Since the current time is after that date, the condition evaluates to false, denying the service account's read access regardless of the role binding. IAM conditions are evaluated at request time, and if the condition is not met, the permission is not granted.

Exam trap

Google Cloud often tests the subtlety that IAM conditions are evaluated at request time and can override a valid role binding, leading candidates to mistakenly focus on the role's permissions rather than the condition's effect.

How to eliminate wrong answers

Option A is wrong because roles/storage.objectViewer includes the storage.objects.get permission required to read objects, so the role does include the necessary permission. Option C is wrong because a misspelled service account would result in the role not being bound at all, but the question states the policy was applied, implying the service account name is correct. Option D is wrong because roles/storage.admin is an overly permissive role that includes many additional permissions beyond read access; the issue is not the role's permissions but the condition blocking access.

Full explanation →

282

MCQeasy

A company uses Cloud Monitoring to track application latency. They notice a spike in latency every 30 minutes. What is the best initial step to diagnose the issue?

A.Increase the number of instances to handle the load.

B.Enable Cloud Trace for all requests.

C.Check if scheduled jobs or cron tasks overlap.

D.Change the alert threshold to ignore the spikes.

AnswerC

Regularly recurring spikes suggest a scheduled job causing contention; investigating this is the most direct diagnostic step.

Why this answer

Recurring spikes at regular intervals often indicate a scheduled process (e.g., cron job, batch job) that runs every 30 minutes. Checking for overlapping scheduled jobs is the most efficient first step before scaling or other actions.

Full explanation →

283

Multi-Selectmedium

Which TWO statements are correct about designing a data pipeline using Cloud Dataflow for processing unbounded data?

Select 2 answers

A.Watermarks are used to measure the progress of event time.

B.Triggers can only emit results at the end of a window.

C.Dataflow guarantees exactly-once processing for streaming pipelines.

D.Cloud Pub/Sub is the recommended source for streaming pipelines.

E.Fixed windows are always based on processing time.

AnswersA, D

Watermarks track event time progress.

Why this answer

Watermarks in Cloud Dataflow measure the progress of event time, indicating when all data up to a certain timestamp is expected to have arrived. This allows the pipeline to handle late-arriving data and determine when to close windows for unbounded data streams.

Exam trap

Google Cloud often tests the misconception that triggers only fire at window boundaries, when in fact Dataflow supports early, on-time, and late firings for flexible result emission.

Full explanation →

284

MCQmedium

A Cloud Build pipeline is set up to train a model on Vertex AI. The build fails with the error: 'ERROR: (gcloud.ai-platform.jobs.submit.training) NOT_FOUND: The parent project does not exist.' The project ID and the service account are correctly configured. What is the most likely cause?

A.The region specified for the training job does not exist.

B.The training job requires a GPU, which is not available in the specified region.

C.The Cloud Build service account does not have the aiplatform.jobs.create permission on the project.

D.The training package is not uploaded to Cloud Storage before the pipeline runs.

AnswerC

Insufficient permissions can cause the project to appear as not found to the service account.

Why this answer

The error 'NOT_FOUND: The parent project does not exist' indicates that the Cloud Build service account lacks the necessary IAM permission to submit a training job to Vertex AI. Even though the project ID and service account are correctly configured, the Cloud Build service account must have the 'aiplatform.jobs.create' permission (or the 'Vertex AI User' role) on the project. Without this, the API call fails because the service account is not authorized to access the project resource.

Exam trap

Google Cloud often tests the misconception that a 'NOT_FOUND' error always means a missing resource (like a project ID or region), when in fact it can indicate an IAM permission issue where the service account is not authorized to see or use the project.

How to eliminate wrong answers

Option A is wrong because an invalid region would produce a different error, such as 'INVALID_ARGUMENT: Region not found' or 'PERMISSION_DENIED', not 'NOT_FOUND: The parent project does not exist'. Option B is wrong because GPU availability issues would result in a 'RESOURCE_EXHAUSTED' or 'ZONE_RESOURCE_POOL_EXHAUSTED' error, not a project-level not found error. Option D is wrong because a missing training package in Cloud Storage would cause a 'FILE_NOT_FOUND' or 'INVALID_ARGUMENT' error during job submission, not a project not found error.

Full explanation →

285

MCQmedium

A company has deployed a machine learning model to AI Platform Prediction. The model uses a custom container with a TensorFlow SavedModel. After deployment, the prediction latency is higher than expected. Which action is most likely to reduce latency without significantly impacting model accuracy?

A.Convert the model to TensorFlow Lite and use a smaller model.

B.Increase the number of prediction nodes in the AI Platform Prediction cluster.

C.Enable XLA (Accelerated Linear Algebra) compilation on model loading.

D.Apply quantization to the model weights to reduce size.

AnswerC

XLA compiles and optimizes the TensorFlow graph, often improving latency without affecting accuracy.

Why this answer

Option B is correct because enabling XLA compilation on model load can optimize the computational graph for better performance with no accuracy loss. Options A, C, and D either reduce accuracy or are not applicable.

Full explanation →

286

Multi-Selectmedium

Which THREE Google Cloud services are typically used together in a production ML pipeline?

Select 3 answers

A.Cloud Storage

B.Cloud Functions

C.Vertex AI Training

D.Vertex AI Prediction

E.BigQuery

AnswersA, C, D

For storing training data, model artifacts, etc.

Why this answer

Cloud Storage is correct because it serves as the central artifact repository in a production ML pipeline on Google Cloud. It stores training data, model artifacts, and prediction inputs/outputs, enabling seamless integration with Vertex AI Training for model training and Vertex AI Prediction for serving. Without Cloud Storage, there is no durable, scalable, and cost-effective way to manage the large datasets and model binaries required for production ML workflows.

Exam trap

The trap here is that candidates confuse 'services used in an ML pipeline' with 'services that can be used somewhere in ML' — Cloud Functions and BigQuery are often used in ML workflows (e.g., triggering retraining or storing features), but they are not the three core services that are typically used together in a production ML pipeline for training, storing artifacts, and serving predictions.

Full explanation →

287

MCQmedium

A Dataflow pipeline reads log files from Cloud Storage, parses them into LogEvent objects, and writes to BigQuery. The pipeline fails with the above errors. What is the most likely cause?

A.The LogEvent class does not have a no-argument constructor.

B.The pipeline is missing required import statements for LogEvent.

C.The BigQuery table schema does not match the LogEvent fields.

D.The log files are not in the expected format, causing parsing failures.

AnswerA

Beam requires a no-arg constructor for Avro or Serializable coders.

Why this answer

Apache Beam's SDK requires that custom types used as PCollection elements (like LogEvent) have a no-argument constructor so that the framework can deserialize objects during distributed processing, especially when using the Dataflow runner. Without it, the pipeline fails at runtime with a serialization error because Beam's default coder (e.g., SerializableCoder) cannot reconstruct the object.

Exam trap

The trap here is that candidates confuse runtime serialization errors with compile-time import issues or schema mismatches, overlooking the fundamental requirement for a no-argument constructor in Beam's default coders.

How to eliminate wrong answers

Option B is wrong because missing import statements would cause a compile-time error, not a runtime pipeline failure with the described errors. Option C is wrong because a BigQuery table schema mismatch would produce a write-time error (e.g., schema mismatch), not a serialization failure during parsing. Option D is wrong because parsing failures from malformed log files would result in exceptions during the parse step, not a serialization error related to the LogEvent class itself.

Full explanation →

288

MCQmedium

A data team runs regular analytical queries on a BigQuery table that stores 2 years of sales data (approximately 10 TB). Queries frequently filter on a `sale_date` column and also group by `product_id`. To optimize cost and performance, which design approach is most effective?

A.Do not partition; only cluster by `sale_date`.

B.Partition by `sale_date` and set a table expiration of 90 days.

C.Partition the table by `sale_date` and cluster by `product_id`.

D.Partition by `product_id` and cluster by `sale_date`.

AnswerC

Partitioning by date enables partition elimination on date filters; clustering by product_id co-locates rows with the same product_id within each partition, improving GROUP BY performance.

Why this answer

Option C is correct because partitioning by `sale_date` allows BigQuery to perform partition pruning, eliminating scans of irrelevant date ranges, while clustering by `product_id` physically co-locates rows with the same product ID within each partition. This combination minimizes the data scanned for queries that filter on `sale_date` and group by `product_id`, directly reducing both cost (bytes billed) and query latency.

Exam trap

Google Cloud often tests the misconception that partitioning by a high-cardinality column like `product_id` is acceptable, but the trap here is that BigQuery enforces a hard limit of 4,000 partitions per table, making such a design infeasible and forcing candidates to recognize that clustering is the correct mechanism for high-cardinality grouping columns.

How to eliminate wrong answers

Option A is wrong because without partitioning, BigQuery must scan the entire 10 TB table even for queries filtering on a narrow date range, leading to unnecessarily high costs and slower performance. Option B is wrong because setting a table expiration of 90 days would delete historical data needed for 2-year analysis, and partitioning alone without clustering does not optimize the GROUP BY on `product_id` within each partition. Option D is wrong because partitioning by `product_id` (a high-cardinality column) would create millions of tiny partitions, exceeding BigQuery's partition limit (4,000 partitions per table) and causing poor performance and management overhead.

Full explanation →

289

Drag & Dropmedium

Drag and drop the steps to create a Cloud Storage bucket with uniform bucket-level access into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Uniform bucket-level access simplifies permissions by using IAM policies at the bucket level instead of ACLs.

Full explanation →

290

Multi-Selecthard

A company is designing a data lake on Cloud Storage for analytics. They need to store data in various formats (Avro, Parquet, CSV) and enable efficient querying with BigQuery and Dataproc. Which THREE practices should they follow?

Select 3 answers

A.Use BigLake to create BigQuery tables that reference Cloud Storage data.

B.Store data in columnar formats like Parquet for analytics workloads.

C.Disable encryption on the bucket to improve read performance.

D.Partition data by date in a logical folder structure (e.g., /data/yyyy/mm/dd).

E.Store all data in CSV format for simplicity.

AnswersA, B, D

Enables querying data without loading.

Why this answer

BigLake allows you to create BigQuery tables that reference data stored in Cloud Storage, enabling unified governance and fine-grained access control without moving data. This is essential for a data lake architecture where BigQuery and Dataproc need to query the same underlying data in various formats like Avro, Parquet, and CSV.

Exam trap

Google Cloud often tests the misconception that disabling encryption improves performance, but Cloud Storage encryption is transparent and has no measurable impact on read throughput, so candidates should recognize that security controls are non-negotiable in cloud data lakes.

Full explanation →

291

MCQeasy

A company uses Cloud Functions to process events from Cloud Storage. They notice that occasionally functions are not triggered. What should they check first to ensure solution quality?

A.Verify that the Cloud Storage bucket has notifications configured for the correct event type.

B.Check the logs for function execution.

C.Increase the function memory allocation.

D.Increase the function timeout.

AnswerA

A misconfigured notification will prevent the function from being triggered at all.

Why this answer

Option C is correct because the first step is to verify the Cloud Storage bucket notification configuration, as a misconfigured trigger will cause missed events. Option A (function timeout) does not cause missing triggers. Option B (memory) is unrelated.

Option D (logs) are helpful but after verifying trigger configuration.

Full explanation →

292

MCQeasy

An organization wants to automate their batch data processing pipeline using Cloud Composer. The pipeline consists of multiple tasks: extract from Cloud Storage, transform with Dataflow, and load into BigQuery. Which Airflow operator should be used to run Dataflow jobs?

A.BigQueryInsertJobOperator

B.DataflowCreatePythonJobOperator

C.GCSToBigQueryOperator

D.DataprocSubmitJobOperator

AnswerB

This operator submits a Dataflow job written in Python.

Why this answer

B is correct because the DataflowCreatePythonJobOperator is specifically designed to submit and manage Apache Beam pipelines written in Python as Dataflow jobs in Google Cloud. This operator handles the creation of a Dataflow job from a Python file, which aligns with the requirement to run Dataflow transformations within a Cloud Composer DAG.

Exam trap

Google Cloud often tests the distinction between Dataflow and Dataproc operators, so the trap here is that candidates might confuse DataprocSubmitJobOperator (for Hadoop/Spark) with Dataflow operators, especially when the question mentions 'transform' without specifying the processing framework.

How to eliminate wrong answers

Option A is wrong because BigQueryInsertJobOperator is used to run BigQuery jobs (e.g., queries, load jobs), not to submit Dataflow pipelines. Option C is wrong because GCSToBigQueryOperator loads data directly from Cloud Storage to BigQuery without using Dataflow for transformation, bypassing the required transform step. Option D is wrong because DataprocSubmitJobOperator submits jobs to Dataproc (Hadoop/Spark clusters), not to Dataflow, which is a different processing service.

Full explanation →

293

MCQhard

A company runs a data pipeline that ingests clickstream events from multiple websites into Cloud Pub/Sub, then processed by Dataflow to generate user sessions, and written to BigQuery for analytics. The pipeline runs 24/7. Recently, the team noticed that some sessions are incomplete due to missing events, and data quality checks reveal that about 2% of sessions have gaps of more than 30 minutes. The pipeline uses fixed 30-minute windows for sessionization, with allowed lateness set to 10 minutes. They have Cloud Monitoring dashboards tracking system throughput and pipeline lag but do not have custom metrics tracking per-element delays or watermark progress. The team suspects two possible causes: (a) the Pub/Sub subscription accumulates backlog and some messages are delivered after the window end; (b) the Dataflow job has insufficient workers causing checkpoint failures. The team needs to determine the root cause and improve data quality. What is the best first course of action?

A.Change the Pub/Sub subscription to pull mode with more aggressive flow control settings.

B.Increase the number of Dataflow workers and set autoscaling to the maximum allowed.

C.Modify the Dataflow pipeline to use session windows instead of fixed windows, and increase allowed lateness to 60 minutes.

D.Set up a Dataflow monitoring dashboard that tracks the watermark delay and create an alert when it exceeds the allowed lateness.

AnswerD

This directly monitors the pipeline's ability to process events within the window, confirming if late data is the root cause.

Why this answer

To determine whether late-arriving messages are the issue, the team should monitor the Dataflow watermark delay, which indicates how far behind the pipeline is compared to the event time. Setting up a metric and alert on watermark delay > allowed lateness will confirm if late data is being dropped.

Full explanation →

294

MCQmedium

A BigQuery table contains streaming data from Cloud Pub/Sub. The table is partitioned by ingestion time. A user runs a query that accesses data from the last 5 minutes and gets correct results. After 90 minutes, the user runs the same query again but notices that some rows are missing. What is the most likely cause?

A.The query is using time travel to a snapshot before the streaming buffer was committed

B.The query is using cached results that exclude recent data

C.The schema of the table was modified after the initial query

D.The table has a partition expiration of 30 days

AnswerA

Time travel queries return data from a snapshot; if the snapshot is before the buffer is flushed, recent data is missing.

Why this answer

Option A is correct because BigQuery's streaming buffer provides low-latency access to recently ingested data, but this data is not immediately committed to managed storage. After the streaming buffer is flushed (typically within 90 minutes), the data becomes available in the table's base storage. If the user runs a query using time travel (e.g., `FOR SYSTEM_TIME AS OF`) to a snapshot taken before the buffer was committed, the query will only see data that was in managed storage at that snapshot time, missing rows that were still in the streaming buffer at that point.

Exam trap

Google Cloud often tests the misconception that cached results or schema changes are responsible for data inconsistencies, when the real issue is the separation between BigQuery's streaming buffer and managed storage, and how time travel queries only see committed data.

How to eliminate wrong answers

Option B is wrong because BigQuery caches query results only for identical queries within a 24-hour period, but the user ran the same query after 90 minutes; if cached results were used, they would include the same rows as the initial query, not missing rows. Option C is wrong because schema modifications do not cause rows to disappear from query results; they may affect column access or data types but do not remove existing rows. Option D is wrong because a partition expiration of 30 days would only remove partitions older than 30 days, not affect data from the last 5 minutes or 90 minutes.

Full explanation →

295

Drag & Dropmedium

Drag and drop the steps to create a Cloud Composer environment for Apache Airflow into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps

Order

Why this order

Cloud Composer provides a managed Airflow environment for orchestrating workflows.

Full explanation →

296

MCQhard

A team is using BigQuery to analyze petabyte-scale data. They notice that queries are slow and expensive due to full table scans. They have already partitioned by date. What additional optimization should they implement?

A.Use materialized views

B.Cluster by frequently filtered columns

C.Convert to native tables

D.Use query caching

AnswerB

Clustering reduces bytes read when filtering on those columns.

Why this answer

Clustering by frequently filtered columns (option B) organizes data within each partition based on the sort order of those columns. This allows BigQuery to prune blocks during query execution, significantly reducing the amount of data scanned and improving both performance and cost. Since the table is already partitioned by date, clustering adds a secondary ordering that targets the most common filter predicates, avoiding full table scans within each partition.

Exam trap

Google Cloud often tests the distinction between partitioning and clustering, where candidates mistakenly believe partitioning alone is sufficient for all filtering scenarios, but clustering is required to avoid full scans on non-date columns.

How to eliminate wrong answers

Option A is wrong because materialized views precompute and store query results, which can speed up repeated aggregations but do not reduce the scan cost of ad-hoc filters on raw data; they are not a substitute for physical data organization like clustering. Option C is wrong because BigQuery tables are already native (managed) tables; converting to native tables is not a valid operation and does not address scan efficiency. Option D is wrong because query caching only returns results for identical queries run within 24 hours, but it does not reduce the scan cost or improve performance for new or slightly different queries that still trigger full table scans.

Full explanation →

297

MCQhard

Your Vertex AI model deployed on an endpoint is experiencing high tail latency during online predictions. The model uses a large embedding layer, and the input size varies. You have enabled automatic scaling with a minimum of 2 replicas and maximum of 10. What is the most likely cause of the latency spikes and the best first step to diagnose?

A.The model's SavedModel is too large due to the embedding layer; reduce embedding dimensions to lower latency.

B.The endpoint's target CPU utilization might be set too low, causing rapid scale-down and cold starts. Check Cloud Logging for scaling events.

C.The model uses a custom prediction routine that is not optimized; use tf.function to improve performance.

D.Enable model monitoring for online prediction and add a buffer to the endpoint's machine type.

AnswerB

If target utilization is low, replicas scale down quickly; cold starts on new requests cause latency. Logs show scaling.

Why this answer

High tail latency with variable input sizes and a large embedding layer often points to cold starts from aggressive scaling. When the target CPU utilization is set too low, the endpoint scales down quickly during lulls, and a subsequent burst of requests forces new replicas to spin up, causing latency spikes. Checking Cloud Logging for scaling events is the best first step because it directly reveals whether the endpoint is scaling down and then experiencing cold starts.

Exam trap

Google Cloud often tests the misconception that high tail latency is always due to model size or inference optimization, when in fact the most common cause in managed serving environments is autoscaling misconfiguration leading to cold starts.

How to eliminate wrong answers

Option A is wrong because reducing embedding dimensions would lower model accuracy and does not address the root cause of latency spikes from scaling dynamics; the model size is not the primary driver of tail latency in this scenario. Option C is wrong because while a custom prediction routine could be suboptimal, the question describes a standard model with a large embedding layer and variable input size, and the latency pattern (spikes) is more characteristic of cold starts than of per-request optimization issues; tf.function would help steady-state performance but not sudden spikes. Option D is wrong because model monitoring detects drift or anomalies but does not diagnose scaling-related latency, and adding a buffer to the machine type (e.g., increasing memory) does not fix the scaling policy that causes cold starts.

Full explanation →

298

Multi-Selecteasy

Which TWO approaches are recommended for handling late-arriving data in a streaming Dataflow pipeline?

Select 2 answers

A.Use side inputs to provide default values for late data.

B.Use fixed windows with a duration of 1 second to minimize lateness.

C.Configure allowed lateness on the window to accept late data.

D.Set the trigger to fire only at the end of the window.

E.Use a filter transform to drop late-arriving elements.

AnswersA, C

Side inputs can supply missing data.

Why this answer

Option A is correct because side inputs in Apache Beam (the programming model underlying Dataflow) allow you to provide default values or supplementary data to handle late-arriving elements gracefully. When a late element arrives after the window has been emitted, a side input can supply a fallback value, ensuring the pipeline can still process the data without discarding it. This approach is recommended for handling late data in streaming pipelines where completeness is not critical.

Exam trap

Google Cloud often tests the misconception that simply using small windows or dropping late data is a valid handling strategy, when in fact the recommended approaches involve configuring allowed lateness and using side inputs for graceful fallback.

Full explanation →

299

MCQhard

A data pipeline ingests sensor data from IoT devices via Cloud Pub/Sub, processes it with Cloud Dataflow, and writes to BigQuery. The pipeline is failing with high latency and data loss. Which troubleshooting step should be taken first?

A.Check Stackdriver logging for error messages.

B.Disable exactly-once processing in Dataflow.

C.Increase the number of Dataflow workers.

D.Switch to BigQuery streaming inserts.

AnswerA

Identifies root cause.

Why this answer

Option A is correct because Stackdriver (now Cloud Logging) is the first place to investigate when a Dataflow pipeline experiences high latency and data loss. Dataflow automatically logs errors, worker failures, and system messages to Cloud Logging, which can reveal root causes such as insufficient resources, stuck steps, or Pub/Sub subscription issues. Checking logs first avoids premature scaling or configuration changes that may not address the actual problem.

Exam trap

Google Cloud often tests the principle of 'diagnose before you optimize' — the trap here is that candidates jump to scaling or switching technologies (options C and D) without first checking logs, which is the fundamental first step in any troubleshooting workflow.

How to eliminate wrong answers

Option B is wrong because disabling exactly-once processing in Dataflow would not fix high latency or data loss; it could actually increase data duplication and make debugging harder, while the core issue remains unaddressed. Option C is wrong because increasing the number of Dataflow workers without first diagnosing the bottleneck (e.g., a hot key, slow transform, or Pub/Sub backlog) can waste resources and may not resolve the underlying cause of latency or loss. Option D is wrong because switching to BigQuery streaming inserts does not address pipeline-level failures; streaming inserts have their own quotas, error handling, and latency characteristics, and the problem likely lies in the Dataflow processing logic or resource allocation, not the sink.

Full explanation →

300

MCQhard

A Dataflow streaming pipeline processes events from Pub/Sub and writes to BigQuery using a dynamically generated table destination based on the event type. The pipeline is experiencing high latency, and the worker CPU utilization is low. Which action is most likely to reduce latency?

A.Increase the batch size parameter in the BigQuery sink to write larger batches.

B.Reduce the number of workers to increase CPU utilization per worker.

C.Enable Dataflow Streaming Engine to improve throughput and reduce latency.

D.Increase the worker disk size to reduce I/O wait time.

AnswerC

B is correct because Streaming Engine moves state to backend, reducing worker overhead and improving latency.

Why this answer

Option C is correct because Dataflow Streaming Engine moves state and computation from worker VMs to the backend service, reducing per-worker overhead and enabling better resource utilization. This directly addresses the symptom of high latency with low CPU utilization, which indicates workers are bottlenecked on shuffle or state management rather than compute.

Exam trap

The trap here is that candidates often assume low CPU utilization means workers are underutilized and should be scaled down (Option B), when in fact low CPU with high latency indicates a bottleneck in shuffle or state management that is not compute-bound.

How to eliminate wrong answers

Option A is wrong because increasing batch size in the BigQuery sink can actually increase latency for streaming pipelines, as larger batches require more time to fill before writing, and the issue here is not sink throughput but worker inefficiency. Option B is wrong because reducing the number of workers would decrease parallelism and likely worsen latency, and low CPU utilization suggests workers are not compute-bound but rather waiting on I/O or shuffle. Option D is wrong because increasing worker disk size does not reduce I/O wait time for streaming pipelines; disk I/O is not the bottleneck when CPU is low and the pipeline uses Pub/Sub and BigQuery, which are network-bound.

Full explanation →

Page 4 of 7

All pages

Practice PDE by domain

Target a specific domain to shore up weak areas.

Designing data processing systems Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

See all domains with question counts →