PDE Exam Questions and Answers

A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?

Use Cloud Dataproc Serverless for all Spark jobs.

Migrate jobs to Cloud Dataflow.

Run Spark on Compute Engine instances with startup scripts.

Use Dataproc clusters with auto-scaling and preemptible VMs.

Reduces cost and operational overhead.

Why: Option D is correct because Dataproc clusters with auto-scaling and preemptible VMs directly address the need to reduce operational overhead and minimize costs for on-premises Spark migrations. Auto-scaling dynamically adjusts cluster size based on workload, while preemptible VMs (which cost 60-80% less than standard VMs) handle fault-tolerant tasks, making this the most cost-effective and operationally efficient architecture for Spark on Dataproc.

A data pipeline ingests sensor data from IoT devices via Cloud Pub/Sub, processes it with Cloud Dataflow, and writes to BigQuery. The pipeline is failing with high latency and data loss. Which troubleshooting step should be taken first?

Check Stackdriver logging for error messages.

Identifies root cause.

Disable exactly-once processing in Dataflow.

Increase the number of Dataflow workers.

Switch to BigQuery streaming inserts.

Why: Option A is correct because Stackdriver (now Cloud Logging) is the first place to investigate when a Dataflow pipeline experiences high latency and data loss. Dataflow automatically logs errors, worker failures, and system messages to Cloud Logging, which can reveal root causes such as insufficient resources, stuck steps, or Pub/Sub subscription issues. Checking logs first avoids premature scaling or configuration changes that may not address the actual problem.

A company needs to process real-time clickstream data and store it in a data warehouse for SQL-based analytics. The data volume is moderate. Which combination of Google Cloud services is most cost-effective?

Cloud Pub/Sub, Cloud Dataproc, Cloud Storage

Cloud Pub/Sub, Cloud Dataflow, Cloud Spanner

Cloud Pub/Sub, Cloud Dataflow, BigQuery

Best for real-time SQL analytics.

Cloud Pub/Sub, Cloud Dataflow, Cloud Storage

Why: Option C is correct because Cloud Pub/Sub ingests real-time clickstream data, Cloud Dataflow processes it with low latency, and BigQuery provides a serverless, SQL-based data warehouse that is cost-effective for moderate data volumes due to its pay-per-query pricing and automatic scaling. This combination avoids the overhead of managing clusters (Dataproc) or expensive storage (Cloud Spanner) while directly supporting SQL analytics.

A financial company processes transactions in real-time and requires exactly-once processing semantics. They also need to reprocess historical data for backtesting. Which Google Cloud service should they use?

Cloud Pub/Sub

Cloud Functions

Cloud Dataproc

Cloud Dataflow

Supports exactly-once and batch/streaming.

Why: Cloud Dataflow (D) is correct because it provides exactly-once processing semantics via its distributed snapshot mechanism (based on the MillWheel paper) and supports both real-time streaming and batch processing for historical backtesting under a unified programming model. This allows the company to reprocess historical data using the same pipeline code, ensuring consistency across real-time and batch modes.

A company is building a data lake on Cloud Storage with data from multiple sources. They need to apply schema-on-read and support ad-hoc SQL queries. Which architecture is most suitable?

Ingest to Cloud Spanner, query directly.

Ingest to Cloud SQL, then export to Cloud Storage for queries.

Ingest to Cloud Storage, create BigQuery external tables.

Schema-on-read and SQL.

Ingest to Cloud Storage, load into Dataproc for queries.

Why: BigQuery external tables allow schema-on-read by defining the schema at query time over data stored in Cloud Storage, enabling ad-hoc SQL queries without loading data into a separate system. This architecture directly supports the requirement for schema-on-read and SQL-based analysis, as BigQuery provides a serverless, scalable SQL engine.

A company wants to stream data from Cloud Pub/Sub into BigQuery with minimal latency. They have a small team and limited operational resources. Which approach is best?

Write a custom application on Compute Engine that polls Pub/Sub and writes to BigQuery.

Create a Dataproc cluster running a Spark Streaming job.

Create a Cloud Function that writes to BigQuery.

Use a Dataflow pipeline with a BigQuery subscription.

Serverless and low maintenance.

Why: Option D is correct because a Dataflow pipeline with a BigQuery subscription provides a fully managed, serverless streaming solution that directly ingests messages from Pub/Sub and writes them to BigQuery with minimal latency. Dataflow handles autoscaling, checkpointing, and exactly-once semantics, which aligns with the team's limited operational resources. The BigQuery subscription (via the Pub/Sub to BigQuery template) eliminates the need for custom code or cluster management, ensuring low-latency streaming without operational overhead.

Want more Designing data processing systems practice?

All Building and operationalizing data processing systems questions

Domain 2: Building and operationalizing data processing systems

A company is migrating its on-premises Apache Spark jobs to Dataproc. The jobs read from and write to Cloud Storage. After migration, the jobs are slower than expected. The Dataproc cluster uses standard worker machines with local SSDs. What is the most likely cause of the performance degradation?

The Spark shuffle service is not enabled on the cluster.

The local SSDs are not mounted or are misconfigured.

The Cloud Storage connector is not using the gRPC protocol.

The jobs use the Cloud Storage connector instead of HDFS, causing network latency.

Reading from Cloud Storage over network is slower than local HDFS reads.

Why: D is correct because the performance degradation is most likely due to network latency when using the Cloud Storage connector instead of HDFS. Cloud Storage is an object store accessed over the network, while HDFS leverages local SSDs for data locality and faster I/O. In Dataproc, jobs that read/write to Cloud Storage incur higher latency compared to using HDFS on local SSDs, especially for shuffle-heavy Spark workloads.

A data pipeline ingests real-time events from Cloud Pub/Sub into BigQuery using Dataflow. The pipeline uses a sliding window of 5 minutes with a 1-minute period to aggregate event counts. Recently, the pipeline started failing with 'The worker failed to provide a heartbeat.' The Dataflow logs show high CPU usage on the workers. What is the best course of action to resolve the issue?

Increase the number of workers and enable autoscaling to distribute the load.

More workers can handle the CPU load from streaming inserts.

Reduce the number of workers to minimize coordination overhead.

Use a global window with a trigger to reduce state size.

Change the windowing to a fixed 5-minute window to reduce computations.

Why: The 'worker failed to provide a heartbeat' error combined with high CPU usage indicates that workers are overloaded and cannot process data fast enough to maintain their heartbeat to the Dataflow service. Increasing the number of workers and enabling autoscaling distributes the computational load across more machines, reducing per-worker CPU pressure and allowing heartbeats to be sent on time. This directly addresses the root cause of resource exhaustion.

A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?

Dataproc Serverless with PySpark

Dataproc Serverless is cost-effective and suitable for batch processing of large CSVs.

Dataflow with batch mode

Cloud Data Fusion

BigQuery Data Transfer Service

Why: Dataproc Serverless with PySpark is the most cost-effective choice because it eliminates cluster management overhead and automatically scales resources based on workload, charging only for the processing time used. For 10 GB CSV files processed daily within a 24-hour window, the serverless model avoids the fixed costs of a persistent cluster, making it ideal for batch, non-time-sensitive jobs. PySpark's native support for CSV parsing and BigQuery integration via the Spark BigQuery connector ensures efficient data loading without additional services.

A financial services company uses Cloud Composer to orchestrate a daily workflow that includes a Dataproc job for risk analysis. The workflow sometimes fails because the Dataproc cluster creation times out. The cluster creation typically takes 3 minutes, but occasionally takes over 10 minutes. What is the most effective way to handle this variability?

Create a long-running Dataproc cluster that remains idle and reuse it for each workflow.

Reusing an existing cluster eliminates the creation step and associated timeout.

Implement a retry loop with exponential backoff in the DAG.

Use preemptible VMs for the cluster to reduce cost and improve creation speed.

Increase the cluster creation timeout in the Airflow configuration.

Why: Option A is correct because creating a long-running Dataproc cluster and reusing it eliminates the variable cluster creation time that causes timeouts. Cloud Composer (Airflow) can manage cluster lifecycle separately from the workflow, ensuring the cluster is always available when the Dataproc job runs. This approach decouples cluster provisioning from job execution, making the workflow resilient to creation delays.

A company is using Dataflow to stream data from Cloud Pub/Sub to BigQuery. The pipeline includes a custom ParDo transformation that enriches the data with external API calls. The pipeline is experiencing high latency and occasional failures due to API timeouts. What strategy should be employed to improve reliability and performance?

Remove the enrichment step and store raw data in BigQuery.

Use a global window to accumulate all data before enrichment.

Use a DoFn with stateful processing and batch API calls using asynchronous HTTP client.

Batching and async calls reduce per-element latency and handle timeouts gracefully.

Increase the number of workers to parallelize API calls.

Why: Option C is correct because using a DoFn with stateful processing and an asynchronous HTTP client allows the pipeline to batch API calls and handle timeouts without blocking the main processing thread. This reduces latency by enabling concurrent requests and improves reliability through retry logic and state management, which is essential for external API enrichment in Dataflow.

A data engineer needs to process a large dataset (500 TB) stored in Cloud Storage using Dataproc. The processing job requires reading the entire dataset and writing results back to Cloud Storage. The job is expected to run for 6 hours. Which configuration minimizes cost?

Use a single-node cluster with standard VMs.

Use a cluster with local SSDs for faster I/O.

Use a cluster with a mix of standard and preemptible VMs.

Preemptible VMs reduce cost significantly while providing sufficient compute.

Use a cluster with n1-highmem-32 instances and 1000 cores.

Why: Option C is correct because preemptible VMs cost about 80% less than standard VMs, and mixing them with standard VMs provides fault tolerance for the job's 6-hour duration. Since the job reads and writes to Cloud Storage (not local HDFS), local SSDs are unnecessary, and a single-node cluster would lack the parallelism needed to process 500 TB efficiently within 6 hours. Using a mix of standard (for critical master/worker nodes) and preemptible VMs (for worker nodes) minimizes cost while ensuring job completion.

Want more Building and operationalizing data processing systems practice?

All Operationalizing machine learning models questions

Domain 3: Operationalizing machine learning models

A company deploys a machine learning model to Vertex AI for real-time predictions. After deployment, they notice that prediction latency spikes during peak traffic hours. Which approach should they take to reduce latency without sacrificing accuracy?

Configure auto-scaling with higher min and max instances

Auto-scaling handles traffic spikes.

Reduce the number of input features

Switch from online to batch prediction

Use a larger machine type for the model

Why: Option A is correct because configuring auto-scaling with higher min and max instances ensures that Vertex AI has sufficient pre-warmed replicas to handle traffic spikes without cold-start latency. This approach maintains model accuracy because it does not alter the model architecture or inference logic, only the infrastructure capacity.

A data science team uses Vertex AI Pipelines to automate retraining. They want to ensure that only models with performance above a threshold are deployed. Which component should they add to the pipeline?

Vertex AI Feature Store

Vertex AI Model Evaluation

Evaluates model and can block deployment if threshold not met.

Cloud Build trigger

Cloud Monitoring alert

Why: Vertex AI Model Evaluation provides built-in evaluation metrics and threshold-based validation that can be used as a pipeline condition to gate model deployment. By adding a Model Evaluation component, the pipeline can compare model performance against a predefined threshold and only proceed to deploy if the metrics (e.g., AUC, precision, recall) meet or exceed the required value.

A company trains a custom model using TensorFlow and wants to deploy it to Vertex AI for low-latency predictions. The model is large (2 GB). Which deployment option should they choose?

Use Vertex AI Batch Prediction job

Deploy as a Cloud Function

Deploy to Vertex AI Endpoint with a custom container

Custom containers allow large models.

Deploy to Cloud Run with minimum instances

Why: Option C is correct because deploying a large (2 GB) model to Vertex AI Endpoint with a custom container allows you to package the model, its dependencies, and a serving framework (e.g., TensorFlow Serving) into a Docker image. This approach supports low-latency predictions by keeping the model loaded in memory across requests, and it can scale to handle real-time inference traffic, unlike batch or serverless options that have cold-start or size limitations.

A company uses Vertex AI to serve a model. They notice that some predictions are incorrect due to data drift. What is the best way to detect and retrain the model automatically?

Store predictions in BigQuery and run scheduled queries

Create a Cloud Monitoring dashboard

Set up Cloud Logging metrics to monitor predictions

Use Vertex AI Model Monitoring with alerts and retraining pipeline

Monitors drift and triggers retraining.

Why: Option D is correct because Vertex AI Model Monitoring is specifically designed to detect data drift and feature skew in production models. It can be configured to send alerts and trigger an automated retraining pipeline via Cloud Functions or Vertex AI Pipelines, enabling continuous model improvement without manual intervention. This directly addresses the need for automatic detection and retraining in response to data drift.

A financial services company needs to explain predictions from a complex ensemble model for regulatory compliance. Which Vertex AI service should they use?

Vertex AI Explainable AI

Provides explanations via feature attributions.

Vertex AI Vizier

Vertex AI Feature Store

Vertex AI Prediction

Why: Vertex AI Explainable AI is the correct service because it provides feature attributions and other explainability techniques (e.g., Shapley value approximations, integrated gradients) that help interpret predictions from complex ensemble models. This is essential for regulatory compliance, where the company must demonstrate how input features influence each prediction, ensuring transparency and auditability.

A team wants to retrain a model weekly using new data stored in BigQuery. They want to minimize manual effort. Which approach should they use?

Use Cloud Scheduler to trigger a Cloud Function that retrains

Retrain manually in a notebook each week

Use Cloud Composer to orchestrate retraining

Create a Vertex AI Pipeline scheduled via Cloud Scheduler

Pipelines automate retraining end-to-end.

Why: Vertex AI Pipelines allow you to define a repeatable, automated ML workflow that can be triggered on a schedule via Cloud Scheduler. This minimizes manual effort by handling data extraction from BigQuery, model retraining, and deployment without human intervention, while also providing versioning and monitoring capabilities.

Want more Operationalizing machine learning models practice?

All Ensuring solution quality questions

Domain 4: Ensuring solution quality

A data pipeline ingests streaming data from Pub/Sub into BigQuery via Dataflow. Recently, the pipeline has been failing with 'deadline exceeded' errors. What is the most likely cause?

The BigQuery streaming quota is exceeded.

Dataflow workers are underutilized due to batch size settings.

Dataflow autoscaling is disabled.

The Pub/Sub subscription's acknowledgement deadline is too short for the processing time.

A short acknowledgment deadline causes messages to be redelivered, leading to repeated processing attempts and eventual deadline exceeded errors.

Why: Option D is correct because 'deadline exceeded' errors in a Dataflow pipeline reading from Pub/Sub indicate that the subscriber is taking longer to process messages than the acknowledgement deadline allows. When the deadline expires, Pub/Sub redelivers the message, causing duplicate processing and eventual pipeline failure. This is a common issue when processing time exceeds the default 10-second acknowledgement deadline.

A team is designing a data lake on Google Cloud using Cloud Storage and BigQuery. They need to ensure that sensitive data (e.g., PII) is encrypted at rest and have the ability to audit access. Which approach meets these requirements?

Use Customer-Managed Encryption Keys (CMEK) and enable VPC Service Controls.

Use Customer-Managed Encryption Keys (CMEK) and enable Cloud Audit Logs.

CMEK provides control over encryption keys, and Cloud Audit Logs record access to data.

Use Default Encryption and enable Data Loss Prevention (DLP) API.

Use Customer-Supplied Encryption Keys (CSEK) and enable VPC Service Controls.

Why: Option B is correct because Customer-Managed Encryption Keys (CMEK) allow the team to control and manage the encryption keys used to protect data at rest in Cloud Storage and BigQuery, while enabling Cloud Audit Logs provides the necessary audit trail for access to both the data and the keys. This combination directly satisfies the requirements for encryption at rest and auditability.

A company runs a batch processing job on Dataproc that uses Apache Spark to process 500 GB of data daily. The job completes successfully but takes 4 hours. The team wants to reduce the runtime to under 2 hours without increasing cost. What should they do?

Use preemptible VMs for worker nodes and increase the number of workers.

Preemptible VMs are cheaper, allowing more workers for the same cost, reducing runtime.

Increase the master node's machine type to n2-standard-8.

Increase the machine type of worker nodes to n2-highmem-8.

Migrate the job to Dataflow with autoscaling enabled.

Why: Preemptible VMs cost significantly less than standard VMs (about 60-80% discount). By using preemptible VMs for worker nodes, you can increase the number of workers (and thus parallelism) without increasing cost. This directly reduces runtime by distributing the 500 GB workload across more executors, while the cost savings from preemptible VMs offset the additional nodes.

Which TWO actions are recommended to improve the reliability of a Cloud Dataflow streaming pipeline that processes event data from Pub/Sub?

Use a pull subscription with a 10-second acknowledgment deadline.

Enable Dataflow Streaming Engine.

Streaming Engine offloads state management to the backend, improving reliability.

Enable exactly-once processing sinks (e.g., BigQuery with guaranteed row-level insertion).

Exactly-once processing prevents duplicate data.

Disable autoscaling to prevent worker churn.

Use micro-batch processing with a small batch size.

Why: Option B is correct because enabling Dataflow Streaming Engine moves state and computation from worker VMs to the backend service, reducing the impact of worker scaling and preemption. This improves reliability by providing consistent performance and fault tolerance for streaming pipelines, especially those with high throughput or stateful processing.

A data analyst runs a complex SQL query in BigQuery that joins multiple large tables and receives the above error. Which action is most likely to resolve the issue?

Use a larger number of workers in the query execution.

Use smaller tables by sampling data.

Add clustering on join columns.

Increase the number of slots allocated to the project.

More slots provide more memory and CPU, reducing resource exceeded errors.

Why: The error indicates that the query exceeded the available slot resources in the BigQuery project. Increasing the number of slots allocated to the project (option D) directly addresses this by providing more compute capacity for parallel query execution, which is the correct action to resolve resource exhaustion in BigQuery's serverless architecture.

A company runs a real-time anomaly detection system on Google Cloud. Streaming data from IoT devices is ingested via Pub/Sub, processed by Dataflow (Apache Beam), and results are written to Bigtable for low-latency serving. Recently, the system has been experiencing increased latency and occasional data loss. The Dataflow pipeline shows high system lag and backlog in Pub/Sub. The Bigtable cluster has 3 nodes and is reporting high CPU utilization (over 90%). The team suspects the issue is with the pipeline configuration. They have already verified that there are no errors in the pipeline code and no network issues. Which action should they take to resolve the issue?

Increase the number of Bigtable nodes to handle the write throughput.

High CPU utilization suggests Bigtable is overwhelmed; adding nodes increases capacity.

Change the Dataflow worker machine type to n2-standard-8.

Decrease the batch size in the Dataflow pipeline to reduce latency.

Increase the number of Dataflow workers to process messages faster.

Why: The high CPU utilization on Bigtable (over 90%) indicates that the cluster is saturated and cannot keep up with the write throughput from Dataflow. This causes backpressure in the pipeline, leading to increased system lag and backlog in Pub/Sub, and eventually data loss when Pub/Sub messages expire. Increasing the number of Bigtable nodes directly addresses the bottleneck by distributing the write load and reducing CPU pressure, which allows the pipeline to drain the backlog and reduce latency.

Want more Ensuring solution quality practice?

Browse all PDE questions Take a timed practice test

Frequently asked questions

How many questions are on the PDE exam?

The PDE exam has 60 questions and must be completed in 120 minutes. The passing score is 720/1000.

What types of questions appear on the PDE exam?

Scenario-based questions covering exam objectives with detailed answer explanations.

How are PDE questions organised by domain?

The exam covers 4 domains: Designing data processing systems, Building and operationalizing data processing systems, Operationalizing machine learning models, Ensuring solution quality. Questions are weighted by domain — higher-weight domains appear more on your actual exam.

Are these the actual PDE exam questions?

No. These are original exam-style practice questions written against the official Google Cloud PDE exam objectives. They are not copied from the real exam. Courseiva focuses on genuine understanding, not memorisation of braindumps.

Ready to practice all 60 PDE questions?

Courseiva tracks your accuracy per domain and routes you toward weak areas automatically. Free, no account required.

Google Cloud · Free Practice Questions · Last reviewed May 2026

PDE Exam Questions and Answers

24real exam-style questions organised by domain, each with the correct answer highlighted and a plain-English explanation of why it's right — and why the others are wrong.

60 exam questions

120 min time limit

Pass: 720/1000 / 1000

4 exam domains

Overview Domain Blueprint Study Guide All QuestionsSample by Domain

1. Designing data processing systems 2. Building and operationalizing data processing systems 3. Operationalizing machine learning models 4. Ensuring solution quality

Domain 1: Designing data processing systems

All Designing data processing systems questions

A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?

Use Cloud Dataproc Serverless for all Spark jobs.

Migrate jobs to Cloud Dataflow.

Run Spark on Compute Engine instances with startup scripts.

Use Dataproc clusters with auto-scaling and preemptible VMs.

Reduces cost and operational overhead.

Check Stackdriver logging for error messages.

Identifies root cause.

Disable exactly-once processing in Dataflow.

Increase the number of Dataflow workers.

Switch to BigQuery streaming inserts.

Cloud Pub/Sub, Cloud Dataproc, Cloud Storage

Cloud Pub/Sub, Cloud Dataflow, Cloud Spanner

Cloud Pub/Sub, Cloud Dataflow, BigQuery

Best for real-time SQL analytics.

Cloud Pub/Sub, Cloud Dataflow, Cloud Storage

Cloud Pub/Sub

Cloud Functions

Cloud Dataproc

Cloud Dataflow

Supports exactly-once and batch/streaming.

A company is building a data lake on Cloud Storage with data from multiple sources. They need to apply schema-on-read and support ad-hoc SQL queries. Which architecture is most suitable?

Ingest to Cloud Spanner, query directly.

Ingest to Cloud SQL, then export to Cloud Storage for queries.

Ingest to Cloud Storage, create BigQuery external tables.

Schema-on-read and SQL.

Ingest to Cloud Storage, load into Dataproc for queries.

A company wants to stream data from Cloud Pub/Sub into BigQuery with minimal latency. They have a small team and limited operational resources. Which approach is best?

Write a custom application on Compute Engine that polls Pub/Sub and writes to BigQuery.

Create a Dataproc cluster running a Spark Streaming job.

Create a Cloud Function that writes to BigQuery.

Use a Dataflow pipeline with a BigQuery subscription.

Serverless and low maintenance.

Want more Designing data processing systems practice?

All Building and operationalizing data processing systems questions

Domain 2: Building and operationalizing data processing systems

The Spark shuffle service is not enabled on the cluster.

The local SSDs are not mounted or are misconfigured.

The Cloud Storage connector is not using the gRPC protocol.

The jobs use the Cloud Storage connector instead of HDFS, causing network latency.

Reading from Cloud Storage over network is slower than local HDFS reads.

Increase the number of workers and enable autoscaling to distribute the load.

More workers can handle the CPU load from streaming inserts.

Reduce the number of workers to minimize coordination overhead.

Use a global window with a trigger to reduce state size.

Change the windowing to a fixed 5-minute window to reduce computations.

Dataproc Serverless with PySpark

Dataproc Serverless is cost-effective and suitable for batch processing of large CSVs.

Dataflow with batch mode

Cloud Data Fusion

BigQuery Data Transfer Service

Create a long-running Dataproc cluster that remains idle and reuse it for each workflow.

Reusing an existing cluster eliminates the creation step and associated timeout.

Implement a retry loop with exponential backoff in the DAG.

Use preemptible VMs for the cluster to reduce cost and improve creation speed.

Increase the cluster creation timeout in the Airflow configuration.

Remove the enrichment step and store raw data in BigQuery.

Use a global window to accumulate all data before enrichment.

Use a DoFn with stateful processing and batch API calls using asynchronous HTTP client.

Batching and async calls reduce per-element latency and handle timeouts gracefully.

Increase the number of workers to parallelize API calls.

Use a single-node cluster with standard VMs.

Use a cluster with local SSDs for faster I/O.

Use a cluster with a mix of standard and preemptible VMs.

Preemptible VMs reduce cost significantly while providing sufficient compute.

Use a cluster with n1-highmem-32 instances and 1000 cores.

Want more Building and operationalizing data processing systems practice?

All Operationalizing machine learning models questions

Domain 3: Operationalizing machine learning models

Configure auto-scaling with higher min and max instances

Auto-scaling handles traffic spikes.

Reduce the number of input features

Switch from online to batch prediction

Use a larger machine type for the model

Vertex AI Feature Store

Vertex AI Model Evaluation

Evaluates model and can block deployment if threshold not met.

Cloud Build trigger

Cloud Monitoring alert

A company trains a custom model using TensorFlow and wants to deploy it to Vertex AI for low-latency predictions. The model is large (2 GB). Which deployment option should they choose?

Use Vertex AI Batch Prediction job

Deploy as a Cloud Function

Deploy to Vertex AI Endpoint with a custom container

Custom containers allow large models.

Deploy to Cloud Run with minimum instances

A company uses Vertex AI to serve a model. They notice that some predictions are incorrect due to data drift. What is the best way to detect and retrain the model automatically?

Store predictions in BigQuery and run scheduled queries

Create a Cloud Monitoring dashboard

Set up Cloud Logging metrics to monitor predictions

Use Vertex AI Model Monitoring with alerts and retraining pipeline

Monitors drift and triggers retraining.

A financial services company needs to explain predictions from a complex ensemble model for regulatory compliance. Which Vertex AI service should they use?

Vertex AI Explainable AI

Provides explanations via feature attributions.

Vertex AI Vizier

Vertex AI Feature Store

Vertex AI Prediction

A team wants to retrain a model weekly using new data stored in BigQuery. They want to minimize manual effort. Which approach should they use?

Use Cloud Scheduler to trigger a Cloud Function that retrains

Retrain manually in a notebook each week

Use Cloud Composer to orchestrate retraining

Create a Vertex AI Pipeline scheduled via Cloud Scheduler

Pipelines automate retraining end-to-end.

Want more Operationalizing machine learning models practice?

All Ensuring solution quality questions

Domain 4: Ensuring solution quality

A data pipeline ingests streaming data from Pub/Sub into BigQuery via Dataflow. Recently, the pipeline has been failing with 'deadline exceeded' errors. What is the most likely cause?

The BigQuery streaming quota is exceeded.

Dataflow workers are underutilized due to batch size settings.

Dataflow autoscaling is disabled.

The Pub/Sub subscription's acknowledgement deadline is too short for the processing time.

A short acknowledgment deadline causes messages to be redelivered, leading to repeated processing attempts and eventual deadline exceeded errors.

Use Customer-Managed Encryption Keys (CMEK) and enable VPC Service Controls.

Use Customer-Managed Encryption Keys (CMEK) and enable Cloud Audit Logs.

CMEK provides control over encryption keys, and Cloud Audit Logs record access to data.

Use Default Encryption and enable Data Loss Prevention (DLP) API.

Use Customer-Supplied Encryption Keys (CSEK) and enable VPC Service Controls.

Use preemptible VMs for worker nodes and increase the number of workers.

Preemptible VMs are cheaper, allowing more workers for the same cost, reducing runtime.

Increase the master node's machine type to n2-standard-8.

Increase the machine type of worker nodes to n2-highmem-8.

Migrate the job to Dataflow with autoscaling enabled.

Which TWO actions are recommended to improve the reliability of a Cloud Dataflow streaming pipeline that processes event data from Pub/Sub?

Use a pull subscription with a 10-second acknowledgment deadline.

Enable Dataflow Streaming Engine.

Streaming Engine offloads state management to the backend, improving reliability.

Enable exactly-once processing sinks (e.g., BigQuery with guaranteed row-level insertion).

Exactly-once processing prevents duplicate data.

Disable autoscaling to prevent worker churn.

Use micro-batch processing with a small batch size.

A data analyst runs a complex SQL query in BigQuery that joins multiple large tables and receives the above error. Which action is most likely to resolve the issue?

Use a larger number of workers in the query execution.

Use smaller tables by sampling data.

Add clustering on join columns.

Increase the number of slots allocated to the project.

More slots provide more memory and CPU, reducing resource exceeded errors.

Increase the number of Bigtable nodes to handle the write throughput.

High CPU utilization suggests Bigtable is overwhelmed; adding nodes increases capacity.

Change the Dataflow worker machine type to n2-standard-8.

Decrease the batch size in the Dataflow pipeline to reduce latency.

Increase the number of Dataflow workers to process messages faster.

Want more Ensuring solution quality practice?