PDE Practice Questions

Question 1

A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?

Accepted Answer

Dataproc Serverless with PySpark. Dataproc Serverless with PySpark is the most cost-effective choice because it eliminates cluster management overhead and automatically scales resources based on workload, charging only for the processing time used. For 10 GB CSV files processed daily within a 24-hour window, the serverless model avoids the fixed costs of a persistent cluster, making it ideal for batch, non-time-sensitive jobs. PySpark's native support for CSV parsing and BigQuery integration via the Spark BigQuery connector ensures efficient data loading without additional services.

Answer

Dataflow with batch mode

Answer

Cloud Data Fusion

Answer

BigQuery Data Transfer Service

Question 2

A company runs a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery. The pipeline uses a side input that is a large lookup table (10 GB) stored in Cloud Storage. The side input is updated hourly. The pipeline experiences high latency and OOM errors on workers. What is the best approach to resolve this?

Accepted Answer

Use a Cloud Bigtable table as a side input via a RichSDF.. Option A is correct because using a Cloud Bigtable table as a side input via a RichSDF (Rich Splittable DoFn) allows the pipeline to perform point lookups on the large (10 GB) lookup table without loading it entirely into worker memory. This avoids OOM errors and reduces latency by leveraging Bigtable's low-latency, scalable key-value storage, which is ideal for high-throughput streaming pipelines that require frequent, random access to a large, frequently updated dataset.

Answer

Use a side input from a PCollection and broadcast it.

Answer

Increase the number of workers to distribute the side input.

Answer

Increase the worker memory to 16 GB per worker.

Question 3

Your company uses Vertex AI Pipelines to automate model retraining. The pipeline has three steps: data extraction from BigQuery, feature engineering using Dataflow, and model training using a custom container on Vertex AI Training. Recently, the pipeline has been failing intermittently at the Dataflow step with a 'The job encountered a transient error. Please retry.' message. You have enabled pipeline retries with 3 attempts. However, the pipeline still fails after 3 retries. You check the logs and find that the Dataflow job requires more resources than the default worker configuration provides. Which change should you make to reduce the failure rate?

Accepted Answer

Increase the Dataflow worker machine type to have more memory and CPU in the pipeline step configuration. Option D is correct because the pipeline fails due to insufficient resources (memory and CPU) in the default Dataflow worker configuration. By increasing the worker machine type (e.g., using a custom machine type with more vCPUs and memory), the Dataflow job can handle the feature engineering workload without hitting resource limits, reducing transient failures. This directly addresses the root cause identified in the logs, unlike retries or parallelism changes.

Answer

Increase the number of Dataflow workers to improve parallelism

Answer

Increase the number of retries in the pipeline to 5

Answer

Replace Dataflow with Dataproc to run the feature engineering step

Question 4

A data science team uses Vertex AI Pipelines to automate retraining. They want to ensure that only models with performance above a threshold are deployed. Which component should they add to the pipeline?

Accepted Answer

Vertex AI Model Evaluation. Vertex AI Model Evaluation provides built-in evaluation metrics and threshold-based validation that can be used as a pipeline condition to gate model deployment. By adding a Model Evaluation component, the pipeline can compare model performance against a predefined threshold and only proceed to deploy if the metrics (e.g., AUC, precision, recall) meet or exceed the required value.

Answer

Vertex AI Feature Store

Answer

Cloud Build trigger

Answer

Cloud Monitoring alert

Question 5

A company needs to process real-time clickstream data and store it in a data warehouse for SQL-based analytics. The data volume is moderate. Which combination of Google Cloud services is most cost-effective?

Accepted Answer

Cloud Pub/Sub, Cloud Dataflow, BigQuery. Option C is correct because Cloud Pub/Sub ingests real-time clickstream data, Cloud Dataflow processes it with low latency, and BigQuery provides a serverless, SQL-based data warehouse that is cost-effective for moderate data volumes due to its pay-per-query pricing and automatic scaling. This combination avoids the overhead of managing clusters (Dataproc) or expensive storage (Cloud Spanner) while directly supporting SQL analytics.

Answer

Cloud Pub/Sub, Cloud Dataproc, Cloud Storage

Answer

Cloud Pub/Sub, Cloud Dataflow, Cloud Spanner

Answer

Cloud Pub/Sub, Cloud Dataflow, Cloud Storage

Question 6

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

Accepted Answer

roles/bigquery.dataEditor. The error indicates the service account lacks the `bigquery.tables.get` permission, which is required to read table metadata. `roles/bigquery.dataEditor` includes this permission along with `bigquery.tables.get`, `bigquery.tables.update`, and `bigquery.tables.export`, making it the minimal role that resolves the access denied error for a Dataflow job reading from a BigQuery table.

Answer

roles/bigquery.admin

Answer

roles/bigquery.user

Answer

roles/bigquery.jobUser

Question 7

A data scientist uses Vertex AI Workbench notebooks for model development. They want to share the environment with team members while maintaining version control. Which approach should they use?

Accepted Answer

Use a user-managed notebook instance with multiple users. A user-managed notebook instance with multiple users is the correct approach because Vertex AI Workbench supports collaboration by allowing multiple users to access the same instance via IAM permissions, while the underlying Git integration enables version control. This setup provides a shared, persistent environment where team members can work on the same codebase without duplicating work, and changes can be tracked through Git repositories.

Answer

Use Cloud Shell and clone the repo

Answer

Share the notebook via Cloud Storage

Answer

Store notebooks in Cloud Source Repositories

Question 8

Drag and drop the steps to deploy a Cloud Dataflow pipeline from a template into the correct order.

Question 9

Drag and drop the steps to migrate an on-premises MySQL database to Cloud SQL using Database Migration Service into the correct order.

Question 10

Drag and drop the steps to set up Cloud IAP (Identity-Aware Proxy) for an App Engine app into the correct order.

Question 11

Drag and drop the steps to set up a Pub/Sub topic with a push subscription to an HTTPS endpoint into the correct order.

Question 12

A company has deployed a machine learning model on Vertex AI Prediction that serves real-time predictions for a customer-facing application. The model was trained using a custom container and is hosted on a single endpoint with a minimum number of nodes. Recently, the team noticed that during peak traffic, prediction latency increases significantly and some requests time out. The endpoint is configured with a baseline traffic split of 100% on the current model version. Which action should the team take to reduce latency and improve reliability?

Accepted Answer

Configure horizontal autoscaling with a higher maximum number of nodes and set a CPU utilization target.. Option C is correct because configuring horizontal autoscaling with a higher maximum number of nodes and a CPU utilization target allows Vertex AI Prediction to automatically add more nodes during peak traffic, distributing the inference load and reducing latency. This directly addresses the root cause—insufficient compute resources under high demand—without requiring architectural changes or sacrificing availability.

Answer

Reduce the minimum number of nodes to zero to allow scale-to-zero during low traffic.

Answer

Place a Google Cloud Load Balancer in front of the Vertex AI endpoint to distribute requests across multiple endpoints.

Answer

Implement A/B testing by splitting traffic between two model versions to distribute load.

Question 13

A data engineer is designing a batch ETL pipeline that reads CSV files from Cloud Storage, transforms them using Dataproc, and writes the results to BigQuery. The data volume is expected to grow 10x in the next year. Which design approach best balances cost and performance?

Accepted Answer

Use a Dataproc cluster with preemptible worker nodes and autoscaling enabled.. Option C is correct because preemptible worker nodes significantly reduce cost (up to 80% discount) while autoscaling dynamically adjusts cluster size to match the growing workload, ensuring performance without over-provisioning. This combination handles the 10x data growth efficiently by scaling out during peak loads and scaling in during lulls, using preemptible instances for fault-tolerant tasks like transformation.

Answer

Create a single large persistent Dataproc cluster to handle the peak load.

Answer

Use Cloud Data Fusion to visually design the pipeline and run it on Dataproc.

Answer

Migrate the pipeline to Dataflow with Apache Beam and use flexRS for cost savings.

Question 14

A financial services company deploys a regression model to predict loan default risk. The model is served using Vertex AI Endpoints with autoscaling. After deployment, latency increases significantly during peak hours, causing timeouts. The model uses scikit-learn and has a large feature set. Which action should the team take to reduce latency while maintaining prediction accuracy?

Accepted Answer

Apply feature selection to reduce the number of input features.. Option D is correct because the latency spike is caused by the large feature set, which increases the time for preprocessing and inference in the scikit-learn model. Reducing the number of input features via feature selection directly decreases the computational load per request, lowering latency without sacrificing accuracy if the selected features retain predictive power. This addresses the root cause, unlike scaling or resource changes that only mask the symptom.

Answer

Switch to batch prediction for all requests.

Answer

Increase the minimum number of replicas in the endpoint to handle peak load.

Answer

Increase the memory allocation for the serving container.

Question 15

Which TWO actions are recommended to improve the reliability of a Cloud Dataflow streaming pipeline that processes event data from Pub/Sub?

Accepted Answer

Enable Dataflow Streaming Engine.. Option B is correct because enabling Dataflow Streaming Engine moves state and computation from worker VMs to the backend service, reducing the impact of worker scaling and preemption. This improves reliability by providing consistent performance and fault tolerance for streaming pipelines, especially those with high throughput or stateful processing.

Answer

Use a pull subscription with a 10-second acknowledgment deadline.

Answer

Disable autoscaling to prevent worker churn.

Answer

Use micro-batch processing with a small batch size.

Google Professional Data Engineer PDE practice test

Three ways to study

All 499 PDE questions with answers

Study PDE by domain

Study PDE by topic

Designing data processing systems practice questions

Building and operationalizing data processing systems practice questions

Operationalizing machine learning models practice questions

Ensuring solution quality practice questions

PDE fundamentals practice questions

PDE scenario practice questions

PDE troubleshooting practice questions

Top PDE questions

Google Professional Data Engineer practice questions

A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?

A data science team uses Vertex AI Pipelines to automate retraining. They want to ensure that only models with performance above a threshold are deployed. Which component should they add to the pipeline?

A company needs to process real-time clickstream data and store it in a data warehouse for SQL-based analytics. The data volume is moderate. Which combination of Google Cloud services is most cost-effective?

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

Exhibit

A data scientist uses Vertex AI Workbench notebooks for model development. They want to share the environment with team members while maintaining version control. Which approach should they use?

Drag and drop the steps to deploy a Cloud Dataflow pipeline from a template into the correct order.

Drag and drop the steps to migrate an on-premises MySQL database to Cloud SQL using Database Migration Service into the correct order.

Drag and drop the steps to set up Cloud IAP (Identity-Aware Proxy) for an App Engine app into the correct order.

Drag and drop the steps to set up a Pub/Sub topic with a push subscription to an HTTPS endpoint into the correct order.

A data engineer is designing a batch ETL pipeline that reads CSV files from Cloud Storage, transforms them using Dataproc, and writes the results to BigQuery. The data volume is expected to grow 10x in the next year. Which design approach best balances cost and performance?

Which TWO actions are recommended to improve the reliability of a Cloud Dataflow streaming pipeline that processes event data from Pub/Sub?

A team is designing a data lake on Google Cloud using Cloud Storage and BigQuery. They need to ensure that sensitive data (e.g., PII) is encrypted at rest and have the ability to audit access. Which approach meets these requirements?

A company is building a real-time streaming pipeline using Pub/Sub and Dataflow to process clickstream data. The pipeline writes aggregated metrics to BigQuery every 10 seconds using a fixed window. During peak traffic, some windows produce duplicate rows in BigQuery. What is the most likely cause?

A data engineering team is operationalizing a machine learning model for real-time fraud detection. The model must process transactions with sub-100ms latency and be highly available. Which TWO strategies should the team implement?

You have a batch prediction job on Vertex AI that processes millions of records. The job is failing with an out-of-memory error. What is the best way to resolve this?

Which TWO are best practices for monitoring a deployed machine learning model in production on Vertex AI?

A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?

A data science team has built a model using scikit-learn. They want to operationalize it on Google Cloud without rewriting the code. Which approach should they take?

Question Discussion

How to use these PDE questions

Quick answer