Back to Google Professional Data Engineer

Google Cloud exam questions

Google Professional Data Engineer PDE practice test

Practise PDE DHCP questions covering DORA flow, scopes, excluded addresses, default gateway options, helper addresses, and troubleshooting clients that receive APIPA or cannot get an IP address.

499
practice questions
4
topics covered
PDE
exam code
Google Cloud
vendor

Study modes

Three ways to study

Start with the Study Sheet to learn the material, switch to Practice Tests for active recall, then take a Mock Exam to simulate the real thing.

Study Sheet

All 499 questions with correct answers and explanations already visible. Read at your own pace — no time pressure.

Start reading →

Practice Test

Answer first, then see feedback and explanation. Tracks your score per session. Best for active recall and identifying weak areas.

Mock Exam

Full timed simulation with countdown. Answers hidden until the end. Includes all question types just like the real exam.

Start mock exam →

Study Sheet

All 499 PDE questions with answers

Every question in the bank, paginated 75 per page. Correct answers and full explanations are revealed upfront — ideal for first-pass learning and pre-exam review.

7 pages · 75 questions per page · 499 total

Domain practice

Study PDE by domain

Each domain has its own study sheet and practice test. Target the areas where you're weakest instead of repeating questions you already know.

All domains with question counts →

Related practice questions

Study PDE by topic

Topic pages go deep on individual concepts — each one covers a specific exam topic with questions, explanations, and study notes.

Courseiva uses original exam-style practice questions created for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps. Learn the difference →

Sample questions

Google Professional Data Engineer practice questions

Start practice test

A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?

A company runs a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery. The pipeline uses a side input that is a large lookup table (10 GB) stored in Cloud Storage. The side input is updated hourly. The pipeline experiences high latency and OOM errors on workers. What is the best approach to resolve this?

Your company uses Vertex AI Pipelines to automate model retraining. The pipeline has three steps: data extraction from BigQuery, feature engineering using Dataflow, and model training using a custom container on Vertex AI Training. Recently, the pipeline has been failing intermittently at the Dataflow step with a 'The job encountered a transient error. Please retry.' message. You have enabled pipeline retries with 3 attempts. However, the pipeline still fails after 3 retries. You check the logs and find that the Dataflow job requires more resources than the default worker configuration provides. Which change should you make to reduce the failure rate?

A data science team uses Vertex AI Pipelines to automate retraining. They want to ensure that only models with performance above a threshold are deployed. Which component should they add to the pipeline?

Question 5easymultiple choice
Read the full NAT/PAT explanation →

A company needs to process real-time clickstream data and store it in a data warehouse for SQL-based analytics. The data volume is moderate. Which combination of Google Cloud services is most cost-effective?

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

Exhibit

Refer to the exhibit.

```
{
  "bindings": [
    {
      "role": "roles/bigquery.dataViewer",
      "members": [
        "serviceAccount:dataflow-worker@PROJECT_ID.iam.gserviceaccount.com"
      ]
    }
  ]
}
```

A data scientist uses Vertex AI Workbench notebooks for model development. They want to share the environment with team members while maintaining version control. Which approach should they use?

Drag and drop the steps to deploy a Cloud Dataflow pipeline from a template into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order
1Step 1
2Step 2
3Step 3
4Step 4
5Step 5

Drag and drop the steps to migrate an on-premises MySQL database to Cloud SQL using Database Migration Service into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order
1Step 1
2Step 2
3Step 3
4Step 4
5Step 5

Drag and drop the steps to set up Cloud IAP (Identity-Aware Proxy) for an App Engine app into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order
1Step 1
2Step 2
3Step 3
4Step 4
5Step 5

Drag and drop the steps to set up a Pub/Sub topic with a push subscription to an HTTPS endpoint into the correct order.

Drag steps to the numbered slots on the right, or tap a step then tap a slot.

Steps
Order
1Step 1
2Step 2
3Step 3
4Step 4
5Step 5

A company has deployed a machine learning model on Vertex AI Prediction that serves real-time predictions for a customer-facing application. The model was trained using a custom container and is hosted on a single endpoint with a minimum number of nodes. Recently, the team noticed that during peak traffic, prediction latency increases significantly and some requests time out. The endpoint is configured with a baseline traffic split of 100% on the current model version. Which action should the team take to reduce latency and improve reliability?

A data engineer is designing a batch ETL pipeline that reads CSV files from Cloud Storage, transforms them using Dataproc, and writes the results to BigQuery. The data volume is expected to grow 10x in the next year. Which design approach best balances cost and performance?

A financial services company deploys a regression model to predict loan default risk. The model is served using Vertex AI Endpoints with autoscaling. After deployment, latency increases significantly during peak hours, causing timeouts. The model uses scikit-learn and has a large feature set. Which action should the team take to reduce latency while maintaining prediction accuracy?

Which TWO actions are recommended to improve the reliability of a Cloud Dataflow streaming pipeline that processes event data from Pub/Sub?

A team is designing a data lake on Google Cloud using Cloud Storage and BigQuery. They need to ensure that sensitive data (e.g., PII) is encrypted at rest and have the ability to audit access. Which approach meets these requirements?

A company is building a real-time streaming pipeline using Pub/Sub and Dataflow to process clickstream data. The pipeline writes aggregated metrics to BigQuery every 10 seconds using a fixed window. During peak traffic, some windows produce duplicate rows in BigQuery. What is the most likely cause?

A data engineering team is operationalizing a machine learning model for real-time fraud detection. The model must process transactions with sub-100ms latency and be highly available. Which TWO strategies should the team implement?

You have a batch prediction job on Vertex AI that processes millions of records. The job is failing with an out-of-memory error. What is the best way to resolve this?

A financial services company uses Cloud Composer to orchestrate a daily workflow that includes a Dataproc job for risk analysis. The workflow sometimes fails because the Dataproc cluster creation times out. The cluster creation typically takes 3 minutes, but occasionally takes over 10 minutes. What is the most effective way to handle this variability?

Your team is using Vertex AI Pipelines to orchestrate a model retraining workflow. The pipeline includes a data validation step, a training step, and a model evaluation step. You want to ensure that if the evaluation step fails due to low model performance, the pipeline stops and does not deploy the model. Which approach should you use?

Which TWO are best practices for monitoring a deployed machine learning model in production on Vertex AI?

A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?

A data science team has built a model using scikit-learn. They want to operationalize it on Google Cloud without rewriting the code. Which approach should they take?

Question Discussion

Share a tip, memory trick, or ask about the reasoning behind this question. Do not post real exam questions, leaked content, braindumps, or copyrighted exam material. Comments are moderated and may be removed without notice.

Loading comments…

Sign in to join the discussion.

Exam question guide

How to use these PDE questions

Use these questions as active recall, not passive reading. Try the question first, review the answer choices, then open the explanation and connect the result back to the exam topic.

Quick answer

Exhibit-style questions test whether you can read a topology, command output, diagram or table before choosing the best answer.

How to extract the relevant detail from an exhibit.

How topology, command output or routing information affects the answer.

How to avoid answering from memory before reading the evidence.

How to map the exhibit back to the exam objective.

These PDE practice questions are part of Courseiva's free Google Cloud certification practice question bank. Courseiva provides original exam-style PDE questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.