Back to Google Professional Data Engineer questions

Scenario-based practice

Hard Difficulty Questions

Practise Google Professional Data Engineer practice questions — original exam-style scenarios covering every exam domain, with detailed explanations, wrong-answer analysis, and common exam traps.

20
scenario questions
PDE
exam code
Google Cloud
vendor

Scenario guide

How to approach hard difficulty questions

These are the questions most candidates get wrong. They require connecting multiple concepts, reading tricky output, or knowing edge-case behaviour that isn't on most study cards. Practising them trains you to operate under uncertainty — a necessary skill on the real exam.

Quick answer

Hard Difficulty Questions questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Related practice questions

Related PDE topic practice pages

Scenario questions usually connect to one or more exam topics. Use these links to review the underlying concepts behind the scenario.

Practice set

Practice scenarios

Question 1hardmultiple choice
Full question →

A company runs a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery. The pipeline uses a side input that is a large lookup table (10 GB) stored in Cloud Storage. The side input is updated hourly. The pipeline experiences high latency and OOM errors on workers. What is the best approach to resolve this?

Question 2hardmultiple choice
Full question →

A data science team uses Vertex AI Pipelines to automate retraining. They want to ensure that only models with performance above a threshold are deployed. Which component should they add to the pipeline?

Question 3hardmultiple choice
Full question →

A data scientist uses Vertex AI Workbench notebooks for model development. They want to share the environment with team members while maintaining version control. Which approach should they use?

Question 4hardmultiple choice
Full question →

A data engineer is designing a batch ETL pipeline that reads CSV files from Cloud Storage, transforms them using Dataproc, and writes the results to BigQuery. The data volume is expected to grow 10x in the next year. Which design approach best balances cost and performance?

Question 5hardmultiple choice
Full question →

A financial services company uses Cloud Composer to orchestrate a daily workflow that includes a Dataproc job for risk analysis. The workflow sometimes fails because the Dataproc cluster creation times out. The cluster creation typically takes 3 minutes, but occasionally takes over 10 minutes. What is the most effective way to handle this variability?

Question 6hardmulti select
Full question →

A company is migrating an on-premises Hadoop cluster to Google Cloud. They need to run existing Spark jobs with minimal modification. Which THREE strategies should they consider? (Choose THREE.)

Question 7hardmultiple choice
Full question →

A data pipeline ingests sensor data from IoT devices via Cloud Pub/Sub, processes it with Cloud Dataflow, and writes to BigQuery. The pipeline is failing with high latency and data loss. Which troubleshooting step should be taken first?

Question 8hardmultiple choice
Read the full NAT/PAT explanation →

A company needs to process sensitive healthcare data with strict compliance requirements. They want to use Cloud Dataflow but must ensure data is encrypted end-to-end and audit logs are retained. Which combination of features should they enable?

Question 9hardmultiple choice
Full question →

A financial services firm processes sensitive transactions using Cloud Dataflow. The pipeline reads from Pub/Sub, performs stateful processing (e.g., fraud detection), and writes to Cloud Spanner. Compliance requires exactly-once processing semantics. Which configuration ensures exactly-once processing?

Question 10hardmultiple choice
Read the full NAT/PAT explanation →

A healthcare company streams patient monitoring data to Cloud Pub/Sub. A Dataflow pipeline reads the stream, enriches with patient records from BigQuery, and writes to Bigtable for real-time queries. The BigQuery lookup is slow and causes pipeline lag. What is the best approach to improve performance?

Question 11hardmultiple choice
Full question →

A company runs a batch processing job on Dataproc that uses Apache Spark to process 500 GB of data daily. The job completes successfully but takes 4 hours. The team wants to reduce the runtime to under 2 hours without increasing cost. What should they do?

Question 12hardmulti select
Full question →

Which THREE considerations are important when designing a data lake on Google Cloud using Cloud Storage?

Question 13hardmulti select
Full question →

Which THREE best practices should be followed when designing a Dataflow pipeline for real-time data processing?

Question 14hardmultiple choice
Full question →

In the Vertex AI Pipeline component YAML exhibit, the component is designed to evaluate a model and produce metrics. If the threshold_accuracy is set to 0.85, what is the expected behavior of this component?

Exhibit

Refer to the exhibit.

```
# Vertex AI Pipeline component YAML
name: model-evaluation
inputs:
  model_path:
    type: String
  test_data_path:
    type: String
  threshold_accuracy:
    type: Float
    default: 0.85
outputs:
  evaluation_metrics:
    type: Metrics
implementation:
  container:
    image: gcr.io/my-project/eval:latest
    args: [
      --model_path, {inputValue: model_path},
      --test_data_path, {inputValue: test_data_path},
      --threshold_accuracy, {inputValue: threshold_accuracy},
      --output_path, {outputPath: evaluation_metrics}
    ]
```
Question 15hardmultiple choice
Full question →

You have deployed a TensorFlow model on Vertex AI Endpoints with autoscaling. The model receives high traffic during peak hours, but you notice that inference latency increases significantly during cold starts. Which strategy would best minimize cold-start latency without incurring unnecessary cost?

Question 16hardmultiple choice
Full question →

A company runs a Cloud Dataflow streaming pipeline that reads from Cloud Pub/Sub, performs a fixed window of 10 seconds, joins with a slowly-changing dimension table stored in Cloud Bigtable, and writes results to BigQuery. The pipeline has been running for months but recently started exhibiting increasing latency and occasional data loss. The pipeline uses default settings with autoscaling enabled (min 2, max 20 workers). The Bigtable cluster has 3 nodes. The dimensions are updated infrequently. The latency has grown from seconds to minutes. Examining the Dataflow monitoring UI, you see that the 'System Lag' metric is increasing, and some windows are not being emitted. The CPU utilization on Bigtable nodes is below 50%. There are no errors in the logs. Which action is most likely to resolve the issue?

Question 17hardmultiple choice
Full question →

The exhibit shows a Spark job submitted to Dataproc that fails with an out-of-memory error. Which change should be made to the submission command to resolve the issue?

Network Topology
cluster=my-clusterregion=us-central1 \class=org.apache.spark.examples.SparkPijars=file:///usr/lib/spark/examples/jars/spark-examples.jar \Refer to the exhibit.```Job [job-id] submitted.
Question 18hardmultiple choice
Full question →

A company is building a data lake on Cloud Storage with data from multiple sources. They need to apply schema-on-read and support ad-hoc SQL queries. Which architecture is most suitable?

Question 19hardmultiple choice
Full question →

A company uses Vertex AI to serve a model that requires GPU for inference. They want to minimize cost while handling variable traffic. Which strategy should they use?

Question 20hardmultiple choice
Full question →

A data science team is operationalizing a batch prediction job using Vertex AI Batch Prediction. The model uses a custom container that requires a specific GPU for inference. The job processes a large dataset stored in Cloud Storage. The team wants to minimize cost while ensuring the job completes within a 2-hour window. Which configuration should they choose?

These PDE practice questions are part of Courseiva's free Google Cloud certification practice question bank. Courseiva provides original exam-style PDE questions with detailed explanations, topic-based practice, mock exams, readiness tracking, and study analytics.