PDE · topic practice

Designing data processing systems practice questions

Practise Google Professional Data Engineer Designing data processing systems practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security
20 questionsDomain: Designing data processing systems

What the exam tests

What to know about Designing data processing systems

Designing data processing systems questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Designing data processing systems exam traps

  • Answering from memory before reading the full scenario.
  • Missing a constraint such as cost, availability, security, scope or command context.
  • Choosing a broad answer when the question asks for the most specific fix.
  • Ignoring why the wrong options are tempting.

Practice set

Designing data processing systems questions

20 questions · select your answer, then reveal the explanation

A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?

A data pipeline ingests sensor data from IoT devices via Cloud Pub/Sub, processes it with Cloud Dataflow, and writes to BigQuery. The pipeline is failing with high latency and data loss. Which troubleshooting step should be taken first?

Question 3easymultiple choice
Read the full NAT/PAT explanation →

A company needs to process real-time clickstream data and store it in a data warehouse for SQL-based analytics. The data volume is moderate. Which combination of Google Cloud services is most cost-effective?

A financial company processes transactions in real-time and requires exactly-once processing semantics. They also need to reprocess historical data for backtesting. Which Google Cloud service should they use?

A company is building a data lake on Cloud Storage with data from multiple sources. They need to apply schema-on-read and support ad-hoc SQL queries. Which architecture is most suitable?

A company wants to stream data from Cloud Pub/Sub into BigQuery with minimal latency. They have a small team and limited operational resources. Which approach is best?

A company has a batch ETL job that runs daily using Cloud Dataflow. The job reads from Cloud Storage, transforms data, and writes to BigQuery. Recently, the job started failing with 'Resources have been exhausted' errors. What is the most likely cause?

Question 8hardmultiple choice
Read the full NAT/PAT explanation →

A company needs to process sensitive healthcare data with strict compliance requirements. They want to use Cloud Dataflow but must ensure data is encrypted end-to-end and audit logs are retained. Which combination of features should they enable?

A company is running a Cloud Dataflow streaming pipeline that aggregates events in 1-minute windows. They notice that the watermark is lagging significantly behind real-time. What is the most likely cause?

A data engineer is designing a batch processing system using Cloud Dataproc. Which TWO practices improve performance and reduce costs? (Choose TWO.)

A company is migrating an on-premises Hadoop cluster to Google Cloud. They need to run existing Spark jobs with minimal modification. Which THREE strategies should they consider? (Choose THREE.)

A data pipeline uses Cloud Pub/Sub to ingest events, then a Cloud Dataflow job writes to BigQuery. The Dataflow job is failing with 'deadline exceeded' errors. Which TWO actions can resolve this? (Choose TWO.)

The exhibit shows a Spark job submitted to Dataproc that fails with an out-of-memory error. Which change should be made to the submission command to resolve the issue?

Network Topology
cluster=my-clusterregion=us-central1 \class=org.apache.spark.examples.SparkPijars=file:///usr/lib/spark/examples/jars/spark-examples.jar \Refer to the exhibit.```Job [job-id] submitted.

The exhibit shows a Cloud Logging query result. A data engineer sees this log for a streaming Dataflow job. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
resource.type="dataflow_step"
resource.labels.job_id="2023-01-01_000000-12345678"
"worker pool exhausted"
```

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

Exhibit

Refer to the exhibit.

```
{
  "bindings": [
    {
      "role": "roles/bigquery.dataViewer",
      "members": [
        "serviceAccount:dataflow-worker@PROJECT_ID.iam.gserviceaccount.com"
      ]
    }
  ]
}
```

A company runs a Cloud Dataflow streaming pipeline that reads from Cloud Pub/Sub, performs a fixed window of 10 seconds, joins with a slowly-changing dimension table stored in Cloud Bigtable, and writes results to BigQuery. The pipeline has been running for months but recently started exhibiting increasing latency and occasional data loss. The pipeline uses default settings with autoscaling enabled (min 2, max 20 workers). The Bigtable cluster has 3 nodes. The dimensions are updated infrequently. The latency has grown from seconds to minutes. Examining the Dataflow monitoring UI, you see that the 'System Lag' metric is increasing, and some windows are not being emitted. The CPU utilization on Bigtable nodes is below 50%. There are no errors in the logs. Which action is most likely to resolve the issue?

A company uses Cloud Dataproc to run nightly Spark ETL jobs that process about 500 GB of data each night. The jobs currently take 4 hours to complete. The company wants to reduce the runtime to under 2 hours to meet a new SLA. The cluster is configured with 10 worker nodes (n1-standard-4) and 1 master node (n1-standard-4). The jobs are CPU-bound and use only default settings. The cluster is deleted after each job and recreated. The data is stored in Cloud Storage. The company is open to increasing cost but wants the most cost-effective solution to meet the SLA. Which approach should they take?

A company runs a batch ETL pipeline on Cloud Dataproc. During peak hours, the job takes longer than expected. The pipeline reads from Cloud Storage, transforms data, and writes to BigQuery. What is the most cost-effective way to improve performance without redesigning the pipeline?

A retail company processes real-time clickstream data using Cloud Pub/Sub and Dataflow. The pipeline aggregates events by user session and writes to Bigtable for low-latency queries. However, users report that session data is sometimes missing or duplicated. What is the most likely cause?

A financial services firm processes sensitive transactions using Cloud Dataflow. The pipeline reads from Pub/Sub, performs stateful processing (e.g., fraud detection), and writes to Cloud Spanner. Compliance requires exactly-once processing semantics. Which configuration ensures exactly-once processing?

Free account

Track your progress over time

Create a free account to save your results and see which topics improve across sessions.

Focused Designing data processing systems sessions

Start a Designing data processing systems only practice session

Every question in these sessions is drawn from the Designing data processing systems domain — nothing else.

Related practice questions

Related PDE topic practice pages

Move into related areas when this topic feels solid.

Frequently asked questions

What does the PDE exam test about Designing data processing systems?
Designing data processing systems questions test whether you can apply the concept in context, not just recognise a definition.
How should I use these practice questions?
Select your answer before revealing the explanation. Then read why each option is right or wrong — this active recall approach builds retention far faster than re-reading notes.
Can I practise just Designing data processing systems questions in a focused session?
Yes — the session launcher on this page draws every question from the Designing data processing systems domain. Use a 10-question session first to gauge your baseline, then move to 20 or 30 once the weak spots are clear.
Where can I practise other PDE topics?
Use the topic links above to move to related areas, or go back to the PDE question bank to see all topics.
Are these real exam questions or dumps?
These are original practice questions written to test the same concepts the PDE exam covers. They are not copied from any real exam or dump site.