PDE · topic practice

Building and operationalizing data processing systems practice questions

Practise Google Professional Data Engineer Building and operationalizing data processing systems practice questions — original exam-style scenarios with answer choices, explanations, and analysis of common mistakes.

Courseiva uses original exam-style practice questions designed for learning and revision. The goal is to understand the concepts, recognise exam patterns, and improve through explanations — not memorise copied exam dumps.

Reviewed byJohnson Ajibi· MSc IT Security
20 questionsDomain: Building and operationalizing data processing systems

What the exam tests

What to know about Building and operationalizing data processing systems

Building and operationalizing data processing systems questions test whether you can apply the concept in context, not just recognise a definition.

How the topic appears in realistic exam-style scenarios.

Which detail in the question changes the correct answer.

How to eliminate plausible but wrong options.

How to connect the question back to the wider exam objective.

Watch out for

Common Building and operationalizing data processing systems exam traps

  • Answering from memory before reading the full scenario.
  • Missing a constraint such as cost, availability, security, scope or command context.
  • Choosing a broad answer when the question asks for the most specific fix.
  • Ignoring why the wrong options are tempting.

Practice set

Building and operationalizing data processing systems questions

20 questions · select your answer, then reveal the explanation

A company is migrating its on-premises Apache Spark jobs to Dataproc. The jobs read from and write to Cloud Storage. After migration, the jobs are slower than expected. The Dataproc cluster uses standard worker machines with local SSDs. What is the most likely cause of the performance degradation?

A data pipeline ingests real-time events from Cloud Pub/Sub into BigQuery using Dataflow. The pipeline uses a sliding window of 5 minutes with a 1-minute period to aggregate event counts. Recently, the pipeline started failing with 'The worker failed to provide a heartbeat.' The Dataflow logs show high CPU usage on the workers. What is the best course of action to resolve the issue?

A company wants to process large CSV files stored in Cloud Storage and load them into BigQuery. The files are generated daily and each file is about 10 GB. The data is not time-sensitive and can be processed within a 24-hour window. Which service is most cost-effective for this use case?

A financial services company uses Cloud Composer to orchestrate a daily workflow that includes a Dataproc job for risk analysis. The workflow sometimes fails because the Dataproc cluster creation times out. The cluster creation typically takes 3 minutes, but occasionally takes over 10 minutes. What is the most effective way to handle this variability?

A company is using Dataflow to stream data from Cloud Pub/Sub to BigQuery. The pipeline includes a custom ParDo transformation that enriches the data with external API calls. The pipeline is experiencing high latency and occasional failures due to API timeouts. What strategy should be employed to improve reliability and performance?

A data engineer needs to process a large dataset (500 TB) stored in Cloud Storage using Dataproc. The processing job requires reading the entire dataset and writing results back to Cloud Storage. The job is expected to run for 6 hours. Which configuration minimizes cost?

A company runs a Dataflow streaming pipeline that reads from Cloud Pub/Sub and writes to BigQuery. The pipeline uses a side input that is a large lookup table (10 GB) stored in Cloud Storage. The side input is updated hourly. The pipeline experiences high latency and OOM errors on workers. What is the best approach to resolve this?

A company uses Cloud Composer to orchestrate data pipelines. One DAG fails intermittently with the error: 'Task received SIGTERM signal.' The task runs a long-running Dataproc job. What is the most likely cause?

Which TWO factors should be considered when choosing between Cloud Dataflow and Dataproc for a batch processing pipeline?

Which THREE best practices should be followed when designing a Dataflow pipeline for real-time data processing?

Which TWO actions can reduce the cost of running a Dataproc cluster for a nightly batch job?

Refer to the exhibit. A Dataflow pipeline writes to BigQuery table employee_records. The pipeline was working yesterday but fails today. What is the most likely cause?

Exhibit

Refer to the exhibit.

Error log from Dataflow job:

"""
Workflow failed. Causes: S3D3: BigQueryIO.Write/BatchLoads/Loads/AllocateLoadTable/ParDo(AllocateLoadTable) failed.
org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO$Write$BigQueryWriteException: BigQuery insertion failed: Response JSON: {
  "error": {
    "errors": [
      {
        "domain": "global",
        "reason": "invalid",
        "message": "Provided Schema does not match Table employee_records. Field last_name has type STRING but provided type INTEGER"
      }
    ],
    "code": 400,
    "message": "Provided Schema does not match Table employee_records. Field last_name has type STRING but provided type INTEGER"
  }
}
"""

Refer to the exhibit. A Dataflow streaming pipeline subscribes to this Pub/Sub subscription. The pipeline occasionally takes more than 10 seconds to process a message. Which behavior will occur?

Exhibit

Refer to the exhibit.

Cloud Pub/Sub subscription configuration:

{
  "name": "projects/my-project/subscriptions/my-sub",
  "topic": "projects/my-project/topics/my-topic",
  "pushConfig": {},
  "ackDeadlineSeconds": 10,
  "messageRetentionDuration": "86400s",
  "expirationPolicy": {
    "ttl": "604800s"
  },
  "enableMessageOrdering": false,
  "retryPolicy": {
    "minimumBackoff": "10s",
    "maximumBackoff": "600s"
  },
  "deadLetterPolicy": {
    "deadLetterTopic": "projects/my-project/topics/dead-letter-topic",
    "maxDeliveryAttempts": 5
  }
}

A company runs a critical real-time data pipeline using Dataflow that ingests events from Cloud Pub/Sub, performs aggregations using sliding windows, and writes results to BigQuery. The pipeline is deployed in us-central1. The pipeline's latency has increased recently, and the Dataflow monitoring shows that the 'system lag' metric is consistently above 5 minutes. The pipeline is using Streaming Engine and has 10 workers with 4 vCPUs each. The pipeline processes approximately 100,000 events per second. The team has verified that the source Pub/Sub topic has sufficient publish throughput and the BigQuery table has no quota issues. The pipeline logs show that some workers are experiencing GC overhead limit exceeded errors. The pipeline code uses stateful processing with a custom keyed state for deduplication. What is the most likely cause of the increased latency?

A company uses Cloud Composer to orchestrate a daily ETL pipeline that includes multiple Dataproc jobs. The pipeline processes sensitive financial data. The security team requires that all data in transit be encrypted, and all Cloud Storage buckets used by the pipeline should have uniform bucket-level access enabled and VPC Service Controls. The pipeline currently uses a single Cloud Composer environment in us-east1. The Dataproc clusters are created using the standard image and use custom service accounts with minimal permissions. The pipeline runs successfully during testing, but in production, the Dataproc jobs fail with 'Access Denied' errors when trying to write to a Cloud Storage bucket. The bucket has uniform bucket-level access enabled and is inside a VPC Service Controls perimeter. The Dataproc service account has the Storage Object Admin role at the project level. What is the most likely cause of the access denied error?

A company is building a real-time streaming pipeline using Pub/Sub and Dataflow to process clickstream data. The pipeline writes aggregated metrics to BigQuery every 10 seconds using a fixed window. During peak traffic, some windows produce duplicate rows in BigQuery. What is the most likely cause?

A data engineer is designing a batch ETL pipeline that reads CSV files from Cloud Storage, transforms them using Dataproc, and writes the results to BigQuery. The data volume is expected to grow 10x in the next year. Which design approach best balances cost and performance?

A data team uses Cloud Composer to orchestrate Airflow DAGs. They need to ensure that a downstream task runs only if at least two out of three upstream sensor tasks succeed. Which TWO configurations should they combine?

A Dataflow pipeline reads log files from Cloud Storage, parses them into LogEvent objects, and writes to BigQuery. The pipeline fails with the above errors. What is the most likely cause?

Exhibit

Refer to the exhibit.

```
# Dataflow pipeline log snippet
2024-03-15 10:00:00 ERROR Transform 'ParseLogs': org.apache.beam.sdk.util.WindowedValue$CoderLoadingException: Unable to load coder for class com.example.LogEvent
2024-03-15 10:00:01 ERROR Transform 'ParseLogs': java.lang.NoSuchMethodError: com.example.LogEvent: method <init>()V not found
```

A financial services company runs a batch Dataflow pipeline daily to process transaction data. The pipeline reads from Cloud Storage, performs complex transformations, and writes to BigQuery. Recently, the pipeline has been failing intermittently with the error: 'Workflow failed. Causes: (9c3f7a2b1d4e): The worker missed 2000 data samples in the last 30 seconds. This can be caused by a variety of factors, including slow work items, network issues, or resource contention.' The team has already increased the number of workers and tried using e2-standard-8 machine types, but the issue persists. The pipeline processes approximately 500 GB of data per run and uses approximately 200 workers. The team suspects that the issue might be related to shuffle operations. What should the team do next to resolve the issue?

Free account

Track your progress over time

Create a free account to save your results and see which topics improve across sessions.

Focused Building and operationalizing data processing systems sessions

Start a Building and operationalizing data processing systems only practice session

Every question in these sessions is drawn from the Building and operationalizing data processing systems domain — nothing else.

Related practice questions

Related PDE topic practice pages

Move into related areas when this topic feels solid.

Frequently asked questions

What does the PDE exam test about Building and operationalizing data processing systems?
Building and operationalizing data processing systems questions test whether you can apply the concept in context, not just recognise a definition.
How should I use these practice questions?
Select your answer before revealing the explanation. Then read why each option is right or wrong — this active recall approach builds retention far faster than re-reading notes.
Can I practise just Building and operationalizing data processing systems questions in a focused session?
Yes — the session launcher on this page draws every question from the Building and operationalizing data processing systems domain. Use a 10-question session first to gauge your baseline, then move to 20 or 30 once the weak spots are clear.
Where can I practise other PDE topics?
Use the topic links above to move to related areas, or go back to the PDE question bank to see all topics.
Are these real exam questions or dumps?
These are original practice questions written to test the same concepts the PDE exam covers. They are not copied from any real exam or dump site.