How many Designing data processing systems questions are on the PDE exam?

The Designing data processing systems domain is one of the weighted domains on the PDE exam. The Courseiva question bank has 159 practice questions for this domain.

Free PDE Designing data processing systems Practice Questions (2026)

Q: What does the Designing data processing systems domain cover on the PDE exam?

The Designing data processing systems domain covers the key concepts and skills tested in this area of the PDE exam blueprint published by Google Cloud.

Q: How can I practice Designing data processing systems questions for PDE?

Click any of the 159 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Designing data processing systems domain.

Practice Designing data processing systems questions

10Q 20Q 30Q 50Q

All PDE Designing data processing systems questions (159)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A company is migrating on-premises Apache Spark jobs to Google Cloud Dataproc. They want to reduce operational overhead and minimize costs. Which architecture is most appropriate?

A data pipeline ingests sensor data from IoT devices via Cloud Pub/Sub, processes it with Cloud Dataflow, and writes to BigQuery. The pipeline is failing with high latency and data loss. Which troubleshooting step should be taken first?

A company needs to process real-time clickstream data and store it in a data warehouse for SQL-based analytics. The data volume is moderate. Which combination of Google Cloud services is most cost-effective?

A financial company processes transactions in real-time and requires exactly-once processing semantics. They also need to reprocess historical data for backtesting. Which Google Cloud service should they use?

A company is building a data lake on Cloud Storage with data from multiple sources. They need to apply schema-on-read and support ad-hoc SQL queries. Which architecture is most suitable?

A company wants to stream data from Cloud Pub/Sub into BigQuery with minimal latency. They have a small team and limited operational resources. Which approach is best?

A company has a batch ETL job that runs daily using Cloud Dataflow. The job reads from Cloud Storage, transforms data, and writes to BigQuery. Recently, the job started failing with 'Resources have been exhausted' errors. What is the most likely cause?

A company needs to process sensitive healthcare data with strict compliance requirements. They want to use Cloud Dataflow but must ensure data is encrypted end-to-end and audit logs are retained. Which combination of features should they enable?

A company is running a Cloud Dataflow streaming pipeline that aggregates events in 1-minute windows. They notice that the watermark is lagging significantly behind real-time. What is the most likely cause?

A data engineer is designing a batch processing system using Cloud Dataproc. Which TWO practices improve performance and reduce costs? (Choose TWO.)

A company is migrating an on-premises Hadoop cluster to Google Cloud. They need to run existing Spark jobs with minimal modification. Which THREE strategies should they consider? (Choose THREE.)

A data pipeline uses Cloud Pub/Sub to ingest events, then a Cloud Dataflow job writes to BigQuery. The Dataflow job is failing with 'deadline exceeded' errors. Which TWO actions can resolve this? (Choose TWO.)

The exhibit shows a Spark job submitted to Dataproc that fails with an out-of-memory error. Which change should be made to the submission command to resolve the issue?

The exhibit shows a Cloud Logging query result. A data engineer sees this log for a streaming Dataflow job. What is the most likely cause?

The exhibit shows an IAM policy for a BigQuery dataset. A Dataflow job is failing with 'Access Denied: Table ... User does not have bigquery.tables.get permission'. Which additional role should be granted to the service account?

A company runs a Cloud Dataflow streaming pipeline that reads from Cloud Pub/Sub, performs a fixed window of 10 seconds, joins with a slowly-changing dimension table stored in Cloud Bigtable, and writes results to BigQuery. The pipeline has been running for months but recently started exhibiting increasing latency and occasional data loss. The pipeline uses default settings with autoscaling enabled (min 2, max 20 workers). The Bigtable cluster has 3 nodes. The dimensions are updated infrequently. The latency has grown from seconds to minutes. Examining the Dataflow monitoring UI, you see that the 'System Lag' metric is increasing, and some windows are not being emitted. The CPU utilization on Bigtable nodes is below 50%. There are no errors in the logs. Which action is most likely to resolve the issue?

A company uses Cloud Dataproc to run nightly Spark ETL jobs that process about 500 GB of data each night. The jobs currently take 4 hours to complete. The company wants to reduce the runtime to under 2 hours to meet a new SLA. The cluster is configured with 10 worker nodes (n1-standard-4) and 1 master node (n1-standard-4). The jobs are CPU-bound and use only default settings. The cluster is deleted after each job and recreated. The data is stored in Cloud Storage. The company is open to increasing cost but wants the most cost-effective solution to meet the SLA. Which approach should they take?

A company runs a batch ETL pipeline on Cloud Dataproc. During peak hours, the job takes longer than expected. The pipeline reads from Cloud Storage, transforms data, and writes to BigQuery. What is the most cost-effective way to improve performance without redesigning the pipeline?

A retail company processes real-time clickstream data using Cloud Pub/Sub and Dataflow. The pipeline aggregates events by user session and writes to Bigtable for low-latency queries. However, users report that session data is sometimes missing or duplicated. What is the most likely cause?

A financial services firm processes sensitive transactions using Cloud Dataflow. The pipeline reads from Pub/Sub, performs stateful processing (e.g., fraud detection), and writes to Cloud Spanner. Compliance requires exactly-once processing semantics. Which configuration ensures exactly-once processing?

A logistics company uses Cloud Functions to process incoming tracking events from IoT devices. Events are sent via HTTP triggers. During peak hours, some events fail with 500 errors. What is the best strategy to handle this reliably?

A media company ingests video files from partners via a REST API. Files are stored in Cloud Storage, and metadata is written to Firestore. A Cloud Function is triggered on object finalize to transcode video using Transcoder API. Sometimes, the function fails because the file is still being uploaded when triggered. How should this be fixed?

A healthcare company streams patient monitoring data to Cloud Pub/Sub. A Dataflow pipeline reads the stream, enriches with patient records from BigQuery, and writes to Bigtable for real-time queries. The BigQuery lookup is slow and causes pipeline lag. What is the best approach to improve performance?

A company uses Cloud Dataproc to run Spark ML jobs. The jobs are memory-intensive and often fail with OutOfMemory errors. Which action would most effectively reduce memory pressure without changing the Spark code?

Which TWO statements are correct about designing a data pipeline using Cloud Dataflow for processing unbounded data?

Which THREE considerations are important when designing a data lake on Google Cloud using Cloud Storage?

Which TWO approaches are recommended for handling late-arriving data in a streaming Dataflow pipeline?

A multinational e-commerce company runs a real-time recommendation system. The architecture: user click events are sent via HTTP to a Cloud Run service, which publishes them to a Cloud Pub/Sub topic. A Dataflow streaming pipeline reads from the subscription, joins with user profile data from Firestore, computes recommendations using a TensorFlow model (loaded as a side input), and writes results to a Redis cache (Memorystore) for low-latency serving. The pipeline is deployed in us-central1. Recently, the team noticed that recommendation latency has increased from 50ms to 500ms, and the pipeline's backlog is growing. The Dataflow monitoring shows high CPU utilization on workers, and the SystemLag metric is 2 minutes and increasing. The Redis cluster shows no performance issues. The Firestore queries are within normal latency. The team suspects the TensorFlow model inference is the bottleneck. The model is a large neural network (500MB) loaded in each worker's memory. The pipeline uses 10 n1-standard-4 workers. The pipeline is using Dataflow's streaming engine. The team wants to reduce latency without increasing cost significantly. What should they do?

Your company is building a real-time fraud detection system using Google Cloud. Transactions are streamed into Pub/Sub, and you need to process them with low latency (under 100ms per event) and aggregate data over sliding windows. Which Google Cloud service is best suited for this processing logic?

Which TWO statements about designing a data processing pipeline on Google Cloud are correct? (Choose 2.)

Based on the exhibit, what is the most likely cause of the out-of-memory error?

You are a data engineer at a global e-commerce company. Your team manages a real-time recommendation system that ingests user clickstream events from a Pub/Sub topic (topic-clickstream). The pipeline uses Dataflow to read events, join with user profile data from Cloud Bigtable, compute recommendations using a machine learning model hosted on Cloud Run, and write results to a BigQuery table for analytics. The pipeline has been running smoothly for months, but recently the Dataflow job started failing with the error: "Workflow failed. Causes: S01:ReadPubSub/Read+Transform/ParDo(ExtractUserID)+ ... (5a3b2c1d) The job failed because a worker encountered an out-of-memory error." The Dataflow job uses the Streaming Engine feature with a worker type of n2-standard-8 (8 vCPU, 32 GB memory) and autoscaling from 2 to 20 workers. The clickstream event rate has increased from 500 events/second to 5000 events/second over the past week. The user profile data in Bigtable has also grown, with average row size increasing from 1 KB to 10 KB due to additional fields. You need to resolve the out-of-memory errors without completely redesigning the pipeline. What should you do?

Drag and drop the steps to create a Cloud Storage bucket with uniform bucket-level access into the correct order.

Drag and drop the steps to configure a VPC network with private Google access for on-premises connectivity using Cloud VPN into the correct order.

Drag and drop the steps to set up a Pub/Sub topic with a push subscription to an HTTPS endpoint into the correct order.

Match each Google Cloud data service to its primary use case.

Match each Google Cloud service to its data processing capability.

Match each Google Cloud monitoring/logging service to its function.

A company uses Dataflow to process streaming data from Pub/Sub. They notice increased processing latency. What is the most likely cause?

A data pipeline uses Cloud Composer to orchestrate Dataflow and BigQuery jobs. The pipeline fails intermittently with dependency errors. Which design change can improve reliability?

A company needs to process sensitive data in BigQuery with column-level security. They want to allow analysts to see aggregated data but not individual records. What approach?

A company uses Dataproc for transient clusters. Which TWO actions can reduce costs?

A data engineer is migrating on-premises Hadoop jobs to Dataproc. Which TWO considerations are important?

A company building a real-time analytics pipeline with Pub/Sub and Dataflow. Which THREE best practices should they follow?

A data engineer tries to grant a service account read access to a Cloud Storage bucket using the IAM policy above. The service account still cannot read objects. What is the most likely reason?

A BigQuery query fails with the error shown in the exhibit. What is the most likely cause?

A Dataflow pipeline as described in the exhibit has increasing lag. Which optimization is most likely to reduce the lag?

A company needs to process large files (100GB each) from Cloud Storage using Dataproc. They want to minimize job execution time. Which configuration is most appropriate?

A data pipeline uses Cloud Pub/Sub to ingest events and Cloud Functions to transform and write to BigQuery. The system is experiencing data loss during Pub/Sub subscription outages. Which design change improves reliability?

A company wants to implement a near-real-time lake architecture using Cloud Storage and BigQuery. They need to enable queries on data within 5 minutes of arrival. Which approach meets the requirement with minimal operational overhead?

A data engineer needs to design a batch pipeline that processes daily log files from Cloud Storage and writes aggregated results to BigQuery. Which service is most appropriate for this ETL job?

A company uses BigQuery to run reporting queries on a table that is partitioned by date and clustered by customer_id. Queries filtering by customer_id and a date range are performing poorly. What is the most likely cause?

A financial services company needs to process high-frequency trading data with strict ordering guarantees. They use Pub/Sub with ordering keys and Dataflow. The pipeline occasionally produces out-of-order results. What is the most likely cause?

An e-commerce company processes real-time clickstream data using Pub/Sub and Dataflow. They want to ensure that if a Dataflow worker fails, the pipeline can resume processing from the point of failure without data loss. Which feature should they enable?

A financial services company uses Cloud Composer to orchestrate daily batch jobs. One job extracts data from MongoDB to Cloud Storage, then loads into BigQuery, and finally runs a Dataflow pipeline for aggregations. The Dataflow job fails intermittently. They want to automatically restart only the failed Dataflow job without re-running the earlier extraction and load. Which Airflow operator configuration should they use?

A company uses Cloud Storage to store IoT sensor data in JSON format. The data is ingested using a Cloud Function triggered by Cloud Storage events. They notice that when many files are uploaded simultaneously, some files are not processed and the Cloud Function logs show 'function execution timeout'. What is the most likely cause and solution?

An online retailer uses BigQuery for analytics. They have a time-series table with 5 billion rows and new data arrives every day. They want to optimize query performance and reduce costs by ensuring that queries scan only the partitions they need. Which table design should they use?

A data engineering team needs to process a large volume of CSV files stored in Cloud Storage using Dataproc. The files are generated hourly and each contains millions of rows. They want to minimize the number of Dataproc cluster nodes to reduce cost while processing within an hour. Which configuration should they recommend?

A gaming company uses Pub/Sub to ingest player events and Dataflow for real-time analytics. They notice that the Pub/Sub subscription backlog is growing despite the Dataflow pipeline running continuously. The pipeline has a 1-hour window for aggregations. What is the most effective way to reduce the backlog?

A startup wants to build a data lake on Google Cloud using Cloud Storage. They need to store raw data in its original format for future analysis. Which storage class should they use to optimize for cost given that data will be accessed occasionally after the first month?

A media company uses Cloud Data Loss Prevention (DLP) API to inspect and de-identify sensitive data before loading into BigQuery. They want to reduce costs by sampling the data during inspection. Which configuration should they use?

A company runs a Dataflow streaming pipeline that reads from Pub/Sub and writes to BigQuery. They experience a sudden spike in data volume causing BigQuery write throughput to be exceeded, resulting in errors. Which strategy should they implement to handle this gracefully?

A company is designing a data processing pipeline for real-time sensor data. They want to ensure low latency and exactly-once processing semantics. Which two Google services should they combine to achieve this? (Choose 2)

A data warehouse team uses Cloud BigQuery for analytics. They want to optimize query performance and reduce costs. Which three actions should they take? (Choose 3)

A company uses Cloud Dataproc for large-scale Spark jobs. They notice that some jobs are failing due to insufficient memory on the worker nodes. They want to improve memory management without over-provisioning. Which three configurations should they apply? (Choose 3)

Given the query plan, what is the most likely reason this query is efficient despite processing 10 billion rows?

What is the most likely cause of data duplication after this command?

What is the root cause of this error and the correct solution?

A company is designing a streaming data pipeline to process real-time clickstream events. They need to aggregate events by session window with a 5-minute gap and enable exactly-once processing semantics. Which Google Cloud service should they use?

A data engineer is designing a batch data pipeline that reads Avro files from Cloud Storage, transforms data using Apache Beam, and writes to BigQuery. The pipeline must handle daily runs and backfills. Which runner should they use?

A company processes IoT sensor data in near real-time. They ingest data via Cloud Pub/Sub, then a Dataflow streaming pipeline writes to Bigtable for low-latency queries. Recently, they observed increased Pub/Sub message backlog during traffic spikes. What is the most effective scaling strategy?

A team needs to migrate an existing on-premises Hadoop Hive workload to Google Cloud. They want to minimize code changes and use a managed service for transient clusters. Which service should they choose?

A financial company needs to process batch trades data daily and ensure that if a transformation step fails, the entire daily run is retried from the beginning. Which design pattern is appropriate?

A data pipeline uses Cloud Pub/Sub to ingest events, then a Dataflow job writes to Cloud Storage in Avro format. The Dataflow job uses Global windows with a 10-minute trigger. The data is later loaded into BigQuery. They notice duplicate rows in BigQuery because the trigger produced multiple panes. What should the Dataflow pipeline change to eliminate duplicates?

A company wants to analyze server logs stored in Cloud Storage using SQL. They need to get results in seconds without setting up any clusters. Which service should they use?

A data pipeline processes streaming data from Pub/Sub to BigQuery. The pipeline needs to handle late-arriving data that is up to 1 hour late. Which Dataflow feature should be used?

A company uses Cloud Dataproc to run Spark jobs on ephemeral clusters. The input data is in Cloud Storage and output is also to Cloud Storage. The cluster is created and deleted daily. The cost is high due to spinning up nodes. Which change can reduce cost without sacrificing performance?

A company is designing a data lake on Google Cloud. They need to store raw data in multiple formats (CSV, Parquet, Avro) and allow various downstream processing frameworks. Which two storage solutions provide flexibility and scalability? (Choose two.)

A streaming pipeline uses Cloud Pub/Sub and Dataflow to process financial transactions. The pipeline must guarantee that each transaction is processed exactly once and in order per customer key. Which two configurations are necessary? (Choose two.)

A company is planning to migrate a legacy batch ETL pipeline to Google Cloud. The pipeline involves reading from a relational database, transforming data, and writing to a data warehouse. Which three Google Cloud services can be used as the orchestration layer? (Choose three.)

A data engineer runs this Dataflow template to load CSV files from Cloud Storage into BigQuery. The job fails with a 'File pattern not matching any files' error. What is the most likely cause?

A team configured a garbage collection rule on a Cloud Bigtable column family with max_age of 100 seconds. After 2 minutes, they notice that data older than 100 seconds is still present. What is the most likely reason?

A team has set up a push subscription to an HTTPS endpoint. They notice that messages are not being acknowledged and are resent every 10 seconds. What is the most likely issue?

A company processes real-time clickstream data from websites. They need to aggregate user sessions that may span multiple hours and handle events that arrive late due to network delays. The pipeline must avoid discarding late data. Which Dataflow feature should they configure?

A data analyst frequently queries a BigQuery table that contains an array of structs representing product purchases. The query below runs slowly: SELECT customer_id, COUNT(purchase) as total_purchases FROM sales, UNNEST(purchases) as purchase GROUP BY customer_id What change would most improve query performance?

A Dataflow pipeline reads events from Pub/Sub and transforms them. Some events contain invalid product IDs that should be filtered out. The list of valid product IDs is stored in a frequently updated BigQuery table. What is the best approach to filter out invalid events?

A manufacturing company wants to detect anomalies in sensor data from thousands of IoT devices in real time. The data is streaming into Pub/Sub. The best solution should use a machine learning model served from AI Platform that scores sensor readings aggregated over 5-minute windows. Which pipeline design meets these requirements?

A company runs a nightly Dataproc batch job to process large log files. The job is idempotent and can tolerate node failures if restarted. Minimizing cost is critical. What is the most cost-effective cluster design?

A company wants to implement a data lake on Google Cloud to store raw sensor data (unstructured binary files) and allow data scientists to run SQL queries on processed data. They expect to store terabytes of data and have different access patterns. Which combination of GCP services best meets these requirements?

A data engineering team needs to build a data integration pipeline that involves connecting to multiple sources, performing data transformations with visual editing, and then running custom machine learning algorithms. The team has both data analysts and data scientists. Which approach is most suitable?

A gaming company uses Avro schemas for its streaming event data. They anticipate adding new optional fields to events over time. They need to ensure backward compatibility so that existing pipelines continue to work. Which strategy should they adopt?

A financial services company must comply with GDPR "right to be forgotten". They store customer transactions in BigQuery partitioned by date. When a user requests deletion, all their data must be removed within 48 hours. The deletion requests are received via a Pub/Sub topic. What is the most scalable and cost-effective approach?

You are designing a streaming Dataflow pipeline that processes high-throughput data. Which two features can help minimize cost? (Choose TWO.)

A payment processing company needs to detect fraudulent transactions in real time. The system must have sub-second latency for high-value transactions and use a machine learning model. Which two components should be part of the architecture? (Choose TWO.)

You are designing a streaming pipeline that must guarantee exactly-once processing. Which three services or features can help achieve this? (Choose THREE.)

What is the most likely cause of this error?

The query above runs slowly on the 10 TB table. Which optimization would most improve performance?

The push endpoint is returning 500 errors. What is the most likely cause?

A company is designing a streaming pipeline using Dataflow to process real-time clickstream data. The pipeline reads from Pub/Sub, performs user sessionization using Apache Beam's Session window, and writes to BigQuery. The team notices that the pipeline's lag is growing and the worker utilization is low. What is the most likely cause and recommended fix?

100

A company wants to ingest IoT sensor data from thousands of devices into BigQuery for near-real-time analytics. The data volume is approximately 10 GB per hour. Which combination of Google Cloud services should they use for a cost-effective and scalable solution?

101

A Dataflow streaming pipeline uses stateful transformations with per-key state and timers. After a deployment, the team observes that the pipeline is reprocessing events from the last 30 minutes every time it restarts. The pipeline's checkpoint is configured to persist every 10 seconds. Which change should be made to prevent unnecessary reprocessing?

102

A data team uses Cloud Dataproc to run nightly Spark jobs. The job volume has increased, and the cluster is often underutilized during the day. They want to reduce costs while ensuring jobs can scale when needed. Which strategy should they adopt?

103

A company processes CSV files that are uploaded to Cloud Storage by external partners. Each file is around 500 MB, and they need to be parsed and loaded into BigQuery. The processing must start as soon as the file arrives. What is the most efficient serverless architecture?

104

A company stores IoT sensor readings in BigQuery. The table is partitioned by day and clustered by sensor_id. Query performance has degraded as data grows; many queries filter by a date range and a single sensor_id. Which optimization should be applied first?

105

A data engineering team uses Cloud Data Fusion to build ETL pipelines. They have a pipeline that reads from Cloud SQL, transforms data using Wrangler, and writes to BigQuery. The pipeline fails intermittently with a 'connection timeout' error from Cloud SQL. What is the best way to handle this?

106

An organization wants to automate their batch data processing pipeline using Cloud Composer. The pipeline consists of multiple tasks: extract from Cloud Storage, transform with Dataflow, and load into BigQuery. Which Airflow operator should be used to run Dataflow jobs?

107

A Dataflow streaming pipeline reads from Pub/Sub, applies a ParDo that uses a side input from a BigQuery table (refreshed hourly), and writes to BigQuery. The side input is large and causes increased latency and worker OOM errors. Which design change solves this?

108

A data engineer is monitoring a Dataflow streaming pipeline and notices that the 'System Lag' metric is increasing. Which TWO actions should be taken to diagnose the issue?

109

A company is designing a data lake on Cloud Storage for analytics. They need to store data in various formats (Avro, Parquet, CSV) and enable efficient querying with BigQuery and Dataproc. Which THREE practices should they follow?

110

A company uses Pub/Sub to decouple services. They have a topic with two subscriptions: Subscription A is a push subscription that sends messages to a Cloud Function; Subscription B is a pull subscription used by a Dataflow job. They need to ensure that messages are processed in order for a specific device_id. Which TWO configurations should they apply?

111

A streaming Dataflow job is processing messages from Cloud Pub/Sub. The job is underutilizing resources and the throughput is lower than expected. Which parameter should be adjusted to increase parallelism?

112

A company stores IoT sensor data in BigQuery. Queries that filter on a timestamp column and a device_id column are slow even though the table is partitioned by day. What should the data engineer do to improve query performance?

113

A financial services company uses Cloud Pub/Sub with ordering keys to process transactions in order. Some messages are failing processing and getting stuck. The team wants to ensure that if a message fails, it can be reprocessed later without blocking subsequent messages. What should they implement?

114

A data engineer is running a Dataproc cluster for a batch ETL job that needs to process 10 TB of data. The job is memory-intensive. The cluster currently uses n1-standard-4 workers. Performance is poor. What is the most cost-effective change to improve performance?

115

A team is designing an event-driven data pipeline. They need to process messages from Cloud Pub/Sub, transform them, and write to BigQuery. The messages have variable volume and spikes. What is the best serverless compute option for this workload?

116

A Dataflow pipeline reads from Cloud Pub/Sub and writes to Cloud Storage. The pipeline needs to guarantee exactly-once processing despite worker failures. Which configuration ensures exactly-once semantics?

117

A data engineer needs to automatically delete objects from a Cloud Storage bucket after 30 days and archive them to nearline storage after 7 days. Which configuration should they use?

118

A BigQuery table contains streaming data from Cloud Pub/Sub. The table is partitioned by ingestion time. A user runs a query that accesses data from the last 5 minutes and gets correct results. After 90 minutes, the user runs the same query again but notices that some rows are missing. What is the most likely cause?

119

In Cloud Composer, a DAG has two tasks: task_A (runs an Apache Spark job on Dataproc) and task_B (loads data from Cloud Storage to BigQuery). task_B must start after task_A completes. The DAG is scheduled to run hourly. Sometimes task_B starts before task_A finishes because task_A's Dataproc job appears to complete in the Airflow metadata but the data is not yet available. What is the best way to ensure task_B only runs after the data is fully written?

120

Which TWO roles are required to allow a service account to run a Dataflow job and write results to BigQuery? (Choose two.)

121

A data engineer is designing a BigQuery table for time-series data that will be queried frequently by time range and also by a customer_id. Which TWO design decisions will improve query performance and manage costs? (Choose two.)

122

A company uses Cloud Pub/Sub with pull subscriptions to process orders. The application requires at-least-once delivery and the ability to process orders in order per customer_id. Which THREE features should they configure? (Choose three.)

123

A company uses Cloud Dataflow to process streaming data. They notice that the pipeline's throughput is lower than expected and the system is experiencing high latency. What is the most likely cause?

124

A data engineer needs to design a data processing system that ingests large volumes of sensor data from IoT devices. The data should be stored in a schema-less format and allow for real-time analytics. Which Google Cloud service is most appropriate?

125

A company is migrating their on-premises Apache Spark jobs to Dataproc. They want to minimize code changes and take advantage of serverless infrastructure. Which Dataproc feature should they use?

126

A data pipeline using Cloud Pub/Sub and Cloud Dataflow is experiencing duplicate messages. The source system publishes messages at least once. What Dataflow technique ensures exactly-once processing?

127

A company processes financial transactions using Cloud Dataflow. They need to ensure that late-arriving data is handled correctly for fraud detection. The pipeline uses event time processing. Which approach should they use to handle late data?

128

A data engineer is designing a batch ETL pipeline using Cloud Composer and Dataflow. The pipeline must be self-healing and retry on failures. Which Composer feature should they configure?

129

A team is using BigQuery to analyze petabyte-scale data. They notice that queries are slow and expensive due to full table scans. They have already partitioned by date. What additional optimization should they implement?

130

A company needs to stream real-time user click events from a web application to BigQuery for analysis. Which Google Cloud architecture is most suitable?

131

A data pipeline reading from Cloud Storage and writing to BigQuery using Dataflow is experiencing high cost. The data is CSV and needs schema inference. What change reduces cost?

132

A data engineer is designing a streaming pipeline with Cloud Pub/Sub and Cloud Dataflow. They need to guarantee at-least-once delivery and handle occasional duplicates. Which TWO configurations should they implement?

133

A company uses Cloud Composer to orchestrate Dataproc and BigQuery jobs. They need to implement retry logic for transient failures. Which THREE features can help?

134

A data warehouse in BigQuery is experiencing performance issues. Which THREE techniques can improve performance without moving data to a different storage system?

135

A company runs a streaming data pipeline on Google Cloud using Cloud Pub/Sub, Cloud Dataflow, and BigQuery. The pipeline processes real-time sensor data for predictive maintenance. Recently, the Dataflow job's lag has increased from seconds to minutes, and the system shows backpressure. The pipeline uses fixed windows of 1 minute and writes results to BigQuery. The data volume has doubled. The team has already increased the number of workers. What should they do next? Options: A. Use session windows instead of fixed windows. B. Enable Streaming Engine and use Upsert to BigQuery. C. Decrease the window duration. D. Use Cloud Storage as temporary sink.

136

A data engineer is responsible for a batch ETL pipeline that runs daily using Cloud Composer and Dataproc. The pipeline extracts data from Cloud SQL, transforms it with Spark, and loads to BigQuery. Last night, the pipeline failed because the Spark job ran out of memory. The team needs a solution that prevents future failures without manual intervention. Options: A. Use a larger machine type for Dataproc. B. Enable Dataproc autoscaling and configure memory-based scaling. C. Split the Spark job into multiple stages. D. Use Cloud Functions to retry the job.

137

A company uses Cloud Dataflow to process financial transactions from Pub/Sub to BigQuery. The pipeline must ensure exactly-once semantics. Recently, they noticed duplicate rows in BigQuery. The source publishes with at-least-once. The Dataflow pipeline uses idempotent writes. What is the most likely cause? Options: A. The pipeline uses GlobalWindows. B. The pipeline has autoscaling enabled. C. The pipeline uses file loads as a sink. D. The pipeline's watermark is misconfigured.

138

A company needs to stream data from a fleet of IoT devices to BigQuery for near-real-time analytics. The data volume is unpredictable and can spike during certain events. Which Google Cloud service should be used as the ingestion point to handle variable throughput with minimal operational overhead?

139

A team runs a Dataflow streaming pipeline that reads from Pub/Sub, windows events by processing time, and writes to BigQuery. Some late-arriving events are being dropped. The requirement is to include all events that arrive within 10 minutes of the watermark. Which pipeline configuration should be used?

140

A company runs a batch data processing workload using Dataproc clusters that are auto-scaled based on YARN memory utilization. During peak times, jobs take much longer than expected. Analysis shows the cluster is not scaling up despite high YARN memory utilization. What is the most likely cause?

141

A company is designing a data processing system that must handle both batch and streaming workloads with unified pipeline code. Which two Google Cloud services are most suitable for implementing a unified batch and streaming pipeline? (Choose TWO.)

142

An organization is moving on-premises Hadoop workloads to Google Cloud. They need to minimize code changes and manage transient clusters for cost savings. Which two Google Cloud services should they consider? (Choose TWO.)

143

A data pipeline reads thousands of JSON files from Cloud Storage, processes them with Cloud Dataflow, and writes to BigQuery. The pipeline sometimes fails because of malformed JSON records. Which three steps should the data engineering team take to improve pipeline reliability? (Choose THREE.)

144

A startup is building a real-time dashboard that shows aggregated metrics from social media feeds. They expect up to 10,000 events per second. The data must be near-real-time (< 30 seconds latency) and stored in BigQuery for historical analysis. They have limited experience managing infrastructure. The CTO suggests using Apache Kafka on Compute Engine for ingestion. However, the data engineer recommends a fully managed solution. Which approach should the team adopt?

145

A large retail company processes point-of-sale transactions from thousands of stores daily. The current batch pipeline runs on Cloud Dataproc using Spark and takes 3 hours to complete. The business wants to reduce processing time to under 30 minutes. The pipeline reads from Cloud Storage, joins with inventory data from BigQuery, performs aggregations, and writes to Cloud SQL for reporting. What is the most effective optimization?

146

A financial services company uses a Dataflow streaming pipeline to process real-time stock trades. The pipeline reads from Pub/Sub, enriches with reference data from Cloud Bigtable, and writes to BigQuery. Recently, they noticed an increase in processing latency during market open hours. Investigation shows that the pipeline is data-skewed: a few stock symbols generate 90% of the traffic. The team wants to reduce latency without changing the pipeline structure. What should they do?

147

An e-commerce company runs a daily batch pipeline that processes clickstream data from Cloud Storage using Cloud Dataproc with Spark. The pipeline includes a join between a large fact table and a small dimension table. The dimension table is stored in Cloud Storage as a CSV file. The join is slow due to shuffling. The data engineer considers broadcasting the dimension table. However, the dimension table is updated daily and the pipeline reads the latest version. What is the best approach to implement this optimization?

148

A company has a Dataflow pipeline that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is failing with 'deadline exceeded' errors during peak hours. The team suspects that the pipeline cannot keep up with the incoming data rate. They also notice that the autoscaling algorithm sets maxNumWorkers to 10, but the pipeline only scales to 5 workers. What is the most likely cause of the inadequate scaling?

149

A healthcare company processes patient data using a Dataflow pipeline that reads from Cloud Storage, transforms data, and writes to BigQuery. They need to ensure that the processing is idempotent to handle failures and retries without duplicating records. The data arrives in daily batches and may be re-delivered if earlier processing failed. What approach should they take to guarantee exactly-once processing in BigQuery?

150

A company runs a Dataproc cluster with 10 worker nodes for a Spark streaming job that processes data from Pub/Sub (via Pub/Sub Lite) and writes to Cloud Storage. They observe that the job is producing many small files in Cloud Storage, leading to high costs and performance issues in downstream batch pipelines. The team wants to consolidate output files while maintaining low latency. What is the best solution?

151

A media company uses Cloud Dataflow to process video metadata from a Pub/Sub stream. The pipeline enriches metadata using a lookup table stored in Cloud Bigtable. Recently, they noticed increased latency and occasional 'Bigtable operation timeout' errors. The Bigtable instance has 3 nodes and the data is highly distributed. The Dataflow pipeline uses default settings. What is the most likely cause of the timeouts?

152

A company runs a production Dataflow streaming pipeline that reads from Pub/Sub, groups events by customer ID, and writes to BigQuery. The pipeline uses global windows with triggers. After a recent code change, the pipeline started generating duplicate events in BigQuery for the same customer ID. The previous version did not have duplicates. The team reviews the code and sees that the trigger was changed from 'afterProcessingTime' to 'afterWatermark'. What is the most likely reason for duplicates?

153

A company is designing a real-time clickstream analytics pipeline using Pub/Sub and Dataflow. The pipeline must handle late-arriving data (up to 1 hour) and ensure exactly-once processing. Which Dataflow feature should be configured to handle late data correctly?

154

A data team runs regular analytical queries on a BigQuery table that stores 2 years of sales data (approximately 10 TB). Queries frequently filter on a `sale_date` column and also group by `product_id`. To optimize cost and performance, which design approach is most effective?

155

A Dataflow streaming job is processing high-volume sensor data from thousands of IoT devices. The job uses global windows with a 10-minute processing time trigger. Recently, the job's CPU utilization is nearly 100% and it is falling behind. Which action is most likely to reduce CPU load while maintaining data freshness?

156

A company is building a data lake on Cloud Storage for log analysis. Log files (CSV) arrive every 5 minutes from multiple sources. The files should be ingested into BigQuery for reporting within 15 minutes. Which approach best meets the requirements with minimal operational overhead?

157

A healthcare company stores patient records as JSON files in Cloud Storage for analysis. They want to design a data lake that enables querying the data with BigQuery while minimizing storage costs and maintaining data security. Which two actions should they take? (Choose two.)

158

A data engineer configures the above lifecycle rule on a Cloud Storage bucket that stores daily log files. After 60 days, they notice that files older than 30 days have been transitioned to Nearline, but files older than 90 days are still present. What is the most likely cause?

159

A large e-commerce company is migrating its on-premise Hadoop cluster to Google Cloud using Dataproc for batch processing. The cluster processes daily sales data from multiple sources, generates aggregated reports, and performs ad-hoc analysis. The migration is complete, but users report that jobs are running 30% slower than on-premise. The data is stored in Cloud Storage as Parquet files partitioned by date. The Dataproc cluster uses preemptible VMs for worker nodes, and the master node uses a standard VM. The jobs heavily rely on shuffling data between stages. The cluster's autoscaling is enabled with a minimum of 10 and a maximum of 50 workers. During job execution, CPU utilization on workers is low, but disk I/O is high, especially on local SSDs. The network utilization is moderate. The team suspects that the shuffle operation is causing the slowdown. Which action should the team take to improve job performance?

Practice all 159 Designing data processing systems questions

Other PDE exam domains

Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

Frequently asked questions

What does the Designing data processing systems domain cover on the PDE exam?

The Designing data processing systems domain covers the key concepts tested in this area of the PDE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PDE domains — no account required.

How many Designing data processing systems questions are in the PDE question bank?

The Courseiva PDE question bank contains 159 questions in the Designing data processing systems domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Designing data processing systems for PDE?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Designing data processing systems questions for PDE?

Yes — the session launcher on this page draws questions exclusively from the Designing data processing systems domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PDE domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included