Practice PDE Designing Data Processing Systems questions with full explanations on every answer.
Start practicing
Designing Data Processing Systems — choose a session length
Free · No account required
Click any question to see the full explanation and answer options, or start a focused practice session above.
A data engineer needs to design a stream processing pipeline that reads events from Pub/Sub, enriches them with data from a Cloud Storage file, and writes aggregated results to BigQuery. The pipeline must handle late-arriving events up to 1 hour. Which Dataflow feature should be used to manage late data?
2A company uses Dataproc to run daily Spark ML jobs. The jobs run for 2 hours each day. The team wants to reduce costs without changing job characteristics. Which strategy is MOST cost-effective?
3A financial services company stream trades into Pub/Sub and processes with Dataflow. The pipeline must ensure exactly-once processing of each trade for regulatory compliance. However, Pub/Sub guarantees at-least-once delivery. Which combination of features should the Dataflow pipeline use to achieve exactly-once semantics?
4A data engineer needs to create a BigQuery table that is partitioned by ingestion time and clustered by customer_id and transaction_date. They also want to limit access so that only users from a specific domain can query the table. Which approach should they use?
5A startup needs a fully managed, serverless Spark service to run occasional data processing jobs without managing clusters. They want to pay only for the resources used during job execution. Which Google Cloud service should they use?
6A company wants to use Cloud Data Fusion to build ETL pipelines. They need to connect to a legacy on-premises database using JDBC and also want to use prebuilt transforms from the Hub. Which two features should they use?
7A company uses Pub/Sub with push subscriptions to deliver events to a Cloud Run service. Recently, the service has been returning HTTP 429 (Too Many Requests), causing messages to be retried and eventually sent to the dead letter topic. What is the MOST likely cause?
8A data engineer needs to process data in a Dataflow pipeline that reads from a Pub/Sub topic. The pipeline must group events into 5-minute windows and compute the average value per key. Which Beam transform should they use after windowing?
9A company uses BigQuery for analytics. They have a table that is queried frequently by date range. To reduce costs, they want to ensure queries only scan the relevant partitions. They also want to improve performance for queries filtering on a specific customer_id. Which table design should they use?
10A data engineer is designing a real-time fraud detection system using Dataflow. The system must detect patterns across events from multiple users within a sliding window of 10 minutes. Events arrive on Pub/Sub topics per user. Which approach should they use to join the streams?
11A company wants to use Dataprep to clean and transform raw CSV files stored in Cloud Storage before loading into BigQuery. The data quality checks show missing values and inconsistent date formats. Which Dataprep feature should they use to handle these issues?
12A company needs a messaging service for event-driven applications that require low cost for high-throughput, but can tolerate occasional message loss. Which Pub/Sub product should they choose?
13A retail company uses Dataflow to process real-time clickstream data. They need to enrich each event with customer profile data from Cloud Bigtable and session metadata from Cloud Spanner. Which two Dataflow features should they use?
14A company is migrating on-premises Hadoop Hive workloads to Google Cloud. They want to use Dataproc for Spark processing and require a managed Hive metastore that can be shared across multiple Dataproc clusters. Which TWO components should they use?
15A data engineer needs to design a BigQuery dataset for a multi-team environment. Each team should have read access only to specific tables, and the data must be protected from accidental deletion. Which THREE steps should they take?
16A company wants to design a data pipeline for real-time fraud detection. The system must process streaming financial transactions, enrich them with user profiles from a lookup table, and flag suspicious activities within seconds. Which architecture pattern would be MOST suitable?
17You are designing a BigQuery data warehouse for a multi-tenant SaaS application. Each tenant's data must be isolated and queried only by that tenant. You need to minimise management overhead and allow tenants to be added dynamically. Which approach should you use?
18You need to process large-scale log files (hundreds of terabytes) using Apache Spark on Google Cloud. The job runs nightly and you want to minimise costs. Which Dataproc cluster configuration is MOST cost-effective?
19A data pipeline ingests streaming events into Pub/Sub. You need to guarantee that each event is processed exactly once downstream in Dataflow. Which combination of Pub/Sub and Dataflow configurations should you use?
20You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. The pipeline must handle late-arriving data (up to 1 hour) and group events into 10-minute windows. Which configuration is correct?
21You are moving an on-premises Hadoop workload to Google Cloud. The workload uses Hive for metadata and HDFS for storage. Which services should you use to minimise reconfiguration?
22Which Google Cloud service provides a visual interface for building ETL pipelines using a drag-and-drop design and includes pre-built transforms from a marketplace?
23You are designing a streaming pipeline that needs to handle sudden spikes in traffic without losing data. The pipeline uses Pub/Sub and Dataflow. Which configuration ensures data is not lost if Dataflow falls behind?
24You need to analyse streaming data from thousands of IoT devices, each sending temperature readings every second. You want to calculate the average temperature per device over the last 5 minutes, updating every minute. Which windowing strategy should you use in Dataflow?
25A company uses BigQuery with partitioned tables by ingestion time. They notice that queries scanning recent partitions are fast but queries scanning older partitions are slow. What is the most likely cause?
26You need to run a one-time data transformation job on a small CSV file (100 MB) using a visual, code-free interface. Which Google Cloud service is designed for this?
27You are designing a Dataflow pipeline that joins two unbounded PCollections from different sources. Which transform should you use?
28A company wants to build a real-time dashboard for monitoring application logs. The logs are ingested via Pub/Sub and must be processed with low latency (sub-second). You need to enrich the logs with user metadata from Cloud SQL and store the results in BigQuery for analysis. Which TWO services should be used for the stream processing? (Choose two.)
29A data pipeline processes sensitive customer data. You need to ensure that only authorised users can query the data in BigQuery, and that the data is encrypted at rest and in transit. Which THREE steps should you take? (Choose three.)
30You are designing a Dataflow pipeline for processing real-time clickstream data. The pipeline must group events into 30-second windows and handle late data up to 5 minutes. You want to output partial results every 10 seconds for low-latency monitoring. Which TWO configurations should you use? (Choose two.)
31Your data engineering team needs to process a continuous stream of clickstream events from a website and update a real-time dashboard showing user activity over the last hour. The pipeline should have minimal operational overhead and support exactly-once processing semantics. Which Google Cloud service should you use?
32Your company ingests millions of events per second into a Pub/Sub topic. The downstream consumer must process events with minimal latency and high throughput. However, the consumer occasionally falls behind during traffic spikes, and you need to ensure no data loss while minimizing costs. Which subscription type and configuration should you choose?
33You are designing a data pipeline that processes streaming events with late-arriving data (up to 2 hours late). The pipeline must compute hourly aggregations and emit results as soon as possible, but must also accurately update results when late data arrives. You want to minimize overall processing cost. Which Dataflow windowing and trigger configuration should you use?
34You are migrating on-premises Hadoop jobs to Google Cloud. The existing jobs use Spark for ETL and Hive for querying. You want to minimize changes to the existing code and maintain the ability to use Hive queries with the same metastore across multiple clusters. Which service combination should you use?
35You are designing a batch data pipeline that runs daily to ingest data from an on-premises database into BigQuery. The ingestion volume is approximately 50 GB per day. The data must be available in BigQuery by 6 AM each day. The on-premises database supports change data capture (CDC) via logs. Which approach minimizes operational cost and complexity?
36You have a BigQuery table that is partitioned by ingestion time and clustered on user_id. The table stores event logs and is queried frequently by user_id to analyze user behavior over the last 30 days. Queries are still scanning too many partitions. Which optimization should you apply first?
37Your company uses Cloud Data Fusion to build ETL pipelines. You have a pipeline that reads from Cloud Storage, transforms data using a custom Wrangler recipe, and writes to BigQuery. The pipeline is failing with an error indicating that the Wrangler directive is invalid. You have verified the recipe works in the Cloud Data Fusion Studio. What is the most likely cause of the failure?
38You need to allow a data analyst to run queries on a BigQuery dataset but prevent them from modifying the data or deleting the dataset. Which IAM role should you grant?
39Your team is migrating a legacy batch processing system that uses Apache Spark on-premises. The migration must be completed with minimal code changes and support both batch and streaming in the future. You want to use a fully managed service. Which Google Cloud service is most appropriate?
40You are designing a Dataflow pipeline that reads from Pub/Sub, aggregates events into 10-minute windows, and writes the results to BigQuery. The pipeline must reliably handle late-arriving data (up to 1 hour) and prevent duplicate aggregations. Which combination of pipeline options should you use?
41You need to create a BigQuery table that stores customer transaction data. The table will be queried frequently by a customer_id column to retrieve recent transactions (last 30 days). Which table design optimizes query performance and cost?
42Your company is building a real-time anomaly detection system for financial transactions. The system must process streams of transactions and flag anomalies within seconds. The volume is moderate (5000 transactions per second). You want a fully managed solution that integrates with BigQuery for historical analysis. Which service should you use for stream processing?
43Your organization is designing a data lake on Google Cloud using Cloud Storage. You need to choose a file format for storing raw data that supports schema evolution, is splittable for parallel processing, and is optimized for query performance in BigQuery. Which TWO formats meet these requirements? (Choose 2.)
44Your company runs a Dataflow streaming pipeline that processes user activity from Pub/Sub and writes aggregated results to BigQuery. Lately, the pipeline is experiencing high latency and backlog growth during peak hours. You need to troubleshoot and improve performance. Which THREE actions should you take? (Choose 3.)
45Your team is using Cloud Dataprep to clean and transform a dataset. Which TWO features of Cloud Dataprep help you understand data quality issues before running the pipeline? (Choose 2.)
46A company needs to process streaming sensor data from millions of devices with sub-second latency, apply transformations, and write results to BigQuery for real-time dashboards. The data volume varies, and they want to avoid managing servers. Which service should they use?
47A data engineer wants to create a BigQuery table that is partitioned by day and clustered by user_id and product_id. Which SQL statement should they use?
48A company uses Cloud Pub/Sub to ingest events from multiple sources. They need to guarantee that each event is processed exactly once by downstream consumers. However, Pub/Sub guarantees at-least-once delivery. Which additional steps should they implement to achieve exactly-once processing?
49A data pipeline uses Dataflow to read from Pub/Sub, window messages into 1-minute fixed windows, and write to BigQuery. The pipeline occasionally has late-arriving data. How should they configure the pipeline to allow late data up to 5 minutes and then trigger a final pane?
50Which Google Cloud service provides a fully managed, serverless Spark environment without requiring cluster provisioning?
51A team wants to use Cloud Pub/Sub Lite for a high-throughput, low-cost messaging system. They need exactly-once delivery to subscribers. What should they know about Pub/Sub Lite's delivery guarantees?
52A company wants to use BigQuery materialized views to accelerate queries on a table that is updated every hour. Which statement about materialized views is true?
53A Dataflow pipeline processes a high-volume stream of JSON events. The pipeline has a bottleneck where a ParDo transformation performs an external API call for each element, causing high latency. Which strategy would BEST improve throughput without sacrificing correctness?
54Which BigQuery feature allows you to share query results with specific users without giving them direct access to the underlying tables?
55A company wants to use Cloud Data Fusion for ETL pipelines. They need to integrate with custom transformations not available in the marketplace. What should they do?
56A Dataproc cluster uses preemptible worker nodes to reduce costs. The cluster runs a long-running Spark job that occasionally experiences worker failures. How should the job be configured to handle preemptible worker failures gracefully?
57A company needs to process data from a legacy system that outputs CSV files daily. They want to visually build transformations without writing code. Which Google Cloud service should they use?
58A company uses Cloud Pub/Sub for event ingestion. They want to ensure that if a subscriber fails to process a message after 5 attempts, the message is sent to a dead letter topic for analysis. Which TWO configurations are needed?
59A company is designing a data pipeline using the lambda architecture. They need to process both real-time streams and batch historical data. Which THREE components are essential for a lambda architecture on Google Cloud?
60A company wants to use Dataproc Metastore to manage metadata for their Spark jobs. Which TWO benefits does Dataproc Metastore provide?
61A company needs to process streaming sensor data and run both real-time analytics and batch reanalysis on historical data. They want to minimize infrastructure management. Which architecture and service combination is MOST suitable?
62You are designing a BigQuery data warehouse for a retail company. Queries frequently filter on order_date and customer_id. To optimize query performance and cost, which table design should you use?
63A Dataflow streaming pipeline reads from Pub/Sub, processes events with a fixed window of 1 minute, and writes to BigQuery. Some events arrive late due to network issues. You need to ensure late events are still included in the correct window but the pipeline must not wait indefinitely. What configuration should you use?
64You need to process a large Spark ML training job on a Dataproc cluster. The job is fault-tolerant and can handle occasional node failures. To reduce costs, which type of worker nodes should you use?
65Your company uses Pub/Sub to ingest clickstream data. Messages must be processed in order for the same user_id. How should you configure the Pub/Sub subscription to guarantee ordering?
66A Dataflow pipeline with multiple steps uses a side input from a slowly changing reference table stored in BigQuery. The side input is updated every hour. To avoid reprocessing the entire pipeline on each update, which approach should you use?
67You need to transform and clean messy CSV data using a visual interface without writing code. The transformation should be scheduled to run weekly. Which Google Cloud service should you use?
68Your team wants to share a BigQuery dataset with another project while ensuring that users from that project can only query specific tables. Which BigQuery feature should you use?
69A company uses Dataproc Serverless for Spark batch jobs. They notice that some jobs are failing due to out-of-memory (OOM) errors. Which configuration parameter should they adjust to allocate more memory per executor?
70You are building a real-time fraud detection system using Dataflow. Events from Pub/Sub need to be grouped by user_id within a 5-minute window to detect suspicious patterns. Some events may be delayed by up to 2 minutes. How should you configure the window and trigger to balance accuracy and latency?
71You need to choose a messaging service for a real-time streaming application that requires low cost and can tolerate occasional message loss. Which service is MOST suitable?
72A data engineer needs to run an existing Spark job on Google Cloud with minimal code changes. The job requires Hive metastore access. Which Dataproc feature should they use to provide a managed Hive metastore?
73You are designing a data pipeline for a financial services company that requires exactly-once processing semantics. Which TWO services or configurations provide exactly-once guarantees?
74A media company processes video metadata using a Dataflow pipeline. They need to join two streaming sources: user activity (Pub/Sub) and video catalog updates (Pub/Sub). Which THREE transforms should be used in the pipeline?
75You are designing a BigQuery data lake for a healthcare organization. The data includes patient records that must be access-controlled at the row level. Which TWO features should you use to meet this requirement?
76A data engineer needs to process streaming data from thousands of IoT devices and generate real-time dashboards. The data volume is low but requires exactly-once processing semantics. Which Google Cloud service combination should they use?
77A company has a BigQuery dataset containing sensitive customer data. They want to share a subset of this data with external partners, ensuring that partners can only see specific columns and rows. Which BigQuery feature should they use?
78A data pipeline is built with Cloud Dataflow that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is experiencing high latency and occasional data loss during worker failures. The engineer wants to improve reliability and performance. Which two actions should they take?
79An organization runs periodic Apache Spark jobs on Dataproc to process data from Cloud Storage. They want to reduce costs by using preemptible instances for worker nodes. What is a key consideration when using preemptible instances in Dataproc?
80A company needs to process high-throughput streaming data with low latency. They are considering Cloud Pub/Sub for ingestion and Cloud Dataflow for processing. However, they are concerned about cost. Which alternative to Cloud Pub/Sub would reduce costs while still meeting the throughput requirements?
81A data engineer is designing a pipeline that reads from Cloud Pub/Sub, aggregates events into 5-minute windows, and writes the results to BigQuery. The engineer wants to ensure that late-arriving data (up to 2 minutes late) is included in the correct window. Which Dataflow feature should they configure?
82A company is using Cloud Storage to store raw logs. They want to use Cloud Data Fusion to transform and load the data into BigQuery on a daily schedule. The transformations are complex and involve joining multiple datasets. What is the most efficient way to run these pipelines?
83A company has a BigQuery table that is partitioned by ingestion time and clustered by the 'customer_id' column. They notice that queries filtering on 'customer_id' are not benefiting from clustering as expected. What is the most likely cause?
84An organization is implementing a data lake on Google Cloud using Cloud Storage. They need to process both batch and streaming data with a unified pipeline. The team has experience with Apache Beam. Which architecture should they use to minimize operational overhead?
85A data pipeline using Cloud Dataflow reads from a Pub/Sub subscription that has a dead letter topic configured. Some messages are being sent to the dead letter topic. Upon investigation, the engineer finds that the messages contain valid data but are malformed according to the schema. What is the most likely reason for the messages being dead-lettered?
86A company uses Cloud Dataproc to run Spark ML training jobs. They want to persist the trained models and metadata in a Hive-compatible metastore. Which Dataproc feature should they use?
87An engineer needs to create a Pub/Sub subscription that sends messages to an HTTPS endpoint. The endpoint must be able to acknowledge messages individually. Which type of subscription should they use?
88A company is migrating their on-premises Hadoop workloads to Google Cloud. They want to use Dataproc for data processing and need to minimize costs for non-critical batch jobs that can tolerate interruptions. Which TWO configurations should they use?
89A data engineering team is designing a streaming pipeline using Cloud Dataflow. They need to join two unbounded PCollections based on a common key. The join must handle late data up to 10 minutes. Which THREE components should they use?
90An organization is using BigQuery for analytics. They have a table that is 500 GB and is frequently queried by 'date' and 'region'. They want to optimize query performance and reduce costs. Which TWO actions should they take?
91A data pipeline ingests streaming events into Pub/Sub and needs to join them with a slowly updating reference table (few thousand rows) from a Cloud Storage CSV file. The pipeline runs on Dataflow with Apache Beam. Which approach is most cost-effective and operationally simple?
92A Dataflow pipeline using Apache Beam processes unbounded data from Pub/Sub. The pipeline uses fixed windows of 1 minute and a trigger that fires early every 30 seconds and at watermark. The team observes that the output pane for window [10:00:00, 10:01:00) contains events with timestamps from 10:00:15 and 10:00:45, but also an event with timestamp 10:02:00. What is the most likely cause?
93A developer wants to create a BigQuery table that automatically expires data older than 30 days to reduce storage costs. Which table design feature should be used?
94A company runs Apache Spark jobs on Dataproc. They want to reduce costs by using preemptible instances for worker nodes. The jobs are fault-tolerant and can handle occasional node loss. However, the cluster must remain available for interactive querying during business hours. Which Dataproc cluster configuration meets these requirements?
95A data engineer needs to design a streaming pipeline that ingests events from multiple sources, enriches them with a lookup table stored in BigQuery (updated every hour), and writes the results to a BigQuery table for real-time dashboards. The pipeline must handle late-arriving data up to 1 hour. Which Dataflow feature should be configured to manage late data?
96Which Google Cloud service provides a serverless Spark environment where you can run Spark jobs without provisioning or managing a cluster?
97A company is using Pub/Sub to ingest clickstream events. They need to ensure that events are delivered to a subscriber at least once, but duplicates can be tolerated. They also need to filter events by type before processing. Which subscription configuration should be used?
98A data pipeline uses Cloud Data Fusion to perform ETL jobs. The pipeline reads from BigQuery, transforms data using Wrangler, and writes to Cloud Storage. The team notices that the pipeline runs slower than expected. They suspect the Data Fusion instance is under-provisioned. Which action should be taken to improve performance?
99A company wants to use Pub/Sub Lite to reduce costs for a high-throughput, low-latency streaming pipeline. However, they have a requirement to retain messages for up to 7 days for reprocessing. Which Pub/Sub Lite configuration supports this retention?
100A data engineer needs to create a BigQuery table that is optimized for queries that filter on a 'customer_id' column and sort by 'transaction_date'. The table will be used for interactive analysis. Which combination of table features should be used?
101A data team is migrating an on-premises Hadoop cluster to Dataproc. The cluster runs a mix of long-running services (Hive, HBase) and transient Spark jobs. They want to minimize cost while maintaining performance. Which TWO strategies should they implement?
102A company uses Pub/Sub to ingest events from multiple sources. They need to ensure that messages from a specific source are processed in order (per source partition). They also need to deduplicate messages. Which TWO features should they use?
103A data engineer is designing a streaming pipeline using Dataflow with Apache Beam. The pipeline reads from Pub/Sub, performs a stateful transformation (e.g., session windowing), and writes to BigQuery. The pipeline must handle late data and ensure exactly-once semantics. Which THREE configurations are required?
104A company is evaluating BigQuery for a data warehouse migration. They have a mix of reporting queries and ad-hoc analytical queries. They want to control query costs and prevent runaway queries. Which THREE strategies should they implement?
105A company uses Cloud Data Fusion for ETL pipelines. They need to transform sensitive data (PII) by masking certain columns before writing to BigQuery. They also need to ensure the pipeline can be monitored and restarted from failure points. Which THREE features should they use?
106A company is designing a data pipeline that ingests real-time events from IoT devices and must handle late-arriving data (up to 1 hour late) while minimizing duplicate processing. They plan to use Dataflow with Pub/Sub. Which combination of windowing and trigger settings should they use?
107A financial services company has a BigQuery dataset containing sensitive customer data. They need to share a subset of this data (excluding PII columns) with an external analytics partner. The partner should be able to query the data using their own BigQuery account, but the company must maintain full control over the underlying table and ensure the partner cannot see or access the original table. Which approach should they use?
108A data engineering team is designing a streaming pipeline using Dataflow to process real-time clickstream data from a website. They need to aggregate user session metrics (e.g., number of sessions, average duration) every 5 minutes. The pipeline must handle late-arriving events (up to 2 minutes late) and ensure exactly-once processing semantics. Which TWO of the following should they configure? (Choose two.)
109A company is migrating their on-premises Hadoop/Spark workloads to Google Cloud. They need a fully managed service that supports existing Spark jobs with minimal code changes, allows autoscaling, and provides integration with Cloud Storage and BigQuery. The team also wants to avoid managing cluster infrastructure and pay only for what they use. Which TWO services meet these requirements? (Choose two.)
110A data team is building a near-real-time dashboard that displays aggregated metrics from Kafka topics. They want to use Pub/Sub as a managed messaging service and Dataflow for stream processing. They need to ingest data from Kafka into Pub/Sub with minimal custom code. Which THREE Google Cloud services should they use together? (Choose three.)
The Designing Data Processing Systems domain covers the key concepts tested in this area of the PDE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PDE domains — no account required.
The Courseiva PDE question bank contains 110 questions in the Designing Data Processing Systems domain. Click any question to see the full explanation and answer breakdown.
Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.
Yes — the session launcher on this page draws questions exclusively from the Designing Data Processing Systems domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.
Save your results, see per-domain analytics, and get readiness scores — free, for every certification.
Sign Up FreeFree forever · Every certification included