How many Designing Data Processing Systems questions are on the PDE exam?

The Designing Data Processing Systems domain is one of the weighted domains on the PDE exam. The Courseiva question bank has 110 practice questions for this domain.

Free PDE Designing Data Processing Systems Practice Questions (2026)

Q: How can I practice Designing Data Processing Systems questions for PDE?

Click any of the 110 questions listed on this page to see the full question and explanation, or use the session launcher to start a focused practice session of 10, 20, 30 or 50 questions drawn only from the Designing Data Processing Systems domain.

Practice Designing Data Processing Systems questions

10Q 20Q 30Q 50Q

All PDE Designing Data Processing Systems questions (110)

Start session

Click any question to see the full explanation and answer options, or start a focused practice session above.

A data engineer needs to design a stream processing pipeline that reads events from Pub/Sub, enriches them with data from a Cloud Storage file, and writes aggregated results to BigQuery. The pipeline must handle late-arriving events up to 1 hour. Which Dataflow feature should be used to manage late data?

A company uses Dataproc to run daily Spark ML jobs. The jobs run for 2 hours each day. The team wants to reduce costs without changing job characteristics. Which strategy is MOST cost-effective?

A financial services company stream trades into Pub/Sub and processes with Dataflow. The pipeline must ensure exactly-once processing of each trade for regulatory compliance. However, Pub/Sub guarantees at-least-once delivery. Which combination of features should the Dataflow pipeline use to achieve exactly-once semantics?

A data engineer needs to create a BigQuery table that is partitioned by ingestion time and clustered by customer_id and transaction_date. They also want to limit access so that only users from a specific domain can query the table. Which approach should they use?

A startup needs a fully managed, serverless Spark service to run occasional data processing jobs without managing clusters. They want to pay only for the resources used during job execution. Which Google Cloud service should they use?

A company wants to use Cloud Data Fusion to build ETL pipelines. They need to connect to a legacy on-premises database using JDBC and also want to use prebuilt transforms from the Hub. Which two features should they use?

A company uses Pub/Sub with push subscriptions to deliver events to a Cloud Run service. Recently, the service has been returning HTTP 429 (Too Many Requests), causing messages to be retried and eventually sent to the dead letter topic. What is the MOST likely cause?

A data engineer needs to process data in a Dataflow pipeline that reads from a Pub/Sub topic. The pipeline must group events into 5-minute windows and compute the average value per key. Which Beam transform should they use after windowing?

A company uses BigQuery for analytics. They have a table that is queried frequently by date range. To reduce costs, they want to ensure queries only scan the relevant partitions. They also want to improve performance for queries filtering on a specific customer_id. Which table design should they use?

A data engineer is designing a real-time fraud detection system using Dataflow. The system must detect patterns across events from multiple users within a sliding window of 10 minutes. Events arrive on Pub/Sub topics per user. Which approach should they use to join the streams?

A company wants to use Dataprep to clean and transform raw CSV files stored in Cloud Storage before loading into BigQuery. The data quality checks show missing values and inconsistent date formats. Which Dataprep feature should they use to handle these issues?

A company needs a messaging service for event-driven applications that require low cost for high-throughput, but can tolerate occasional message loss. Which Pub/Sub product should they choose?

A retail company uses Dataflow to process real-time clickstream data. They need to enrich each event with customer profile data from Cloud Bigtable and session metadata from Cloud Spanner. Which two Dataflow features should they use?

A company is migrating on-premises Hadoop Hive workloads to Google Cloud. They want to use Dataproc for Spark processing and require a managed Hive metastore that can be shared across multiple Dataproc clusters. Which TWO components should they use?

A data engineer needs to design a BigQuery dataset for a multi-team environment. Each team should have read access only to specific tables, and the data must be protected from accidental deletion. Which THREE steps should they take?

A company wants to design a data pipeline for real-time fraud detection. The system must process streaming financial transactions, enrich them with user profiles from a lookup table, and flag suspicious activities within seconds. Which architecture pattern would be MOST suitable?

You are designing a BigQuery data warehouse for a multi-tenant SaaS application. Each tenant's data must be isolated and queried only by that tenant. You need to minimise management overhead and allow tenants to be added dynamically. Which approach should you use?

You need to process large-scale log files (hundreds of terabytes) using Apache Spark on Google Cloud. The job runs nightly and you want to minimise costs. Which Dataproc cluster configuration is MOST cost-effective?

A data pipeline ingests streaming events into Pub/Sub. You need to guarantee that each event is processed exactly once downstream in Dataflow. Which combination of Pub/Sub and Dataflow configurations should you use?

You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. The pipeline must handle late-arriving data (up to 1 hour) and group events into 10-minute windows. Which configuration is correct?

You are moving an on-premises Hadoop workload to Google Cloud. The workload uses Hive for metadata and HDFS for storage. Which services should you use to minimise reconfiguration?

Which Google Cloud service provides a visual interface for building ETL pipelines using a drag-and-drop design and includes pre-built transforms from a marketplace?

You are designing a streaming pipeline that needs to handle sudden spikes in traffic without losing data. The pipeline uses Pub/Sub and Dataflow. Which configuration ensures data is not lost if Dataflow falls behind?

You need to analyse streaming data from thousands of IoT devices, each sending temperature readings every second. You want to calculate the average temperature per device over the last 5 minutes, updating every minute. Which windowing strategy should you use in Dataflow?

A company uses BigQuery with partitioned tables by ingestion time. They notice that queries scanning recent partitions are fast but queries scanning older partitions are slow. What is the most likely cause?

You need to run a one-time data transformation job on a small CSV file (100 MB) using a visual, code-free interface. Which Google Cloud service is designed for this?

You are designing a Dataflow pipeline that joins two unbounded PCollections from different sources. Which transform should you use?

A company wants to build a real-time dashboard for monitoring application logs. The logs are ingested via Pub/Sub and must be processed with low latency (sub-second). You need to enrich the logs with user metadata from Cloud SQL and store the results in BigQuery for analysis. Which TWO services should be used for the stream processing? (Choose two.)

A data pipeline processes sensitive customer data. You need to ensure that only authorised users can query the data in BigQuery, and that the data is encrypted at rest and in transit. Which THREE steps should you take? (Choose three.)

You are designing a Dataflow pipeline for processing real-time clickstream data. The pipeline must group events into 30-second windows and handle late data up to 5 minutes. You want to output partial results every 10 seconds for low-latency monitoring. Which TWO configurations should you use? (Choose two.)

Your data engineering team needs to process a continuous stream of clickstream events from a website and update a real-time dashboard showing user activity over the last hour. The pipeline should have minimal operational overhead and support exactly-once processing semantics. Which Google Cloud service should you use?

Your company ingests millions of events per second into a Pub/Sub topic. The downstream consumer must process events with minimal latency and high throughput. However, the consumer occasionally falls behind during traffic spikes, and you need to ensure no data loss while minimizing costs. Which subscription type and configuration should you choose?

You are designing a data pipeline that processes streaming events with late-arriving data (up to 2 hours late). The pipeline must compute hourly aggregations and emit results as soon as possible, but must also accurately update results when late data arrives. You want to minimize overall processing cost. Which Dataflow windowing and trigger configuration should you use?

You are migrating on-premises Hadoop jobs to Google Cloud. The existing jobs use Spark for ETL and Hive for querying. You want to minimize changes to the existing code and maintain the ability to use Hive queries with the same metastore across multiple clusters. Which service combination should you use?

You are designing a batch data pipeline that runs daily to ingest data from an on-premises database into BigQuery. The ingestion volume is approximately 50 GB per day. The data must be available in BigQuery by 6 AM each day. The on-premises database supports change data capture (CDC) via logs. Which approach minimizes operational cost and complexity?

You have a BigQuery table that is partitioned by ingestion time and clustered on user_id. The table stores event logs and is queried frequently by user_id to analyze user behavior over the last 30 days. Queries are still scanning too many partitions. Which optimization should you apply first?

Your company uses Cloud Data Fusion to build ETL pipelines. You have a pipeline that reads from Cloud Storage, transforms data using a custom Wrangler recipe, and writes to BigQuery. The pipeline is failing with an error indicating that the Wrangler directive is invalid. You have verified the recipe works in the Cloud Data Fusion Studio. What is the most likely cause of the failure?

You need to allow a data analyst to run queries on a BigQuery dataset but prevent them from modifying the data or deleting the dataset. Which IAM role should you grant?

Your team is migrating a legacy batch processing system that uses Apache Spark on-premises. The migration must be completed with minimal code changes and support both batch and streaming in the future. You want to use a fully managed service. Which Google Cloud service is most appropriate?

You are designing a Dataflow pipeline that reads from Pub/Sub, aggregates events into 10-minute windows, and writes the results to BigQuery. The pipeline must reliably handle late-arriving data (up to 1 hour) and prevent duplicate aggregations. Which combination of pipeline options should you use?

You need to create a BigQuery table that stores customer transaction data. The table will be queried frequently by a customer_id column to retrieve recent transactions (last 30 days). Which table design optimizes query performance and cost?

Your company is building a real-time anomaly detection system for financial transactions. The system must process streams of transactions and flag anomalies within seconds. The volume is moderate (5000 transactions per second). You want a fully managed solution that integrates with BigQuery for historical analysis. Which service should you use for stream processing?

Your organization is designing a data lake on Google Cloud using Cloud Storage. You need to choose a file format for storing raw data that supports schema evolution, is splittable for parallel processing, and is optimized for query performance in BigQuery. Which TWO formats meet these requirements? (Choose 2.)

Your company runs a Dataflow streaming pipeline that processes user activity from Pub/Sub and writes aggregated results to BigQuery. Lately, the pipeline is experiencing high latency and backlog growth during peak hours. You need to troubleshoot and improve performance. Which THREE actions should you take? (Choose 3.)

Your team is using Cloud Dataprep to clean and transform a dataset. Which TWO features of Cloud Dataprep help you understand data quality issues before running the pipeline? (Choose 2.)

A company needs to process streaming sensor data from millions of devices with sub-second latency, apply transformations, and write results to BigQuery for real-time dashboards. The data volume varies, and they want to avoid managing servers. Which service should they use?

A data engineer wants to create a BigQuery table that is partitioned by day and clustered by user_id and product_id. Which SQL statement should they use?

A company uses Cloud Pub/Sub to ingest events from multiple sources. They need to guarantee that each event is processed exactly once by downstream consumers. However, Pub/Sub guarantees at-least-once delivery. Which additional steps should they implement to achieve exactly-once processing?

A data pipeline uses Dataflow to read from Pub/Sub, window messages into 1-minute fixed windows, and write to BigQuery. The pipeline occasionally has late-arriving data. How should they configure the pipeline to allow late data up to 5 minutes and then trigger a final pane?

Which Google Cloud service provides a fully managed, serverless Spark environment without requiring cluster provisioning?

A team wants to use Cloud Pub/Sub Lite for a high-throughput, low-cost messaging system. They need exactly-once delivery to subscribers. What should they know about Pub/Sub Lite's delivery guarantees?

A company wants to use BigQuery materialized views to accelerate queries on a table that is updated every hour. Which statement about materialized views is true?

A Dataflow pipeline processes a high-volume stream of JSON events. The pipeline has a bottleneck where a ParDo transformation performs an external API call for each element, causing high latency. Which strategy would BEST improve throughput without sacrificing correctness?

Which BigQuery feature allows you to share query results with specific users without giving them direct access to the underlying tables?

A company wants to use Cloud Data Fusion for ETL pipelines. They need to integrate with custom transformations not available in the marketplace. What should they do?

A Dataproc cluster uses preemptible worker nodes to reduce costs. The cluster runs a long-running Spark job that occasionally experiences worker failures. How should the job be configured to handle preemptible worker failures gracefully?

A company needs to process data from a legacy system that outputs CSV files daily. They want to visually build transformations without writing code. Which Google Cloud service should they use?

A company uses Cloud Pub/Sub for event ingestion. They want to ensure that if a subscriber fails to process a message after 5 attempts, the message is sent to a dead letter topic for analysis. Which TWO configurations are needed?

A company is designing a data pipeline using the lambda architecture. They need to process both real-time streams and batch historical data. Which THREE components are essential for a lambda architecture on Google Cloud?

A company wants to use Dataproc Metastore to manage metadata for their Spark jobs. Which TWO benefits does Dataproc Metastore provide?

A company needs to process streaming sensor data and run both real-time analytics and batch reanalysis on historical data. They want to minimize infrastructure management. Which architecture and service combination is MOST suitable?

You are designing a BigQuery data warehouse for a retail company. Queries frequently filter on order_date and customer_id. To optimize query performance and cost, which table design should you use?

A Dataflow streaming pipeline reads from Pub/Sub, processes events with a fixed window of 1 minute, and writes to BigQuery. Some events arrive late due to network issues. You need to ensure late events are still included in the correct window but the pipeline must not wait indefinitely. What configuration should you use?

You need to process a large Spark ML training job on a Dataproc cluster. The job is fault-tolerant and can handle occasional node failures. To reduce costs, which type of worker nodes should you use?

Your company uses Pub/Sub to ingest clickstream data. Messages must be processed in order for the same user_id. How should you configure the Pub/Sub subscription to guarantee ordering?

A Dataflow pipeline with multiple steps uses a side input from a slowly changing reference table stored in BigQuery. The side input is updated every hour. To avoid reprocessing the entire pipeline on each update, which approach should you use?

You need to transform and clean messy CSV data using a visual interface without writing code. The transformation should be scheduled to run weekly. Which Google Cloud service should you use?

Your team wants to share a BigQuery dataset with another project while ensuring that users from that project can only query specific tables. Which BigQuery feature should you use?

A company uses Dataproc Serverless for Spark batch jobs. They notice that some jobs are failing due to out-of-memory (OOM) errors. Which configuration parameter should they adjust to allocate more memory per executor?

You are building a real-time fraud detection system using Dataflow. Events from Pub/Sub need to be grouped by user_id within a 5-minute window to detect suspicious patterns. Some events may be delayed by up to 2 minutes. How should you configure the window and trigger to balance accuracy and latency?

You need to choose a messaging service for a real-time streaming application that requires low cost and can tolerate occasional message loss. Which service is MOST suitable?

A data engineer needs to run an existing Spark job on Google Cloud with minimal code changes. The job requires Hive metastore access. Which Dataproc feature should they use to provide a managed Hive metastore?

You are designing a data pipeline for a financial services company that requires exactly-once processing semantics. Which TWO services or configurations provide exactly-once guarantees?

A media company processes video metadata using a Dataflow pipeline. They need to join two streaming sources: user activity (Pub/Sub) and video catalog updates (Pub/Sub). Which THREE transforms should be used in the pipeline?

You are designing a BigQuery data lake for a healthcare organization. The data includes patient records that must be access-controlled at the row level. Which TWO features should you use to meet this requirement?

A data engineer needs to process streaming data from thousands of IoT devices and generate real-time dashboards. The data volume is low but requires exactly-once processing semantics. Which Google Cloud service combination should they use?

A company has a BigQuery dataset containing sensitive customer data. They want to share a subset of this data with external partners, ensuring that partners can only see specific columns and rows. Which BigQuery feature should they use?

A data pipeline is built with Cloud Dataflow that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is experiencing high latency and occasional data loss during worker failures. The engineer wants to improve reliability and performance. Which two actions should they take?

An organization runs periodic Apache Spark jobs on Dataproc to process data from Cloud Storage. They want to reduce costs by using preemptible instances for worker nodes. What is a key consideration when using preemptible instances in Dataproc?

A company needs to process high-throughput streaming data with low latency. They are considering Cloud Pub/Sub for ingestion and Cloud Dataflow for processing. However, they are concerned about cost. Which alternative to Cloud Pub/Sub would reduce costs while still meeting the throughput requirements?

A data engineer is designing a pipeline that reads from Cloud Pub/Sub, aggregates events into 5-minute windows, and writes the results to BigQuery. The engineer wants to ensure that late-arriving data (up to 2 minutes late) is included in the correct window. Which Dataflow feature should they configure?

A company is using Cloud Storage to store raw logs. They want to use Cloud Data Fusion to transform and load the data into BigQuery on a daily schedule. The transformations are complex and involve joining multiple datasets. What is the most efficient way to run these pipelines?

A company has a BigQuery table that is partitioned by ingestion time and clustered by the 'customer_id' column. They notice that queries filtering on 'customer_id' are not benefiting from clustering as expected. What is the most likely cause?

An organization is implementing a data lake on Google Cloud using Cloud Storage. They need to process both batch and streaming data with a unified pipeline. The team has experience with Apache Beam. Which architecture should they use to minimize operational overhead?

A data pipeline using Cloud Dataflow reads from a Pub/Sub subscription that has a dead letter topic configured. Some messages are being sent to the dead letter topic. Upon investigation, the engineer finds that the messages contain valid data but are malformed according to the schema. What is the most likely reason for the messages being dead-lettered?

A company uses Cloud Dataproc to run Spark ML training jobs. They want to persist the trained models and metadata in a Hive-compatible metastore. Which Dataproc feature should they use?

An engineer needs to create a Pub/Sub subscription that sends messages to an HTTPS endpoint. The endpoint must be able to acknowledge messages individually. Which type of subscription should they use?

A company is migrating their on-premises Hadoop workloads to Google Cloud. They want to use Dataproc for data processing and need to minimize costs for non-critical batch jobs that can tolerate interruptions. Which TWO configurations should they use?

A data engineering team is designing a streaming pipeline using Cloud Dataflow. They need to join two unbounded PCollections based on a common key. The join must handle late data up to 10 minutes. Which THREE components should they use?

An organization is using BigQuery for analytics. They have a table that is 500 GB and is frequently queried by 'date' and 'region'. They want to optimize query performance and reduce costs. Which TWO actions should they take?

A data pipeline ingests streaming events into Pub/Sub and needs to join them with a slowly updating reference table (few thousand rows) from a Cloud Storage CSV file. The pipeline runs on Dataflow with Apache Beam. Which approach is most cost-effective and operationally simple?

A Dataflow pipeline using Apache Beam processes unbounded data from Pub/Sub. The pipeline uses fixed windows of 1 minute and a trigger that fires early every 30 seconds and at watermark. The team observes that the output pane for window [10:00:00, 10:01:00) contains events with timestamps from 10:00:15 and 10:00:45, but also an event with timestamp 10:02:00. What is the most likely cause?

A developer wants to create a BigQuery table that automatically expires data older than 30 days to reduce storage costs. Which table design feature should be used?

A company runs Apache Spark jobs on Dataproc. They want to reduce costs by using preemptible instances for worker nodes. The jobs are fault-tolerant and can handle occasional node loss. However, the cluster must remain available for interactive querying during business hours. Which Dataproc cluster configuration meets these requirements?

A data engineer needs to design a streaming pipeline that ingests events from multiple sources, enriches them with a lookup table stored in BigQuery (updated every hour), and writes the results to a BigQuery table for real-time dashboards. The pipeline must handle late-arriving data up to 1 hour. Which Dataflow feature should be configured to manage late data?

Which Google Cloud service provides a serverless Spark environment where you can run Spark jobs without provisioning or managing a cluster?

A company is using Pub/Sub to ingest clickstream events. They need to ensure that events are delivered to a subscriber at least once, but duplicates can be tolerated. They also need to filter events by type before processing. Which subscription configuration should be used?

A data pipeline uses Cloud Data Fusion to perform ETL jobs. The pipeline reads from BigQuery, transforms data using Wrangler, and writes to Cloud Storage. The team notices that the pipeline runs slower than expected. They suspect the Data Fusion instance is under-provisioned. Which action should be taken to improve performance?

A company wants to use Pub/Sub Lite to reduce costs for a high-throughput, low-latency streaming pipeline. However, they have a requirement to retain messages for up to 7 days for reprocessing. Which Pub/Sub Lite configuration supports this retention?

100

A data engineer needs to create a BigQuery table that is optimized for queries that filter on a 'customer_id' column and sort by 'transaction_date'. The table will be used for interactive analysis. Which combination of table features should be used?

101

A data team is migrating an on-premises Hadoop cluster to Dataproc. The cluster runs a mix of long-running services (Hive, HBase) and transient Spark jobs. They want to minimize cost while maintaining performance. Which TWO strategies should they implement?

102

A company uses Pub/Sub to ingest events from multiple sources. They need to ensure that messages from a specific source are processed in order (per source partition). They also need to deduplicate messages. Which TWO features should they use?

103

A data engineer is designing a streaming pipeline using Dataflow with Apache Beam. The pipeline reads from Pub/Sub, performs a stateful transformation (e.g., session windowing), and writes to BigQuery. The pipeline must handle late data and ensure exactly-once semantics. Which THREE configurations are required?

104

A company is evaluating BigQuery for a data warehouse migration. They have a mix of reporting queries and ad-hoc analytical queries. They want to control query costs and prevent runaway queries. Which THREE strategies should they implement?

105

A company uses Cloud Data Fusion for ETL pipelines. They need to transform sensitive data (PII) by masking certain columns before writing to BigQuery. They also need to ensure the pipeline can be monitored and restarted from failure points. Which THREE features should they use?

106

A company is designing a data pipeline that ingests real-time events from IoT devices and must handle late-arriving data (up to 1 hour late) while minimizing duplicate processing. They plan to use Dataflow with Pub/Sub. Which combination of windowing and trigger settings should they use?

107

A financial services company has a BigQuery dataset containing sensitive customer data. They need to share a subset of this data (excluding PII columns) with an external analytics partner. The partner should be able to query the data using their own BigQuery account, but the company must maintain full control over the underlying table and ensure the partner cannot see or access the original table. Which approach should they use?

108

A data engineering team is designing a streaming pipeline using Dataflow to process real-time clickstream data from a website. They need to aggregate user session metrics (e.g., number of sessions, average duration) every 5 minutes. The pipeline must handle late-arriving events (up to 2 minutes late) and ensure exactly-once processing semantics. Which TWO of the following should they configure? (Choose two.)

109

A company is migrating their on-premises Hadoop/Spark workloads to Google Cloud. They need a fully managed service that supports existing Spark jobs with minimal code changes, allows autoscaling, and provides integration with Cloud Storage and BigQuery. The team also wants to avoid managing cluster infrastructure and pay only for what they use. Which TWO services meet these requirements? (Choose two.)

110

A data team is building a near-real-time dashboard that displays aggregated metrics from Kafka topics. They want to use Pub/Sub as a managed messaging service and Dataflow for stream processing. They need to ingest data from Kafka into Pub/Sub with minimal custom code. Which THREE Google Cloud services should they use together? (Choose three.)

Practice all 110 Designing Data Processing Systems questions

Other PDE exam domains

Ingesting and Processing the Data Storing the Data Preparing and Using Data for Analysis Maintaining and Automating Data Workloads Building and operationalizing data processing systems Operationalizing machine learning models Ensuring solution quality

Frequently asked questions

What does the Designing Data Processing Systems domain cover on the PDE exam?

The Designing Data Processing Systems domain covers the key concepts tested in this area of the PDE exam blueprint published by Google Cloud. Courseiva provides free domain-focused practice, mock exams, missed-question review, and readiness tracking across all PDE domains — no account required.

How many Designing Data Processing Systems questions are in the PDE question bank?

The Courseiva PDE question bank contains 110 questions in the Designing Data Processing Systems domain. Click any question to see the full explanation and answer breakdown.

What is the best way to practice Designing Data Processing Systems for PDE?

Start with a 10-question focused session to identify your baseline accuracy in this domain. Read every explanation — even for questions you answer correctly — to understand the reasoning. Once you score consistently above 80%, move to a 20–30 question session to confirm depth before moving to the next domain.

Can I practice only Designing Data Processing Systems questions for PDE?

Yes — the session launcher on this page draws questions exclusively from the Designing Data Processing Systems domain. Choose 10, 20, 30, or 50 questions for a focused session, or click individual questions to review them one by one.

Free forever · No credit card required

Track your PDE domain progress

Save your results, see per-domain analytics, and get readiness scores — free, for every certification.

Free forever · Every certification included