PDE Designing Data Processing Systems — All Questions With Answers

Question 1easymultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to design a stream processing pipeline that reads events from Pub/Sub, enriches them with data from a Cloud Storage file, and writes aggregated results to BigQuery. The pipeline must handle late-arriving events up to 1 hour. Which Dataflow feature should be used to manage late data?

Question 2mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses Dataproc to run daily Spark ML jobs. The jobs run for 2 hours each day. The team wants to reduce costs without changing job characteristics. Which strategy is MOST cost-effective?

Question 3hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A financial services company stream trades into Pub/Sub and processes with Dataflow. The pipeline must ensure exactly-once processing of each trade for regulatory compliance. However, Pub/Sub guarantees at-least-once delivery. Which combination of features should the Dataflow pipeline use to achieve exactly-once semantics?

Question 4mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to create a BigQuery table that is partitioned by ingestion time and clustered by customer_id and transaction_date. They also want to limit access so that only users from a specific domain can query the table. Which approach should they use?

Question 5easymultiple choice

Read the full Designing Data Processing Systems explanation →

A startup needs a fully managed, serverless Spark service to run occasional data processing jobs without managing clusters. They want to pay only for the resources used during job execution. Which Google Cloud service should they use?

Question 6mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to use Cloud Data Fusion to build ETL pipelines. They need to connect to a legacy on-premises database using JDBC and also want to use prebuilt transforms from the Hub. Which two features should they use?

Question 7hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses Pub/Sub with push subscriptions to deliver events to a Cloud Run service. Recently, the service has been returning HTTP 429 (Too Many Requests), causing messages to be retried and eventually sent to the dead letter topic. What is the MOST likely cause?

Question 8easymultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to process data in a Dataflow pipeline that reads from a Pub/Sub topic. The pipeline must group events into 5-minute windows and compute the average value per key. Which Beam transform should they use after windowing?

Question 9mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses BigQuery for analytics. They have a table that is queried frequently by date range. To reduce costs, they want to ensure queries only scan the relevant partitions. They also want to improve performance for queries filtering on a specific customer_id. Which table design should they use?

Question 10hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer is designing a real-time fraud detection system using Dataflow. The system must detect patterns across events from multiple users within a sliding window of 10 minutes. Events arrive on Pub/Sub topics per user. Which approach should they use to join the streams?

Question 11mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to use Dataprep to clean and transform raw CSV files stored in Cloud Storage before loading into BigQuery. The data quality checks show missing values and inconsistent date formats. Which Dataprep feature should they use to handle these issues?

Question 12easymultiple choice

Read the full Designing Data Processing Systems explanation →

A company needs a messaging service for event-driven applications that require low cost for high-throughput, but can tolerate occasional message loss. Which Pub/Sub product should they choose?

Question 13mediummulti select

Read the full Designing Data Processing Systems explanation →

A retail company uses Dataflow to process real-time clickstream data. They need to enrich each event with customer profile data from Cloud Bigtable and session metadata from Cloud Spanner. Which two Dataflow features should they use?

Question 14hardmulti select

Read the full Designing Data Processing Systems explanation →

A company is migrating on-premises Hadoop Hive workloads to Google Cloud. They want to use Dataproc for Spark processing and require a managed Hive metastore that can be shared across multiple Dataproc clusters. Which TWO components should they use?

Question 15mediummulti select

Read the full Designing Data Processing Systems explanation →

A data engineer needs to design a BigQuery dataset for a multi-team environment. Each team should have read access only to specific tables, and the data must be protected from accidental deletion. Which THREE steps should they take?

Question 16mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to design a data pipeline for real-time fraud detection. The system must process streaming financial transactions, enrich them with user profiles from a lookup table, and flag suspicious activities within seconds. Which architecture pattern would be MOST suitable?

Question 17hardmultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a BigQuery data warehouse for a multi-tenant SaaS application. Each tenant's data must be isolated and queried only by that tenant. You need to minimise management overhead and allow tenants to be added dynamically. Which approach should you use?

Question 18easymultiple choice

Read the full Designing Data Processing Systems explanation →

You need to process large-scale log files (hundreds of terabytes) using Apache Spark on Google Cloud. The job runs nightly and you want to minimise costs. Which Dataproc cluster configuration is MOST cost-effective?

Question 19mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline ingests streaming events into Pub/Sub. You need to guarantee that each event is processed exactly once downstream in Dataflow. Which combination of Pub/Sub and Dataflow configurations should you use?

Question 20hardmultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. The pipeline must handle late-arriving data (up to 1 hour) and group events into 10-minute windows. Which configuration is correct?

Question 21mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You are moving an on-premises Hadoop workload to Google Cloud. The workload uses Hive for metadata and HDFS for storage. Which services should you use to minimise reconfiguration?

Question 22easymultiple choice

Read the full Designing Data Processing Systems explanation →

Which Google Cloud service provides a visual interface for building ETL pipelines using a drag-and-drop design and includes pre-built transforms from a marketplace?

Question 23mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a streaming pipeline that needs to handle sudden spikes in traffic without losing data. The pipeline uses Pub/Sub and Dataflow. Which configuration ensures data is not lost if Dataflow falls behind?

Question 24mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You need to analyse streaming data from thousands of IoT devices, each sending temperature readings every second. You want to calculate the average temperature per device over the last 5 minutes, updating every minute. Which windowing strategy should you use in Dataflow?

Question 25hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses BigQuery with partitioned tables by ingestion time. They notice that queries scanning recent partitions are fast but queries scanning older partitions are slow. What is the most likely cause?

Question 26easymultiple choice

Read the full Designing Data Processing Systems explanation →

You need to run a one-time data transformation job on a small CSV file (100 MB) using a visual, code-free interface. Which Google Cloud service is designed for this?

Question 27mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a Dataflow pipeline that joins two unbounded PCollections from different sources. Which transform should you use?

Question 28mediummulti select

Read the full Designing Data Processing Systems explanation →

A company wants to build a real-time dashboard for monitoring application logs. The logs are ingested via Pub/Sub and must be processed with low latency (sub-second). You need to enrich the logs with user metadata from Cloud SQL and store the results in BigQuery for analysis. Which TWO services should be used for the stream processing? (Choose two.)

Question 29hardmulti select

Read the full Designing Data Processing Systems explanation →

A data pipeline processes sensitive customer data. You need to ensure that only authorised users can query the data in BigQuery, and that the data is encrypted at rest and in transit. Which THREE steps should you take? (Choose three.)

Question 30mediummulti select

Read the full Designing Data Processing Systems explanation →

You are designing a Dataflow pipeline for processing real-time clickstream data. The pipeline must group events into 30-second windows and handle late data up to 5 minutes. You want to output partial results every 10 seconds for low-latency monitoring. Which TWO configurations should you use? (Choose two.)

Question 31easymultiple choice

Read the full Designing Data Processing Systems explanation →

Your data engineering team needs to process a continuous stream of clickstream events from a website and update a real-time dashboard showing user activity over the last hour. The pipeline should have minimal operational overhead and support exactly-once processing semantics. Which Google Cloud service should you use?

Question 32mediummultiple choice

Read the full Designing Data Processing Systems explanation →

Your company ingests millions of events per second into a Pub/Sub topic. The downstream consumer must process events with minimal latency and high throughput. However, the consumer occasionally falls behind during traffic spikes, and you need to ensure no data loss while minimizing costs. Which subscription type and configuration should you choose?

Question 33hardmultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a data pipeline that processes streaming events with late-arriving data (up to 2 hours late). The pipeline must compute hourly aggregations and emit results as soon as possible, but must also accurately update results when late data arrives. You want to minimize overall processing cost. Which Dataflow windowing and trigger configuration should you use?

Question 34easymultiple choice

Read the full Designing Data Processing Systems explanation →

You are migrating on-premises Hadoop jobs to Google Cloud. The existing jobs use Spark for ETL and Hive for querying. You want to minimize changes to the existing code and maintain the ability to use Hive queries with the same metastore across multiple clusters. Which service combination should you use?

Question 35mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a batch data pipeline that runs daily to ingest data from an on-premises database into BigQuery. The ingestion volume is approximately 50 GB per day. The data must be available in BigQuery by 6 AM each day. The on-premises database supports change data capture (CDC) via logs. Which approach minimizes operational cost and complexity?

Question 36mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You have a BigQuery table that is partitioned by ingestion time and clustered on user_id. The table stores event logs and is queried frequently by user_id to analyze user behavior over the last 30 days. Queries are still scanning too many partitions. Which optimization should you apply first?

Question 37hardmultiple choice

Read the full Designing Data Processing Systems explanation →

Your company uses Cloud Data Fusion to build ETL pipelines. You have a pipeline that reads from Cloud Storage, transforms data using a custom Wrangler recipe, and writes to BigQuery. The pipeline is failing with an error indicating that the Wrangler directive is invalid. You have verified the recipe works in the Cloud Data Fusion Studio. What is the most likely cause of the failure?

Question 38easymultiple choice

Read the full Designing Data Processing Systems explanation →

You need to allow a data analyst to run queries on a BigQuery dataset but prevent them from modifying the data or deleting the dataset. Which IAM role should you grant?

Question 39mediummultiple choice

Read the full Designing Data Processing Systems explanation →

Your team is migrating a legacy batch processing system that uses Apache Spark on-premises. The migration must be completed with minimal code changes and support both batch and streaming in the future. You want to use a fully managed service. Which Google Cloud service is most appropriate?

Question 40hardmultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a Dataflow pipeline that reads from Pub/Sub, aggregates events into 10-minute windows, and writes the results to BigQuery. The pipeline must reliably handle late-arriving data (up to 1 hour) and prevent duplicate aggregations. Which combination of pipeline options should you use?

Question 41mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You need to create a BigQuery table that stores customer transaction data. The table will be queried frequently by a customer_id column to retrieve recent transactions (last 30 days). Which table design optimizes query performance and cost?

Question 42easymultiple choice

Read the full Designing Data Processing Systems explanation →

Your company is building a real-time anomaly detection system for financial transactions. The system must process streams of transactions and flag anomalies within seconds. The volume is moderate (5000 transactions per second). You want a fully managed solution that integrates with BigQuery for historical analysis. Which service should you use for stream processing?

Question 43mediummulti select

Read the full Designing Data Processing Systems explanation →

Your organization is designing a data lake on Google Cloud using Cloud Storage. You need to choose a file format for storing raw data that supports schema evolution, is splittable for parallel processing, and is optimized for query performance in BigQuery. Which TWO formats meet these requirements? (Choose 2.)

Question 44hardmulti select

Read the full Designing Data Processing Systems explanation →

Your company runs a Dataflow streaming pipeline that processes user activity from Pub/Sub and writes aggregated results to BigQuery. Lately, the pipeline is experiencing high latency and backlog growth during peak hours. You need to troubleshoot and improve performance. Which THREE actions should you take? (Choose 3.)

Question 45easymulti select

Read the full Designing Data Processing Systems explanation →

Your team is using Cloud Dataprep to clean and transform a dataset. Which TWO features of Cloud Dataprep help you understand data quality issues before running the pipeline? (Choose 2.)

Question 46mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company needs to process streaming sensor data from millions of devices with sub-second latency, apply transformations, and write results to BigQuery for real-time dashboards. The data volume varies, and they want to avoid managing servers. Which service should they use?

Question 47easymultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer wants to create a BigQuery table that is partitioned by day and clustered by user_id and product_id. Which SQL statement should they use?

Question 48hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses Cloud Pub/Sub to ingest events from multiple sources. They need to guarantee that each event is processed exactly once by downstream consumers. However, Pub/Sub guarantees at-least-once delivery. Which additional steps should they implement to achieve exactly-once processing?

Question 49mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline uses Dataflow to read from Pub/Sub, window messages into 1-minute fixed windows, and write to BigQuery. The pipeline occasionally has late-arriving data. How should they configure the pipeline to allow late data up to 5 minutes and then trigger a final pane?

Question 50easymultiple choice

Read the full Designing Data Processing Systems explanation →

Which Google Cloud service provides a fully managed, serverless Spark environment without requiring cluster provisioning?

Question 51mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A team wants to use Cloud Pub/Sub Lite for a high-throughput, low-cost messaging system. They need exactly-once delivery to subscribers. What should they know about Pub/Sub Lite's delivery guarantees?

Question 52mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to use BigQuery materialized views to accelerate queries on a table that is updated every hour. Which statement about materialized views is true?

Question 53hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A Dataflow pipeline processes a high-volume stream of JSON events. The pipeline has a bottleneck where a ParDo transformation performs an external API call for each element, causing high latency. Which strategy would BEST improve throughput without sacrificing correctness?

Question 54easymultiple choice

Read the full Designing Data Processing Systems explanation →

Which BigQuery feature allows you to share query results with specific users without giving them direct access to the underlying tables?

Question 55mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to use Cloud Data Fusion for ETL pipelines. They need to integrate with custom transformations not available in the marketplace. What should they do?

Question 56hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A Dataproc cluster uses preemptible worker nodes to reduce costs. The cluster runs a long-running Spark job that occasionally experiences worker failures. How should the job be configured to handle preemptible worker failures gracefully?

Question 57mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company needs to process data from a legacy system that outputs CSV files daily. They want to visually build transformations without writing code. Which Google Cloud service should they use?

Question 58mediummulti select

Read the full Designing Data Processing Systems explanation →

A company uses Cloud Pub/Sub for event ingestion. They want to ensure that if a subscriber fails to process a message after 5 attempts, the message is sent to a dead letter topic for analysis. Which TWO configurations are needed?

Question 59hardmulti select

Read the full Designing Data Processing Systems explanation →

A company is designing a data pipeline using the lambda architecture. They need to process both real-time streams and batch historical data. Which THREE components are essential for a lambda architecture on Google Cloud?

Question 60mediummulti select

Read the full Designing Data Processing Systems explanation →

A company wants to use Dataproc Metastore to manage metadata for their Spark jobs. Which TWO benefits does Dataproc Metastore provide?

Question 61mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company needs to process streaming sensor data and run both real-time analytics and batch reanalysis on historical data. They want to minimize infrastructure management. Which architecture and service combination is MOST suitable?

Question 62mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a BigQuery data warehouse for a retail company. Queries frequently filter on order_date and customer_id. To optimize query performance and cost, which table design should you use?

Question 63hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A Dataflow streaming pipeline reads from Pub/Sub, processes events with a fixed window of 1 minute, and writes to BigQuery. Some events arrive late due to network issues. You need to ensure late events are still included in the correct window but the pipeline must not wait indefinitely. What configuration should you use?

Question 64easymultiple choice

Read the full Designing Data Processing Systems explanation →

You need to process a large Spark ML training job on a Dataproc cluster. The job is fault-tolerant and can handle occasional node failures. To reduce costs, which type of worker nodes should you use?

Question 65mediummultiple choice

Read the full Designing Data Processing Systems explanation →

Your company uses Pub/Sub to ingest clickstream data. Messages must be processed in order for the same user_id. How should you configure the Pub/Sub subscription to guarantee ordering?

Question 66hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A Dataflow pipeline with multiple steps uses a side input from a slowly changing reference table stored in BigQuery. The side input is updated every hour. To avoid reprocessing the entire pipeline on each update, which approach should you use?

Question 67mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You need to transform and clean messy CSV data using a visual interface without writing code. The transformation should be scheduled to run weekly. Which Google Cloud service should you use?

Question 68easymultiple choice

Read the full Designing Data Processing Systems explanation →

Your team wants to share a BigQuery dataset with another project while ensuring that users from that project can only query specific tables. Which BigQuery feature should you use?

Question 69mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses Dataproc Serverless for Spark batch jobs. They notice that some jobs are failing due to out-of-memory (OOM) errors. Which configuration parameter should they adjust to allocate more memory per executor?

Question 70hardmultiple choice

Read the full Designing Data Processing Systems explanation →

You are building a real-time fraud detection system using Dataflow. Events from Pub/Sub need to be grouped by user_id within a 5-minute window to detect suspicious patterns. Some events may be delayed by up to 2 minutes. How should you configure the window and trigger to balance accuracy and latency?

Question 71easymultiple choice

Read the full Designing Data Processing Systems explanation →

You need to choose a messaging service for a real-time streaming application that requires low cost and can tolerate occasional message loss. Which service is MOST suitable?

Question 72mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to run an existing Spark job on Google Cloud with minimal code changes. The job requires Hive metastore access. Which Dataproc feature should they use to provide a managed Hive metastore?

Question 73mediummulti select

Read the full Designing Data Processing Systems explanation →

You are designing a data pipeline for a financial services company that requires exactly-once processing semantics. Which TWO services or configurations provide exactly-once guarantees?

Question 74hardmulti select

Read the full Designing Data Processing Systems explanation →

A media company processes video metadata using a Dataflow pipeline. They need to join two streaming sources: user activity (Pub/Sub) and video catalog updates (Pub/Sub). Which THREE transforms should be used in the pipeline?

Question 75mediummulti select

Read the full Designing Data Processing Systems explanation →

You are designing a BigQuery data lake for a healthcare organization. The data includes patient records that must be access-controlled at the row level. Which TWO features should you use to meet this requirement?

Question 76easymultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to process streaming data from thousands of IoT devices and generate real-time dashboards. The data volume is low but requires exactly-once processing semantics. Which Google Cloud service combination should they use?

Question 77easymultiple choice

Read the full Designing Data Processing Systems explanation →

A company has a BigQuery dataset containing sensitive customer data. They want to share a subset of this data with external partners, ensuring that partners can only see specific columns and rows. Which BigQuery feature should they use?

Question 78mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline is built with Cloud Dataflow that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is experiencing high latency and occasional data loss during worker failures. The engineer wants to improve reliability and performance. Which two actions should they take?

Question 79mediummultiple choice

Read the full Designing Data Processing Systems explanation →

An organization runs periodic Apache Spark jobs on Dataproc to process data from Cloud Storage. They want to reduce costs by using preemptible instances for worker nodes. What is a key consideration when using preemptible instances in Dataproc?

Question 80mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company needs to process high-throughput streaming data with low latency. They are considering Cloud Pub/Sub for ingestion and Cloud Dataflow for processing. However, they are concerned about cost. Which alternative to Cloud Pub/Sub would reduce costs while still meeting the throughput requirements?

Question 81mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer is designing a pipeline that reads from Cloud Pub/Sub, aggregates events into 5-minute windows, and writes the results to BigQuery. The engineer wants to ensure that late-arriving data (up to 2 minutes late) is included in the correct window. Which Dataflow feature should they configure?

Question 82mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company is using Cloud Storage to store raw logs. They want to use Cloud Data Fusion to transform and load the data into BigQuery on a daily schedule. The transformations are complex and involve joining multiple datasets. What is the most efficient way to run these pipelines?

Question 83hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A company has a BigQuery table that is partitioned by ingestion time and clustered by the 'customer_id' column. They notice that queries filtering on 'customer_id' are not benefiting from clustering as expected. What is the most likely cause?

Question 84hardmultiple choice

Read the full Designing Data Processing Systems explanation →

An organization is implementing a data lake on Google Cloud using Cloud Storage. They need to process both batch and streaming data with a unified pipeline. The team has experience with Apache Beam. Which architecture should they use to minimize operational overhead?

Question 85hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline using Cloud Dataflow reads from a Pub/Sub subscription that has a dead letter topic configured. Some messages are being sent to the dead letter topic. Upon investigation, the engineer finds that the messages contain valid data but are malformed according to the schema. What is the most likely reason for the messages being dead-lettered?

Question 86mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses Cloud Dataproc to run Spark ML training jobs. They want to persist the trained models and metadata in a Hive-compatible metastore. Which Dataproc feature should they use?

Question 87easymultiple choice

Read the full Designing Data Processing Systems explanation →

An engineer needs to create a Pub/Sub subscription that sends messages to an HTTPS endpoint. The endpoint must be able to acknowledge messages individually. Which type of subscription should they use?

Question 88mediummulti select

Read the full Designing Data Processing Systems explanation →

A company is migrating their on-premises Hadoop workloads to Google Cloud. They want to use Dataproc for data processing and need to minimize costs for non-critical batch jobs that can tolerate interruptions. Which TWO configurations should they use?

Question 89hardmulti select

Read the full Designing Data Processing Systems explanation →

A data engineering team is designing a streaming pipeline using Cloud Dataflow. They need to join two unbounded PCollections based on a common key. The join must handle late data up to 10 minutes. Which THREE components should they use?

Question 90mediummulti select

Read the full Designing Data Processing Systems explanation →

An organization is using BigQuery for analytics. They have a table that is 500 GB and is frequently queried by 'date' and 'region'. They want to optimize query performance and reduce costs. Which TWO actions should they take?

Question 91mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline ingests streaming events into Pub/Sub and needs to join them with a slowly updating reference table (few thousand rows) from a Cloud Storage CSV file. The pipeline runs on Dataflow with Apache Beam. Which approach is most cost-effective and operationally simple?

Question 92hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A Dataflow pipeline using Apache Beam processes unbounded data from Pub/Sub. The pipeline uses fixed windows of 1 minute and a trigger that fires early every 30 seconds and at watermark. The team observes that the output pane for window [10:00:00, 10:01:00) contains events with timestamps from 10:00:15 and 10:00:45, but also an event with timestamp 10:02:00. What is the most likely cause?

Question 93easymultiple choice

Read the full Designing Data Processing Systems explanation →

A developer wants to create a BigQuery table that automatically expires data older than 30 days to reduce storage costs. Which table design feature should be used?

Question 94mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company runs Apache Spark jobs on Dataproc. They want to reduce costs by using preemptible instances for worker nodes. The jobs are fault-tolerant and can handle occasional node loss. However, the cluster must remain available for interactive querying during business hours. Which Dataproc cluster configuration meets these requirements?

Question 95mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to design a streaming pipeline that ingests events from multiple sources, enriches them with a lookup table stored in BigQuery (updated every hour), and writes the results to a BigQuery table for real-time dashboards. The pipeline must handle late-arriving data up to 1 hour. Which Dataflow feature should be configured to manage late data?

Question 96easymultiple choice

Read the full Designing Data Processing Systems explanation →

Which Google Cloud service provides a serverless Spark environment where you can run Spark jobs without provisioning or managing a cluster?

Question 97mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company is using Pub/Sub to ingest clickstream events. They need to ensure that events are delivered to a subscriber at least once, but duplicates can be tolerated. They also need to filter events by type before processing. Which subscription configuration should be used?

Question 98hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline uses Cloud Data Fusion to perform ETL jobs. The pipeline reads from BigQuery, transforms data using Wrangler, and writes to Cloud Storage. The team notices that the pipeline runs slower than expected. They suspect the Data Fusion instance is under-provisioned. Which action should be taken to improve performance?

Question 99mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to use Pub/Sub Lite to reduce costs for a high-throughput, low-latency streaming pipeline. However, they have a requirement to retain messages for up to 7 days for reprocessing. Which Pub/Sub Lite configuration supports this retention?

Question 100easymultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to create a BigQuery table that is optimized for queries that filter on a 'customer_id' column and sort by 'transaction_date'. The table will be used for interactive analysis. Which combination of table features should be used?

Question 101hardmulti select

Read the full Designing Data Processing Systems explanation →

A data team is migrating an on-premises Hadoop cluster to Dataproc. The cluster runs a mix of long-running services (Hive, HBase) and transient Spark jobs. They want to minimize cost while maintaining performance. Which TWO strategies should they implement?

Question 102mediummulti select

Read the full Designing Data Processing Systems explanation →

A company uses Pub/Sub to ingest events from multiple sources. They need to ensure that messages from a specific source are processed in order (per source partition). They also need to deduplicate messages. Which TWO features should they use?

Question 103mediummulti select

Read the full Designing Data Processing Systems explanation →

A data engineer is designing a streaming pipeline using Dataflow with Apache Beam. The pipeline reads from Pub/Sub, performs a stateful transformation (e.g., session windowing), and writes to BigQuery. The pipeline must handle late data and ensure exactly-once semantics. Which THREE configurations are required?

Question 104mediummulti select

Read the full Designing Data Processing Systems explanation →

A company is evaluating BigQuery for a data warehouse migration. They have a mix of reporting queries and ad-hoc analytical queries. They want to control query costs and prevent runaway queries. Which THREE strategies should they implement?

Question 105hardmulti select

Read the full Designing Data Processing Systems explanation →

A company uses Cloud Data Fusion for ETL pipelines. They need to transform sensitive data (PII) by masking certain columns before writing to BigQuery. They also need to ensure the pipeline can be monitored and restarted from failure points. Which THREE features should they use?

Question 106mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company is designing a data pipeline that ingests real-time events from IoT devices and must handle late-arriving data (up to 1 hour late) while minimizing duplicate processing. They plan to use Dataflow with Pub/Sub. Which combination of windowing and trigger settings should they use?

Question 107hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A financial services company has a BigQuery dataset containing sensitive customer data. They need to share a subset of this data (excluding PII columns) with an external analytics partner. The partner should be able to query the data using their own BigQuery account, but the company must maintain full control over the underlying table and ensure the partner cannot see or access the original table. Which approach should they use?

Question 108mediummulti select

Read the full Designing Data Processing Systems explanation →

A data engineering team is designing a streaming pipeline using Dataflow to process real-time clickstream data from a website. They need to aggregate user session metrics (e.g., number of sessions, average duration) every 5 minutes. The pipeline must handle late-arriving events (up to 2 minutes late) and ensure exactly-once processing semantics. Which TWO of the following should they configure? (Choose two.)

Question 109hardmulti select

Read the full Designing Data Processing Systems explanation →

A company is migrating their on-premises Hadoop/Spark workloads to Google Cloud. They need a fully managed service that supports existing Spark jobs with minimal code changes, allows autoscaling, and provides integration with Cloud Storage and BigQuery. The team also wants to avoid managing cluster infrastructure and pay only for what they use. Which TWO services meet these requirements? (Choose two.)

Question 110mediummulti select

Read the full Designing Data Processing Systems explanation →

A data team is building a near-real-time dashboard that displays aggregated metrics from Kafka topics. They want to use Pub/Sub as a managed messaging service and Dataflow for stream processing. They need to ingest data from Kafka into Pub/Sub with minimal custom code. Which THREE Google Cloud services should they use together? (Choose three.)

Question 1easymultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to design a stream processing pipeline that reads events from Pub/Sub, enriches them with data from a Cloud Storage file, and writes aggregated results to BigQuery. The pipeline must handle late-arriving events up to 1 hour. Which Dataflow feature should be used to manage late data?

Question 2mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses Dataproc to run daily Spark ML jobs. The jobs run for 2 hours each day. The team wants to reduce costs without changing job characteristics. Which strategy is MOST cost-effective?

Question 3hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A financial services company stream trades into Pub/Sub and processes with Dataflow. The pipeline must ensure exactly-once processing of each trade for regulatory compliance. However, Pub/Sub guarantees at-least-once delivery. Which combination of features should the Dataflow pipeline use to achieve exactly-once semantics?

Question 4mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to create a BigQuery table that is partitioned by ingestion time and clustered by customer_id and transaction_date. They also want to limit access so that only users from a specific domain can query the table. Which approach should they use?

Question 5easymultiple choice

Read the full Designing Data Processing Systems explanation →

A startup needs a fully managed, serverless Spark service to run occasional data processing jobs without managing clusters. They want to pay only for the resources used during job execution. Which Google Cloud service should they use?

Question 6mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to use Cloud Data Fusion to build ETL pipelines. They need to connect to a legacy on-premises database using JDBC and also want to use prebuilt transforms from the Hub. Which two features should they use?

Question 7hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses Pub/Sub with push subscriptions to deliver events to a Cloud Run service. Recently, the service has been returning HTTP 429 (Too Many Requests), causing messages to be retried and eventually sent to the dead letter topic. What is the MOST likely cause?

Question 8easymultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to process data in a Dataflow pipeline that reads from a Pub/Sub topic. The pipeline must group events into 5-minute windows and compute the average value per key. Which Beam transform should they use after windowing?

Question 9mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses BigQuery for analytics. They have a table that is queried frequently by date range. To reduce costs, they want to ensure queries only scan the relevant partitions. They also want to improve performance for queries filtering on a specific customer_id. Which table design should they use?

Question 10hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer is designing a real-time fraud detection system using Dataflow. The system must detect patterns across events from multiple users within a sliding window of 10 minutes. Events arrive on Pub/Sub topics per user. Which approach should they use to join the streams?

Question 11mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to use Dataprep to clean and transform raw CSV files stored in Cloud Storage before loading into BigQuery. The data quality checks show missing values and inconsistent date formats. Which Dataprep feature should they use to handle these issues?

Question 12easymultiple choice

Read the full Designing Data Processing Systems explanation →

A company needs a messaging service for event-driven applications that require low cost for high-throughput, but can tolerate occasional message loss. Which Pub/Sub product should they choose?

Question 13mediummulti select

Read the full Designing Data Processing Systems explanation →

A retail company uses Dataflow to process real-time clickstream data. They need to enrich each event with customer profile data from Cloud Bigtable and session metadata from Cloud Spanner. Which two Dataflow features should they use?

Question 14hardmulti select

Read the full Designing Data Processing Systems explanation →

A company is migrating on-premises Hadoop Hive workloads to Google Cloud. They want to use Dataproc for Spark processing and require a managed Hive metastore that can be shared across multiple Dataproc clusters. Which TWO components should they use?

Question 15mediummulti select

Read the full Designing Data Processing Systems explanation →

A data engineer needs to design a BigQuery dataset for a multi-team environment. Each team should have read access only to specific tables, and the data must be protected from accidental deletion. Which THREE steps should they take?

Question 16mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to design a data pipeline for real-time fraud detection. The system must process streaming financial transactions, enrich them with user profiles from a lookup table, and flag suspicious activities within seconds. Which architecture pattern would be MOST suitable?

Question 17hardmultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a BigQuery data warehouse for a multi-tenant SaaS application. Each tenant's data must be isolated and queried only by that tenant. You need to minimise management overhead and allow tenants to be added dynamically. Which approach should you use?

Question 18easymultiple choice

Read the full Designing Data Processing Systems explanation →

You need to process large-scale log files (hundreds of terabytes) using Apache Spark on Google Cloud. The job runs nightly and you want to minimise costs. Which Dataproc cluster configuration is MOST cost-effective?

Question 19mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline ingests streaming events into Pub/Sub. You need to guarantee that each event is processed exactly once downstream in Dataflow. Which combination of Pub/Sub and Dataflow configurations should you use?

Question 20hardmultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a Dataflow pipeline that reads from Pub/Sub and writes to BigQuery. The pipeline must handle late-arriving data (up to 1 hour) and group events into 10-minute windows. Which configuration is correct?

Question 21mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You are moving an on-premises Hadoop workload to Google Cloud. The workload uses Hive for metadata and HDFS for storage. Which services should you use to minimise reconfiguration?

Question 22easymultiple choice

Read the full Designing Data Processing Systems explanation →

Which Google Cloud service provides a visual interface for building ETL pipelines using a drag-and-drop design and includes pre-built transforms from a marketplace?

Question 23mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a streaming pipeline that needs to handle sudden spikes in traffic without losing data. The pipeline uses Pub/Sub and Dataflow. Which configuration ensures data is not lost if Dataflow falls behind?

Question 24mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You need to analyse streaming data from thousands of IoT devices, each sending temperature readings every second. You want to calculate the average temperature per device over the last 5 minutes, updating every minute. Which windowing strategy should you use in Dataflow?

Question 25hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses BigQuery with partitioned tables by ingestion time. They notice that queries scanning recent partitions are fast but queries scanning older partitions are slow. What is the most likely cause?

Question 26easymultiple choice

Read the full Designing Data Processing Systems explanation →

You need to run a one-time data transformation job on a small CSV file (100 MB) using a visual, code-free interface. Which Google Cloud service is designed for this?

Question 27mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a Dataflow pipeline that joins two unbounded PCollections from different sources. Which transform should you use?

Question 28mediummulti select

Read the full Designing Data Processing Systems explanation →

A company wants to build a real-time dashboard for monitoring application logs. The logs are ingested via Pub/Sub and must be processed with low latency (sub-second). You need to enrich the logs with user metadata from Cloud SQL and store the results in BigQuery for analysis. Which TWO services should be used for the stream processing? (Choose two.)

Question 29hardmulti select

Read the full Designing Data Processing Systems explanation →

A data pipeline processes sensitive customer data. You need to ensure that only authorised users can query the data in BigQuery, and that the data is encrypted at rest and in transit. Which THREE steps should you take? (Choose three.)

Question 30mediummulti select

Read the full Designing Data Processing Systems explanation →

You are designing a Dataflow pipeline for processing real-time clickstream data. The pipeline must group events into 30-second windows and handle late data up to 5 minutes. You want to output partial results every 10 seconds for low-latency monitoring. Which TWO configurations should you use? (Choose two.)

Question 31easymultiple choice

Read the full Designing Data Processing Systems explanation →

Your data engineering team needs to process a continuous stream of clickstream events from a website and update a real-time dashboard showing user activity over the last hour. The pipeline should have minimal operational overhead and support exactly-once processing semantics. Which Google Cloud service should you use?

Question 32mediummultiple choice

Read the full Designing Data Processing Systems explanation →

Your company ingests millions of events per second into a Pub/Sub topic. The downstream consumer must process events with minimal latency and high throughput. However, the consumer occasionally falls behind during traffic spikes, and you need to ensure no data loss while minimizing costs. Which subscription type and configuration should you choose?

Question 33hardmultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a data pipeline that processes streaming events with late-arriving data (up to 2 hours late). The pipeline must compute hourly aggregations and emit results as soon as possible, but must also accurately update results when late data arrives. You want to minimize overall processing cost. Which Dataflow windowing and trigger configuration should you use?

Question 34easymultiple choice

Read the full Designing Data Processing Systems explanation →

You are migrating on-premises Hadoop jobs to Google Cloud. The existing jobs use Spark for ETL and Hive for querying. You want to minimize changes to the existing code and maintain the ability to use Hive queries with the same metastore across multiple clusters. Which service combination should you use?

Question 35mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a batch data pipeline that runs daily to ingest data from an on-premises database into BigQuery. The ingestion volume is approximately 50 GB per day. The data must be available in BigQuery by 6 AM each day. The on-premises database supports change data capture (CDC) via logs. Which approach minimizes operational cost and complexity?

Question 36mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You have a BigQuery table that is partitioned by ingestion time and clustered on user_id. The table stores event logs and is queried frequently by user_id to analyze user behavior over the last 30 days. Queries are still scanning too many partitions. Which optimization should you apply first?

Question 37hardmultiple choice

Read the full Designing Data Processing Systems explanation →

Your company uses Cloud Data Fusion to build ETL pipelines. You have a pipeline that reads from Cloud Storage, transforms data using a custom Wrangler recipe, and writes to BigQuery. The pipeline is failing with an error indicating that the Wrangler directive is invalid. You have verified the recipe works in the Cloud Data Fusion Studio. What is the most likely cause of the failure?

Question 38easymultiple choice

Read the full Designing Data Processing Systems explanation →

You need to allow a data analyst to run queries on a BigQuery dataset but prevent them from modifying the data or deleting the dataset. Which IAM role should you grant?

Question 39mediummultiple choice

Read the full Designing Data Processing Systems explanation →

Your team is migrating a legacy batch processing system that uses Apache Spark on-premises. The migration must be completed with minimal code changes and support both batch and streaming in the future. You want to use a fully managed service. Which Google Cloud service is most appropriate?

Question 40hardmultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a Dataflow pipeline that reads from Pub/Sub, aggregates events into 10-minute windows, and writes the results to BigQuery. The pipeline must reliably handle late-arriving data (up to 1 hour) and prevent duplicate aggregations. Which combination of pipeline options should you use?

Question 41mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You need to create a BigQuery table that stores customer transaction data. The table will be queried frequently by a customer_id column to retrieve recent transactions (last 30 days). Which table design optimizes query performance and cost?

Question 42easymultiple choice

Read the full Designing Data Processing Systems explanation →

Your company is building a real-time anomaly detection system for financial transactions. The system must process streams of transactions and flag anomalies within seconds. The volume is moderate (5000 transactions per second). You want a fully managed solution that integrates with BigQuery for historical analysis. Which service should you use for stream processing?

Question 43mediummulti select

Read the full Designing Data Processing Systems explanation →

Your organization is designing a data lake on Google Cloud using Cloud Storage. You need to choose a file format for storing raw data that supports schema evolution, is splittable for parallel processing, and is optimized for query performance in BigQuery. Which TWO formats meet these requirements? (Choose 2.)

Question 44hardmulti select

Read the full Designing Data Processing Systems explanation →

Your company runs a Dataflow streaming pipeline that processes user activity from Pub/Sub and writes aggregated results to BigQuery. Lately, the pipeline is experiencing high latency and backlog growth during peak hours. You need to troubleshoot and improve performance. Which THREE actions should you take? (Choose 3.)

Question 45easymulti select

Read the full Designing Data Processing Systems explanation →

Your team is using Cloud Dataprep to clean and transform a dataset. Which TWO features of Cloud Dataprep help you understand data quality issues before running the pipeline? (Choose 2.)

Question 46mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company needs to process streaming sensor data from millions of devices with sub-second latency, apply transformations, and write results to BigQuery for real-time dashboards. The data volume varies, and they want to avoid managing servers. Which service should they use?

Question 47easymultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer wants to create a BigQuery table that is partitioned by day and clustered by user_id and product_id. Which SQL statement should they use?

Question 48hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses Cloud Pub/Sub to ingest events from multiple sources. They need to guarantee that each event is processed exactly once by downstream consumers. However, Pub/Sub guarantees at-least-once delivery. Which additional steps should they implement to achieve exactly-once processing?

Question 49mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline uses Dataflow to read from Pub/Sub, window messages into 1-minute fixed windows, and write to BigQuery. The pipeline occasionally has late-arriving data. How should they configure the pipeline to allow late data up to 5 minutes and then trigger a final pane?

Question 50easymultiple choice

Read the full Designing Data Processing Systems explanation →

Which Google Cloud service provides a fully managed, serverless Spark environment without requiring cluster provisioning?

Question 51mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A team wants to use Cloud Pub/Sub Lite for a high-throughput, low-cost messaging system. They need exactly-once delivery to subscribers. What should they know about Pub/Sub Lite's delivery guarantees?

Question 52mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to use BigQuery materialized views to accelerate queries on a table that is updated every hour. Which statement about materialized views is true?

Question 53hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A Dataflow pipeline processes a high-volume stream of JSON events. The pipeline has a bottleneck where a ParDo transformation performs an external API call for each element, causing high latency. Which strategy would BEST improve throughput without sacrificing correctness?

Question 54easymultiple choice

Read the full Designing Data Processing Systems explanation →

Which BigQuery feature allows you to share query results with specific users without giving them direct access to the underlying tables?

Question 55mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to use Cloud Data Fusion for ETL pipelines. They need to integrate with custom transformations not available in the marketplace. What should they do?

Question 56hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A Dataproc cluster uses preemptible worker nodes to reduce costs. The cluster runs a long-running Spark job that occasionally experiences worker failures. How should the job be configured to handle preemptible worker failures gracefully?

Question 57mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company needs to process data from a legacy system that outputs CSV files daily. They want to visually build transformations without writing code. Which Google Cloud service should they use?

Question 58mediummulti select

Read the full Designing Data Processing Systems explanation →

A company uses Cloud Pub/Sub for event ingestion. They want to ensure that if a subscriber fails to process a message after 5 attempts, the message is sent to a dead letter topic for analysis. Which TWO configurations are needed?

Question 59hardmulti select

Read the full Designing Data Processing Systems explanation →

A company is designing a data pipeline using the lambda architecture. They need to process both real-time streams and batch historical data. Which THREE components are essential for a lambda architecture on Google Cloud?

Question 60mediummulti select

Read the full Designing Data Processing Systems explanation →

A company wants to use Dataproc Metastore to manage metadata for their Spark jobs. Which TWO benefits does Dataproc Metastore provide?

Question 61mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company needs to process streaming sensor data and run both real-time analytics and batch reanalysis on historical data. They want to minimize infrastructure management. Which architecture and service combination is MOST suitable?

Question 62mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You are designing a BigQuery data warehouse for a retail company. Queries frequently filter on order_date and customer_id. To optimize query performance and cost, which table design should you use?

Question 63hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A Dataflow streaming pipeline reads from Pub/Sub, processes events with a fixed window of 1 minute, and writes to BigQuery. Some events arrive late due to network issues. You need to ensure late events are still included in the correct window but the pipeline must not wait indefinitely. What configuration should you use?

Question 64easymultiple choice

Read the full Designing Data Processing Systems explanation →

You need to process a large Spark ML training job on a Dataproc cluster. The job is fault-tolerant and can handle occasional node failures. To reduce costs, which type of worker nodes should you use?

Question 65mediummultiple choice

Read the full Designing Data Processing Systems explanation →

Your company uses Pub/Sub to ingest clickstream data. Messages must be processed in order for the same user_id. How should you configure the Pub/Sub subscription to guarantee ordering?

Question 66hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A Dataflow pipeline with multiple steps uses a side input from a slowly changing reference table stored in BigQuery. The side input is updated every hour. To avoid reprocessing the entire pipeline on each update, which approach should you use?

Question 67mediummultiple choice

Read the full Designing Data Processing Systems explanation →

You need to transform and clean messy CSV data using a visual interface without writing code. The transformation should be scheduled to run weekly. Which Google Cloud service should you use?

Question 68easymultiple choice

Read the full Designing Data Processing Systems explanation →

Your team wants to share a BigQuery dataset with another project while ensuring that users from that project can only query specific tables. Which BigQuery feature should you use?

Question 69mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses Dataproc Serverless for Spark batch jobs. They notice that some jobs are failing due to out-of-memory (OOM) errors. Which configuration parameter should they adjust to allocate more memory per executor?

Question 70hardmultiple choice

Read the full Designing Data Processing Systems explanation →

You are building a real-time fraud detection system using Dataflow. Events from Pub/Sub need to be grouped by user_id within a 5-minute window to detect suspicious patterns. Some events may be delayed by up to 2 minutes. How should you configure the window and trigger to balance accuracy and latency?

Question 71easymultiple choice

Read the full Designing Data Processing Systems explanation →

You need to choose a messaging service for a real-time streaming application that requires low cost and can tolerate occasional message loss. Which service is MOST suitable?

Question 72mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to run an existing Spark job on Google Cloud with minimal code changes. The job requires Hive metastore access. Which Dataproc feature should they use to provide a managed Hive metastore?

Question 73mediummulti select

Read the full Designing Data Processing Systems explanation →

You are designing a data pipeline for a financial services company that requires exactly-once processing semantics. Which TWO services or configurations provide exactly-once guarantees?

Question 74hardmulti select

Read the full Designing Data Processing Systems explanation →

A media company processes video metadata using a Dataflow pipeline. They need to join two streaming sources: user activity (Pub/Sub) and video catalog updates (Pub/Sub). Which THREE transforms should be used in the pipeline?

Question 75mediummulti select

Read the full Designing Data Processing Systems explanation →

You are designing a BigQuery data lake for a healthcare organization. The data includes patient records that must be access-controlled at the row level. Which TWO features should you use to meet this requirement?

Question 76easymultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to process streaming data from thousands of IoT devices and generate real-time dashboards. The data volume is low but requires exactly-once processing semantics. Which Google Cloud service combination should they use?

Question 77easymultiple choice

Read the full Designing Data Processing Systems explanation →

A company has a BigQuery dataset containing sensitive customer data. They want to share a subset of this data with external partners, ensuring that partners can only see specific columns and rows. Which BigQuery feature should they use?

Question 78mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline is built with Cloud Dataflow that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is experiencing high latency and occasional data loss during worker failures. The engineer wants to improve reliability and performance. Which two actions should they take?

Question 79mediummultiple choice

Read the full Designing Data Processing Systems explanation →

An organization runs periodic Apache Spark jobs on Dataproc to process data from Cloud Storage. They want to reduce costs by using preemptible instances for worker nodes. What is a key consideration when using preemptible instances in Dataproc?

Question 80mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company needs to process high-throughput streaming data with low latency. They are considering Cloud Pub/Sub for ingestion and Cloud Dataflow for processing. However, they are concerned about cost. Which alternative to Cloud Pub/Sub would reduce costs while still meeting the throughput requirements?

Question 81mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer is designing a pipeline that reads from Cloud Pub/Sub, aggregates events into 5-minute windows, and writes the results to BigQuery. The engineer wants to ensure that late-arriving data (up to 2 minutes late) is included in the correct window. Which Dataflow feature should they configure?

Question 82mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company is using Cloud Storage to store raw logs. They want to use Cloud Data Fusion to transform and load the data into BigQuery on a daily schedule. The transformations are complex and involve joining multiple datasets. What is the most efficient way to run these pipelines?

Question 83hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A company has a BigQuery table that is partitioned by ingestion time and clustered by the 'customer_id' column. They notice that queries filtering on 'customer_id' are not benefiting from clustering as expected. What is the most likely cause?

Question 84hardmultiple choice

Read the full Designing Data Processing Systems explanation →

An organization is implementing a data lake on Google Cloud using Cloud Storage. They need to process both batch and streaming data with a unified pipeline. The team has experience with Apache Beam. Which architecture should they use to minimize operational overhead?

Question 85hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline using Cloud Dataflow reads from a Pub/Sub subscription that has a dead letter topic configured. Some messages are being sent to the dead letter topic. Upon investigation, the engineer finds that the messages contain valid data but are malformed according to the schema. What is the most likely reason for the messages being dead-lettered?

Question 86mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company uses Cloud Dataproc to run Spark ML training jobs. They want to persist the trained models and metadata in a Hive-compatible metastore. Which Dataproc feature should they use?

Question 87easymultiple choice

Read the full Designing Data Processing Systems explanation →

An engineer needs to create a Pub/Sub subscription that sends messages to an HTTPS endpoint. The endpoint must be able to acknowledge messages individually. Which type of subscription should they use?

Question 88mediummulti select

Read the full Designing Data Processing Systems explanation →

A company is migrating their on-premises Hadoop workloads to Google Cloud. They want to use Dataproc for data processing and need to minimize costs for non-critical batch jobs that can tolerate interruptions. Which TWO configurations should they use?

Question 89hardmulti select

Read the full Designing Data Processing Systems explanation →

A data engineering team is designing a streaming pipeline using Cloud Dataflow. They need to join two unbounded PCollections based on a common key. The join must handle late data up to 10 minutes. Which THREE components should they use?

Question 90mediummulti select

Read the full Designing Data Processing Systems explanation →

An organization is using BigQuery for analytics. They have a table that is 500 GB and is frequently queried by 'date' and 'region'. They want to optimize query performance and reduce costs. Which TWO actions should they take?

Question 91mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline ingests streaming events into Pub/Sub and needs to join them with a slowly updating reference table (few thousand rows) from a Cloud Storage CSV file. The pipeline runs on Dataflow with Apache Beam. Which approach is most cost-effective and operationally simple?

Question 92hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A Dataflow pipeline using Apache Beam processes unbounded data from Pub/Sub. The pipeline uses fixed windows of 1 minute and a trigger that fires early every 30 seconds and at watermark. The team observes that the output pane for window [10:00:00, 10:01:00) contains events with timestamps from 10:00:15 and 10:00:45, but also an event with timestamp 10:02:00. What is the most likely cause?

Question 93easymultiple choice

Read the full Designing Data Processing Systems explanation →

A developer wants to create a BigQuery table that automatically expires data older than 30 days to reduce storage costs. Which table design feature should be used?

Question 94mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company runs Apache Spark jobs on Dataproc. They want to reduce costs by using preemptible instances for worker nodes. The jobs are fault-tolerant and can handle occasional node loss. However, the cluster must remain available for interactive querying during business hours. Which Dataproc cluster configuration meets these requirements?

Question 95mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to design a streaming pipeline that ingests events from multiple sources, enriches them with a lookup table stored in BigQuery (updated every hour), and writes the results to a BigQuery table for real-time dashboards. The pipeline must handle late-arriving data up to 1 hour. Which Dataflow feature should be configured to manage late data?

Question 96easymultiple choice

Read the full Designing Data Processing Systems explanation →

Which Google Cloud service provides a serverless Spark environment where you can run Spark jobs without provisioning or managing a cluster?

Question 97mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company is using Pub/Sub to ingest clickstream events. They need to ensure that events are delivered to a subscriber at least once, but duplicates can be tolerated. They also need to filter events by type before processing. Which subscription configuration should be used?

Question 98hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A data pipeline uses Cloud Data Fusion to perform ETL jobs. The pipeline reads from BigQuery, transforms data using Wrangler, and writes to Cloud Storage. The team notices that the pipeline runs slower than expected. They suspect the Data Fusion instance is under-provisioned. Which action should be taken to improve performance?

Question 99mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company wants to use Pub/Sub Lite to reduce costs for a high-throughput, low-latency streaming pipeline. However, they have a requirement to retain messages for up to 7 days for reprocessing. Which Pub/Sub Lite configuration supports this retention?

Question 100easymultiple choice

Read the full Designing Data Processing Systems explanation →

A data engineer needs to create a BigQuery table that is optimized for queries that filter on a 'customer_id' column and sort by 'transaction_date'. The table will be used for interactive analysis. Which combination of table features should be used?

Question 101hardmulti select

Read the full Designing Data Processing Systems explanation →

A data team is migrating an on-premises Hadoop cluster to Dataproc. The cluster runs a mix of long-running services (Hive, HBase) and transient Spark jobs. They want to minimize cost while maintaining performance. Which TWO strategies should they implement?

Question 102mediummulti select

Read the full Designing Data Processing Systems explanation →

A company uses Pub/Sub to ingest events from multiple sources. They need to ensure that messages from a specific source are processed in order (per source partition). They also need to deduplicate messages. Which TWO features should they use?

Question 103mediummulti select

Read the full Designing Data Processing Systems explanation →

A data engineer is designing a streaming pipeline using Dataflow with Apache Beam. The pipeline reads from Pub/Sub, performs a stateful transformation (e.g., session windowing), and writes to BigQuery. The pipeline must handle late data and ensure exactly-once semantics. Which THREE configurations are required?

Question 104mediummulti select

Read the full Designing Data Processing Systems explanation →

A company is evaluating BigQuery for a data warehouse migration. They have a mix of reporting queries and ad-hoc analytical queries. They want to control query costs and prevent runaway queries. Which THREE strategies should they implement?

Question 105hardmulti select

Read the full Designing Data Processing Systems explanation →

A company uses Cloud Data Fusion for ETL pipelines. They need to transform sensitive data (PII) by masking certain columns before writing to BigQuery. They also need to ensure the pipeline can be monitored and restarted from failure points. Which THREE features should they use?

Question 106mediummultiple choice

Read the full Designing Data Processing Systems explanation →

A company is designing a data pipeline that ingests real-time events from IoT devices and must handle late-arriving data (up to 1 hour late) while minimizing duplicate processing. They plan to use Dataflow with Pub/Sub. Which combination of windowing and trigger settings should they use?

Question 107hardmultiple choice

Read the full Designing Data Processing Systems explanation →

A financial services company has a BigQuery dataset containing sensitive customer data. They need to share a subset of this data (excluding PII columns) with an external analytics partner. The partner should be able to query the data using their own BigQuery account, but the company must maintain full control over the underlying table and ensure the partner cannot see or access the original table. Which approach should they use?

Question 108mediummulti select

Read the full Designing Data Processing Systems explanation →

A data engineering team is designing a streaming pipeline using Dataflow to process real-time clickstream data from a website. They need to aggregate user session metrics (e.g., number of sessions, average duration) every 5 minutes. The pipeline must handle late-arriving events (up to 2 minutes late) and ensure exactly-once processing semantics. Which TWO of the following should they configure? (Choose two.)

Question 109hardmulti select

Read the full Designing Data Processing Systems explanation →

A company is migrating their on-premises Hadoop/Spark workloads to Google Cloud. They need a fully managed service that supports existing Spark jobs with minimal code changes, allows autoscaling, and provides integration with Cloud Storage and BigQuery. The team also wants to avoid managing cluster infrastructure and pay only for what they use. Which TWO services meet these requirements? (Choose two.)

Question 110mediummulti select

Read the full Designing Data Processing Systems explanation →

A data team is building a near-real-time dashboard that displays aggregated metrics from Kafka topics. They want to use Pub/Sub as a managed messaging service and Dataflow for stream processing. They need to ingest data from Kafka into Pub/Sub with minimal custom code. Which THREE Google Cloud services should they use together? (Choose three.)