PDE Designing data processing systems Practice Test 10 — 15 Questions

Question 1

A data engineer is responsible for a batch ETL pipeline that runs daily using Cloud Composer and Dataproc. The pipeline extracts data from Cloud SQL, transforms it with Spark, and loads to BigQuery. Last night, the pipeline failed because the Spark job ran out of memory. The team needs a solution that prevents future failures without manual intervention. Options: A. Use a larger machine type for Dataproc. B. Enable Dataproc autoscaling and configure memory-based scaling. C. Split the Spark job into multiple stages. D. Use Cloud Functions to retry the job.

Accepted Answer

Enable Dataproc autoscaling and configure memory-based scaling. Option A is correct because Dataproc autoscaling with memory-based scaling dynamically adjusts the cluster size based on the memory utilization of running jobs. This prevents out-of-memory failures by automatically adding worker nodes when memory pressure increases, without requiring manual intervention or pre-provisioning oversized clusters. It directly addresses the root cause—insufficient memory during peak processing—while maintaining cost efficiency.

Answer

Use Cloud Functions to retry the job

Answer

Use a larger machine type for Dataproc

Answer

Split the Spark job into multiple stages

Question 2

A company uses Cloud Dataflow to process financial transactions from Pub/Sub to BigQuery. The pipeline must ensure exactly-once semantics. Recently, they noticed duplicate rows in BigQuery. The source publishes with at-least-once. The Dataflow pipeline uses idempotent writes. What is the most likely cause? Options: A. The pipeline uses GlobalWindows. B. The pipeline has autoscaling enabled. C. The pipeline uses file loads as a sink. D. The pipeline's watermark is misconfigured.

Accepted Answer

The pipeline's watermark is misconfigured. The most likely cause is a misconfigured watermark. In Dataflow, the watermark tracks event time progress and determines when to trigger window results. If the watermark is misconfigured (e.g., too aggressive or based on incorrect timestamps), late-arriving data may be processed in multiple windows, leading to duplicate rows even with idempotent writes. Since the source uses at-least-once delivery, late data can be re-published, and a faulty watermark can cause it to be written again.

Answer

The pipeline uses file loads as a sink

Answer

The pipeline uses GlobalWindows

Answer

The pipeline has autoscaling enabled

Question 3

A company needs to stream data from a fleet of IoT devices to BigQuery for near-real-time analytics. The data volume is unpredictable and can spike during certain events. Which Google Cloud service should be used as the ingestion point to handle variable throughput with minimal operational overhead?

Accepted Answer

Cloud Pub/Sub. Cloud Pub/Sub is the correct choice because it is a fully managed, scalable messaging service designed to decouple data producers from consumers, handling unpredictable and spiky throughput without requiring manual scaling. It can ingest millions of messages per second and buffer them until BigQuery is ready to consume, ensuring near-real-time analytics with minimal operational overhead.

Answer

Cloud Datastore

Answer

Cloud Functions

Answer

Cloud Storage

Question 4

A team runs a Dataflow streaming pipeline that reads from Pub/Sub, windows events by processing time, and writes to BigQuery. Some late-arriving events are being dropped. The requirement is to include all events that arrive within 10 minutes of the watermark. Which pipeline configuration should be used?

Accepted Answer

Use fixed windows with .withAllowedLateness(Duration.standardMinutes(10)). Option B is correct because `withAllowedLateness(Duration.standardMinutes(10))` on a fixed window allows late-arriving events to be included up to 10 minutes after the watermark passes the window's end. This directly meets the requirement to retain events arriving within 10 minutes of the watermark, while still using processing-time windows as specified.

Answer

Use sliding windows with no allowed lateness

Answer

Use fixed windows with withAllowedLateness(Duration.standardSeconds(10))

Answer

Switch from processing time to event time and use default triggers

Question 5

A company runs a batch data processing workload using Dataproc clusters that are auto-scaled based on YARN memory utilization. During peak times, jobs take much longer than expected. Analysis shows the cluster is not scaling up despite high YARN memory utilization. What is the most likely cause?

Accepted Answer

The cluster is using primary workers only; auto-scaling only adds secondary workers. Dataproc clusters have two types of workers: primary workers (which run both HDFS and compute) and secondary workers (compute-only). The autoscaler can only add or remove secondary workers; it cannot scale primary workers. If the cluster uses only primary workers, the autoscaler has no secondary workers to add, so it cannot scale up even under high YARN memory utilization. This explains why the cluster remains static during peak times.

Answer

Spark dynamic allocation is disabled, preventing executors from using added workers

Answer

The cluster autoscaler is misconfigured to scale based on CPU, not memory

Answer

The autoscaler is set to scale down secondary workers, not up

Question 6

A company is designing a data processing system that must handle both batch and streaming workloads with unified pipeline code. Which two Google Cloud services are most suitable for implementing a unified batch and streaming pipeline? (Choose TWO.)

Accepted Answer

Apache Beam SDK. Apache Beam SDK (C) provides a unified programming model that allows developers to write a single pipeline that can execute in both batch and streaming modes without code changes. It abstracts the underlying execution engine, making it the correct choice for unified pipeline code.

Answer

Cloud Data Fusion

Answer

BigQuery

Answer

Cloud Dataproc

Question 7

An organization is moving on-premises Hadoop workloads to Google Cloud. They need to minimize code changes and manage transient clusters for cost savings. Which two Google Cloud services should they consider? (Choose TWO.)

Accepted Answer

Dataproc on GKE. Options B and D are correct: Dataproc is a managed Hadoop/Spark service that can run transient clusters, and Dataproc on GKE allows running Spark workloads on GKE for flexibility. Option A is wrong because Dataflow is not compatible with Hadoop. Option C is wrong because Compute Engine requires manual cluster setup. Option E is wrong because BigQuery is not Hadoop-compatible.

Answer

Compute Engine with self-managed Hadoop

Answer

BigQuery

Answer

Cloud Dataflow

Question 8

A data pipeline reads thousands of JSON files from Cloud Storage, processes them with Cloud Dataflow, and writes to BigQuery. The pipeline sometimes fails because of malformed JSON records. Which three steps should the data engineering team take to improve pipeline reliability? (Choose THREE.)

Accepted Answer

Integrate Cloud Pub/Sub as an intermediary to buffer and allow message retry. Option A is correct because integrating Cloud Pub/Sub as an intermediary decouples the ingestion of JSON files from the Dataflow pipeline. Pub/Sub provides at-least-once delivery and automatic retries for messages that are not acknowledged, which buffers against transient failures and malformed records. This allows the pipeline to pull messages at its own pace and retry processing without losing data.

Answer

Use a try-catch block in the pipeline to retry processing failed records

Answer

Create a Cloud Monitoring alert on pipeline failures

Question 9

A startup is building a real-time dashboard that shows aggregated metrics from social media feeds. They expect up to 10,000 events per second. The data must be near-real-time (< 30 seconds latency) and stored in BigQuery for historical analysis. They have limited experience managing infrastructure. The CTO suggests using Apache Kafka on Compute Engine for ingestion. However, the data engineer recommends a fully managed solution. Which approach should the team adopt?

Accepted Answer

Use Cloud Pub/Sub for ingestion and Cloud Dataflow for streaming into BigQuery. Option C is correct because Cloud Pub/Sub provides a fully managed, scalable ingestion service that can handle 10,000+ events per second without infrastructure management, and Cloud Dataflow offers exactly-once, auto-scaling streaming into BigQuery with sub-30-second latency. This combination meets the near-real-time requirement while eliminating operational overhead, aligning with the data engineer's recommendation for a fully managed solution.

Answer

Use Cloud Functions to ingest events directly into BigQuery

Answer

Use Apache Kafka on Compute Engine for ingestion, then use Dataflow to write to BigQuery

Answer

Use App Engine to receive events and write to BigQuery

Question 10

A large retail company processes point-of-sale transactions from thousands of stores daily. The current batch pipeline runs on Cloud Dataproc using Spark and takes 3 hours to complete. The business wants to reduce processing time to under 30 minutes. The pipeline reads from Cloud Storage, joins with inventory data from BigQuery, performs aggregations, and writes to Cloud SQL for reporting. What is the most effective optimization?

Accepted Answer

Read inventory data from BigQuery and pre-join in BigQuery, then export to Cloud Storage as ORC files. Option B is correct because it offloads the join operation to BigQuery, which is optimized for large-scale analytics and can process the join much faster than Spark. By pre-joining and exporting the result as ORC files (a columnar format optimized for Spark), the pipeline avoids the expensive shuffle and data transfer between Cloud Storage and BigQuery, significantly reducing the overall processing time to meet the 30-minute target.

Answer

Migrate the pipeline to Cloud Dataflow with Apache Beam for auto-scaling

Answer

Write intermediate results to Cloud SQL instead of BigQuery for faster access

Answer

Increase the number of worker nodes in the Dataproc cluster

Question 11

A financial services company uses a Dataflow streaming pipeline to process real-time stock trades. The pipeline reads from Pub/Sub, enriches with reference data from Cloud Bigtable, and writes to BigQuery. Recently, they noticed an increase in processing latency during market open hours. Investigation shows that the pipeline is data-skewed: a few stock symbols generate 90% of the traffic. The team wants to reduce latency without changing the pipeline structure. What should they do?

Accepted Answer

Enable Dataflow Streaming Engine to dynamically repartition work. Option C is correct because using a streaming engine separates compute from storage, allowing better handling of hot keys. Option A is wrong because more workers may not help if the hot key bottleneck is within a single worker. Option B is wrong because reshuffling is already happening; using a different window doesn't fix skew. Option D is wrong because waiting for no backlog is not a solution.

Answer

Increase the Pub/Sub subscription flow control to buffer less data

Answer

Use event-time windows based on trade timestamp to spread data

Answer

Increase the number of workers and use more CPU

Question 12

An e-commerce company runs a daily batch pipeline that processes clickstream data from Cloud Storage using Cloud Dataproc with Spark. The pipeline includes a join between a large fact table and a small dimension table. The dimension table is stored in Cloud Storage as a CSV file. The join is slow due to shuffling. The data engineer considers broadcasting the dimension table. However, the dimension table is updated daily and the pipeline reads the latest version. What is the best approach to implement this optimization?

Accepted Answer

Use DataFrame.join with broadcast hint on the dimension DataFrame. Option A is correct because broadcasting the small dimension table using the broadcast hint (e.g., `broadcast(dimensionDF)`) forces Spark to replicate the dimension data to all executor nodes, eliminating the need for a shuffle during the join. This is ideal when the dimension table is small enough to fit in executor memory, and since the pipeline reads the latest CSV daily, the broadcast will automatically use the updated data without additional code changes.

Answer

Read the fact table and dimension table into separate DataFrames and use standard join

Answer

Read the dimension table as an RDD and collect as a map, then use map-side join

Answer

Increase the spark.sql.autoBroadcastJoinThreshold to a large value

Question 13

A company has a Dataflow pipeline that reads from Pub/Sub, applies transformations, and writes to BigQuery. The pipeline is failing with 'deadline exceeded' errors during peak hours. The team suspects that the pipeline cannot keep up with the incoming data rate. They also notice that the autoscaling algorithm sets maxNumWorkers to 10, but the pipeline only scales to 5 workers. What is the most likely cause of the inadequate scaling?

Accepted Answer

The pipeline is CPU-bound and the autoscaler evaluates that adding more workers would not improve throughput. Option D is correct because the autoscaler in Dataflow evaluates CPU utilization and throughput per worker. If the pipeline is CPU-bound, adding more workers does not reduce per-worker CPU load or improve throughput, so the autoscaler stops at 5 workers even though maxNumWorkers is 10. This is a classic symptom of a bottleneck that cannot be parallelized further, such as a single-threaded transformation or a hot key in a GroupByKey operation.

Answer

The maxNumWorkers setting is too low and should be reduced to trigger more aggressive scaling

Answer

BigQuery streaming quota is limiting the number of concurrent writes

Answer

The Pub/Sub subscription has a per-subscriber throughput limit of 5 workers

Question 14

A healthcare company processes patient data using a Dataflow pipeline that reads from Cloud Storage, transforms data, and writes to BigQuery. They need to ensure that the processing is idempotent to handle failures and retries without duplicating records. The data arrives in daily batches and may be re-delivered if earlier processing failed. What approach should they take to guarantee exactly-once processing in BigQuery?

Accepted Answer

Write data to a staging BigQuery table, then use a MERGE statement to upsert into the final table. Option D is correct because BigQuery load jobs are not idempotent by default; if a load job is retried, it can create duplicate rows. By writing to a staging table first and then using a MERGE statement (or INSERT IF NOT EXISTS) to upsert into the final table, you can deduplicate based on a unique key. This approach guarantees exactly-once semantics even when the same batch is re-delivered, as the MERGE operation will only insert rows that do not already exist in the target table.

Answer

Use BigQuery's streaming inserts with InsertId to deduplicate

Answer

Ingest data via Pub/Sub and use a Dataflow pipeline with exactly-once processing

Answer

Use Dataflow's built-in exactly-once semantics and write to BigQuery via load jobs

Question 15

A company runs a Dataproc cluster with 10 worker nodes for a Spark streaming job that processes data from Pub/Sub (via Pub/Sub Lite) and writes to Cloud Storage. They observe that the job is producing many small files in Cloud Storage, leading to high costs and performance issues in downstream batch pipelines. The team wants to consolidate output files while maintaining low latency. What is the best solution?

Accepted Answer

Use windowed streaming with a longer window duration and Spark's file size configuration. Option B is correct because using a longer window duration in Spark Streaming allows more data to accumulate before writing, and combining this with Spark's file size configuration (e.g., `spark.sql.files.maxRecordsPerFile` or `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2`) ensures that output files are consolidated into larger sizes. This reduces the number of small files in Cloud Storage while maintaining low latency by avoiding an extra compaction job or reducing parallelism.

Answer

Run a separate compaction job that periodically merges small files into larger ones

Answer

Reduce the number of workers to force more data per task

Answer

Switch from Dataproc to Dataflow, which has built-in file sharding optimization