Google Professional Data Engineer PDE Questions 451–499 | Page 7/7

451

MCQhard

A company runs a large Dataflow pipeline that aggregates user activity data from Pub/Sub into BigQuery every 10 minutes using fixed windows. Recently, the daily summary reports have shown 5-10% lower user engagement for certain segments compared to historical trends. The pipeline is completing successfully with no errors in Cloud Monitoring, and the Dataflow job dashboard shows all steps in green. There are no alarms. The team suspects data is being dropped or missed. They have verified that the Pub/Sub topic is receiving data correctly. After reviewing the pipeline code, they find that the pipeline uses a global window with a default 10-minute trigger, and writes results to a single BigQuery table partitioned by date. They also use exactly-once processing mode. Which of the following is the most likely cause and the best course of action to diagnose and fix the data quality issue?

A.Implement a retry mechanism in the Pub/Sub subscription to ensure no messages are lost.

B.Enable Cloud Logging for all pipeline steps and analyze the logs for dropped elements.

C.Add a global window with a late-data trigger to capture any data arriving after the window ends.

D.Use Dataflow’s built-in metrics to compare the number of elements read from Pub/Sub and written to BigQuery for each window.

AnswerD

This identifies exactly where data is lost, enabling targeted debugging without overhead.

Why this answer

Option D is correct because the pipeline uses a global window with a default 10-minute trigger, which means data is processed in micro-batches but the global window never closes, so late-arriving data is included. However, the team suspects data is being dropped, and the most direct way to diagnose this is to compare the number of elements read from Pub/Sub (using the Pub/Sub subscription's 'pubsub_subscription' metric) with the number of elements written to BigQuery (using the BigQuery sink's 'bigquery_rows_written' metric) for each window. This comparison will reveal if any data is lost between reading and writing, which is a common issue when using exactly-once processing mode with streaming inserts that may silently fail due to schema mismatches or quota limits.

Exam trap

The trap here is that candidates assume 'exactly-once processing' guarantees no data loss, but in reality, exactly-once only ensures no duplicates, not that all data is successfully written to the sink; silent failures in streaming inserts to BigQuery can cause data to be dropped without triggering pipeline errors.

How to eliminate wrong answers

Option A is wrong because Pub/Sub subscriptions already have built-in retry mechanisms (e.g., at-least-once delivery) and the issue is not about message loss from Pub/Sub; the team verified the topic is receiving data correctly. Option B is wrong because enabling Cloud Logging for all pipeline steps would generate excessive logs and is not the most efficient diagnostic approach; Dataflow already provides built-in metrics (e.g., 'pubsub_subscription' and 'bigquery_rows_written') that can directly compare element counts without needing to parse logs. Option C is wrong because the pipeline already uses a global window with a default 10-minute trigger, which inherently captures late data (since the global window never closes); adding a late-data trigger is redundant and does not address the potential data loss between Pub/Sub and BigQuery.

Google Professional Data Engineer (PDE) — Questions 451–499