A company runs a large Dataflow pipeline that aggregates user activity data from Pub/Sub into BigQuery every 10 minutes using fixed windows. Recently, the daily summary reports have shown 5-10% lower user engagement for certain segments compared to historical trends. The pipeline is completing successfully with no errors in Cloud Monitoring, and the Dataflow job dashboard shows all steps in green. There are no alarms. The team suspects data is being dropped or missed. They have verified that the Pub/Sub topic is receiving data correctly. After reviewing the pipeline code, they find that the pipeline uses a global window with a default 10-minute trigger, and writes results to a single BigQuery table partitioned by date. They also use exactly-once processing mode. Which of the following is the most likely cause and the best course of action to diagnose and fix the data quality issue?
This identifies exactly where data is lost, enabling targeted debugging without overhead.
Why this answer
Option D is correct because the pipeline uses a global window with a default 10-minute trigger, which means data is processed in micro-batches but the global window never closes, so late-arriving data is included. However, the team suspects data is being dropped, and the most direct way to diagnose this is to compare the number of elements read from Pub/Sub (using the Pub/Sub subscription's 'pubsub_subscription' metric) with the number of elements written to BigQuery (using the BigQuery sink's 'bigquery_rows_written' metric) for each window. This comparison will reveal if any data is lost between reading and writing, which is a common issue when using exactly-once processing mode with streaming inserts that may silently fail due to schema mismatches or quota limits.
Exam trap
The trap here is that candidates assume 'exactly-once processing' guarantees no data loss, but in reality, exactly-once only ensures no duplicates, not that all data is successfully written to the sink; silent failures in streaming inserts to BigQuery can cause data to be dropped without triggering pipeline errors.
How to eliminate wrong answers
Option A is wrong because Pub/Sub subscriptions already have built-in retry mechanisms (e.g., at-least-once delivery) and the issue is not about message loss from Pub/Sub; the team verified the topic is receiving data correctly. Option B is wrong because enabling Cloud Logging for all pipeline steps would generate excessive logs and is not the most efficient diagnostic approach; Dataflow already provides built-in metrics (e.g., 'pubsub_subscription' and 'bigquery_rows_written') that can directly compare element counts without needing to parse logs. Option C is wrong because the pipeline already uses a global window with a default 10-minute trigger, which inherently captures late data (since the global window never closes); adding a late-data trigger is redundant and does not address the potential data loss between Pub/Sub and BigQuery.