A data engineer needs to design a stream processing pipeline that reads events from Pub/Sub, enriches them with data from a Cloud Storage file, and writes aggregated results to BigQuery. The pipeline must handle late-arriving events up to 1 hour. Which Dataflow feature should be used to manage late data?
Trap 1: Triggers
Triggers control when aggregate results are emitted, but watermarks signal late data.
Trap 2: Side inputs
Side inputs are for enriching streams with static or slowly changing data, not for managing late data.
Trap 3: Windowing
Windowing groups elements by time, but watermarks handle late arrivals.
- A
Triggers
Why wrong: Triggers control when aggregate results are emitted, but watermarks signal late data.
- B
Watermarks
Watermarks track the event time progress and allowed lateness; Dataflow drops elements beyond the watermark.
- C
Side inputs
Why wrong: Side inputs are for enriching streams with static or slowly changing data, not for managing late data.
- D
Windowing
Why wrong: Windowing groups elements by time, but watermarks handle late arrivals.