This chapter covers Cloud Dataflow, Google's fully managed stream and batch processing service based on Apache Beam. As a core component of the Data and Analytics domain, Dataflow appears in roughly 10-15% of ACE exam questions, often in scenarios involving real-time data pipelines, ETL, and integration with other Google Cloud services like Pub/Sub, BigQuery, and Cloud Storage. You will learn the unified batch/streaming model, key concepts like pipelines, PCollections, PTransforms, windows, watermarks, and triggers, as well as how to choose between Dataflow and alternatives like Dataproc or Cloud Composer.
Jump to a section
Imagine a car factory assembly line that can build cars from both pre-cut parts (batch) and custom orders that arrive in real time (stream). The factory has a conveyor belt system that can handle both modes. For batch, parts are stacked in a staging area, then the belt runs at full speed to assemble a fixed number of cars. For stream, a second, parallel belt runs continuously, accepting individual parts as they arrive and assembling them on the fly. The factory manager can decide to merge the two belts: when a custom part arrives, it can be added to the batch line for immediate assembly, or batch parts can be held to wait for streaming parts. The key is that the factory uses a unified scheduling system that treats both modes the same way at the assembly stations—each station just processes whatever item comes next, whether from the batch pile or the live feed. If the live feed is slow, the batch pile fills in; if the batch pile runs out, the live feed keeps going. This is exactly how Dataflow unifies batch and stream processing under a single Beam model: batch is a bounded stream with a defined start and end, while streaming is unbounded. The underlying execution engine handles both identically, using watermarks to track progress and triggers to emit results.
What is Cloud Dataflow?
Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within Google Cloud. Apache Beam is an open-source unified programming model that allows you to define both batch and streaming data processing jobs with the same code. Dataflow handles the underlying resource provisioning, autoscaling, and fault tolerance automatically. The service is designed for high-throughput, low-latency data processing, and it integrates natively with other GCP services like Pub/Sub, BigQuery, Cloud Storage, and Bigtable.
Why Unified Batch and Streaming?
Traditional data processing systems required separate codebases for batch (e.g., Hadoop MapReduce) and streaming (e.g., Apache Storm). This led to duplicated logic and operational complexity. Apache Beam's unified model, which Dataflow implements, treats batch as a special case of streaming: a batch pipeline processes a bounded dataset (finite size, known start and end), while a streaming pipeline processes an unbounded dataset (data arrives continuously, no fixed end). The same Beam primitives—PCollections, PTransforms, and pipeline I/O—work for both. Dataflow's execution engine uses the same runtime for both modes, optimizing for latency in streaming and throughput in batch.
Core Concepts
Pipeline: A directed acyclic graph (DAG) of steps that defines the data processing workflow. You construct a pipeline object, apply transforms, and run it.
PCollection: A distributed, immutable collection of data elements. It can be bounded (batch) or unbounded (streaming).
PTransform: A processing step that takes one or more PCollections as input and produces one or more PCollections as output. Examples: ParDo, GroupByKey, Combine, Flatten.
Pipeline I/O: Sources and sinks like ReadFromPubSub, WriteToBigQuery, ReadFromText, WriteToText.
Runner: The execution engine. Dataflow Runner is the managed service runner; there are also runners for Apache Spark, Flink, and direct local execution.
How Dataflow Works Internally
When you submit a pipeline, Dataflow performs the following steps:
Pipeline Construction: Your Beam code (Java, Python, or Go) defines the pipeline DAG. The SDK constructs a portable representation of the pipeline.
Optimization: The Dataflow service optimizes the execution plan, fusing adjacent transforms, reordering operations, and applying combiners.
Resource Allocation: Dataflow automatically provisions a pool of Compute Engine instances (workers) based on the pipeline's parallelism requirements. By default, it uses the n1-standard-2 machine type, but you can customize this.
Execution: Workers execute the pipeline stages. Each stage may be parallelized across multiple workers. Data handles are distributed across workers using a shuffle service.
Fault Tolerance: If a worker fails, Dataflow restarts the failed work on another worker. Streaming pipelines use checkpointing to avoid data loss.
Autoscaling: For batch pipelines, Dataflow dynamically adjusts the number of workers based on the remaining work. For streaming, it scales based on the throughput of the incoming data.
Windows, Watermarks, and Triggers
For unbounded (streaming) data, three concepts are critical:
Windows: Divide the unbounded PCollection into finite chunks based on time or other criteria. Common window types:
- Fixed windows: e.g., 1-minute windows - Sliding windows: e.g., 5-minute windows sliding every 1 minute - Sessions: windows that close after a period of inactivity - Global window: all data in one window (used in batch or with custom triggers) - Watermarks: An estimate of how complete the data is up to a certain point in time. It tracks the progress of event time. Dataflow uses a watermark to know when it's safe to close a window and emit results. Late data (data that arrives after the watermark) can be handled by allowed lateness. - Triggers: Determine when to emit the results of a window. Examples:
- AfterWatermark: emit when the watermark passes the end of the window - AfterProcessingTime: emit after a certain amount of processing time - AfterCount: emit after a certain number of elements - Repeatedly: combine with other triggers to emit early and on late firings
Configuration and Defaults
Worker Machine Type: Default n1-standard-2. You can specify with --machine-type.
Number of Workers: Default is computed automatically. You can set min and max with --num-workers and --max-workers.
Disk Size: Default 250 GB per worker. Adjust with --disk-size-gb.
Region: Dataflow runs in a regional endpoint. You must specify a region (e.g., us-central1).
Networking: Workers can use a VPC network. For private access, use Private Google Access.
Service Account: Dataflow uses a default Compute Engine service account unless you specify a custom one with --service-account.
Integration with Other GCP Services
Pub/Sub: Common source for streaming pipelines. Use ReadFromPubSub with a subscription.
BigQuery: Sink for both batch and streaming. Use WriteToBigQuery with write disposition (WRITE_APPEND, WRITE_TRUNCATE, WRITE_EMPTY).
Cloud Storage: Source for batch (text, Avro, Parquet) and sink for any pipeline.
Bigtable: Low-latency sink for streaming.
Cloud Spanner: Sink for transactional workloads.
Datastore: Good for moderate throughput.
Command-Line Interface
You can run a Dataflow pipeline using the gcloud command or the mvn/gradle exec plugin. Example Python pipeline submission:
python -m my_pipeline \
--runner DataflowRunner \
--project my-project \
--region us-central1 \
--staging_location gs://my-bucket/staging \
--temp_location gs://my-bucket/temp \
--template_location gs://my-bucket/templates/my_templateMonitoring and Logging
Dataflow Monitoring Interface: GCP Console provides a graphical view of pipeline progress, wall time, and element counts.
Stackdriver (Cloud Monitoring): Metrics like system latency, user time, and watermark lag.
Logging: Pipeline logs are sent to Cloud Logging. You can view them in the Console or export to BigQuery.
Common Pitfalls
Not setting a region: The pipeline will fail.
Forgetting to stage files: The pipeline code and dependencies must be staged in GCS.
Using too many workers for a small dataset: Over-parallelization can cause overhead.
Ignoring watermark lag: High lag indicates data is arriving late; adjust allowed lateness or windows.
Not handling schema changes in streaming to BigQuery: Use WRITE_TRUNCATE carefully.
Key Values and Defaults
Default worker machine type: n1-standard-2
Default disk size per worker: 250 GB
Default streaming engine: Dataflow Streaming Engine (since 2020)
Default autoscaling algorithm: THROUGHPUT_BASED
Maximum number of workers: 1000 (can be increased via quota request)
Minimum watermark granularity: 1 second
Allowed lateness default: 0 (must be explicitly set)
Summary
Dataflow abstracts away cluster management and provides a unified programming model for batch and streaming. Understanding windows, watermarks, and triggers is essential for streaming pipelines. The ACE exam tests your ability to choose the right service for a given scenario, configure pipelines correctly, and troubleshoot common issues.
Define pipeline in Beam SDK
Write your data processing logic using Apache Beam SDK (Java, Python, or Go). The pipeline is a DAG of PTransforms applied to PCollections. For example, a typical pipeline reads from a source (Pub/Sub for streaming, GCS for batch), applies transformations like ParDo for element-wise processing, GroupByKey for grouping, and writes to a sink (BigQuery, GCS). The code is identical for batch and streaming except for the source and windowing strategy. You must define windowing (e.g., FixedWindows.of(Duration.standardMinutes(1))) for unbounded PCollections. The pipeline object is constructed but not yet executed.
Set runner and submit job
Specify the DataflowRunner as the execution engine. You submit the pipeline using a command like `python -m my_pipeline --runner DataflowRunner --project my-project --region us-central1 --staging_location gs://my-bucket/staging --temp_location gs://my-bucket/temp`. The SDK packages your code and dependencies into a staging location in Cloud Storage. Dataflow then creates a job on the service. The job ID is returned. You can also use templates to pre-stage pipelines for repeated execution.
Resource provisioning and optimization
Dataflow analyzes the pipeline DAG and determines the optimal execution plan. It fuses adjacent transforms to reduce serialization overhead. It then provisions a pool of Compute Engine worker instances. By default, it uses n1-standard-2 machines. The number of workers is determined by the autoscaling algorithm (THROUGHPUT_BASED by default). For batch, workers are provisioned based on the estimated work; for streaming, a minimum number of workers is kept. The workers are configured with a boot disk (default 250 GB) and network settings.
Execution of pipeline stages
Workers execute the pipeline stages in parallel. Each stage is a group of fused transforms. Data elements flow through the stages. For batch, data is read from the source and processed in parallel splits. For streaming, data is read from Pub/Sub or other streaming sources, and windowing is applied. Workers communicate via a shuffle service for group-by operations. The pipeline progress is tracked: each element goes through a series of stages. If a worker fails, the work is retried on another worker. Checkpoints are taken periodically for stateful transforms.
Window assignment and watermark tracking
For unbounded PCollections, each element is assigned to one or more windows based on its event timestamp and the windowing strategy. Dataflow tracks a watermark, which is an estimate of the maximum event time that has been observed. The watermark advances as data arrives. When the watermark passes the end of a window, that window is considered complete (no more late data will arrive unless allowed lateness is set). Triggers define when to emit the window's results: e.g., after watermark, early firings, or on late data. Late data can still be processed within the allowed lateness period.
Output and cleanup
The results of each window (or global window for batch) are written to the specified sink (e.g., BigQuery table, GCS files). For streaming, the pipeline runs continuously until cancelled. For batch, the pipeline terminates when all data is processed. After completion, Dataflow automatically decommissions the worker instances, and logs are available in Cloud Logging. The staging and temp locations in GCS are not automatically cleaned up; you must manage them separately.
Enterprise Scenario 1: Real-Time Analytics for E-Commerce
A large e-commerce company needs to process clickstream data from millions of users in real time to update dashboards and trigger recommendations. They use Pub/Sub to ingest click events, then a Dataflow streaming pipeline that reads from a Pub/Sub subscription, applies windowing (1-minute fixed windows), computes aggregate metrics (page views, add-to-cart rates), and writes the results to BigQuery for dashboarding and to Cloud Bigtable for fast lookups. The pipeline uses the Dataflow Streaming Engine to handle spikes during flash sales. They set --max-workers=500 to cap costs. Common issues: watermark lag during traffic bursts — they adjust maxNumWorkers and use a larger machine type. They also set allowedLateness=5 minutes to capture late-arriving events from mobile clients with poor connectivity.
Enterprise Scenario 2: Batch ETL for Data Warehousing
A financial services firm runs nightly batch ETL jobs to transform raw transaction logs (stored as Avro files in GCS) into a denormalized format for BigQuery. They use a Dataflow batch pipeline with ReadFromAvro, ParDo for cleansing and enrichment, and WriteToBigQuery with WRITE_TRUNCATE to replace the daily table. The pipeline is run via a Cloud Composer (Airflow) DAG that triggers a Dataflow template. They use --machine-type=n1-highmem-4 to handle memory-intensive joins. Autoscaling is enabled with --max-workers=100. A common mistake: forgetting to set --temp_location causes the pipeline to fail. They also monitor shuffle costs and optimize by using combiners.
Enterprise Scenario 3: IoT Sensor Data Aggregation
An industrial IoT company collects sensor readings from thousands of devices via Pub/Sub. They need to detect anomalies in near real time. The Dataflow streaming pipeline reads sensor data, applies a sliding window (5-minute window sliding every 1 minute), computes average and standard deviation, and writes alerts to a Pub/Sub topic. They use stateful processing (ParDo with state) to track previous readings per sensor. They encounter issues with state size growing unbounded; they use stateful transforms with Timer to clear state periodically. They also set --disk-size-gb=100 to reduce cost, but find that shuffle performance degrades — they revert to 250 GB. They use Cloud Monitoring alerts on dataflow.googleapis.com/job/watermark_lag to detect processing delays.
What the ACE Tests
The ACE exam (Objective 3.3) focuses on your ability to decide when to use Dataflow versus alternatives like Dataproc or Cloud Composer, understand the unified batch/streaming model, and configure basic pipelines. Specific areas: - Choosing Dataflow for unified batch/streaming: Questions present a scenario requiring both batch and streaming with the same code — Dataflow is the answer. - Windows, watermarks, triggers: Know the difference between fixed, sliding, session windows. Understand that watermarks track event time completeness, not processing time. - Autoscaling and resource configuration: Defaults (machine type, disk size) and how to adjust them. - Integration with Pub/Sub and BigQuery: Common source/sink patterns. - Dataflow templates: Purpose (reuse, staging) and how to create them. - Streaming versus batch: Batch reads bounded data, streaming reads unbounded. Both use same Beam API.
Common Wrong Answers
Choosing Dataproc over Dataflow for a streaming pipeline: Candidates see "Hadoop" or "Spark" in the scenario and pick Dataproc. But Dataproc requires cluster management and is not serverless. Dataflow is the managed service for Beam, which natively supports streaming.
Selecting Cloud Composer to run a streaming pipeline: Composer orchestrates batch workflows; it does not execute streaming pipelines. Dataflow is the execution engine.
Thinking batch and streaming require different code: The exam stresses that Beam unifies them. Wrong answers say "recompile" or "rewrite" the pipeline.
Confusing watermarks with processing time: A question might ask "when does a window close?" The correct answer is "when the watermark passes the end of the window," not "after a fixed processing time."
Incorrectly setting `num_workers` for streaming: Some think you must set a fixed number; Dataflow autoscales by default. Setting --num-workers sets the initial number, but autoscaling can change it.
Specific Numbers and Terms
Default machine type: n1-standard-2
Default disk size: 250 GB
Streaming engine: Dataflow Streaming Engine (enabled by default since 2020)
Autoscaling algorithm: THROUGHPUT_BASED
Maximum workers: 1000 (default quota 100)
Window types: Fixed, Sliding, Sessions, Global
Trigger: AfterWatermark, AfterProcessingTime, AfterCount, Repeatedly
Allowed lateness: must be set explicitly, default 0
Pipeline submission: --runner DataflowRunner
Edge Cases and Exceptions
Global window with triggers: Even in streaming, you can use the global window with custom triggers (e.g., after count) to get microbatch behavior.
Dataflow Prime: A newer pricing model that offers cost savings for flexible workloads; not heavily tested on ACE but good to know.
FlexRS: For batch, you can use Flexible Resource Scheduling to reduce cost by allowing preemption.
Cross-region pipelines: Dataflow jobs run in a specific region; data movement between regions incurs network costs.
How to Eliminate Wrong Answers
Look for keywords: "unified batch and streaming" → Dataflow. "Managed cluster for Spark" → Dataproc. "Workflow orchestration" → Composer. "Real-time processing from Pub/Sub" → Dataflow. "Batch ETL from GCS to BigQuery" → Dataflow or Dataproc depending on complexity; Dataflow is simpler for straightforward ETL. Also, eliminate any answer that mentions manual cluster management if the question asks for serverless.
Dataflow is a fully managed service for executing Apache Beam pipelines, supporting both batch and streaming with the same code.
Batch processes bounded data; streaming processes unbounded data. The unified model treats batch as a special case.
Key streaming concepts: windows (fixed, sliding, session), watermarks (event time progress), and triggers (when to emit results).
Default worker machine type: n1-standard-2. Default disk size: 250 GB. Autoscaling algorithm: THROUGHPUT_BASED.
Dataflow integrates natively with Pub/Sub, BigQuery, Cloud Storage, Bigtable, and Spanner.
Use Dataflow over Dataproc for serverless, unified batch/streaming pipelines. Use Dataproc for Spark/Hadoop-specific workloads.
Dataflow templates allow you to pre-stage pipelines for repeated execution without recompilation.
Monitor pipelines via Cloud Monitoring metrics like watermark lag and system latency.
Always set a region when submitting a job, and ensure staging and temp locations are in Cloud Storage.
For streaming, set allowedLateness to handle late-arriving data; default is 0.
These come up on the exam all the time. Here's how to tell them apart.
Cloud Dataflow
Fully managed, serverless; no cluster management.
Unified batch and streaming model (Apache Beam).
Autoscaling based on pipeline throughput.
Best for ETL, real-time analytics, and simple transformations.
Pricing per second for worker resources.
Cloud Dataproc
Managed cluster for Hadoop/Spark; you manage cluster lifecycle.
Separate batch (Spark) and streaming (Spark Streaming) codebases.
Autoscaling via cluster scaling policies (not per-pipeline).
Best for complex ML, iterative algorithms, and legacy Hadoop jobs.
Pricing per cluster hour (including idle time).
Mistake
Dataflow only supports streaming, not batch.
Correct
Dataflow supports both batch and streaming using the same Apache Beam SDK. Batch is a special case of streaming where the data is bounded.
Mistake
You must use separate code for batch and streaming pipelines.
Correct
The Beam unified model allows the same pipeline code to run in both modes. Only the source (e.g., Pub/Sub vs. GCS) and windowing strategy differ.
Mistake
Watermarks are based on processing time.
Correct
Watermarks are based on event time, not processing time. They estimate how complete the data is up to a certain event time.
Mistake
Dataflow automatically cleans up staging files after the job completes.
Correct
Dataflow does not clean up staging or temp locations in Cloud Storage. You must manage them manually or set lifecycle policies.
Mistake
Setting --num-workers to a high value always improves performance.
Correct
Over-parallelization can cause overhead due to shuffle and coordination. Dataflow autoscaling is often better; setting max workers prevents runaway scaling.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Dataflow is a serverless service for Apache Beam pipelines, ideal for unified batch/streaming ETL. Dataproc is a managed Hadoop/Spark cluster service, better for complex analytics and legacy jobs. Dataflow requires no cluster management; Dataproc requires you to manage cluster lifecycle (though it can be auto-scaled).
Yes, Apache Beam allows the same pipeline code to run in both modes. The only differences are the source (bounded vs. unbounded) and the windowing strategy (required for streaming). You can conditionally set windowing based on the pipeline options.
A watermark is an estimate of the maximum event time that has been observed in the streaming data. It tracks how complete the data is up to a certain point in time. When the watermark passes the end of a window, that window is considered complete and results can be emitted (unless late data is allowed).
Dataflow uses a THROUGHPUT_BASED autoscaling algorithm by default. For batch, it adjusts the number of workers based on the remaining work. For streaming, it scales based on the throughput of the incoming data. You can set min and max workers using --num-workers and --max-workers.
Dataflow templates allow you to stage a pipeline in Cloud Storage so that it can be run repeatedly without recompiling the code. You create a template by specifying --template_location during job submission. Later, you can launch jobs from the template via the Console, gcloud, or API, passing runtime parameters.
Use the allowedLateness parameter on your windowing strategy. For example, `Window.into(FixedWindows.of(Duration.standardMinutes(1))).withAllowedLateness(Duration.standardMinutes(5))`. Late data within the allowed period will be processed and can trigger additional firings if you use appropriate triggers.
The default machine type is n1-standard-2 (2 vCPUs, 7.5 GB memory). The default disk size is 250 GB per worker. You can override these with --machine-type and --disk-size-gb.
You've just covered Cloud Dataflow for Stream and Batch — now see how well it sticks with free ACE practice questions. Full explanations included, no account needed.
Done with this chapter?