This chapter covers real-time analytics using Google Cloud Pub/Sub and Dataflow, two core services for building streaming data pipelines. For the GCDL exam, this topic is part of Domain 3 (Data Analytics and AI) and appears in approximately 10-15% of questions. You will need to understand the roles of Pub/Sub as a messaging middleware and Dataflow as a stream and batch processing engine, their key features, and how they integrate to enable real-time insights. The exam focuses on conceptual understanding rather than deep technical configuration, but you must know the differences between push and pull subscriptions, exactly-once vs. at-least-once delivery, and when to use Dataflow vs. other processing options.
Jump to a section
Imagine a massive warehouse where packages arrive at high speed from thousands of shipping docks. Each package has a label with a topic (e.g., 'electronics', 'clothing'). Instead of workers rushing to grab each package as it arrives, packages are placed onto a conveyor belt system called Pub/Sub. This belt can hold millions of packages in queues (subscriptions) and workers (subscribers) can pick them up at their own pace. If a worker is too slow, the package stays on the belt (message retention up to 7 days). If a worker crashes, the package remains until another worker picks it up (at-least-once delivery). Now, imagine a factory called Dataflow that sits at the end of the belt. Dataflow can process packages in real time, grouping them by destination, applying transformations like adding a barcode, and then sending them to different trucks. Dataflow uses the same programming model (Apache Beam) to handle both streaming (continuous belt) and batch (full warehouse dump) processing. If the belt gets too fast, Dataflow automatically adds more workers (autoscaling) to keep up. If a package fails processing, Dataflow can redirect it to a dead-letter queue for later inspection. This entire system ensures that data flows reliably from producers to consumers, with exactly-once processing semantics, even if components fail.
What is Pub/Sub?
Google Cloud Pub/Sub is a fully managed, real-time messaging service that allows you to send and receive messages between independent applications. It decouples senders (publishers) from receivers (subscribers) by introducing an intermediary topic. Publishers send messages to a topic, and subscribers receive messages from a subscription attached to that topic. Pub/Sub guarantees at-least-once delivery of messages, meaning a message may be delivered more than once, but it will never be lost. Messages are retained for up to 7 days if not acknowledged.
How Pub/Sub Works Internally
When a publisher sends a message to a topic, Pub/Sub assigns a unique message ID and stores the message redundantly across multiple zones. The message is then fanned out to all subscriptions on that topic. Each subscription maintains its own queue of messages. Subscribers pull messages from the subscription (pull mode) or receive them via a webhook (push mode). After processing, the subscriber sends an acknowledgment (ACK) to remove the message from the subscription queue. If the subscriber does not ACK within the acknowledgement deadline (default 10 seconds, configurable up to 600 seconds), the message is redelivered. This ensures at-least-once delivery. The maximum message size is 10 MB, and the default message retention is 7 days.
Key Components and Defaults
Topic: A named resource to which messages are sent. Publishers create messages and publish them to a topic.
Subscription: A named resource representing a stream of messages from a single topic. Subscriptions can be pull or push.
Message: A combination of data (payload, up to 10 MB) and attributes (key-value metadata).
Acknowledgment Deadline: The time a subscriber has to ACK a message after delivery. Default: 10 seconds. Max: 600 seconds.
Message Retention Duration: How long unacknowledged messages are kept. Default: 7 days. Range: 10 minutes to 7 days.
Exactly-once Delivery: Available for pull subscriptions with the enable_exactly_once_delivery flag. Ensures each message is delivered exactly once, at the cost of higher latency and lower throughput.
Ordering Keys: Messages with the same ordering key are delivered in order within a region. Not available for push subscriptions.
What is Dataflow?
Dataflow is a fully managed service for executing Apache Beam pipelines. Apache Beam is an open-source, unified programming model that allows you to define both batch and streaming data processing jobs with the same code. Dataflow handles the execution, scaling, and fault tolerance automatically. It is ideal for ETL (Extract, Transform, Load), real-time analytics, and event-driven applications.
How Dataflow Works Internally
Dataflow pipelines are defined as a Directed Acyclic Graph (DAG) of transforms (PTransforms) applied to collections of data (PCollections). For streaming, Dataflow uses a technique called "windowing" to group unbounded data into finite windows (e.g., fixed, sliding, session). Each window is processed when its watermark (an estimate of event time completeness) passes a threshold. Dataflow supports exactly-once processing guarantees for streaming, using checkpointing and commit logs. It uses Google Cloud Storage for temporary storage and can automatically scale workers (up to 1000) based on the backlog of unprocessed data.
Key Dataflow Concepts
Pipeline: A directed acyclic graph of steps that reads, transforms, and writes data.
PCollection: An immutable collection of data that can be bounded (batch) or unbounded (streaming).
PTransform: A data processing operation that transforms a PCollection into another PCollection.
Windowing: Divides an unbounded PCollection into finite chunks based on event time or processing time.
Watermark: An estimate of the completeness of event time data. Dataflow uses the watermark to trigger window processing and handle late data.
Trigger: Determines when to emit results for a window. Can be event-time, processing-time, or data-driven.
Autoscaling: Dataflow automatically adjusts the number of workers based on CPU utilization and backlog. The maximum number of workers can be set (default 1000).
Flexible Resource Scheduling (FlexRS): A batch-only pricing model that offers lower cost in exchange for potential delays (up to 6 hours).
Integration of Pub/Sub and Dataflow
A common pattern is to use Pub/Sub as a source for Dataflow pipelines. Dataflow subscribes to a Pub/Sub subscription (pull mode) and processes messages in real time. The pipeline can then write results to BigQuery, Cloud Storage, or another sink. Dataflow automatically acknowledges messages after successful processing, ensuring exactly-once semantics if configured. The integration supports both streaming and batch (using Pub/Sub's seek feature to replay messages).
Configuration and Verification
To create a Pub/Sub topic and subscription:
gcloud pubsub topics create my-topic
gcloud pubsub subscriptions create my-sub --topic=my-topicTo run a Dataflow streaming pipeline with Pub/Sub source:
gcloud dataflow jobs run my-streaming-job \
--gcs-location gs://my-bucket/templates/streaming_beam_pipeline \
--parameters inputSubscription=projects/my-project/subscriptions/my-sub \
--region us-central1To verify the pipeline is running:
gcloud dataflow jobs list --region us-central1 --filter="STATE:RUNNING"For monitoring, use the Dataflow monitoring interface in the console, which shows step-by-step throughput, watermark lag, and system lag.
Interaction with Other Services
BigQuery: Dataflow can write directly to BigQuery for analytics.
Cloud Storage: Used for staging files and temporary storage.
Cloud Functions: Can be triggered by Pub/Sub messages for lightweight processing.
Cloud Run / GKE: Can host custom subscribers using Pub/Sub client libraries.
Data Catalog: Can be used to discover and manage Pub/Sub schemas.
Performance Considerations
Pub/Sub throughput scales horizontally; topics can handle millions of messages per second.
Dataflow autoscaling is reactive; it takes about 2-3 minutes to add workers.
For high throughput, use pull subscriptions with flow control to avoid overwhelming the subscriber.
Message ordering keys reduce parallelism; use only when necessary.
Exactly-once delivery in Pub/Sub reduces throughput by about 50% compared to at-least-once.
Common Pitfalls
Not setting an appropriate acknowledgment deadline can cause unnecessary redeliveries.
Using push subscriptions without a properly configured endpoint can lead to message loss.
Ignoring watermark lag in Dataflow can indicate that the pipeline is falling behind.
Not handling late data (e.g., setting allowed lateness) can cause data loss in streaming pipelines.
Publisher sends message to topic
The publisher application creates a message with a payload (e.g., JSON) and optional attributes. It calls the Pub/Sub API to publish the message to a specific topic. Pub/Sub assigns a unique message ID and stores the message redundantly across multiple zones. The publish request returns immediately with the message ID. The message is not yet delivered to any subscriber; it is held in the topic's storage.
Message fanned out to subscriptions
Pub/Sub internally fans out the message to all subscriptions attached to the topic. Each subscription gets its own copy of the message. The message is enqueued in the subscription's message queue. If a subscription is configured with a filter, only messages matching the filter are enqueued. The message remains in the queue until it is acknowledged or the retention period expires.
Subscriber pulls message from subscription
The subscriber (e.g., Dataflow worker) sends a pull request to the subscription. Pub/Sub returns a batch of messages (up to 1000 or 10 MB, whichever is smaller). The subscriber then processes the messages. While processing, the subscriber must acknowledge each message within the acknowledgment deadline (default 10 seconds). If the subscriber needs more time, it can modify the deadline (up to 600 seconds).
Subscriber acknowledges message
After successful processing, the subscriber sends an ACK request to Pub/Sub for each message. Pub/Sub then removes the message from the subscription queue. If the subscriber fails to ACK within the deadline, the message becomes eligible for redelivery. This ensures at-least-once delivery. For exactly-once delivery, Pub/Sub uses a deduplication mechanism based on the message ID and a lease system.
Dataflow pipeline processes message
Dataflow receives the message from Pub/Sub as a PCollection element. The pipeline applies transforms (e.g., parsing, filtering, windowing). For streaming, Dataflow assigns a timestamp (event time or publish time) and places the element into the appropriate window. The window is triggered when the watermark passes the window end, and results are emitted. Dataflow handles late data within a configurable allowed lateness period.
Dataflow writes results to sink
After processing, Dataflow writes the transformed data to a sink such as BigQuery, Cloud Storage, or another Pub/Sub topic. For BigQuery, Dataflow uses streaming inserts or file loads. Dataflow acknowledges the Pub/Sub message only after the output is committed. This ensures exactly-once processing from source to sink. If the pipeline fails, it can resume from the last checkpoint.
Enterprise Scenario 1: Real-Time Fraud Detection
A financial services company processes millions of credit card transactions per second. They use Pub/Sub to ingest transaction events from point-of-sale systems. Each transaction is published to a transactions topic. A Dataflow streaming pipeline subscribes to the topic, enriches the transaction with customer profile data from Cloud Bigtable, and applies a machine learning model to score fraud risk. High-risk transactions are published to a fraud-alerts topic, which triggers a Cloud Function to notify the customer via SMS. The pipeline uses sliding windows of 5 minutes to aggregate transaction counts per card. Dataflow's autoscaling handles traffic spikes during holiday seasons. The company configures Pub/Sub's exactly-once delivery to avoid duplicate fraud alerts. A common misconfiguration is setting the acknowledgment deadline too low, causing redeliveries and duplicate processing. The solution uses a deadline of 60 seconds to account for ML inference time.
Enterprise Scenario 2: IoT Sensor Data Aggregation
A manufacturing plant has thousands of IoT sensors sending temperature, pressure, and vibration data every second. Each sensor publishes to a Pub/Sub topic with a sensor ID attribute. A Dataflow pipeline reads from the subscription, groups data by sensor ID using session windows (gap duration 10 seconds), and calculates moving averages. The results are written to BigQuery for long-term storage and to a real-time dashboard via Data Studio. Dataflow's watermark helps handle out-of-order data due to network delays. The plant uses Pub/Sub's ordering keys to ensure sensor data is processed in sequence. A key challenge is managing message volume: each sensor sends 86,400 messages per day, so with 10,000 sensors, the pipeline must handle 864 million messages daily. Dataflow's autoscaling scales to 200 workers during peak. Misconfiguration of window size (too small) can cause excessive output, while too large a window delays insights.
Scenario 3: Clickstream Analytics for E-Commerce
An e-commerce platform tracks user clicks, page views, and purchases. Events are published to a Pub/Sub topic. A Dataflow pipeline performs sessionization (session windows with 30-minute gap) to group user activity. It then computes metrics like conversion rate and average session duration. The output is written to BigQuery for ad-hoc analysis and to a Pub/Sub topic for real-time personalization. The company uses Pub/Sub's filter feature to route only purchase events to a separate subscription for inventory updates. A common issue is that late-arriving data (e.g., from mobile devices offline) can cause session splits. Dataflow's allowed lateness (set to 1 hour) merges late events into the correct session. Without this, sessions are fragmented, leading to inaccurate metrics.
Exactly What GCDL Tests
The GCDL exam (Objective 3.1) focuses on understanding the role of Pub/Sub and Dataflow in real-time analytics. You will NOT be asked to write code or run commands. Instead, you must know:
Pub/Sub: What it is, how it decouples publishers and subscribers, at-least-once vs. exactly-once delivery, push vs. pull subscriptions, message retention (7 days), and use cases (event ingestion, streaming data).
Dataflow: What it is, Apache Beam integration, batch vs. streaming, autoscaling, exactly-once processing, and common use cases (ETL, real-time analytics).
Integration: How Pub/Sub and Dataflow work together (Dataflow as a subscriber), and how they differ from alternatives like Cloud Functions or Cloud Run for light processing.
Common Wrong Answers
"Pub/Sub guarantees exactly-once delivery by default." This is false. The default is at-least-once. Exactly-once is an optional feature that reduces throughput. Candidates often confuse the default behavior.
"Dataflow is only for batch processing." Dataflow handles both batch and streaming. The exam may present a scenario requiring real-time processing and offer Dataflow as an option; candidates might incorrectly choose a batch-only tool like Dataproc.
"Pub/Sub stores messages indefinitely." The default retention is 7 days, configurable but not indefinite. Candidates may think messages are stored forever.
"Dataflow requires manual scaling." Dataflow autoscales automatically. The exam might present a scenario where traffic spikes and test if you know Dataflow can handle it without manual intervention.
Specific Numbers and Terms
Message retention: 7 days (default), 10 minutes minimum.
Acknowledgment deadline: 10 seconds default, max 600 seconds.
Max message size: 10 MB.
Dataflow max workers: 1000 (default).
Watermark: used for event-time processing.
FlexRS: batch-only, lower cost, up to 6 hours delay.
Edge Cases
Push subscriptions: The endpoint must be a publicly accessible HTTPS endpoint. If the endpoint is down, messages are retried with exponential backoff and eventually expired.
Ordering keys: Only available for pull subscriptions; not for push. If ordering is required, push cannot be used.
Exactly-once delivery: Requires pull subscriptions and reduces throughput. It uses a lease-based system to prevent duplicate delivery.
How to Eliminate Wrong Answers
If a question mentions "real-time" or "streaming," eliminate batch-only options (e.g., Dataproc, BigQuery load jobs).
If a question asks about "message ordering" and offers push subscriptions, eliminate that option because push does not support ordering.
If a question asks about "guaranteed delivery" without specifying exactly-once, think at-least-once.
If a question mentions "autoscaling," Dataflow is the obvious choice over Cloud Functions (which scales per request but not for streaming).
Pub/Sub is a fully managed messaging service that decouples publishers and subscribers, with at-least-once delivery by default and optional exactly-once delivery.
Dataflow is a fully managed Apache Beam service for both batch and streaming data processing, with autoscaling and exactly-once processing.
Pub/Sub retains messages for up to 7 days (default) and the acknowledgment deadline can be set from 10 to 600 seconds.
Dataflow uses watermarks to track event time completeness and triggers window processing accordingly.
Pub/Sub push subscriptions do not support ordering keys or exactly-once delivery; pull subscriptions are required for those features.
Dataflow autoscales up to 1000 workers by default, based on CPU utilization and backlog.
FlexRS is a batch-only pricing option for Dataflow that offers lower cost with possible delays up to 6 hours.
The integration of Pub/Sub and Dataflow enables real-time analytics with end-to-end exactly-once semantics.
These come up on the exam all the time. Here's how to tell them apart.
Pub/Sub
Messaging middleware: decouples publishers and subscribers
Stores messages for up to 7 days
Delivers messages with at-least-once or exactly-once semantics
No built-in data transformation; subscribers must process messages
Ideal for event ingestion and fan-out
Dataflow
Data processing engine: transforms and analyzes data
Processes data in real-time or batch; does not store data long-term
Provides exactly-once processing for streaming pipelines
Built-in transforms (filter, group, join, window) via Apache Beam
Ideal for ETL, real-time analytics, and complex event processing
Pub/Sub Push
Messages delivered via HTTPS POST to a pre-configured endpoint
Subscriber must be publicly accessible
Automatic retries with exponential backoff
No support for ordering keys or exactly-once delivery
Simpler for lightweight subscribers (e.g., Cloud Functions)
Pub/Sub Pull
Subscriber initiates requests to pull messages
Subscriber can be behind a VPC or firewall
Subscriber controls flow rate and acknowledgment deadline
Supports ordering keys and exactly-once delivery
Better for high-throughput and complex processing
Mistake
Pub/Sub guarantees exactly-once delivery by default.
Correct
Pub/Sub's default delivery mode is at-least-once, meaning a message may be delivered more than once. Exactly-once delivery is an optional feature enabled on pull subscriptions with `enable_exactly_once_delivery`, which reduces throughput and increases latency.
Mistake
Dataflow can only process batch data, not streaming.
Correct
Dataflow is built on Apache Beam, which supports both batch and streaming pipelines with the same code. Dataflow can process unbounded (streaming) data from sources like Pub/Sub, and it provides exactly-once processing guarantees for streaming.
Mistake
Pub/Sub stores messages indefinitely.
Correct
Pub/Sub retains unacknowledged messages for a maximum of 7 days by default, configurable down to 10 minutes. Messages are not stored indefinitely; after the retention period, they are automatically deleted.
Mistake
Push subscriptions are more reliable than pull subscriptions.
Correct
Both push and pull are reliable, but they have different use cases. Push requires a publicly accessible HTTPS endpoint and automatically retries on failure. Pull gives the subscriber more control over the rate of consumption and supports exactly-once delivery and ordering keys. Reliability depends on proper configuration, not the mode.
Mistake
Dataflow requires manual scaling of workers.
Correct
Dataflow automatically scales workers up or down based on the backlog of unprocessed data and CPU utilization. You can set a maximum number of workers, but scaling is automatic. Manual scaling is not required.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Pub/Sub is a messaging service for reliably delivering messages from publishers to subscribers. It does not process or transform data. Dataflow is a data processing service that uses Apache Beam to transform and analyze data in batch or streaming mode. Pub/Sub handles ingestion and decoupling; Dataflow handles processing. They are often used together: Pub/Sub ingests data and Dataflow processes it.
Not by default. The default is at-least-once delivery. Exactly-once delivery is an optional feature that can be enabled on pull subscriptions. It reduces throughput and increases latency. For most use cases, at-least-once is sufficient, and idempotent processing handles duplicates.
Yes, Dataflow supports streaming (real-time) processing. It can read from Pub/Sub or other streaming sources, apply transformations using windows and triggers, and write results to sinks like BigQuery. Dataflow provides exactly-once processing guarantees for streaming pipelines.
A watermark is an estimate of the event time up to which all data has been received. Dataflow uses watermarks to trigger window processing and handle late data. If data arrives after the watermark, it is considered late and can be handled within a configurable allowed lateness period.
By default, Pub/Sub retains unacknowledged messages for 7 days. You can configure this value between 10 minutes and 7 days. After the retention period, messages are automatically deleted and cannot be recovered.
In push subscriptions, Pub/Sub sends messages to a pre-configured HTTPS endpoint. In pull subscriptions, the subscriber initiates requests to pull messages. Push is simpler but requires a public endpoint and does not support ordering keys or exactly-once delivery. Pull gives more control and supports advanced features.
Yes, Dataflow automatically scales the number of workers up or down based on the processing backlog and CPU utilization. You can set a maximum number of workers (default 1000) to limit scaling. Autoscaling helps handle traffic spikes without manual intervention.
You've just covered Real-Time Analytics with Pub/Sub and Dataflow — now see how well it sticks with free GCDL practice questions. Full explanations included, no account needed.
Done with this chapter?