ACEChapter 10 of 101Objective 3.3

BigQuery and Pub/Sub

This chapter covers BigQuery and Pub/Sub, two core Google Cloud data services that are heavily tested on the ACE exam. BigQuery is Google's serverless, highly scalable data warehouse for analytics, while Pub/Sub is a fully managed real-time messaging service. Understanding their architectures, key features, and integration patterns is critical because approximately 10-15% of ACE exam questions touch on these services, either directly or in scenarios involving data pipelines, streaming analytics, or event-driven architectures. This chapter will equip you with the precise technical details needed to tackle exam questions confidently.

25 min read
Intermediate
Updated May 31, 2026

BigQuery and Pub/Sub: Library and Mailroom

Imagine a large corporate library (BigQuery) that stores all books (data) in a highly organized, indexed manner. Researchers can walk in and ask complex questions like 'Find all books published after 2010 about cloud computing by authors with PhDs' — the library's catalog system (BigQuery's columnar storage and query engine) scans only the relevant sections (columns) instead of reading every book, returning answers in seconds. Now, the library also has a mailroom (Pub/Sub) that receives a constant stream of letters (messages) from external senders. The mailroom doesn't read or store the letters permanently; it just sorts them into labeled bins (topics) and notifies subscribed departments (subscribers) to pick up their mail. The key is that the library and mailroom are separate systems: the mailroom handles real-time, high-volume message ingestion, while the library is optimized for analytical queries on large datasets. To connect them, the library can set up a subscription that automatically picks up letters from a bin and files them into the appropriate books (tables) — this is the BigQuery Subscription feature. Without this integration, researchers would have to manually collect letters and enter them into books, causing delays and errors.

How It Actually Works

What is BigQuery and Why Does It Exist?

BigQuery is Google Cloud's serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility. It decouples compute from storage, allowing you to run SQL queries on petabytes of data without managing infrastructure. The key innovation is its columnar storage format (Capacitor) and a distributed query execution engine (Dremel) that can process billions of rows in seconds. BigQuery is not a traditional relational database; it is an analytical database optimized for read-heavy, aggregate queries, not transactional row-level operations. It supports standard SQL (ANSI 2011 compliant) and integrates with tools like Looker, Data Studio, and TensorFlow.

How BigQuery Works Internally

BigQuery's architecture has three main layers: storage, compute, and networking. Data is stored in a columnar format called Capacitor, which compresses data and enables efficient scanning of only the columns needed for a query. The compute layer uses Dremel, a tree architecture that breaks queries into small tasks and executes them in parallel across thousands of slots. Slots are virtual CPUs that represent compute capacity; you can purchase dedicated slots (flex slots, flat-rate reservations) or use on-demand pricing where you pay per byte processed. The networking layer uses Jupiter, Google's high-speed network, to shuffle data between nodes with minimal latency.

When you run a query, BigQuery first parses and optimizes the SQL, then reads only the necessary columns from storage, applies filters (predicate pushdown), and performs joins and aggregations in a distributed manner. The result is cached for 24 hours if the query is re-run with no changes. BigQuery automatically manages data distribution and replication for durability (99.9999999999% annual durability, 11 nines).

Key BigQuery Components, Values, and Defaults

Datasets: Top-level containers for tables, views, and routines. Datasets are regional or multi-regional (e.g., US, EU). Default expiration for tables is none unless set.

Tables: Standard (native BigQuery storage), External (data in Cloud Storage, Drive, or Bigtable), or View (virtual table defined by SQL query).

Partitioning: Divide a table into segments (by date, integer range, or ingestion time). Default partitioning is by ingestion time if enabled.

Clustering: Co-locate rows with similar values in the same blocks. Up to 4 columns can be clustered. Improves query performance and reduces bytes billed.

Slots: On-demand pricing uses shared slot pool; flat-rate pricing reserves dedicated slots (100 slots minimum for flex slots).

Maximum query size: 1 GB of data per slot (with flat-rate) or up to 6 TB per day on on-demand.

Time travel: Access data up to 7 days in the past (default 7 days; can be set to 2-7 days for tables).

Streaming buffer: When streaming data into BigQuery, data is written to a streaming buffer first (up to 90 minutes latency before being available for copy/export).

Configuration and Verification Commands

Creating a dataset:

bq mk --dataset --location=US my_project:my_dataset

Creating a partitioned table:

bq mk --table --schema "id:INTEGER,name:STRING,date:DATE" --time_partitioning_field date my_dataset.my_table

Running a query:

bq query --use_legacy_sql=false 'SELECT COUNT(*) FROM my_dataset.my_table WHERE date = "2023-01-01"'

Viewing job details:

bq show --job <job_id>

What is Pub/Sub and Why Does It Exist?

Pub/Sub is a fully managed, real-time messaging service that allows you to send and receive messages between independent applications. It decouples senders (publishers) from receivers (subscribers) through an asynchronous, durable message queue. It is designed for high throughput (millions of messages per second) and low latency (typically under 100ms). Pub/Sub is the backbone for event-driven architectures, streaming data pipelines, and IoT applications on Google Cloud.

How Pub/Sub Works Internally

Pub/Sub uses a pull/push model. Publishers send messages to a topic. A topic is a named resource to which messages are sent. Subscribers register interest in a topic via a subscription. A subscription is a named resource that represents a stream of messages from a single topic. Messages are stored in the subscription's backlog (up to 7 days by default) until acknowledged by the subscriber. Each subscription has an associated acknowledgment deadline (default 10 seconds, configurable up to 600 seconds). If a message is not acknowledged within the deadline, it is re-delivered. Pub/Sub guarantees at-least-once delivery; duplicates are possible. Messages are ordered only if you enable message ordering on the subscription (uses ordering keys).

Pub/Sub supports two types of subscriptions: pull (subscriber actively fetches messages) and push (Pub/Sub sends messages to a configured endpoint). For push, the endpoint must be an HTTPS server with a valid SSL certificate. Pub/Sub also supports dead letter topics (DLT) to capture messages that cannot be processed after a certain number of delivery attempts (default 5).

Key Pub/Sub Components, Values, and Defaults

Topic: A named resource. Messages are retained for up to 7 days if not delivered.

Subscription: A named resource. Has a message retention duration (default 7 days, max 7 days). Acknowledge deadline default 10 seconds, min 10 seconds, max 600 seconds.

Message: Contains a payload (base64-encoded, up to 10 MB) and optional attributes (key-value pairs).

Publisher: Sends messages to a topic. Uses a client library or REST/gRPC API.

Subscriber: Receives messages from a subscription. Can be a Cloud Function, App Engine, Compute Engine, or external server.

Flow control: On pull subscriptions, you can set max outstanding messages (default 1000) or max outstanding bytes (default 100 MB).

Dead letter topic: Optional; after max delivery attempts (default 5), messages are forwarded to a dead letter topic.

Configuration and Verification Commands

Creating a topic:

gcloud pubsub topics create my-topic

Creating a subscription:

gcloud pubsub subscriptions create my-sub --topic=my-topic --ack-deadline=60

Publishing a message:

gcloud pubsub topics publish my-topic --message="Hello" --attribute="key=value"

Pulling messages:

gcloud pubsub subscriptions pull my-sub --auto-ack --limit=5

How BigQuery and Pub/Sub Interact

The primary integration is through BigQuery Subscription (also called BigQuery write subscription). This allows you to create a Pub/Sub subscription that directly writes messages to a BigQuery table. The subscription must be created with the --bigquery-table flag. The table must exist and have a schema that matches the message attributes. Messages are written in near-real-time (latency typically seconds). This integration eliminates the need for a separate streaming data pipeline (e.g., Dataflow).

Creating a BigQuery subscription:

gcloud pubsub subscriptions create my-bq-sub --topic=my-topic --bigquery-table=my_project:my_dataset.my_table

Important: The BigQuery table must be created with a schema that includes columns for the message payload. By default, the payload is written to a column named data (BYTES type), and attributes are written to columns with matching names. You can also use --write-metadata to add columns like subscription_name, message_id, publish_time, etc.

Another integration is using Pub/Sub as a source for BigQuery via Cloud Dataflow or Datastream. Dataflow can read from Pub/Sub and write to BigQuery with transformations. Datastream can stream changes from databases to BigQuery via Pub/Sub.

Exam-Relevant Details

BigQuery is serverless — no clusters to manage.

BigQuery pricing: on-demand ($5 per TB processed, first 1 TB free per month) or flat-rate (slots).

Pub/Sub pricing: based on data volume (ingress and egress) and number of API calls.

Pub/Sub has a 10 MB message size limit.

Pub/Sub guarantees at-least-once delivery; duplicates possible.

BigQuery streaming inserts have a 1 MB per row limit and a 90-minute buffer before data is available for export.

BigQuery's time travel window is 7 days by default; can be set to 2-7 days.

Pub/Sub's acknowledgment deadline can be extended; if not acked, message is redelivered.

BigQuery supports external tables (Cloud Storage, Google Drive, Bigtable, Cloud SQL).

BigQuery clustering improves query performance for columns with high cardinality.

Pub/Sub ordering keys ensure messages with the same key are delivered in order to pull subscribers.

Common Pitfalls

Forgetting that Pub/Sub does not guarantee exactly-once delivery; design idempotent consumers.

Using BigQuery for transactional workloads (it's not designed for row-level updates).

Not setting table expiration in BigQuery, leading to uncontrolled storage costs.

Overlooking the 7-day retention limit for Pub/Sub messages; subscribers must acknowledge or messages are lost.

Misconfiguring the acknowledgment deadline; if too short, messages may be redelivered unnecessarily.

Step-by-Step: Setting Up a Streaming Data Pipeline with BigQuery and Pub/Sub

1. Create a Pub/Sub Topic Use gcloud pubsub topics create streaming-topic. This creates a named channel for publishers to send messages.

2. Create a BigQuery Dataset and Table Create a dataset with bq mk --dataset my_dataset. Then create a table with a schema that matches the expected message structure. For example:

bq mk --table --schema "event_id:STRING,event_type:STRING,timestamp:TIMESTAMP" my_dataset.events

3. Create a BigQuery Subscription Run:

gcloud pubsub subscriptions create bq-sub --topic=streaming-topic --bigquery-table=my_project:my_dataset.events --write-metadata

This creates a subscription that writes messages directly to the BigQuery table. The --write-metadata flag adds columns like subscription_name, message_id, publish_time, etc.

4. Publish Messages Use a publisher client or gcloud:

gcloud pubsub topics publish streaming-topic --message='{"event_id":"1","event_type":"click","timestamp":"2023-01-01T00:00:00Z"}'

5. Verify Data in BigQuery Query the table:

bq query --use_legacy_sql=false 'SELECT * FROM my_dataset.events LIMIT 10'

6. Monitor and Troubleshoot Use Cloud Monitoring for Pub/Sub subscription metrics (e.g., pubsub.googleapis.com/subscription/backlog_bytes) and BigQuery job history.

Real-World Section

Enterprise Scenario 1: Real-Time Analytics for E-Commerce

A large e-commerce platform processes clickstream data from millions of users. They use Pub/Sub to ingest events (page views, clicks, purchases) in real time. A BigQuery subscription writes these events directly into a partitioned and clustered BigQuery table. This enables analysts to run near-real-time queries on user behavior, such as 'What is the conversion rate in the last hour?'. In production, they handle over 500,000 messages per second. Key considerations: they set the acknowledgment deadline to 60 seconds to allow for burst processing, and they use dead letter topics to capture malformed events. A common misconfiguration is not setting table partitioning by date, causing full table scans on time-range queries, which increases costs and latency.

Enterprise Scenario 2: IoT Sensor Data Pipeline

A manufacturing company collects sensor data from thousands of IoT devices. Each device sends a message every second to a Pub/Sub topic. They need to store this data for historical analysis and anomaly detection. They create a BigQuery subscription to write the data into a BigQuery table partitioned by ingestion time (pseudo-column _PARTITIONTIME). They also set up a Dataflow pipeline to transform the data before writing to BigQuery for complex aggregations. In production, they observed that the streaming buffer latency (up to 90 minutes) caused delays for real-time dashboards; they mitigated by using a separate Dataflow pipeline that writes to a separate table with lower latency. A common mistake is using on-demand pricing for high-volume streaming, resulting in unpredictable costs; they switched to flat-rate reservations.

Enterprise Scenario 3: Log Aggregation and Compliance

A financial services company needs to aggregate logs from multiple microservices for compliance auditing. Each service publishes logs to a shared Pub/Sub topic. A BigQuery subscription writes all logs to a BigQuery table with a schema that includes log_level, service_name, message, and timestamp. They use clustering on service_name and log_level to optimize queries. They also set table expiration to 90 days to comply with data retention policies. In production, they encountered issues with message ordering — logs from the same service arrived out of order. They enabled message ordering on the subscription using an ordering key (service instance ID). However, ordering reduces throughput; they accepted the trade-off. A common pitfall is not using a dead letter topic; when a schema mismatch occurs, messages are silently dropped. They monitor the subscription's num_undelivered_messages metric.

Exam Focus Section

What the ACE Exam Tests

The ACE exam (Objective 3.3: Deploy and implement data solutions) tests your ability to choose the right Google Cloud data service for a given scenario. For BigQuery and Pub/Sub, you need to know:

BigQuery's serverless nature and when to use it over Cloud SQL or Spanner.

How to create datasets, tables, and queries using bq and gcloud.

Pub/Sub's messaging model, topics, subscriptions, and delivery guarantees.

The integration between BigQuery and Pub/Sub via BigQuery subscriptions.

Pricing models: on-demand vs. flat-rate for BigQuery; data volume vs. API calls for Pub/Sub.

Common Wrong Answers

1.

Choosing Cloud SQL for analytics: Candidates often pick Cloud SQL because it's familiar, but BigQuery is the correct answer for large-scale analytical queries. Cloud SQL is for OLTP workloads.

2.

Thinking Pub/Sub guarantees exactly-once delivery: The exam tests that Pub/Sub is at-least-once. Many choose exactly-once because it sounds more reliable.

3.

Assuming BigQuery supports transactions: BigQuery does not support multi-row transactions. Questions about updating individual rows should point to Cloud SQL or Spanner.

4.

Confusing Pub/Sub topics and subscriptions: A topic is for publishing; a subscription is for receiving. A common trap asks 'What do you create to receive messages?' — answer is subscription, not topic.

Specific Numbers and Terms

Pub/Sub message size: 10 MB.

BigQuery streaming buffer latency: up to 90 minutes.

BigQuery time travel: 7 days default.

Pub/Sub acknowledgment deadline: 10-600 seconds, default 10.

Pub/Sub message retention: up to 7 days.

BigQuery on-demand pricing: $5 per TB processed.

BigQuery flat-rate: 100 slots minimum for flex slots.

Edge Cases

If a Pub/Sub push endpoint returns a non-200 status, Pub/Sub retries with exponential backoff.

BigQuery external tables can query data in Cloud Storage without loading; but performance is slower.

BigQuery views are logical; they don't store data.

How to Eliminate Wrong Answers

If the question mentions 'real-time streaming', consider Pub/Sub + Dataflow or BigQuery subscription.

If the question mentions 'petabyte-scale analytics', BigQuery is usually the answer.

If the question mentions 'transactional writes', eliminate BigQuery.

If the question mentions 'message ordering', Pub/Sub ordering keys are needed.

Misconceptions

1. Myth: BigQuery is a traditional relational database. Reality: BigQuery is a columnar data warehouse optimized for analytical queries, not OLTP. It does not support transactions, foreign keys, or indexes.

2. Myth: Pub/Sub guarantees exactly-once delivery. Reality: Pub/Sub guarantees at-least-once delivery. Duplicates can occur; applications must be idempotent.

3. Myth: BigQuery streaming inserts are immediately available for export. Reality: Data in the streaming buffer is not available for export or copy operations for up to 90 minutes.

4. Myth: Pub/Sub topics can have multiple subscribers without creating subscriptions. Reality: Each subscriber must have its own subscription; a topic can have multiple subscriptions, but each subscription is independent.

5. Myth: BigQuery queries always scan all data. Reality: BigQuery uses columnar storage and predicate pushdown; it only scans columns referenced in the query and can prune partitions/clusters.

Comparisons

1. BigQuery vs. Cloud SQL - BigQuery: Serverless, petabyte-scale, columnar, SQL for analytics, pay per query. - Cloud SQL: Managed MySQL/PostgreSQL/SQL Server, OLTP, row-level operations, supports transactions, pay per instance.

2. Pub/Sub vs. Cloud Tasks - Pub/Sub: Asynchronous messaging, many-to-many, high throughput, real-time, at-least-once delivery. - Cloud Tasks: Task queues, targeted at HTTP-based tasks, at-least-once, but with a focus on reliable execution and retries.

3. BigQuery vs. Bigtable - BigQuery: Analytical, SQL, columnar, best for aggregations and reporting. - Bigtable: NoSQL wide-column, low latency, high throughput for time-series and IoT, not SQL.

Key Takeaways

BigQuery is serverless and separates compute and storage.

Pub/Sub is for real-time messaging with at-least-once delivery.

BigQuery subscriptions write Pub/Sub messages directly to BigQuery tables.

BigQuery pricing: on-demand ($5/TB) or flat-rate (slots).

Pub/Sub message size limit: 10 MB.

BigQuery streaming buffer latency: up to 90 minutes.

BigQuery time travel: 7 days default.

Pub/Sub acknowledgment deadline: 10-600 seconds.

Use BigQuery for analytics, Cloud SQL for transactions.

Use Pub/Sub for event-driven architectures.

FAQ

1. Q: What is the difference between a Pub/Sub topic and a subscription? A: A topic is a named channel to which publishers send messages. A subscription is a named resource that represents a stream of messages from a single topic. Subscribers must have a subscription to receive messages. A topic can have multiple subscriptions, each receiving all messages independently.

2. Q: Can BigQuery handle real-time streaming data? A: Yes, BigQuery supports streaming inserts and can receive data from Pub/Sub via BigQuery subscriptions. However, there is a streaming buffer latency of up to 90 minutes before data is available for export. For near-real-time analytics, use Dataflow to write directly to BigQuery.

3. Q: How do I avoid duplicate messages in Pub/Sub? A: Pub/Sub is at-least-once, so duplicates are possible. To handle duplicates, make your subscriber idempotent by using a unique message ID (e.g., message_id) to deduplicate. Alternatively, use Cloud Dataflow with deduplication transforms.

4. Q: What is the maximum message size in Pub/Sub? A: 10 MB. If you need to send larger payloads, consider using Cloud Storage and sending a reference URL in the message.

5. Q: How do I query data in BigQuery that is stored in Cloud Storage? A: Create an external table using bq mk --external_table_definition or via the console. You can query it like a native BigQuery table, but performance may be slower.

6. Q: What happens if a Pub/Sub push endpoint is down? A: Pub/Sub retries delivery with exponential backoff. If the endpoint continues to fail, messages can be forwarded to a dead letter topic after max delivery attempts (default 5).

7. Q: Can I change the partitioning of an existing BigQuery table? A: No, you cannot change partitioning after table creation. You must recreate the table with the new partitioning scheme and reinsert data.

Quiz

1. Q: A company needs to run SQL queries on petabytes of log data. Which Google Cloud service should they use? A: BigQuery. It is serverless, columnar, and designed for large-scale analytical queries. Cloud SQL is for smaller transactional workloads.

2. Q: A real-time application requires at-least-once message delivery with low latency. Which service should be used? A: Pub/Sub. It provides at-least-once delivery and low-latency messaging. Cloud Tasks is for task queues, not real-time messaging.

3. Q: What is the default acknowledgment deadline for a Pub/Sub subscription? A: 10 seconds. It can be configured up to 600 seconds.

4. Q: What is the maximum time a message can be retained in a Pub/Sub subscription? A: 7 days. After that, messages are deleted.

5. Q: Which BigQuery feature allows you to access data from the past 7 days? A: Time travel. It allows querying data as it existed up to 7 days ago (default).

6. Q: A developer wants to stream data from Pub/Sub directly into BigQuery without using Dataflow. What should they create? A: A BigQuery subscription. This writes messages directly to a BigQuery table.

7. Q: What is the minimum number of slots for a flex slot reservation? A: 100 slots.

8. Q: True or False: BigQuery supports multi-row transactions. A: False. BigQuery is not designed for transactional workloads; it supports only single-row mutations via DML.

9. Q: A Pub/Sub message is 15 MB. What happens? A: The message is rejected because the maximum size is 10 MB.

10. Q: Which command creates a Pub/Sub topic? A: gcloud pubsub topics create my-topic.

Walk-Through

1

Create Pub/Sub Topic

Use the gcloud command `gcloud pubsub topics create streaming-topic`. This creates a named resource that acts as a channel for publishers to send messages. The topic is regional or multi-regional. The command returns immediately; the topic is ready for use. No configuration is needed beyond the name. Publishers will use this topic name to send messages.

2

Create BigQuery Dataset and Table

First, create a dataset with `bq mk --dataset my_dataset`. Then, create a table with a schema that matches the expected message structure. For example: `bq mk --table --schema "event_id:STRING,event_type:STRING,timestamp:TIMESTAMP" my_dataset.events`. The table must exist before creating the BigQuery subscription. The schema should include columns for message attributes and optionally a `data` column for the payload.

3

Create BigQuery Subscription

Run `gcloud pubsub subscriptions create bq-sub --topic=streaming-topic --bigquery-table=my_project:my_dataset.events --write-metadata`. This creates a subscription that writes messages directly to the BigQuery table. The `--write-metadata` flag adds columns like `subscription_name`, `message_id`, `publish_time`, etc. The subscription starts delivering messages immediately.

4

Publish Messages

Use a publisher client or gcloud command: `gcloud pubsub topics publish streaming-topic --message='{"event_id":"1","event_type":"click","timestamp":"2023-01-01T00:00:00Z"}'`. The message is sent to the topic. Pub/Sub stores the message in the subscription backlog. The BigQuery subscription automatically writes the message to the BigQuery table.

5

Verify Data in BigQuery

Query the BigQuery table to confirm data is present: `bq query --use_legacy_sql=false 'SELECT * FROM my_dataset.events LIMIT 10'`. The results should show the published messages. There may be a slight delay (seconds) for the data to appear. Monitor the subscription's backlog metric to ensure messages are being processed.

What This Looks Like on the Job

Enterprise Scenario 1: Real-Time Analytics for E-Commerce

A large e-commerce platform processes clickstream data from millions of users. They use Pub/Sub to ingest events (page views, clicks, purchases) in real time. A BigQuery subscription writes these events directly into a partitioned and clustered BigQuery table. This enables analysts to run near-real-time queries on user behavior, such as 'What is the conversion rate in the last hour?'. In production, they handle over 500,000 messages per second. Key considerations: they set the acknowledgment deadline to 60 seconds to allow for burst processing, and they use dead letter topics to capture malformed events. A common misconfiguration is not setting table partitioning by date, causing full table scans on time-range queries, which increases costs and latency.

Enterprise Scenario 2: IoT Sensor Data Pipeline

A manufacturing company collects sensor data from thousands of IoT devices. Each device sends a message every second to a Pub/Sub topic. They need to store this data for historical analysis and anomaly detection. They create a BigQuery subscription to write the data into a BigQuery table partitioned by ingestion time (pseudo-column _PARTITIONTIME). They also set up a Dataflow pipeline to transform the data before writing to BigQuery for complex aggregations. In production, they observed that the streaming buffer latency (up to 90 minutes) caused delays for real-time dashboards; they mitigated by using a separate Dataflow pipeline that writes to a separate table with lower latency. A common mistake is using on-demand pricing for high-volume streaming, resulting in unpredictable costs; they switched to flat-rate reservations.

Enterprise Scenario 3: Log Aggregation and Compliance

A financial services company needs to aggregate logs from multiple microservices for compliance auditing. Each service publishes logs to a shared Pub/Sub topic. A BigQuery subscription writes all logs to a BigQuery table with a schema that includes log_level, service_name, message, and timestamp. They use clustering on service_name and log_level to optimize queries. They also set table expiration to 90 days to comply with data retention policies. In production, they encountered issues with message ordering — logs from the same service arrived out of order. They enabled message ordering on the subscription using an ordering key (service instance ID). However, ordering reduces throughput; they accepted the trade-off. A common pitfall is not using a dead letter topic; when a schema mismatch occurs, messages are silently dropped. They monitor the subscription's num_undelivered_messages metric.

How ACE Actually Tests This

What the ACE Exam Tests

The ACE exam (Objective 3.3: Deploy and implement data solutions) tests your ability to choose the right Google Cloud data service for a given scenario. For BigQuery and Pub/Sub, you need to know:

BigQuery's serverless nature and when to use it over Cloud SQL or Spanner.

How to create datasets, tables, and queries using bq and gcloud.

Pub/Sub's messaging model, topics, subscriptions, and delivery guarantees.

The integration between BigQuery and Pub/Sub via BigQuery subscriptions.

Pricing models: on-demand vs. flat-rate for BigQuery; data volume vs. API calls for Pub/Sub.

Common Wrong Answers

1.

Choosing Cloud SQL for analytics: Candidates often pick Cloud SQL because it's familiar, but BigQuery is the correct answer for large-scale analytical queries. Cloud SQL is for OLTP workloads.

2.

Thinking Pub/Sub guarantees exactly-once delivery: The exam tests that Pub/Sub is at-least-once. Many choose exactly-once because it sounds more reliable.

3.

Assuming BigQuery supports transactions: BigQuery does not support multi-row transactions. Questions about updating individual rows should point to Cloud SQL or Spanner.

4.

Confusing Pub/Sub topics and subscriptions: A topic is for publishing; a subscription is for receiving. A common trap asks 'What do you create to receive messages?' — answer is subscription, not topic.

Specific Numbers and Terms

Pub/Sub message size: 10 MB.

BigQuery streaming buffer latency: up to 90 minutes.

BigQuery time travel: 7 days default.

Pub/Sub acknowledgment deadline: 10-600 seconds, default 10.

Pub/Sub message retention: up to 7 days.

BigQuery on-demand pricing: $5 per TB processed.

BigQuery flat-rate: 100 slots minimum for flex slots.

Edge Cases

If a Pub/Sub push endpoint returns a non-200 status, Pub/Sub retries with exponential backoff.

BigQuery external tables can query data in Cloud Storage without loading; but performance is slower.

BigQuery views are logical; they don't store data.

How to Eliminate Wrong Answers

If the question mentions 'real-time streaming', consider Pub/Sub + Dataflow or BigQuery subscription.

If the question mentions 'petabyte-scale analytics', BigQuery is usually the answer.

If the question mentions 'transactional writes', eliminate BigQuery.

If the question mentions 'message ordering', Pub/Sub ordering keys are needed.

Key Takeaways

BigQuery is serverless and separates compute and storage.

Pub/Sub is for real-time messaging with at-least-once delivery.

BigQuery subscriptions write Pub/Sub messages directly to BigQuery tables.

BigQuery pricing: on-demand ($5/TB) or flat-rate (slots).

Pub/Sub message size limit: 10 MB.

BigQuery streaming buffer latency: up to 90 minutes.

BigQuery time travel: 7 days default.

Pub/Sub acknowledgment deadline: 10-600 seconds.

Use BigQuery for analytics, Cloud SQL for transactions.

Use Pub/Sub for event-driven architectures.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

BigQuery

Serverless, no infrastructure management

Columnar storage, optimized for analytical queries

Petabyte-scale, processes billions of rows

Pricing: on-demand ($5/TB) or flat-rate (slots)

No support for transactions or foreign keys

Cloud SQL

Managed MySQL/PostgreSQL/SQL Server

Row-based storage, optimized for OLTP

Limited to instance storage (up to 30 TB)

Pricing: per instance (vCPUs, memory, storage)

Full support for transactions, indexes, foreign keys

Pub/Sub

Asynchronous messaging, many-to-many

High throughput (millions of messages/sec)

Real-time, low latency (<100ms)

At-least-once delivery, no ordering by default

No built-in retries for push endpoints

Cloud Tasks

Task queues, targeted at HTTP-based tasks

Lower throughput, designed for background work

Not real-time, but reliable execution

At-least-once delivery with retries

Built-in retry with exponential backoff

Watch Out for These

Mistake

BigQuery is a traditional relational database.

Correct

BigQuery is a columnar data warehouse optimized for analytical queries, not OLTP. It does not support transactions, foreign keys, or indexes.

Mistake

Pub/Sub guarantees exactly-once delivery.

Correct

Pub/Sub guarantees at-least-once delivery. Duplicates can occur; applications must be idempotent.

Mistake

BigQuery streaming inserts are immediately available for export.

Correct

Data in the streaming buffer is not available for export or copy operations for up to 90 minutes.

Mistake

Pub/Sub topics can have multiple subscribers without creating subscriptions.

Correct

Each subscriber must have its own subscription; a topic can have multiple subscriptions, but each subscription is independent.

Mistake

BigQuery queries always scan all data.

Correct

BigQuery uses columnar storage and predicate pushdown; it only scans columns referenced in the query and can prune partitions/clusters.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a Pub/Sub topic and a subscription?

A topic is a named channel to which publishers send messages. A subscription is a named resource that represents a stream of messages from a single topic. Subscribers must have a subscription to receive messages. A topic can have multiple subscriptions, each receiving all messages independently.

Can BigQuery handle real-time streaming data?

Yes, BigQuery supports streaming inserts and can receive data from Pub/Sub via BigQuery subscriptions. However, there is a streaming buffer latency of up to 90 minutes before data is available for export. For near-real-time analytics, use Dataflow to write directly to BigQuery.

How do I avoid duplicate messages in Pub/Sub?

Pub/Sub is at-least-once, so duplicates are possible. To handle duplicates, make your subscriber idempotent by using a unique message ID (e.g., `message_id`) to deduplicate. Alternatively, use Cloud Dataflow with deduplication transforms.

What is the maximum message size in Pub/Sub?

10 MB. If you need to send larger payloads, consider using Cloud Storage and sending a reference URL in the message.

How do I query data in BigQuery that is stored in Cloud Storage?

Create an external table using `bq mk --external_table_definition` or via the console. You can query it like a native BigQuery table, but performance may be slower.

What happens if a Pub/Sub push endpoint is down?

Pub/Sub retries delivery with exponential backoff. If the endpoint continues to fail, messages can be forwarded to a dead letter topic after max delivery attempts (default 5).

Can I change the partitioning of an existing BigQuery table?

No, you cannot change partitioning after table creation. You must recreate the table with the new partitioning scheme and reinsert data.

Terms Worth Knowing

Ready to put this to the test?

You've just covered BigQuery and Pub/Sub — now see how well it sticks with free ACE practice questions. Full explanations included, no account needed.

Done with this chapter?