This chapter covers BigQuery partitioning and clustering, two powerful features that optimize query performance and reduce costs by minimizing the data scanned. For the ACE exam, these topics appear in approximately 10–15% of questions, often integrated with cost optimization and data modeling scenarios. You will learn the internal mechanisms, configuration options, best practices, and common pitfalls to ensure you can design efficient BigQuery tables and answer exam questions correctly.
Jump to a section
Imagine a library with millions of books. Without any organization, finding a book requires scanning every shelf — a full table scan. Partitioning is like splitting the library into separate rooms by year: Room 2020, Room 2021, etc. If you only want books from 2021, you bypass all other rooms entirely, drastically reducing search space. Each room is a separate partition, and you can even store them in different buildings (regions). Clustering is like sorting books within each room by author's last name. Even if you need books from multiple years, within a room the books are physically ordered, so finding all books by 'Smith' in that year is fast — you jump to the 'S' section, read sequentially until the author changes, and stop. Without clustering, you'd have to scan every book in the room. Together, partitioning prunes entire rooms, and clustering organizes within each room for efficient range scans. But if you always search by author regardless of year, partitioning by year doesn't help — you still have to enter every room. Likewise, if you partition by a low-cardinality column like 'status' (active/inactive), you get only two rooms — not much pruning. The key is choosing the right partition key and cluster key based on your query patterns.
What Are Partitioning and Clustering?
BigQuery is a serverless, highly scalable data warehouse that separates compute from storage. Queries incur costs based on the amount of data processed (scanned). Partitioning and clustering are two techniques to limit the data scanned, thereby reducing cost and improving performance.
Partitioning divides a table into segments (partitions) based on a column (partition key). Queries that filter on the partition key can prune (skip) entire partitions, scanning only the relevant data.
Clustering physically orders data within each partition (or within the whole table if unpartitioned) based on one or more columns (cluster keys). This allows BigQuery to use block-level pruning: it can skip reading blocks that do not contain matching cluster key values.
How Partitioning Works Internally
When you create a partitioned table, BigQuery stores each partition as a separate storage unit. Metadata tracks which partitions exist. At query time, the query planner analyzes the WHERE clause to determine which partitions are needed. For example, if a table is partitioned by DATE and the query filters WHERE date = '2023-01-01', only that partition is read. If the filter is WHERE date BETWEEN '2023-01-01' AND '2023-01-31', only those 31 partitions are read.
Partitioning can be based on:
- Ingestion time (_PARTITIONTIME pseudo-column): Automatically partitions by the time data was ingested. Useful for streaming or append-only data.
- Date/timestamp column: A column of type DATE, TIMESTAMP, or DATETIME.
- Integer range: Partition by an integer column with specified range start, end, and interval. For example, RANGE_BUCKET(customer_id, START=0, END=1000000, INTERVAL=10000) creates 100 partitions each covering 10,000 IDs.
Key limitations:
A table can have at most 4000 partitions (if using ingestion time or date partitioning) or 4000 partitions for integer range (with a maximum of 4000 partitions).
Partition keys cannot be modified after table creation.
You cannot partition by multiple columns.
How Clustering Works Internally
Clustering sorts the data within each partition (or the entire table) by the cluster key columns. BigQuery automatically reorganizes data as it is written, using a sort-based approach. The sorted data is stored in blocks of approximately 100 MB each. Metadata for each block stores the minimum and maximum values of the cluster keys. When a query filters on cluster keys, BigQuery can skip blocks whose min/max range does not overlap the filter.
For example, a table clustered by customer_id and order_date. A query WHERE customer_id BETWEEN 100 AND 200 will only read blocks that might contain those IDs. If the data is well-clustered, many blocks will be skipped.
Clustering columns must be top-level columns (not nested) of type: STRING, BYTES, INT64, FLOAT64, NUMERIC, BIGNUMERIC, BOOL, DATE, TIMESTAMP, DATETIME, or TIME. You can specify up to 4 cluster columns. Order matters: the first column has the highest priority for sorting.
Partitioning vs. Clustering: When to Use Each
Partitioning is best for columns with high cardinality (many distinct values) that are commonly used in equality or range filters. Example: date (365 values per year). Partitioning prunes entire storage units, so it is very effective for eliminating large amounts of data.
Clustering is best for columns that are frequently used in filters but have lower cardinality or are not suitable for partitioning (e.g., customer_id with millions of values — cannot partition by that because 4000 partitions limit). Clustering also helps with queries that filter on multiple columns.
Common pattern: Partition by date, cluster by customer_id or region. This allows time-based pruning and then efficient lookup within each day.
Configuration and Verification Commands
Creating a partitioned table (date column):
CREATE TABLE mydataset.mytable
(
transaction_id INT64,
transaction_date DATE,
amount FLOAT64
)
PARTITION BY transaction_date
OPTIONS (
partition_expiration_days = 365,
require_partition_filter = true
);Creating a partitioned table (ingestion time):
CREATE TABLE mydataset.mytable
(
transaction_id INT64,
amount FLOAT64
)
PARTITION BY _PARTITIONDATE
OPTIONS (partition_expiration_days = 365);Creating a clustered table:
CREATE TABLE mydataset.mytable
(
customer_id INT64,
order_date DATE,
amount FLOAT64
)
PARTITION BY order_date
CLUSTER BY customer_id
OPTIONS (partition_expiration_days = 365);Verifying partitioning and clustering:
Use INFORMATION_SCHEMA.TABLES:
SELECT table_name, partition_by, cluster_by
FROM `region-us.INFORMATION_SCHEMA.TABLES`
WHERE table_name = 'mytable';Or bq show:
bq show --format=prettyjson mydataset.mytableHow It Interacts with Related Technologies
Materialized views: Can be partitioned and clustered independently of the base table.
Query caching: BigQuery caches query results for 24 hours. Partitioning and clustering do not affect caching directly but improve performance for first-time queries.
Streaming buffer: Data streamed to a partitioned table goes to a streaming buffer and is not immediately partitioned. After the buffer is flushed (typically within 90 minutes), data is written to the correct partition. Clustering applies after the data is written.
Time travel: BigQuery supports time travel (accessing historical data up to 7 days). Partitioning can help limit scans for time travel queries.
Slot reservations: Partitioning and clustering reduce slot consumption by scanning less data, which can help stay within reservation limits.
Performance and Cost Considerations
Cost: Queries are billed by the bytes processed. Partitioning and clustering reduce bytes processed, lowering cost.
Performance: Less data scanned means faster queries. Clustering also reduces the number of blocks read, improving I/O.
Over-partitioning: More than a few hundred partitions can cause metadata overhead and slower query planning. Keep partitions under 4000.
Partition expiration: Set partition_expiration_days to automatically delete old partitions, saving storage costs.
Require partition filter: Enabling require_partition_filter prevents accidental full-table scans on partitioned tables.
Best practices: Choose partition key with high cardinality and frequent filter usage. For clustering, choose columns that are often used in range or equality filters, and list them in order of filtering frequency.
Common Mistakes
Partitioning by a low-cardinality column (e.g., status with 3 values) — only 3 partitions, minimal pruning.
Using clustering without partitioning on a large table — clustering helps but without partition pruning, you may still scan many blocks.
Choosing too many cluster columns (over 4 is not allowed) or ordering them poorly.
Forgetting to set require_partition_filter on production tables — a query without a partition filter will scan all partitions and be expensive.
Assuming clustering works immediately after data insertion — clustering is automatic but may take time for new data to be sorted.
Exam Relevance
The ACE exam tests your understanding of when to use partitioning vs. clustering, how to create them, and how they affect cost and performance. Be prepared to identify the correct partition key for a given query pattern, and to know the limitations (e.g., max 4000 partitions, max 4 cluster columns, cannot modify partition key after creation).
Analyze query patterns
Before creating a table, identify the most common filter columns in your queries. For a sales table, typical filters are `date` and `customer_id`. Determine which column has high cardinality and is used frequently for range or equality filters — that becomes the partition key. The next most filtered column(s) become cluster keys. This step is critical: choosing the wrong key leads to poor pruning and higher costs.
Create table with partitioning
Use the `CREATE TABLE` statement with `PARTITION BY` clause. For date partitioning, choose a DATE, TIMESTAMP, or DATETIME column. For ingestion time, use `_PARTITIONDATE`. For integer range, specify `RANGE_BUCKET`. Set `partition_expiration_days` to automatically manage storage. Enable `require_partition_filter` to prevent accidental full scans. The table will be physically divided into partitions, each stored separately.
Add clustering
Add `CLUSTER BY` clause listing up to 4 columns. The order matters: the first column has the highest sorting priority. BigQuery will sort data within each partition by these columns. This allows block-level pruning. For example, if you cluster by `customer_id`, queries filtering on `customer_id` will skip irrelevant blocks. Note: clustering is most effective when data is sorted, which happens automatically but may take time after ingestion.
Insert and monitor data
Insert data as usual. For streaming inserts, data goes to a streaming buffer and is not immediately partitioned or clustered. After the buffer flushes (within ~90 minutes), data is written to the correct partition and sorted. For batch loads, partitioning and clustering are applied immediately. Monitor the table using `INFORMATION_SCHEMA.TABLES` to verify the partition and cluster columns. Check `__TABLES_SUMMARY__` for partition counts.
Query and optimize
Write queries that include filters on the partition key and cluster keys to maximize pruning. Use `WHERE` clauses with exact values or ranges. Avoid functions on partition columns (e.g., `DATE(transaction_time)` instead of `transaction_date`) because they may prevent partition pruning. Use `EXPLAIN` to see how many partitions are scanned. Adjust partition and cluster keys if query patterns change — note that you cannot change keys after creation, so you may need to recreate the table.
Scenario 1: E-commerce Order Analytics
A large e-commerce company stores 2 TB of daily order data. Queries typically filter by order_date (last 30 days) and customer_id (to analyze specific customers). They partition by order_date and cluster by customer_id. This reduces scanned data from 2 TB to about 200 GB (last 30 days), and within that, block-level pruning further reduces to ~50 GB. Without clustering, scanning 200 GB costs ~$1 per query; with clustering, ~$0.25. They also set partition_expiration_days=365 to automatically drop partitions older than a year, saving storage costs. A common misconfiguration is forgetting to set require_partition_filter — a developer once ran a query without a date filter, scanning all 2 TB and costing $10. After enabling the filter, such queries fail with an error, alerting the team to add a filter.
Scenario 2: IoT Sensor Data
A smart building company ingests sensor readings every minute. The table is partitioned by ingestion time (_PARTITIONDATE) and clustered by sensor_id and building_id. Queries often ask for readings from a specific sensor in a specific building over the last hour. Partitioning by ingestion time prunes to the last few partitions (e.g., today's partition only). Clustering on (building_id, sensor_id) then allows efficient block-level pruning. However, they initially clustered by sensor_id only, which was suboptimal because queries always included building_id. After recreating the table with (building_id, sensor_id) in that order, performance improved 3x. They also learned that streaming data is not immediately clustered — a query right after ingestion might scan many blocks. They now wait 90 minutes after peak ingestion before running critical reports.
Scenario 3: Financial Transactions
A bank stores millions of transactions daily, with queries filtering by transaction_date and account_id. They partition by transaction_date and cluster by account_id. However, they also have queries that filter by transaction_type (e.g., 'withdrawal'). Since transaction_type has low cardinality (5 types), they considered partitioning by it, but that would create only 5 partitions — not effective. Instead, they added transaction_type as a third cluster column (after account_id). This works because clustering can handle low-cardinality columns by block pruning. But they discovered that clustering on three columns with the first column having millions of distinct values (account_id) makes the second and third columns less effective. They reordered to (transaction_type, account_id) for queries that always filter by type first. The trade-off: queries filtering only by account_id now scan more blocks. They accepted this based on actual query frequency analysis.
ACE Exam Focus: Objective 3.3
The ACE exam tests your ability to design efficient BigQuery tables using partitioning and clustering. Specific objectives include: - 3.3.1: Determine when to use partitioning vs. clustering. - 3.3.2: Create partitioned and clustered tables. - 3.3.3: Optimize query performance and cost. - 3.3.4: Manage partitions (expiration, filtering).
Common Wrong Answers
"Partitioning by a low-cardinality column like 'status' is effective." → Wrong. Low cardinality means few partitions, so minimal pruning. The exam expects you to choose a column with high cardinality (e.g., date, customer ID) for partitioning.
"Clustering can replace partitioning." → Wrong. Clustering organizes data within partitions but does not prune entire partitions. For large tables, partitioning is essential for significant cost savings.
"You can change the partition key after table creation." → Wrong. Partition key is immutable. To change it, you must recreate the table.
"Clustering works immediately after streaming inserts." → Wrong. Streaming data goes to a buffer and is not clustered until flushed (up to 90 minutes).
"You can partition by multiple columns." → Wrong. Only one partition column is allowed. Use clustering for additional columns.
Specific Numbers and Terms
Max partitions: 4000 (both date and integer range).
Max cluster columns: 4.
Partition expiration: Set in days; default is no expiration.
Require partition filter: Boolean option, default false.
Partition key types: DATE, TIMESTAMP, DATETIME, or integer (for range).
Cluster key types: STRING, BYTES, INT64, FLOAT64, NUMERIC, BIGNUMERIC, BOOL, DATE, TIMESTAMP, DATETIME, TIME.
Ingestion time pseudo-columns: _PARTITIONTIME (timestamp) and _PARTITIONDATE (date).
Edge Cases
NULL partition key: Rows with NULL partition key are stored in a special __NULL__ partition. Queries with IS NULL filter will scan only that partition.
Integer range partitioning: Must specify START, END, INTERVAL. END is exclusive. The number of partitions = (END - START) / INTERVAL, capped at 4000.
Partition decorators: For ingestion time partitioned tables, you can use $ syntax to query specific partitions (e.g., mytable$20230101).
Clustering on a column with many NULLs: NULLs are sorted first (if ascending) or last (if descending).
How to Eliminate Wrong Answers
If a question asks about reducing cost for a table that is frequently filtered by date and customer, the correct answer will involve partitioning by date and clustering by customer. Any answer suggesting only clustering or only partitioning by customer is wrong.
If a question mentions a table with 5000 distinct dates, partitioning by date is possible (5000 > 4000? No, 5000 > 4000, so you cannot partition by date because it would exceed 4000 partitions. The correct approach is to use integer range partitioning with a larger interval, or cluster by date without partitioning.
If a question asks about immediate clustering after streaming, the answer that mentions a delay (streaming buffer) is correct.
For cost optimization, always consider require_partition_filter and partition_expiration_days.
Partitioning reduces cost by pruning entire storage units; clustering reduces cost by pruning blocks within partitions.
Maximum partitions: 4000. Maximum cluster columns: 4.
Partition key must be DATE, TIMESTAMP, DATETIME, or integer (range). Cluster key must be a top-level column of supported type.
Partition key is immutable after table creation; cluster key is also immutable.
Set `require_partition_filter = true` to prevent accidental full table scans on partitioned tables.
Use `partition_expiration_days` to automatically delete old partitions and save storage costs.
Streaming inserts are not immediately partitioned or clustered; data is written after the streaming buffer flushes (up to 90 minutes).
For integer range partitioning, specify START, END (exclusive), and INTERVAL. Number of partitions = (END - START) / INTERVAL ≤ 4000.
Clustering works best when the cluster key columns are frequently used in filters and the data is well-sorted.
Choose partition key first based on high-cardinality filter columns, then cluster keys based on secondary filter columns.
These come up on the exam all the time. Here's how to tell them apart.
Partitioning
Divides table into separate storage units (partitions).
Prunes entire partitions based on filter on partition key.
Supports up to 4000 partitions.
Best for high-cardinality columns (e.g., date, customer ID).
Cannot be changed after table creation.
Clustering
Sorts data within each partition (or whole table) by cluster keys.
Prunes blocks within partitions based on filter on cluster keys.
Supports up to 4 cluster columns.
Best for columns used in filters but not suitable for partitioning (e.g., low cardinality, many distinct values).
Can be used with or without partitioning.
Mistake
Partitioning and clustering are the same thing.
Correct
Partitioning divides a table into separate storage units (partitions) based on a column, enabling partition pruning. Clustering sorts data within each partition (or the entire table) to allow block-level pruning. They are complementary but serve different purposes.
Mistake
You can partition by multiple columns.
Correct
BigQuery allows only one partition column per table. You can combine partitioning with clustering on multiple columns to achieve multi-dimensional pruning.
Mistake
Clustering guarantees that queries will always scan only relevant blocks.
Correct
Clustering is most effective when data is well-sorted. Over time, as data is inserted, the clustering may degrade. BigQuery automatically reclusters in the background, but it is not instantaneous. Queries on recently inserted data may scan more blocks.
Mistake
Partitioning by a column with low cardinality (e.g., boolean) is beneficial.
Correct
Low cardinality means few partitions (e.g., 2 for boolean). Partition pruning is minimal, so the overhead of managing partitions may outweigh the benefit. Use clustering instead for low-cardinality columns.
Mistake
Once a table is created, you can change its partition or cluster columns.
Correct
Partition and cluster columns are immutable after table creation. To change them, you must create a new table and copy the data (e.g., using a query or load job).
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
No, BigQuery allows only one partition column per table. However, you can combine partitioning with clustering to achieve multi-dimensional pruning. For example, partition by `date` and cluster by `customer_id` and `region`.
Both are pseudo-columns for ingestion-time partitioned tables. `_PARTITIONTIME` is a TIMESTAMP column representing the partition boundary (midnight of the partition date). `_PARTITIONDATE` is a DATE column equivalent to the date portion of `_PARTITIONTIME`. You can filter on either.
A partitioned table can have up to 4000 partitions. This applies to date, timestamp, and integer range partitioning. If you exceed 4000, you will get an error. For date partitioning, you can partition by day for about 11 years (365*11 ≈ 4015, so slightly over 4000). Use month or year partitioning for longer time spans.
Yes, you can create a clustered table without partitioning. The data is sorted across the entire table. This can help with block-level pruning, but without partition pruning, queries may still scan a large portion of the table if filters are not highly selective.
No, both are immutable. To change them, you must create a new table with the desired partitioning and clustering, then copy the data (e.g., using a `SELECT *` query or a load job).
If the table has `require_partition_filter = false` (default), the query will scan all partitions, potentially incurring high cost. If `require_partition_filter = true`, the query will fail with an error stating that a partition filter is required. You can use the `WHERE _PARTITIONTIME IS NOT NULL` trick to bypass this in some cases, but it will still scan all partitions.
Clustering itself does not incur additional storage costs. Storage is billed based on the compressed size of the data. Clustering may slightly increase storage due to sorting overhead, but the savings from reduced query costs typically far outweigh any minimal storage increase.
You've just covered BigQuery Partitioning and Clustering — now see how well it sticks with free ACE practice questions. Full explanations included, no account needed.
Done with this chapter?