This chapter covers data distribution methods in Azure Synapse Analytics dedicated SQL pools: hash, round-robin, and replicated tables. Understanding these distribution strategies is critical for the DP-900 exam, as approximately 10-15% of questions in the analytics domain test your ability to choose the correct distribution for a given workload scenario. We will explain the internal mechanism of each method, when to use them, and common exam traps.
Jump to a section
Imagine a large warehouse that receives shipments of books and must distribute them across three storage rooms for efficient retrieval. The warehouse manager has three strategies. First, Hash Distribution: The manager uses a rule based on the book's ISBN (like a hash function) to assign each book to a specific room. Every book with the same ISBN always goes to the same room, so if you need all books by the same author, they are together. This speeds up queries that join on that key but requires careful planning so that rooms are balanced. Second, Round-Robin Distribution: The manager simply places each incoming book into the next room in a cycle—first book to Room A, second to Room B, third to Room C, fourth back to Room A, and so on. This ensures the rooms are perfectly balanced, but if you want all books by one author, you have to check every room, which is slow for joins. Third, Replicated Distribution: The manager makes a copy of every book and puts a full set in each room. Now any room can answer any query independently, which is great for small reference tables like a list of publishers, but it wastes space and is impractical for large tables. In a data warehouse, these distribution strategies control how data is spread across compute nodes to balance performance and storage.
What is Data Distribution and Why Does It Matter?
In a massively parallel processing (MPP) architecture like Azure Synapse Analytics dedicated SQL pool, data is distributed across multiple compute nodes (called distributions). Each distribution is a unit of storage and processing that holds a portion of the table's data. The goal of distribution is to maximize query performance by minimizing data movement between nodes during query execution. Data movement is the most expensive operation in an MPP system—it involves shuffling rows across the network to collocate data needed for joins or aggregations. The distribution method determines how rows are assigned to distributions, directly impacting query performance and storage balance.
Hash Distribution
Hash distribution uses a deterministic hash function on one or more columns of the table (the distribution column or columns). When a row is inserted, the system computes a hash value of the distribution column(s) and maps that hash to a specific distribution. The same distribution column value always goes to the same distribution, ensuring that rows with the same key are colocated. This colocation is powerful for star schema scenarios where fact tables join with dimension tables on the same distribution key. For example, if a fact table and a dimension table both use the same CustomerID as the distribution column, then all rows for a given customer in both tables reside on the same node, avoiding data movement during joins.
Key characteristics:
The hash function uses a 32-bit hash value modulo the number of distributions (typically 60 in a dedicated SQL pool).
Distribution column should have high uniqueness (high cardinality) to avoid data skew. If many rows have the same distribution key value, they all go to one distribution, causing a hot spot.
Distribution column cannot be updated after insert (it is immutable).
Best for large fact tables and large dimension tables that are joined frequently.
Queries that filter on the distribution column can be pruned to a single distribution (distribution pruning), improving performance.
Round-Robin Distribution
Round-robin distribution distributes rows evenly across all distributions without any consideration of the data content. When a row is inserted, the system assigns it to the next distribution in a cyclic order. This ensures that each distribution gets approximately the same number of rows, providing perfect load balance for storage and for operations that scan the entire table (like full table scans). However, because there is no colocation of related rows, any join or aggregation that groups by a column will require data movement to shuffle rows to the correct distribution. Round-robin is the default distribution for tables created without specifying a distribution method.
Key characteristics:
Data is distributed in a round-robin fashion across all distributions (60 in a dedicated SQL pool).
No distribution column; all rows are treated equally.
Provides even data distribution regardless of data values.
Not suitable for large tables that are frequently joined or aggregated; leads to heavy data movement.
Often used for staging tables or temporary tables where the data is loaded quickly and then transformed into a hash-distributed table.
Also used for small tables that are not joined frequently (but replicated is often better for very small tables).
Replicated Tables
Replicated tables store a full copy of the table on each distribution node. This eliminates data movement entirely for queries that join the replicated table with other tables, because every node already has the complete data. Replication is ideal for small dimension tables (typically less than 2 GB compressed) that are frequently joined with large fact tables. The downside is that any write operation (INSERT, UPDATE, DELETE) must update all copies, which increases write latency. Also, storage overhead multiplies by the number of distributions (60 copies), so it is not suitable for large tables.
Key characteristics:
Table is copied to each of the 60 distributions.
Ideal for small dimension tables (size < 2 GB compressed) that are used in many joins.
Eliminates data movement for joins involving the replicated table.
Write operations are slower because they must update all copies.
Not suitable for large tables or tables with frequent writes.
Replicated tables are a good choice for lookup tables that change infrequently.
Choosing the Right Distribution
The choice depends on table size, usage pattern, and query workload. The general guidance from Microsoft is:
For large fact tables (size > 2 GB) that are joined with dimension tables, use hash distribution on a column that has high cardinality and is used in joins.
For large dimension tables (size > 2 GB) that are joined with fact tables, also use hash distribution on the join column.
For small dimension tables (size < 2 GB), use replicated distribution to avoid data movement.
For staging tables or tables that are not frequently queried, use round-robin distribution for fast loading.
Impact on Query Performance
Queries in a dedicated SQL pool are executed in parallel across all distributions. The query optimizer generates a plan that may require data movement if the required rows are not colocated. Data movement can be a broadcast move (sending a small table to all nodes) or a shuffle move (redistributing rows based on a join key). Hash-distributed tables on the join key avoid shuffle moves. Replicated tables avoid both broadcast and shuffle moves. Round-robin tables often require full shuffles, which are expensive.
Configuration and Verification
Distribution is defined at table creation using the WITH (DISTRIBUTION = ...) clause. For hash distribution, you must also specify a distribution column. Examples:
-- Hash distribution on CustomerID
CREATE TABLE FactSales (
SalesID INT NOT NULL,
CustomerID INT NOT NULL,
Amount DECIMAL(10,2)
)
WITH (DISTRIBUTION = HASH(CustomerID));
-- Round-robin distribution (default)
CREATE TABLE StagingData (
ID INT,
Name NVARCHAR(50)
)
WITH (DISTRIBUTION = ROUND_ROBIN);
-- Replicated table
CREATE TABLE DimProduct (
ProductID INT NOT NULL,
ProductName NVARCHAR(100)
)
WITH (DISTRIBUTION = REPLICATE);To verify distribution, query sys.pdw_table_distribution_properties:
SELECT
t.name AS TableName,
p.distribution_policy_desc AS DistributionType
FROM sys.tables t
JOIN sys.pdw_table_distribution_properties p
ON t.object_id = p.object_id;To check data skew for hash-distributed tables:
DBCC PDW_SHOWSPACEUSED('FactSales');This command shows the row count per distribution. High variation indicates data skew.
Interaction with Table Types and Indexes
Distribution works together with table indexing (clustered columnstore, clustered index, heap). For hash-distributed tables, the distribution column is often the same as the partition column for partitioned tables, but not required. Replicated tables cannot be partitioned. Round-robin tables can be partitioned, but partitioning does not reduce data movement for joins.
Defaults and Limits
Default distribution: ROUND_ROBIN if not specified.
Number of distributions: 60 for dedicated SQL pool (fixed).
Replicated table size limit: 2 GB compressed (per table, not per distribution).
Distribution column for hash: must be a column with no updates allowed; cannot be a computed column or have certain data types like text/ntext/image.
Summary of Mechanism
Hash: Hash function on column value -> modulo 60 -> distribution ID. Rows with same key colocated.
Round-Robin: Sequence counter increments for each row -> distribution ID = counter % 60. Even distribution, no colocation.
Replicated: Full copy on each distribution. No colocation needed because all data is local.
Define Table with Distribution
When creating a table in a dedicated SQL pool, you specify the distribution method using the WITH (DISTRIBUTION = ...) clause. For hash distribution, you must also specify the distribution column(s) in parentheses. The system records this metadata in the catalog views. If no distribution is specified, round-robin is used by default. This step determines how future data inserts will be assigned to distributions.
Insert Rows into Table
During INSERT operations, for each row, the SQL pool engine evaluates the distribution method. For hash, it computes a hash of the distribution column value using a built-in hash function (e.g., BINARY_CHECKSUM or a custom algorithm) and then takes modulo 60 to get the target distribution ID. For round-robin, it uses a round-robin counter that increments per row, assigning the next distribution ID. For replicated, the row is inserted into all 60 distributions. The row is then stored in the local storage of the assigned distribution.
Execute Query with Joins
When a query joins two tables, the query optimizer analyzes the distribution columns. If both tables are hash-distributed on the same column with the same data type, the join can be performed locally on each distribution without data movement (colocated join). If not, the optimizer may choose to move data: either broadcast a small table to all nodes or shuffle one or both tables by the join key. Replicated tables avoid movement because every node already has the full data.
Monitor Data Skew
After loading data, you should check for data skew in hash-distributed tables using DBCC PDW_SHOWSPACEUSED. If one distribution has significantly more rows than others, the distribution column choice is poor. High skew leads to performance bottlenecks because the largest distribution becomes the slowest node. You may need to recreate the table with a different distribution column or use round-robin if skew is unavoidable.
Optimize Distribution Strategy
Based on monitoring and query patterns, you might change the distribution strategy. This requires recreating the table using CREATE TABLE AS SELECT (CTAS) with a new distribution option. You cannot alter distribution on an existing table. For replicated tables, if the table grows beyond 2 GB, you must switch to hash or round-robin. For round-robin tables that become heavily joined, consider changing to hash on the join column.
In a real-world Azure Synapse Analytics deployment, a retail company uses a dedicated SQL pool to analyze sales data. Their fact table, FactSales, has over 2 billion rows. They initially used round-robin distribution for fast loading, but queries joining with the Customer dimension took over 30 seconds due to data movement. After analyzing query patterns, they changed to hash distribution on CustomerID, which collocated rows from FactSales and DimCustomer. Query times dropped to under 2 seconds. However, they discovered significant data skew because a few customers had millions of rows. They mitigated this by using a composite distribution column (CustomerID + SaleDate) to distribute more evenly.
Another scenario: A financial services company uses replicated distribution for small dimension tables like DimCurrency (50 rows) and DimDate (10 years of dates, about 3650 rows). These tables are joined with a large fact table of transactions. Because they are replicated, every node has a local copy, and joins complete without any data movement. The company monitors table sizes using DMVs to ensure the replicated tables stay under 2 GB. When the DimProduct table grew to 3 GB due to new products, they switched it to hash distribution on ProductID to avoid the 2 GB replication limit.
A common misconfiguration is using round-robin for large fact tables that are frequently aggregated. For example, a marketing analytics team used round-robin for a 500 GB table and ran daily aggregation queries that grouped by CampaignID. Each query required shuffling the entire table across the network, taking 45 minutes. After switching to hash on CampaignID, the same query completed in 5 minutes. The team learned to always consider distribution when designing tables.
Performance considerations: Hash distribution can still cause data movement if the join column is not the distribution column. Replicated tables consume storage; with 60 distributions, a 1 GB replicated table uses 60 GB. Round-robin is best for staging tables that are truncated and reloaded frequently, as it provides even distribution without the overhead of hashing.
The DP-900 exam (objective 3.1: Describe how to choose the right distribution for a table) tests your ability to match distribution methods to use cases. Expect 2-3 questions on this topic. The exam does not require memorizing exact syntax but expects you to understand when to use each distribution.
Common Wrong Answers: 1. Choosing round-robin for a large fact table that is frequently joined with dimension tables. Candidates pick round-robin because it balances data evenly, but they forget that joins will be slow due to data movement. The correct answer is hash on the join column. 2. Choosing replicated for a large table (>2 GB). Candidates think replication eliminates data movement, but they ignore the size limit and storage overhead. The correct answer is hash or round-robin. 3. Choosing hash for a staging table that is loaded quickly and then dropped. Candidates think hash is always better, but for temporary data, round-robin is faster to load and provides even distribution. 4. Confusing hash distribution with partitioning. The exam may ask about both; remember that distribution spreads data across nodes, while partitioning splits data within a node.
Specific Numbers: - Number of distributions in a dedicated SQL pool: 60 (fixed). - Replicated table size limit: 2 GB compressed. - Default distribution: round-robin.
Edge Cases: - If a hash distribution column has low cardinality (e.g., Gender with only 2 values), data will be skewed (only 2 distributions get data). The exam may ask you to identify this as a poor choice. - If a table is small but frequently updated, replicated may cause write performance issues. The exam might ask for the best distribution for a small but frequently updated table; answer is hash or round-robin, not replicated. - When joining two tables, if both are hash-distributed on the same column, the join is colocated. If one is replicated, no movement is needed. If both are round-robin, a full shuffle is required.
How to Eliminate Wrong Answers: - If the scenario mentions a large fact table and joins, eliminate round-robin and replicated. - If the scenario mentions a small dimension table (under 2 GB), eliminate round-robin and hash in favor of replicated. - If the scenario mentions fast loading of staging data, eliminate hash (because hashing adds overhead) and replicated (because of write overhead). - If the scenario mentions data skew, look for a distribution column with high cardinality.
Three distribution methods: hash, round-robin, and replicated.
Hash distribution colocates rows with same distribution key, ideal for large fact and dimension tables joined on that key.
Round-robin distributes rows evenly but leads to data movement; best for staging tables.
Replicated tables store a full copy on each distribution; ideal for small dimension tables (< 2 GB compressed).
Number of distributions in a dedicated SQL pool is fixed at 60.
Default distribution is round-robin if not specified.
Distribution cannot be altered; must recreate table using CTAS.
Data skew occurs in hash distribution when distribution column has low cardinality.
Replicated tables have a 2 GB compressed size limit.
Distribution choice directly impacts query performance and storage balance.
These come up on the exam all the time. Here's how to tell them apart.
Hash Distribution
Rows with same distribution key go to same node (colocation).
Best for large tables that are frequently joined on the distribution column.
Distribution column must have high cardinality to avoid skew.
Query performance is high for filtered queries on distribution column (distribution pruning).
Slightly higher insert overhead due to hash computation.
Round-Robin Distribution
Rows are evenly distributed without regard to content.
Best for staging tables or tables not used in joins.
No distribution column; no skew issues.
Query performance is poor for joins and aggregations due to data movement.
Fastest insert performance because no hash computation.
Replicated Distribution
Full copy on every distribution (60 copies).
Eliminates data movement for joins involving this table.
Best for small dimension tables (< 2 GB compressed).
Write operations are slow (must update all copies).
Storage overhead is 60x the table size.
Hash Distribution (for small tables)
Data is split across distributions based on hash.
Joins on distribution column are colocated, but other joins require movement.
Suitable for large tables (> 2 GB) or tables with frequent writes.
Write operations are faster (only one distribution updated).
Storage overhead is 1x the table size (plus replication factor for HA).
Mistake
Hash distribution always provides even data distribution.
Correct
Hash distribution can cause data skew if the distribution column has low cardinality or many duplicate values. For example, a column with only two distinct values will send all rows to only two distributions. Even distribution requires a column with high uniqueness.
Mistake
Round-robin distribution is best for all tables because it distributes data evenly.
Correct
Round-robin provides even storage but leads to heavy data movement during joins and aggregations because related rows are scattered. It is only suitable for staging tables or tables that are not frequently queried with joins.
Mistake
Replicated tables are suitable for any table size as long as you have enough storage.
Correct
Azure Synapse enforces a 2 GB compressed limit for replicated tables. Exceeding this limit will cause errors. Also, write performance degrades because every insert/update must be applied to all 60 copies.
Mistake
You can change the distribution of a table using ALTER TABLE.
Correct
Distribution cannot be altered on an existing table. You must recreate the table using CREATE TABLE AS SELECT (CTAS) with the new distribution option.
Mistake
The distribution column for hash distribution can be updated after insert.
Correct
The distribution column is immutable. Once a row is inserted, the distribution column value cannot be changed. If you need to update it, you must delete and reinsert the row.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
The default distribution is round-robin. If you create a table without specifying a distribution method, Azure Synapse will distribute rows in a round-robin fashion across all 60 distributions. This ensures even distribution but is not optimal for tables that are frequently joined.
Choose a column that is used in joins with other tables, has high cardinality (many unique values), and does not have significant data skew. Avoid columns with few distinct values like Gender or Status. Also, the column should be immutable (no updates after insert). Common choices are CustomerID, ProductID, or DateKey (if high cardinality).
No. Azure Synapse enforces a 2 GB compressed size limit for replicated tables. If your table exceeds this limit, you must use hash or round-robin distribution. You can check the size using DBCC PDW_SHOWSPACEUSED or query the sys.dm_pdw_nodes_db_partition_stats DMV.
Distribution spreads data across compute nodes (distributions) to enable parallel processing. Partitioning splits data within a single distribution into smaller segments based on a partition key (e.g., date). Distribution is for parallelism across nodes; partitioning is for data management and pruning within a node. Both can be used together.
You cannot alter distribution on an existing table. You must create a new table with the desired distribution using CREATE TABLE AS SELECT (CTAS) and then rename the tables. For example: CREATE TABLE NewTable WITH (DISTRIBUTION = HASH(Column)) AS SELECT * FROM OldTable; DROP TABLE OldTable; RENAME OBJECT NewTable TO OldTable;
Data skew occurs when some distributions have significantly more rows than others in a hash-distributed table. This causes the query to wait for the largest distribution to finish, creating a bottleneck. You can detect skew using DBCC PDW_SHOWSPACEUSED. To fix it, choose a distribution column with higher cardinality or use a composite distribution key.
Replicated tables require every write operation (INSERT, UPDATE, DELETE) to be applied to all 60 copies. This increases write latency and can cause contention. For tables that are updated frequently, hash or round-robin distribution is better because only one copy (on one distribution) is modified.
You've just covered Data Warehouse Distribution: Hash, Round-Robin, Replicated — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?