A company uses Azure Synapse Analytics dedicated SQL pool for a large data warehouse. The fact table contains billions of rows and is hash-distributed on ProductID. Frequent queries join this fact table with a small Store dimension table (10,000 rows) and a medium-sized Product dimension table (500,000 rows). The queries aggregate sales by store and product for recent months, but run slowly due to data movement during joins. Which design change will most reduce data movement and improve query performance?
Correct. Replicating a small dimension table (less than 1 GB) copies it to all distributions, eliminating data movement when joining with the fact table. This directly addresses the performance issue.
Why this answer
Replicating the small Store dimension table (10,000 rows) across all compute nodes eliminates the need to shuffle data during joins with the fact table. In Azure Synapse dedicated SQL pool, replicated tables store a full copy on each distribution, so queries that join a replicated table with a distributed fact table avoid costly data movement, significantly improving performance for frequent aggregation queries.
Exam trap
The trap here is that candidates often think changing the distribution key or adding an index will solve data movement, but they overlook that replicating the small dimension table is the most direct and cost-effective way to eliminate shuffling for frequent joins.
How to eliminate wrong answers
Option B is wrong because changing the fact table to round-robin distribution would distribute rows randomly without any hash key, which would force full data movement for every join and aggregation, making performance worse. Option C is wrong because changing the distribution key to StoreID would co-locate fact rows with the same StoreID on the same distribution, but the Store dimension is small and already a candidate for replication; more importantly, the fact table is large and hash-distributed on ProductID for other workloads, and changing the key could break existing query patterns and still require movement for ProductID-based joins. Option D is wrong because adding a nonclustered index on StoreID in the fact table does not reduce data movement during joins; indexes improve local data access but do not affect the distribution-level data shuffling required when tables are on different distributions.