AZ-305Chapter 55 of 103Objective 2.4

Azure Synapse Analytics Architecture

This chapter covers Azure Synapse Analytics architecture, a unified analytics service that combines big data and data warehousing. For the AZ-305 exam, understanding Synapse's components and how they integrate with other Azure data services is critical, as it appears in approximately 10-15% of questions related to data storage and analytics. You'll learn the differences between dedicated SQL pools, serverless SQL pools, and Apache Spark pools, along with their optimal use cases and architectural decisions.

25 min read
Intermediate
Updated May 31, 2026

Synapse as a Data Refinery

Imagine a massive oil refinery that takes crude oil from multiple pipelines (data sources). The refinery has two main processing units: a high-throughput distillation column for bulk processing (dedicated SQL pool) and a flexible, on-demand catalytic cracker for specialized batches (serverless SQL pool). Crude oil first enters a sorting yard (data lake) where it's stored in raw form. When the refinery needs to produce specific products (insights), it either runs the crude through the continuous distillation column for steady, high-volume output or uses the cracker for quick, ad-hoc batches without reserving capacity. The refinery also has a Spark-based laboratory (Apache Spark pools) that can experiment with new chemical reactions (data transformations) using the same crude samples. A central control room (Synapse Studio) oversees all operations, allowing engineers to design workflows (pipelines) that move oil between units, schedule maintenance, and monitor quality. The key is that all units share the same raw material storage (data lake), so no data duplication occurs. Just as a refinery can switch between processing modes based on demand, Synapse lets you choose between provisioned compute (dedicated pool) and on-demand compute (serverless) for the same data, paying only for what you use.

How It Actually Works

What is Azure Synapse Analytics?

Azure Synapse Analytics is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It provides a unified experience to ingest, prepare, manage, and serve data for immediate BI and machine learning needs. The service is built on a massively parallel processing (MPP) architecture that distributes data across multiple nodes to provide high-performance query execution.

Core Components

Synapse consists of several key components: - Dedicated SQL Pool: A provisioned, petabyte-scale relational data warehouse. You define a Service Level Objective (SLO) in Data Warehouse Units (DWU) ranging from DW100c to DW30000c. The underlying architecture uses compute nodes (up to 60) and control nodes. Queries are distributed across nodes for parallel execution. - Serverless SQL Pool: An on-demand query service that queries data directly from the data lake (Azure Data Lake Storage Gen2 or Blob Storage). It uses a pay-per-query model with no provisioning. The service automatically scales based on query complexity. - Apache Spark Pool: Fully managed Spark clusters for data engineering, data science, and machine learning. You define the number of nodes (3 to 200) and node size (Small: 4 vCPU, 32 GB; Medium: 8 vCPU, 64 GB; Large: 16 vCPU, 128 GB). Spark pools can be autoscaled or dynamic allocation. - Synapse Pipelines: A cloud-based ETL/ELT service similar to Azure Data Factory. Pipelines orchestrate data movement and transformation activities. - Synapse Studio: A single web-based IDE for managing all Synapse resources, developing notebooks, authoring pipelines, and visualizing data.

How Dedicated SQL Pool Works Internally

When you create a dedicated SQL pool, you specify a performance level (DWU). The service provisions compute nodes based on the DWU. Each compute node has a distribution ID and stores a subset of the data. The control node coordinates queries: it receives the query, generates a distributed query plan, and sends it to compute nodes. Compute nodes execute in parallel and return results to the control node, which aggregates and returns them to the client. Data is distributed across 60 distributions using hash, round-robin, or replicated distribution.

Hash Distribution: Distributes rows evenly across distributions using a hash function on a distribution key. This is ideal for large fact tables that are joined on the distribution key.

Round-Robin Distribution: Distributes rows evenly without any optimization. Useful for staging tables.

Replicated Distribution: Copies the entire table to each compute node. Ideal for small dimension tables to avoid shuffling during joins.

Query Execution Flow

1.

Client sends T-SQL query to the control node.

2.

Control node parses, validates, and generates a distributed query plan.

3.

Control node sends plan fragments to compute nodes.

4.

Compute nodes execute their fragments in parallel, possibly shuffling data between nodes if needed.

5.

Compute nodes send partial results to control node.

6.

Control node aggregates and returns final result to client.

Serverless SQL Pool Architecture

Serverless SQL pool is built on top of a distributed query engine that reads data directly from Azure Storage. It uses a compute layer that scales automatically based on query complexity and data volume. There is no need to manage infrastructure. The service supports T-SQL and can query Parquet, CSV, JSON, and Delta Lake formats.

Resource Class: Serverless SQL pool has a resource class (small, medium, large, xlarge) that determines the amount of memory and I/O allocated per query. You can set it via SET RESOURCE_CLASS = 'medium';

External Tables: You can create external tables that reference data in the data lake, enabling querying without data ingestion.

Cost Model: You pay per terabyte of data processed. The first 1 TB per month in the pay-as-you-go tier is free.

Apache Spark Pool Architecture

Spark pools in Synapse are fully managed Apache Spark clusters. You define the number of nodes and node size. The cluster consists of a driver node and executor nodes. The driver runs the SparkContext and schedules tasks on executors. Executors process data and store intermediate results.

Autoscale: You can enable autoscaling to automatically add or remove nodes based on workload.

Dynamic Allocation: Spark dynamically adjusts the number of executors based on the workload.

Libraries: You can install custom libraries using requirements.txt or conda environments.

Synapse Link

Synapse Link provides real-time analytics on operational data by integrating with Azure Cosmos DB and SQL Server. It uses Change Data Capture (CDC) to replicate changes from the operational store to Synapse without impacting the source.

Integration with Azure Data Lake Storage Gen2

Synapse uses ADLS Gen2 as the primary storage layer. Data can be stored in the default file system or in linked services. The hierarchical namespace of ADLS Gen2 enables efficient file operations. Synapse can query data in place without moving it.

Security

Workspace Firewall: Restrict access to the workspace by IP rules.

Managed Virtual Network: Isolate Synapse resources within a managed VNet.

Private Endpoints: Securely connect to Synapse from on-premises or other Azure services.

Azure AD Authentication: Use Azure AD for user authentication.

Column-Level Security: Restrict access to sensitive columns.

Dynamic Data Masking: Mask sensitive data in query results.

Row-Level Security: Restrict rows based on user context.

Performance Optimization for Dedicated SQL Pool

Table Design: Use hash distribution on join keys, replicate small tables, and avoid round-robin for large tables.

Partitioning: Partition large tables by date column to enable partition elimination.

Indexes: Use clustered columnstore indexes for large tables, clustered indexes for small tables.

Statistics: Keep statistics updated with UPDATE STATISTICS.

Materialized Views: Pre-compute complex aggregations.

Result Set Caching: Cache query results for repeated executions.

Monitoring and Management

DMVs: Use dynamic management views like sys.dm_pdw_exec_requests to monitor queries.

Azure Monitor: Configure diagnostic settings to send logs to Log Analytics.

Synapse Studio Monitoring Hub: View active queries, pipelines, and Spark applications.

Disaster Recovery

Geo-Backup: Dedicated SQL pool automatically performs geo-redundant backups every 8 hours to a paired region.

Restore Points: User-defined restore points can be created before major changes.

Cross-Region Restore: Restore a backup to a different region.

Common Commands

-- Create a dedicated SQL pool
CREATE DATABASE myPool WITH (SERVICE_OBJECTIVE = 'DW100c');

-- Scale a dedicated SQL pool
ALTER DATABASE myPool MODIFY (SERVICE_OBJECTIVE = 'DW500c');

-- Pause a dedicated SQL pool (stops billing for compute)
ALTER DATABASE myPool PAUSE;

-- Resume a dedicated SQL pool
ALTER DATABASE myPool RESUME;

-- Query an external table in serverless SQL pool
SELECT TOP 10 * FROM OPENROWSET(
    BULK 'https://mydatalake.dfs.core.windows.net/parquet/*.parquet',
    FORMAT = 'PARQUET'
) AS rows;

Azure CLI Commands

# Create Synapse workspace
az synapse workspace create --name myworkspace --resource-group myrg --storage-account mystorage --file-system myfs

# Create a dedicated SQL pool
az synapse sql pool create --workspace-name myworkspace --resource-group myrg --name mypool --performance-level DW100c

# Pause a dedicated SQL pool
az synapse sql pool pause --workspace-name myworkspace --resource-group myrg --name mypool

Walk-Through

1

Create Synapse Workspace

The first step is provisioning a Synapse workspace in an Azure region. You select a resource group, workspace name, and specify the default data lake storage (ADLS Gen2) with a file system. The workspace is the top-level container for all Synapse resources. During creation, you can enable managed virtual network and configure firewall rules. The workspace URL will be `https://web.azuresynapse.net`.

2

Provision Dedicated SQL Pool

Within the workspace, create a dedicated SQL pool by specifying a name and performance level (DWU). The service provisions compute nodes and creates a control node. The pool is online and billing starts immediately. You can connect using SQL Server Management Studio (SSMS) or Azure Data Studio. The pool is an MPP database with 60 distributions. You can later pause or scale the pool.

3

Create External Data Sources

To query data in the data lake, create external data sources, file formats, and external tables. For serverless SQL pool, use `CREATE EXTERNAL DATA SOURCE` with the ADLS Gen2 URL. For dedicated SQL pool, use PolyBase to define external tables. This step enables querying data in place without loading into the pool.

4

Ingest Data via Pipeline

Use Synapse Pipelines to copy data from various sources (Azure SQL DB, Blob Storage, on-premises) into the data lake or dedicated SQL pool. Pipelines can include data flows for transformations. Activities run on an Integration Runtime (IR) – Azure IR for cloud sources, self-hosted IR for on-premises. You can schedule pipelines or trigger them manually.

5

Query Data with Serverless SQL

Use serverless SQL pool to run ad-hoc queries directly on data lake files. Write T-SQL queries using `OPENROWSET` or `SELECT` from external tables. The service automatically scales compute resources based on query complexity. Example: `SELECT COUNT(*) FROM OPENROWSET(BULK 'https://.../*.csv', FORMAT='CSV') AS r`. Pay per terabyte of data processed.

6

Analyze with Apache Spark

Create a Spark pool in the workspace. Develop and run notebooks using Python, Scala, or SQL. Spark reads data from the data lake and performs transformations like aggregations, joins, and machine learning. The results can be written back to the data lake or loaded into the dedicated SQL pool. Spark pools can be autoscaled.

7

Visualize in Power BI

Connect Power BI directly to Synapse workspaces. From Synapse Studio, you can create Power BI datasets and reports. DirectQuery mode allows real-time queries on dedicated or serverless SQL pools. Import mode is also supported for smaller datasets. Power BI reports can be embedded in Synapse Studio.

What This Looks Like on the Job

Enterprise Scenario 1: Retail Analytics Platform

A large retailer ingests terabytes of daily sales, inventory, and customer data from thousands of stores. They use Synapse with a dedicated SQL pool (DW1000c) for core reporting and a serverless SQL pool for ad-hoc analytics by data scientists. The data pipeline runs hourly, copying transactional data from on-premises SQL Server to ADLS Gen2 using self-hosted Integration Runtime. A Synapse pipeline then transforms the data using mapping data flows and loads into the dedicated SQL pool. Power BI dashboards provide real-time sales and inventory visibility. The dedicated SQL pool is scaled up during peak holiday seasons to DW3000c and scaled down or paused at night to save costs. The serverless SQL pool is used by analysts to query raw data for trend analysis without impacting the main warehouse. A common issue they faced was query performance degradation due to outdated statistics, which they resolved by scheduling UPDATE STATISTICS after each load.

Enterprise Scenario 2: Healthcare Data Lakehouse

A healthcare provider consolidates patient records, clinical trials, and genomic data. They use Synapse as a data lakehouse with ADLS Gen2 as the single source of truth. The dedicated SQL pool stores structured patient demographics and claims data, while genomic data in Parquet format is queried via serverless SQL pool. Apache Spark pools process genomic data using Python libraries. Synapse Link captures real-time changes from their operational Cosmos DB for patient vitals. The workspace is secured with managed VNet and private endpoints to ensure HIPAA compliance. They encountered a challenge with Spark job failures due to driver OOM errors, which they mitigated by increasing the driver memory and enabling dynamic allocation. The key lesson was to design file partitioning carefully – using too many small files (file fragmentation) killed performance in serverless SQL queries.

Enterprise Scenario 3: Financial Services Regulatory Reporting

A bank must generate daily regulatory reports across multiple business units. They use Synapse pipelines to extract data from various source systems (Oracle, SQL Server, flat files) and load into a dedicated SQL pool. The pool uses hash distribution on account ID to optimize joins. Materialized views pre-aggregate daily balances to speed up reports. They use row-level security to restrict data access by business unit. A major misconfiguration occurred when they initially used round-robin distribution for a large fact table, causing severe query performance. Switching to hash distribution improved query times by 10x. They also learned to avoid over-partitioning – partitioning by date at the month level was sufficient; daily partitioning caused too many small partitions.

How AZ-305 Actually Tests This

What AZ-305 Tests

Objective 2.4 (Design a data storage solution for relational data) includes Synapse Analytics as a key technology. The exam focuses on:

Choosing between dedicated SQL pool, serverless SQL pool, and Azure SQL Database based on workload patterns.

Understanding when to use PolyBase vs. COPY command for data loading.

Identifying appropriate distribution types (hash, round-robin, replicated) for different table sizes and join patterns.

Recognizing the cost implications: dedicated pools charge for provisioned compute even when idle; serverless charges per query processed.

Knowing that serverless SQL pool can query data in ADLS Gen2 and Blob Storage but not in Azure SQL DB or Cosmos DB (except via Synapse Link).

Common Wrong Answers

1.

'Serverless SQL pool is always cheaper' – Candidates often assume serverless is cheaper for all workloads. Reality: For steady, predictable workloads with frequent queries, a dedicated pool may be more cost-effective because serverless charges per TB processed, and high query volumes can exceed the cost of a reserved dedicated pool.

2.

'Dedicated SQL pool supports autoscaling' – While you can manually scale, dedicated SQL pool does not autoscale automatically. Candidates confuse it with Azure SQL Database serverless tier.

3.

'Synapse Link provides real-time replication to dedicated SQL pool' – Synapse Link currently supports Cosmos DB and SQL Server, but changes are replicated to a serverless SQL pool, not directly to a dedicated pool. You can then load into dedicated pool via pipeline.

4.

'PolyBase is the only way to load data into dedicated SQL pool' – The COPY command is now the recommended method for high-throughput loading. PolyBase is still supported but not optimal.

Specific Exam Numbers

DWU range: DW100c to DW30000c.

Maximum compute nodes: 60.

Serverless SQL pool free tier: first 1 TB per month.

Spark pool node sizes: Small (4 vCPU, 32 GB), Medium (8 vCPU, 64 GB), Large (16 vCPU, 128 GB).

Default autoscale range: 3 to 200 nodes.

Edge Cases

If a dedicated SQL pool is paused, you cannot query it but billing for storage continues.

Serverless SQL pool cannot query data in Azure SQL DB directly; you must use external tables or copy data.

Synapse pipelines can invoke Azure Functions and Databricks notebooks, not just Synapse activities.

How to Eliminate Wrong Answers

When you see a question about cost, ask: 'Is the workload predictable or intermittent?' Predictable → dedicated pool. Intermittent → serverless. For performance, look for 'large fact table joins' → hash distribution. For 'small dimension table' → replicated. If the question mentions 'real-time analytics on operational data' → Synapse Link with Cosmos DB.

Key Takeaways

Azure Synapse Analytics unifies data warehousing (dedicated SQL pool) and big data analytics (Spark) with on-demand querying (serverless SQL pool).

Dedicated SQL pool uses MPP with up to 60 compute nodes and 60 distributions; choose hash, round-robin, or replicated distribution based on table size and join patterns.

Serverless SQL pool charges per TB of data processed; first 1 TB per month is free.

The COPY command is the recommended method for high-throughput data loading into dedicated SQL pool, not PolyBase.

Synapse Link provides real-time analytics on Cosmos DB and SQL Server data, replicating to serverless SQL pool.

Dedicated SQL pool does not autoscale; you must manually scale or pause/resume.

For predictable, high-performance workloads, use dedicated SQL pool; for ad-hoc, cost-sensitive queries, use serverless SQL pool.

Spark pool node sizes: Small (4 vCPU, 32 GB), Medium (8 vCPU, 64 GB), Large (16 vCPU, 128 GB); default autoscale 3-200 nodes.

Security features include workspace firewall, managed VNet, private endpoints, column-level security, dynamic data masking, and row-level security.

Use materialized views, result set caching, and updated statistics to optimize dedicated SQL pool performance.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Dedicated SQL Pool

Provisioned compute – you pay for DWU even when idle.

Supports complex queries on petabyte-scale data with MPP.

Data must be loaded into the pool (though external tables are possible).

Can pause and resume to save compute costs.

Best for predictable, high-performance workloads.

Serverless SQL Pool

Pay-per-query – no provisioning, pay only for data processed.

Queries data directly from data lake without loading.

Scalable automatically but limited to 1 TB free per month.

Cannot be paused – always available.

Best for ad-hoc, exploratory, or infrequent queries.

Watch Out for These

Mistake

Dedicated SQL pool and serverless SQL pool are the same thing with different pricing.

Correct

They are architecturally different: dedicated SQL pool uses provisioned MPP compute with 60 distributions and requires data loaded into the pool; serverless SQL pool is an on-demand query engine that reads data directly from the data lake without data loading.

Mistake

You can use serverless SQL pool to query data in Azure SQL Database directly.

Correct

Serverless SQL pool can only query data in Azure Storage (ADLS Gen2, Blob Storage) and Cosmos DB via Synapse Link. To query Azure SQL DB, you must use external tables with a linked service or copy data.

Mistake

Dedicated SQL pool automatically scales based on load.

Correct

Dedicated SQL pool does not autoscale. You must manually scale or pause/resume. Azure SQL Database has a serverless tier that auto-scales, but Synapse dedicated pool does not.

Mistake

Synapse Link replicates data to a dedicated SQL pool in real time.

Correct

Synapse Link replicates changes to a serverless SQL pool, not directly to a dedicated pool. You can then use a pipeline to load into a dedicated pool, but that adds latency.

Mistake

PolyBase is the fastest way to load data into a dedicated SQL pool.

Correct

The COPY command is now the recommended method for loading large datasets into a dedicated SQL pool. It offers higher throughput and simpler syntax compared to PolyBase.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between dedicated SQL pool and serverless SQL pool in Azure Synapse?

Dedicated SQL pool is a provisioned, petabyte-scale data warehouse with MPP architecture. You pay for compute (DWU) even when idle. Serverless SQL pool is an on-demand query service that charges per terabyte of data processed. Dedicated pool requires data loading; serverless queries data directly from the data lake. Choose dedicated for predictable, high-performance workloads; serverless for ad-hoc or infrequent queries.

Can I use serverless SQL pool to query data in Azure SQL Database?

No, serverless SQL pool can only query data stored in Azure Data Lake Storage Gen2, Azure Blob Storage, or Cosmos DB (via Synapse Link). To query Azure SQL Database, you must use external tables with a linked service or copy the data into the data lake first.

How does data distribution work in dedicated SQL pool?

Dedicated SQL pool distributes data across 60 distributions using three methods: hash distribution (distributes by hash of a key column), round-robin (evenly without optimization), and replicated (full copy on each node). Hash is best for large fact tables joined on the distribution key; replicated for small dimension tables; round-robin for staging tables.

What is the COPY command and how is it different from PolyBase?

The COPY command is the recommended method for loading data into a dedicated SQL pool. It provides higher throughput, simpler syntax, and supports more file formats than PolyBase. Example: `COPY INTO dbo.Sales FROM 'https://...' WITH (FILE_TYPE='CSV', CREDENTIAL=(IDENTITY='...', SECRET='...'))`. PolyBase is older but still supported for external tables.

How do I optimize query performance in a dedicated SQL pool?

Optimize by: (1) Choosing appropriate distribution (hash for large tables, replicated for small). (2) Partitioning large tables on date columns for partition elimination. (3) Using clustered columnstore indexes. (4) Keeping statistics updated. (5) Using materialized views for pre-aggregation. (6) Caching results with result set caching. (7) Scaling the pool to a higher DWU for complex queries.

What is Synapse Link and when should I use it?

Synapse Link provides real-time analytics on operational data by replicating changes from Azure Cosmos DB or SQL Server to a serverless SQL pool in Synapse. Use it when you need near-real-time insights from transactional systems without impacting source performance. The data is available for querying within seconds of the change.

Can I pause a serverless SQL pool to save costs?

No, serverless SQL pool is always available and cannot be paused. You only pay for queries executed (per TB processed). To control costs, you can limit query concurrency and set budget alerts. In contrast, dedicated SQL pool can be paused to stop compute billing.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Synapse Analytics Architecture — now see how well it sticks with free AZ-305 practice questions. Full explanations included, no account needed.

Done with this chapter?