This chapter covers Azure Synapse Analytics, a unified analytics service that combines data integration, enterprise data warehousing, and big data analytics. For the DP-900 exam, understanding Synapse's components (dedicated SQL pool, serverless SQL pool, Apache Spark pool, and Synapse pipelines) and its role in modern data architectures is critical. Questions on this topic account for approximately 10-15% of the analytics domain (Objective 3.1). You will need to identify when to use Synapse versus other Azure data services and describe its core capabilities.
Jump to a section
Imagine a large supermarket (Azure Synapse Analytics) that combines a warehouse (dedicated SQL pool for structured data) with a farmers' market (serverless SQL pool for on-demand queries) and a library (Apache Spark for data transformation). The supermarket has a single entrance (Synapse Studio) where you can plan your shopping (data integration pipelines). The warehouse stores pallets of boxed goods (relational tables with indexed data) and requires a membership (dedicated resources). The farmers' market has fresh produce (data in files like CSV or Parquet) that you can pick directly without membership (pay-per-query). The library has books (Spark notebooks) that you can use to cook (transform) raw ingredients into ready meals (processed datasets). The supermarket also has a delivery service (Synapse Link) that brings hot items (real-time data) straight from the farm (operational databases like Cosmos DB) to your kitchen (analytics). This integrated design means you can store, query, transform, and analyze data in one place, using the right tool for each task without leaving the store.
What is Azure Synapse Analytics?
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It provides a unified experience to ingest, explore, prepare, transform, manage, and serve data for immediate BI and machine learning needs. Synapse replaces the older Azure SQL Data Warehouse and adds significant capabilities including serverless on-demand query, Apache Spark integration, and deep integration with Power BI and Azure Machine Learning.
Why It Exists
Traditional data warehouses are optimized for structured, relational data and struggle with semi-structured or unstructured data. Conversely, big data platforms like Hadoop handle variety but lack the performance and governance of a warehouse. Azure Synapse bridges this gap by offering multiple compute engines that can operate on the same data lake. The key innovation: you can use dedicated SQL pools for high-performance relational queries, serverless SQL pools for ad-hoc queries over data lake files, and Apache Spark for data engineering and machine learning, all within a single service and with a single security and metadata layer.
How It Works Internally
Synapse is built on a separation of compute and storage. Data is stored in Azure Data Lake Storage Gen2 (ADLS Gen2) or in managed storage within Synapse. Compute resources are provisioned independently: - Dedicated SQL Pool: Uses a distributed query processing architecture. Data is distributed across 60 distributions using hash or round-robin distribution. Queries are executed in parallel across all nodes. The control node coordinates query execution and results aggregation. You can pause and resume the compute to save costs. - Serverless SQL Pool: This is a query service over files in your data lake. It does not store data; it simply reads files (Parquet, CSV, JSON, etc.) and returns results. It uses a pay-per-query model. The engine automatically scales based on query complexity and data volume. - Apache Spark Pool: Fully managed Spark clusters. You define the pool size (number of nodes and node size). Spark jobs run in a serverless manner; you pay for the compute used. Synapse integrates Spark with SQL via notebooks and can read/write data from the same data lake. - Synapse Pipelines: Similar to Azure Data Factory, pipelines orchestrate data movement and transformation. They can trigger activities such as copying data, running a Spark job, or executing a SQL script. Pipelines can be scheduled or event-driven.
Key Components, Values, and Defaults
Dedicated SQL Pool Performance Levels: Measured in Data Warehouse Units (DWU). Available levels: DW100c, DW200c, DW300c, DW400c, DW500c, DW600c, DW1000c, DW1200c, DW1500c, DW2000c, DW3000c, DW6000c, DW15000c. The 'c' stands for compute-optimized Gen2. DWU determines CPU, memory, and I/O. You can scale up or down in minutes.
Serverless SQL Pool: No provisioning needed. You query files directly using T-SQL. The cost is $5 per TB of data scanned (for standard tier). Reserved capacity pricing is available.
Apache Spark Pool: Node sizes: Small (4 vCPU, 32 GB RAM), Medium (8 vCPU, 64 GB RAM), Large (16 vCPU, 128 GB RAM). Autoscaling can be enabled. The default Spark version is 3.3 (as of 2024).
Synapse Pipelines: Activities include Copy Data, Data Flow, Execute SQL, Notebook, and others. Triggers can be schedule (e.g., every hour) or event-based (e.g., when a new file arrives in blob storage).
Synapse Link: Enables near real-time analytics on operational data from Azure Cosmos DB, SQL Server, and Dataverse. It uses change feed to replicate data continuously with latency under 1 minute.
Configuration and Verification Commands
To create a Synapse workspace using Azure CLI:
az synapse workspace create \
--name myworkspace \
--resource-group myResourceGroup \
--storage-account mydatalakestorage \
--file-system myfilesystem \
--sql-admin-login-user adminuser \
--sql-admin-login-password P@ssw0rd1234To create a dedicated SQL pool:
az synapse sql pool create \
--workspace-name myworkspace \
--name mypool \
--resource-group myResourceGroup \
--performance-level DW1000cTo query using serverless SQL pool:
SELECT TOP 10 *
FROM OPENROWSET(
BULK 'https://mydatalakestorage.dfs.core.windows.net/myfilesystem/data/*.parquet',
FORMAT='PARQUET'
) AS [result]How It Interacts with Related Technologies
Azure Data Lake Storage Gen2: The primary storage layer. Synapse workspaces automatically create a linked service to the ADLS Gen2 account. All engines (dedicated, serverless, Spark) can read/write to the same lake.
Power BI: Synapse has a built-in Power BI workspace integration. You can directly create Power BI datasets from Synapse SQL pools and publish reports.
Azure Machine Learning: You can train ML models using Spark MLlib or integrate with Azure ML for automated ML. The trained models can be deployed as endpoints and consumed within Synapse.
Azure Purview: Synapse integrates with Purview for data cataloging and lineage. You can register Synapse as a data source in Purview to track data movement and transformations.
Microsoft 365: Synapse Link for Dataverse allows real-time analytics on Dynamics 365 and Power Platform data.
Provision Synapse Workspace
First, create a Synapse workspace in the Azure portal or via CLI. You need an Azure subscription and a resource group. The workspace requires a storage account (ADLS Gen2) and a file system. You also specify SQL admin credentials for the built-in serverless SQL pool and any dedicated SQL pools you plan to create. The workspace acts as the central management hub for all Synapse resources.
Ingest Data Using Pipelines
Use Synapse Pipelines (or Copy Data tool) to move data from source systems (Azure Blob, SQL Server, Cosmos DB, etc.) into the data lake. You can schedule pipelines to run at specific intervals or trigger them based on events. The pipeline activity copies data into the ADLS Gen2 storage in the chosen format (Parquet, CSV, etc.). This step populates the data lake for subsequent analysis.
Explore and Prepare Data with Spark
Launch an Apache Spark pool and create a notebook. Use Python, Scala, or SQL to read raw data from the lake, perform transformations (filtering, aggregations, joins), and write the cleaned data back to the lake. Spark pools are auto-scaled and can be paused when not in use. This step is for data engineering tasks that require programmatic logic.
Query Data with Serverless SQL
Use the built-in serverless SQL pool to run ad-hoc queries directly on files in the data lake. You can use T-SQL to query Parquet, CSV, or JSON files without loading them into a database. This is ideal for data exploration and quick reporting. The cost is based on the amount of data scanned. You can also create views and external tables for reusability.
Load Data into Dedicated SQL Pool
For high-performance, interactive queries, load data from the data lake into a dedicated SQL pool. Use CTAS (CREATE TABLE AS SELECT) or PolyBase to efficiently load data. The dedicated SQL pool stores data in a distributed manner across 60 distributions. Once loaded, you can run complex aggregations and joins with low latency. You can pause the pool when not in use to save costs.
Enterprise Scenario 1: Retail Analytics
A large retail chain uses Azure Synapse to unify sales data from thousands of stores, inventory data from warehouses, and customer clickstream data from their e-commerce platform. They ingest data daily using Synapse Pipelines from on-premises SQL Server and Azure Blob Storage. The data is stored in ADLS Gen2 as Parquet files. Data engineers use Spark notebooks to clean and join the datasets, creating a unified sales fact table. Business analysts then use the serverless SQL pool to run ad-hoc queries, such as 'What was the sales lift from last week's promotion?' For dashboards, they load aggregated data into a dedicated SQL pool and connect it to Power BI. The dedicated pool is scaled to DW1000c during business hours and paused overnight. This architecture handles 10 TB of data and supports 500 concurrent users. A common misconfiguration is not partitioning large tables in the dedicated pool, leading to full scans and slow queries.
Enterprise Scenario 2: Real-Time IoT Analytics
A manufacturing company uses sensors to monitor equipment temperature and vibration. Data streams into Azure Event Hubs and is ingested via Synapse Pipelines into ADLS Gen2. They use Synapse Link for real-time analytics on Cosmos DB, which stores the latest sensor readings. Engineers run Spark streaming jobs to detect anomalies (e.g., temperature spikes) and alert the maintenance team. Historical data is stored in Parquet files and queried using serverless SQL for trend analysis. The dedicated SQL pool is used for monthly reporting. The key challenge is managing the cost of serverless SQL queries; they use data partitioning and file pruning to limit scanned data. They also set up budget alerts to avoid unexpected bills.
Enterprise Scenario 3: Healthcare Data Lake
A hospital network aggregates patient records, lab results, and imaging metadata. They use Synapse to comply with data governance requirements (HIPAA). All data is stored in ADLS Gen2 with Azure Purview for cataloging. Researchers use Spark notebooks to run machine learning models on de-identified data. The serverless SQL pool allows clinicians to query specific cohorts (e.g., 'patients with condition X and medication Y') without moving data. The dedicated SQL pool is used for operational reporting. A common pitfall is not setting up proper security: they use Azure AD passthrough and row-level security to ensure only authorized users see sensitive data. Misconfiguration of firewall rules can expose the workspace to unauthorized access.
What DP-900 Tests on This Topic
The exam objective 3.1 (Analytics) expects you to identify the appropriate analytics service for a given scenario. For Azure Synapse, you must know:
The components: dedicated SQL pool, serverless SQL pool, Apache Spark pool, and pipelines.
When to use dedicated vs. serverless: dedicated for predictable, high-performance workloads; serverless for ad-hoc, exploratory queries.
That Synapse is the evolution of Azure SQL Data Warehouse.
That Synapse Link enables real-time analytics on operational data.
Common Wrong Answers and Why
'Azure Synapse is just a data warehouse' – Wrong because Synapse is a unified analytics platform that includes data integration, Spark, and serverless querying, not just a warehouse.
'Serverless SQL pool stores data' – Wrong; serverless SQL pool queries files directly without storing data. Candidates confuse it with dedicated pool.
'Synapse Pipelines are the same as Azure Data Factory' – Partially true, but Synapse Pipelines are built-in and share the same engine. The exam expects you to know they are equivalent in functionality.
'You can use Synapse only with structured data' – Wrong; Synapse works with structured (tables), semi-structured (JSON), and unstructured (files) data via Spark.
Specific Numbers and Terms
DWU (Data Warehouse Units) levels: DW100c to DW15000c.
Serverless SQL cost: $5/TB scanned (standard tier).
Dedicated SQL pool distributions: 60.
Spark pool node sizes: Small (4 vCPU, 32 GB), Medium (8 vCPU, 64 GB), Large (16 vCPU, 128 GB).
Synapse Link latency: under 1 minute.
Edge Cases and Exceptions
The exam may ask: 'Which service should you use to run a one-time query on a 1 TB Parquet file?' Answer: Serverless SQL pool (not dedicated, because provisioning a dedicated pool is overkill). Another edge case: 'You need to transform data using Python and then train a model.' Answer: Apache Spark pool. Also, note that Synapse Link is specifically for Cosmos DB, SQL Server, and Dataverse – not for all databases.
How to Eliminate Wrong Answers
If the scenario mentions 'real-time analytics on operational data', look for 'Synapse Link'. If it mentions 'ad-hoc queries on data lake files', choose 'serverless SQL pool'. If it mentions 'high-performance, predictable queries', choose 'dedicated SQL pool'. If it mentions 'data transformation using code', choose 'Apache Spark pool'. Remember that Synapse is a single service that combines all these capabilities.
Azure Synapse Analytics unifies data integration, warehousing, and big data analytics in one service.
Dedicated SQL pool provides high-performance relational querying with DWU scaling from DW100c to DW15000c.
Serverless SQL pool queries files in ADLS Gen2 directly with pay-per-query pricing ($5/TB scanned).
Apache Spark pool in Synapse enables data engineering and machine learning using Python, Scala, or SQL.
Synapse Pipelines are built on Azure Data Factory technology for data orchestration.
Synapse Link provides near real-time analytics on operational data from Cosmos DB, SQL Server, and Dataverse with sub-minute latency.
Synapse replaces Azure SQL Data Warehouse and adds serverless and Spark capabilities.
Security integration includes Azure AD, managed identity, firewall rules, and private endpoints.
These come up on the exam all the time. Here's how to tell them apart.
Dedicated SQL Pool
Stores data in managed tables within the pool.
Requires provisioning and pays for allocated resources per hour.
Best for predictable, high-performance workloads with consistent query patterns.
Supports row-level security, partitioning, and indexes.
Can be paused to stop compute costs.
Serverless SQL Pool
Queries data directly from files in the data lake without storing it.
Pay-per-query based on data scanned ($5/TB).
Ideal for ad-hoc exploration, data discovery, and one-time queries.
No provisioning; automatically scales.
Cannot pause; costs are incurred only when queries run.
Mistake
Azure Synapse Analytics is the same as Azure SQL Data Warehouse.
Correct
Synapse is the evolution of Azure SQL Data Warehouse with added capabilities: serverless SQL, Apache Spark, pipelines, and deep integration with Power BI and Azure ML. SQL Data Warehouse only had dedicated SQL pool.
Mistake
Serverless SQL pool stores data permanently.
Correct
Serverless SQL pool does not store any data; it queries files directly from the data lake. There is no local storage. Data remains in ADLS Gen2.
Mistake
You must use dedicated SQL pool for all queries in Synapse.
Correct
You can use serverless SQL pool for ad-hoc queries and Spark for transformations. Dedicated SQL pool is for high-performance, interactive queries on loaded data.
Mistake
Synapse Link is the same as Azure Data Factory.
Correct
Synapse Link is a feature for real-time analytics on operational data (Cosmos DB, SQL Server, Dataverse). Azure Data Factory is a separate data integration service, though Synapse Pipelines are built on the same technology.
Mistake
Spark pools in Synapse are always running and incur cost continuously.
Correct
Spark pools can be set to autoscale and can be paused when not in use. You pay only for the compute used during job execution. You can also set a time-to-live (TTL) for idle sessions.
Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.
Dedicated SQL pool is a provisioned data warehouse that stores data in tables and charges per hour for allocated resources (DWU). It is optimized for predictable, high-performance queries. Serverless SQL pool is an on-demand query service that reads files directly from the data lake and charges per TB of data scanned. It requires no provisioning and is ideal for ad-hoc exploration. For the exam, remember: dedicated = stored, provisioned, high-performance; serverless = file-based, pay-per-query, exploratory.
Yes. You can create a Synapse workspace and use only the serverless SQL pool and Apache Spark pool for querying and transforming data in the data lake. The dedicated SQL pool is optional. This is common in data lake architectures where you want to avoid provisioning a warehouse. However, the workspace itself requires an ADLS Gen2 account. Exam tip: Synapse is not just a data warehouse; it's an analytics platform.
Synapse Link enables near real-time analytics on operational data from Azure Cosmos DB, SQL Server, and Dataverse. It uses the change feed to replicate data continuously into Synapse with latency under one minute. Use it when you need to run analytics on live operational data without impacting the source database. For example, analyzing e-commerce orders as they are placed. Exam tip: Synapse Link is for real-time, not batch, analytics.
Synapse has built-in integration with Power BI. You can directly create Power BI datasets from a dedicated SQL pool or serverless SQL pool. You can also publish Power BI reports from within Synapse Studio. This integration allows users to build dashboards on top of Synapse data without leaving the analytics environment. Exam tip: This integration is a key selling point for Synapse.
Yes, within Apache Spark pools. You can write Python code in notebooks to perform data transformations, machine learning, and other tasks. Synapse also supports Scala, .NET, and SQL. Python is commonly used with libraries like Pandas, PySpark, and scikit-learn. Exam tip: Python is supported via Spark, not in SQL pools.
Costs vary by component: dedicated SQL pool charges per hour based on DWU; serverless SQL pool charges per TB of data scanned ($5/TB); Apache Spark pool charges per vCore-hour for the compute used; Synapse Pipelines have charges per activity execution and data movement. Storage costs are separate for ADLS Gen2. You can pause dedicated SQL pool and Spark pools to save costs. Exam tip: Serverless is cost-effective for sporadic queries; dedicated is for steady workloads.
Yes. Azure SQL Data Warehouse has been rebranded as Azure Synapse Analytics. Existing SQL Data Warehouse instances are now considered Synapse workspaces with a dedicated SQL pool. You can upgrade to the full Synapse experience by enabling the workspace features. No data migration is needed. Exam tip: Synapse is the successor; SQL Data Warehouse is deprecated.
You've just covered Azure Synapse Analytics — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.
Done with this chapter?