This chapter covers Azure Synapse Analytics, a limitless analytics service that brings together big data and data warehousing. For AZ-900, understanding Synapse helps you identify how Azure supports large-scale data integration, exploration, and analysis. This objective (2.4) focuses on core Azure services, and Synapse is a key example of a modern cloud analytics platform. You'll learn its components, how it differs from traditional data warehouses, and the exam's focus points.
Jump to a section
Imagine a large manufacturing company that receives raw materials from dozens of suppliers (data sources). The company has a massive warehouse where they store all the raw materials (data lake), but the warehouse is chaotic—materials are piled randomly, and it's hard to find what you need. To turn these materials into finished products (insights), they need a factory with multiple specialized departments. Azure Synapse Analytics is that factory. It has a receiving dock (data ingestion) where raw data comes in from various sources (databases, files, streams). Then, a sorting and cleaning department (data preparation and transformation) organizes the materials, removing defects and standardizing formats. Next, the factory has a powerful assembly line (analytics engine) that can quickly combine materials into complex products (queries and reports). The factory also has a quality control lab (machine learning integration) that can test products and predict future demand. Finally, the finished goods are stored in a showroom (data warehousing) for customers (business analysts) to view and purchase (query and visualize). The factory manager (Azure Synapse Studio) oversees the entire operation from a single control room, adjusting workflows and monitoring performance. This unified approach means the company doesn't need separate buildings for each step—everything is under one roof, reducing delays and errors. In the same way, Synapse integrates big data and data warehousing into one service, so you don't have to stitch together separate tools for storage, processing, and analytics.
What is Azure Synapse Analytics?
Azure Synapse Analytics is a unified analytics platform that combines enterprise data warehousing, big data analytics, data integration, and machine learning. It allows you to ingest, prepare, manage, and serve data for immediate BI and machine learning needs. The service replaces the need for separate tools like SQL Server, Spark clusters, and data factories by integrating them into a single environment.
The Business Problem It Solves
Traditional data analytics often involves silos: a data warehouse for structured data, a Hadoop cluster for unstructured data, and separate ETL tools. This leads to complexity, data duplication, and slow insights. Synapse solves this by providing a single service that can handle both structured and unstructured data, batch and real-time processing, and all stages of analytics—from ingestion to visualization.
How It Works – Step by Step
Data Ingestion: Data flows into Synapse from various sources (Azure Blob Storage, Data Lake Storage, SQL databases, IoT streams) using pipelines (like Azure Data Factory).
Data Storage: Data is stored in a data lake (Azure Data Lake Storage Gen2) or within Synapse's built-in SQL pools.
Data Preparation: Using Synapse Pipelines or notebooks (with Spark or SQL), data is cleaned, transformed, and shaped.
Analytics: You run queries using serverless SQL (on-demand) or dedicated SQL pools (provisioned). Synapse also supports Apache Spark for big data processing.
Data Serving: Processed data is exposed to BI tools like Power BI, Azure Analysis Services, or custom applications.
Governance: Synapse integrates with Azure Purview for data cataloging and lineage tracking.
Key Components
- Synapse SQL: Two types of SQL endpoints: - Dedicated SQL Pool: Provisioned, persistent storage. Good for predictable, high-performance workloads. You pay for provisioned DWUs (Data Warehouse Units). - Serverless SQL Pool: On-demand, pay-per-query. Ideal for ad-hoc exploration, data lake queries, and prototyping. - Apache Spark Pool: Managed Spark clusters for big data processing. You define autoscaling and session timeouts. - Synapse Pipelines: Orchestration engine (built on Azure Data Factory) to create ETL/ELT workflows. - Synapse Studio: Web-based IDE for managing all resources, writing code, and visualizing data. - Synapse Link: Real-time analytics on operational data (e.g., from Cosmos DB) without ETL.
Tiers and Pricing Models
Synapse doesn't have traditional tiers like Basic/Standard. Instead, you pay for:
Dedicated SQL Pool: per DWU-hour (e.g., DW100c to DW30000c).
Serverless SQL: per TB of data scanned.
Apache Spark: per vCore-hour.
Pipelines: per activity run and data movement.
Comparison to On-Premises Equivalent
An on-premises data warehouse typically uses a single large server (e.g., Teradata, SQL Server PDW) with fixed storage and compute. Scaling requires hardware upgrades. Synapse separates compute and storage, allowing independent scaling. Additionally, on-premises solutions lack integrated Spark, serverless options, and cloud-native data lake connectivity.
Azure Portal and CLI Touchpoints
In the Azure portal, you create a Synapse workspace (like a logical container). Then you add SQL pools, Spark pools, and pipelines. The Azure CLI can automate creation:
az synapse workspace create --name MySynapseWorkspace \
--resource-group MyResourceGroup --location eastus \
--storage-account MyStorageAccount \
--file-system MyFileSystem --sql-admin-login-user admin \
--sql-admin-login-password P@ssw0rd1234You can also manage pools via CLI:
az synapse sql pool create --workspace-name MySynapseWorkspace \
--resource-group MyResourceGroup --name MySqlPool \
--performance-level DW100cConcrete Business Scenarios
Retail Analytics: A retailer ingests sales data from stores (structured) and social media feeds (unstructured). Synapse unifies the data, runs Spark to analyze sentiment, and serves aggregated reports via Power BI.
IoT Telemetry: A manufacturer collects sensor data from machines. Synapse ingests streaming data via Event Hubs, uses serverless SQL to query recent data, and stores historical data in a dedicated pool for long-term analysis.
Healthcare Claims: A health insurer processes millions of claims daily. Synapse pipelines transform raw claims, deduplicate records, and load into a dedicated SQL pool for fraud detection models.
Create a Synapse Workspace
In the Azure portal, search for 'Azure Synapse Analytics' and click 'Create'. You'll need to provide a resource group, workspace name, region, and select a Data Lake Storage Gen2 account (or create one). The workspace is the logical container for all Synapse resources. Behind the scenes, Azure provisions a managed resource group with a SQL server, Spark pools, and storage. The workspace name must be globally unique. After creation, you access Synapse Studio via a URL like https://web.azuresynapse.net.
Configure Data Storage
Synapse uses Azure Data Lake Storage Gen2 as its primary data lake. In the workspace, you can create containers and folders for raw, processed, and curated data. You can also link other storage accounts (Blob, ADLS Gen1) using linked services. Data is stored in Parquet, CSV, JSON, or other formats. For dedicated SQL pools, data is stored internally in columnar format. The separation of storage and compute means you can scale compute without moving data.
Ingest Data with Pipelines
In Synapse Studio, go to the 'Integrate' hub and create a pipeline. Pipelines are like Azure Data Factory pipelines: they contain activities (copy, data flow, notebook) that move and transform data. For example, you can copy data from an on-premises SQL Server to the data lake using a self-hosted integration runtime. Pipelines can be triggered on a schedule or event. Each activity run incurs cost based on execution time and data movement.
Query Data with Serverless SQL
Serverless SQL is built into every Synapse workspace. You can query files in the data lake using T-SQL without provisioning any resources. For example, to query a CSV: SELECT * FROM OPENROWSET(BULK 'https://mystorage.dfs.core.windows.net/container/folder/file.csv', FORMAT='CSV') AS [result]. You pay only for the data scanned (per TB). Serverless SQL is perfect for ad-hoc exploration and prototyping. It can also create views and external tables.
Create a Dedicated SQL Pool
For production workloads with predictable performance, create a dedicated SQL pool. Choose a performance level from DW100c (100 compute DWUs) to DW30000c. The pool provisions dedicated resources. You can pause the pool to save costs when not in use. Data is stored in the pool's internal storage. You can load data using PolyBase, COPY statement, or pipelines. Queries run in seconds on large datasets. Scaling up/down takes a few minutes.
Scenario 1: E-commerce Company Unifying Sales and Clickstream Data
An online retailer wants to combine structured sales data from their transactional database with unstructured clickstream logs from their website. They use Synapse to ingest both into a data lake (ADLS Gen2). Sales data is loaded nightly via a pipeline from SQL Server, while clickstream data streams in real-time via Azure Event Hubs. A Synapse notebook (Spark) joins the datasets, calculates conversion rates, and writes results to a dedicated SQL pool. Business analysts use Power BI to visualize dashboards. The team configures Synapse Pipelines to run hourly for clickstream and daily for sales. Cost considerations: they use serverless SQL for ad-hoc queries and a small dedicated pool (DW100c) for dashboards. If they misconfigure the pipeline (e.g., wrong file format), data may not load, causing stale reports.
Scenario 2: Financial Services Firm for Risk Analytics
A bank processes terabytes of transaction data for fraud detection. They use Synapse to run complex SQL queries and machine learning models. Data is ingested from multiple sources (mainframe, CRM) using pipelines. They create a dedicated SQL pool (DW1000c) for high-performance queries. Data scientists use Spark pools to train anomaly detection models. Governance is handled via Azure Purview integration. The team uses Synapse Studio to monitor pipeline runs and query performance. Common mistake: not pausing the dedicated pool overnight, leading to unnecessary costs. They set up autoscale for Spark pools to handle varying workloads.
Scenario 3: Healthcare Provider for Patient Analytics
A hospital system wants to analyze patient records, lab results, and IoT device data. They use Synapse to ingest data from on-premises SQL databases and HL7 feeds. Serverless SQL is used to explore data lake files. They build a pipeline that transforms data into a star schema and loads into a dedicated SQL pool. Power BI reports show patient outcomes. They integrate Azure Machine Learning to predict readmission risk. If they don't properly secure the data (e.g., missing column-level security), sensitive patient data could be exposed. They use Synapse's row-level security and dynamic data masking.
Exam Objective: 2.4 Describe core Azure services
AZ-900 expects you to know the purpose of Azure Synapse Analytics as a unified analytics service. You won't be asked to write queries or configure pipelines, but you must understand its role in big data and data warehousing.
Common Wrong Answers and Why
'Synapse is just a rebranded SQL Data Warehouse' – Wrong. Synapse combines SQL Data Warehouse with Spark, pipelines, and data lake integration. The exam tests that it's more than just a data warehouse.
'Synapse replaces Azure Analysis Services' – Wrong. Analysis Services is for tabular models and is complementary, not replaced.
'Serverless SQL pool is the same as dedicated SQL pool' – Wrong. Serverless is on-demand, pay-per-query; dedicated is provisioned and persistent.
'Synapse requires a separate data factory' – Wrong. Synapse has built-in pipelines; you don't need a separate Azure Data Factory.
Specific Terms and Values
Dedicated SQL Pool – provisioned, persistent, uses DWUs (Data Warehouse Units).
Serverless SQL Pool – on-demand, pay per TB scanned.
Synapse Pipelines – built on Azure Data Factory.
Synapse Studio – web-based IDE.
Synapse Link – real-time analytics on operational data.
Edge Cases and Tricky Distinctions
Synapse vs. Azure Data Lake: Data Lake is storage; Synapse provides compute and analytics on top of that storage.
Synapse vs. Azure HDInsight: HDInsight is open-source Hadoop/Spark; Synapse is a managed, integrated platform.
Synapse vs. Azure Databricks: Databricks is a Spark-based platform with collaborative notebooks; Synapse is more SQL-centric with integrated pipelines.
Memory Trick: 'SUDS' – Synapse Unifies Data and Services
Storage (data lake)
Unified SQL (serverless and dedicated)
Data integration (pipelines)
Spark (big data processing)
Decision Tree for Elimination
If a question asks about 'big data and data warehousing combined' or 'unified analytics', pick Synapse. If it's about 'real-time analytics on operational data', think Synapse Link. If it's about 'orchestrating ETL', think Synapse Pipelines. If it's about 'querying data lake files without provisioning', think serverless SQL.
Azure Synapse Analytics is a unified analytics platform combining big data and data warehousing.
It includes two SQL options: serverless (on-demand, pay-per-query) and dedicated (provisioned, persistent).
Synapse Pipelines are built-in for data integration, no separate Data Factory needed.
Synapse Studio is the web-based IDE for managing all resources.
Synapse Link enables real-time analytics on operational data (e.g., Cosmos DB) without ETL.
Dedicated SQL pools scale in DWUs (Data Warehouse Units) from DW100c to DW30000c.
Serverless SQL queries data lake files directly without provisioning.
These come up on the exam all the time. Here's how to tell them apart.
Azure Synapse Analytics
Unified analytics platform with SQL, Spark, and pipelines
Includes data warehousing (dedicated/ serverless SQL)
Provides a web-based IDE (Synapse Studio)
Best for end-to-end analytics including BI
Pricing includes DWU-hours, vCore-hours, and per-TB scanning
Azure Data Factory
Cloud-scale data integration and orchestration only
No built-in data warehousing or compute engines
Uses Data Factory Studio for monitoring
Best for ETL/ELT and data movement
Pricing based on activity runs and integration runtime hours
Mistake
Azure Synapse Analytics is just a new name for Azure SQL Data Warehouse.
Correct
While Synapse evolved from SQL Data Warehouse, it adds integrated Spark, serverless SQL, pipelines, and a unified studio. It's a superset, not a rename.
Mistake
You must choose between serverless and dedicated SQL pools; you can't use both.
Correct
A Synapse workspace can have both serverless and dedicated SQL pools simultaneously. They serve different use cases.
Mistake
Synapse Pipelines are a separate service and require Azure Data Factory.
Correct
Synapse Pipelines are built into Synapse and do not require a separate Data Factory instance, though they share the same engine.
Mistake
Synapse can only analyze data stored in Azure Data Lake Storage Gen2.
Correct
Synapse can connect to many sources (Blob, SQL DB, Cosmos DB, on-premises) via linked services and pipelines.
Mistake
Serverless SQL pool stores data persistently.
Correct
Serverless SQL pool does not store data; it queries files in the data lake on demand. Only dedicated SQL pools have persistent storage.
Serverless SQL is on-demand: you query files in the data lake and pay per TB scanned. It has no persistent storage. Dedicated SQL pool is provisioned: you allocate compute resources (DWUs) and store data internally. Dedicated pool offers predictable performance and supports advanced features like materialized views and row-level security. Use serverless for ad-hoc exploration and prototyping; use dedicated for production workloads.
No. Synapse includes built-in pipelines (Synapse Pipelines) that are based on Azure Data Factory. You can create and run ETL/ELT workflows directly within Synapse without a separate Data Factory instance. However, if you already use Data Factory, you can integrate it with Synapse.
Synapse Link enables near real-time analytics on operational data from sources like Azure Cosmos DB and SQL Server. It automatically syncs changes from the operational database to Synapse, eliminating the need for ETL. Use it when you need low-latency analytics on transactional data without impacting operational performance.
Yes. Synapse integrates seamlessly with Power BI. You can connect Power BI to a dedicated SQL pool, serverless SQL, or even directly to data lake files. Synapse Studio even has a built-in Power BI integration for creating reports.
Pricing is consumption-based: dedicated SQL pool charges per DWU-hour (e.g., DW100c costs ~$1.10/hr). Serverless SQL charges per TB of data scanned (e.g., $5/TB). Spark pools charge per vCore-hour. Pipelines charge per activity run and data movement. You can pause dedicated pools to stop costs.
Synapse is a unified analytics platform with strong SQL support, integrated pipelines, and data warehousing. Databricks is a Spark-based platform focused on collaborative data science and engineering. Synapse is better for SQL-centric BI and ETL; Databricks excels at machine learning and advanced analytics. Both can work together.
Yes. Synapse can ingest streaming data from Azure Event Hubs and IoT Hub using pipelines or Spark Structured Streaming. You can then query the data in real-time using serverless SQL or store it for later analysis. For low-latency (<1 second) streaming, consider Azure Stream Analytics.
You've just covered Azure Synapse Analytics — now see how well it sticks with free AZ-900 practice questions. Full explanations included, no account needed.
Done with this chapter?