DP-900Chapter 61 of 101Objective 3.1

Azure Synapse Spark Pools

This chapter covers Azure Synapse Spark Pools, a critical component of the Analytics domain in the DP-900 exam. You will learn what Spark pools are, how they execute big data processing jobs in parallel, and how they differ from SQL pools. Approximately 10-15% of exam questions touch on Synapse Analytics components, with Spark pools being a key topic. Mastery of this chapter will help you answer questions about workload types, scaling, and integration with other Azure services.

25 min read
Intermediate
Updated May 31, 2026

Synapse Spark Pools as a Shared Kitchen

Imagine a large commercial kitchen (the Azure Synapse workspace) that can handle different cooking tasks. Inside, there are several cooking stations (Spark pools) that can be set up with different numbers of stoves (nodes) and chefs (executors). Each chef is responsible for a specific part of a recipe (a data processing job). When a new order comes in (a Spark job is submitted), the kitchen manager (the Synapse Spark pool resource manager) allocates a certain number of stoves and chefs from the pool to work on that order. The chefs work in parallel: one chops vegetables (filters data), another boils water (aggregates data), and a third sears meat (joins datasets). They communicate through a shared ingredient shelf (the Spark shuffle service) to pass intermediate results. If an order requires more chefs than currently available, the kitchen manager can temporarily hire extra chefs (auto-scaling) or the order waits until chefs finish their current tasks (job queuing). The kitchen also has a policy: if a chef is idle for too long (idle timeout, default 10 minutes), they are sent home to save costs. This is exactly how Apache Spark pools work in Azure Synapse: you define a pool with a node size and number, submit jobs that run on executors, and the pool auto-scales or auto-pauses based on configuration. The key difference is that in the real kitchen, each chef works independently, while in Spark, executors are coordinated by a driver node that orchestrates the entire job, similar to a head chef who assigns tasks and collects results.

How It Actually Works

What Are Azure Synapse Spark Pools?

Azure Synapse Spark Pools (formerly known as Azure Synapse Analytics Spark pools) are a managed Apache Spark service within Azure Synapse Analytics. They allow data engineers and data scientists to run Spark jobs—such as data transformation, machine learning, and stream processing—without managing a Spark cluster. The pool abstracts the underlying compute resources (virtual machines) and provides a REST API and notebook interface for job submission.

Why Spark Pools Exist

Traditional big data processing required provisioning and managing dedicated Spark clusters, which was time-consuming and costly. Spark pools offer a serverless-like experience: you define a pool with a fixed or auto-scaling number of nodes, and jobs run on demand. The pool can be paused when not in use, saving costs. This is ideal for variable workloads where you don't need a cluster running 24/7.

How Spark Pools Work Internally

When you create a Spark pool in Azure Synapse, you specify: - Node size: Small (4 vCPU, 32 GB RAM), Medium (8 vCPU, 64 GB RAM), Large (16 vCPU, 128 GB RAM). - Number of nodes: Minimum 3, maximum 200 (default is 10). - Auto-scaling: Enabled or disabled. If enabled, you set minimum and maximum nodes (e.g., 3-20). - Auto-pause: Enabled or disabled. If enabled, you set an idle timeout in minutes (default 10, minimum 5).

The pool consists of one driver node and multiple executor nodes (the nodes you specify). The driver node runs the SparkContext and schedules tasks. Executor nodes run the actual tasks in parallel. When a job is submitted (via notebook, pipeline, or SDK), the following happens:

1.

Job Submission: The user submits a Spark job (e.g., a Python script or Scala JAR) to the pool's endpoint. The job is sent to the driver node.

2.

Resource Allocation: The driver requests executors from the cluster manager (YARN or Kubernetes, depending on the runtime version). The cluster manager allocates containers on executor nodes.

3.

Task Scheduling: The driver divides the job into stages and tasks. Each task is a unit of work that can run on one partition of data. Tasks are assigned to executors.

4.

Data Processing: Executors read data from source (e.g., Azure Data Lake Storage Gen2) using the Hadoop FileSystem API. They perform transformations (map, filter, join) and write intermediate results to local or shuffle storage.

5.

Shuffle: When a wide dependency (e.g., groupByKey) occurs, executors shuffle data across the network. This is the most expensive operation. Spark uses a sort-based shuffle to minimize network I/O.

6.

Result Collection: The final results are collected by the driver or written to a sink (e.g., Azure Synapse dedicated SQL pool, Azure Storage).

Key Components and Defaults

Apache Spark Runtime: Azure Synapse supports Spark 3.1, 3.2, and 3.3 (as of writing). The runtime includes Delta Lake, Spark SQL, MLlib, and GraphX.

Data Sources: Native connectors for Azure Blob Storage, ADLS Gen2, Azure Cosmos DB (via Spark connector), Azure SQL Database, and Synapse SQL pools.

Job Types: Batch jobs (via pipelines or notebooks) and streaming jobs (via Spark Structured Streaming).

Concurrency: By default, a Spark pool can run up to 50 concurrent jobs (configurable). Each job gets its own set of executors.

Auto-Scale: Based on load, the pool can add or remove nodes within the set min/max range. Scale-up/down takes 5-10 minutes.

Auto-Pause: After the idle timeout (default 10 minutes) with no jobs, the pool stops all nodes. When a new job arrives, the pool resumes (takes 2-5 minutes).

Configuration and Verification Commands

You can submit jobs using the Synapse Studio UI, but also programmatically:

# Using Azure CLI to submit a Spark job
ez synapse spark job submit \
  --workspace-name myworkspace \
  --spark-pool-name mysparkpool \
  --main-definition-file myjob.py \
  --main-file-path abfss://container@storage.dfs.core.windows.net/scripts/myjob.py \
  --name myjob

To monitor jobs:

# List jobs in a pool
ez synapse spark job list --workspace-name myworkspace --spark-pool-name mysparkpool

# Get job details
ez synapse spark job show --workspace-name myworkspace --spark-pool-name mysparkpool --job-id 1

Interaction with Related Technologies

Synapse Pipelines: You can use a 'Spark job' activity in a pipeline to run a Spark job as part of an ETL workflow.

Synapse SQL Pools: Spark pools can read from and write to dedicated SQL pools using the 'synapsesql' connector.

Azure Machine Learning: Spark pools can be used as compute targets for training machine learning models.

Power BI: Spark pools can be accessed via Power BI DirectQuery for interactive analytics (though not recommended for large datasets).

Performance Considerations

Data Locality: To minimize data movement, ensure the Spark pool and the data source are in the same Azure region.

Partitioning: Default shuffle partitions is 200. For large datasets, increase this to 400 or more to avoid out-of-memory errors.

Caching: Use cache() or persist() for intermediate results that are reused.

Serialization: Use Kryo serialization for better performance (set spark.serializer to org.apache.spark.serializer.KryoSerializer).

Walk-Through

1

Create a Spark Pool

In Azure Synapse Studio, navigate to Manage > Apache Spark pools > New. Enter a name (e.g., 'mysparkpool'), select node size (Small/Medium/Large), and set the number of nodes (minimum 3, default 10). Optionally enable auto-scaling (min 3, max 200) and auto-pause (idle timeout 5-15 minutes, default 10). The pool will be provisioned in a few minutes. Behind the scenes, Azure creates a set of VMs in a cluster managed by Azure Synapse. The driver node is always one, and executor nodes are the number you specify minus one (the driver uses one node).

2

Submit a Spark Job

You can submit a job via Synapse Studio (Develop > Spark job definitions) or programmatically. When you submit, the driver node receives the job definition. It creates a SparkContext and requests executors from the cluster manager. The driver splits the job into stages based on shuffle dependencies. Each stage contains tasks that operate on a partition of data (default parallelism equals number of executors * cores per executor). Tasks are serialized and sent to executors via Akka (or Netty in newer versions).

3

Execute Tasks on Executors

Executors deserialize tasks and run them on their allocated cores. Each executor has a fixed memory (e.g., 32 GB for Small node) divided into execution memory (for shuffles, joins) and storage memory (for caching). As tasks run, they read data from the source (e.g., ADLS Gen2) using the Hadoop InputFormat. Intermediate results are stored in executor memory or spilled to disk if memory is insufficient. When a task completes, it sends the result back to the driver.

4

Shuffle Data Across Nodes

For operations like groupByKey or reduceByKey, data must be repartitioned across executors. Spark performs a shuffle: each executor writes shuffle data to local disk (shuffle files) indexed by reducer ID. The driver then notifies reducers to fetch their partitions from the map executors. This is a costly operation because it involves disk I/O and network transfer. Spark uses a sort-based shuffle to reduce memory usage. The number of shuffle partitions defaults to 200, which can be tuned.

5

Collect Results and Auto-Pause

After all tasks complete, the driver collects the final result (if using collect()) or writes to a sink (e.g., Delta Lake table). The job status is updated in the Spark history server. If auto-pause is enabled, the pool monitors for idle time. After the configured idle timeout (default 10 minutes) with no active jobs, the pool stops all VMs. When a new job is submitted, the pool automatically restarts (takes 2-5 minutes). Note: auto-pause does not apply if a job is running continuously (e.g., streaming).

What This Looks Like on the Job

Scenario 1: ETL for a Retail Company

A large retailer uses Azure Synapse Spark pools to transform raw sales data from Azure Data Lake Storage Gen2 into curated datasets for reporting. Every night, a pipeline triggers a Spark job that reads 500 GB of Parquet files, cleanses the data (remove duplicates, fill nulls), joins with product catalog (another 50 GB), and writes the result to a dedicated SQL pool. The Spark pool is configured with Medium nodes (8 vCPU, 64 GB) and auto-scaling from 5 to 20 nodes. The job takes about 30 minutes. Key considerations: they set the shuffle partitions to 400 to avoid memory issues. They also enable auto-pause with a 15-minute idle timeout to save costs during the day when no jobs run. Without auto-pause, the pool would run 24/7 costing ~$100 per day; with auto-pause, it runs only ~8 hours per night, costing ~$30 per day.

Scenario 2: Interactive Data Exploration

A data science team uses Spark notebooks for ad-hoc analysis. They have a small Spark pool with 3 Small nodes (4 vCPU, 32 GB each). They use Python and Spark SQL to explore customer behavior data. The pool is set to auto-scale (3-10 nodes) to handle occasional large queries. Auto-pause is set to 5 minutes because the team often leaves notebooks idle. A common issue: when the pool is paused, the first query takes 3-5 minutes to resume, which frustrates users. To mitigate, they set a longer idle timeout (10 minutes) during business hours and use a scheduled pipeline to keep the pool warm.

Scenario 3: Real-Time Stream Processing

A financial services company uses Spark Structured Streaming on a Spark pool to process real-time stock trades from Azure Event Hubs. The pool is configured with Large nodes (16 vCPU, 128 GB) and auto-scaling from 5 to 30 nodes based on load. The streaming query runs continuously, so auto-pause is disabled. A misconfiguration: initially, they used the default shuffle partitions (200), causing out-of-memory errors during peak hours. They increased shuffle partitions to 1000 and increased executor memory by setting spark.executor.memory to 20g. They also tuned spark.sql.shuffle.partitions to 500. The job now runs smoothly processing 10,000 events per second.

How DP-900 Actually Tests This

DP-900 Exam Focus on Azure Synapse Spark Pools

This topic falls under Domain 3: Analytics (25-30% of exam), specifically objective 3.1: Describe analytical workloads and 3.2: Describe components of a modern data warehouse. The exam expects you to understand:

1.

What Spark pools are used for: batch and real-time data processing, not OLTP or data warehousing (that's SQL pools).

2.

Scaling options: scale-up (larger nodes) vs. scale-out (more nodes). The exam may ask which to choose for parallel processing.

3.

Auto-pause and auto-scale: know the default idle timeout (10 minutes) and that auto-pause saves costs.

4.

Integration: Spark pools can read from ADLS Gen2, Blob Storage, and write to Synapse SQL pools.

Common Wrong Answers

1.

'Spark pools are used for transacting data' – Wrong. Spark pools are for analytical processing, not transactional workloads. Transactional workloads use Azure SQL Database or Cosmos DB.

2.

'Spark pools support auto-pause with a default of 5 minutes' – Wrong. Default is 10 minutes. 5 is the minimum.

3.

'Spark pools can only process batch data' – Wrong. They also support streaming via Spark Structured Streaming.

4.

'Spark pools are the same as Databricks' – Wrong. While both use Apache Spark, Synapse Spark pools are integrated with Synapse Analytics and managed differently. The exam may ask to differentiate.

Specific Numbers and Terms

Default idle timeout for auto-pause: 10 minutes.

Minimum nodes: 3.

Maximum nodes: 200.

Node sizes: Small (4 vCPU, 32 GB), Medium (8 vCPU, 64 GB), Large (16 vCPU, 128 GB).

Supported runtimes: Spark 3.1, 3.2, 3.3 (but exam may not ask version specifics).

Shuffle partitions default: 200.

Edge Cases

Streaming jobs: auto-pause is disabled automatically for streaming queries. The exam might ask: 'Can a Spark pool with streaming jobs auto-pause?' Answer: No.

Multiple jobs: a pool can run multiple jobs concurrently (default 50). The exam may test that each job gets its own executors.

Pool deletion: if you delete a pool, all running jobs are terminated.

How to Eliminate Wrong Answers

If a question asks about Spark pool capabilities, eliminate options that mention relational data warehousing (use SQL pool) or transactional processing. Look for keywords: 'big data', 'parallel processing', 'in-memory', 'ETL', 'machine learning'. For scaling, if the question emphasizes parallelism (more cores), choose 'scale-out' (more nodes). If it emphasizes larger memory per node, choose 'scale-up' (larger node size).

Key Takeaways

Azure Synapse Spark pools are a managed Apache Spark service for big data processing, not for transactional workloads.

Spark pools support both batch and streaming jobs via Spark Structured Streaming.

Minimum nodes for a Spark pool is 3; maximum is 200.

Auto-pause default idle timeout is 10 minutes (minimum 5).

Node sizes: Small (4 vCPU/32GB), Medium (8 vCPU/64GB), Large (16 vCPU/128GB).

Default shuffle partitions = 200; tune for large datasets.

Spark pools integrate with Synapse pipelines, SQL pools, and ADLS Gen2.

Auto-scaling adjusts nodes between min and max based on load.

Streaming jobs disable auto-pause automatically.

Spark pools are charged per node-hour, independent of storage.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Azure Synapse Spark Pools

Used for big data processing (ETL, ML, streaming)

Uses Apache Spark (in-memory distributed computing)

Auto-pause and auto-scale for cost savings

Reads/writes to ADLS Gen2, Blob Storage, etc.

Pay per node-hour (compute + storage separate)

Azure Synapse SQL Pools (Dedicated)

Used for relational data warehousing (SQL queries)

Uses Massively Parallel Processing (MPP) engine

Can pause and resume manually (not auto-pause)

Stores data in tables (distributions)

Pay per DWU (Data Warehouse Unit) per hour

Watch Out for These

Mistake

Spark pools are the same as Azure Databricks.

Correct

Both use Apache Spark, but Azure Synapse Spark pools are a native part of Synapse Analytics, tightly integrated with Synapse pipelines and SQL pools. Databricks is a separate platform with its own workspace and pricing. The exam treats them as distinct.

Mistake

Spark pools can only be used for batch processing.

Correct

Spark pools support both batch and streaming (Structured Streaming). You can process real-time data from Event Hubs or IoT Hub.

Mistake

Auto-pause default idle timeout is 5 minutes.

Correct

The default is 10 minutes. The minimum configurable is 5 minutes. The exam often tests this default value.

Mistake

A Spark pool can have a minimum of 1 node.

Correct

The minimum is 3 nodes. This is because Spark requires at least 3 nodes for fault tolerance (one driver, two executors). The exam may ask this.

Mistake

Spark pools can be used for OLTP workloads.

Correct

Spark is designed for analytical processing (OLAP), not transactional (OLTP). For OLTP, use Azure SQL Database or Cosmos DB.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between a Spark pool and a SQL pool in Azure Synapse?

A Spark pool is for big data processing using Apache Spark (in-memory distributed computing) for ETL, machine learning, and streaming. A SQL pool (dedicated or serverless) is for relational data warehousing using T-SQL queries. Spark pools are ideal for complex transformations and iterative algorithms; SQL pools are for fast, scalable SQL analytics. On the DP-900 exam, remember that Spark = big data, SQL = data warehousing.

Can I run multiple jobs on the same Spark pool simultaneously?

Yes, by default a Spark pool can run up to 50 concurrent jobs. Each job gets its own set of executors (within the pool's total nodes). If the pool is fully utilized, new jobs are queued. You can configure the maximum number of concurrent jobs per pool. This is important for shared environments.

How does auto-pause work in Spark pools?

Auto-pause stops all compute nodes after a period of inactivity (idle timeout, default 10 minutes). When a new job is submitted, the pool automatically resumes (takes 2-5 minutes). This saves costs by not running idle compute. Note: streaming jobs disable auto-pause because they are always active. You can also manually pause a pool.

What is the default number of shuffle partitions in Spark?

The default is 200. This can be changed using `spark.sql.shuffle.partitions` configuration. For large datasets, increase this number to avoid out-of-memory errors. On the exam, know that the default is 200, and that tuning partitions is a common performance optimization.

Can Spark pools read data from Azure SQL Database?

Yes, Spark pools can read from Azure SQL Database using the JDBC connector. However, this is not recommended for large datasets due to performance. For better performance, use Azure Data Lake Storage Gen2 or Blob Storage. The exam may test that Spark pools can connect to various data sources including Azure SQL, Cosmos DB, and ADLS.

What is the minimum number of nodes required for a Spark pool?

The minimum is 3 nodes. Spark requires at least one driver and two executors for fault tolerance. If you try to create a pool with 1 or 2 nodes, it will fail. The exam may ask this as a trivia question.

How are Spark pools charged?

You pay per node-hour for the compute resources (vCPUs and memory) used by the pool, regardless of whether jobs are running. However, if auto-pause is enabled, you are not charged when the pool is paused. Storage (data in ADLS) is charged separately. The exam may ask about cost optimization using auto-pause.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Azure Synapse Spark Pools — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?