DP-900Chapter 100 of 101Objective 3.3

Apache Spark Core Concepts for DP-900

This chapter covers Apache Spark core concepts essential for the DP-900 exam, focusing on how Spark fits into Azure's analytics ecosystem. Spark is a unified analytics engine for large-scale data processing, and DP-900 expects you to understand its components, execution model, and integration with Azure services like Synapse Analytics and Databricks. Approximately 10-15% of exam questions touch on Spark-related topics, primarily in the analytics domain (objective 3.3). By the end of this chapter, you'll be able to explain Spark's architecture, RDDs, DataFrames, lazy evaluation, and the role of the driver and executors – all at the depth required for the exam.

25 min read
Intermediate
Updated May 31, 2026

Spark as a Master Chef in a Commercial Kitchen

Imagine a commercial kitchen preparing a banquet for 1,000 guests. The head chef (driver) receives the order (Spark application) and breaks it down into tasks: chop vegetables, sear meat, plate desserts. Each task is assigned to a station (executor) with its own stove and prep area (CPU and memory). The head chef doesn't cook; they orchestrate. They send a prep list (task) to the vegetable station, which chops carrots and returns the finished bowls (results). If the vegetable station runs out of cutting boards (memory), the head chef can split the task or move work to another station. The kitchen runs in stages: first all chopping (stage 1), then all cooking (stage 2). If a station drops a knife (failure), the head chef reassigns that task to another station using the original prep list (lineage). The key is that the head chef keeps all prep lists (RDD lineage) so any task can be recreated exactly. This is fundamentally different from a traditional kitchen where the head chef might stir the pot themselves (single-node processing) – here, the chef never touches the food, only directs the work. The banquet (query) completes faster because multiple stations work in parallel, and the kitchen scales by adding more stations (nodes) without changing the recipe (code).

How It Actually Works

What is Apache Spark?

Apache Spark is an open-source, distributed computing framework designed for fast, in-memory data processing. Unlike Hadoop MapReduce, which writes intermediate results to disk, Spark keeps data in memory, reducing I/O overhead and accelerating iterative algorithms and interactive queries. Spark provides APIs in Scala, Java, Python, and R, and supports batch processing, streaming, machine learning, and graph processing through its ecosystem libraries.

Spark Architecture: Driver and Executors

A Spark application runs as a set of independent processes coordinated by a driver program. The driver is the main control process that:

Converts a user's application into multiple tasks that can be executed in parallel.

Schedules tasks on executors across the cluster.

Maintains the DAG (Directed Acyclic Graph) of stages and tasks.

Tracks the state of all executors and tasks.

Executors are worker processes that:

Run tasks assigned by the driver.

Store data in memory or disk for cached RDDs.

Report progress and results back to the driver.

Each Spark cluster has one driver and multiple executors. The number of executors, their memory, and cores are configurable via parameters like spark.executor.instances, spark.executor.memory, and spark.executor.cores. For example, a common configuration on Azure Synapse might set:

spark.executor.instances=10
spark.executor.memory=8g
spark.executor.cores=4

This means 10 executors, each with 8 GB RAM and 4 CPU cores, capable of running 4 tasks in parallel each.

Resilient Distributed Datasets (RDDs)

RDD is the fundamental data structure of Spark. It is an immutable, partitioned collection of records that can be operated on in parallel. Key characteristics: - Resilient: If a partition is lost due to node failure, it can be recomputed from lineage (the sequence of transformations that created it). - Distributed: Data is split into partitions across multiple nodes. - Dataset: A collection of typed objects (e.g., RDD[String], RDD[(Int, String)]).

RDDs support two types of operations: 1. Transformations: Lazy operations that produce a new RDD (e.g., map, filter, flatMap, reduceByKey). They are not executed until an action is called. 2. Actions: Operations that trigger computation and return a value to the driver or write data to storage (e.g., count, collect, saveAsTextFile).

Example transformation:

rdd = sc.textFile("wasbs://container@storage.blob.core.windows.net/data.txt")
words = rdd.flatMap(lambda line: line.split(" "))
word_pairs = words.map(lambda w: (w, 1))

No computation happens until an action like word_pairs.reduceByKey(lambda a,b: a+b).collect() is called.

DataFrames and Spark SQL

DataFrames are a higher-level abstraction built on top of RDDs, providing a schema (column names and types) and allowing SQL-like queries. DataFrames are more efficient than RDDs due to Catalyst optimizer and Tungsten execution engine. In PySpark, you create a DataFrame from various sources:

df = spark.read.format("parquet").load("path/to/data")
df.createOrReplaceTempView("my_table")
spark.sql("SELECT column1, COUNT(*) FROM my_table GROUP BY column1").show()

DataFrames support both transformations (e.g., select, filter, groupBy, join) and actions (e.g., show, count, write).

Lazy Evaluation and DAG Scheduler

Spark uses lazy evaluation: transformations are not executed immediately but are recorded as a lineage graph. When an action is called, the driver builds a DAG of stages, where each stage contains a set of tasks that can be executed together (i.e., without shuffling data across nodes). The DAG Scheduler: 1. Identifies the optimal execution plan. 2. Partitions the DAG into stages based on shuffle dependencies. 3. Submits tasks to the Task Scheduler, which launches them on executors.

For example, consider:

rdd = sc.textFile("file.txt")
rdd2 = rdd.filter(lambda line: line.startswith("ERROR"))
rdd3 = rdd2.map(lambda line: line.split("\t")[1])
result = rdd3.count()

The DAG will have two stages: Stage 0 reads the file and filters; Stage 1 maps and counts. The shuffle boundary occurs because count is an action that aggregates across partitions.

Shuffle and Partitioning

A shuffle is a process of redistributing data across partitions, often required by operations like groupByKey, reduceByKey, join, and distinct. Shuffles are expensive because they involve disk I/O, network transfer, and serialization. Spark writes map output to local disk, then fetches it by reducers. The number of shuffle partitions is controlled by spark.sql.shuffle.partitions (default 200).

Partitioning determines how data is split across nodes. Default partitioning is based on block size (e.g., 128 MB for HDFS). You can repartition or coalesce:

df.repartition(100, "key_column")  # increases partitions, full shuffle
df.coalesce(10)  # decreases partitions, no full shuffle

Caching and Persistence

Caching stores intermediate results in memory to speed up iterative algorithms. You can persist an RDD or DataFrame at different storage levels: - MEMORY_ONLY: Deserialize objects in memory. If not enough memory, partitions are recomputed. - MEMORY_AND_DISK: Store in memory; spill to disk if necessary. - DISK_ONLY: Store on disk only. - MEMORY_ONLY_SER (Java/Scala): Serialized objects in memory, less memory but more CPU. - MEMORY_AND_DISK_SER: Like above but spills to disk.

Example:

df.persist(StorageLevel.MEMORY_AND_DISK)
# or
spark.conf.set("spark.sql.adaptive.enabled", "true")  # for automatic caching

Spark in Azure: Azure Synapse Analytics and Databricks

Azure Synapse Analytics provides a serverless Spark pool and dedicated Spark pools. You can create a Spark pool with specific node sizes (e.g., Medium: 4 vCores, 32 GB RAM). Serverless pools automatically scale based on workload. Azure Databricks is a managed Spark platform with optimized runtime (e.g., Databricks Runtime 10.4 LTS includes Spark 3.2.1). Both integrate with Azure Data Lake Storage Gen2 and Azure Blob Storage.

Key Configuration Parameters for DP-900

spark.executor.memory: Memory per executor (default 1g).

spark.executor.cores: CPU cores per executor (default 1).

spark.sql.shuffle.partitions: Number of partitions for shuffles (default 200).

spark.sql.adaptive.enabled: Enable adaptive query execution (default false, but recommended true).

spark.serializer: Class for serialization (default org.apache.spark.serializer.JavaSerializer; Kryo is faster).

Monitoring and Debugging

Spark provides a web UI on port 4040 (driver) that shows:

Stages and tasks

Storage (cached RDDs)

Environment (configuration)

Executors (memory, disk usage)

You can also access Spark logs via Azure Monitor or Databricks cluster logs.

Common Mistakes on the Exam

Confusing RDDs with DataFrames: DataFrames have schema and use Catalyst optimizer; RDDs are lower-level.

Thinking transformations execute immediately – they are lazy.

Assuming cache() persists forever – it persists only until evicted or the application ends.

Forgetting that collect() brings all data to the driver, which can cause OOM for large datasets.

Summary of Spark Execution Flow

1.

User submits a Spark application (e.g., PySpark script).

2.

Driver creates a SparkContext (or SparkSession).

3.

Driver requests resources from cluster manager (YARN, Kubernetes, or standalone).

4.

Executors are launched on worker nodes.

5.

Driver transforms user code into a DAG of stages.

6.

Tasks are scheduled to executors.

7.

Executors run tasks, cache data if instructed.

8.

Results are sent back to driver for actions.

9.

Application ends, executors are terminated.

Walk-Through

1

Submit Spark Application

The user submits a Spark application using `spark-submit` or through a notebook (Azure Synapse/Databricks). The `spark-submit` command specifies the main class or Python file, JARs, and configuration options. For example: `spark-submit --class com.example.MyApp --master yarn --deploy-mode cluster --executor-memory 4g --num-executors 10 myapp.jar`. The driver program starts on the cluster manager (e.g., YARN ResourceManager) and requests resources.

2

Driver Initializes SparkContext

The driver creates a SparkContext (or SparkSession in 2.0+). This object connects to the cluster manager and coordinates the application. The SparkContext sets up the DAG Scheduler, Task Scheduler, and Block Manager. It also registers the application with the cluster manager. In Azure Synapse, the SparkSession is automatically created when you start a notebook.

3

Cluster Manager Launches Executors

The cluster manager (e.g., YARN, Kubernetes, or Spark standalone) allocates containers for executors on worker nodes. Each executor is a JVM process with a set amount of memory and CPU cores. The number of executors is determined by `spark.executor.instances` (or dynamic allocation). Executors register with the driver and begin heartbeating to report status.

4

Driver Builds DAG and Creates Stages

When an action is called (e.g., `count()`, `collect()`), the driver examines the lineage of transformations and builds a DAG. The DAG Scheduler splits the DAG into stages. A stage boundary occurs at shuffle dependencies (e.g., `reduceByKey`, `join`). Each stage contains a set of tasks, one per partition. The driver then submits these tasks to the Task Scheduler.

5

Tasks Execute on Executors

The Task Scheduler sends tasks to executors via the network. Each executor deserializes the task code and applies it to its data partition. For example, a `map` task reads a partition from HDFS or memory, applies the function, and writes output to a local buffer. If the stage involves a shuffle, the executor writes shuffle data to local disk and notifies the driver. The driver then schedules the next stage's tasks, which fetch the shuffle data from the previous stage's executors.

6

Collect Results and Finalize

For actions like `collect()`, each executor sends its partial results back to the driver. The driver aggregates them and returns the final result to the user. For `saveAsTextFile()`, executors write directly to the output location. Once all tasks are complete, the driver stops the SparkContext and releases resources. The cluster manager deallocates the executors.

What This Looks Like on the Job

Scenario 1: ETL Pipeline for Retail Analytics

A large retailer uses Azure Synapse Analytics to process daily sales data from 10,000 stores. Raw data lands in Azure Data Lake Storage Gen2 as CSV files (approx. 5 TB per day). A Spark job runs nightly to:

Read the raw CSV files using spark.read.format("csv").option("header","true").load("path").

Clean and transform: filter invalid records, join with product catalog (stored as Parquet), aggregate sales by store and product.

Write the aggregated results back to ADLS in Parquet format partitioned by date.

The pipeline uses 50 executors with 8 GB memory each. Caching the product catalog DataFrame (which is only 2 GB) speeds up the join significantly. Without caching, the catalog is read from disk each time, adding 10 minutes to the 2-hour job. Misconfiguration: setting spark.sql.shuffle.partitions to 50 (too low) caused severe skew and OOM errors; increasing to 500 resolved the issue.

Scenario 2: Real-Time Fraud Detection with Structured Streaming

A financial services company uses Azure Databricks to detect fraudulent credit card transactions in near real-time. Transactions arrive as a stream from Azure Event Hubs (thousands per second). The Spark Structured Streaming job:

Reads from Event Hubs using spark.readStream.format("eventhubs").load().

Applies a sliding window aggregation (1-minute window, slide every 10 seconds) to count transactions per user.

Joins with a static lookup table of known fraud patterns (cached as a DataFrame).

Flags suspicious transactions and writes alerts to a Delta table.

The job runs with 8 executors, each with 4 cores. The checkpoint location is set to ADLS to allow recovery from failures. Key configuration: spark.sql.streaming.schemaInference set to true for flexible schema. A common pitfall is not setting spark.sql.streaming.checkpointLocation, which causes loss of state on restart.

Scenario 3: Interactive Data Exploration for Data Scientists

A healthcare research team uses Azure Synapse Serverless Spark to explore genomic data. They have 20 TB of variant call data in Parquet format. Data scientists run ad-hoc queries using Python notebooks. The serverless pool auto-scales from 5 to 50 executors based on workload. They use DataFrames and Spark SQL extensively. For example:

SELECT gene, COUNT(*) as variant_count
FROM variants
WHERE population = 'EUR'
GROUP BY gene
ORDER BY variant_count DESC

The query completes in 30 seconds thanks to the Catalyst optimizer and columnar storage. However, a common mistake is using collect() on large results, which overwhelms the driver memory. Instead, they use show(10) or write to a table.

How DP-900 Actually Tests This

What DP-900 Tests on This Topic

The DP-900 exam covers Spark under objective 3.3: Describe the analytics workload. Specifically, you need to:

Identify the components of a Spark cluster: driver, executors, cluster manager.

Understand the difference between RDDs and DataFrames.

Recognize that Spark uses in-memory processing for speed.

Know that transformations are lazy and actions trigger computation.

Understand how Azure Synapse Analytics and Azure Databricks provide Spark capabilities.

The exam focuses on conceptual understanding rather than deep configuration. You will NOT be asked to write Spark code or remember specific configuration parameters like spark.executor.memory. However, you should know the purpose of key parameters like spark.sql.shuffle.partitions.

Common Wrong Answers and Why Candidates Choose Them

1. Wrong: "Spark writes intermediate results to disk for fault tolerance." Why: Candidates confuse Spark with Hadoop MapReduce. Spark uses lineage for fault tolerance, not disk writes. In-memory processing is a key differentiator.

2. Wrong: "RDDs are the same as DataFrames." Why: Both are distributed collections, but DataFrames have schema and are optimized by Catalyst. The exam may ask which is easier to use for SQL queries.

3. Wrong: "Transformations execute immediately." Why: Many candidates assume all operations are eager. The exam tests lazy evaluation by asking when computation occurs.

4. Wrong: "The driver stores all data." Why: The driver only stores metadata and results from actions; data is distributed across executors. Candidates may think the driver is a storage node.

Specific Numbers and Terms That Appear on the Exam

Default shuffle partitions: 200

Storage levels: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY

Spark components: Driver, Executor, Cluster Manager (YARN, Kubernetes, Standalone)

Actions: count(), collect(), saveAsTextFile(), show()

Transformations: map(), filter(), join(), groupBy()

Edge Cases and Exceptions

If an executor fails, Spark recomputes its tasks using lineage. If the driver fails, the entire application fails.

Caching does NOT guarantee data is always in memory – it can be evicted if memory is needed for other tasks.

collect() should be used only on small datasets; for large datasets, use take(n) or write to disk.

How to Eliminate Wrong Answers

If a question mentions "in-memory processing" and "fault tolerance via lineage," it's about Spark.

If it contrasts "disk-based" vs. "in-memory," the disk-based option is usually Hadoop MapReduce.

For questions about Spark vs. other services: Synapse Spark is for big data batch/streaming; Azure Stream Analytics is for real-time analytics on streams; Azure Data Factory is for orchestration.

Key Takeaways

Apache Spark is a distributed, in-memory computing framework for big data processing.

Spark cluster consists of a driver and multiple executors; the driver coordinates, executors run tasks.

RDDs are immutable, partitioned collections; DataFrames add schema and optimization on top of RDDs.

Transformations are lazy (e.g., map, filter); actions trigger execution (e.g., count, collect).

Shuffles redistribute data across partitions and are expensive; controlled by spark.sql.shuffle.partitions (default 200).

Caching stores data in memory for faster iterative processing; storage levels include MEMORY_ONLY and MEMORY_AND_DISK.

In Azure, Spark is available via Azure Synapse Analytics (serverless or dedicated pools) and Azure Databricks.

Fault tolerance in Spark is achieved through lineage: lost partitions are recomputed from the original transformations.

Spark supports batch and streaming (Structured Streaming) processing.

Common exam traps: confusing transformations with actions, thinking Spark always uses memory, and mixing up RDDs and DataFrames.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Spark RDD

Low-level API with no schema

Supports arbitrary Java/Scala/Python objects

No built-in optimization; manual optimization required

Slower for SQL-like queries

Better for unstructured data (e.g., text files)

Spark DataFrame

High-level API with schema (column names and types)

Uses Catalyst optimizer for query optimization

Faster for SQL and relational operations

Supports SQL queries directly via Spark SQL

Better for structured data (e.g., Parquet, JSON)

Spark (batch processing)

Processes finite data in one shot

No continuous output; results after completion

Uses RDD/DataFrame transformations

Fault tolerance via lineage

Suitable for ETL, historical analysis

Spark Structured Streaming

Processes unbounded data streams continuously

Produces results incrementally (micro-batches or continuous)

Uses DataFrames with streaming sources/sinks

Fault tolerance via checkpointing and replay

Suitable for real-time dashboards, alerts

Watch Out for These

Mistake

Spark always stores data in memory.

Correct

Spark can store data in memory or disk depending on the storage level. If memory is insufficient, data spills to disk. The default storage level for caching is MEMORY_ONLY, but you can choose MEMORY_AND_DISK to avoid recomputation.

Mistake

RDDs and DataFrames are the same thing.

Correct

DataFrames are a higher-level abstraction built on top of RDDs with schema information. DataFrames leverage the Catalyst optimizer for query optimization, making them more efficient for SQL-like operations. RDDs are lower-level and better for unstructured data or custom transformations.

Mistake

Transformations execute as soon as they are called.

Correct

Transformations are lazy – they are not executed until an action (e.g., `count()`, `collect()`) is called. The driver builds a DAG of transformations and optimizes it before execution.

Mistake

The driver process stores all the data.

Correct

The driver only stores metadata and results from actions (like `collect()`). Data is distributed across executors in partitions. The driver does not store the actual dataset.

Mistake

Spark can only process batch data.

Correct

Spark supports both batch and streaming data through Spark Streaming (micro-batch) and Structured Streaming (continuous processing). It can handle real-time data as well as historical data.

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

What is the difference between Spark RDD and DataFrame for DP-900?

RDD (Resilient Distributed Dataset) is the low-level data structure in Spark, representing an immutable collection of objects partitioned across nodes. DataFrame is a higher-level abstraction that organizes data into named columns (like a table). DataFrames use the Catalyst optimizer for query optimization, making them faster for SQL-like operations. For DP-900, remember that DataFrames are easier to use and more performant for structured data, while RDDs are lower-level and better for unstructured data.

Does Spark always keep data in memory?

No, Spark can store data in memory, on disk, or both, depending on the storage level chosen when caching. The default storage level for cache() is MEMORY_ONLY, but if memory is insufficient, partitions are recomputed. You can explicitly use MEMORY_AND_DISK to spill to disk. Additionally, during shuffles, data is written to disk. So Spark is not always in-memory; it uses memory when possible but falls back to disk.

What is lazy evaluation in Spark?

Lazy evaluation means that transformations (like map, filter) are not executed immediately. Instead, Spark records them as a lineage graph. Only when an action (like count, collect) is called does Spark optimize and execute the entire DAG. This allows Spark to combine operations and reduce data shuffling. For DP-900, know that transformations are lazy and actions trigger computation.

How does Spark achieve fault tolerance?

Spark achieves fault tolerance through lineage. Each RDD stores the sequence of transformations that created it (its lineage). If a partition is lost due to node failure, Spark can recompute it by replaying the transformations on the original data. This is more efficient than replication used in Hadoop HDFS. Checkpointing to disk can also be used to truncate lineage for long chains.

What is a shuffle in Spark?

A shuffle is a process of redistributing data across partitions, typically required by operations like groupByKey, reduceByKey, join, and distinct. During a shuffle, map tasks write data to local disk, and reduce tasks fetch it over the network. Shuffles are expensive due to disk I/O and network transfer. The number of shuffle partitions is controlled by spark.sql.shuffle.partitions (default 200).

What is the role of the driver in Spark?

The driver is the main process that coordinates the Spark application. It converts user code into tasks, schedules them on executors, and maintains the DAG. The driver also tracks the state of executors and handles fault tolerance by recomputing lost tasks. It runs the user's main function and is responsible for returning results to the user (e.g., from collect()).

How does Spark integrate with Azure services?

Spark is available in Azure through Azure Synapse Analytics (serverless and dedicated Spark pools) and Azure Databricks. Both integrate with Azure Data Lake Storage Gen2, Azure Blob Storage, and Azure Event Hubs. Synapse provides a unified analytics experience, while Databricks offers optimized runtime and collaborative notebooks. For DP-900, know that these services provide managed Spark clusters.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Apache Spark Core Concepts for DP-900 — now see how well it sticks with free DP-900 practice questions. Full explanations included, no account needed.

Done with this chapter?