You are a data engineer for a large e-commerce company. The company uses Azure Data Lake Storage Gen2 (ADLS Gen2) as its data lake. A team of data scientists needs to process a massive dataset (approximately 5 TB) stored in Parquet format in the data lake. The dataset contains sales transactions from the past 10 years. The data scientists run a Spark job daily using Azure Synapse Analytics (serverless Spark pool) to compute aggregated sales metrics by product category and region. The job reads the entire dataset each day, performs transformations, and writes the aggregated results back to the data lake. Over the past few weeks, the job has been taking longer to complete, and the data scientists have reported that the job now takes over 6 hours, exceeding the acceptable SLA of 4 hours. They suspect the issue is related to data skew or suboptimal partitioning. You need to optimize the job to reduce execution time. Which approach should you take?
Auto Loader incrementally ingests new files, avoiding a full scan of the 5 TB dataset daily. This directly reduces the data processed and speeds up the job.
Why this answer
Option C is correct because the job reads the entire 5 TB dataset daily, which is inefficient when only new data needs processing. Auto Loader with 'directoryListing' mode incrementally identifies and processes only new files since the last run, drastically reducing the data volume and execution time. This directly addresses the root cause of the SLA breach—reading unchanged historical data repeatedly—rather than tuning resources or partitioning.
Exam trap
The trap here is that candidates focus on tuning Spark parameters (memory, partitions, joins) to handle the existing workload, but the real issue is the unnecessary reprocessing of unchanged data, which only incremental loading can solve.
How to eliminate wrong answers
Option A is wrong because increasing executor memory and cores only improves shuffle performance for existing data volumes; it does not reduce the amount of data read or processed, so the job would still read the full 5 TB daily and likely remain over 6 hours. Option B is wrong because repartitioning on 'product_category' with 2000 partitions may help with data skew but does not eliminate the need to read the entire dataset each day; it also introduces a costly shuffle operation that could worsen performance. Option D is wrong because a broadcast join hint is used to optimize joins by broadcasting a small table to all executors, but the fact table (sales transactions) is massive (5 TB) and cannot be broadcast; this approach would cause out-of-memory errors or be ignored by Spark.