DP-203 Design and develop data processing • Set 2
DP-203 Design and develop data processing Practice Test 2 — 15 questions with explanations. Free, no signup.
You are a data engineer for a large e-commerce company. The company uses Azure Data Lake Storage Gen2 (ADLS Gen2) as its data lake. A team of data scientists needs to process a massive dataset (approximately 5 TB) stored in Parquet format in the data lake. The dataset contains sales transactions from the past 10 years. The data scientists run a Spark job daily using Azure Synapse Analytics (serverless Spark pool) to compute aggregated sales metrics by product category and region. The job reads the entire dataset each day, performs transformations, and writes the aggregated results back to the data lake. Over the past few weeks, the job has been taking longer to complete, and the data scientists have reported that the job now takes over 6 hours, exceeding the acceptable SLA of 4 hours. They suspect the issue is related to data skew or suboptimal partitioning. You need to optimize the job to reduce execution time. Which approach should you take?