Your organization has a data lake on Cloud Storage with millions of small files (average 10 KB). You need to build a batch processing pipeline using Cloud Dataproc that runs a Spark job to transform the data and output results to BigQuery. The pipeline currently takes 4 hours to run because Spark spends a large amount of time listing files and managing tasks. You want to reduce the run time without changing the cluster size. Which action should you take?
Combining files reduces task count and listing overhead.
Why this answer
Option D is correct because the primary bottleneck is the overhead of listing millions of small files and managing many Spark tasks. By combining small files into larger ones using a separate job before the main transformation, you reduce the number of files Spark must list and the number of tasks required, which directly cuts the 4-hour runtime. Enabling Spark Dynamic Resource Allocation ensures resources are used efficiently during this preprocessing step without changing the cluster size.
Exam trap
The trap here is that candidates focus on data format or partitioning tuning (A, B, C) instead of recognizing that the root cause is the sheer number of small files causing excessive file listing and task overhead, which requires a preprocessing step to consolidate files.
How to eliminate wrong answers
Option A is wrong because converting CSV to Parquet improves read performance and compression but does not address the overhead of listing millions of small files or the task management cost; the bottleneck is file count, not format. Option B is wrong because using Spark coalesce reduces the number of output partitions, which only affects the write phase to BigQuery and does nothing to reduce the input file listing or task scheduling overhead. Option C is wrong because increasing the number of Spark partitions would create even more tasks, exacerbating the overhead from managing millions of small files and likely increasing runtime, not reducing it.