An e-commerce company uses AWS Glue to run ETL jobs that transform clickstream data from Amazon S3. The job reads Parquet files, performs aggregations, and writes the results to Amazon Redshift. The job runs successfully but takes longer than expected. The data volume is increasing. Which design change would MOST improve the job's performance?
More workers parallelize tasks and reduce runtime.
Why this answer
Increasing the number of Glue worker nodes (DPUs) directly scales the distributed processing capacity of the ETL job, allowing it to process larger volumes of Parquet data in parallel. This is the most straightforward way to reduce execution time when data volume is growing, as AWS Glue automatically partitions the workload across the additional workers.
Exam trap
The trap here is that candidates assume increasing DPUs always increases cost without considering that the job's runtime reduction often lowers total cost, and they mistakenly choose a data format or target change that does not address the core parallelism issue.
How to eliminate wrong answers
Option A is wrong because writing to a single large file eliminates parallelism in downstream reads and can cause bottlenecks in Redshift's COPY operation, which benefits from multiple files for concurrent loading. Option B is wrong because converting Parquet to CSV increases file size and I/O overhead due to lack of columnar compression and predicate pushdown, degrading performance. Option C is wrong because replacing Redshift with Redshift Spectrum would offload query processing to S3 but does not address the ETL job's performance bottleneck; the job still writes to Redshift, and Spectrum is a query engine, not a write target.