A company is migrating its on-premises Apache Spark jobs to Dataproc. The jobs read from and write to Cloud Storage. After migration, the jobs are slower than expected. The Dataproc cluster uses standard worker machines with local SSDs. What is the most likely cause of the performance degradation?
Trap 1: The Spark shuffle service is not enabled on the cluster.
Shuffle service affects intermediate data, not final read/write.
Trap 2: The local SSDs are not mounted or are misconfigured.
Dataproc automatically mounts local SSDs; misconfiguration is unlikely.
Trap 3: The Cloud Storage connector is not using the gRPC protocol.
gRPC improves performance but is not the primary cause of slowdown.
- A
The Spark shuffle service is not enabled on the cluster.
Why wrong: Shuffle service affects intermediate data, not final read/write.
- B
The local SSDs are not mounted or are misconfigured.
Why wrong: Dataproc automatically mounts local SSDs; misconfiguration is unlikely.
- C
The Cloud Storage connector is not using the gRPC protocol.
Why wrong: gRPC improves performance but is not the primary cause of slowdown.
- D
The jobs use the Cloud Storage connector instead of HDFS, causing network latency.
Reading from Cloud Storage over network is slower than local HDFS reads.