A company is migrating their on-premises Apache Spark jobs to Dataproc. They want to minimize code changes and take advantage of serverless infrastructure. Which Dataproc feature should they use?
Serverless Spark runs jobs without cluster management and is compatible with existing Spark code.
Why this answer
Dataproc Serverless Spark is the correct choice because it allows the company to run Spark workloads without provisioning or managing clusters, minimizing code changes by using the same Spark APIs and libraries. This serverless infrastructure automatically scales resources and handles failures, aligning with the goal of reducing operational overhead while maintaining compatibility with existing Spark jobs.
Exam trap
Google Cloud often tests the distinction between 'serverless' and 'managed' services; the trap here is that candidates may confuse Dataproc Workflow Templates or Jobs API with serverless capabilities, but those still require cluster management, whereas Dataproc Serverless Spark truly abstracts the infrastructure.
How to eliminate wrong answers
Option A is wrong because preemptible VMs are cost-effective but still require managing a cluster and do not provide serverless infrastructure; they are prone to termination, which can disrupt jobs without proper checkpointing. Option B is wrong because Workflow Templates orchestrate job sequences on existing clusters but do not eliminate cluster management or provide serverless execution. Option D is wrong because the Dataproc Jobs API with custom machine types still requires a running cluster to submit jobs, thus not achieving serverless infrastructure or minimizing cluster management.