A data engineer needs to process a large dataset (500 TB) stored in Cloud Storage using Dataproc. The processing job requires reading the entire dataset and writing results back to Cloud Storage. The job is expected to run for 6 hours. Which configuration minimizes cost?
Preemptible VMs reduce cost significantly while providing sufficient compute.
Why this answer
Option C is correct because preemptible VMs cost about 80% less than standard VMs, and mixing them with standard VMs provides fault tolerance for the job's 6-hour duration. Since the job reads and writes to Cloud Storage (not local HDFS), local SSDs are unnecessary, and a single-node cluster would lack the parallelism needed to process 500 TB efficiently within 6 hours. Using a mix of standard (for critical master/worker nodes) and preemptible VMs (for worker nodes) minimizes cost while ensuring job completion.
Exam trap
Google Cloud often tests the misconception that local SSDs always improve performance for data processing jobs, but in Dataproc, when data resides in Cloud Storage, the bottleneck is network throughput, not local disk speed, making SSDs an unnecessary cost.
How to eliminate wrong answers
Option A is wrong because a single-node cluster cannot process 500 TB in 6 hours due to limited CPU and memory resources, and it lacks fault tolerance if the node fails. Option B is wrong because local SSDs add cost without benefit when reading/writing from Cloud Storage, as the bottleneck is network I/O, not disk I/O; Dataproc uses Cloud Storage as the primary data source, not HDFS. Option D is wrong because using 1000 cores with n1-highmem-32 instances is over-provisioned and expensive, and the job's 6-hour runtime does not justify such a large cluster; it also ignores the cost savings of preemptible VMs.