ACEChapter 84 of 101Objective 3.3

Cloud Dataproc for Spark and Hadoop

This chapter covers Google Cloud Dataproc, the managed service for running Apache Spark and Hadoop clusters. Dataproc is a core component of the ACE exam, appearing in approximately 5-8% of questions across domains, particularly under 'Deploy and Implement' (Objective 3.3). You will learn how Dataproc simplifies cluster creation, job submission, autoscaling, and integration with other GCP services like Cloud Storage and BigQuery. The exam tests your ability to configure clusters, manage jobs, and optimize costs using preemptible instances and lifecycle policies.

25 min read
Intermediate
Updated May 31, 2026

Dataproc as a Prefab Data Factory

Imagine you run a furniture company that needs to process raw lumber into finished tables. You could build a sawmill from scratch: buy land, construct the building, install saws, hire operators, and manage maintenance. That's like building your own Hadoop/Spark cluster on bare VMs. Now imagine a prefab factory: you call a vendor, specify how many tables per hour you need, and they deliver a fully equipped, pre-assembled factory on a flatbed truck. You plug it in, run it for a day, and when you're done, they take it away. That's Cloud Dataproc. The factory is pre-configured with all the tools (Hadoop, Spark, Hive, Pig) and can scale by adding more assembly lines (worker nodes) on the fly. You only pay for the time the factory is operational. The vendor handles the electrical wiring, plumbing, and safety inspections (cluster management, monitoring, patching). If you need to process a different type of wood (a different dataset), you just re-tool the saws (change cluster configuration) and restart. The prefab approach eliminates the capital expense and expertise needed to build and maintain a permanent facility, letting you focus on the actual woodworking (data processing).

How It Actually Works

What is Cloud Dataproc?

Cloud Dataproc is a fully managed, cloud-native service for running Apache Spark, Apache Hadoop, and associated open-source data processing frameworks. It automates the provisioning, scaling, and management of clusters so you can focus on running data pipelines without worrying about infrastructure. Dataproc is designed for speed: clusters can be created in under 90 seconds, and you only pay for the resources you use (per-second billing after a 1-minute minimum). It integrates deeply with Google Cloud Storage (GCS), BigQuery, Cloud Monitoring, and Cloud Logging.

Why Dataproc Exists

On-premises or self-managed Hadoop/Spark clusters require significant operational overhead: hardware procurement, software installation, configuration, tuning, patching, and capacity planning. Dataproc eliminates this by providing a turnkey environment. It is ideal for: - ETL pipelines – transforming large datasets using Spark or Hive. - Machine learning – training models with Spark MLlib. - Data migration – moving on-premises Hadoop workloads to GCP. - Ad-hoc analysis – running interactive queries on large datasets.

How Dataproc Works Internally

Dataproc uses a master-worker architecture. The master node runs YARN ResourceManager, HDFS NameNode (if HDFS is enabled), and Spark driver. Worker nodes run YARN NodeManager, HDFS DataNode, and Spark executors. When you create a cluster, Dataproc automatically: 1. Provisions Compute Engine VMs (master and workers). 2. Installs and configures Hadoop, Spark, and optional components (Hive, Pig, Tez, etc.). 3. Configures network firewalls and IAM permissions. 4. Sets up monitoring with Cloud Monitoring and Logging.

Key Components and Defaults

Master node – Single node (can be high-memory). Default machine type: n1-standard-4 (4 vCPUs, 15 GB RAM).

Worker nodes – At least 2 nodes. Default: n1-standard-4.

Preemptible workers – Optional, up to 100% of workers. Cost ~80% less than regular instances but can be terminated at any time.

Cluster types: Single node (1 master, 0 workers – for testing), Standard (1 master, N workers), High Availability (3 masters, N workers).

Image version: Dataproc uses predefined image versions (e.g., 2.0-debian10, 2.1-ubuntu20). Each includes specific versions of Hadoop, Spark, etc.

Idle timeout: Default 0 (no timeout). You can set a timeout (in minutes) to auto-delete the cluster when idle.

Scaling: You can manually add/remove workers or enable autoscaling based on YARN memory or CPU utilization.

Configuration and Verification Commands

Create a cluster using gcloud:

gcloud dataproc clusters create my-cluster \
    --region=us-central1 \
    --zone=us-central1-a \
    --master-machine-type=n1-standard-4 \
    --worker-machine-type=n1-standard-4 \
    --num-workers=3 \
    --image-version=2.0-debian10 \
    --optional-components=ANACONDA,JUPYTER \
    --enable-component-gateway

Submit a Spark job:

gcloud dataproc jobs submit spark \
    --cluster=my-cluster \
    --region=us-central1 \
    --class=org.apache.spark.examples.SparkPi \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \
    -- 1000

List clusters:

gcloud dataproc clusters list --region=us-central1

Update cluster (add workers):

gcloud dataproc clusters update my-cluster \
    --region=us-central1 \
    --num-workers=5

Interaction with Related Technologies

Cloud Storage (GCS): Dataproc can read/write data directly from GCS using the gs:// connector, avoiding HDFS. This is recommended for cost savings and durability. HDFS is ephemeral – data is lost when the cluster is deleted.

BigQuery: Use the Spark BigQuery connector to read/write BigQuery tables directly from Spark jobs.

Cloud Monitoring: Metrics like YARN memory, CPU utilization, and HDFS usage are automatically collected.

Cloud Logging: Job logs are streamed to Cloud Logging for analysis.

Cloud Composer: Orchestrate Dataproc jobs as part of Airflow DAGs.

Autoscaling

Dataproc autoscaling uses a primary worker group and an optional secondary worker group (preemptible). Autoscaling policies define: - Scale-in/out thresholds – e.g., scale out when YARN memory utilization > 80% for 2 minutes. - Cooldown period – default 2 minutes between scaling actions. - Max instances – limit on total workers.

Security

IAM roles: dataproc.editor, dataproc.worker, dataproc.viewer.

Service accounts: Each cluster uses a service account to access GCS, BigQuery, etc.

VPC-SC: Dataproc supports VPC Service Controls for data exfiltration prevention.

Encryption: Data on disk is encrypted at rest using Google-managed or CMEK keys.

Lifecycle Management

Dataproc clusters can be deleted automatically via: - Idle timeout: Auto-delete after N minutes of inactivity (no jobs running). - Cluster scheduling: Use Cloud Scheduler to delete clusters at specific times. - Workflow templates: Define a sequence of jobs; the cluster can be deleted after the workflow completes.

Key Timers and Defaults

Cluster creation time: < 90 seconds (typical).

Per-second billing after 1-minute minimum.

Idle timeout: 0 (disabled) by default.

Autoscaling cooldown period: 2 minutes.

Preemptible VM max lifetime: 24 hours (but can be terminated earlier).

Exam-Relevant Details

Dataproc is not serverless – you manage clusters, but the control plane is managed.

You can use preemptible instances for worker nodes to reduce cost, but not for master nodes.

Dataproc does not support GPUs on preemptible instances.

The Component Gateway provides web interfaces (Jupyter, Zeppelin, Hadoop UI) via a secure proxy.

Initialization actions are scripts run on all nodes during cluster creation (e.g., install custom libraries).

Workflow templates allow you to define a DAG of jobs with dependencies and can be triggered via Cloud Scheduler.

Troubleshooting

Common issues: - Out of memory: Increase worker machine type or add more workers. - Job failure: Check logs in Cloud Logging or use gcloud dataproc jobs wait. - Preemptible worker loss: Ensure job is fault-tolerant (e.g., use checkpointing).

Walk-Through

1

Define Cluster Configuration

First, determine the cluster's region, zone, machine types, number of workers, image version, and optional components. You can also specify initialization actions (scripts run on all nodes at startup) and a staging bucket for job dependencies. Use the `gcloud dataproc clusters create` command or the Cloud Console. The config is saved as a YAML or JSON object. For example, you might choose n1-standard-4 for master and workers, with 3 workers, image 2.0-debian10, enable Jupyter and Anaconda via optional components, and enable the component gateway for web UIs.

2

Provision Compute Resources

Dataproc sends requests to Compute Engine to create the VMs. The master node is created first, then workers. Each VM is assigned a service account for API access. The cluster is assigned a VPC network (default or custom). Firewall rules are automatically opened for internal communication (e.g., Hadoop ports). The provisioning takes under 90 seconds. During this time, Dataproc installs the chosen image version and runs any initialization actions.

3

Install and Configure Software

After VMs are ready, Dataproc installs Hadoop, Spark, and optional components (Hive, Pig, Tez, etc.) using the image's package manager. Configuration files (e.g., core-site.xml, yarn-site.xml) are generated with optimized settings for GCP. The master node runs YARN ResourceManager, HDFS NameNode, and Spark driver. Workers run YARN NodeManager and HDFS DataNode. If using HDFS, it is formatted on the master. The cluster is now ready for jobs.

4

Submit and Run Jobs

Jobs are submitted via `gcloud dataproc jobs submit` or through the Console. Dataproc supports Spark, Hadoop MapReduce, Pig, Hive, and PySpark. The job is sent to the YARN ResourceManager, which allocates containers on worker nodes. Spark executors run in these containers. You can also use the Component Gateway to access Spark History Server, YARN UI, or Jupyter notebooks. Job logs are streamed to Cloud Logging.

5

Monitor and Scale

During job execution, you can monitor metrics via Cloud Monitoring (e.g., YARN memory, CPU). If the cluster is under/overloaded, you can manually add or remove workers using `gcloud dataproc clusters update`. Autoscaling policies can be configured to automatically adjust the number of workers based on YARN memory utilization. Preemptible workers can be added to the secondary worker group for cost-effective burst capacity.

6

Delete Cluster or Let Idle Timeout

After jobs complete, you can delete the cluster to stop incurring costs. Use `gcloud dataproc clusters delete` or set an idle timeout (e.g., 10 minutes) so the cluster auto-deletes. Deleting a cluster removes all VMs and ephemeral HDFS data. Persistent data stored in GCS or BigQuery remains. You can also use workflow templates to automatically delete the cluster after a job sequence finishes.

What This Looks Like on the Job

Enterprise Scenario 1: ETL Pipeline for Retail Analytics A large retailer processes daily sales data from thousands of stores. They use Dataproc to run a 10-node Spark cluster (n1-standard-8 workers) that reads raw CSV files from Cloud Storage, cleans and transforms the data using Spark SQL, and writes the results to BigQuery for dashboarding. The cluster is created via a Cloud Composer DAG that runs daily at 2 AM. After the job finishes, the cluster is deleted using an idle timeout of 5 minutes. The retailer saves costs by using preemptible instances for 50% of workers, as the job is fault-tolerant with checkpointing to GCS. A common misconfiguration is not setting the idle timeout, leading to the cluster running all day and incurring unnecessary costs.

Enterprise Scenario 2: Ad-Hoc Data Science Exploration A data science team needs to explore a 5 TB dataset stored in GCS. They create a Dataproc cluster with Jupyter and Anaconda components enabled via --optional-components. They use the Component Gateway to access JupyterLab. The cluster has 5 preemptible workers for cost savings. However, during a long-running hyperparameter tuning job, preemptible VMs are frequently terminated, causing job failures. The solution: switch to standard workers or implement checkpointing in Spark. The team learns that preemptible instances are best for batch jobs that can tolerate interruptions.

Enterprise Scenario 3: Migrating On-Premises Hadoop to GCP A financial services company migrates its on-premises Hadoop cluster to GCP. They use Dataproc with HDFS enabled (temporarily) to replicate data. They run MapReduce jobs on Dataproc to validate data integrity. They then transition to using GCS as the primary data store, removing HDFS. The migration involves using gcloud dataproc clusters create with the same Hadoop version as on-premises. A pitfall is forgetting to set the --bucket flag for staging, causing job failures due to missing dependencies. They also use initialization actions to install custom security certificates.

How ACE Actually Tests This

The ACE exam tests Dataproc under Objective 3.3: Deploy and Implement – Deploy and implement Cloud Dataproc. Expect 2-4 questions. Key areas:

1.

Cluster creation and configuration: Know the syntax of gcloud dataproc clusters create, especially --optional-components, --enable-component-gateway, --preemptible-worker-count, --image-version, and --initialization-actions. The exam often asks: "Which flag enables Jupyter?" Answer: --optional-components=JUPYTER.

2.

Cost optimization: Preemptible instances are a common topic. Wrong answer: "Preemptible instances can be used for master nodes." Reality: Only worker nodes can be preemptible. Also, autoscaling with preemptible workers is a cost-saving strategy.

3.

Data storage: A frequent question: "Where should you store persistent data?" Wrong answer: HDFS. Correct: Cloud Storage (GCS). HDFS is ephemeral and lost when the cluster is deleted.

4.

Job submission: Know how to submit Spark and Hadoop jobs. A trap: "Which command submits a PySpark job?" The correct answer: gcloud dataproc jobs submit pyspark.

5.

Security: IAM roles (dataproc.editor, dataproc.worker) and service accounts. A common wrong answer: "You need to create a custom service account for every cluster." Reality: You can use the default Compute Engine service account.

6.

Autoscaling: Understand that autoscaling works on YARN memory utilization. The exam may ask: "Which metric triggers scale-out?" Answer: YARN memory utilization > threshold.

7.

Workflow templates: Know that they define a sequence of jobs and can include cluster creation and deletion. A question might ask: "How do you schedule a Dataproc workflow?" Answer: Cloud Scheduler triggers the workflow template.

8.

Edge cases: Single-node clusters (1 master, 0 workers) are for testing only – not for production. High-availability mode uses 3 masters. The exam may test that master nodes cannot be preemptible.

Key Takeaways

Dataproc is a managed service for Apache Spark and Hadoop clusters; you manage clusters, not the control plane.

Clusters can be created in under 90 seconds using gcloud or Cloud Console.

Always use Cloud Storage (GCS) for persistent data; HDFS is ephemeral and deleted with the cluster.

Preemptible VMs reduce costs but can only be used for worker nodes, not master nodes.

Autoscaling adjusts worker count based on YARN memory utilization.

Component Gateway provides secure web access to Jupyter, Zeppelin, and Hadoop UIs.

Workflow templates define a DAG of jobs and can auto-delete the cluster after completion.

Default machine type for master and workers is n1-standard-4.

Idle timeout auto-deletes clusters after a set period of inactivity; default is disabled (0).

Initialization actions are scripts run on all nodes during cluster creation.

Easy to Mix Up

These come up on the exam all the time. Here's how to tell them apart.

Dataproc (Managed Hadoop/Spark)

Cluster creation in < 90 seconds via gcloud or Console.

Automatic installation and configuration of Hadoop/Spark.

Integrated monitoring, logging, and IAM.

Per-second billing after 1-minute minimum.

Preemptible VMs supported for workers.

Self-Managed Hadoop/Spark on Compute Engine

Manual provisioning of VMs and software installation.

Requires manual configuration of all Hadoop/Spark components.

Must set up monitoring and logging manually.

Full VM cost per second regardless of usage.

Preemptible VMs can be used but need manual integration.

Watch Out for These

Mistake

Dataproc is serverless; you don't manage any infrastructure.

Correct

Dataproc is managed, not serverless. You still create and manage clusters (VMs), though the control plane is fully managed. For serverless Spark, use Dataproc Serverless (beta) or Cloud Dataflow.

Mistake

You should always use HDFS for persistent data storage.

Correct

HDFS is ephemeral – data is lost when the cluster is deleted. For persistent storage, use Cloud Storage (GCS). HDFS is only for temporary data during job execution.

Mistake

Preemptible VMs can be used for master nodes.

Correct

Preemptible VMs can only be used for worker nodes (primary or secondary). Master nodes must be regular VMs to ensure cluster stability.

Mistake

Dataproc clusters can only be created in the same region as your data.

Correct

Dataproc clusters can be created in any region, but for best performance, place the cluster in the same region as your data (e.g., GCS bucket). Cross-region access incurs egress costs and latency.

Mistake

Autoscaling can scale the number of master nodes.

Correct

Autoscaling only adjusts the number of worker nodes (primary or secondary). Master node count is fixed at creation (1 or 3 for HA).

Do You Actually Know This?

Reveal each answer, then mark whether you got it right. Score 60%+ to unlock the next chapter.

Frequently Asked Questions

How do I create a Dataproc cluster with Jupyter and Anaconda?

Use the `--optional-components` flag: `gcloud dataproc clusters create my-cluster --region=us-central1 --optional-components=ANACONDA,JUPYTER --enable-component-gateway`. The `--enable-component-gateway` flag provides secure web access to Jupyter. You can then access JupyterLab via the Component Gateway URL shown in the Console.

Can I use preemptible VMs for the master node?

No. Preemptible VMs can only be used for worker nodes (primary or secondary). Master nodes must be regular VMs to ensure cluster stability. If a master node is lost, the entire cluster fails.

What is the difference between a standard cluster and a high-availability cluster?

A standard cluster has one master node; a high-availability (HA) cluster has three master nodes for redundancy. HA clusters are recommended for production workloads that require fault tolerance. HA clusters cost more because of the additional master nodes.

How do I submit a PySpark job to a Dataproc cluster?

Use the `gcloud dataproc jobs submit pyspark` command: `gcloud dataproc jobs submit pyspark my_script.py --cluster=my-cluster --region=us-central1`. You can also pass arguments after `--`.

How does Dataproc autoscaling work?

Autoscaling adjusts the number of worker nodes based on a policy. The default metric is YARN memory utilization. When utilization exceeds a threshold (e.g., 80%) for a cooldown period (default 2 minutes), the cluster scales out. When it drops below a threshold, it scales in. You can define min and max instances.

What happens to my data when I delete a Dataproc cluster?

Ephemeral HDFS data is lost. Data stored in Cloud Storage (GCS), BigQuery, or Cloud SQL persists. Always store persistent data outside HDFS.

Can I schedule a Dataproc cluster to start and stop automatically?

Yes, you can use Cloud Scheduler to trigger a workflow template that creates a cluster, runs jobs, and deletes the cluster. Alternatively, use Cloud Functions to start/stop clusters on a schedule.

Terms Worth Knowing

Ready to put this to the test?

You've just covered Cloud Dataproc for Spark and Hadoop — now see how well it sticks with free ACE practice questions. Full explanations included, no account needed.

Done with this chapter?