Dataproc Clusters

You can create a Cloud Dataproc cluster via a Cloud Dataproc API clusters.create request, on the command line using the Google Cloud SDK gcloud command-line tool, or with the Google Cloud Platform Console.

Creating clusters using the command line

To create a Cloud Dataproc cluster on the command line, use the following Cloud SDK command:

gcloud dataproc clusters create <cluster-name>

The above command creates a cluster with default Cloud Dataproc service settings for your master and worker virtual machine instances, disk sizes and types, network type, zone where your cluster is deployed, and other cluster settings. The list of possible settings are given below:

gcloud dataproc clusters create NAME [--async] [--bucket=BUCKET]
    [--image-version=VERSION]
    [--initialization-action-timeout=TIMEOUT; default="10m"]
    [--initialization-actions=CLOUD_STORAGE_URI,[CLOUD_STORAGE_URI,…]]
    [--master-machine-type=MASTER_MACHINE_TYPE]
    [--num-workers=NUM_WORKERS]
    [--properties=[PREFIX:PROPERTY=VALUE,…]]
    [--scopes=SCOPE,[SCOPE,…]]
    [--service-account=SERVICE_ACCOUNT]
    [--worker-boot-disk-size=WORKER_BOOT_DISK_SIZE]
    [--worker-machine-type=WORKER_MACHINE_TYPE]
    [--zone=ZONE, -z ZONE]
    [--network=NETWORK     | --subnet=SUBNET]

--initialization-actions sets a list of Cloud Storage URIs of executables to run on each node in the cluster.

--scopes specifies scopes for the node instances.

Staging bucket

When you create a cluster, Cloud Dataproc creates a Cloud Storage staging bucket in your project or reuses an existing bucket from a previous cluster creation request. A separate bucket is used in each geographical region, as determined by the Google Compute Engine zone of the cluster. A Cloud Dataproc-created staging bucket is shared among clusters in the same region.

To list the name of the staging bucket created by Cloud Dataproc, run the Cloud SDK gcloud dataproc clusters describe command.

gcloud dataproc clusters describe <cluster-name>
clusterName: your-cluster-name
clusterUuid: daa40b3f-5ff5-4e89-9bf1-bcbfec ...
configuration:
    configurationBucket: dataproc-edc9d85f-12f9-4905-...
    ...

Regional endpoints

Google Cloud Dataproc supports both a single "global" endpoint and "regional" endpoints based onGoogle Compute Engine zones. Each Dataproc region constitutes an independent resource namespace constrained to deploying instances into Compute Engine zones inside the region. Specifically, you can specify distinct regions, such as us-east1or europe-west1, to isolate resources (including VM instances and Google Cloud Storage) and metadata storage locations utilized by Cloud Dataproc within the user-specified region.

Unless specified, Cloud Dataproc will default to the "global" region. The "global" region is a special multi-region namespace which is capable of interacting with Cloud Dataproc resources across all Google Compute zones globally.

There are some situations where specifying a regional endpoint may be useful:

specifying an explicit regional endpoint may provide better regional isolation and protection.
selecting specific regions may lead to better performance compared to the default "global" namespace.

Image version

When using thegcloud dataproc clusters createcommand you can use the--image-versionargument to specify an image version. Image version numbers use the following format:

version_major.version_minor.version_patch

Best practice is to specify the major and minor version only, so the latest patch version is always used. For example, you can run the following command to create a newmy-test-clusterthat uses the current patch release of image version1.0:

gcloud dataproc clusters create my-test-cluster --image-version 1.0

New major versions will be created periodically to incorporate one or more of the following:

Major releases for Operating system, Spark, Hadoop, and other Big Data components, or Google Cloud connectors
Major changes or updates to Cloud Dataproc functionality

New minor versions will be created periodically to incorporate minor releases and updates of the above.

New patch versions will be created periodically to incorporate patches or fixes for a component in the image.

Major and minor image versions are supported for a specified period of time after they are released. During this period, clusters using the Image Versions are eligible for support. After the support window has closed, clusters using the Image Version are not eligible for support.

Months after Image Version release	Can create clusters with this Image Version?	Clusters using this Image Version eligible for support?
0-12	Yes	Yes
12-18	Yes	No
18+	No	No

Configuring clusters

Network configuration

The Compute Engine VM instances in a Google Cloud Dataproc cluster, consisting of master and worker VMs, require full internal IP networking access to each other. When creating a cluster, you can accept the default network for the cluster, or you can specify your VPC network. For the later, To do , you must first create a VPC network with firewall rules. Then, when you create the cluster and associate your network with the cluster.

Legacy networks use a default firewall rule with a source IP range of10.0.0.0/8to allow intra-cluster communication, while newer subnet-enabled default networks use a slightly more restrictive source IP range of10.128.0.0/9due to having constrained IP ranges per regional subnetwork.

Creating a VPC network

For creating a VPC, you can choose either an Automatic creation or a Custom one. For automatic VPC creation, you must select one or more firewall rules. In addition to using the source IP ranges just noted, the typical (anddefault-allow-internal) firewall rule in a Cloud Dataproc cluster's network opensudp:0-65535; tcp:0-65535; icmpports.

For automatic creation, the <network-name>-allow-internal rule, which opens udp:0-65535;tcp:0-65535;icmpports, should be selected to enable full internal IP networking access among VM instances in the network. You can also select the <network-name>-allow-ssh rule to open standard SSH port 22 to allow SSH connections to network.

If you choose Custom mode, you must specify the region and private IP address range for each subnetwork.

Note that you provide firewall rules for custom subnetworks after the network is created.

Currently, you cannot use the Cloud Platform Console to create a Cloud Dataproc cluster if it will use a custom mode VPC network. Instead, use the gcloud command: gcloud dataproc clusters create with the --subnet flag to create the cluster.

gcloud dataproc clusters create ... \
  --subnet projects/project-id/regions/region/subnetworks/subnet-name

Create a private IP Cloud Dataproc cluster

You can create a Dataproc cluster that is isolated from the public internet whose VM instances communicate over a private subnet. To do this, configure the subnet in the VPC network for Google Private Access. Then, create a Dataproc cluster that uses the private IP subnet.

For a subnet in an auto mode VPC network:

cloud beta dataproc clusters create my-cluster \
  --no-address \
  --network my-custom-auto-subnet

For a subnet in a custom mode VPC network:

gcloud beta dataproc clusters create cluster-name \
  --no-address \
  --subnet projects/project-id/region/region/subnetworks/subnetwork-name

Cluster properties

The open source components installed on Cloud Dataproc clusters contain many configuration files. For example, Spark and Hadoop have several XML and plain text configuration files. From time to time, you may need to update or add to these configuration files. You can easily use the --properties option of the dataproccommand to modify many common configuration files when creating a cluster.

To make it easy to change properties , the --properties requires a string of text that follows this format:

file_prefix:property=value

Each of the specified properties are then mapped to configuration files. For example, "core:io.serializations" denotes the io.serializations property in core-site.xml configuration file. The following table lists the supported prefixes and their mappings:

Prefix	Target Configuration File
core	core-site.xml
hdfs	hdfs-site.xml
mapred	mapred-site.xml
yarn	yarn-site.xml
hive	hive-site.xml
pig	pig.properties
spark	spark-defaults.conf

Important notes

Some properties are reserved and cannot be overridden because they would impact the functionality of the Cloud Dataproc cluster. If you try changing a reserved property, you can expect an error message when creating your cluster.
You can specify multiple changes by separating each with a comma.
The --properties command cannot modify configuration files not shown above.
Changing properties when creating clusters in the GCP console is currently not supported.
The changes in properties will be applied before the daemons on your cluster start.
If the specified property already exists, it will be updated. If the specified property does not exist, it will be added to the configuration file.

Examples

If you want to change thespark.mastersetting in thespark-defaults.conffile, you can do so by adding the following--propertiesoption when creating a new cluster on the command line:

-- properties 'spark:spark.master=spark://example.com'

You can change several properties at once, in one or more than one configuration file, by using a comma. For example, to change thespark.mastersetting in thespark-defaults.conffile and thedfs.hostssetting in thehdfs-site.xmlfile, you can use the following option when creating a cluster:

--properties 'spark:spark.master=spark://example.com,hdfs:dfs.hosts=/foo/bar/baz'

Initialization actions

When creating a cluster, you may want to specify initialization actions in executables and/or scripts that Cloud Dataproc. These actions will run on and initialize some or all nodes in your cluster immediately after the cluster is set up. You can find frequently used and other sample initialization action scripts at the following locations:

GitHub repository
Google Cloud Storage—in the shared bucket gs://dataproc-initialization-actions

Cluster initialization actions can be specified regardless of how you create a cluster, whether on the GCP console, on the command line with gcloud tool or programmatically with the clusters.create API (see NodeInitializationAction)

There are a few important things to know when creating or using initialization actions:

Initialization actions run as the root user.
You should use absolute paths in initialization actions.
Your initialization actions should use a shebang line to indicate how the script should be interpreted (such as #!/bin/bashor #!/usr/bin/python.

If you want to limit initialization actions to master or worker nodes, you can add simple node-selection logic to your executable or script, as shown below.

ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  ... master specific actions ...
else
  ... worker specific actions ...
fi

When creating a cluster with thegcloud dataproc clusters createcommand, specify the Cloud Storage location(s) of the initialization executable(s) and/or script(s) with the--initialization-actionsflag.

gcloud dataproc clusters create CLUSTER-NAME \
  [--initialization-actions [GCS_URI,...]] \
  [--initialization-action-timeout TIMEOUT; default="10m"] \
  ... other flags ...

The optional --initialization-action-timeoutflag to specify a timeout for the initialization action (the default value is 10 minutes). If the initialization has not completed by the end of the timeout period, Cloud Dataproc will cancel the initialization action.

Example—Staging binaries

A common cluster initialization scenario is the staging of job binaries on a cluster to eliminate the need to stage the binaries each time a job is submitted. For example, assume the following initialization script is stored ings://my-bucket/download-job-jar.sh:

#!/bin/bash
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role)
if [[ "${ROLE}" == 'Master' ]]; then
  gsutil cp gs://my-bucket/jobs/sessionalize-logs-1.0.jar home/username
fi

The location of this script can be passed to thegcloud dataproc clusters createcommand:

gcloud dataproc clusters create my-dataproc-cluster \
    --initialization-actions gs://my-bucket/download-job-jar.sh

Cloud Dataproc will run this script on all nodes, and, as a consequence of the script's node-selection logic, will download the jar to the master node. Submitted jobs can then use the pre-staged jar:

gcloud dataproc jobs submit hadoop \
    --cluster my-dataproc-cluster \
    --jar file:///home/username/sessionalize-logs-1.0.jar

High-Availability mode

When creating a Google Cloud Dataproc cluster, you can put the cluster into Hadoop High Availability (HA) mode by specifying the number of master instances in the cluster. The number of masters can only be specified at cluster creation time.

Currently, Cloud Dataproc supports two master configurations:

1 master (default, non HA)
3 masters (Hadoop HA)

To create a HA cluster with gcloud dataproc clusters create, run the following command:

gcloud beta dataproc clusters create cluster-name --num-masters 3

In the case of a Compute Engine failure, Dataproc instances will reboot. With the default single-master configuration, Dataproc is designed to recover and continue processing new work in such cases, but in-flight jobs will necessarily fail and need to be retried. Besides, HDFS will be inaccessible until the single NameNode fully recovers on reboot.

In HA mode, HDFS High Availability and YARN High Availability are configured to allow uninterrupted YARN and HDFS operations despite any single-node failures/reboots.

Note that the driver/main program of any jobs you run still represents a potential single point of failure if the correctness of your job depends on the driver program running successfully. Jobs submitted through the Cloud Dataproc Jobs API are not considered "high availability," and will still be terminated on failure of the master node that runs the corresponding job driver programs.
For individual jobs to be resilient against single-node failures using a HA Cloud Dataproc cluster, the job must either:

run without a synchronous driver program, or it must
run the driver program itself inside a YARN container and be written to handle driver-program restarts.

Instance Names

The default master is named cluster-name-m; HA masters are named cluster-name-m-0, cluster-name-m-1, cluster-name-m-2.

Apache ZooKeeper

In an HA Cloud Dataproc cluster, all masters participate in a ZooKeeper cluster, which enables automatic failover for other Hadoop services.

HDFS

In a standard Cloud Dataproc cluster:

cluster-name-m runs:
- NameNode
- Secondary NameNode

In a High Availability Cloud Dataproc cluster:

cluster-name-m-0 and cluster-name-m-1 run:
- NameNode
- ZKFailoverController
All masters run JournalNode
There is no Secondary NameNode

YARN

In a standard Cloud Dataproc cluster, cluster-name-m runs ResourceManager.
In a High Availability Cloud Dataproc cluster, all masters run ResourceManager.

Single-node clusters

Single node clusters are Cloud Dataproc clusters with only one node. This single node acts as the master and worker for your Cloud Dataproc cluster.

There are a number of situations where single node Cloud Dataproc clusters can be useful, including:

Trying out new versions of Spark and Hadoop or other open source components
Building proof-of-concept (PoC) demonstrations
Lightweight data science
Small-scale non-critical data processing
Education related to the Spark and Hadoop ecosystem

Creating single-node clusters

Using the gcloud command-line tool, you can create a single node cluster, pass the argument --single-node to the dataproc clusters create command.

gcloud beta dataproc clusters create <args> --single-node

Using the clusters.create API, you will need to do two things to create a single-node clusters:

Add the property dataproc:dataproc.allow.zero.workers="true" to the SoftwareConfig of the cluster request.
Do not submit any values for the workerConfig and secondaryWorkerConfig.

Semantics of single-node clusters

The following semantics apply to single node clusters:

Single node clusters are configured the same way as multi node Cloud Dataproc clusters.
Single node clusters include services such as HDFS and YARN.
Single node clusters will report as master nodes for initialization actions.
Single node clusters will show 0 workers since the single node is acting as the master and worker.
Single node clusters will have hostnames that follow the pattern clustername-m. You can use this hostname to SSH into or connect to a web UI on the node.
Single node clusters cannot be upgraded to multi node clusters. Once created, single node clusters will be restricted to one node. Likewise, multi node clusters cannot be scaled down to become single node clusters.

While single node clusters only have one node, most Cloud Dataproc concepts and features still apply, except:

Limitations of single node clusters

The following limitations apply to single node clusters on Cloud Dataproc.

Single node clusters are not intended for large-scale parallel data processing.
n1-standard-1 machine types are not recommended due to their limited resources for YARN applications.
Single node clusters are not available with high-availability because there is only one node in the cluster.
Single node clusters cannot use preemptible VMs.

Customized instance types

At present, creating clusters with custom machine types is only supported through the Google Cloud SDKgcloud dataproccommand.

Custom machine types use a customized machine typename. As an example, the custom machine type name for a custom VM with 6 virtual CPUs and 22.5 GB of memory is:

custom-[number-of-virtual-CPUs]-[memory-size-in-Megabytes]

For example a machine with a custom-6-23040 type has 6 virtual CPUs and 22.5 Gigabytes.

Once you know the machine type name you wish to use, you can create a cluster with that custom machine type:

gcloud dataproc clusters create test-cluster /
    --worker-machine-type custom-6-23040 /
    --master-machine-type custom-6-23040

Specifications for custom types

The maximum number of vCPUs allowed for a custom machine type is determined by the zone where the instance will be hosted:
- Zones that support Broadwell, Haswell, and Ivy Bridge processors can support custom machine types with up to 64 vCPUs.
- Zones that support Sandy Bridge processors can support custom machine types with up to 16 vCPUs.
Above 1, vCPU count must be even, such as 2, 4, 6, 8, 10, and so on.
The memory per vCPU of a custom machine type must be between 0.9 GB and 6.5 GB per vCPU, inclusive.
The total memory for a custom machine type must be a multiple of 256 MB. For example, 6.9 GB is not acceptable, but 6.75 GB and 7 GB are acceptable.
Instances with custom machine types have the same persistent disk capacity limitations as instances with predefined machine types. However, custom machine types are still limited to 16 individual persistent disk volumes.

Examples of invalid machine types

1 vCPU, 0.6 GB of total memory — Invalid because the total memory is less than the minimum 0.9 GB.

1 vCPU, 0.9 GB of total memory — Invalid because the total memory must be a multiple of 256 MB. For 1 vCPU, use a minimum of 1024 MB memory.

Examples of valid machine types

32 vCPUs, 29 GB of total memory — Valid because the total number of vCPUs is even and the total memory is a multiple of 256 MB. The amount of memory per vCPU is 0.9 GB, which satisfies the minimum requirement.

1 vCPU, 1 GB of total memory — Valid because it has one vCPU, which is the minimum value, and the total memory is a multiple of 256 MB. The amount of memory per vCPU is also between the acceptable range of 0.9 GB to 6.5 GB per vCPU.

Pre-emptible VMs

Cloud Dataproc clusters can use preemptible VM instances to save money. You can use preemptible VMs to lower per-hour compute costs for non-critical data processing or to create very large clusters at a lower total cost.

All preemptible instances added to a cluster use the machine type of the cluster's non-preemptible worker nodes. For example, if you create a cluster with workers that use n1-standard-4machine types, all preemptible instances added to the cluster will also use n1-standard-4machines. The addition or removal of preemptible workers from a cluster does not affect the number of non-preemptible workers in the cluster.

Because preemptible instances are reclaimed if they are required for other tasks, Cloud Dataproc adds preemptible instances as secondary workers in a managed instance group, which contains only preemptible workers. The managed group automatically re-adds workers lost due to reclamation as capacity permits.

The following rules will apply when you use preemptible workers with a Cloud Dataproc cluster:

Processing only—Since preemptibles can be reclaimed at any time, preemptible workers do not store data. Preemptibles added to a Cloud Dataproc cluster only function as processing nodes.
No preemptible-only clusters—To ensure clusters do not lose all workers, Cloud Dataproc cannot create preemptible-only clusters. If you use the gcloud dataproc clusters createcommand with --num-preemptible-workers, and you do not also specify a number of standard workers with --num-workers, Cloud Dataproc will automatically add two non-preemptible workers to the cluster.
Persistent disk size—As a default, all preemptible workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.

When creating a Cloud Dataproc cluster from the Cloud Platform Console, you can specify the number of preemptible workers. After a cluster has been created, you can add and remove preemptible workers by editing the cluster from the Cloud Platform Console.

Using the command-line tool, you create a cluster with preemptible instances with the --num-preemptible-workersargument. To remove all preemptible workers from a cluster, use the gcloud alpha dataproc clusters update command with --num-preemptible-workers set to 0.

Accessing clusters

Web interfaces

Some of the core open source components included with Google Cloud Dataproc clusters, such as Apache Hadoop and Apache Spark, provide web interfaces. These web interfaces can be used to manage and monitor different cluster resources and facilities, such as the YARN resource manager, the Hadoop Distributed File System (HDFS), MapReduce, and Spark.

By default, these interfaces do not require authentication or offer SSL encryption. We do not open firewall ports for accessing these applications directly since they will be publicly accessible. Instead, we recommend the use of an SSH tunnel from your local network to your Google Compute Engine network.

The interfaces listed below are available on a Cloud Dataproc cluster master node (replacemaster-host-namewith the name of your master node).

Web UI	Port	URL
YARN ResourceManager	8088	`http://master-host-name:8088`
HDFS NameNode	50070	`http://master-host-name:50070`

To connect to the web interfaces, we recommend you use an SSH tunnel to create a secure connection to your master node. The SSH tunnel supports traffic proxying using the SOCKS protocol. This means that you can send network requests through your SSH tunnel in any browser that supports the SOCKS protocol. This method allows you to transfer all of your browser data over SSH, eliminating the need to open firewall ports to access the web interfaces.

Connecting to the web interfaces with SSH and SOCKS is a two-step process:

Create an SSH tunnel. Use an SSH client or utility to create the SSH tunnel. Use the SSH tunnel to securely transfer web traffic data from your computer's web browser to the Cloud Dataproc cluster.
Use a SOCKS proxy to connect with your browser. Configure your browser to use the SOCKS proxy. The SOCKS proxy routes data intended for the Cloud Dataproc cluster through the SSH tunnel.

Step 1: creating an SSH tunnel

Run the following command to set up an SSH tunnel to the Hadoop master instance on port 1080 of your local machine. Replacemaster-host-namewith the name of the master node in your Cloud Dataproc cluster andmaster-host-zonewith the zone of your Cloud Dataproc cluster.

gcloud compute ssh  --zone=<master-host-zone> \
  --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" <master-host-name>

The--ssh-flagflag allows you to add extra parameters to your SSH connection. The--ssh-flagvalues, above, have the following meanings:

-D 1080specifies dynamic application-level port forwarding.
-Ninstructs gcloudnot to open a remote shell.
-ninstructs gcloudnot to read from stdin.

Step 2 - Connect with your web browser

Your SSH tunnel supports traffic proxying using the SOCKS protocol. You must configure your browser to use the proxy when connecting to your cluster.

<Google Chrome executable path> \
  --proxy-server="socks5://localhost:1080" \
  --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \
  --user-data-dir=/tmp/

Scaling and updating clusters

Using the Cloud Dataproc API, you can update a cluster with a clusters.patch request. Alternatively, you can use the command gcloud dataproc clusters update. Only, the following cluster parameters can be updated:

the number of standard worker nodes (--num-workers) in a cluster to scale clusters
the number of preemptible worker nodes (--num-preemptible-workers)

After creating a Cloud Dataproc cluster, you can scale the cluster by increasing or decreasing the number of worker nodes in the cluster.

You can scale a Cloud Dataproc cluster at any time, even when jobs are running on the cluster. The primary reasons for scaling are:

to increase the number of workers to make a job run faster
to decrease the number of workers to save money
to increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage

For example, to scale a cluster named "dataproc-1" to use five worker nodes, run the following command.

gcloud dataproc clusters update dataproc-1 --num-workers 5
Waiting on operation [operations/projects/project-id/operations/...].
Waiting for cluster update operation...done.
Updated [https://dataproc.googleapis.com/...].
clusterName: my-test-cluster
...
  masterDiskConfiguration:
    bootDiskSizeGb: 500
  masterName: dataproc-1-m
  numWorkers: 5
  ...
  workers:
  - my-test-cluster-w-0
  - my-test-cluster-w-1
  - my-test-cluster-w-2
  - my-test-cluster-w-3
  - my-test-cluster-w-4
...

Updating and deleting clusters

You can delete a cluster with a clusters.delete request. Alternatively, you can use the followinggcloudcommand using to delete a cluster.

gcloud dataproc clusters delete my-dataproc-cluster-name

Managing clusters