Managing Dataproc jobs

Submit jobs via command-line

You can submit a job via a jobs.submit API request or via the gcloud command gcloud dataproc jobs submit.

Example 1: submit a PySpark job using command-line

gsutil cp gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py .
cat hello-world.py

#!/usr/bin/python
import pyspark
sc = pyspark.SparkContext()
rdd = sc.parallelize(['Hello,', 'world!'])
words = sorted(rdd.collect())
print words

gcloud dataproc jobs submit pyspark --cluster <my-dataproc-cluster> hello-world.py

Copying file:///tmp/hello-world.py [Content-Type=text/x-python]...
…
Job [dc1c28ac-c380-4d6c-a543-2a6ca43691eb] submitted.
Waiting for job output...
…
['Hello,', 'world!']
Job finished successfully.
…

Example 2: Submit a sample Spark job

gcloud dataproc jobs submit spark --cluster <my-dataproc-cluster> \
--class org.apache.spark.examples.SparkPi \
--jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

Job [54825071-ae28-4c5b-85a5-58fae6a597d6] submitted.
Waiting for job output…
…
Pi is roughly 3.14177148
…
Job finished successfully.
…

To submit a sample Spark job, set the following arguments

the Cluster name on which the job will be run.
the Job type to Spark.
the Jar files that defines the job to be run file:///usr/lib/spark/examples/jars/spark-examples.jar
- file:///denotes a Hadoop LocalFileSystem scheme; Cloud Dataproc installed/usr/lib/spark/examples/jars/spark-examples.jaron the cluster's master node when it created the cluster.
- Alternatively, you can specify a Cloud Storage path (gs://your-bucket/your-jarfile.jar) or a Hadoop Distributed File System path (hdfs://examples/example.jar) to one of your jars.
the Main class toorg.apache.spark.examples.SparkPi.
the Job arguments to their desired value.

You can view your job's driver output from the command line using thegcloud dataproc jobs waitcommand shown below:

gcloud dataproc --project=<your-project-id> jobs wait <your-job-id>

SSH into master node using the Cloud Platform Console

Use the Cloud Platform Console to SSH into your cluster's master node (the default name for the master node is the cluster name followed by an -msuffix). After establishing an SSH connection to the VM master instance, you can access the Spark shell or HDFS console.

Restarting jobs on failure

By default, Cloud Dataproc jobs will not automatically restart on failure. By using an optional setting, you can set jobs to restart on failure. When you set a job to restart, you specify the maximum number of failures per hour.

Restarting jobs mitigates common types of job failure, including out-of-memory issues and unexpected Google Compute Engine virtual machine reboots. Restartable jobs are especially useful for long-running and streaming jobs. For example, you can restart Spark streaming jobs running on Google Cloud Dataproc clusters to ensure that the streaming jobs are resilient.

Restartable job semantics

Any type of Cloud Dataproc job can be marked as restartable. This applies to jobs submitted through the {Console}, Google Cloud SDK gcloud command-line tool, or the Cloud Dataproc REST API. Jobs can be restarted no more than ten times per hour.

You can specify the maximum number of times a job can be restarted per hour when submitting the job using thegcloudcommand-line tool.

gcloud beta dataproc jobs submit <args>
 --max-failures-per-hour <number>

You can set jobs to restart through the jobs.submit API. You can set the number of times a job can be restarted on failure in the max_failures_per_hour field of the JobScheduling request parameter.

Semantics of restartable jobs

The following semantics apply to reporting the success or failure of jobs:

A job is reported successful if the driver terminates with code 0.
A job is reported failed if:
- The driver terminates with a non-zero code more than 4 times in 10 minutes.
- The driver terminates with non-zero code, and has exceeded the max_failures_per_hoursetting.
A job will be restarted if the driver exits with a non-zero code, is not thrashing, and is within the max_failures_per_hour setting.

Known limitations

The following limitations apply to restartable jobs on Cloud Dataproc.

You must design your jobs to gracefully handle restarting. For example, if your job writes to a directory, your job will need to account for the fact that the directory may already exist when your job is restarted.
Apache Spark streaming jobs that checkpoint can be restarted after failure, but these jobs will not report Yarn status.

Managing jobs