Managing Dataproc jobs
Submit jobs via command-line
You can submit a job via a jobs.submit API request or via the gcloud command gcloud dataproc jobs submit
.
Example 1: submit a PySpark job using command-line
gsutil cp gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py .
cat hello-world.py
#!/usr/bin/python
import pyspark
sc = pyspark.SparkContext()
rdd = sc.parallelize(['Hello,', 'world!'])
words = sorted(rdd.collect())
print words
gcloud dataproc jobs submit pyspark --cluster <my-dataproc-cluster> hello-world.py
Copying file:///tmp/hello-world.py [Content-Type=text/x-python]...
…
Job [dc1c28ac-c380-4d6c-a543-2a6ca43691eb] submitted.
Waiting for job output...
…
['Hello,', 'world!']
Job finished successfully.
…
Example 2: Submit a sample Spark job
gcloud dataproc jobs submit spark --cluster <my-dataproc-cluster> \
--class org.apache.spark.examples.SparkPi \
--jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000
Job [54825071-ae28-4c5b-85a5-58fae6a597d6] submitted.
Waiting for job output…
…
Pi is roughly 3.14177148
…
Job finished successfully.
…
To submit a sample Spark job, set the following arguments
- the Cluster name on which the job will be run.
- the Job type to
Spark
. - the Jar files that defines the job to be run
file:///usr/lib/spark/examples/jars/spark-examples.jar
file:///
denotes a Hadoop LocalFileSystem scheme; Cloud Dataproc installed/usr/lib/spark/examples/jars/spark-examples.jar
on the cluster's master node when it created the cluster.- Alternatively, you can specify a Cloud Storage path (
gs://your-bucket/your-jarfile.jar
) or a Hadoop Distributed File System path (hdfs://examples/example.jar) to one of your jars.
- the Main class to
org.apache.spark.examples.SparkPi
. - the Job arguments to their desired value.
You can view your job's driver output from the command line using thegcloud dataproc jobs wait
command shown below:
gcloud dataproc --project=<your-project-id> jobs wait <your-job-id>
SSH into master node using the Cloud Platform Console
Use the Cloud Platform Console to SSH into your cluster's master node (the default name for the master node is the cluster name followed by an -m
suffix). After establishing an SSH connection to the VM master instance, you can access the Spark shell or HDFS console.
Restarting jobs on failure
By default, Cloud Dataproc jobs will not automatically restart on failure. By using an optional setting, you can set jobs to restart on failure. When you set a job to restart, you specify the maximum number of failures per hour.
Restarting jobs mitigates common types of job failure, including out-of-memory issues and unexpected Google Compute Engine virtual machine reboots. Restartable jobs are especially useful for long-running and streaming jobs. For example, you can restart Spark streaming jobs running on Google Cloud Dataproc clusters to ensure that the streaming jobs are resilient.
Restartable job semantics
Any type of Cloud Dataproc job can be marked as restartable. This applies to jobs submitted through the {Console}, Google Cloud SDK gcloud command-line tool, or the Cloud Dataproc REST API. Jobs can be restarted no more than ten times per hour.
You can specify the maximum number of times a job can be restarted per hour when submitting the job using thegcloud
command-line tool.
gcloud beta dataproc jobs submit <args>
--max-failures-per-hour <number>
You can set jobs to restart through the jobs.submit API. You can set the number of times a job can be restarted on failure in the max_failures_per_hour
field of the JobScheduling request parameter.
Semantics of restartable jobs
The following semantics apply to reporting the success or failure of jobs:
- A job is reported successful if the driver terminates with code
0
. - A job is reported failed if:
- The driver terminates with a non-zero code more than 4 times in 10 minutes.
- The driver terminates with non-zero code, and has exceeded the
max_failures_per_hour
setting.
- A job will be restarted if the driver exits with a non-zero code, is not thrashing, and is within the
max_failures_per_hour
setting.
Known limitations
The following limitations apply to restartable jobs on Cloud Dataproc.
- You must design your jobs to gracefully handle restarting. For example, if your job writes to a directory, your job will need to account for the fact that the directory may already exist when your job is restarted.
- Apache Spark streaming jobs that checkpoint can be restarted after failure, but these jobs will not report Yarn status.