Training models

Training models

In Cloud ML Engine, you develop your model using TensorFlow. You create a Python training application as you would when running TensorFlow locally. Cloud ML Engine can help to:

  • Train your model locally, mimicking the cloud-based process, in order to test your model and quickly iterate your design.
  • Run your trainer in the cloud with managed, scalable computing resources.
  • Run your trainer with distributed processing in the cloud to get results faster.
  • Take advantage of cloud resources to automatically optimize your model's settings using hyperparameter tuning.
  • Accelerate your training jobs for models with computationally intensive operations using graphics processing units (GPUs).

Training process

Here's an overview of the training process on Cloud ML engine:

  1. You create a TensorFlow application that defines your computation graph and trains your model. You can write your training application for Cloud ML Engine the same as you would if you run locally in your development environment.
  2. You get your training and verification data into a source that Cloud ML Engine can access. This usually means putting it in Google Cloud Storage, BigTable, or another Google Cloud Platform storage service associated with the same project that you're using for Cloud ML Engine.
  3. When your application is ready to run, it must be packaged and staged to a Google Cloud Storage bucket that your project can access. This is automated when you use the gcloud command-line tool to run a training job.
  4. The Cloud ML Engine training service sets up resources for your job. It allocates one or more virtual machines, also called training instances, based on your job configuration. Each training instance is set up by:
    • Applying the standard machine image for the version of Cloud ML Engine your job uses.
    • Loading your trainer package and installing it with pip.
    • Installing any additional packages that you specify as dependencies.
  5. The service runs your training job, passing the command-line arguments you specify when you submitted it.
  6. You can get information about your running job in three ways:
    • On Stackdriver Logging
    • By requesting job details or running log streaming with the gcloud command-line tool.
    • By programmatically making status requests to the training service.
  7. When your trainer succeeds or encounters an unrecoverable error, Cloud ML Engine halts all job processes and cleans up the resources.

Distributed training

If you run a distributed TensorFlow job with Cloud ML Engine, you'll specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify in each node. In accordance with the distributed TensorFlow model, each running trainer on a given node, also called replica, is given a single role or task in distributed training:

  • Exactly one replica is designated the master. This task manages the others and reports status for the job as a whole. As mentioned above, the training service runs until "your trainer" succeeds or encounters an unrecoverable error. In the distributed case, it is the status of the master replica that signals the overall job status.

    If you are running a single-process job, the sole replica is the master for the job.

  • One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your trainer.

  • One or more replicas may be designated as parameter servers. These replicas coordinate shared model state between the workers.

Submitting training jobs

Before starting a training job, you need to configure its input parameters. You pass your parameters to the training service by setting the members of the Job resource in a JSON request string. The training parameters are defined in the TrainingInput object. There are two kinds of parameters that you provide when creating a training job: job configuration parameters, and training application parameters.

If you use thegcloudcommand-line tool to create training jobs, the most common training parameters are defined as flags of thegcloud ml-engine jobs submit trainingcommand as follows.

gcloud ml-engine jobs submit training jobID \
    --scale-tier=STANDARD_1
    --package-path=gs://bucket/path/to/package.tar.gz \
    --module-name=trainer.task \
    --job-dir=gs://bucket/path/to/dir \
    --region=us-central1 \
    -- \
    --my_first_arg=first_arg_value \
    --my_second_arg=second_arg_value

Job configuration parameters

The Cloud ML Engine training service needs information to set up resources in the cloud and deploy your trainer application on each node in the processing cluster.

Job ID

You must give your training job a name following these rules:

  • It must be unique within your Google Cloud Platform project.
  • It may only contain mixed-case letters, digits, and underscores.
  • It must start with a letter.
  • It must be no more than 128 characters long.

It's a good idea to make your job IDs easy to distinguish from one another. A common technique is to define a base name for all jobs associated with a given model and then append a date/time string. This convention makes it easy to sort lists of jobs by name—all jobs for a model are grouped together in ascending order. For example, in BASH:

now=$(date + "%Y%m%d_%H%M%S")
JOB_NAME="census_$now"
Scale tier

You must tell Cloud ML Engine the number and type of machines to run your training job on. To make the process easier, you can pick from a set of predefined cluster specifications called scale tiers. The specific cluster configuration of each tier is not fixed: it may change as the availability of cloud resources changes over time. Instead, each scale tier is defined in terms of its suitability for certain types of jobs. Generally, the more advanced the tier, the more machines will be allocated to the cluster, and the more powerful the specifications of each virtual machine. As you increase the complexity of the scale tier, the hourly cost of training jobs, measured in ML training units, also increases. The scale tier definitions are listed below:

Scale tier Description
BASIC A single worker instance. This tier is suitable for learning how to use Cloud ML Engine and for experimenting with new models using small datasets.
STANDARD_1 Many workers and a few parameter servers.
PREMIUM_1 A large number of workers with many parameter servers.
BASIC_GPU A single worker instance with a GPU.
CUSTOM TheCUSTOMtier is not a set tier, but rather enables you to use your own cluster specification. You must to specify at least setTrainingInput.masterType when choosing custom tier.
Package URIs

You trainer must be made into a Python package and copied to a Google Cloud Storage bucket before you can run it on Cloud ML Engine. You pass the URI of your package to the training service as an element of the package URI list. The URI of a Cloud Storage location takes this form:

gs://bucket_name/path/to/package.tar.gz

The training service installs each package (using pip install) on every virtual machine it allocates for your training job.

Python module

Your trainer package can contain multiple modules (Python files). You must identify the module that contains your application entry point. For example, if you create a package namedmy_trainer, and your main module is calledtask.py, you specify that package with the namemy_trainer.task.

Region

When you run a training job, you specify the region that you want it to run in. If you store your training dataset on Google Cloud Storage, you should run your training job in the same region as the bucket you're using. If you must run your job in a different region from your data bucket, your job may take longer.

Job output directory

Cloud ML engine provides a mechanism for specifying the output directory for a job. You can set a job directory when you configure your job as a command line argument --job-dir.

Runtime version

The runtime version dictates the versions of TensorFlow and other Python packages that are installed on your allocated training instances. Unless you have a compelling reason to, you should let the training service use its default version, which is always the latest stable version.

Training application parameters

You can send data to your trainer when it runs in the cloud by specifying command-line arguments for your main module. Assemble the list of arguments and include it in your training configuration.

The training service accepts the arguments as a list of strings with the following format:

['--my_first_arg', 'first_arg_value', '--my_second_arg', 'second_arg_value']

Monitoring training

For overall status, the easiest way to check on your job is the Machine Learning Jobs page on Google Cloud Platform console. You can get the same details programmatically and with the gcloud command-line tool. Use gcloud ml-engine jobs describe to get details about the current state of the job on the command line. You can get a list of jobs associated with your project that includes job status and creation time with gcloud ml-engine jobs list.

You can find charts of your job's aggregate CPU and memory utilization on the Job details page of the Console. You can get more detailed information about the online resources that your training jobs use with Stackdriver Monitoring. Cloud ML Engine exports two metrics to Stackdriver for each task (worker, parameter server, and master) in a job:

  • ml/training/memory/utilization shows fraction of allocated memory that is currently in use.

  • ml/training/cpu/utilization shows fraction of allocated CPU that is currently in use.

In addition, you can configure your trainer to save summary data that you can examine and visualize using TensorBoard.

Accelerating the training process

Using a custom scale tier

When you use this tier, set values to configure your processing cluster according to the following guidelines:

  • You must setTrainingInput.masterType to specify the type of machine to use for your master node. This is the only required setting.
  • You may optionally setTrainingInput.workerCount to specify the number of workers to use. If you specify one or more workers, you must also set TrainingInput.workerType to specify the type of machine to use for your worker nodes.
  • You may set TrainingInput.parameterServerCount to specify the number of parameter servers to use. If you specify one or more parameter servers, you must also set TrainingInput.parameterServerType to specify the type of machine to use for your parameter servers.
  • You can specify a different machine type for the master worker, the parameter servers, and the workers, but you can't use different machine types for individual instances within a given type.

The machine types are listed here for reference:

Machine type Description
standard A basic machine configuration suitable for training simple models with small to moderate datasets.
large_model A machine with a lot of memory, specially suited for parameter servers when your model is large (having many hidden layers or layers with very large numbers of nodes).
complex_model_s A machine suitable for the master and workers of the cluster when your model requires more computation than the standard machine can handle satisfactorily.
complex_model_m A machine with roughly twice the number of cores and roughly double the memory ofcomplex_model_s.
complex_model_l A machine with roughly twice the number of cores and roughly double the memory ofcomplex_model_m.
standard_gpu A machine equivalent to standard that also includes a GPU that you can use in your trainer.
complex_model_m_gpu A machine equivalent tocomplex_model_mthat also includes four GPUs.
complex_model_l_gpu A machine equivalent tocomplex_model_lthat also includes eight GPUs.

Even though the exact specifications of the machine types are subject to change at any time, you can compare them in terms of relative capability. The following table uses rough "t-shirt" sizing to describe the machine types.

Machine type CPU GPUs Memory ML units
standard XS - M 1
large_model S - XL 3
complex_model_s S - S 2
complex_model_m M - M 3
complex_model_l L - L 6
standard_gpu XS 1 M 3
complex_model_m_gpu M 4 M 12
complex_model_l_gpu L 8 L 24

Each increase in size constitutes roughly double capacity in the area being measured. Possible sizes are (in increasing order): XS, S, M, L, XL, XXL.

Using GPU's

Graphics Processing Units (GPUs) can significantly accelerate the training process for many deep learning models. To use GPUs in the cloud, configure your job to access GPU-enabled machines:

  • Set the scale tier to CUSTOM.
  • Configure each task (master, worker, or parameter server) to use one of the GPU-enabled machine types:
    • Use standard_gputo give your task access to a single GPU.
    • Use complex_model_m_gpu to give your task access to four GPUs.
    • Use complex_model_l_gpu to give your task access to eight GPUs.

In addition, you need to run your job in a region that supports GPUs. The following regions currently provide access to GPUs:

  • us-east1
  • us-central1
  • asia-east1
  • europe-west1

To make use of the GPUs on a machine, make the appropriate changes to your TensorFlow trainer application:

  • High-level Estimator API: No code changes are necessary as long as your ClusterSpec is configured properly. If a cluster is a mixture of CPUs and GPUs, map thepsjob name to the CPUs and theworkerjob name to the GPUs.

  • Core Tensorflow API: You must assign ops to run on GPU-enabled machines. This process is the same a susing GPUs with TensorFlow locally. You can use tf.train.replica_device_setter to assign ops to devices. A standard_gpu machine's single GPU is identified as "/gpu:0". Machines with multiple GPUs use identifiers ranging from "/gpu:0" to "/gpu:n". For example, complex_model_m_gpu machines have four GPUs identified as "/gpu:0" through "/gpu:3".

Note that the the number of concurrent GPUs is limited to 10. If you need more processing capability, you can apply for a quota increase.

results matching ""

    No results matching ""