Cloud Machine Learning Engine
Overview
Cloud Machine Learning Engine combines the managed infrastructure of Google Cloud Platform with the power and flexibility of TensorFlow. You can use it to train your machine learning models at scale, and to host trained models to make predictions about new data in the cloud.
Cloud ML Engine mainly does two things:
- Enables you to train machine learning models at scale by running TensorFlow training applications in the cloud.
- Hosts those trained models for you in the cloud so that you can use them to get predictions about new data.
End-to-end Cloud ML workflow
The prerequisite for getting started with Cloud ML Engine is your training application, written in TensorFlow. This application is responsible for defining your computation graph, managing the training and validation process, and exporting your model. Cloud ML Engine is designed to run your TensorFlow trainer with minimal alteration. However, there are a few required design choices and best practices that help your trainer work well with the Cloud ML Engine services.
- You must make your trainer into a Python package and stage it on Google Cloud Storage where your training job can access it. Then, Cloud ML engine service can access and install your training application on each training instance. This step is included in your training job request when you use the gcloud command-line tool, but is left to you if calling the training service programmatically.
- You must also have your training and validation data prepared for running your trainer. Your input data must be in a format that TensorFlow can process or you need to account for any transformation in your application. Your data must be stored where Cloud ML Engine can access it. The easiest solution is to store it a Google Cloud Storage bucket within the same project that you use for Cloud ML Engine.
- With your trainer package and your data prepared, you can begin to train your model using Cloud ML Engine. The training service allocates resources in the cloud according to specifications you include with your job request. While your trainer runs, it can write output to Cloud Storage locations. Typically, a trainer writes regular checkpoints during training and exports the trained model at the end of the job. In addition, Cloud ML Engine sends logging information to Stackdriver Logging.
- You can create a model resource to assign your trained model to and then deploy your model version. You specify an exported model (a SavedModel file) and a model resource to assign the version to, and Cloud ML Engine hosts it so that you can run predictions on new data.
- Cloud ML Engine supports two kinds of prediction: online and batch. Online prediction is optimized for handling a high rate of requests with minimal latency whereas batch prediction is optimized for getting inferences for large collections of data with minimal job duration.
Components and tools
The core of Cloud ML Engine is the REST API, a set of RESTful services that manages jobs, models, and versions, and makes predictions on hosted models on Google Cloud Platform. You could use the REST API directly, but you will most likely find it easier to use the client Library for Python to access the APIs in your Python code, or to run Cloud ML Engine tasks at the command-line using the gcloud
command-line tool.
It is recommended to use gcloud ml-engine
commands to train models, manage models and versions. Apart from the REST API functionalities, the gcloud
tool include more utility commands. For example, you can train models locally in a way that emulates a cloud job with gcloud ml-engine local train.
You can also manage your models, versions and jobs from Google Cloud Platform Console. In addition, Cloud Machine Learning Engine functionality has been integrated into Google Cloud Datalab to provide an interactive experience to experiment with and build machine learning using Tensorflow and Cloud ML Engine. Finally, your Cloud ML Engine resources are connected to useful tools, like Stackdriver Logging and Stackdriver Monitoring.
Cloud ML resources
There are three main resources that are exposed by the REST API: models, versions and jobs.
Model
In general, a Model is the solution to a problem that you're trying to solve by learning from data. In Cloud ML Engine, Model has a more specific meaning in Cloud ML Engine. A model is a logical container for individual versions of a solution to a problem. For example, a generic problem to solve is predicting the sale price of houses given a set of data about previous sales. When working on a housing price prediction solution, you might create a model in Cloud ML Engine calledhousing_prices
. You might try multiple machine learning techniques or implementations to solve your problem. At each stage, you can deploy versions of that model. Each version might be completely different from the others, but you can organize them under the same model if you think it best for your workflow.
Cloud ML Engine also uses the terms trained model _and _SavedModel. A trained model is the state of your TensorFlow computation graph and its settings after training. TensorFlow can serialize that information and create a file using its SavedModel format. Your trainer exports a trained model, which you can deploy for prediction in the cloud.
Version
A version is an instance of a machine learning solution stored in the Cloud ML Engine model service. You make a version by passing a serialized trained model (in the TensorFlow SavedModel format) to the service. A model must contain at least one version before you can run predictions on it.
Job
You interact with the services of Cloud ML Engine by initiating requests and jobs. Requests are regular Web API requests that return with a response object as quickly as possible. Jobs are long-running operations that are processed asynchronously. You submit a request to start the job and get a quick response that verifies the job status. Then you can request status periodically to track your job's progress.