Prediction with Cloud ML engine

Before you can run prediction on new data, you need to host your trained machine learning models. This section discusses model hosting and prediction and introduces considerations you should keep in mind for your projects.

The process to get set up to make predictions in the cloud is summarized below:

You export your model using SavedModel as part of your training application.
You create a model resource in Cloud ML Engine and then create a model version from your saved model. Only for batch prediction, you can get inferences for a SavedModel that isn't deployed to Cloud ML Engine.
You format your input data for prediction and request either online prediction or batch prediction.
1. When you use online prediction, the service runs your saved model and returns the requested predictions as the response message for the call. Your model version is deployed in the region you specified when you created the model. Although it is not guaranteed, a model version that you use regularly is generally kept ready to run.
2. When you use batch prediction, the process is a little more involved. The prediction service allocates one or more prediction nodes to run your job. The service restores your TensorFlow graph on each allocated node. Then, the prediction service distributes your input data across the allocated nodes. Each node runs your graph and saves the predictions to a Cloud Storage location that you specify. When all of your input data is processed, the service shuts down your job and releases the resources it allocated for it.

Deploying models

Cloud ML Engine can host your models so that you can get predictions from them in the cloud. The process of hosting a saved model is called deployment. The prediction service manages the infrastructure needed to run your model at scale, and makes it available for online and batch prediction requests. This section describes model deployment.

Models and versions

Cloud ML Engine organizes your trained models using resources called models _and _versions. A model is conceptually a solution to a given machine learning problem. In Cloud ML Engine, a model is essentially a container for actual implementations of the machine-learning model, which are called versions. Developing a machine-learning model is an iterative process. For that reason, the Cloud ML Engine resource paradigm is set up with the assumption that you'll be making multiple versions of each machine learning model.

The "model" that you deploy to Cloud ML Engine as a model version is a TensorFlow SavedModel. You export a SavedModel in your trainer. The versions you create for any given model resource are arbitrary; you can use the same model resource even if you completely change the machine-learning model between versions. A model is an organizational tool that you can use however it makes sense for your situation.

Every model with at least one version has a default version; the default is set when the first version is created. If you request predictions specifying just a model name, Cloud ML Engine uses the default version for that model. Note that the only time the service automatically sets the default version is when you create the very first one. You can manually make any subsequent version the default by calling projects.models.versions.setDefault that is also exposed as gcloud ml-engine versions set-default and as an option in the Versions list of the Model details page on Google Cloud Platform console.

The Cloud ML Engine quota policy sets a limit of 100 models per project and limits the total number of versions (combined between all models) to 200.

Model deployment parameters

Cloud ML Engine needs some information to create your model version. You also have some options you can configure. This section describes the parameters of both types. These parameters are defined in theVersionobject or added for convenience in thegcloud ml-engine versions createcommand.

Version name

A name for the new version that is unique among the names of other versions of the model.

Description

You can provide a description for your version.

At present the description is only given when you get the version information with the API; neither the gcloud command-line tool nor Google Cloud Platform Console display the description.

Deployment URI

You must provide the URI of the Cloud Storage location where your SavedModel is stored. Cloud ML Engine pulls the model from this location and deploys it.

This parameter is called --origin in the gcloud ml-engine versions create command.

Runtime version

Cloud ML Engine uses the latest stable runtime version to deploy your model version unless you specify a different supported one. The runtime version primarily determines the version of TensorFlow that the prediction service uses to run your model.

When you run a batch prediction job you have the option of overriding the assigned runtime version. Online prediction always uses the runtime version set when the model version is deployed.

Manual scaling

You can specify the number of training nodes to keep running for your model version.

When you manually set the number of nodes to keep ready for your version, those nodes are considered to be constantly in use, even when not serving predictions. This means that you are charged the hourly rate for each node from the moment you create the version until you delete it. You can't change this value without deploying your model to a different version.

Staging bucket

If you are using thegcloudcommand-line tool to deploy your model, you can use a SavedModel on your local computer. The tool stages it in the Cloud Storage location you specify before deploying it to Cloud ML Engine.

Getting predictions

You can get predictions either in batch or online. In both cases, you pass new data instances to Cloud ML engine to get predicted values. The following tables show the important differences between these two modes:

Online prediction	Batch prediction
Optimized to minimize the latency of serving predictions.	Optimized to handle a high volume of instances in a job and to run more complex models.
Can process one or more instances per request.	Can process one or more instances per request.
Predictions returned in the response message.	Predictions written to output files in a Cloud Storage location that you specify.
Input data passed directly as a JSON string.	Input data passed indirectly as one or more URIs of files in Cloud Storage locations.
Returns as soon as possible.	Asynchronous request.
Anyone with Viewer access to the project can request.	Must be a project Editor to run.
Runs on the runtime version and in the region selected when you deploy the model.	Can run in any available region, using any available runtime version. Though you should run with the defaults for deployed model versions.
Runs models deployed to Cloud ML Engine.	Runs models deployed to Cloud ML Engine or models stored in accessible Google Cloud Storage locations.

The needs and cost requirements of your application dictate the type of prediction you should use. You should generally use online prediction when you are making requests in response to application input or in other situations where timely inference is needed. Batch prediction is ideal for processing accumulated data when you don't need immediate results or when the cost of online prediction is not too expensive for your application.

Input data format

Online and batch prediction require different formats for the input data that you give to Cloud ML Engine. These formats are summarized in the following table:

Prediction type and interface	Supported input format
Batch with API call	Text file with JSON instance strings or TFRecords file (may be compressed)
Batch with`gcloud`tool	Text file with JSON instance strings or TFRecords file (may be compressed)
Online with API call	JSON request message
Online with`gcloud`tool	Text file with JSON instance strings or CSV file

Binary data can't be formatted as the UTF-8 encoded strings that JSON supports. If you have binary data in your inputs, you must use base64 encoding to represent it. The following special formatting is required:

Your encoded string must be formatted as a JSON object with a single key namedb64.
In your TensorFlow code, you must name the input/output aliases for the binary input so that it ends with '_bytes'.

Running predictions