Data Preparation
This page provides a simple overview of the steps required to prepare your data. Data preparation usually involves the following iterative steps:
- Gather data.
- Clean the data.
- Split the data.
- Engineer features.
- Preprocess the features.
Gather data
Finding data, especially data with the labels you need, can be a challenge. Your sources might vary significantly from one machine learning project to the next. If you find that you are merging data from different sources, or getting data entry from multiple places, you’ll need to be extra careful in the next step.
Clean the data
Cleaning data is the process of checking for integrity and consistency. At this stage you shouldn't be looking at the data overall for patterns. Instead, you clean data by column (attribute), looking for such anomalies as:
Instances with missing features.
Multiple methods of representing a feature. For example, if some instances list a length measurement in inches and others list it in centimeters. It is crucially important that all instances of a given feature use the same scale and follow the same format.
Features with values far out of the typical range (outliers), which may be data-entry anomalies or other invalid data.
Significant changes in the data over distances in time, geographic location, or other recognizable characteristics.
Incorrect labels, or poorly defined labeling criteria.
Split the data
You need at least three subsets of data in a supervised learning scenario: training data, evaluation data, and test data.
_Training data _is the data that you use to train your model and analyze before developing the model.
_Evaluation data _is what you use to check your model’s performance during the training phase. Its primary use is to estimate how well your model can generalize to data beyond the training set. Therefore, it is useful for compare the predictive power of different models.
_Test data _is used to test your final model that’s close to completion, usually after multiple training iterations. You should never analyze or scrutinize your test data, instead keeping it fresh until needed to test your model.
Here are some important things to remember when you split your data:
It is better to randomly sample the subsets from one big dataset than to use some pre-divided data, such as instances from two distinct date ranges or data-collection systems. The latter approach has an increased risk of non-uniformity that can lead to overfitting.
Ideally you should assign instances to a dataset and keep those associations throughout the process.
In general, you should have more training data than evaluation data, and more evaluation data than test data.
Engineer features
Before you develop your model, you should get acquainted with your training data. Look for patterns in your data, and think about what values could influence your target attribute. This process of deciding which data is important for your model is called feature engineering.
Feature engineering is not just about deciding which attributes you have in your raw data that you want in your model. The harder and often more important work is extracting generalizable, indicative features from specific data. That means combining the data you have with your knowledge about the problem space to get the data you really need. It can be a complex process, and doing it right depends on understanding the subject matter and the goals of your problem. Here are a couple of examples:
Example data: residential address
Data about people often includes a residential address, which is a complex string, often hard to make consistent, and not particularly useful for many applications on its own. You should usually extract a more meaningful feature from it. Here are some examples of things you could extract from an address:
- Longitude and latitude
- Neighborhood
- Closest elementary school
- Legislative district
- Relative position to a landmark
Example data: timestamp
Another common item of data is a timestamp, which is usually a large numerical value indicating the amount of time elapsed since a common reference point. Here are some examples of things you might extract from a precise timestamp:
- Hour of the day
- Elapsed time since another event
- Time of day (morning, afternoon, evening, night)
- Whether some facility was open or closed at that time
- Frequency of an event (in combination with other instances)
- Position of the sun (in combination with latitude and longitude)
Here are some important things to note about the examples above:
You can combine multiple attributes to make one generalizable feature. For example, address and timestamp can get you the position of the sun.
You can use feature engineering to simplify data. For example, timestamp to time of day takes an attribute with seemingly countless values and reduces it to four categories.
You can get useful features, and reduce the number of instances in your dataset, by engineering across instances. For example, use multiple instances to calculate the frequency of something.
When you're done you'll have a list of features to include when training your model.
One of the most difficult parts of the process is deciding when you have the right set of features. It's sometimes difficult to know which features are likely to affect your prediction accuracy. Machine learning experts often stress that it's a field that requires flexibility and experimentation. You'll never get it perfect the first try, so make your best guess and use the results to inform your next iteration.
Preprocessing data
So far this page has described generally applicable steps to take when getting your data ready to train your model. It hasn't mattered up to this point how your data is represented and formatted. Preprocessing is the next step: getting your prepared data into a format that works with the tools and techniques you use to train a model.
Data formats and Cloud ML Engine
Cloud ML Engine doesn't get involved in your data format; you can use whatever input format is convenient for your training application. That said, you'll need to have your input data in a format that TensorFlow can read.
The recommended format for TensorFlow, especially sparse vectors and binary data, is a TFRecords file containing
tf.train.Example
protocol buffers (which containFeatures
as a field). You write a little program that gets your data, stuffs it in anExample9
protocol buffer, serializes the protocol buffer to a string, and then writes the string to a TFRecords file using thetf.python_io.TFRecordWriter
.To read text files in comma-separated value (CSV) format, use a
tf.TextLineReader
with thetf.decode_csv
operation.
You also need to have your data in a location that your Cloud ML Engine project can access. The simplest solution is often to use a CSV file in a Google Cloud Storage bucket that your Google Cloud project has access to.
Transforming data
There are many transformations that might be useful to perform on your raw feature data. Some of the more common ones are:
- Normalizing numerical values to be represented at a consistent scale (e.g. a range between -1 and 1 or between 0 and 1).
- Representing non-numeric data numerically, such as changing categorical features to index values or one-hot vectors.
- Changing raw text strings to a more compact representation, like a bag of words.
Summary of data preparation and preprocessing
Cloud ML Engine doesn't impose specific requirements on your input data, leaving you to use whatever format works for your training application. Follow TensorFlow's data reading procedures.
- Use the raw data you have to get the data you need.
- Split your dataset into training, validation, and test subsets.
- Store your data in a location that your Cloud ML Engine project can access--a Cloud Storage bucket is often the easiest approach.
- Transform features to suit the operations you perform on them.