Loading data into BigQuery

In most cases, you must first load your data into BigQuery before you can run queries. In some situtations, you can query data from external source without loading it. You can load data in the following ways:

  • Load directly from a readable data source.
  • Load from Google Cloud Storage
  • Insert individual records using streaming inserts.

Loaded data can be added to a new table, appended to a table, or can overwrite a table.

Before you can load data, you must set up billing and ensure that you have read access to the data source and write access to the destination table.

BigQuery supports loading data from several source formats, including CSV, JSON, Avro, and Google Cloud Datastore backup files. For details on formats, see the later sections on data format and schema auto-detection.

Data formats

BigQuery supports these data formats:

Choose a format based upon the following factors:

  • Your data's schema.

    CSV, JSON, and Avro all support flat data. JSON and Avro also support data with nested and repeated fields, which are useful for expressing hierarchical data. Nested and repeated fields also reduce duplication when denormalizing the data.

  • Embedded newlines.

    If your data contains embedded newlines, BigQuery can load the data much faster in JSON or Avro format.

  • External limitations.

    For example, your data might come from a document store database that natively stores data in JSON format. Or, your data might come from a source that only exports in CSV format.

CSV format

To change how BigQuery parses CSV data, specify additional CSV properties in configuration.load object.

CSV data type BigQuery property
Header rows skipLeadingRows
Newline characters allowQuotedNewlines
Custom field delimiters fieldDelimiter
Custom null values nullMarker
Trailing optional columns allowJaggedRows

Avro format

  • Avro is an open source data format that embeds the serialized data and the data's schema in the same file.
  • BigQuery does not support compressed Avro files. It does support compressed data blocks and the DEFLATE codec.
  • BigQuery converts Avro data types to the following BigQuery data types
Avro data type BigQuery data type Notes
null BigQuery ignores these values
boolean BOOLEAN
int, long INTEGER
float, double FLOAT
bytes BYTES
string STRING UTF-8 only
record RECORD Aliases are ignoredDoc is ignoredDefault values are set at read timeOrder is ignored
enum STRING The string is the symbolic value of the enumAliases are ignoredDoc is ignored
array repeated fields Arrays of arrays are not supported.
map<T> RECORD BigQuery converts an Avro map<T> field to a repeated RECORD that contains two fields: a key and a value. BigQuery stores the key as a STRING, and converts the value to its corresponding data type in BigQuery.
union Nullable fieldRECORD with a list of nullable fields When union only has one non-null type, it converts to a nullable field.Otherwise it converts to a RECORD with a list of nullable fields. Only one of these fields will be set at read time.
fixed BYTES Aliases are ignoredSize is ignored

JSON format

One JSON object, including nested or repeated fields, must appear on each line.

The following example shows sample nested/repeated data:

{"fullName": "John Doe", "age": 22, "citiesLived": [{ "place": "Seattle", "numberOfYears": 5}, {"place": "Stockholm", "numberOfYears": 6}]}
{"fullName": "Jane Austen", "age": 24, "citiesLived": [{"place": "Los Angeles", "numberOfYears": 2}, {"place": "Tokyo", "numberOfYears": 2}]}

Schema Auto-Detection in BigQuery

  • When you load data or query an external data source, you can enable schema auto-detection.
  • In this case, BigQuery infers the schema in the following way. It selects a random file in the source and takes a representative sample of up to 100 rows of data. Then, BigQuery examines each field and attempts to assign a field type based on the values in the sample.

  • To view the detected schema for a table, use the CLI command bq show or web UI.

Currently, you cannot enable schema auto-detection for Google Sheets data sources using the web UI. Also, schema auto-detection is not used with Avro or Google Cloud Datastore backup files. When you load Avro or Cloud Datastore backup data, BigQuery automatially retrieves the schema using the source data.

  • When creating a table definition file using the API, set the value of the"autodetect"property totrueorfalse. Settingautodetecttotrueenables auto-detection. Settingautodetecttofalsedisables it.

  • To enable schema auto-detection in the CLI when loading data, pass the--autodetectflag to the bq load command.

  • Using the CLI, you can enable schema auto-detection when you create a table definition file for CSV, JSON, or Google Sheets data. When using the CLI to create a table definition file, you can pass the--autodetectflag to themkdefcommand to enable schema auto-detection, or you can pass the--noautodetectflag to disable auto-detection.

Auto-detection details

In addition to detecting schema details, auto-detection recognizes the following:

Compression

BigQuery recognizes gzip-compatible file compression when opening a file.

CSV Delimiter

BigQuery detects the following delimiters:

  • comma ( , )
  • pipe ( | )
  • tab ( \t )
CSV Header

BigQuery infers headers by comparing the first row of the file with other rows in the data set. If the first line contains only strings, and the other lines do not, BigQuery assumes that the first row is a header row.

CSV Quoted new lines

BigQuery detects quoted new line characters within a CSV field and does not interpret the quoted new line character as a row boundary.

Timestamps

BigQuery detects a wide array of timestamp formats, including, but not limited to:

  • yyyy-mm-ddin any order
  • yyyy-mm-dd hh:mm:ss
  • yyyy-mm-dd hh:mm:ss.sss

Other timestamp details:

  • Date separators can be "-", "/", or "."
  • A time zone can be appended using an offset or name

Preparing data for loading

Data encoding

  • BigQuery supports UTF-8 encoding for both nested or repeated and flat data, and supports ISO-8859-1 encoding for flat data.
  • If you don't specify an encoding, or explicitly specify that your data is UTF-8 but then provide a CSV file that is not UTF-8 encoded, BigQuery attempts to convert your CSV file to UTF-8.

  • Delimiters must be encoded as ISO-8859-1.

  • If BigQuery cannot convert a character, it is converted to the standard Unicode replacement character: �.

  • JSON files must always be encoded in UTF-8.

Data compression

  • BigQuery can load uncompressed files significantly faster than compressed ones due to parallel load operations.
  • Because uncompressed files are larger, using them can lead to bandwidth limitations and higher Google Cloud Storage costs for data staged in Google Cloud Storage prior to being loaded into BigQuery.
  • In general, if bandwidth is limited, compress your files using gzip before uploading them to Google Cloud Storage. Currently, when loading data into BigQuery gzip is the only supported compression type.
  • If loading speed is important to your app and you have a lot of bandwidth to load your data, leave your files uncompressed.

Denormalized data schemas

  • Many developers are accustomed to working with relational databases and normalized data schemas.
  • Normalization eliminates duplicate data from being stored, and provides an important benefit of consistency when regular updates are being made to the data.
  • In BigQuery, you typically should denormalize the data structure in order to enable super-fast querying. While JOINs on small datasets are possible with BigQuery, they're not as performant as a denormalized structure.

  • BigQuery allows you to use repeated and nested fields to avoid duplication of data and to retain the high performance characteristics of flat schema.

Loading from Cloud Storage

BigQuery supports loading data from all storage classes. The following examples show how to load from Cloud storage using different interfaces.

Command-line

Use the bq loadcommand and include a Cloud Storage URI for the sourceargument:

bq load [DATASET].[TABLE_NAME] [PATH_TO_SOURCE] [SCHEMA]

where:

  • [DATASET].[TABLE_NAME]is a fully qualified table name, where[DATASET]represents an existing dataset.
  • [PATH_TO_SOURCE]is a fully-qualified Cloud Storage URI.
  • [SCHEMA]is a valid schema. The schema can be a local file, or it can be typed as part of the command.

For example, the following examples all load data from Cloud Storage:

bq load ds.new_tbl gs://mybucket/info.csv ./info_schema.json
bq load ds.small gs://mybucket/small.csv name:integer,value:string
bq load ds.small gs://mybucket/small.csv field1,field2,field3
  • BigQuery does not support source URIs that include multiple consecutive slashes after the initial double slash.

  • If your Google Cloud Storage data is separated into multiple files that share a common base-name, you can use a wildcard in the URI when you load the data. Simply append an asterisk (*) to the base-name

  • The wildcard can appear inside the object name or at the end of the object name. Appending a wildcard to the bucket name is unsupported.

Java

Job job = table.load(FormatOptions.csv(), sourceUri);
// Wait for the job to complete
try {
  Job completedJob = job.waitFor(WaitForOption.checkEvery(1, TimeUnit.SECONDS),
      WaitForOption.timeout(3, TimeUnit.MINUTES));
  if (completedJob != null && completedJob.getStatus().getError() == null) {
    // Job completed successfully
  } else {
    // Handle error case
  }
} catch (InterruptedException | TimeoutException e) {
  // Handle interrupted wait
}

Python

def load_data_from_gcs(dataset_name, table_name, source):
    bigquery_client = bigquery.Client()
    dataset = bigquery_client.dataset(dataset_name)
    table = dataset.table(table_name)
    job_name = str(uuid.uuid4())

    job = bigquery_client.load_table_from_storage(
        job_name, table, source)

    job.begin()

    wait_for_job(job)

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_name, table_name))
def wait_for_job(job):
    while True:
        job.reload()
        if job.state == 'DONE':
            if job.error_result:
                raise RuntimeError(job.errors)
            return
        time.sleep(1)

Loading data with a POST request

The BigQuery REST API allows you to load data using a POST request.

  • You can make upload requests in any of the following ways. Specify the method you are using with theuploadTyperequest parameter.

    • Multipart upload:uploadType=multipart is useful for quick transfer of smaller files and metadata. It transfers the file along with metadata in a single request.
    • Resumable upload:uploadType=resumable is for reliable transfer of larger files. With this method, you use a session initiating request, which optionally can include metadata. This is a good strategy to use for most applications, since it also works for smaller files at the cost of one additional HTTP request per upload.
  • The allows you to upload certain types of binary data or media. When uploading media, you use a special URI. In fact, methods that support media uploads have two URI endpoints:

    • The /upload URI, for the media. The upload endpoint is the standard resource URI with an “/upload” prefix. Use this URI when transferring the media data itself.
      POST /upload/bigquery/v2/projects/<projectId>/jobs
      
    • The standard resource URI, for the metadata. If the resource contains any data fields, those fields are used to store metadata describing the uploaded file. You can use this URI when creating or updating metadata values. POST /bigquery/v2/projects/<projectId>/jobs

Multi-part upload

If you have metadata that you want to send along with the data to upload, you can make a singlemultipart/relatedrequest. This is a good choice if the data is small enough to upload again in its entirety if the connection fails.

To use multipart upload, make a POST request to the method's /upload URI and add the query paramete uploadType=multipart

POST https://www.googleapis.com/upload/bigquery/v2/projects/<projectId>/jobs?uploadType=multipart

The top-level HTTP headers to use when making a multipart upload request include:

  • Content-Type is set to multipart/related and include the boundary string you're using to identify the parts of the request.
  • Content-Length is st to the total number of bytes in the request body. The media portion of the request must be less than the maximum file size specified for this method.

The example below shows a multipart upload request for the Google BigQuery API.

POST /upload/bigquery/v2/projects/<projectId>/jobs?uploadType=multipart HTTP/1.1
Host: www.googleapis.com
Authorization: Bearer your_auth_token
Content-Type: multipart/related; boundary=foo_bar_baz
Content-Length: number_of_bytes_in_entire_request_body

--foo_bar_baz
Content-Type: application/json; charset=UTF-8

{
  "configuration": {
    "load": {
      "sourceFormat": "<required for JSON files>",
      "schema": {
        "fields": [
          {"name": "f1", "type": "STRING"},
          {"name": "f2", "type": "INTEGER"}
        ]
      },
      "destinationTable": {
        "projectId": "projectId",
        "datasetId": "datasetId",
        "tableId": "tableId"
      }
    }
  }
}


--foo_bar_baz
Content-Type: */*

CSV or JSON data
--foo_bar_baz--

If the request succeeds, the server returns the HTTP 200 OK status code along with any metadata:

HTTP/1.1 200
Content-Type: application/json

{
  "configuration": {
    "load": {
      "sourceFormat": "<required for JSON files>",
      "schema": {
        "fields": [
          {"name": "f1", "type": "STRING"},
          {"name": "f2", "type": "INTEGER"}
        ]
      },
      "destinationTable": {
        "projectId": "projectId",
        "datasetId": "datasetId",
        "tableId": "tableId"
      }
    }
  }
}

Resumable upload

To upload data files more reliably, you can use the resumable upload protocol. This protocol allows you to resume an upload operation after a network failure has interrupted the upload. It is very useful if you are loading large files. It can also reduce your bandwidth usage in the event of network failures because you don't have to restart large file uploads from the beginning.

The steps for using resumable upload include:

  1. Start a resumable session. Make an initial request to the upload URI that includes the metadata, if any.

    • POST https://www.googleapis.com/upload/bigquery/v2/projects/<projectId>/jobs?uploadType=resumable
    • For this first request, the body is either empty or it contains the metadata only; you'll transfer the actual contents in subsequent requests.

    • Set the following HTTP headers with the initial request X-Upload-Content-Type (media MIME type of the upload data), X-Upload-Content-Length, (number of bytes of upload data) in addition to the Content-Type and Content-Length of this initiating request.

  2. Save the resumable session URI. Save the session URI returned in the response of the initial request; you'll use it for the remaining requests in this session.

    • If the session initiation request succeeds, the API server responds with a 200.

    • In addition, it provides a Location header that specifies your resumable session URI. It includes aupload_idquery parameter that gives the unique upload ID for this session.

  3. Upload the file. Send the media file to the resumable session URI.

    • To upload the file, send PUTrequests to the upload URI from the previous step.

    • The format of the upload request is:

    PUT https://www.googleapis.com/upload/bigquery/v2/projects/<projectId>/jobs?uploadType=resumable&upload_id=xa298sd_sdlkj2 HTTP/1.1
    
    Content-Length: 2000000
    Content-Type: */*
    
    bytes 0-1999999
    

If an upload request is terminated before receiving a response or if you receive an HTTP503 Service Unavailableresponse from the server, then you need to resume the interrupted upload. To do this:

  1. Request status.
    Query the current status of the upload by issuing an emptyPUTrequest to the upload URI. For this request, the HTTP headers should include a Content-Rangeheader indicating that the current position in the file is unknown. For example, set the Content-Range to */2000000if your total file length is 2,000,000. If you don't know the full size of the file, set theContent-Range to */*.
    Note: You can request the status between chunks, not just if the upload is interrupted. This is useful, for example, if you want to show upload progress indications for legacy browsers.

  2. Get number of bytes uploaded.
    Process the response from the status query. The server uses theRangeheader in its response to specify which bytes it has received so far. For example, aRangeheader of0-299999indicates that the first 300,000 bytes of the file have been received.

  3. Upload remaining data.
    Finally, now that you know where to resume the request, send the remaining data or current chunk. Note that you need to treat the remaining data as a separate chunk in either case, so you need to send theContent-Rangeheader when you resume the upload.

Best practices

When uploading media, it is helpful to be aware of some best practices related to error handling.

  • Resume or retry uploads that fail due to connection interruptions or any5xxerrors, including:
    • 500 Internal Server Error
    • 502 Bad Gateway
    • 503 Service Unavailable
    • 504 Gateway Timeout
  • Use an exponential backoff strategy if any5xxserver error is returned when resuming or retrying upload requests. Exponential backoff can help alleviate server problems during periods of high volume of requests or heavy network traffic.
  • Other kinds of requests should not be handled by exponential backoff but you can still retry a number of them. When retrying these requests, limit the number of retries. For example your code could limit to ten retries or less before reporting an error.
  • Handle 404 Not Founderrors when doing resumable uploads by starting the entire upload over from the beginning.

Samples of client codes:

Java:

TableId tableId = TableId.of(datasetName, tableName);
WriteChannelConfiguration writeChannelConfiguration =
    WriteChannelConfiguration.newBuilder(tableId)
        .setFormatOptions(FormatOptions.csv())
        .build();
TableDataWriteChannel writer = bigquery.writer(writeChannelConfiguration);
// Write data to writer
try (OutputStream stream = Channels.newOutputStream(writer)) {
  Files.copy(csvPath, stream);
}
// Get load job
Job job = writer.getJob();
job = job.waitFor();
LoadStatistics stats = job.getStatistics();
return stats.getOutputRows();

Python:

def load_data_from_file(dataset_name, table_name, source_file_name):
    bigquery_client = bigquery.Client()
    dataset = bigquery_client.dataset(dataset_name)
    table = dataset.table(table_name)

    # Reload the table to get the schema.
    table.reload()

    with open(source_file_name, 'rb') as source_file:
        # This example uses CSV, but you can use other formats.
        # See https://cloud.google.com/bigquery/loading-data
        job = table.upload_from_file(
            source_file, source_format='text/csv')

    wait_for_job(job)

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_name, table_name))

Loading from Cloud Datastore backup

BigQuery supports loading data from Google Cloud Datastore backups. In Cloud Datastore, you can back up each entity type, also known as a kind, into a set of backup files. You can then load the information as a BigQuery table. You can control which properties BigQuery should load by setting theprojectionFieldsproperty. If you do not want to loading the data, you can query the backup directly by setting it up as an external data source.

Datastore type conversion

BigQuery converts data from each entity in Cloud Datastore backup files to BigQuery's data types. The following table describes the conversion between data types.

Cloud Datastore data type BigQuery data type
Blob BigQuery discards these values when loading the data.
Blobstorekey STRING
Boolean BOOLEAN
Category STRING
Datastore key RECORD
Date and time TIMESTAMP
Email STRING
Embedded entity RECORD
Floating-point number DOUBLE
Geographical point RECORD[{"lat","DOUBLE"}, {"long","DOUBLE"}]
IM handle STRING
Integer INTEGER
Link STRING
Phone number STRING
Postal address STRING
Rating INTEGER
Short blob BigQuery discards these values when loading the data.
String STRING (truncated to 64 KB)
User RECORD[{"email","STRING"} {"userid","STRING"}]

Datastore key properties

Each entity in Cloud Datastore has a unique key that contains information such as the namespace and the path. BigQuery creates aRECORDdata type for the key, with nested fields for each piece of information, as described in the following table.

Key property Description BigQuery data type
__key__.app The Cloud Datastore app name. STRING
__key__.id The entity's ID, ornullif__key__.nameis set. INTEGER
__key__.kind The entity's kind. STRING
__key__.name The entity's name, ornullif__key__.idis set. STRING
__key__.namespace If the Cloud Datastore app uses a custom namespace, the entity's namespace. Else, the default namespace is represented by an empty string. STRING
__key__.path The flattenedancestral path of the entity, consisting of the sequence of kind-identifier pairs from the root entity to the entity itself. For example:"Country","USA","PostalCode",10011,"Route",1234.

Set the following properties to load data from the API or CLI tool:

API or CLI use:

  • Set sourceFormat to DATASTORE_BACKUP
  • Set sourceUris to the full path of the Datastore backup file that ends with <kindname>.backup_info. The full bucket name format isgs://bucket_name<Datastore backup file>

Limits and quotas

Quotas for load jobs

The following quotas apply for loading data into BigQuery.

  • Daily limit: 1,000 load jobs per table per day, and 50,000 load jobs per project per day. These numbers include failures.
  • Row and cell size limits: | Data format | Max limit | | :--- | :--- | | CSV | 2 MB (row and cell size) | | JSON | 2 MB (row size) | | Avro | 16 MB (block size) |
  • Maximum columns per table: 10,000
  • Maximum File Sizes: | File Type | Compressed | Uncompressed | | :--- | :--- | :--- | | CSV | 4 GB | With quoted new-lines in values: 4 GBWithout new-lines in values: 5 TB | | JSON | 4 GB | 5 TB | | Avro | Compressed Avro files are not supported, but compressed data blocks are. BigQuery supports the DEFLATE codec. | 5 TB (1 MB for the file header) |
  • Maximum size per load job: 15 TB across all input files for CSV, JSON, and Avro
  • Maximum number of files per load job: 10,000 including all files matching a wildcard
  • Load job execution time limit: 6 hours
  • API requests per second, per user: If you make more than 100 requests per second, throttling might occur. This limit does NOT apply for streaming inserts.

results matching ""

    No results matching ""