Partitioned tables

This section introduces partitioned tables and explain their benefits and use cases. By dividing a large table into smaller partitions, you can improve query performance and reduce your bill by limiting the amount of data scanned.

In the past, users would have to divide a dataset into daily tables to help reduce the amount of data scanned when querying a specific date range. For example, if you have a a year's worth of data in a single table, a query that involves the last seven days of data still requires a full scan of the entire table to determine which data to return. However, if your table is divided into daily tables, you can restrict the query to the seven most recent daily tables.

Daily tables, however, have several disadvantages. You must manually, or programmatically, create the daily tables. SQL queries are often more complex because your data can be spread across hundreds of tables. Performance degrades as the number of referenced tables increases. There is also a limit of 1,000 tables that can be referenced in a single query.

BigQuery enables you to create date-partitioned tables, where each table represents data loaded on a particular date. Partitioned tables have none of these disadvantages. A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. In BigQuery, date-partitioned tables can have up to 2,000 partitions. You can use either legacy SQL or standard SQL to query partitioned tables.

Schema of partitions

A partition does not have a schema by itself. All partitions in a partitioned table share the schema of the table. If you need to load data into a partition with a schema that is not the same as the schema of the table, you need to update the schema of the table before loading the data.

Creating and updating date-partitioned tables

To create a partitioned table, you must declare the table as partitioned at creation time. You do not need to specify a schema, as the schema can be specified when data is subsequently loaded or copied into the table.

Using the Command-line:

Use the bq mk command with the --time_partitioning_type flag. For example, the following command creates a partitioned table named table1 in the mydataset dataset:

bq mk --time_partitioning_type=DAY mydataset.table1

Using API:

Specify the partitioning configuration in the Tables::insert request. For example, the following configuration settings create a partitioned table named table1 in the mydataset dataset in the myProject project:

{
  "tableReference": {
    "projectId": "myProject",
    "tableId": "table1",
    "datasetId": "mydataset"
  },
  "timePartitioning": {
    "type": "DAY"
  }
}

Setting the expiration of a partition

To create a partitioned table and specify an expiration time for data in partitions.

Using the Command-line:

Use the time_partitioning_expirationflag. For example, the following command creates a partitioned table that keeps data in a partition for three days, or 259,200 seconds, before the data expires:

bq mk --time_partitioning_type=DAY --time_partitioning_expiration=259200 mydataset.table2

Using API:

Use the expirationMsconfiguration setting. For example, the following configuration settings create a partitioned table that keeps data in a partition for three days, or 259,200,000 milliseconds, before the data expires::

{
  "tableReference": {
    "projectId": "myProject",
    "tableId": "table2",
    "datasetId": "mydataset"
  },
  "timePartitioning": {
    "type": "DAY",
    "expirationMs": 259200000
  }
}

Querying date-partitioned tables

Partitioned tables include a pseudo column named_PARTITIONTIMEthat contains a date-based timestamp for data loaded into the table. The timestamp is based on UTC time and represents the number of microseconds since the unix epoch. For example, if data is appended to a table on April 15, 2016, all of the rows of data appended on that day contain the valueTIMESTAMP("2016-04-15")in the_PARTITIONTIMEcolumn.

To select the value of_PARTITIONTIME, you must use an alias. For example, the following query selects_PARTITIONTIMEby assigning the aliasptto the pseudo column:

SELECT
  _PARTITIONTIME AS pt, field1
FROM
 mydataset.table1

Partitioning pitfalls

Use the_PARTITIONTIMEpseudo column to limit the number of partitions scanned during a query. For example, the following query scans only the partitions between the dates January 1, 2016 and January 2, 2016 from the partitioned table namedtemps:

# Query works in legacy and standard SQL
SELECT
  temp
FROM
  mydataset.temps
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP('2016-01-01')
  AND TIMESTAMP('2016-01-02');

The following examples use the_PARTITIONTIMEpseudo column and scan all the partitions in a table.

In legacy SQL, the_PARTITIONTIMEfilter works only when the filter is specified as closely as possible to the table name. For example, the following query scans all partitions intable1despite the presence of the_PARTITIONTIMEfilter:

# Scans all partitions on t1
SELECT
  t1.field1,
  t2.field1
FROM
  mydataset.table2 t1
CROSS JOIN
  mydataset.table2 t2
WHERE
  t1._PARTITIONTIME = TIMESTAMP('2016-03-28')
  AND t1.field2 = "one"

To effectively use the _PARTITIONTIMEfilter in legacy SQL to limit the number of partitions scanned, use a subquery on

table1with the filter pushed into the subquery, as shown in the following query.

# Scans only the specified partition
SELECT
  t1.field1,
  t2.field1
FROM (
  SELECT
    field1,
    field2
  FROM
    mydataset.table1
  WHERE
    _PARTITIONTIME = TIMESTAMP('2016-03-28')) t1
CROSS JOIN
  mydataset.table2 t2
WHERE
  t1.field2 = "one"

Do not include any other columns in the _PARTITIONTIMEfilter. For example, the following query does not limit the partitions scanned because field2is a column in the table and BigQuery cannot determine in advance which partitions to select.

# Scans all partitions of table2
SELECT
  field1
FROM
  mydataset.table2
WHERE
  _PARTITIONTIME + field2 = TIMESTAMP('2016-03-28');

Filters on _PARTITIONTIMEthat include subqueries cannot be used to limit the number of partitions scanned for a partitioned table. For example, the following query does not limit the partitions scanned on table mydataset.table2

# Scans all partitions of table2
SELECT
  field1
FROM
  mydataset.table2
WHERE
  _PARTITIONTIME = (SELECT MAX(timestamp) FROM mydataset.table1)

The __UNPARTITIONED__ partition

The__UNPARTITIONED__partition temporarily holds data that is streamed to a partitioned table while it is in the streaming buffer. Data that is streamed directly to a specific partition of a partitioned table does not use the__UNPARTITIONED__partition. Instead, the data is streamed directly to the partition. See streaming into partitioned tables. Data in the streaming bufferhas a NULL value for the_PARTITIONTIMEcolumn.

To query data in the__UNPARTITIONED__partition, use the_PARTITIONTIMEpseudo column with the NULL value.

SELECT
  field1
FROM
  [DATASET_ID].[PARTITIONED_TABLE]
WHERE
  _PARTITIONTIME IS NULL

Where[DATASET_ID]is the name of the dataset that stores the table and[PARTITIONED_TABLE]is the name of the partitioned table.

Addressing table partitions

To make it easier to reference data from a specific partition, BigQuery provides partition decorators. Partition decorators take the form:

[TABLE_NAME]$YYYYMMDD

where [TABLE_NAME] is the name of a partitioned table, andYYYYMMDDrepresents a date. For example, the partition for May 19, 2016 in a table namedmydataset.tablecan be referenced using mydataset.table$20160519.

The partition decorator separator ($) is a special variable in the unix shell, so you might have to escape the decorator when using the command-line tool.

Copying, updating and deleting partitions

Copying partitioned tables

  • If you copy a partitioned table to a new table, all of the partitioning information is copied with the table. The new table and the old table will have identical partitions.

  • If you want to copy a partitioned table into another partitioned table, the partition specifications for the source and destination tables must match.

  • If you copy a non-partitioned table into a partitioned table, BigQuery copies the source data into the partition that represents the current date.

  • If you copy multiple source tables into a partitioned table in the same job, the source tables cannot contain a mixture of partitioned and non-partitioned tables. If all of the source tables are partitioned tables, the partition specifications for all source tables must match the destination table's partition specification.

Restating data in a partition

To update data in a specific partition, append a_partition decorator _to the name of the partitioned table when loading data into the table. For example, the following command replaces the data in the entire partition for the date January 1, 2016 in a partitioned table namedmydataset.table1 with content loaded from a Cloud Storage bucket:

bq load  --replace --source_format=NEWLINE_DELIMITED_JSON 'mydataset.table1$20160101' gs://[MY_BUCKET]/replacement_json.json

Because partitions in a partitioned table share the table schema, replacing data in a partition will not replace the schema of the table. Instead, the schema of the new data must be compatible with the table schema. To update the schema of the table with the load job, use configuration.load.schemaUpdateOptions

Updating partitions using query results

To use the output of a query to restate data for a partition, use a partition decorator when specifying the destination table. For example, the following command restates the data for the March 1, 2016 partition of table1 using the output of the query.

bq query --allow_large_results --replace --noflatten_results --destination_table 'mydataset.table1$20160301' 'SELECT field1 + 10, field2 FROM mydataset.table1$20160301'

Deleting partitions of a partitioned tables

You can use the partition decorator to delete a specific partition in a partitioned table. For example, the partition for March 1, 2016 in a partitioned table namedmydataset.tablecan be deleted using:

bq rm 'mydataset.table$20160301'

Partition expiration

Partition expiration is based on the elapsed time since the partition date. Partitions are deleted a set amount of time after the partition date, and the expiration happens regardless of when the partition was updated.

For example, if the partition expiration is set to 60 days and data is written to partition 20170101 on January 30, 2017, then the data expires on January 1, 2017 + 60 days.

For projects that had partitioned tables that were created before December 13, 2016, the partition expiration is based on the last date that the partition was modified.

Long-term storage pricing

Each partition of a partitioned table is considered separately for long-term storage pricing. If a partition hasn't been modified in the last 90 days, the data in that partition is considered long term storage price and is charged at the discounted price.

Best Practices with partitioned tables

Viewing most recent partitions

To limit the amount of data scanned to a set of partitions, create a view that contains a filter on_PARTITIONTIME. For example, the following query can be used to create a view that includes only the most recent seven days of data from a table namedmydataset.partitioned_table:

LEGACY SQL:
#legacySQL
SELECT
  *
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP(UTC_USEC_TO_DAY(NOW() - 7 * 60 * 60 * 24 * 1000000))
  AND TIMESTAMP(UTC_USEC_TO_DAY(CURRENT_TIMESTAMP()));
STANDARD SQL:
#standardSQL
SELECT
  *
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY)
  AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);

Dealing with timezone issues

The value of_PARTITIONTIMEis based on the UTC date when the field is populated, which means that partitions are divided based on 12:00 AM UTC. If you want to query data based on a timezone other than UTC, you should choose one of the following options before you start loading data into your table.

There are two ways to query data in a partitioned table using a custom, non-UTC, timezone. You can either create a separate timestamp column or you can use partition decorators to load data into a specific partition.

If you use atimestamp column, you can use the default UTC-based partitioning and account for timezone differences in your SQL queries. Alternately, if you prefer to have partitions that are grouped by a timezone other than UTC, usepartition decorators to load datainto partitions based on a different timezone.

Tracking timezones using a timestamp

To adjust for timezones using a timestamp, create a separate column to store a timestamp that enables you to address rows by the hour or minute.

To query for data based on a timezone other than UTC, use both the_PARTITIONTIMEpseudo column and your custom timestamp column. Using_PARTITIONTIMElimits the table scan to the relevant partitions, and your custom timestamp further limits the results to your timezone. For example, to query data from a partitioned table (mydataset.partitioned_table) with a timestamp field[MY_TIMESTAMP_FIELD]for data added to the table between2016-05-01 08:00:00 PSTand2016-05-05 14:00:00 PST:

LEGACY SQL:
#legacySQL
SELECT
  field1
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
  AND TIMESTAMP("2016-05-06")
  AND DATE_ADD([MY_TIMESTAMP_FIELD], 8, 'HOUR') BETWEEN TIMESTAMP("2016-05-01 12:00:00")
  AND TIMESTAMP("2016-05-05 14:00:00");
STANDARD SQL
#standardSQL
SELECT
  field1
FROM
  mydataset.partitioned_table
WHERE
  _PARTITIONTIME BETWEEN TIMESTAMP("2016-05-01")
  AND TIMESTAMP("2016-05-06")
  AND TIMESTAMP_ADD([MY_TIMESTAMP_FIELD], INTERVAL 8 HOUR) BETWEEN TIMESTAMP("2016-05-01 12:00:00")
  AND TIMESTAMP("2016-05-05 14:00:00");

Loading data using partition decorators

Partition decorators enable you to load data into a specific partition. To adjust for timezones, use a partition decorator to load data into a partition based on your preferred timezone. For example, if you are on Pacific Standard Time (PST), load all data generated on May 1, 2016 PST into the partition for that date by using the corresponding partition decorator:

[TABLE_NAME]$20160501

Better performance with the pseudo column

To improve query performance, use the_PARTITIONTIMEpseudo column by itself on the left side of a comparison. For example, the following queries process the same amount of data, but the second example can provide better performance.

Example 1. The following query can be slower because it combines the pseudo column value with other operations in theWHEREfilter.

LEGACY SQL
#legacySQL
/* Can be slower */
SELECT
  field1
FROM
  mydataset.table1
WHERE
  DATE_ADD(_PARTITIONTIME, 5, "DAY") > TIMESTAMP("2016-04-15")
STANDARD SQL
#standardSQL
/* Can be slower */
SELECT
  field1
FROM
  mydataset.table1
WHERE
  TIMESTAMP_ADD(_PARTITIONTIME, INTERVAL 5 DAY) > TIMESTAMP("2016-04-15")

Example 2. The following query can perform better because it places the pseudo column by itself on the left side of the filter comparison.

LEGACY SQL
#legacySQL
/* Often performs better */
SELECT
  field1
FROM
  mydataset.table1
WHERE
  _PARTITIONTIME > DATE_ADD(TIMESTAMP('2016-04-15'), -5, "DAY")
STANDARD SQL
#standardSQL
/* Often performs better */
SELECT
  field1
FROM
  mydataset.table1
WHERE
  _PARTITIONTIME > TIMESTAMP_SUB(TIMESTAMP('2016-04-15'), INTERVAL 5 DAY)

Depending on the table size, the second query, which places_PARTITIONTIMEby itself on the left side of the>comparison operator, can provide better performance than the first query. Because the queries process the same amount of data, the number of bytes billed is the same in both cases.

results matching ""

    No results matching ""