Best practices for BigQuery

Introduction

Here is a list for BigQuery recommended best practices that you should complete for launching a commercial application that uses Google BigQuery. This checklist focuses on activities related to architecture and schema design, quota and cost management, security, data pre-processing, and performance optimization.

Schema Design

❑ Design the schema based on the need to query it. BigQuery supports nested and repeated (type:RECORD) fields, allowing logical 1-to-many relationships in a denormalized table. Leverage these where appropriate, while maintaining a strategy for accessing such fields. Consider creating logical views for common unnesting of repeated fields. Estimate querying cost.
❑ Repeated fields can become challenging to query beyond the first or second nesting level, so consider the depth to which nesting suits your use.
❑ While joins are performant on BigQuery, denormalization can improve the performance of some queries. Consider where in your data model joinability is important.
❑ BigQuery allows tables to appended or updated via DML. DML is intended for bulk updates, deletions and inserts, rather than single row modifications. Ensure that your update patterns are consistent with an OLAP system.
❑ Unlike a traditional RDBMS, there is no notion of primary/secondary or row-id keys. If required, identify a column in the table schema for that purpose.

Data storage and query processing

Estimate Data Volumes

❑ Calculate how much data will be uploaded in the initial upload to get to some base level.
❑ Calculate how much data will be uploaded incrementally and at what velocity (hourly/daily/weekly).
❑ Calculate how much data (if any) expires and at what frequency.

❑ Identify the types of queries that will be executed on BigQuery datasets every day. Stackdriver monitoring provide useful diagnostics on resource utilization, concurrent queries, and dataset sizes.
❑ Decide on a table partitioning/sharding strategy. For example, if queries issued during the day are relevant for the data collected during the day, create one table per day. To aggregate data across multiple days, run queries using the table wildcards or the \_PARTITIONTIME pseudocolumn.
❑ Calculate how much data (the number of queries x data processed per query) will be processed by BigQuery every day. Note that BigQuery charges for processing individual columns (not the whole row), so account for it in your calculations. Also note that the column referenced in queries invokes a full column scan.

Quota Management

Complete a quota analysis in the following areas wherever applicable:

The number of load jobs per day required to load data to BigQuery.
The number of load jobs per table per day to BigQuery.
If using a streaming API, the number of streaming inserts, the row size of inserts, the maximum rows per second, the maximum rows per request, and other streaming API specific parameters.
The number of queries issued by your application to BigQuery.
The number of concurrent sessions used to execute queries simultaneously.
The number of export jobs executed per day to extract data out of BigQuery.
Even the API to invoke BigQuery operations is limited per rate and per day. For more info, see
API Rate Limits

Cost Analysis

Consider the following cost optimization recommendations:

Select only relevant columns in your queries. Note that BigQuery will execute (and charge) a full column scan on the selected column irrespective of the filters in the where clause.
Partition/shard your tables into smaller units to optimize on processing cost. There is often a trade-off between optimization and performance. If you have too many small tables, there is a fixed overhead in loading each table referenced in the query, which might affect performance.
Test your queries on smaller partitions of the table rather than one large table.
If using the API, validate queries for syntax and get data processing statistics using the dryRun flag.
Delete older tables if there is no need to query on them, or take advantage of the expirationTime on tables.

Security

Carefully review the IAM policies for BigQuery to ensure that dataset access and job-running permissions are correct. Points to consider:
Audit BigQuery permissions for project members to ensure your security polices are followed. The bigquery.User
role allows users to run jobs in your project, while dataset.Viewercontrols the ability to read all tables in a given dataset.
dataset.Editorand dataset.Ownershould only be granted if the user needs to modify tables or datasets, respectively.
If there is a need to share specific datasets to team members, use share dataset or explicit IAM roles instead.
If you require finer grained security, consider using Authorized Views to control access.

Pre-processing data prior to BigQuery

Consider pre-processing data before loading it into BigQuery to optimize for cost and performance. For example, you can:

Pre-process unnecessary typecasts and calculations.
Pre-join frequent joins.
Pre-aggregate metrics and frequently-run analytics.
Pre-process (transform, de-normalize, etc.) data using BigQuery itself, Google Cloud Dataflow, Google Cloud Dataproc, or ETL tools.
Consider having multiple versions of the same dataset structured differently for different types of queries.
Since DML statements per table have a daily limit, plan for updates/deletes to tables via staged batches of modifications.

Query Tuning & Testing

Test your queries on the expected volume of data and tune them according to the following principles:

Omit unnecessary columns from the select clause to reduce cost and improve performance.
In the context of nested queries, effectively use where clauses to filter out data in the inner most queries so the outer queries have less data to process.
Push flattening as far inside as possible, use WHEREclauses in conjunction, if possible.
Move the heavy weight filters, such as regexp, to the end.
Avoid grouping on strings, when the original equivalent data is available: use timestamps vs strings.
Try to use ORDER BYand LIMITin the outermost queries. Avoid ORDER BYin the inner queries as much as possible.
Be aware that using GROUP BYwhen dealing with skewed (large) groups may result in increased tail-latency. Filter them out to improve query performance.
Consider using IF/CASEor analytical SQL functions instead of self-joins because they have lower processing overhead as compared to self-joins.

Best practices