Getting started with BigQuery
Overview
- Google BigQuery is an enterprise data warehouse that allows storing and querying massive datasets.
- BigQuery can be accessed using the webUI, a command-line tool or the BigQuery REST API.
- There are official client libraries for Java, .NET or Python to simply the access to REST API.
- BigQuery resources, such as tables and datasets, must belong a GCP project. Projects are top-level containers for BigQuery data and query jobs.
BigQuery resources
Tables
- Tables contain your data in BigQuery.
- Each table has a schema that describes field names, types and other metadata.
- BigQuery supports the following table types:
- Native tables: tables backed by native BigQuery storage
- External tables: tables backed by storage external to BigQuery.
- Views: virtual tables defined by a SQL query.
Datasets
- Datasets are useful for organizing and controlling access to BigQuery tables.
- A table must belong to a Dataset so you must create at least one dataset before loading data into tables.
- You can share data with others by settings permission and assigning roles on organizations, projects and datasets.
- These are the three types of resources available to IAM. You cannot control access at the table level.
Storage
BigQuery manages the technical aspects of storing your structured data, including compression, encryption, replication, performance tuning, and scaling.
BigQuery stores data in the Capacitor columnar data format, and offers the standard database concepts of tables, partitions, columns, and rows.
You can load data into BigQuery via batch loads or streaming.
With BigQuery, you can perform data operations such as copying tables, querying tables using SQL, modifying data through SQL DML, export data, or share stored data with other using Identity and Access Management (IAM) permissions.
Jobs
- Jobs are actions that you create and that BigQuery run on your behalf to load, export, query, or copy data.
- Jobs can take a long time to run. Therefore, they are executed asynchronously and their status can be polled while they are running.
- BigQuery stores a history of all jobs associated with a GCP project.
Slots
- A BigQuery slot is a unit of analytics capacity for measuring the throughput of SQL query executions.
- BigQuery automatically calculates the required slots for each query based on the query size and complexity.
- The BigQuery quota policy limits the number of slots available for each account.
- Slots are share among all queries in a project, and the maximum number of concurrent slots for on-demand pricing is 2000.
As a rule of thumb, if you're processing less than 100 GB of queries at once, you're unlikely to be using all 2000 slots. To check how many slots your account uses, see Monitoring BigQuery Using Stackdriver
Working with BigQuery
There are three main things you can do in BigQuery: loading and exporting data, querying and viewing data, and managing data.
Loading and exporting data
In general, you load into BigQuery storage before running queries. Alternatively, you can create a table that is associated to an external data source. This allows you to store the input data externally and load only the result of the query. If you want to move your data out of BigQuery, you can also export it.
Querying and viewing data
- Once you have setup tables in BigQuery, you can query or view them in your tables.
- You can run query using traditional or standard SQL languages.
- You can run queries interactively or in batch.
- You can also create view, which is a virtual table defined by a SQL query.
- Finally, you can also use partitioned tabled to query only a subset of your data in a more efficient way.
Managing data
In addition to querying and viewing data, you can manage data in BigQuery in the following ways:
- Listing projects, jobs, datasets, and tables.
- Getting information about jobs, datasets, and tables.
- Defining, updating, or patching datasets and tables.
- Deleting datasets and tables.
- Managing table partitions. .