Official

Premium

BigQuery

Sync from BigQuery to any destination

Make sure to sign up and run cloudquery login before running your first sync.

Publisher

cloudquery

Latest version

v1.3.6

Type

Source

Platforms

Date Published

May 7, 2024

Price per 1M rows

Starting from $15

monthly free quota

1M rows

Documentation Tables Changelog Destinations

Overview Incremental-Examples

Set up process

brew install cloudquery/tap/cloudquery

1. Download CLI and login

See installation options

2. Create source and destination configs

Plugin configuration

cloudquery sync bigquery.yml postgresql.yml

3. Run the sync

CloudQuery sync

Overview

The CloudQuery BigQuery source plugin syncs tables from BigQuery to any of the supported CloudQuery destinations (e.g. PostgreSQL, BigQuery, Snowflake, and more).

Example Configuration

This example syncs from BigQuery to a Postgres destination. The (top level) source spec section is described in the Source Spec Reference.

kind: source
spec:
  name: bigquery
  path: cloudquery/bigquery
  registry: cloudquery
  version: "v1.3.6"
  tables: ["*"]
  destinations: ["postgresql"]
  spec:
    project_id: ${PROJECT_ID}
    dataset_id: ${DATASET_ID}
    # Optional parameters
    # dataset_location: ""
    # service_account_key_json: ""
    # endpoint: ""

This example above expects the following environment variables to be set:

PROJECT_ID - The Google Cloud Project ID
DATASET_ID - The Google Cloud BigQuery Dataset ID

Configuration Reference

This is the (nested) spec used by the BigQuery source plugin.

project_id (string) (required)
The id of the project where the destination BigQuery database resides.
dataset_id (string) (required)
The id of the BigQuery dataset within the project, e.g. my_dataset.
dataset_location (string) (optional)
The data location of the BigQuery dataset. If set, will be used as the default location for job operations. Pro-tip: this can solve "dataset not found" issues for newly created datasets.
service_account_key_json (string) (optional) (default: empty).
GCP service account key content. This allows for using different service accounts for the GCP source and BigQuery destination. If using service account keys, it is best to use environment or file variable substitution.
endpoint (string) (optional)
The BigQuery API endpoint to use. This is useful for testing against a local emulator.
table_options (map of Table Options Spec) (optional)
Table options to set for specific tables. Untemplated table names are the keys. Only valid if queries is empty.
queries (list of queries Spec) (optional)
(Preview feature) List of queries to run instead of directly reading tables. The tables in top-level spec should be left as * or can be a subset of these tables.
destination_table_name (string) (optional) (default: {{TABLE}})
The destination table name to write the data to. (Preview Feature)
Supports the following placeholder variables:
- {{TABLE}} will be replaced with the table name
- {{DATASET}} will be replaced with the dataset name
- {{UUID}} will be replaced with a random UUID in raw format to uniquely identify each table
- {{YEAR}} will be replaced with the current year in YYYY format
- {{MONTH}} will be replaced with the current month in MM format
- {{DAY}} will be replaced with the current day in DD format
- {{HOUR}} will be replaced with the current hour in HH format
- {{MINUTE}} will be replaced with the current minute in mm format
At least one of {{TABLE}} or {{UUID}} is required.
Note that timestamps are in UTC and will be the time sync started.
concurrency (integer) (optional) (default: 100)
Number of tables to sync concurrently. Lower or increase this number based on your database size and available resources.
discovery_concurrency (integer) (optional) (default: 100)
Number of goroutines to use for discovering table schemas.

Table Options Spec (Preview)

Enables setting table options for specific tables. Map key is the name of the table in untemplated form.

incremental_column (string) (optional)

Name of the incremental column in the table. If empty, no incremental column will be used.

Queries Spec (Preview)

Allows running arbitrary queries instead of fetching existing tables. Each query will be run as a separate table.

name (string) (required)
Name of the table to be generated from the query.
query (string) (required)
SQL query to run. Should have a {incremental_column} > @cursor or similar in the WHERE clause if incremental_column is set.
incremental_column (string) (optional)
Name of the incremental column in the query result set. The query must have a reference to @cursor in its WHERE clause, and a column with this name in the result set. If empty, no incremental column will be used.
incremental_column_type (enum) (required if incremental_column is set)
Type of the incremental column, one of TIMESTAMP, INTEGER or STRING. This is required if incremental_column is set.

Authentication

The BigQuery plugin authenticates using your Application Default Credentials. Available options are all the same options described here in detail:

Local Environment:

gcloud auth application-default login (recommended when running locally)

Google Cloud cloud-based development environment:

When you run on Cloud Shell or Cloud Code credentials are already available.

Google Cloud containerized environment:

When running on GKE use workload identity.

Google Cloud services that support attaching a service account:

Services such as Compute Engine, App Engine and functions supporting attaching a user-managed service account which will CloudQuery will be able to utilize.

On-premises or another cloud provider

The suggested way is to use Workload identity federation
If not available you can always use service account keys and export the location of the key via GOOGLE_APPLICATION_CREDENTIALS. (Not recommended as long-lived keys are a security risk)

Underlying library

We use the official cloud.google.com/go/bigquery package for database connection.

Incremental-Examples

Configuration Examples for Incremental Sync

To sync tables incrementally each table needs to have an incremental_column specified in the table_options section of the configuration. The tables without an incremental_column will be synced fully.

Sync a Table Incrementally

To sync a table incrementally, you need to specify the incremental_column in the table_options section of the configuration. Here's an example source spec to sync a list of tables incrementally:

kind: source
spec:
  name: bigquery
  path: cloudquery/bigquery
  registry: cloudquery
  version: "v1.3.6"
  tables:
    - "my_table"
    - "another_table"
    - "yet_another_table"
  destinations: ["postgresql"]
  backend_options:
    table_name: "cq_state_bq"
    connection: "@@plugins.postgresql.connection"
  spec:
    project_id: my-project
    dataset_id: my_dataset
    table_options:
      my_table:
        incremental_column: updated_at
      another_table:
        incremental_column: id

In the example above, the my_table table will be synced incrementally based on the updated_at column. another_table will be synced incrementally based on the id column. yet_another_table will be synced fully.

The incremental column will be used to fetch only the new or updated rows from the source table and the state will be stored in the cq_state_bq table in the PostgreSQL destination.

Sync a Custom Query Incrementally

To sync a custom query incrementally, you need to specify the incremental_column as well as incremental_column_type in the queries Spec. Here's an example source spec sync a custom query incrementally:

kind: source
spec:
  name: bigquery
  path: cloudquery/bigquery
  registry: cloudquery
  version: "v1.3.6"
  tables:
    - "my_query_result"
  destinations: ["postgresql"]
  backend_options:
    table_name: "cq_state_bq"
    connection: "@@plugins.postgresql.connection"
  spec:
    project_id: my-project
    dataset_id: my_dataset
    queries:
      - name: my_query_result
        query: "SELECT * FROM my_table WHERE updated_at > @cursor"
        incremental_column: updated_at
        incremental_column_type: timestamp

In the example above, the my_query_result table will be generated from the custom query SELECT * FROM my_table WHERE updated_at > @cursor and will be synced incrementally based on the updated_at column. @cursor is a placeholder that will be replaced with the last sync time. If no sync time is available, it will be replaced with the positive minimum value of that type.

For more information about managing state for incremental tables, see Managing Incremental Tables.