description |
---|
BigQuery is a serverless, highly scalable, and cost-effective data warehouse offered by Google Cloud Provider. |
Feature | Supported?(Yes/No) | Notes |
---|---|---|
Full Refresh Sync | Yes | |
Incremental - Append Sync | Yes | |
Incremental - Deduped History | Yes | |
Bulk loading | Yes | |
Namespaces | Yes |
There are two flavors of connectors for this destination:
- Bigquery: This is producing the standard Airbyte outputs using a
_airbyte_raw_*
tables storing the JSON blob data first. Afterward, these are transformed and normalized into separate tables, potentially "exploding" nested streams into their own tables if basic normalization is configured. Bigquery (Denormalized)
: Instead of splitting the final data into multiple tables, this destination leverages BigQuery capabilities with Structured and Repeated fields to produce a single "big" table per stream. This does not write the_airbyte_raw_*
tables in the destination and normalization from this connector is not supported at this time.
Check out common troubleshooting issues for the BigQuery destination connector on our Discourse here.
Each stream will be output into its own table in BigQuery. Each table will contain 3 columns:
_airbyte_ab_id
: a uuid assigned by Airbyte to each event that is processed. The column type in BigQuery isString
._airbyte_emitted_at
: a timestamp representing when the event was pulled from the data source. The column type in BigQuery isTimestamp
._airbyte_data
: a json blob representing with the event data. The column type in BigQuery isString
.
The output tables from the BigQuery destination are partitioned and clustered by the Time-unit column _airbyte_emitted_at
at a daily granularity. Partitions boundaries are based on UTC time.
This is useful to limit the number of partitions scanned when querying these partitioned tables, by using a predicate filter (a WHERE clause). Filters on the partitioning column will be used to prune the partitions and reduce the query cost. (The parameter "Require partition filter" is not enabled by Airbyte, but you may toggle this by updating the produced tables if you wish so)
To use the BigQuery destination, you'll need:
- A Google Cloud Project with BigQuery enabled
- A BigQuery Dataset into which Airbyte can sync your data
- A Google Cloud Service Account with the "BigQuery User" and "BigQuery Data Editor" roles in your GCP project
- A Service Account Key to authenticate into your Service Account
For GCS Staging upload mode:
- GCS role enabled for same user as used for biqquery
- HMAC key obtained for user. Currently, only the HMAC key is supported. More credential types will be added in the future.
See the setup guide for more information about how to create the required resources.
If you have a Google Cloud Project with BigQuery enabled, skip to the "Create a Dataset" section.
First, follow along the Google Cloud instructions to Create a Project.
Enable BigQuery
BigQuery is typically enabled automatically in new projects. If this is not the case for your project, follow the "Before you begin" section in the BigQuery QuickStart docs.
Airbyte needs a location in BigQuery to write the data being synced from your data sources. If you already have a Dataset into which Airbyte should sync data, skip this section. Otherwise, follow the Google Cloud guide for Creating a Dataset via the Console UI to achieve this.
Note that queries written in BigQuery can only reference Datasets in the same physical location. So if you plan on combining the data Airbyte synced with data from other datasets in your queries, make sure you create the datasets in the same location on Google Cloud. See the Introduction to Datasets section for more info on considerations around creating Datasets.
In order for Airbyte to sync data into BigQuery, it needs credentials for a Service Account with the "BigQuery User" and "BigQuery Data Editor" roles, which grants permissions to run BigQuery jobs, write to BigQuery Datasets, and read table metadata. We highly recommend that this Service Account is exclusive to Airbyte for ease of permissioning and auditing. However, you can use a pre-existing Service Account if you already have one with the correct permissions.
The easiest way to create a Service Account is to follow GCP's guide for Creating a Service Account. Once you've created the Service Account, make sure to keep its ID handy as you will need to reference it when granting roles. Service Account IDs typically take the form <account-name>@<project-name>.iam.gserviceaccount.com
Then, add the service account as a Member in your Google Cloud Project with the "BigQuery User" role. To do this, follow the instructions for Granting Access in the Google documentation. The email address of the member you are adding is the same as the Service Account ID you just created.
At this point you should have a service account with the "BigQuery User" project-level permission.
Service Account Keys are used to authenticate as Google Service Accounts. For Airbyte to leverage the permissions you granted to the Service Account in the previous step, you'll need to provide its Service Account Keys. See the Google documentation for more information about Keys.
Follow the Creating and Managing Service Account Keys guide to create a key. Airbyte currently supports JSON Keys only, so make sure you create your key in that format. As soon as you created the key, make sure to download it, as that is the only time Google will allow you to see its contents. Once you've successfully configured BigQuery as a destination in Airbyte, delete this key from your computer.
You should now have all the requirements needed to configure BigQuery as a destination in the UI. You'll need the following information to configure the BigQuery destination:
- Project ID
- Dataset Location
- Dataset ID: the name of the schema where the tables will be created.
- Service Account Key: the contents of your Service Account Key JSON file
Additional options can also be customized:
- Google BigQuery client chunk size: Google BigQuery client's chunk(buffer) size (MIN=1, MAX = 15) for each table. The default 15MiB value is used if not set explicitly. It's recommended to decrease value for big data sets migration for less HEAP memory consumption and avoiding crashes. For more details refer to https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html
- Transformation Priority: configure the priority of queries run for transformations. Refer to https://cloud.google.com/bigquery/docs/running-queries. By default, Airbyte runs interactive query jobs on BigQuery, which means that the query is executed as soon as possible and count towards daily concurrent quotas and limits. If set to use batch query on your behalf, BigQuery starts the query as soon as idle resources are available in the BigQuery shared resource pool. This usually occurs within a few minutes. If BigQuery hasn't started the query within 24 hours, BigQuery changes the job priority to interactive. Batch queries don't count towards your concurrent rate limit, which can make it easier to start many queries at once.
Once you've configured BigQuery as a destination, delete the Service Account Key from your computer.
There are 2 available options to upload data to BigQuery Standard
and GCS Staging
.
-
Standard
is option to upload data directly from your source to BigQuery storage. This way is faster and requires less resources than GCS one.Please be aware you may see some fails for big datasets and slow sources, i.e. if reading from source takes more than 10-12 hours.
This is caused by the Google BigQuery SDK client limitations. For more details please check airbytehq#3549
-
GCS Uploading (CSV format)
: This approach has been implemented in order to avoid the issue for big datasets mentioned above.At the first step all data is uploaded to GCS bucket and then all moved to BigQuery at one shot stream by stream.
The destination-gcs connector is partially used under the hood here, so you may check its documentation for more details.
For the GCS Staging upload type additional params must be configured:
- GCS Bucket Name
- GCS Bucket Path
- GCS Bucket Keep files after migration
- See this to create an S3 bucket.
- HMAC Key Access ID
- See this on how to generate an access key.
- We recommend creating an Airbyte-specific user or service account. This user or account will require read and write permissions to objects in the bucket.
- Secret Access Key
- Corresponding key to the above access ID.
- Make sure your GCS bucket is accessible from the machine running Airbyte.
- This depends on your networking setup.
- The easiest way to verify if Airbyte is able to connect to your GCS bucket is via the check connection tool in the UI.
Note: It partially re-uses the destination-gcs connector under the hood. So you may also refer to its guide for additional clarifications. GCS Region for GCS would be used the same as set for BigQuery Format - Gcs format is set to CSV
From BigQuery Datasets Naming:
When you create a dataset in BigQuery, the dataset name must be unique for each project. The dataset name can contain the following:
-
Up to 1,024 characters.
-
Letters (uppercase or lowercase), numbers, and underscores.
Note: In the Cloud Console, datasets that begin with an underscore are hidden from the navigation pane. You can query tables and views in these datasets even though these datasets aren't visible.
-
Dataset names are case-sensitive: mydataset and MyDataset can coexist in the same project.
-
Dataset names cannot contain spaces or special characters such as -, &, @, or %.
Therefore, Airbyte BigQuery destination will convert any invalid characters into '_' characters when writing data.
Version | Date | Pull Request | Subject |
---|---|---|---|
0.5.0 | 2021-10-26 | #7240 | Output partitioned/clustered tables |
0.4.1 | 2021-10-04 | #6733 | Support dataset starting with numbers |
0.4.0 | 2021-08-26 | #5296 | Added GCS Staging uploading option |
0.3.12 | 2021-08-03 | #3549 | Add optional arg to make a possibility to change the BigQuery client's chunk\buffer size |
0.3.11 | 2021-07-30 | #5125 | Enable additionalPropertities in spec.json |
0.3.10 | 2021-07-28 | #3549 | Add extended logs and made JobId filled with region and projectId |
0.3.9 | 2021-07-28 | #5026 | Add sanitized json fields in raw tables to handle quotes in column names |
0.3.6 | 2021-06-18 | #3947 | Service account credentials are now optional. |
0.3.4 | 2021-06-07 | #3277 | Add dataset location option |
Version | Date | Pull Request | Subject |
---|---|---|---|
0.1.10 | 2021-11-09 | #7804 | handle null values in fields described by a $ref definition |
0.1.9 | 2021-11-08 | #7736 | Fixed the handling of ObjectNodes with $ref definition key |
0.1.8 | 2021-10-27 | #7413 | Fixed DATETIME conversion for BigQuery |
0.1.7 | 2021-10-26 | #7240 | Output partitioned/clustered tables |
0.1.6 | 2021-09-16 | #6145 | BigQuery Denormalized support for date, datetime & timestamp types through the json "format" key |
0.1.5 | 2021-09-07 | #5881 | BigQuery Denormalized NPE fix |
0.1.4 | 2021-09-04 | #5813 | fix Stackoverflow error when receive a schema from source where "Array" type doesn't contain a required "items" element |
0.1.3 | 2021-08-07 | #5261 | 🐛 Destination BigQuery(Denormalized): Fix processing arrays of records |
0.1.2 | 2021-07-30 | #5125 | Enable additionalPropertities in spec.json |
0.1.1 | 2021-06-21 | #3555 | Partial Success in BufferedStreamConsumer |
0.1.0 | 2021-06-21 | #4176 | Destination using Typed Struct and Repeated fields |