Skip to content

Commit

Permalink
Mooore BQ docs changes
Browse files Browse the repository at this point in the history
  • Loading branch information
pondzix committed May 13, 2024
1 parent 60d0b1b commit 0a938c3
Show file tree
Hide file tree
Showing 8 changed files with 28 additions and 70 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@
import Mermaid from '@theme/Mermaid';
import Link from '@docusaurus/Link';
```

<p>The BigQuery Streaming Loader on {props.cloud} is a fully streaming application that continually pulls events from {props.stream} and writes to BigQuery using the <Link to="https://cloud.google.com/bigquery/docs/reference/storage/libraries#client-libraries-install-java">BigQuery Storage API</Link>.</p>
<p>The BigQuery Streaming Loader on {props.cloud} is a fully streaming application that continually pulls events from {props.stream} and writes to BigQuery using the <Link to="https://cloud.google.com/bigquery/docs/write-api">BigQuery Storage API</Link>.</p>

<Mermaid value={`
flowchart LR
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ import DeployOverview from '@site/docs/pipeline-components-and-applications/load

## Overview

The BigQuery Loader is an application that loads Snowplow events to BigQuery using the [BigQuery Storage API](https://cloud.google.com/bigquery/docs/write-api).

<Tabs groupId="cloud" queryString lazy>
<TabItem value="aws" label="AWS" default>
<LoaderDiagram stream="Kinesis" cloud="AWS"/>
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
At the high level, BigQuery loader reads enriched Snowplow events in real time and loads them in BigQuery using the Storage Write API.
At the high level, BigQuery loader reads enriched Snowplow events in real time and loads them in BigQuery using the [legacy streaming API](https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery).

```mermaid
flowchart LR
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ sidebar_position: 0
```mdx-code-block
import {versions} from '@site/src/componentVersions';
import CodeBlock from '@theme/CodeBlock';
import Diagram from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/_diagram.md';
import Diagram from '@site/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/_diagram.md';
```

Under the umbrella of Snowplow BigQuery Loader, we have a family of applications that can be used to load enriched Snowplow data into BigQuery.
Expand Down Expand Up @@ -147,7 +147,7 @@ The loader takes command line arguments `--config` with a path to the configurat
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-streamloader:1.7.1 \\
snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x} \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json
`}</CodeBlock>
Expand All @@ -157,7 +157,7 @@ Or you can pass the whole config as a base64-encoded string using the `--config`
<CodeBlock language="bash">{
`docker run \\
-v /path/to/resolver.json:/resolver.json \\
snowplow/snowplow-bigquery-streamloader:1.7.1 \\
snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x} \\
--config=ewogICJwcm9qZWN0SWQiOiAiY29tLWFjbWUiCgogICJsb2FkZXIiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAiZW5yaWNoZWQtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogewogICAgICAgICJkYXRhc2V0SWQiOiAic25vd3Bsb3ciCiAgICAgICAgInRhYmxlSWQiOiAiZXZlbnRzIgogICAgICB9CgogICAgICAiYmFkIjogewogICAgICAgICJ0b3BpYyI6ICJiYWQtdG9waWMiCiAgICAgIH0KCiAgICAgICJ0eXBlcyI6IHsKICAgICAgICAidG9waWMiOiAidHlwZXMtdG9waWMiCiAgICAgIH0KCiAgICAgICJmYWlsZWRJbnNlcnRzIjogewogICAgICAgICJ0b3BpYyI6ICJmYWlsZWQtaW5zZXJ0cy10b3BpYyIKICAgICAgfQogICAgfQogIH0KCiAgIm11dGF0b3IiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAidHlwZXMtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogJHtsb2FkZXIub3V0cHV0Lmdvb2R9ICMgd2lsbCBiZSBhdXRvbWF0aWNhbGx5IGluZmVycmVkCiAgICB9CiAgfQoKICAicmVwZWF0ZXIiOiB7CiAgICAiaW5wdXQiOiB7CiAgICAgICJzdWJzY3JpcHRpb24iOiAiZmFpbGVkLWluc2VydHMtc3ViIgogICAgfQoKICAgICJvdXRwdXQiOiB7CiAgICAgICJnb29kIjogJHtsb2FkZXIub3V0cHV0Lmdvb2R9ICMgd2lsbCBiZSBhdXRvbWF0aWNhbGx5IGluZmVycmVkCgogICAgICAiZGVhZExldHRlcnMiOiB7CiAgICAgICAgImJ1Y2tldCI6ICJnczovL2RlYWQtbGV0dGVyLWJ1Y2tldCIKICAgICAgfQogICAgfQogIH0KCiAgIm1vbml0b3JpbmciOiB7fSAjIGRpc2FibGVkCn0= \\
--resolver=/resolver.json
`}</CodeBlock>
Expand All @@ -169,7 +169,7 @@ For example, to override the `repeater.input.subscription` setting using system
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-streamloader:1.7.1 \\
snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x} \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
-Drepeater.input.subscription="failed-inserts-sub"
Expand All @@ -180,7 +180,7 @@ Or to use environment variables for every setting:
<CodeBlock language="bash">{
`docker run \\
-v /path/to/resolver.json:/resolver.json \\
snowplow/snowplow-bigquery-repeater:1.7.1 \\
snowplow/snowplow-bigquery-repeater:${versions.bqLoader1x} \\
--resolver=/resolver.json \\
-Dconfig.override_with_env_vars=true
`}</CodeBlock>
Expand All @@ -197,7 +197,7 @@ StreamLoader accepts `--config` and `--resolver` arguments, as well as any JVM s
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-streamloader:1.7.1 \\
snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x} \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
-Dconfig.override_with_env_vars=true
Expand All @@ -212,7 +212,7 @@ The Dataflow Loader accepts the same two arguments as StreamLoader and [any oth
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-loader:1.7.1 \\
snowplow/snowplow-bigquery-loader:${versions.bqLoader1x} \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
--labels={"key1":"val1","key2":"val2"} # optional Dataflow args
Expand All @@ -233,7 +233,7 @@ Mutator has three subcommands: `listen`, `create` and `add-column`.
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-mutator:1.7.1 \\
snowplow/snowplow-bigquery-mutator:${versions.bqLoader1x} \\
listen \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
Expand All @@ -247,7 +247,7 @@ Mutator has three subcommands: `listen`, `create` and `add-column`.
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-mutator:1.7.1 \\
snowplow/snowplow-bigquery-mutator:${versions.bqLoader1x} \\
add-column \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
Expand All @@ -264,7 +264,7 @@ The specified schema must be present in one of the Iglu registries in the resolv
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-mutator:1.7.1 \\
snowplow/snowplow-bigquery-mutator:${versions.bqLoader1x} \\
create \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
Expand All @@ -281,7 +281,7 @@ We recommend constantly running Repeater on a small / cheap node or Docker conta
<CodeBlock language="bash">{
`docker run \\
-v /path/to/configs:/configs \\
snowplow/snowplow-bigquery-repeater:1.7.1 \\
snowplow/snowplow-bigquery-repeater:${versions.bqLoader1x} \\
--config=/configs/bigquery.hocon \\
--resolver=/configs/resolver.json \\
--bufferSize=20 \\ # size of the batch to send to the dead-letter bucket
Expand All @@ -297,19 +297,19 @@ We recommend constantly running Repeater on a small / cheap node or Docker conta
All applications are available as Docker images on Docker Hub, based on Ubuntu Focal and OpenJDK 11:

<CodeBlock language="bash">{
`$ docker pull snowplow/snowplow-bigquery-streamloader:1.7.1
$ docker pull snowplow/snowplow-bigquery-loader:1.7.1
$ docker pull snowplow/snowplow-bigquery-mutator:1.7.1
$ docker pull snowplow/snowplow-bigquery-repeater:1.7.1
`$ docker pull snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x}
$ docker pull snowplow/snowplow-bigquery-loader:${versions.bqLoader1x}
$ docker pull snowplow/snowplow-bigquery-mutator:${versions.bqLoader1x}
$ docker pull snowplow/snowplow-bigquery-repeater:${versions.bqLoader1x}
`}</CodeBlock>

<p>We also provide an alternative lightweight set of images based on <a href="https://github.com/GoogleContainerTools/distroless">Google's "distroless" base image</a>, which may provide some security advantages for carrying fewer dependencies. These images are distinguished with the <code>{`1.7.1-distroless`}</code> tag:</p>
<p>We also provide an alternative lightweight set of images based on <a href="https://github.com/GoogleContainerTools/distroless">Google's "distroless" base image</a>, which may provide some security advantages for carrying fewer dependencies. These images are distinguished with the <code>{`${versions.bqLoader1x}-distroless`}</code> tag:</p>

<CodeBlock language="bash">{
`$ docker pull snowplow/snowplow-bigquery-streamloader:1.7.1-distroless
$ docker pull snowplow/snowplow-bigquery-loader:1.7.1-distroless
$ docker pull snowplow/snowplow-bigquery-mutator:1.7.1-distroless
$ docker pull snowplow/snowplow-bigquery-repeater:1.7.1-distroless
`$ docker pull snowplow/snowplow-bigquery-streamloader:${versions.bqLoader1x}-distroless
$ docker pull snowplow/snowplow-bigquery-loader:${versions.bqLoader1x}-distroless
$ docker pull snowplow/snowplow-bigquery-mutator:${versions.bqLoader1x}-distroless
$ docker pull snowplow/snowplow-bigquery-repeater:${versions.bqLoader1x}-distroless
`}</CodeBlock>

Mutator, Repeater and Streamloader are also available as fatjar files attached to [releases](https://github.com/snowplow-incubator/snowplow-bigquery-loader/releases) in the project's Github repository.
Original file line number Diff line number Diff line change
Expand Up @@ -53,53 +53,12 @@ There are two main types of schema changes:

**Non-breaking**: The schema version can be changed in a minor way (`1-2-3``1-3-0` or `1-2-3``1-2-4`). Data is stored in the same database column.

### Without recovery columns

Loader tries to format the incoming data according to the latest version of the schema it saw (for a given major version, e.g. `1-*-*`). For example, if a batch contains events with schema versions `1-0-0`, `1-0-1` and `1-0-2`, the loader derives the output schema based on version `1-0-2`. Then the loader instructs BigQuery to adjust the database column and load the data.

This logic relies on two assumptions:

1. **Old events compatible with new schemas.** Events with older schema versions, e.g. `1-0-0` and `1-0-1`, have to be valid against the newer ones, e.g. `1-0-2`. Those that are valid will result in failed events.

2. **Old columns compatible with new schemas.** The corresponding BigQuery columns have to be migrated correctly from one version to another. Changes, such as altering the type of a field from `integer` to `string`, would fail. Loading would break with SQL errors and the whole batch would be stuck and hard to recover.

These assumptions are not always clear to the users, making the process error-prone.

### With recovery columns

First, we support schema evolution that’s not strictly backwards compatible (although we still recommend against it since it can confuse downstream consumers of the data). This is done by _merging_ multiple schemas so that both old and new events can coexist. For example, suppose we have these two schemas:

```json
{
// 1-0-0
"properties": {
"a": {"type": "integer"}
}
}
```

```json
{
// 1-0-1
"properties": {
"b": {"type": "integer"}
}
}
```

These would be merged into the following:
```json
{
// merged
"properties": {
"a": {"type": "integer"},
"b": {"type": "integer"}
}
}
```
### Recovering from invalid schema evolution

Let's consider these two schemas as an example of breaking schema evolution (changing the type of a field from `integer` to `string`) using the same major version (`1-0-0` and `1-0-1`):

Second, the loader does not fail when it can’t modify the database column to store both old and new events. (As a reminder, an example would be changing the type of a field from `integer` to `string`.) Instead, it creates a _temporary_ column for the new data as an exception. The users can then run SQL statements to resolve this situation as they see fit. For instance, consider these two schemas:
```json
{
// 1-0-0
Expand Down
1 change: 1 addition & 0 deletions docs/storing-querying/loading-process/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ We load data into Redshift using the [RDB Loader](/docs/pipeline-components-and-
</TabItem>
<TabItem value="bigquery" label="BigQuery">
<Tabs groupId="bigquery-loader-version" queryString lazy>
We load data into BigQuery using the [BigQuery Loader](/docs/pipeline-components-and-applications/loaders-storage-targets/bigquery-loader/index.md).
<TabItem value="v2" label="Version 2.x" default>
<BigQueryLoaderDiagramV2/>
</TabItem>
Expand Down
4 changes: 2 additions & 2 deletions docs/storing-querying/schemas-in-warehouse/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -381,9 +381,9 @@ If you are [modeling your data with dbt](/docs/modeling-your-data/modeling-your-

:::info Breaking changes

While our recommendation is to use major schema versions to indicate breaking changes (e.g. changing a type of a field from a `string` to a `number`), this is not particularly relevant for BigQuery. Indeed, each schema version gets its own column, so there is no difference between major and minor versions. That said, we believe sticking to our recommendation is a good idea:
While our recommendation is to use major schema versions to indicate breaking changes (e.g. changing a type of a field from a `string` to a `number`), this is not particularly relevant for BigQuery Loader version 1.x. Indeed, each schema version gets its own column, so there is no difference between major and minor versions. That said, we believe sticking to our recommendation is a good idea:
* Breaking changes might affect downstream consumers of the data, even if they don’t affect BigQuery
* In the future, you might decide to migrate to a different data warehouse where our rules are stricter (e.g. Databricks)
* Version 2 of the loader has stricter behavior that matches our loaders for other warehouses and lakes

:::

Expand Down
1 change: 1 addition & 0 deletions src/componentVersions.js
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ export const versions = {

// Loaders
bqLoader: '2.0.0',
bqLoader1x: '1.7.1',
esLoader: '2.1.2',
gcsLoader: '0.5.5',
postgresLoader: '0.3.3',
Expand Down

0 comments on commit 0a938c3

Please sign in to comment.