Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add public docs for typing & deduping #28902

Merged
merged 4 commits into from
Aug 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
97 changes: 42 additions & 55 deletions docs/release_notes/upgrading_to_destinations_v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,59 +2,21 @@

## What is Destinations V2?

At launch, Airbyte Destinations V2 provides:
* One-to-one mapping: Data from one stream (endpoint or table) will now create one table in the destination, making it simpler and more efficient.
* Improved error handling: Typing errors will no longer fail your sync, ensuring smoother data integration processes.
* Auditable typing errors: Typing errors will now be easily visible in a new _airbyte_meta column, allowing for better tracking of inconsistencies and resolution of issues.
* Incremental data loading: Data will become visible in the destination as it is loaded.

## Destinations V2 Example

Consider the following [source schema](https://docs.airbyte.com/integrations/sources/faker) for stream `users`:

```json
{
"id": "number",
"first_name": "string",
"age": "number",
"address": {
"city": "string",
"zip": "string"
}
}
```

The data from one stream will now be mapped to one table in your schema as below. Highlights:
* Improved error handling with `_airbyte_meta`: Airbyte will populate typing errors in the `_airbyte_meta` column instead of failing your sync. You can query these results to audit misformatted or unexpected data.
* Internal Airbyte tables in the `airbyte` schema: Airbyte will now generate all raw tables in the `airbyte` schema. You can use these tables to investigate raw data, but please note the format of the tables in `airbyte` may change at any time.

#### Destination Table Name: *public.users*

| *(note, not in actual table)* | _airbyte_raw_id | _airbyte_extracted_at | _airbyte_meta | id | first_name | age | address |
|----------------------------------------------- |----------------- |--------------------- |-------------------------------------------------------------------------- |---- |------------ |------ |--------------------------------------------- |
| Successful typing and de-duping ⟶ | xxx-xxx-xxx | 2022-01-01 12:00:00 | {} | 1 | sarah | 39 | { city: “San Francisco”, zip: “94131” } |
| Failed typing that didn’t break other rows ⟶ | yyy-yyy-yyy | 2022-01-01 12:00:00 | { errors: { age: “fish” is not a valid integer for column “age” }} | 2 | evan | NULL | { city: “Menlo Park”, zip: “94002” } |
| Not-yet-typed ⟶ | | | | | | | |

In legacy normalization, columns of [Airbyte type](https://docs.airbyte.com/understanding-airbyte/supported-data-types/#the-types) `Object` in the Destination were "unnested" into separate tables. In this example, with Destinations V2, the previously unnested `public.users_address` table with columns `city` and `zip` will no longer be generated.

#### Destination Table Name: *airbyte.raw_public_users* (`airbyte.{namespace}_{stream}`)
Starting today, Airbyte Destinations V2 provides you with:
* One-to-one table mapping: Data in one stream will always be mapped to one table in your data warehouse. No more sub-tables.
* Improved error handling with `_airbyte_meta`: Airbyte will now populate typing errors in the `_airbyte_meta` column instead of failing your sync. You can query these results to audit misformatted or unexpected data.
* Internal Airbyte tables in the `airbyte_internal` schema: Airbyte will now generate all raw tables in the `airbyte_internal` schema. We no longer clutter your destination schema with raw data tables.
* Incremental delivery for large syncs: Data will be incrementally delivered to your final tables. No more waiting hours to see the first rows in your destination table.

| *(note, not in actual table)* | _airbyte_raw_id | _airbyte_data | _airbyte_loaded_at | _airbyte_extracted_at |
|----------------------------------------------- |----------------- |------------------------------------------------------------------------------------------------------------- |---------------------- |--------------------- |
| Successful typing and de-duping ⟶ | xxx-xxx-xxx | { id: 1, first_name: “sarah”, age: 39, address: { city: “San Francisco”, zip: “94131” } } | 2022-01-01 12:00:001 | 2022-01-01 12:00:00 |
| Failed typing that didn’t break other rows ⟶ | yyy-yyy-yyy | { id: 2, first_name: “evan”, age: “fish”, address: { city: “Menlo Park”, zip: “94002” } } | 2022-01-01 12:00:001 | 2022-01-01 12:00:00 |
| Not-yet-typed ⟶ | zzz-zzz-zzz | { id: 3, first_name: “edward”, age: 35, address: { city: “Sunnyvale”, zip: “94003” } } | NULL | 2022-01-01 13:00:00 |
To see more details and examples on the contents of the Destinations V2 release, see this [guide](../understanding-airbyte/typing-deduping.md). The remainder of this page will walk you through upgrading connectors from legacy normalization to Destinations V2.

## Deprecating Legacy Normalization

The upgrade to Destinations V2 is handled by moving your connections to use [updated versions of Airbyte destinations](#destinations-v2-compatible-versions). Existing normalization options, both `Raw data (JSON)` and `Normalized tabular data` will be unsupported starting **Oct 1, 2023**.
The upgrade to Destinations V2 is handled by moving your connections to use [updated versions of Airbyte destinations](#destinations-v2-compatible-versions). Existing normalization options, both `Raw data (JSON)` and `Normalized tabular data` will be unsupported starting **Nov 1, 2023**.

![Legacy Normalization](./assets/airbyte_legacy_normalization.png)

As a Cloud user, existing connections using legacy normalization will be paused on **Oct 1, 2023**. As an Open Source user, you may choose to upgrade at your convenience. However, destination connector versions prior to Destinations V2 will no longer be supported as of **Oct 1, 2023**.

<!--- See [here]() to learn more about Airbyte's breaking change rollout requirements. -->
As a Cloud user, existing connections using legacy normalization will be paused on **Oct 1, 2023**. As an Open Source user, you may choose to upgrade at your convenience. However, destination connector versions prior to Destinations V2 will no longer be supported as of **Nov 1, 2023**.

### Breakdown of Breaking Changes

Expand All @@ -66,6 +28,8 @@ The following table details the delivered data modified by Destinations V2:
| Normalized tabular data | API Source | Unnested tables, `_airbyte` metadata columns, SCD tables |
| Normalized tabular data | Tabular Source (database, file, etc.) | `_airbyte` metadata columns, SCD tables |

![Airbyte Destinations V2 Column Changes](./assets/destinations-v2-column-changes.png)

Whenever possible, we've taken this opportunity to use the best data type for storing JSON for your querying convenience. For example, `destination-bigquery` now loads `JSON` blobs as type `JSON` in BigQuery (introduced last [year](https://cloud.google.com/blog/products/data-analytics/bigquery-now-natively-supports-semi-structured-data)), instead of type `string`.

## Quick Start to Upgrading
Expand Down Expand Up @@ -117,9 +81,38 @@ These steps allow you to dual-write for connections incrementally syncing data w
1. Copy the raw data you've already replicated to the new schema being used by your newly created connection. You need to do this for every stream in the connection with an incremental sync mode. Sample SQL you can run in your data warehouse:

```mysql
CREATE TABLE {new_schema}.raw_{stream_name} AS
SELECT *
FROM {old_schema}.raw_{stream_name};
BEGIN
DECLARE gcp_project STRING;
DECLARE target_dataset STRING;
DECLARE target_table STRING;
DECLARE source_dataset STRING;
DECLARE source_table STRING;
DECLARE old_table STRING;
DECLARE new_table STRING;

SET gcp_project = '';
SET target_dataset = 'airbyte_internal';
SET target_table = '';
SET source_dataset = '';
SET source_table = '';
SET old_table = CONCAT(gcp_project, '.', source_dataset, '.', source_table);
SET new_table = CONCAT(gcp_project, '.', target_dataset, '.', target_table);

EXECUTE IMMEDIATE FORMAT('''
CREATE OR REPLACE TABLE `%s` (_airbyte_raw_id STRING, _airbyte_data JSON, _airbyte_extracted_at TIMESTAMP, _airbyte_loaded_at TIMESTAMP)
PARTITION BY DATE(_airbyte_extracted_at)
CLUSTER BY _airbyte_extracted_at
AS (
SELECT
_airbyte_ab_id AS _airbyte_raw_id,
PARSE_JSON(_airbyte_data) AS _airbyte_data,
_airbyte_emitted_at AS _airbyte_extracted_at,
CAST(NULL AS TIMESTAMP) AS _airbyte_loaded_at
FROM `%s`
)
''', new_table, old_table);

END;
```

2. Go to your newly created connection, and navigate to the `Settings` tab.
Expand Down Expand Up @@ -158,12 +151,6 @@ For each [CDC-supported](https://docs.airbyte.com/understanding-airbyte/cdc) sou
| MySQL | [All above upgrade paths supported](#advanced-upgrade-paths) | You can upgrade the connection in place, or dual write. When dual writing, Airbyte can leverage the state of an existing, active connection to ensure historical data is not re-replicated from MySQL. |
| SQL Server | [Upgrade connection in place](#quick-start-to-upgrading) | You can optionally dual write, but this requires resyncing historical data from the SQL Server source. |

### Rolling back to Legacy Normalization

If you are an Airbyte Cloud customer, and have an urgent need to temporarily roll back to legacy normalization, you can reach out to in-app support (Support -> In-App Support, in Airbyte Cloud) for assistance.

If you are an Airbyte Open Source user, we have published a [rollback version for each destination](#destinations-v2-compatible-versions) that will re-create the final tables with normalization using raw tables in the new format if they are available, and otherwise default to pre-existing raw tables used by legacy normalization.

## Destinations V2 Compatible Versions

For each destination connector, Destinations V2 is effective as of the following versions:
Expand Down
68 changes: 68 additions & 0 deletions docs/understanding-airbyte/typing-deduping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Typing and Deduping

This page refers to new functionality currently available in **early access**. Typing and deduping will become the new default method of transforming datasets within data warehouse and database destinations after they've been replicated. This functionality is going live with [Destinations V2](https://github.com/airbytehq/airbyte/issues/26028), which is now in early access for BigQuery.

You will eventually be required to upgrade your connections to use the new destination versions. We are building tools for you to copy your connector’s configuration to a new version to make testing new destinations easier. These will be available in the next few weeks.

## What is Destinations V2?

At launch, Airbyte Destinations V2 will provide:
* One-to-one table mapping: Data in one stream will always be mapped to one table in your data warehouse. No more sub-tables.
* Improved per-row error handling with `_airbyte_meta`: Airbyte will now populate typing errors in the `_airbyte_meta` column instead of failing your sync. You can query these results to audit misformatted or unexpected data.
* Internal Airbyte tables in the `airbyte_internal` schema: Airbyte will now generate all raw tables in the `airbyte_internal` schema. We no longer clutter your desired schema with raw data tables.
* Incremental delivery for large syncs: Data will be incrementally delivered to your final tables when possible. No more waiting hours to see the first rows in your destination table.

## Destinations V2 Example

Consider the following [source schema](https://docs.airbyte.com/integrations/sources/faker) for stream `users`:

```json
{
"id": "number",
"first_name": "string",
"age": "number",
"address": {
"city": "string",
"zip": "string"
}
}
```

The data from one stream will now be mapped to one table in your schema as below:

#### Destination Table Name: *public.users*

| *(note, not in actual table)* | _airbyte_raw_id | _airbyte_extracted_at | _airbyte_meta | id | first_name | age | address |
|----------------------------------------------- |----------------- |--------------------- |-------------------------------------------------------------------------- |---- |------------ |------ |--------------------------------------------- |
| Successful typing and de-duping ⟶ | xxx-xxx-xxx | 2022-01-01 12:00:00 | {} | 1 | sarah | 39 | { city: “San Francisco”, zip: “94131” } |
| Failed typing that didn’t break other rows ⟶ | yyy-yyy-yyy | 2022-01-01 12:00:00 | { errors: {[“fish” is not a valid integer for column “age”]} | 2 | evan | NULL | { city: “Menlo Park”, zip: “94002” } |
| Not-yet-typed ⟶ | | | | | | | |

In legacy normalization, columns of [Airbyte type](https://docs.airbyte.com/understanding-airbyte/supported-data-types/#the-types) `Object` in the Destination were "unnested" into separate tables. In this example, with Destinations V2, the previously unnested `public.users_address` table with columns `city` and `zip` will no longer be generated.

#### Destination Table Name: *airbyte.raw_public_users* (`airbyte.{namespace}_{stream}`)

| *(note, not in actual table)* | _airbyte_raw_id | _airbyte_data | _airbyte_loaded_at | _airbyte_extracted_at |
|----------------------------------------------- |----------------- |------------------------------------------------------------------------------------------------------------- |---------------------- |--------------------- |
| Successful typing and de-duping ⟶ | xxx-xxx-xxx | { id: 1, first_name: “sarah”, age: 39, address: { city: “San Francisco”, zip: “94131” } } | 2022-01-01 12:00:001 | 2022-01-01 12:00:00 |
| Failed typing that didn’t break other rows ⟶ | yyy-yyy-yyy | { id: 2, first_name: “evan”, age: “fish”, address: { city: “Menlo Park”, zip: “94002” } } | 2022-01-01 12:00:001 | 2022-01-01 12:00:00 |
| Not-yet-typed ⟶ | zzz-zzz-zzz | { id: 3, first_name: “edward”, age: 35, address: { city: “Sunnyvale”, zip: “94003” } } | NULL | 2022-01-01 13:00:00 |

You also now see the following changes in Airbyte-provided columns:

![Airbyte Destinations V2 Column Changes](../release_notes/assets/destinations-v2-column-changes.png)

## Participating in Early Access

You can start using Destinations V2 for BigQuery in early access by following the below instructions:

1. **Upgrade your BigQuery Destination**: If you are using Airbyte Open Source, update your BigQuery destination version to the latest version. If you are a Cloud customer, this step will already be completed on your behalf.
2. **Enabling Destinations V2**: Create a new BigQuery destination, and enable the Destinations V2 option under `Advanced` settings. You will need your BigQuery credentials for this step. For this early release, we ask that you enable Destinations V2 on a new BigQuery destination using new connections. When Destinations V2 is fully available, there will be additional migration paths for upgrading your destination without resetting any of your existing connections.
1. If your previous BigQuery destination is using “GCS Staging”, you can reuse the same staging bucket.
2. Do not enable Destinations V2 on your previous / existing BigQuery destination during early release. It will cause your existing connections to fail.
3. **Create a New Connection**: Create connections using the new BigQuery destination. These will automatically use Destinations V2.
1. If your new destination has the same default namespace, you may want to add a stream prefix to avoid collisions in the final tables.
2. Do not modify the ‘Transformation’ settings. These will be ignored.
4. **Monitor your Sync**: Wait at least 20 minutes, or until your sync is complete. Verify the data in your destination is correct. Congratulations, you have successfully upgraded your connection to Destinations V2!

Once you’ve completed the setup for Destinations V2, we ask that you pay special attention to the data delivered in your destination. Let us know immediately if you see any unexpected data: table and column name changes, missing columns, or columns with incorrect types.
1 change: 1 addition & 0 deletions docusaurus/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -385,6 +385,7 @@ const understandingAirbyte = {
'understanding-airbyte/airbyte-protocol',
'understanding-airbyte/airbyte-protocol-docker',
'understanding-airbyte/basic-normalization',
'understanding-airbyte/typing-deduping',
{
type: 'category',
label: 'Connections and Sync Modes',
Expand Down