diff --git a/docs/release_notes/upgrading_to_destinations_v2.md b/docs/release_notes/upgrading_to_destinations_v2.md index 8cf9bcad60ec1..ca67337788380 100644 --- a/docs/release_notes/upgrading_to_destinations_v2.md +++ b/docs/release_notes/upgrading_to_destinations_v2.md @@ -7,12 +7,13 @@ import {SnowflakeMigrationGenerator, BigQueryMigrationGenerator} from './destina ## What is Destinations V2? Starting today, Airbyte Destinations V2 provides you with: -* One-to-one table mapping: Data in one stream will always be mapped to one table in your data warehouse. No more sub-tables. -* Improved error handling with `_airbyte_meta`: Airbyte will now populate typing errors in the `_airbyte_meta` column instead of failing your sync. You can query these results to audit misformatted or unexpected data. -* Internal Airbyte tables in the `airbyte_internal` schema: Airbyte will now generate all raw tables in the `airbyte_internal` schema. We no longer clutter your destination schema with raw data tables. -* Incremental delivery for large syncs: Data will be incrementally delivered to your final tables. No more waiting hours to see the first rows in your destination table. -To see more details and examples on the contents of the Destinations V2 release, see this [guide](../understanding-airbyte/typing-deduping.md). The remainder of this page will walk you through upgrading connectors from legacy normalization to Destinations V2. +- One-to-one table mapping: Data in one stream will always be mapped to one table in your data warehouse. No more sub-tables. +- Improved error handling with `_airbyte_meta`: Airbyte will now populate typing errors in the `_airbyte_meta` column instead of failing your sync. You can query these results to audit misformatted or unexpected data. +- Internal Airbyte tables in the `airbyte_internal` schema: Airbyte will now generate all raw tables in the `airbyte_internal` schema. We no longer clutter your destination schema with raw data tables. +- Incremental delivery for large syncs: Data will be incrementally delivered to your final tables. No more waiting hours to see the first rows in your destination table. + +To see more details and examples on the contents of the Destinations V2 release, see this [guide](understanding-airbyte/typing-deduping.md). The remainder of this page will walk you through upgrading connectors from legacy normalization to Destinations V2. ## Deprecating Legacy Normalization @@ -26,15 +27,15 @@ As a Cloud user, existing connections using legacy normalization will be paused The following table details the delivered data modified by Destinations V2: -| Current Normalization Setting | Source Type | Impacted Data (Breaking Changes) | -|-----------------------------------|--------------------------------------- |--------------------------------------------------------------------------------| -| Raw JSON | All | `_airbyte` metadata columns, raw table location | -| Normalized tabular data | API Source | Unnested tables, `_airbyte` metadata columns, SCD tables | -| Normalized tabular data | Tabular Source (database, file, etc.) | `_airbyte` metadata columns, SCD tables | +| Current Normalization Setting | Source Type | Impacted Data (Breaking Changes) | +| ----------------------------- | ------------------------------------- | -------------------------------------------------------- | +| Raw JSON | All | `_airbyte` metadata columns, raw table location | +| Normalized tabular data | API Source | Unnested tables, `_airbyte` metadata columns, SCD tables | +| Normalized tabular data | Tabular Source (database, file, etc.) | `_airbyte` metadata columns, SCD tables | ![Airbyte Destinations V2 Column Changes](./assets/destinations-v2-column-changes.png) -Whenever possible, we've taken this opportunity to use the best data type for storing JSON for your querying convenience. For example, `destination-bigquery` now loads `JSON` blobs as type `JSON` in BigQuery (introduced last [year](https://cloud.google.com/blog/products/data-analytics/bigquery-now-natively-supports-semi-structured-data)), instead of type `string`. +Whenever possible, we've taken this opportunity to use the best data type for storing JSON for your querying convenience. For example, `destination-bigquery` now loads `JSON` blobs as type `JSON` in BigQuery (introduced last [year](https://cloud.google.com/blog/products/data-analytics/bigquery-now-natively-supports-semi-structured-data)), instead of type `string`. ## Quick Start to Upgrading @@ -43,6 +44,7 @@ The quickest path to upgrading is to click upgrade on any out-of-date connection ![Upgrade Path](./assets/airbyte_destinations_v2_upgrade_prompt.png) After upgrading the out-of-date destination to a [Destinations V2 compatible version](#destinations-v2-effective-versions), the following will occur at the next sync **for each connection** sending data to the updated destination: + 1. Existing raw tables replicated to this destination will be copied to a new `airbyte` schema. 2. The new raw tables will be updated to the new Destinations V2 format. 3. The new raw tables will be updated with any new data since the last sync, like normal. @@ -53,12 +55,13 @@ Pre-existing raw tables, SCD tables and "unnested" tables will always be left un Each destination version is managed separately, so if you have multiple destinations, they all need to be upgraded one by one. Versions are tied to the destination. When you update the destination, **all connections tied to that destination will be sending data in the Destinations V2 format**. For upgrade paths that will minimize disruption to existing dashboards, see: -* [Upgrading Connections One by One with Dual-Writing](#upgrading-connections-one-by-one-with-dual-writing) -* [Testing Destinations V2 on a Single Connection](#testing-destinations-v2-for-a-single-connection) -* [Upgrading Connections One by One Using CDC](#upgrade-paths-for-connections-using-cdc) -* [Upgrading as a User of Raw Tables](#upgrading-as-a-user-of-raw-tables) -* [Rolling back to Legacy Normalization](#oss-only-rolling-back-to-legacy-normalization) - + +- [Upgrading Connections One by One with Dual-Writing](#upgrading-connections-one-by-one-with-dual-writing) +- [Testing Destinations V2 on a Single Connection](#testing-destinations-v2-for-a-single-connection) +- [Upgrading Connections One by One Using CDC](#upgrade-paths-for-connections-using-cdc) +- [Upgrading as a User of Raw Tables](#upgrading-as-a-user-of-raw-tables) +- [Rolling back to Legacy Normalization](#oss-only-rolling-back-to-legacy-normalization) + ## Advanced Upgrade Paths ### Upgrading Connections One by One with Dual-Writing @@ -67,7 +70,7 @@ Dual writing is a method employed during upgrades where new incoming data is wri #### Steps to Follow for All Sync Modes -1. **[Open Source]** Update the default destination version for your workspace to a [Destinations V2 compatible version](#destinations-v2-effective-versions). This sets the default version for any newly created destination. All existing syncs will remain on their current version. +1. **[Open Source]** Update the default destination version for your workspace to a [Destinations V2 compatible version](#destinations-v2-effective-versions). This sets the default version for any newly created destination. All existing syncs will remain on their current version. ![Upgrade your default destination version](assets/airbyte_version_upgrade.png) @@ -104,6 +107,7 @@ These steps allow you to dual-write for connections incrementally syncing data w ### Testing Destinations V2 for a Single Connection You may want to verify the format of updated data for a single connection. To do this: + 1. If all of the streams you are looking to test with are in **full refresh mode**, follow the [steps for upgrading connections one by one](#steps-to-follow-for-all-sync-modes). Ensure any connections you create have a `Manual` replication frequency. 2. For any streams in **incremental** sync modes, follow the [steps for upgrading incremental syncs](#additional-steps-for-incremental-sync-modes). For testing, you do not need to copy pre-existing raw data. By solely inheriting state from a pre-existing connection, enabling a sync will provide a sample of the most recent data in the updated format for testing. @@ -112,10 +116,11 @@ When you are done testing, you can disable or delete this testing connection, an ### Upgrading as a User of Raw Tables If you have written downstream transformations directly from the output of raw tables, or use the "Raw JSON" normalization setting, you should know that: -* Multiple column names are being updated (from `airbyte_ab_id` to `airbyte_raw_id`, and `airbyte_emitted_at` to `airbyte_extracted_at`). -* The location of raw tables will from now on default to an `airbyte` schema in your destination. -* When you upgrade to a [Destinations V2 compatible version](#destinations-v2-effective-versions) of your destination, we will never alter your existing raw data. Although existing downstream dashboards will go stale, they will never be broken. -* You can dual write by following the [steps above](#upgrading-connections-one-by-one-with-dual-writing) and copying your raw data to the schema of your newly created connection. + +- Multiple column names are being updated (from `airbyte_ab_id` to `airbyte_raw_id`, and `airbyte_emitted_at` to `airbyte_extracted_at`). +- The location of raw tables will from now on default to an `airbyte` schema in your destination. +- When you upgrade to a [Destinations V2 compatible version](#destinations-v2-effective-versions) of your destination, we will never alter your existing raw data. Although existing downstream dashboards will go stale, they will never be broken. +- You can dual write by following the [steps above](#upgrading-connections-one-by-one-with-dual-writing) and copying your raw data to the schema of your newly created connection. We may make further changes to raw tables in the future, as these tables are intended to be a staging ground for Airbyte to optimize the performance of your syncs. We cannot guarantee the same level of stability as for final tables in your destination schema. @@ -123,24 +128,24 @@ We may make further changes to raw tables in the future, as these tables are int For each [CDC-supported](https://docs.airbyte.com/understanding-airbyte/cdc) source connector, we recommend the following: -| CDC Source | Recommendation | Notes | -|------------ |----------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Postgres | [Upgrade connection in place](#quick-start-to-upgrading) | You can optionally dual write, but this requires resyncing historical data from the source. You must create a new Postgres source with a different replication slot than your existing source to preserve the integrity of your existing connection. | -| MySQL | [All above upgrade paths supported](#advanced-upgrade-paths) | You can upgrade the connection in place, or dual write. When dual writing, Airbyte can leverage the state of an existing, active connection to ensure historical data is not re-replicated from MySQL. | -| SQL Server | [Upgrade connection in place](#quick-start-to-upgrading) | You can optionally dual write, but this requires resyncing historical data from the SQL Server source. | +| CDC Source | Recommendation | Notes | +| ---------- | ------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Postgres | [Upgrade connection in place](#quick-start-to-upgrading) | You can optionally dual write, but this requires resyncing historical data from the source. You must create a new Postgres source with a different replication slot than your existing source to preserve the integrity of your existing connection. | +| MySQL | [All above upgrade paths supported](#advanced-upgrade-paths) | You can upgrade the connection in place, or dual write. When dual writing, Airbyte can leverage the state of an existing, active connection to ensure historical data is not re-replicated from MySQL. | +| SQL Server | [Upgrade connection in place](#quick-start-to-upgrading) | You can optionally dual write, but this requires resyncing historical data from the SQL Server source. | ## Destinations V2 Compatible Versions For each destination connector, Destinations V2 is effective as of the following versions: -| Destination Connector | Safe Rollback Version | Destinations V2 Compatible | -|----------------------- |----------------------- |------------------------------| -| BigQuery | 1.4.4 | 2.0.0+ | -| Snowflake | 0.4.1 | 2.0.0+ | -| Redshift | 0.4.8 | 2.0.0+ | -| MSSQL | 0.1.24 | 2.0.0+ | -| MySQL | 0.1.20 | 2.0.0+ | -| Oracle | 0.1.19 | 2.0.0+ | -| TiDB | 0.1.3 | 2.0.0+ | -| DuckDB | 0.1.0 | 2.0.0+ | -| Clickhouse | 0.2.3 | 2.0.0+ | +| Destination Connector | Safe Rollback Version | Destinations V2 Compatible | +| --------------------- | --------------------- | -------------------------- | +| BigQuery | 1.4.4 | 2.0.0+ | +| Snowflake | 0.4.1 | 2.0.0+ | +| Redshift | 0.4.8 | 2.0.0+ | +| MSSQL | 0.1.24 | 2.0.0+ | +| MySQL | 0.1.20 | 2.0.0+ | +| Oracle | 0.1.19 | 2.0.0+ | +| TiDB | 0.1.3 | 2.0.0+ | +| DuckDB | 0.1.0 | 2.0.0+ | +| Clickhouse | 0.2.3 | 2.0.0+ | diff --git a/docs/understanding-airbyte/typing-deduping.md b/docs/understanding-airbyte/typing-deduping.md index 4eab218724c69..257bca5668844 100644 --- a/docs/understanding-airbyte/typing-deduping.md +++ b/docs/understanding-airbyte/typing-deduping.md @@ -1,16 +1,38 @@ # Typing and Deduping -This page refers to new functionality currently available in **early access**. Typing and deduping will become the new default method of transforming datasets within data warehouse and database destinations after they've been replicated. This functionality is going live with [Destinations V2](https://github.com/airbytehq/airbyte/issues/26028), which is now in early access for BigQuery. +This page refers to new functionality currently available in **early access**. Typing and deduping will become the new default method of transforming datasets within data warehouse and database destinations after they've been replicated. This functionality is going live with [Destinations V2](/release_notes/upgrading_to_destinations_v2/), which is now in early access for BigQuery. -You will eventually be required to upgrade your connections to use the new destination versions. We are building tools for you to copy your connector’s configuration to a new version to make testing new destinations easier. These will be available in the next few weeks. +You will eventually be required to upgrade your connections to use the new destination versions. We are building tools for you to copy your connector’s configuration to a new version to make testing new destinations easier. These will be available in the next few weeks. ## What is Destinations V2? -At launch, Airbyte Destinations V2 will provide: -* One-to-one table mapping: Data in one stream will always be mapped to one table in your data warehouse. No more sub-tables. -* Improved per-row error handling with `_airbyte_meta`: Airbyte will now populate typing errors in the `_airbyte_meta` column instead of failing your sync. You can query these results to audit misformatted or unexpected data. -* Internal Airbyte tables in the `airbyte_internal` schema: Airbyte will now generate all raw tables in the `airbyte_internal` schema. We no longer clutter your desired schema with raw data tables. -* Incremental delivery for large syncs: Data will be incrementally delivered to your final tables when possible. No more waiting hours to see the first rows in your destination table. +At launch, [Airbyte Destinations V2](/release_notes/upgrading_to_destinations_v2) will provide: + +- One-to-one table mapping: Data in one stream will always be mapped to one table in your data warehouse. No more sub-tables. +- Improved per-row error handling with `_airbyte_meta`: Airbyte will now populate typing errors in the `_airbyte_meta` column instead of failing your sync. You can query these results to audit misformatted or unexpected data. +- Internal Airbyte tables in the `airbyte_internal` schema: Airbyte will now generate all raw tables in the `airbyte_internal` schema. We no longer clutter your desired schema with raw data tables. +- Incremental delivery for large syncs: Data will be incrementally delivered to your final tables when possible. No more waiting hours to see the first rows in your destination table. + +## `_airbyte_meta` Errors + +"Per-row error handling" is a new paradigm for Airbyte which provides greater flexibility for our users. Airbyte now separates `data-moving problems` from `data-content problems`. Prior to Destinations V2, both types of errors were handled the same way: by failing the sync. Now, a failing sync means that Airbyte could not _move_ all of your data. You can query the `_airbyte_meta` column to see which rows failed for _content_ reasons, and why. This is a more flexible approach, as you can now decide how to handle rows with errors on a case-by-case basis. + +:::tip +When using data downstream from Airbyte, we generally recommend you only include rows which do not have an error, e.g: + +```sql +-- postgres syntax +SELECT COUNT(*) FROM _table_ WHERE json_array_length(_airbyte_meta ->> errors) = 0 +``` + +::: + +The types of errors which will be stored in `_airbyte_meta.errors` include: + +- **Typing errors**: the source declared that the type of the column `id` should be an integer, but a string value was returned. +- **Size errors**: the source returned content which cannot be stored within this this row or column (e.g. [a Redshift Super column has a 16mb limit](https://docs.aws.amazon.com/redshift/latest/dg/limitations-super.html)). + +Depending on your use-case, it may still be valuable to consider rows with errors, especially for aggregations. For example, you may have a table `user_reviews`, and you would like to know the count of new reviews received today. You can choose to include reviews regardless of whether your data warehouse had difficulty storing the full contents of the `message` column. For this use case, `SELECT COUNT(*) from user_reviews WHERE DATE(created_at) = DATE(NOW())` is still valid. ## Destinations V2 Example @@ -30,23 +52,23 @@ Consider the following [source schema](https://docs.airbyte.com/integrations/sou The data from one stream will now be mapped to one table in your schema as below: -#### Destination Table Name: *public.users* +#### Destination Table Name: _public.users_ -| *(note, not in actual table)* | _airbyte_raw_id | _airbyte_extracted_at | _airbyte_meta | id | first_name | age | address | -|----------------------------------------------- |----------------- |--------------------- |-------------------------------------------------------------------------- |---- |------------ |------ |--------------------------------------------- | -| Successful typing and de-duping ⟶ | xxx-xxx-xxx | 2022-01-01 12:00:00 | {} | 1 | sarah | 39 | { city: “San Francisco”, zip: “94131” } | -| Failed typing that didn’t break other rows ⟶ | yyy-yyy-yyy | 2022-01-01 12:00:00 | { errors: {[“fish” is not a valid integer for column “age”]} | 2 | evan | NULL | { city: “Menlo Park”, zip: “94002” } | -| Not-yet-typed ⟶ | | | | | | | | +| _(note, not in actual table)_ | \_airbyte_raw_id | \_airbyte_extracted_at | \_airbyte_meta | id | first_name | age | address | +| -------------------------------------------- | ---------------- | ---------------------- | ------------------------------------------------------------ | --- | ---------- | ---- | --------------------------------------- | +| Successful typing and de-duping ⟶ | xxx-xxx-xxx | 2022-01-01 12:00:00 | {} | 1 | sarah | 39 | { city: “San Francisco”, zip: “94131” } | +| Failed typing that didn’t break other rows ⟶ | yyy-yyy-yyy | 2022-01-01 12:00:00 | { errors: {[“fish” is not a valid integer for column “age”]} | 2 | evan | NULL | { city: “Menlo Park”, zip: “94002” } | +| Not-yet-typed ⟶ | | | | | | | | In legacy normalization, columns of [Airbyte type](https://docs.airbyte.com/understanding-airbyte/supported-data-types/#the-types) `Object` in the Destination were "unnested" into separate tables. In this example, with Destinations V2, the previously unnested `public.users_address` table with columns `city` and `zip` will no longer be generated. -#### Destination Table Name: *airbyte.raw_public_users* (`airbyte.{namespace}_{stream}`) +#### Destination Table Name: _airbyte.raw_public_users_ (`airbyte.{namespace}_{stream}`) -| *(note, not in actual table)* | _airbyte_raw_id | _airbyte_data | _airbyte_loaded_at | _airbyte_extracted_at | -|----------------------------------------------- |----------------- |------------------------------------------------------------------------------------------------------------- |---------------------- |--------------------- | -| Successful typing and de-duping ⟶ | xxx-xxx-xxx | { id: 1, first_name: “sarah”, age: 39, address: { city: “San Francisco”, zip: “94131” } } | 2022-01-01 12:00:001 | 2022-01-01 12:00:00 | -| Failed typing that didn’t break other rows ⟶ | yyy-yyy-yyy | { id: 2, first_name: “evan”, age: “fish”, address: { city: “Menlo Park”, zip: “94002” } } | 2022-01-01 12:00:001 | 2022-01-01 12:00:00 | -| Not-yet-typed ⟶ | zzz-zzz-zzz | { id: 3, first_name: “edward”, age: 35, address: { city: “Sunnyvale”, zip: “94003” } } | NULL | 2022-01-01 13:00:00 | +| _(note, not in actual table)_ | \_airbyte_raw_id | \_airbyte_data | \_airbyte_loaded_at | \_airbyte_extracted_at | +| -------------------------------------------- | ---------------- | ----------------------------------------------------------------------------------------- | -------------------- | ---------------------- | +| Successful typing and de-duping ⟶ | xxx-xxx-xxx | { id: 1, first_name: “sarah”, age: 39, address: { city: “San Francisco”, zip: “94131” } } | 2022-01-01 12:00:001 | 2022-01-01 12:00:00 | +| Failed typing that didn’t break other rows ⟶ | yyy-yyy-yyy | { id: 2, first_name: “evan”, age: “fish”, address: { city: “Menlo Park”, zip: “94002” } } | 2022-01-01 12:00:001 | 2022-01-01 12:00:00 | +| Not-yet-typed ⟶ | zzz-zzz-zzz | { id: 3, first_name: “edward”, age: 35, address: { city: “Sunnyvale”, zip: “94003” } } | NULL | 2022-01-01 13:00:00 | You also now see the following changes in Airbyte-provided columns: @@ -54,15 +76,15 @@ You also now see the following changes in Airbyte-provided columns: ## Participating in Early Access -You can start using Destinations V2 for BigQuery in early access by following the below instructions: +You can start using Destinations V2 for BigQuery or Snowflake in early access by following the below instructions: -1. **Upgrade your BigQuery Destination**: If you are using Airbyte Open Source, update your BigQuery destination version to the latest version. If you are a Cloud customer, this step will already be completed on your behalf. -2. **Enabling Destinations V2**: Create a new BigQuery destination, and enable the Destinations V2 option under `Advanced` settings. You will need your BigQuery credentials for this step. For this early release, we ask that you enable Destinations V2 on a new BigQuery destination using new connections. When Destinations V2 is fully available, there will be additional migration paths for upgrading your destination without resetting any of your existing connections. - 1. If your previous BigQuery destination is using “GCS Staging”, you can reuse the same staging bucket. - 2. Do not enable Destinations V2 on your previous / existing BigQuery destination during early release. It will cause your existing connections to fail. +1. **Upgrade your Destination**: If you are using Airbyte Open Source, update your destination version to the latest version. If you are a Cloud customer, this step will already be completed on your behalf. +2. **Enabling Destinations V2**: Create a new destination, and enable the Destinations V2 option under `Advanced` settings. You will need your data warehouse credentials for this step. For this early release, we ask that you enable Destinations V2 on a new destination using new connections. When Destinations V2 is fully available, there will be additional migration paths for upgrading your destination without resetting any of your existing connections. + 1. If your previous BigQuery destination is using “GCS Staging”, you can reuse the same staging bucket. + 2. Do not enable Destinations V2 on your previous / existing destinations during early release. It will cause your existing connections to fail. 3. **Create a New Connection**: Create connections using the new BigQuery destination. These will automatically use Destinations V2. - 1. If your new destination has the same default namespace, you may want to add a stream prefix to avoid collisions in the final tables. - 2. Do not modify the ‘Transformation’ settings. These will be ignored. + 1. If your new destination has the same default namespace, you may want to add a stream prefix to avoid collisions in the final tables. + 2. Do not modify the ‘Transformation’ settings. These will be ignored. 4. **Monitor your Sync**: Wait at least 20 minutes, or until your sync is complete. Verify the data in your destination is correct. Congratulations, you have successfully upgraded your connection to Destinations V2! Once you’ve completed the setup for Destinations V2, we ask that you pay special attention to the data delivered in your destination. Let us know immediately if you see any unexpected data: table and column name changes, missing columns, or columns with incorrect types.