Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] No Deduped + History, Append + Deduped is the future! #29114

Merged
merged 3 commits into from
Aug 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 31 additions & 31 deletions docs/cloud/core-concepts.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Core Concepts

Airbyte enables you to build data pipelines and replicate data from a source to a destination. You can configure how frequently the data is synced, what data is replicated, what format the data is written to in the destination, and if the data is stored in raw tables format or basic normalized (or JSON) format.
Airbyte enables you to build data pipelines and replicate data from a source to a destination. You can configure how frequently the data is synced, what data is replicated, what format the data is written to in the destination, and if the data is stored in raw tables format or basic normalized (or JSON) format.

This page describes the concepts you need to know to use Airbyte.

## Source
## Source

A source is an API, file, database, or data warehouse that you want to ingest data from.
A source is an API, file, database, or data warehouse that you want to ingest data from.

## Destination

Expand All @@ -18,7 +18,7 @@ An Airbyte component which pulls data from a source or pushes data to a destinat

## Connection

A connection is an automated data pipeline that replicates data from a source to a destination.
A connection is an automated data pipeline that replicates data from a source to a destination.

Setting up a connection involves configuring the following parameters:

Expand All @@ -38,7 +38,7 @@ Setting up a connection involves configuring the following parameters:
<tr>
<td>Destination Namespace and stream names
</td>
<td>Where should the replicated data be written?
<td>Where should the replicated data be written?
</td>
</tr>
<tr>
Expand All @@ -63,28 +63,28 @@ Setting up a connection involves configuring the following parameters:

## Stream

A stream is a group of related records.
A stream is a group of related records.

Examples of streams:

* A table in a relational database
* A resource or API endpoint for a REST API
* The records from a directory containing many files in a filesystem
- A table in a relational database
- A resource or API endpoint for a REST API
- The records from a directory containing many files in a filesystem

## Field

A field is an attribute of a record in a stream.
A field is an attribute of a record in a stream.

Examples of fields:
Examples of fields:

* A column in the table in a relational database
* A field in an API response
- A column in the table in a relational database
- A field in an API response

## Namespace

Namespace is a group of streams in a source or destination. Common use cases for namespaces are enforcing permissions, segregating test and production data, and general data organization.

A schema in a relational database system is an example of a namespace.
A schema in a relational database system is an example of a namespace.

In a source, the namespace is the location from where the data is replicated to the destination.

Expand Down Expand Up @@ -121,32 +121,32 @@ In a destination, the namespace is the location where the replicated data is sto

A sync mode governs how Airbyte reads from a source and writes to a destination. Airbyte provides different sync modes to account for various use cases.

* **Full Refresh | Overwrite:** Sync all records from the source and replace data in destination by overwriting it.
* **Full Refresh | Append:** Sync all records from the source and add them to the destination without deleting any data.
* **Incremental Sync | Append:** Sync new records from the source and add them to the destination without deleting any data.
* **Incremental Sync | Deduped History:** Sync new records from the source and add them to the destination. Also provides a de-duplicated view mirroring the state of the stream in the source.
- **Full Refresh | Overwrite:** Sync all records from the source and replace data in destination by overwriting it.
- **Full Refresh | Append:** Sync all records from the source and add them to the destination without deleting any data.
- **Incremental Sync | Append:** Sync new records from the source and add them to the destination without deleting any data.
- **Incremental Sync | Append + Deduped:** Sync new records from the source and add them to the destination. Also provides a de-duplicated view mirroring the state of the stream in the source.

## Normalization

Normalization is the process of structuring data from the source into a format appropriate for consumption in the destination. For example, when writing data from a nested, dynamically typed source like a JSON API to a relational destination like Postgres, normalization is the process which un-nests JSON from the source into a relational table format which uses the appropriate column types in the destination.

Note that normalization is only relevant for the following relational database & warehouse destinations:
Note that normalization is only relevant for the following relational database & warehouse destinations:

* BigQuery
* Snowflake
* Redshift
* Postgres
* Oracle
* MySQL
* MSSQL
- BigQuery
- Snowflake
- Redshift
- Postgres
- Oracle
- MySQL
- MSSQL

Other destinations do not support normalization as described in this section, though they may normalize data in a format that makes sense for them. For example, the S3 destination connector offers the option of writing JSON files in S3, but also offers the option of writing statically typed files such as Parquet or Avro.
Other destinations do not support normalization as described in this section, though they may normalize data in a format that makes sense for them. For example, the S3 destination connector offers the option of writing JSON files in S3, but also offers the option of writing statically typed files such as Parquet or Avro.

After a sync is complete, Airbyte normalizes the data. When setting up a connection, you can choose one of the following normalization options:

* Raw data (no normalization): Airbyte places the JSON blob version of your data in a table called `_airbyte_raw_<stream name>`
* Basic Normalization: Airbyte converts the raw JSON blob version of your data to the format of your destination. *Note: Not all destinations support normalization.*
* [dbt Cloud integration](https://docs.airbyte.com/cloud/managing-airbyte-cloud/dbt-cloud-integration): Airbyte's dbt Cloud integration allows you to use dbt Cloud for transforming and cleaning your data during the normalization process.
- Raw data (no normalization): Airbyte places the JSON blob version of your data in a table called `_airbyte_raw_<stream name>`
- Basic Normalization: Airbyte converts the raw JSON blob version of your data to the format of your destination. _Note: Not all destinations support normalization._
- [dbt Cloud integration](https://docs.airbyte.com/cloud/managing-airbyte-cloud/dbt-cloud-integration): Airbyte's dbt Cloud integration allows you to use dbt Cloud for transforming and cleaning your data during the normalization process.

:::note

Expand All @@ -156,7 +156,7 @@ Normalizing data may cause an increase in your destination's compute cost. This

## Workspace

A workspace is a grouping of sources, destinations, connections, and other configurations. It lets you collaborate with team members and share resources across your team under a shared billing account.
A workspace is a grouping of sources, destinations, connections, and other configurations. It lets you collaborate with team members and share resources across your team under a shared billing account.

When you [sign up](http://cloud.airbyte.com/signup) for Airbyte Cloud, we automatically create your first workspace where you are the only user with access. You can set up your sources and destinations to start syncing data and invite other users to join your workspace.

Expand Down
57 changes: 31 additions & 26 deletions docs/cloud/getting-started-with-airbyte-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,26 +64,27 @@ A connection is an automated data pipeline that replicates data from a source to

Setting up a connection involves configuring the following parameters:

| Parameter | Description |
|----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Replication frequency | How often should the data sync? |
| [Data residency](https://docs.airbyte.com/cloud/managing-airbyte-cloud/manage-data-residency#choose-the-data-residency-for-a-connection) | Where should the data be processed? |
| Destination Namespace and stream names | Where should the replicated data be written? |
| Catalog selection | Which streams and fields should be replicated from the source to the destination? |
| Sync mode | How should the streams be replicated (read and written)? |
| Optional transformations | How should Airbyte protocol messages (raw JSON blob) data be converted into other data representations? |
| Parameter | Description |
| ---------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------- |
| Replication frequency | How often should the data sync? |
| [Data residency](https://docs.airbyte.com/cloud/managing-airbyte-cloud/manage-data-residency#choose-the-data-residency-for-a-connection) | Where should the data be processed? |
| Destination Namespace and stream names | Where should the replicated data be written? |
| Catalog selection | Which streams and fields should be replicated from the source to the destination? |
| Sync mode | How should the streams be replicated (read and written)? |
| Optional transformations | How should Airbyte protocol messages (raw JSON blob) data be converted into other data representations? |

For more information, see [Connections and Sync Modes](../understanding-airbyte/connections/README.md) and [Namespaces](../understanding-airbyte/namespaces.md)

If you need to use [cron scheduling](http://www.quartz-scheduler.org/documentation/quartz-2.3.0/tutorials/crontrigger.html):
1. In the **Replication Frequency** dropdown, click **Cron**.

1. In the **Replication Frequency** dropdown, click **Cron**.
2. Enter a cron expression and choose a time zone to create a sync schedule.

:::note

* Only one sync per connection can run at a time.
* If cron schedules a sync to run before the last one finishes, the scheduled sync will start after the last sync completes.
* Cloud does not allow schedules that sync more than once per hour.
- Only one sync per connection can run at a time.
- If cron schedules a sync to run before the last one finishes, the scheduled sync will start after the last sync completes.
- Cloud does not allow schedules that sync more than once per hour.

:::

Expand Down Expand Up @@ -171,12 +172,12 @@ To better understand the destination namespace configurations, see [Destination
- Select **Overwrite** to erase the old data and replace it completely
- Select **Append** to capture changes to your table
**Note:** This creates duplicate records
- Select **Deduped + history** to mirror your source while keeping records unique
- Select **Append + Deduped** to mirror your source while keeping records unique

**Note:** Some sync modes may not yet be available for your source or destination

4. **Cursor field**: Used in **Incremental** sync mode to determine which records to sync. Airbyte pre-selects the cursor field for you (example: updated date). If you have multiple cursor fields, select the one you want.
5. **Primary key**: Used in **Deduped + history** sync mode to determine the unique identifier.
5. **Primary key**: Used in **Append + Deduped** sync mode to determine the unique identifier.
6. **Destination**:
- **Namespace:** The database schema of your destination tables.
- **Stream name:** The final table name in destination.
Expand All @@ -193,24 +194,28 @@ Verify the sync by checking the logs:
3. Check the data at your destination. If you added a Destination Stream Prefix while setting up the connection, make sure to search for the stream name with the prefix.

## Allowlist IP addresses

Depending on your [data residency](https://docs.airbyte.com/cloud/managing-airbyte-cloud/manage-data-residency#choose-your-default-data-residency) location, you may need to allowlist the following IP addresses to enable access to Airbyte:

### United States and Airbyte Default

#### GCP region: us-west3

[comment]: # (IMPORTANT: if changing the list of IP addresses below, you must also update the connector.airbyteCloudIpAddresses LaunchDarkly flag to show the new list so that the correct list is shown in the Airbyte Cloud UI, then reach out to the frontend team and ask them to update the default value in the useAirbyteCloudIps hook!)
[comment]: # "IMPORTANT: if changing the list of IP addresses below, you must also update the connector.airbyteCloudIpAddresses LaunchDarkly flag to show the new list so that the correct list is shown in the Airbyte Cloud UI, then reach out to the frontend team and ask them to update the default value in the useAirbyteCloudIps hook!"

* 34.106.109.131
* 34.106.196.165
* 34.106.60.246
* 34.106.229.69
* 34.106.127.139
* 34.106.218.58
* 34.106.115.240
* 34.106.225.141
- 34.106.109.131
- 34.106.196.165
- 34.106.60.246
- 34.106.229.69
- 34.106.127.139
- 34.106.218.58
- 34.106.115.240
- 34.106.225.141

### European Union

#### AWS region: eu-west-3
* 13.37.4.46
* 13.37.142.60
* 35.181.124.238

- 13.37.4.46
- 13.37.142.60
- 35.181.124.238
Loading