Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Cloud Docs #32539

Merged
merged 4 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 31 additions & 37 deletions docs/cloud/core-concepts.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Core Concepts

Airbyte enables you to build data pipelines and replicate data from a source to a destination. You can configure how frequently the data is synced, what data is replicated, what format the data is written to in the destination, and if the data is stored in raw tables format or basic normalized (or JSON) format.
Airbyte enables you to build data pipelines and replicate data from a source to a destination. You can configure how frequently the data is synced, what data is replicated, and how the data is written to in the destination.

This page describes the concepts you need to know to use Airbyte.

Expand All @@ -18,9 +18,7 @@ An Airbyte component which pulls data from a source or pushes data to a destinat

## Connection

A connection is an automated data pipeline that replicates data from a source to a destination.

Setting up a connection involves configuring the following parameters:
A connection is an automated data pipeline that replicates data from a source to a destination. Setting up a connection enables configuration of the following parameters:

<table>
<tr>
Expand All @@ -30,21 +28,21 @@ Setting up a connection involves configuring the following parameters:
</td>
</tr>
<tr>
<td>Sync schedule
<td>Replication Frequency
</td>
<td>When should a data sync be triggered?
</td>
</tr>
<tr>
<td>Destination Namespace and stream names
<td>Destination Namespace and Stream Prefix
</td>
<td>Where should the replicated data be written?
</td>
</tr>
<tr>
<td>Catalog selection
<td>Catalog Selection
</td>
<td>What data should be replicated from the source to the destination?
<td>What data (streams and columns) should be replicated from the source to the destination?
</td>
</tr>
<tr>
Expand All @@ -54,9 +52,9 @@ Setting up a connection involves configuring the following parameters:
</td>
</tr>
<tr>
<td>Optional transformations
<td>Schema Propagation
</td>
<td>How should Airbyte protocol messages (raw JSON blob) data be converted into other data representations?
<td>How should Airbyte handle schema drift in sources?
</td>
</tr>
</table>
Expand All @@ -82,58 +80,54 @@ Examples of fields:

## Namespace

Namespace is a group of streams in a source or destination. Common use cases for namespaces are enforcing permissions, segregating test and production data, and general data organization.

A schema in a relational database system is an example of a namespace.
Namespace is a method of grouping streams in a source or destination. Namespaces are used to generally organize data, segregate tests and production data, and enforce permissions. In a relational database system, this is known as a schema.

In a source, the namespace is the location from where the data is replicated to the destination.
In a source, the namespace is the location from where the data is replicated to the destination. In a destination, the namespace is the location where the replicated data is stored in the destination.

In a destination, the namespace is the location where the replicated data is stored in the destination. Airbyte supports the following configuration options for destination namespaces:
Airbyte supports the following configuration options for a connection:

<table>
<tr>
<td><strong>Configuration</strong>
<table>
<tr>
<td><strong>Destination Namespace</strong>
</td>
<td><strong>Description</strong>
</td>
</tr>
<tr>
<td>Mirror source structure
</tr>
<tr>
<td>Destination default
</td>
<td>Some sources (for example, databases) provide namespace information for a stream. If a source provides the namespace information, the destination will reproduce the same namespace when this configuration is set. For sources or streams where the source namespace is not known, the behavior will default to the "Destination default" option.
<td>All streams will be replicated to the single default namespace defined by the Destination. For more details, see<a href="https://docs.airbyte.com/understanding-airbyte/namespaces#--destination-connector-settings"> ​​Destination Connector Settings</a>
nataliekwong marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
<tr>
<td>Destination default
</tr>
<tr>
<td>Mirror source structure
</td>
<td>All streams will be replicated and stored in the default namespace defined on the destination settings page. For settings for popular destinations, see<a href="https://docs.airbyte.com/understanding-airbyte/namespaces#destination-connector-settings"> ​​Destination Connector Settings</a>
<td>Some sources (for example, databases) provide namespace information for a stream. If a source provides namespace information, the destination will mirror the same namespace when this configuration is set. For sources or streams where the source namespace is not known, the behavior will default to the "Destination default" option.
</td>
</tr>
<tr>
</tr>
<tr>
<td>Custom format
</td>
<td>All streams will be replicated and stored in a user-defined custom format. See<a href="https://docs.airbyte.com/understanding-airbyte/namespaces#custom-format"> Custom format</a> for more details.
<td>All streams will be replicated to a single user-defined namespace. See<a href="https://docs.airbyte.com/understanding-airbyte/namespaces#--custom-format"> Custom format</a> for more details
</td>
</tr>
</table>
</tr>
</table>

## Connection sync modes

A sync mode governs how Airbyte reads from a source and writes to a destination. Airbyte provides different sync modes to account for various use cases.

- **Full Refresh | Overwrite:** Sync all records from the source and replace data in destination by overwriting it.
- **Full Refresh | Append:** Sync all records from the source and add them to the destination without deleting any data.
- **Incremental Sync | Append:** Sync new records from the source and add them to the destination without deleting any data.
- **Incremental Sync | Append + Deduped:** Sync new records from the source and add them to the destination. Also provides a de-duplicated view mirroring the state of the stream in the source.
- **Full Refresh | Overwrite:** Sync all records from the source and replace data in destination by overwriting it each time.
- **Full Refresh | Append:** Sync all records from the source and add them to the destination without deleting any data. This creates a historical copy of all records each sync.
- **Incremental Sync | Append:** Sync new records from the source and add them to the destination without deleting any data. This enables efficient historical tracking over time of data.
- **Incremental Sync | Append + Deduped:** Sync new records from the source and add them to the destination. Also provides a de-duplicated view mirroring the state of the stream in the source. This is the most common replication use case.

## Normalization
nataliekwong marked this conversation as resolved.
Show resolved Hide resolved

Normalization is the process of structuring data from the source into a format appropriate for consumption in the destination. For example, when writing data from a nested, dynamically typed source like a JSON API to a relational destination like Postgres, normalization is the process which un-nests JSON from the source into a relational table format which uses the appropriate column types in the destination.

Note that normalization is only relevant for the following relational database & warehouse destinations:

- BigQuery
- Snowflake
- Redshift
- Postgres
- Oracle
Expand Down
Loading
Loading