Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Docs and Data Model Update #1021

Merged
merged 4 commits into from
Nov 20, 2020
Merged

Conversation

cgardens
Copy link
Contributor

What

  • As I've started documenting the data model and working with it, I think there was a pretty big weakness in the original implementation.
  • I have restructured it so that an ConfiguredAirbyteStream contains an AirbyteStream. This fixes a few things
    • first copying the json_schema into ConfiguredAirbyteStream but not the other fields meant that the ConfiguredAirbyteStream didn't have enough information to configure an AirbyteStream (i.e. it didn't know what the default_cursor_field) was supposed to be.
    • Copying some of the data and not others was also just confusing. In trying to document it, I couldn't write docs that gave good motivation for what was and wasn't copied. So if I was confused, the user was going to be flummoxed.
    • Without copying the whole AirbyteStream handling the distinction between source-defined and user-defined cursors was going to be very confusing and involve copying more specific parts of the stream.

How

Describe the solution

Checklist

  • I have an branch that implements this restructuring. Still need to run integration tests and push images, but should be relatively fast.

Recommended reading order

  1. airbyte_protocol.yaml
  2. incremental.md
  3. catalog.md

@@ -134,6 +134,11 @@ definitions:
type: array
items:
"$ref": "#/definitions/SyncMode"
cursor_field:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could replace this with a boolean. with a source-defined cursors, it's not technically necessary to expose them outside the source for the actual replication. i exposed it in this iteration so that the UI would be able to display what field would be used (even if it's not configurable). i can go either way on this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making this not a boolean blurs the distinction between catalog and configuredcatalog because this feels like configuration. can't the user be expected to know that if the cursor_field_configurable==false then the cursor is the default_cursor_field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. going with the boolean as you suggest. called source_defined_cursor_field

@@ -134,6 +134,11 @@ definitions:
type: array
items:
"$ref": "#/definitions/SyncMode"
cursor_field:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making this not a boolean blurs the distinction between catalog and configuredcatalog because this feels like configuration. can't the user be expected to know that if the cursor_field_configurable==false then the cursor is the default_cursor_field?

docs/architecture/catalog.md Outdated Show resolved Hide resolved
@@ -168,7 +168,7 @@ read(Config, AirbyteCatalog, State) -> Stream<AirbyteMessage>

* Input:
1. `config` - A configuration JSON object that has been validated using the `ConnectorSpecification`.
2. `catalog` - An `AirbyteCatalog`. This `catalog` should be a subset of the `catalog` returned by the `discover` command. It is what will be used in the `read` command to select what data to transfer.
2. `catalog` - An `ConfiguredAirbyteCatalog`. This `catalog` should be constructed from the `catalog` returned by the `discover` command. This is done by copying the streams that you want to sync from the `AirbyteCatalog` into the `ConfiguredAirbyteCatalog`. To convert an `AirbyteStream` to a `ConfiguredAirbyteStream` copy the `json_schema` and `name` fields. struct It is what will be used in the `read` command to select what data to transfer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. `catalog` - An `ConfiguredAirbyteCatalog`. This `catalog` should be constructed from the `catalog` returned by the `discover` command. This is done by copying the streams that you want to sync from the `AirbyteCatalog` into the `ConfiguredAirbyteCatalog`. To convert an `AirbyteStream` to a `ConfiguredAirbyteStream` copy the `json_schema` and `name` fields. struct It is what will be used in the `read` command to select what data to transfer.
2. `catalog` - A `ConfiguredAirbyteCatalog`. This `catalog` should be constructed from the `catalog` returned by the `discover` command. This is done by copying the streams that you want to sync from the `AirbyteCatalog` into the `ConfiguredAirbyteCatalog`. To convert an `AirbyteStream` to a `ConfiguredAirbyteStream` copy the `json_schema` and `name` fields. struct It is what will be used in the `read` command to select what data to transfer.

@@ -168,7 +168,7 @@ read(Config, AirbyteCatalog, State) -> Stream<AirbyteMessage>

* Input:
1. `config` - A configuration JSON object that has been validated using the `ConnectorSpecification`.
2. `catalog` - An `AirbyteCatalog`. This `catalog` should be a subset of the `catalog` returned by the `discover` command. It is what will be used in the `read` command to select what data to transfer.
2. `catalog` - An `ConfiguredAirbyteCatalog`. This `catalog` should be constructed from the `catalog` returned by the `discover` command. This is done by copying the streams that you want to sync from the `AirbyteCatalog` into the `ConfiguredAirbyteCatalog`. To convert an `AirbyteStream` to a `ConfiguredAirbyteStream` copy the `json_schema` and `name` fields. struct It is what will be used in the `read` command to select what data to transfer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be updated to indicate you copy the whole stream right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup! fixed.

For a source to do incremental sync is must be able to keep track of new and updated records. This can take a couple different forms. Before we jump into them, we are going to use the word cursor or cursor field to describe the field or column in the data that Airbyte uses as a comparable to determine if any given record is new or has been updated since the last sync.

## Source-Defined Cursor.
Some sources are able to determine the cursor that the use without any user input. For example, in the exchange rates api source, the source itself can determine that date field is in fact the field to be used to determine the last record that was synced. In these cases, the source will set the `cursor_field` attribute in the `AirbyteStream`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some sources are able to determine the cursor that the use without any user input. For example, in the exchange rates api source, the source itself can determine that date field is in fact the field to be used to determine the last record that was synced. In these cases, the source will set the `cursor_field` attribute in the `AirbyteStream`.
Some sources are able to determine the cursor that they use without any user input. For example, in the exchange rates api source, the source itself can determine that date field is in fact the field to be used to determine the last record that was synced. In these cases, the source will set the `cursor_field` attribute in the `AirbyteStream`.

docs/architecture/incremental.md Outdated Show resolved Hide resolved
docs/architecture/incremental.md Outdated Show resolved Hide resolved
Let's assume that our warehouse contains all of the data that it did at the end of the previous section. Now unfortunately the king and queen lose their heads. Let's see that delta:
```json
[
{ "name": "Louis XVI", "deceased": true, "updated_at": 1793 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is such a grim example 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you couldn't tell, i was very please with myself when i came up with this. 😛

docs/architecture/catalog.md Outdated Show resolved Hide resolved

## Overview

Incremental syncs in Airbyte allow sources to replicate only new or modified data. This prevents re-sending data that has already been sent to your warehouse. We will call this set of new or updated records the delta going forward.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should emphasize that this doesn't pull the full data from the source either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@cgardens cgardens force-pushed the cgardens/incremental_docs branch from 5f6a33c to 2b72d97 Compare November 19, 2020 23:00
@cgardens cgardens force-pushed the cgardens/incremental_docs branch from 2b72d97 to 3ec5253 Compare November 20, 2020 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants