-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental Docs and Data Model Update #1021
Conversation
@@ -134,6 +134,11 @@ definitions: | |||
type: array | |||
items: | |||
"$ref": "#/definitions/SyncMode" | |||
cursor_field: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could replace this with a boolean. with a source-defined cursors, it's not technically necessary to expose them outside the source for the actual replication. i exposed it in this iteration so that the UI would be able to display what field would be used (even if it's not configurable). i can go either way on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making this not a boolean blurs the distinction between catalog and configuredcatalog because this feels like configuration. can't the user be expected to know that if the cursor_field_configurable==false
then the cursor is the default_cursor_field
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. going with the boolean as you suggest. called source_defined_cursor_field
@@ -134,6 +134,11 @@ definitions: | |||
type: array | |||
items: | |||
"$ref": "#/definitions/SyncMode" | |||
cursor_field: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making this not a boolean blurs the distinction between catalog and configuredcatalog because this feels like configuration. can't the user be expected to know that if the cursor_field_configurable==false
then the cursor is the default_cursor_field
?
@@ -168,7 +168,7 @@ read(Config, AirbyteCatalog, State) -> Stream<AirbyteMessage> | |||
|
|||
* Input: | |||
1. `config` - A configuration JSON object that has been validated using the `ConnectorSpecification`. | |||
2. `catalog` - An `AirbyteCatalog`. This `catalog` should be a subset of the `catalog` returned by the `discover` command. It is what will be used in the `read` command to select what data to transfer. | |||
2. `catalog` - An `ConfiguredAirbyteCatalog`. This `catalog` should be constructed from the `catalog` returned by the `discover` command. This is done by copying the streams that you want to sync from the `AirbyteCatalog` into the `ConfiguredAirbyteCatalog`. To convert an `AirbyteStream` to a `ConfiguredAirbyteStream` copy the `json_schema` and `name` fields. struct It is what will be used in the `read` command to select what data to transfer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. `catalog` - An `ConfiguredAirbyteCatalog`. This `catalog` should be constructed from the `catalog` returned by the `discover` command. This is done by copying the streams that you want to sync from the `AirbyteCatalog` into the `ConfiguredAirbyteCatalog`. To convert an `AirbyteStream` to a `ConfiguredAirbyteStream` copy the `json_schema` and `name` fields. struct It is what will be used in the `read` command to select what data to transfer. | |
2. `catalog` - A `ConfiguredAirbyteCatalog`. This `catalog` should be constructed from the `catalog` returned by the `discover` command. This is done by copying the streams that you want to sync from the `AirbyteCatalog` into the `ConfiguredAirbyteCatalog`. To convert an `AirbyteStream` to a `ConfiguredAirbyteStream` copy the `json_schema` and `name` fields. struct It is what will be used in the `read` command to select what data to transfer. |
@@ -168,7 +168,7 @@ read(Config, AirbyteCatalog, State) -> Stream<AirbyteMessage> | |||
|
|||
* Input: | |||
1. `config` - A configuration JSON object that has been validated using the `ConnectorSpecification`. | |||
2. `catalog` - An `AirbyteCatalog`. This `catalog` should be a subset of the `catalog` returned by the `discover` command. It is what will be used in the `read` command to select what data to transfer. | |||
2. `catalog` - An `ConfiguredAirbyteCatalog`. This `catalog` should be constructed from the `catalog` returned by the `discover` command. This is done by copying the streams that you want to sync from the `AirbyteCatalog` into the `ConfiguredAirbyteCatalog`. To convert an `AirbyteStream` to a `ConfiguredAirbyteStream` copy the `json_schema` and `name` fields. struct It is what will be used in the `read` command to select what data to transfer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be updated to indicate you copy the whole stream right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup! fixed.
docs/architecture/incremental.md
Outdated
For a source to do incremental sync is must be able to keep track of new and updated records. This can take a couple different forms. Before we jump into them, we are going to use the word cursor or cursor field to describe the field or column in the data that Airbyte uses as a comparable to determine if any given record is new or has been updated since the last sync. | ||
|
||
## Source-Defined Cursor. | ||
Some sources are able to determine the cursor that the use without any user input. For example, in the exchange rates api source, the source itself can determine that date field is in fact the field to be used to determine the last record that was synced. In these cases, the source will set the `cursor_field` attribute in the `AirbyteStream`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some sources are able to determine the cursor that the use without any user input. For example, in the exchange rates api source, the source itself can determine that date field is in fact the field to be used to determine the last record that was synced. In these cases, the source will set the `cursor_field` attribute in the `AirbyteStream`. | |
Some sources are able to determine the cursor that they use without any user input. For example, in the exchange rates api source, the source itself can determine that date field is in fact the field to be used to determine the last record that was synced. In these cases, the source will set the `cursor_field` attribute in the `AirbyteStream`. |
Let's assume that our warehouse contains all of the data that it did at the end of the previous section. Now unfortunately the king and queen lose their heads. Let's see that delta: | ||
```json | ||
[ | ||
{ "name": "Louis XVI", "deceased": true, "updated_at": 1793 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is such a grim example 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you couldn't tell, i was very please with myself when i came up with this. 😛
docs/architecture/incremental.md
Outdated
|
||
## Overview | ||
|
||
Incremental syncs in Airbyte allow sources to replicate only new or modified data. This prevents re-sending data that has already been sent to your warehouse. We will call this set of new or updated records the delta going forward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should emphasize that this doesn't pull the full data from the source either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
5f6a33c
to
2b72d97
Compare
2b72d97
to
3ec5253
Compare
What
ConfiguredAirbyteStream
contains anAirbyteStream
. This fixes a few thingsjson_schema
intoConfiguredAirbyteStream
but not the other fields meant that theConfiguredAirbyteStream
didn't have enough information to configure an AirbyteStream (i.e. it didn't know what the default_cursor_field) was supposed to be.AirbyteStream
handling the distinction between source-defined and user-defined cursors was going to be very confusing and involve copying more specific parts of the stream.How
Describe the solution
Checklist
Recommended reading order
airbyte_protocol.yaml
incremental.md
catalog.md