-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update state message #12586
Update state message #12586
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benmoriceau was it discussed whether the connector should declare via spec.json if it manages global or per-stream state? this would allow transparency to the user about whether a table reset they perform in the UI will nuke their whole dataset vs. just one stream. Or is the idea that this determination would happen by reading the STATE message?
The type of state the other fields represent. | ||
If not set, the state data is interpreted as GLOBAL and should be read from the `data` field for backwards compatibility. | ||
GLOBAL means that the state should be read from `global` and means that it represents the state for all the streams. | ||
PER_STREAM means that the state should be read from `streams`. Each item in the ist represents the state for the associated stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PER_STREAM means that the state should be read from `streams`. Each item in the ist represents the state for the associated stream. | |
PER_STREAM means that the state should be read from `streams`. Each item in the list represents the state for the associated stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
description: "per stream state data" | ||
additionalProperties: false | ||
properties: | ||
name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: | |
stream_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it redundant with the name of the fields in the AirbyteSateMessage
it will be something like stream[x].stream_name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly on this one, but I think I do lean towards sherif's suggestion. I don't think it's redundant because when you say an AirbyteStreamState
has a name
it isn't necessarily obvious that that name is a stream name. It could be interpreted as AirbyteStreamState
s just have a name for some reason. doing with stream_name
removes the ambiguity.
global: | ||
"$ref": "#/definitions/AirbyteStateBlob" | ||
streams: | ||
type: array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this an array instead of a map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The map type doesn't exist in json schema. One version was using the additional properties for it but it was not clear: https://json-schema.org/draft/2020-12/json-schema-core.html#rfc.section.4.2.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would "additionalProperties": { "type": AirbyteStateBlob }
work with object
definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth exploring something like what Sherif proposed above. Otherwise we will need to need to sort through all of the streamStates in this array to find specific ones, rather than having direct access to a stream state given the stream name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would
"additionalProperties": { "type": AirbyteStateBlob }
work withobject
definition?
This would work but it is not explicit that the key is a string plus the field is gonna be named additionalProperties and not streams (doc). We considered it at some point but abandoned it because of all the confusion it was bringing.
The tradeoff of rebuilding it as a map sounds better than having a field that it is not properly named and which can lead to have something unexpected in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maps in JSONSchema have been pretty sus. i.e. it seems like a lot of the codegen doesn't know what to do with them. I think the array is going to lead to less headache.
Lake makes a good point on the sorting, but I think we can handle it.
additionalProperties: true | ||
required: | ||
- data | ||
additionalProperties: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we changing this to false
? I think that will make it harder in the future to add/remove fields to this object in a backwards-compatible way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field allows too many thing it is a Map<String, Object>
, we can't know what is in it and we should avoid using it. I don't think it is being used at the moment and we shouldn't be using it in the future.
global: | ||
"$ref": "#/definitions/AirbyteStateBlob" | ||
streams: | ||
type: array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth exploring something like what Sherif proposed above. Otherwise we will need to need to sort through all of the streamStates in this array to find specific ones, rather than having direct access to a stream state given the stream name
I will let Benoit confirm, but AFAIK the idea was that this would be determined by reading the most recent state from the connector, defaulting to GLOBAL if there is no state for that connector. Once we process the first PER_STREAM state message from that connector, it would change the behavior in the UI. |
We spoke about that offline with Sherif. Some connector will support both global and per stream update. The way we will know which state to pass will be by using the state table and the TBD per stream state table and see which one is the active one. As you say the default state will be GLOBAL if no state is present. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. i think it is worth doing name
=> stream_name
properties: | ||
state_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't make this required because it wouldn't be backwards compatible, right?
global: | ||
"$ref": "#/definitions/AirbyteStateBlob" | ||
streams: | ||
type: array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maps in JSONSchema have been pretty sus. i.e. it seems like a lot of the codegen doesn't know what to do with them. I think the array is going to lead to less headache.
Lake makes a good point on the sorting, but I think we can handle it.
description: "per stream state data" | ||
additionalProperties: false | ||
properties: | ||
name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly on this one, but I think I do lean towards sherif's suggestion. I don't think it's redundant because when you say an AirbyteStreamState
has a name
it isn't necessarily obvious that that name is a stream name. It could be interpreted as AirbyteStreamState
s just have a name for some reason. doing with stream_name
removes the ambiguity.
What Update the airbyte state message to support the per stream state. The state message still contains the old way of storing the state in the data fields. It introduce 2 new fields to represent the global and the per stream state.
What
Update the airbyte state message to support the per stream state.
The state message still contains the old way of storing the state in the data fields. It introduce 2 new fields to represent the global and the per stream state.
The new state message has been defined in: https://docs.google.com/document/d/1zJVUQkVy8ZJUcqzM_hYPGijr4pI0Xx6I2eqpagJ2r7E/edit#heading=h.voxzl9lf9qd1