This repository has been archived by the owner on Mar 1, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Airbyte based readers #428
Airbyte based readers #428
Changes from 12 commits
970b564
952cc8f
214d0e4
237600c
79623d8
e7313d3
3de44a4
ad9ad4a
21d7c5c
ea60ec3
64e1ad5
d37b137
7c2d32d
2a7294c
4211699
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we should figure out a better test system on our side, but currently tests are failing because we assume that third-party imports are lazy imports.
i know specifically for AirbyteRecordMessage it's used in the
RecordHandler
type used to typerecord_handler
. For this do you think we could typerecord_handler
asAny
first, and then within__init__
lazy importAirbyteRecordMessage
, importRecordHandler
, and do anisinstance()
check onrecord_handler
?A bit hacky but will at least allow tests to pass. thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer composition vs. multiple inheritance, easier to inspect the specific object (BaseEmbeddedIntegration) you're using
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof, guessing there's no iterator method available for llamaindex?
Could we implement one anyway / is there any value in doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, what do you think, @jerryjliu ? Probably something we can split out of the first PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by default every loader implements
load_data
but you're free to define custom methods too!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK cool, let's add a
lazy_load
as well that's returning an iterable.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity where is _load_data defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's on the base integration class that's coming from the CDK. We chose this approach so we can roll out improvements without having to update the loader itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the first time I've seen this pattern of putting everything in metadata, but it's an interesting idea that we should explore further.
From my understanding of the way LlamaIndex currently works it won't work properly because our text splitting doesn't split the metadata but rather attaches it to every split node (which will be an issue if the metadata is large).
@logan-markewich @jerryjliu correct me if I'm wrong.
But maybe we can create a Document subclass that's just a structured dictionary with some kind of list you define of the metadata of the content that you actually want to be chunked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i agree, this is a good concern. right now to keep it simple i'd almost recommend doing a big text dump (so at least text splitters work), for ease-of-use out of the box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to expand on the other comment further:
If I'm thinking about a hubspot use case, it may be let me do some Q&A over the previous notes and emails I've had with a partner.
The content of the notes and emails are probably what we want to be split, embedded, and retrieved over. If we're only embedding each note individually without any additional text splitting, then we're OK because we embed all of the metadata by default.
However, if we want to start text splitting, then the problem arises, because 1. the splitting will only happen on doc.text, but 2. somewhat even more problematic, every split will have a complete copy of the document's metadata.
3000 words is a lot of text (definitely bigger than most interaction notes and emails) so we may be able to get away with it for now, but I think longer run we'll want a different document class to handle this kind of use case.
There could be a lot of interesting things here, because after the loader loads it in in this more structured way, the user or the application designer could say: I want my coworkers' Notion comments to be split alongside the content of the doc, or no, I don't want the comments to split alongside the doc and it's just a matter of adjusting a list for the fields to include in the splitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing we could do is to allow the user to pass a
RecordHandler = Optional[Callable[[AirbyteRecordMessage, Optional[str]], Document]]
as part of the constructor arguments. If it's not set, it will do the current "all metadata" thing, but the user can overwrite it and put things into text vs. metadata as they see fit.Do you think that would be a better fit for the architecture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@flash1293 yeah i think that can work! can maybe add a usage example in the README of the main AirbyteCDKReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other alternative is to dump everything into text actually (so naive text dump), and if the user wants to have more fine-grained metadata specifications they use this argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the record handler plus dumping everything in there by default, seems like the easiest way to get started. The text could get really messy for some streams, but that's probably ok