-
Notifications
You must be signed in to change notification settings - Fork 15.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a generic document loader #4875
Conversation
self, text_splitter: Optional[TextSplitter] = None | ||
) -> List[Document]: | ||
"""Load all documents and split them into sentences.""" | ||
raise NotImplementedError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we load and split (when text_splitter is not None)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can tackle it in the initializer and have it available through any the lazy method of the load method, I don't think we need another method to do the same thing, we could add it for backwards compatibility potentially but I want to add something that can simply process text.
|
||
def load(self) -> List[Document]: | ||
"""Load all documents.""" | ||
return list(self.lazy_load()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this not implemented on BaseLoader like this bc lazy_load was added later and isn't implemented in a lot of places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
glob: str = "**/[!.]*", | ||
suffixes: Optional[Sequence[str]] = None, | ||
show_progress: bool = False, | ||
parser: Union[str, BaseBlobParser] = "default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do Lister["default"] instead of str
"""Create a generic document loader using a filesystem blob loader. | ||
|
||
Args: | ||
parser: A blob parser which knows how to parse blobs into documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth mentioning "default"?
) -> GenericLoader: | ||
"""Create a generic document loader using a filesystem blob loader. | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add types to these, and match signature order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order yes, i suggest not to add types since the type is already provided as part of the function signature
_PathLike = Union[str, Path] | ||
|
||
|
||
class GenericLoader(BaseLoader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooc are you imagining for everything that has both a blob parser/parser and doc loader atm we'll reimplement the doc loader as child of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly thinking of this as living independently as a self contained unit. It may need 2-3 more classmethods and it should be able to load content from most popular locations and then parse is with any arbitrary parser.
I think this will replace the functionality of a bunch of other loaders, we don't necessarily need to refactor any of the existing ones. Can keep them as they are.
Add generic document loader
Expected changes
Before submitting
No notebooks yet, we first need to get a few of the basic parsers up (prior to advertising the interface)