Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a generic document loader #4875

Merged
merged 4 commits into from
May 18, 2023
Merged

Add a generic document loader #4875

merged 4 commits into from
May 18, 2023

Conversation

eyurtsev
Copy link
Collaborator

Add generic document loader

  • This PR adds a generic document loader which can assemble a loader from a blob loader and a parser
  • Adds a registry for parsers
  • Populate registry with a default mimetype based parser

Expected changes

  • Parsing involves loading content via IO so can be sped up via:
    • Threading in sync
    • Async
  • The actual parsing logic may be computatinoally involved: may need to figure out to add multi-processing support
  • May want to add suffix based parser since suffixes are easier to specify in comparison to mime types

Before submitting

No notebooks yet, we first need to get a few of the basic parsers up (prior to advertising the interface)

self, text_splitter: Optional[TextSplitter] = None
) -> List[Document]:
"""Load all documents and split them into sentences."""
raise NotImplementedError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we load and split (when text_splitter is not None)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can tackle it in the initializer and have it available through any the lazy method of the load method, I don't think we need another method to do the same thing, we could add it for backwards compatibility potentially but I want to add something that can simply process text.


def load(self) -> List[Document]:
"""Load all documents."""
return list(self.lazy_load())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not implemented on BaseLoader like this bc lazy_load was added later and isn't implemented in a lot of places

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

glob: str = "**/[!.]*",
suffixes: Optional[Sequence[str]] = None,
show_progress: bool = False,
parser: Union[str, BaseBlobParser] = "default",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do Lister["default"] instead of str

"""Create a generic document loader using a filesystem blob loader.

Args:
parser: A blob parser which knows how to parse blobs into documents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth mentioning "default"?

) -> GenericLoader:
"""Create a generic document loader using a filesystem blob loader.

Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add types to these, and match signature order

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order yes, i suggest not to add types since the type is already provided as part of the function signature

_PathLike = Union[str, Path]


class GenericLoader(BaseLoader):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooc are you imagining for everything that has both a blob parser/parser and doc loader atm we'll reimplement the doc loader as child of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly thinking of this as living independently as a self contained unit. It may need 2-3 more classmethods and it should be able to load content from most popular locations and then parse is with any arbitrary parser.

I think this will replace the functionality of a bunch of other loaders, we don't necessarily need to refactor any of the existing ones. Can keep them as they are.

@eyurtsev eyurtsev merged commit 8e41143 into master May 18, 2023
@eyurtsev eyurtsev deleted the eugene/add_generic_loader branch May 18, 2023 02:38
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants