Add a generic document loader #4875

eyurtsev · 2023-05-17T19:02:46Z

Add generic document loader

This PR adds a generic document loader which can assemble a loader from a blob loader and a parser
Adds a registry for parsers
Populate registry with a default mimetype based parser

Expected changes

Parsing involves loading content via IO so can be sped up via:
- Threading in sync
- Async
The actual parsing logic may be computatinoally involved: may need to figure out to add multi-processing support
May want to add suffix based parser since suffixes are easier to specify in comparison to mime types

Before submitting

No notebooks yet, we first need to get a few of the basic parsers up (prior to advertising the interface)

dev2049 · 2023-05-17T19:15:50Z

langchain/document_loaders/generic.py

+        self, text_splitter: Optional[TextSplitter] = None
+    ) -> List[Document]:
+        """Load all documents and split them into sentences."""
+        raise NotImplementedError(


why can't we load and split (when text_splitter is not None)?

we can tackle it in the initializer and have it available through any the lazy method of the load method, I don't think we need another method to do the same thing, we could add it for backwards compatibility potentially but I want to add something that can simply process text.

dev2049 · 2023-05-17T19:16:40Z

langchain/document_loaders/generic.py

+
+    def load(self) -> List[Document]:
+        """Load all documents."""
+        return list(self.lazy_load())


is this not implemented on BaseLoader like this bc lazy_load was added later and isn't implemented in a lot of places

dev2049 · 2023-05-17T19:18:30Z

langchain/document_loaders/generic.py

+        glob: str = "**/[!.]*",
+        suffixes: Optional[Sequence[str]] = None,
+        show_progress: bool = False,
+        parser: Union[str, BaseBlobParser] = "default",


should we do Lister["default"] instead of str

dev2049 · 2023-05-17T19:18:47Z

langchain/document_loaders/generic.py

+        """Create a generic document loader using a filesystem blob loader.
+
+        Args:
+            parser: A blob parser which knows how to parse blobs into documents


worth mentioning "default"?

dev2049 · 2023-05-17T19:18:58Z

langchain/document_loaders/generic.py

+    ) -> GenericLoader:
+        """Create a generic document loader using a filesystem blob loader.
+
+        Args:


should we add types to these, and match signature order

order yes, i suggest not to add types since the type is already provided as part of the function signature

dev2049 · 2023-05-17T19:22:22Z

langchain/document_loaders/generic.py

+_PathLike = Union[str, Path]
+
+
+class GenericLoader(BaseLoader):


ooc are you imagining for everything that has both a blob parser/parser and doc loader atm we'll reimplement the doc loader as child of this?

Mostly thinking of this as living independently as a self contained unit. It may need 2-3 more classmethods and it should be able to load content from most popular locations and then parse is with any arbitrary parser.

I think this will replace the functionality of a bunch of other loaders, we don't necessarily need to refactor any of the existing ones. Can keep them as they are.

eyurtsev added 2 commits May 17, 2023 14:56

x

d80922f

q

54313ce

eyurtsev requested review from hwchase17, vowelparrot and dev2049 May 17, 2023 19:02

dev2049 reviewed May 17, 2023

View reviewed changes

eyurtsev added 2 commits May 17, 2023 22:19

w

184c792

q

f579b00

eyurtsev merged commit 8e41143 into master May 18, 2023

eyurtsev deleted the eugene/add_generic_loader branch May 18, 2023 02:38

danielchalef mentioned this pull request Jun 5, 2023

Zep Hybrid Search #5742

Merged

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a generic document loader #4875

Add a generic document loader #4875

eyurtsev commented May 17, 2023

dev2049 May 17, 2023

eyurtsev May 18, 2023

dev2049 May 17, 2023

eyurtsev May 17, 2023

dev2049 May 17, 2023

dev2049 May 17, 2023

dev2049 May 17, 2023

eyurtsev May 18, 2023

dev2049 May 17, 2023

eyurtsev May 18, 2023

		_PathLike = Union[str, Path]


		class GenericLoader(BaseLoader):

Add a generic document loader #4875

Add a generic document loader #4875

Conversation

eyurtsev commented May 17, 2023

Add generic document loader

Expected changes

Before submitting

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment