Bibtex integration for document loader and retriever #5137

eyurtsev · 2023-05-23T16:05:58Z

Bibtex integration

Wrap bibtexparser to retrieve a list of docs from a bibtex file.

Get the metadata from the bibtex entries
page_content get from the local pdf referenced in the file field of the bibtex entry using pymupdf
If no valid pdf file, page_content set to the abstract field of the bibtex entry
Support Zotero flavour using regex to get the file path
Added usage example in docs/modules/indexes/document_loaders/examples/bibtex.ipynb

Who can review?

My best guess: @eyurtsev, @dev2049

eyurtsev · 2023-05-23T16:07:33Z

langchain/utilities/bibtex.py

+logger = logging.getLogger(__name__)
+
+
+class BibtexparserWrapper(BaseModel):


@wavefrontshaping why do we have a separate wrapper instead of using the BaseLoader interface?

eyurtsev · 2023-05-23T16:08:19Z

langchain/utilities/bibtex.py

+                    meta[field] = entry[field]
+        return meta
+
+    def run(self, file_path: str) -> str:


What is the purpose of the run interface?

eyurtsev · 2023-05-23T16:08:48Z

langchain/utilities/bibtex.py

+            else "No good bibtex information found. Check your bibtex file."
+        )
+
+    def lazy_load(self, file_path: str) -> Iterator[Document]:


@wavefrontshaping i added a lazy load. We want all loaders to be lazy by default. load invocation calls list on the lazy load

eyurtsev · 2023-05-23T16:13:12Z

tests/unit_tests/document_loaders/test_bibtex.py

+)
+
+
+@pytest.mark.requires("pymupdf")


Moved to a unit test rather than integration test so that the test runs on CI

eyurtsev · 2023-05-24T02:49:08Z

@wavefrontshaping apologies was afk for a while today, so took a while to me a while to get the through the linting issues.

I made a few refactor to fix a few static typing issues, let me know if anything looks incorrect or isn't consistent with intent of original code.

I left a few questions on the PR -- was wondering what the purpose of the run method was and whether it would make sense to build the bibtex loader directly as a loader without creating a bibtex utility

wavefrontshaping · 2023-05-24T13:34:18Z

I left a few questions on the PR -- was wondering what the purpose of the run method was and whether it would make sense to build the bibtex loader directly as a loader without creating a bibtex utility

@eyurtsev I do not have a good reply.
I did not know what typical structure was expected, so I duplicated the Arxiv utility and loader and started from here. So I copied the frame and functions without questioning their relevance.

wavefrontshaping · 2023-05-24T13:34:48Z

And thanks for you help and reactivity BTW!

# Bibtex integration Wrap bibtexparser to retrieve a list of docs from a bibtex file. * Get the metadata from the bibtex entries * `page_content` get from the local pdf referenced in the `file` field of the bibtex entry using `pymupdf` * If no valid pdf file, `page_content` set to the `abstract` field of the bibtex entry * Support Zotero flavour using regex to get the file path * Added usage example in `docs/modules/indexes/document_loaders/examples/bibtex.ipynb` --------- Co-authored-by: Sébastien M. Popoff <sebastien.popoff@espci.fr> Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>

wavefrontshaping and others added 10 commits May 15, 2023 12:59

Wrap bibtexparser to be used as document loader and retriever

d14490b

more general and agnostic metadata retrieval

dec8ff0

use file_path instead of path

138dc9e

pep8

c88c879

add unit tests

2e2b0b0

example bibtex file

e2f4f4f

remove bibtex retriever (useless)

1adee8c

Merge branch 'wavefrontshaping/master' into eugene/bibtex

3131046

q

a678dd1

q

c7b43be

eyurtsev mentioned this pull request May 23, 2023

Bibtex integration for document loader and retriever #4719

Closed

eyurtsev added 3 commits May 23, 2023 12:08

q

a46964f

x

3172eaf

q

e0172d6

eyurtsev commented May 23, 2023

View reviewed changes

eyurtsev added 6 commits May 23, 2023 12:47

x

e261334

Merge branch 'master' into eugene/bibtex

10b7b9f

q

e41f8a3

x

a9aca95

q

79fbd8e

q

9ce5278

dev2049 added 03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation) labels May 24, 2023

dev2049 added 4 commits May 24, 2023 15:57

merge

cfc88ac

cr

a21d491

cr

1481082

ignore

2fa5188

undo

0811a86

dev2049 merged commit 5cfa72a into master May 25, 2023

dev2049 deleted the eugene/bibtex branch May 25, 2023 07:21

danielchalef mentioned this pull request Jun 5, 2023

Zep Hybrid Search #5742

Merged

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bibtex integration for document loader and retriever #5137

Bibtex integration for document loader and retriever #5137

eyurtsev commented May 23, 2023

eyurtsev May 23, 2023

eyurtsev May 23, 2023

eyurtsev May 23, 2023

eyurtsev May 23, 2023

eyurtsev commented May 24, 2023

wavefrontshaping commented May 24, 2023 •

edited

Loading

wavefrontshaping commented May 24, 2023

		logger = logging.getLogger(__name__)


		class BibtexparserWrapper(BaseModel):

		)


		@pytest.mark.requires("pymupdf")

Bibtex integration for document loader and retriever #5137

Bibtex integration for document loader and retriever #5137

Conversation

eyurtsev commented May 23, 2023

Bibtex integration

Who can review?

eyurtsev May 23, 2023

Choose a reason for hiding this comment

eyurtsev May 23, 2023

Choose a reason for hiding this comment

eyurtsev May 23, 2023

Choose a reason for hiding this comment

eyurtsev May 23, 2023

Choose a reason for hiding this comment

eyurtsev commented May 24, 2023

wavefrontshaping commented May 24, 2023 • edited Loading

wavefrontshaping commented May 24, 2023

wavefrontshaping commented May 24, 2023 •

edited

Loading