Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bibtex integration for document loader and retriever #5137

Merged
merged 24 commits into from
May 25, 2023
Merged

Conversation

eyurtsev
Copy link
Collaborator

Bibtex integration

Wrap bibtexparser to retrieve a list of docs from a bibtex file.

  • Get the metadata from the bibtex entries
  • page_content get from the local pdf referenced in the file field of the bibtex entry using pymupdf
  • If no valid pdf file, page_content set to the abstract field of the bibtex entry
  • Support Zotero flavour using regex to get the file path
  • Added usage example in docs/modules/indexes/document_loaders/examples/bibtex.ipynb

Who can review?

My best guess: @eyurtsev, @dev2049

logger = logging.getLogger(__name__)


class BibtexparserWrapper(BaseModel):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wavefrontshaping why do we have a separate wrapper instead of using the BaseLoader interface?

meta[field] = entry[field]
return meta

def run(self, file_path: str) -> str:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the run interface?

else "No good bibtex information found. Check your bibtex file."
)

def lazy_load(self, file_path: str) -> Iterator[Document]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wavefrontshaping i added a lazy load. We want all loaders to be lazy by default. load invocation calls list on the lazy load

)


@pytest.mark.requires("pymupdf")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a unit test rather than integration test so that the test runs on CI

@eyurtsev
Copy link
Collaborator Author

@wavefrontshaping apologies was afk for a while today, so took a while to me a while to get the through the linting issues.

I made a few refactor to fix a few static typing issues, let me know if anything looks incorrect or isn't consistent with intent of original code.

I left a few questions on the PR -- was wondering what the purpose of the run method was and whether it would make sense to build the bibtex loader directly as a loader without creating a bibtex utility

@wavefrontshaping
Copy link
Contributor

wavefrontshaping commented May 24, 2023

I left a few questions on the PR -- was wondering what the purpose of the run method was and whether it would make sense to build the bibtex loader directly as a loader without creating a bibtex utility

@eyurtsev I do not have a good reply.
I did not know what typical structure was expected, so I duplicated the Arxiv utility and loader and started from here. So I copied the frame and functions without questioning their relevance.

@wavefrontshaping
Copy link
Contributor

And thanks for you help and reactivity BTW!

@dev2049 dev2049 added 03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation) labels May 24, 2023
@dev2049 dev2049 merged commit 5cfa72a into master May 25, 2023
@dev2049 dev2049 deleted the eugene/bibtex branch May 25, 2023 07:21
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this pull request Jun 19, 2023
# Bibtex integration

Wrap bibtexparser to retrieve a list of docs from a bibtex file.
* Get the metadata from the bibtex entries
* `page_content` get from the local pdf referenced in the `file` field
of the bibtex entry using `pymupdf`
* If no valid pdf file, `page_content` set to the `abstract` field of
the bibtex entry
* Support Zotero flavour using regex to get the file path
* Added usage example in
`docs/modules/indexes/document_loaders/examples/bibtex.ipynb`
---------

Co-authored-by: Sébastien M. Popoff <sebastien.popoff@espci.fr>
Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 enhancement Enhancement of existing functionality Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants