-
Notifications
You must be signed in to change notification settings - Fork 15.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bibtex integration for document loader and retriever #5137
Conversation
logger = logging.getLogger(__name__) | ||
|
||
|
||
class BibtexparserWrapper(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wavefrontshaping why do we have a separate wrapper instead of using the BaseLoader
interface?
langchain/utilities/bibtex.py
Outdated
meta[field] = entry[field] | ||
return meta | ||
|
||
def run(self, file_path: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of the run interface?
langchain/utilities/bibtex.py
Outdated
else "No good bibtex information found. Check your bibtex file." | ||
) | ||
|
||
def lazy_load(self, file_path: str) -> Iterator[Document]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wavefrontshaping i added a lazy load. We want all loaders to be lazy by default. load
invocation calls list
on the lazy load
) | ||
|
||
|
||
@pytest.mark.requires("pymupdf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to a unit test rather than integration test so that the test runs on CI
@wavefrontshaping apologies was afk for a while today, so took a while to me a while to get the through the linting issues. I made a few refactor to fix a few static typing issues, let me know if anything looks incorrect or isn't consistent with intent of original code. I left a few questions on the PR -- was wondering what the purpose of the |
@eyurtsev I do not have a good reply. |
And thanks for you help and reactivity BTW! |
# Bibtex integration Wrap bibtexparser to retrieve a list of docs from a bibtex file. * Get the metadata from the bibtex entries * `page_content` get from the local pdf referenced in the `file` field of the bibtex entry using `pymupdf` * If no valid pdf file, `page_content` set to the `abstract` field of the bibtex entry * Support Zotero flavour using regex to get the file path * Added usage example in `docs/modules/indexes/document_loaders/examples/bibtex.ipynb` --------- Co-authored-by: Sébastien M. Popoff <sebastien.popoff@espci.fr> Co-authored-by: Dev 2049 <dev.dev2049@gmail.com>
Bibtex integration
Wrap bibtexparser to retrieve a list of docs from a bibtex file.
page_content
get from the local pdf referenced in thefile
field of the bibtex entry usingpymupdf
page_content
set to theabstract
field of the bibtex entrydocs/modules/indexes/document_loaders/examples/bibtex.ipynb
Who can review?
My best guess: @eyurtsev, @dev2049