-
Notifications
You must be signed in to change notification settings - Fork 16.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docugami DataLoader #4727
Docugami DataLoader #4727
Conversation
|
||
## What is Docugami? | ||
|
||
Docugami converts business documents into a Document XML Knowledge Graph, generating forests of XML semantic trees representing entire documents. This is a rich representation that includes the semantic and structural characteristics of various chunks in the document as an XML tree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "generating forests of XML semantic trees" -- I don't know what this phrase means, I suspect other readers will be confused as well. I'm going to land as is but feel free to send a new PR to update language or to address any nits below
access_token: Optional[str] = os.environ.get("DOCUGAMI_API_KEY") | ||
docset_id: Optional[str] | ||
document_ids: Optional[Sequence[str]] | ||
file_paths: Optional[Sequence[Path]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjaffri I updated Lists to Sequences since we want things to be immutable. Consider adding support for str
since users are likely to use either Path
or str
to work with files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Good idea for str input I can send a follow up change for your review at your leisure
@@ -0,0 +1,28 @@ | |||
"""Test DocugamiLoader.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjaffri this was moved to unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you
cc @tjaffri PR has been merged -- made minor changes to resolve merge conflicts, some changes in type annotations and moved to unit tests folder |
@eyurtsev thank you! |
# Docs and code review fixes for Docugami DataLoader 1. I noticed a couple of hyperlinks that are not loading in the langchain docs (I guess need explicit anchor tags). Added those. 2. In code review @eyurtsev had a [suggestion](#4727 (comment)) to allow string paths. Turns out just updating the type works (I tested locally with string paths). # Pre-submission checks I ran `make lint` and `make tests` successfully. --------- Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>
Adds a document loader for Docugami
Specifically:
Here is an example of a result that is not possible without the capabilities added by Docugami (from the notebook):