-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix unstructured-text example #277
Conversation
ef69797
to
826aa4c
Compare
Deploying datachain-documentation with
|
Latest commit: |
b0d65e4
|
Status: | ✅ Deploy successful! |
Preview URL: | https://c29e31d9.datachain-documentation.pages.dev |
Branch Preview URL: | https://fix-unstructured-text.datachain-documentation.pages.dev |
2e1a659
to
394e656
Compare
import nltk | ||
|
||
nltk.download("punkt_tab") | ||
nltk.download("averaged_perceptron_tagger_eng") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[F] These are needed for the "fast" strategy which is the only strategy supported without installing poppler
or other non-easy-to-install across all platform tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: a new version of nltk
came out and that meant that unstructured
could no longer download these packs by itself. I have pinned nltk
back to the working version.
# "pdf", "ppt", "pptx", "rtf", "rst", "tsv", "xlsx" | ||
import os | ||
|
||
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[F] pipeline
downloads pszemraj/led-large-book-summary
so we might as well get it as quickly as possible
Alternatively, we can wait for Unstructured-IO/unstructured#3512 + the next release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this!
177ff07
to
b0d65e4
Compare
This PR fixes the unstructured-text nlp example. Tl;dr is that the new version of
nltk
does not play nicely withunstructured
.