Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix unstructured-text example #277

Merged
merged 5 commits into from
Aug 13, 2024
Merged

fix unstructured-text example #277

merged 5 commits into from
Aug 13, 2024

Conversation

mattseddon
Copy link
Contributor

@mattseddon mattseddon commented Aug 12, 2024

This PR fixes the unstructured-text nlp example. Tl;dr is that the new version of nltk does not play nicely with unstructured.

@mattseddon mattseddon force-pushed the fix-unstructured-text branch from ef69797 to 826aa4c Compare August 12, 2024 02:09
Copy link

cloudflare-workers-and-pages bot commented Aug 12, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: b0d65e4
Status: ✅  Deploy successful!
Preview URL: https://c29e31d9.datachain-documentation.pages.dev
Branch Preview URL: https://fix-unstructured-text.datachain-documentation.pages.dev

View logs

@mattseddon mattseddon force-pushed the fix-unstructured-text branch 5 times, most recently from 2e1a659 to 394e656 Compare August 12, 2024 09:07
import nltk

nltk.download("punkt_tab")
nltk.download("averaged_perceptron_tagger_eng")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F] These are needed for the "fast" strategy which is the only strategy supported without installing poppler or other non-easy-to-install across all platform tools.

Copy link
Contributor Author

@mattseddon mattseddon Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: a new version of nltk came out and that meant that unstructured could no longer download these packs by itself. I have pinned nltk back to the working version.

@mattseddon mattseddon marked this pull request as ready for review August 12, 2024 10:47
@mattseddon mattseddon requested a review from a team August 12, 2024 10:48
# "pdf", "ppt", "pptx", "rtf", "rst", "tsv", "xlsx"
import os

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[F] pipeline downloads pszemraj/led-large-book-summary so we might as well get it as quickly as possible

@mattseddon mattseddon self-assigned this Aug 12, 2024
@mattseddon
Copy link
Contributor Author

Alternatively, we can wait for Unstructured-IO/unstructured#3512 + the next release

Copy link
Contributor

@dtulga dtulga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this!

@mattseddon mattseddon force-pushed the fix-unstructured-text branch from 177ff07 to b0d65e4 Compare August 12, 2024 23:24
@mattseddon mattseddon merged commit 1fa9465 into main Aug 13, 2024
31 of 36 checks passed
@mattseddon mattseddon deleted the fix-unstructured-text branch August 13, 2024 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants