Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/_download_nltk_packages_if_not_present throws HTTP 403 Forbidden #3795

Closed
tn-halfspace opened this issue Nov 25, 2024 · 18 comments
Closed
Labels
bug Something isn't working

Comments

@tn-halfspace
Copy link

Describe the bug
Until 25.11.2024 I haven't seen any problems with this function. Since now this url:

https://utic-public-cf.s3.amazonaws.com/nltk_data_3.8.2.tar.gz

Returned 403.

Edit: it's back up there... But seems inconsistent.

@tn-halfspace tn-halfspace added the bug Something isn't working label Nov 25, 2024
@villagab4
Copy link

Also experiencing this issue !

@derekhsu
Copy link

I still have the issue now. Where can I download and install it?

@akhilkapil
Copy link

Hi @tn-halfspace, were you able to resolve this issue?

@tn-halfspace
Copy link
Author

This is ridiculous @Unstructured-DevOps. It hapenned again today.

I've created my own workaround to use nltk's downloader directly instead of relying on unstructured.nlp downloader and download the two packages that are used by unstructured.io (averaged_perceptron_tagger_eng and punkt_tab)

I think in a perfect world the best practice would be to ship your code with these two packages and use manual installation by changing the NLTK_PATH according to NLTK documentation

@vangheem
Copy link
Contributor

If you upgrade to the latest, it should be fixed: #3796

@skshetry
Copy link

I am still getting AccessDenied. @vangheem, would it be possible to keep the URL up?

We are still stuck with an older version of unstructured due to #3731 at the moment.

@kartik4949
Copy link

Following could be a hack for solving this issue.

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

@skshetry
Copy link

Following could be a hack for solving this issue.

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

yeah, that's what I am doing right now (iterative/datachain#687). :)

@djibomar
Copy link

This started happening for us today as well. The previous version used to call this s3 url to download nltk https://utic-public-cf.s3.amazonaws.com/nltk_data_3.8.2.tar.gz however it now just throws a 403.
Why did this happen ?

@jexp
Copy link

jexp commented Dec 13, 2024

Same here, we're also seeing the 403 errors.

@zya
Copy link

zya commented Dec 13, 2024

I'm experiencing this intermittently as well.

@dipanjanS
Copy link

This is still happening and has broken all our pipelines

@jan-schneider3
Copy link

Same here

@zya
Copy link

zya commented Dec 13, 2024

Updating to 0.16.11 seems to have addressed the issue for me.

@dipanjanS
Copy link

Still doesnt work even with 0.16.11, literally the same issue but 0.15 seems to be working, this is really frustrating

image

@dipanjanS
Copy link

Reinstalled everything and reupdated, its working now with the latest version so far, however I have other issues now where the outputs have changed :) but that's for another day (issue). Thanks for all the suggestions in this thread!

@doyeka
Copy link

doyeka commented Dec 13, 2024

If you upgrade to the latest, it should be fixed: #3796

Can confirm upgrading to the latest version of unstructured (0.16.11) resolved the bug for me.

ccurme added a commit to langchain-ai/langchain that referenced this issue Dec 14, 2024
Bump unstructured to pick up resolution of
Unstructured-IO/unstructured#3795
@scanny
Copy link
Collaborator

scanny commented Dec 14, 2024

Closing as resolved. If you're still having trouble feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests