Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset("amazon_polarity") NonMatchingChecksumError #1856

Closed
yanxi0830 opened this issue Feb 10, 2021 · 12 comments
Closed

load_dataset("amazon_polarity") NonMatchingChecksumError #1856

yanxi0830 opened this issue Feb 10, 2021 · 12 comments

Comments

@yanxi0830
Copy link

Hi, it seems that loading the amazon_polarity dataset gives a NonMatchingChecksumError.

To reproduce:

load_dataset("amazon_polarity")

This will give the following error:

---------------------------------------------------------------------------
NonMatchingChecksumError                  Traceback (most recent call last)
<ipython-input-3-8559a03fe0f8> in <module>()
----> 1 dataset = load_dataset("amazon_polarity")

3 frames
/usr/local/lib/python3.6/dist-packages/datasets/utils/info_utils.py in verify_checksums(expected_checksums, recorded_checksums, verification_name)
     37     if len(bad_urls) > 0:
     38         error_msg = "Checksums didn't match" + for_verification_name + ":\n"
---> 39         raise NonMatchingChecksumError(error_msg + str(bad_urls))
     40     logger.info("All the checksums matched successfully" + for_verification_name)
     41 

NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://drive.google.com/u/0/uc?id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM&export=download']
@lhoestq
Copy link
Member

lhoestq commented Feb 10, 2021

Hi ! This issue may be related to #996
This comes probably from the Quota Exceeded error from Google Drive.
Can you try again tomorrow and see if you still have the error ?

On my side I didn't get any error today with load_dataset("amazon_polarity")

@calebchiam
Copy link

+1 encountering this issue as well

@jguoxu
Copy link

jguoxu commented Feb 10, 2021

@lhoestq Hi! I encounter the same error when loading yelp_review_full.

from datasets import load_dataset
dataset_yp = load_dataset("yelp_review_full")

When you say the "Quota Exceeded from Google drive". Is this a quota from the dataset owner? or the quota from our (the runner) Google Drive?

@huggingface huggingface deleted a comment from nasseralmohr3600 Feb 10, 2021
@dtch1997
Copy link

+1 Also encountering this issue

@lhoestq
Copy link
Member

lhoestq commented Feb 11, 2021

When you say the "Quota Exceeded from Google drive". Is this a quota from the dataset owner? or the quota from our (the runner) Google Drive?

Each file on Google Drive can be downloaded only a certain amount of times per day because of a quota. The quota is reset every day. So if too many people download the dataset the same day, then the quota is likely to exceed.
That's a really bad limitations of Google Drive and we should definitely find another host for these dataset than Google Drive.
For now I would suggest to wait and try again later..

So far the issue happened with CNN DailyMail, Amazon Polarity and Yelp Reviews.
Are you experiencing the issue with other datasets ? @calebchiam @dtch1997

@calebchiam
Copy link

@lhoestq Gotcha, that is quite problematic...for what it's worth, I've had no issues with the other datasets I tried, such as yelp_reviews_full and amazon_reviews_multi.

@thomasw21
Copy link
Contributor

thomasw21 commented Jul 21, 2021

Same issue today with "big_patent", though the symptoms are slightly different.

When running

from datasets import load_dataset
load_dataset("big_patent", split="validation")

I get the following
FileNotFoundError: Local file \huggingface\datasets\downloads\6159313604f4f2c01e7d1cac52139343b6c07f73f6de348d09be6213478455c5\bigPatentData\train.tar.gz doesn't exist

I had to look into 6159313604f4f2c01e7d1cac52139343b6c07f73f6de348d09be6213478455c5 (which is a file instead of a folder) and got the following:

<!DOCTYPE html><html><head><title>Google Drive - Quota exceeded</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><link href=&#47;static&#47;doclist&#47;client&#47;css&#47;4033072956&#45;untrustedcontent.css rel="stylesheet" nonce="JV0t61Smks2TEKdFCGAUFA"><link rel="icon" href="//ssl.gstatic.com/images/branding/product/1x/drive_2020q4_32dp.png"/><style nonce="JV0t61Smks2TEKdFCGAUFA">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important} </style><script nonce="iNUHigT+ENVQ3UZrLkFtRw"></script></head><body><div id=gbar><nobr><a target=_blank class=gb1 href="https://www.google.fr/webhp?tab=ow">Search</a> <a target=_blank class=gb1 href="http://www.google.fr/imghp?hl=en&tab=oi">Images</a> <a target=_blank class=gb1 href="https://maps.google.fr/maps?hl=en&tab=ol">Maps</a> <a target=_blank class=gb1 href="https://play.google.com/?hl=en&tab=o8">Play</a> <a target=_blank class=gb1 href="https://www.youtube.com/?gl=FR&tab=o1">YouTube</a> <a target=_blank class=gb1 href="https://news.google.com/?tab=on">News</a> <a target=_blank class=gb1 href="https://mail.google.com/mail/?tab=om">Gmail</a> <b class=gb1>Drive</b> <a target=_blank class=gb1 style="text-decoration:none" href="https://www.google.fr/intl/en/about/products?tab=oh"><u>More</u> &raquo;</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a target="_self" href="/settings?hl=en_US" class=gb4>Settings</a> | <a target=_blank href="//support.google.com/drive/?p=web_home&hl=en_US" class=gb4>Help</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://drive.google.com/uc%3Fexport%3Ddownload%26id%3D1J3mucMFTWrgAYa3LuBZoLRR3CzzYD3fa&service=writely&ec=GAZAMQ" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div class="uc-main"><div id="uc-text"><p class="uc-error-caption">Sorry, you can&#39;t view or download this file at this time.</p><p class="uc-error-subcaption">Too many users have viewed or downloaded this file recently. Please try accessing the file again later. If the file you are trying to access is particularly large or is shared with many people, it may take up to 24 hours to be able to view or download the file. If you still can't access a file after 24 hours, contact your domain administrator.</p></div></div><div class="uc-footer"><hr class="uc-footer-divider">&copy; 2021 Google - <a class="goog-link" href="//support.google.com/drive/?p=web_home">Help</a> - <a class="goog-link" href="//support.google.com/drive/bin/answer.py?hl=en_US&amp;answer=2450387">Privacy & Terms</a></div></body></html>

@SBrandeis
Copy link
Contributor

SBrandeis commented Feb 17, 2022

A similar issue arises when trying to stream the dataset

>>> from datasets import load_dataset
>>> iter_dset = load_dataset("amazon_polarity", split="test", streaming=True)
>>> iter(iter_dset).__next__()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\lib\tarfile.py in nti(s)
    186             s = nts(s, "ascii", "strict")
--> 187             n = int(s.strip() or "0", 8)
    188         except ValueError:

ValueError: invalid literal for int() with base 8: 'e nonce='

During handling of the above exception, another exception occurred:

InvalidHeaderError                        Traceback (most recent call last)
~\lib\tarfile.py in next(self)
   2288             try:
-> 2289                 tarinfo = self.tarinfo.fromtarfile(self)
   2290             except EOFHeaderError as e:

~\lib\tarfile.py in fromtarfile(cls, tarfile)
   1094         buf = tarfile.fileobj.read(BLOCKSIZE)
-> 1095         obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
   1096         obj.offset = tarfile.fileobj.tell() - BLOCKSIZE

~\lib\tarfile.py in frombuf(cls, buf, encoding, errors)
   1036
-> 1037         chksum = nti(buf[148:156])
   1038         if chksum not in calc_chksums(buf):

~\lib\tarfile.py in nti(s)
    188         except ValueError:
--> 189             raise InvalidHeaderError("invalid header")
    190     return n

InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

ReadError                                 Traceback (most recent call last)
<ipython-input-5-6b9058341b2b> in <module>
----> 1 iter(iter_dset).__next__()

~\lib\site-packages\datasets\iterable_dataset.py in __iter__(self)
    363
    364     def __iter__(self):
--> 365         for key, example in self._iter():
    366             if self.features:
    367                 # we encode the example for ClassLabel feature types for example

~\lib\site-packages\datasets\iterable_dataset.py in _iter(self)
    360         else:
    361             ex_iterable = self._ex_iterable
--> 362         yield from ex_iterable
    363
    364     def __iter__(self):

~\lib\site-packages\datasets\iterable_dataset.py in __iter__(self)
     77
     78     def __iter__(self):
---> 79         yield from self.generate_examples_fn(**self.kwargs)
     80
     81     def shuffle_data_sources(self, seed: Optional[int]) -> "ExamplesIterable":

~\.cache\huggingface\modules\datasets_modules\datasets\amazon_polarity\56923eeb72030cb6c4ea30c8a4e1162c26b25973475ac1f44340f0ec0f2936f4\amazon_polarity.py in _generate_examples(self, filepath, files)
    114     def _generate_examples(self, filepath, files):
    115         """Yields examples."""
--> 116         for path, f in files:
    117             if path == filepath:
    118                 lines = (line.decode("utf-8") for line in f)

~\lib\site-packages\datasets\utils\streaming_download_manager.py in __iter__(self)
    616
    617     def __iter__(self):
--> 618         yield from self.generator(*self.args, **self.kwargs)
    619
    620

~\lib\site-packages\datasets\utils\streaming_download_manager.py in _iter_from_urlpath(cls, urlpath, use_auth_token)
    644     ) -> Generator[Tuple, None, None]:
    645         with xopen(urlpath, "rb", use_auth_token=use_auth_token) as f:
--> 646             yield from cls._iter_from_fileobj(f)
    647
    648     @classmethod

~\lib\site-packages\datasets\utils\streaming_download_manager.py in _iter_from_fileobj(cls, f)
    624     @classmethod
    625     def _iter_from_fileobj(cls, f) -> Generator[Tuple, None, None]:
--> 626         stream = tarfile.open(fileobj=f, mode="r|*")
    627         for tarinfo in stream:
    628             file_path = tarinfo.name

~\lib\tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1603             stream = _Stream(name, filemode, comptype, fileobj, bufsize)
   1604             try:
-> 1605                 t = cls(name, filemode, stream, **kwargs)
   1606             except:
   1607                 stream.close()

~\lib\tarfile.py in __init__(self, name, mode, fileobj, format, tarinfo, dereference, ignore_zeros, encoding, errors, pax_headers, debug, errorlevel, copybufsize)
   1484             if self.mode == "r":
   1485                 self.firstmember = None
-> 1486                 self.firstmember = self.next()
   1487
   1488             if self.mode == "a":

~\lib\tarfile.py in next(self)
   2299                     continue
   2300                 elif self.offset == 0:
-> 2301                     raise ReadError(str(e))
   2302             except EmptyHeaderError:
   2303                 if self.offset == 0:

ReadError: invalid header

@dirkgr
Copy link
Contributor

dirkgr commented Mar 10, 2022

This error still happens, but for a different reason now: Google Drive returns a warning instead of the dataset.

@chengjiali
Copy link

Met the same issue +1

@lhoestq
Copy link
Member

lhoestq commented Mar 11, 2022

Hi ! Thanks for reporting. Google Drive changed the way to bypass the warning message recently.

The latest release 1.18.4 fixes this for datasets loaded in a regular way.

We opened a PR to fix this recently for streaming mode at #3843 - we'll do a new release once the fix is merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants