load_dataset("amazon_polarity") NonMatchingChecksumError #1856

yanxi0830 · 2021-02-10T10:00:56Z

Hi, it seems that loading the amazon_polarity dataset gives a NonMatchingChecksumError.

To reproduce:

load_dataset("amazon_polarity")

This will give the following error:

---------------------------------------------------------------------------
NonMatchingChecksumError                  Traceback (most recent call last)
<ipython-input-3-8559a03fe0f8> in <module>()
----> 1 dataset = load_dataset("amazon_polarity")

3 frames
/usr/local/lib/python3.6/dist-packages/datasets/utils/info_utils.py in verify_checksums(expected_checksums, recorded_checksums, verification_name)
     37     if len(bad_urls) > 0:
     38         error_msg = "Checksums didn't match" + for_verification_name + ":\n"
---> 39         raise NonMatchingChecksumError(error_msg + str(bad_urls))
     40     logger.info("All the checksums matched successfully" + for_verification_name)
     41 

NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://drive.google.com/u/0/uc?id=0Bz8a_Dbh9QhbaW12WVVZS2drcnM&export=download']

The text was updated successfully, but these errors were encountered:

lhoestq · 2021-02-10T18:19:04Z

Hi ! This issue may be related to #996
This comes probably from the Quota Exceeded error from Google Drive.
Can you try again tomorrow and see if you still have the error ?

On my side I didn't get any error today with load_dataset("amazon_polarity")

calebchiam · 2021-02-10T20:36:02Z

+1 encountering this issue as well

jguoxu · 2021-02-10T22:39:48Z

@lhoestq Hi! I encounter the same error when loading yelp_review_full.

from datasets import load_dataset
dataset_yp = load_dataset("yelp_review_full")

When you say the "Quota Exceeded from Google drive". Is this a quota from the dataset owner? or the quota from our (the runner) Google Drive?

dtch1997 · 2021-02-11T04:27:49Z

+1 Also encountering this issue

lhoestq · 2021-02-11T10:18:06Z

When you say the "Quota Exceeded from Google drive". Is this a quota from the dataset owner? or the quota from our (the runner) Google Drive?

Each file on Google Drive can be downloaded only a certain amount of times per day because of a quota. The quota is reset every day. So if too many people download the dataset the same day, then the quota is likely to exceed.
That's a really bad limitations of Google Drive and we should definitely find another host for these dataset than Google Drive.
For now I would suggest to wait and try again later..

So far the issue happened with CNN DailyMail, Amazon Polarity and Yelp Reviews.
Are you experiencing the issue with other datasets ? @calebchiam @dtch1997

calebchiam · 2021-02-11T10:33:23Z

@lhoestq Gotcha, that is quite problematic...for what it's worth, I've had no issues with the other datasets I tried, such as yelp_reviews_full and amazon_reviews_multi.

thomasw21 · 2021-07-21T12:55:04Z

Same issue today with "big_patent", though the symptoms are slightly different.

When running

from datasets import load_dataset
load_dataset("big_patent", split="validation")

I get the following
FileNotFoundError: Local file \huggingface\datasets\downloads\6159313604f4f2c01e7d1cac52139343b6c07f73f6de348d09be6213478455c5\bigPatentData\train.tar.gz doesn't exist

I had to look into 6159313604f4f2c01e7d1cac52139343b6c07f73f6de348d09be6213478455c5 (which is a file instead of a folder) and got the following:

<!DOCTYPE html><html><head><title>Google Drive - Quota exceeded</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><link href=/static/doclist/client/css/4033072956-untrustedcontent.css rel="stylesheet" nonce="JV0t61Smks2TEKdFCGAUFA"><link rel="icon" href="//ssl.gstatic.com/images/branding/product/1x/drive_2020q4_32dp.png"/><style nonce="JV0t61Smks2TEKdFCGAUFA">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important} </style><script nonce="iNUHigT+ENVQ3UZrLkFtRw"></script></head><body><div id=gbar><nobr><a target=_blank class=gb1 href="https://www.google.fr/webhp?tab=ow">Search</a> <a target=_blank class=gb1 href="http://www.google.fr/imghp?hl=en&tab=oi">Images</a> <a target=_blank class=gb1 href="https://maps.google.fr/maps?hl=en&tab=ol">Maps</a> <a target=_blank class=gb1 href="https://play.google.com/?hl=en&tab=o8">Play</a> <a target=_blank class=gb1 href="https://www.youtube.com/?gl=FR&tab=o1">YouTube</a> <a target=_blank class=gb1 href="https://news.google.com/?tab=on">News</a> <a target=_blank class=gb1 href="https://mail.google.com/mail/?tab=om">Gmail</a> <b class=gb1>Drive</b> <a target=_blank class=gb1 style="text-decoration:none" href="https://www.google.fr/intl/en/about/products?tab=oh"><u>More</u> »</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a target="_self" href="/settings?hl=en_US" class=gb4>Settings</a> | <a target=_blank href="//support.google.com/drive/?p=web_home&hl=en_US" class=gb4>Help</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://drive.google.com/uc%3Fexport%3Ddownload%26id%3D1J3mucMFTWrgAYa3LuBZoLRR3CzzYD3fa&service=writely&ec=GAZAMQ" class=gb4>Sign in</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div><div class="uc-main"><div id="uc-text"><p class="uc-error-caption">Sorry, you can't view or download this file at this time.</p><p class="uc-error-subcaption">Too many users have viewed or downloaded this file recently. Please try accessing the file again later. If the file you are trying to access is particularly large or is shared with many people, it may take up to 24 hours to be able to view or download the file. If you still can't access a file after 24 hours, contact your domain administrator.</p></div></div><div class="uc-footer"><hr class="uc-footer-divider">© 2021 Google - <a class="goog-link" href="//support.google.com/drive/?p=web_home">Help</a> - <a class="goog-link" href="//support.google.com/drive/bin/answer.py?hl=en_US&answer=2450387">Privacy & Terms</a></div></body></html>

SBrandeis · 2022-02-17T15:58:06Z

A similar issue arises when trying to stream the dataset

>>> from datasets import load_dataset
>>> iter_dset = load_dataset("amazon_polarity", split="test", streaming=True)
>>> iter(iter_dset).__next__()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\lib\tarfile.py in nti(s)
    186             s = nts(s, "ascii", "strict")
--> 187             n = int(s.strip() or "0", 8)
    188         except ValueError:

ValueError: invalid literal for int() with base 8: 'e nonce='

During handling of the above exception, another exception occurred:

InvalidHeaderError                        Traceback (most recent call last)
~\lib\tarfile.py in next(self)
   2288             try:
-> 2289                 tarinfo = self.tarinfo.fromtarfile(self)
   2290             except EOFHeaderError as e:

~\lib\tarfile.py in fromtarfile(cls, tarfile)
   1094         buf = tarfile.fileobj.read(BLOCKSIZE)
-> 1095         obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
   1096         obj.offset = tarfile.fileobj.tell() - BLOCKSIZE

~\lib\tarfile.py in frombuf(cls, buf, encoding, errors)
   1036
-> 1037         chksum = nti(buf[148:156])
   1038         if chksum not in calc_chksums(buf):

~\lib\tarfile.py in nti(s)
    188         except ValueError:
--> 189             raise InvalidHeaderError("invalid header")
    190     return n

InvalidHeaderError: invalid header

During handling of the above exception, another exception occurred:

ReadError                                 Traceback (most recent call last)
<ipython-input-5-6b9058341b2b> in <module>
----> 1 iter(iter_dset).__next__()

~\lib\site-packages\datasets\iterable_dataset.py in __iter__(self)
    363
    364     def __iter__(self):
--> 365         for key, example in self._iter():
    366             if self.features:
    367                 # we encode the example for ClassLabel feature types for example

~\lib\site-packages\datasets\iterable_dataset.py in _iter(self)
    360         else:
    361             ex_iterable = self._ex_iterable
--> 362         yield from ex_iterable
    363
    364     def __iter__(self):

~\lib\site-packages\datasets\iterable_dataset.py in __iter__(self)
     77
     78     def __iter__(self):
---> 79         yield from self.generate_examples_fn(**self.kwargs)
     80
     81     def shuffle_data_sources(self, seed: Optional[int]) -> "ExamplesIterable":

~\.cache\huggingface\modules\datasets_modules\datasets\amazon_polarity\56923eeb72030cb6c4ea30c8a4e1162c26b25973475ac1f44340f0ec0f2936f4\amazon_polarity.py in _generate_examples(self, filepath, files)
    114     def _generate_examples(self, filepath, files):
    115         """Yields examples."""
--> 116         for path, f in files:
    117             if path == filepath:
    118                 lines = (line.decode("utf-8") for line in f)

~\lib\site-packages\datasets\utils\streaming_download_manager.py in __iter__(self)
    616
    617     def __iter__(self):
--> 618         yield from self.generator(*self.args, **self.kwargs)
    619
    620

~\lib\site-packages\datasets\utils\streaming_download_manager.py in _iter_from_urlpath(cls, urlpath, use_auth_token)
    644     ) -> Generator[Tuple, None, None]:
    645         with xopen(urlpath, "rb", use_auth_token=use_auth_token) as f:
--> 646             yield from cls._iter_from_fileobj(f)
    647
    648     @classmethod

~\lib\site-packages\datasets\utils\streaming_download_manager.py in _iter_from_fileobj(cls, f)
    624     @classmethod
    625     def _iter_from_fileobj(cls, f) -> Generator[Tuple, None, None]:
--> 626         stream = tarfile.open(fileobj=f, mode="r|*")
    627         for tarinfo in stream:
    628             file_path = tarinfo.name

~\lib\tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
   1603             stream = _Stream(name, filemode, comptype, fileobj, bufsize)
   1604             try:
-> 1605                 t = cls(name, filemode, stream, **kwargs)
   1606             except:
   1607                 stream.close()

~\lib\tarfile.py in __init__(self, name, mode, fileobj, format, tarinfo, dereference, ignore_zeros, encoding, errors, pax_headers, debug, errorlevel, copybufsize)
   1484             if self.mode == "r":
   1485                 self.firstmember = None
-> 1486                 self.firstmember = self.next()
   1487
   1488             if self.mode == "a":

~\lib\tarfile.py in next(self)
   2299                     continue
   2300                 elif self.offset == 0:
-> 2301                     raise ReadError(str(e))
   2302             except EmptyHeaderError:
   2303                 if self.offset == 0:

ReadError: invalid header

dirkgr · 2022-03-10T18:21:30Z

This error still happens, but for a different reason now: Google Drive returns a warning instead of the dataset.

chengjiali · 2022-03-11T04:02:18Z

Met the same issue +1

lhoestq · 2022-03-11T11:06:10Z

Hi ! Thanks for reporting. Google Drive changed the way to bypass the warning message recently.

The latest release 1.18.4 fixes this for datasets loaded in a regular way.

We opened a PR to fix this recently for streaming mode at #3843 - we'll do a new release once the fix is merged :)

albertvillanova · 2022-03-15T13:55:23Z

Fixed by:

huggingface deleted a comment from nasseralmohr3600 Feb 10, 2021

lhoestq mentioned this issue Feb 19, 2021

DBPedia14 Dataset Checksum bug? #1907

Closed

albertvillanova closed this as completed Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_dataset("amazon_polarity") NonMatchingChecksumError #1856

load_dataset("amazon_polarity") NonMatchingChecksumError #1856

yanxi0830 commented Feb 10, 2021

lhoestq commented Feb 10, 2021

calebchiam commented Feb 10, 2021

jguoxu commented Feb 10, 2021

dtch1997 commented Feb 11, 2021

lhoestq commented Feb 11, 2021

calebchiam commented Feb 11, 2021

thomasw21 commented Jul 21, 2021 •

edited

Loading

SBrandeis commented Feb 17, 2022 •

edited

Loading

dirkgr commented Mar 10, 2022

chengjiali commented Mar 11, 2022

lhoestq commented Mar 11, 2022 •

edited

Loading

albertvillanova commented Mar 15, 2022

load_dataset("amazon_polarity") NonMatchingChecksumError #1856

load_dataset("amazon_polarity") NonMatchingChecksumError #1856

Comments

yanxi0830 commented Feb 10, 2021

lhoestq commented Feb 10, 2021

calebchiam commented Feb 10, 2021

jguoxu commented Feb 10, 2021

dtch1997 commented Feb 11, 2021

lhoestq commented Feb 11, 2021

calebchiam commented Feb 11, 2021

thomasw21 commented Jul 21, 2021 • edited Loading

SBrandeis commented Feb 17, 2022 • edited Loading

dirkgr commented Mar 10, 2022

chengjiali commented Mar 11, 2022

lhoestq commented Mar 11, 2022 • edited Loading

albertvillanova commented Mar 15, 2022

thomasw21 commented Jul 21, 2021 •

edited

Loading

SBrandeis commented Feb 17, 2022 •

edited

Loading

lhoestq commented Mar 11, 2022 •

edited

Loading