-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_dataset("amazon_polarity") NonMatchingChecksumError #1856
Comments
Hi ! This issue may be related to #996 On my side I didn't get any error today with |
+1 encountering this issue as well |
@lhoestq Hi! I encounter the same error when loading
When you say the "Quota Exceeded from Google drive". Is this a quota from the dataset owner? or the quota from our (the runner) Google Drive? |
+1 Also encountering this issue |
Each file on Google Drive can be downloaded only a certain amount of times per day because of a quota. The quota is reset every day. So if too many people download the dataset the same day, then the quota is likely to exceed. So far the issue happened with CNN DailyMail, Amazon Polarity and Yelp Reviews. |
@lhoestq Gotcha, that is quite problematic...for what it's worth, I've had no issues with the other datasets I tried, such as |
Same issue today with "big_patent", though the symptoms are slightly different. When running from datasets import load_dataset
load_dataset("big_patent", split="validation") I get the following I had to look into
|
A similar issue arises when trying to stream the dataset >>> from datasets import load_dataset
>>> iter_dset = load_dataset("amazon_polarity", split="test", streaming=True)
>>> iter(iter_dset).__next__()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~\lib\tarfile.py in nti(s)
186 s = nts(s, "ascii", "strict")
--> 187 n = int(s.strip() or "0", 8)
188 except ValueError:
ValueError: invalid literal for int() with base 8: 'e nonce='
During handling of the above exception, another exception occurred:
InvalidHeaderError Traceback (most recent call last)
~\lib\tarfile.py in next(self)
2288 try:
-> 2289 tarinfo = self.tarinfo.fromtarfile(self)
2290 except EOFHeaderError as e:
~\lib\tarfile.py in fromtarfile(cls, tarfile)
1094 buf = tarfile.fileobj.read(BLOCKSIZE)
-> 1095 obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
1096 obj.offset = tarfile.fileobj.tell() - BLOCKSIZE
~\lib\tarfile.py in frombuf(cls, buf, encoding, errors)
1036
-> 1037 chksum = nti(buf[148:156])
1038 if chksum not in calc_chksums(buf):
~\lib\tarfile.py in nti(s)
188 except ValueError:
--> 189 raise InvalidHeaderError("invalid header")
190 return n
InvalidHeaderError: invalid header
During handling of the above exception, another exception occurred:
ReadError Traceback (most recent call last)
<ipython-input-5-6b9058341b2b> in <module>
----> 1 iter(iter_dset).__next__()
~\lib\site-packages\datasets\iterable_dataset.py in __iter__(self)
363
364 def __iter__(self):
--> 365 for key, example in self._iter():
366 if self.features:
367 # we encode the example for ClassLabel feature types for example
~\lib\site-packages\datasets\iterable_dataset.py in _iter(self)
360 else:
361 ex_iterable = self._ex_iterable
--> 362 yield from ex_iterable
363
364 def __iter__(self):
~\lib\site-packages\datasets\iterable_dataset.py in __iter__(self)
77
78 def __iter__(self):
---> 79 yield from self.generate_examples_fn(**self.kwargs)
80
81 def shuffle_data_sources(self, seed: Optional[int]) -> "ExamplesIterable":
~\.cache\huggingface\modules\datasets_modules\datasets\amazon_polarity\56923eeb72030cb6c4ea30c8a4e1162c26b25973475ac1f44340f0ec0f2936f4\amazon_polarity.py in _generate_examples(self, filepath, files)
114 def _generate_examples(self, filepath, files):
115 """Yields examples."""
--> 116 for path, f in files:
117 if path == filepath:
118 lines = (line.decode("utf-8") for line in f)
~\lib\site-packages\datasets\utils\streaming_download_manager.py in __iter__(self)
616
617 def __iter__(self):
--> 618 yield from self.generator(*self.args, **self.kwargs)
619
620
~\lib\site-packages\datasets\utils\streaming_download_manager.py in _iter_from_urlpath(cls, urlpath, use_auth_token)
644 ) -> Generator[Tuple, None, None]:
645 with xopen(urlpath, "rb", use_auth_token=use_auth_token) as f:
--> 646 yield from cls._iter_from_fileobj(f)
647
648 @classmethod
~\lib\site-packages\datasets\utils\streaming_download_manager.py in _iter_from_fileobj(cls, f)
624 @classmethod
625 def _iter_from_fileobj(cls, f) -> Generator[Tuple, None, None]:
--> 626 stream = tarfile.open(fileobj=f, mode="r|*")
627 for tarinfo in stream:
628 file_path = tarinfo.name
~\lib\tarfile.py in open(cls, name, mode, fileobj, bufsize, **kwargs)
1603 stream = _Stream(name, filemode, comptype, fileobj, bufsize)
1604 try:
-> 1605 t = cls(name, filemode, stream, **kwargs)
1606 except:
1607 stream.close()
~\lib\tarfile.py in __init__(self, name, mode, fileobj, format, tarinfo, dereference, ignore_zeros, encoding, errors, pax_headers, debug, errorlevel, copybufsize)
1484 if self.mode == "r":
1485 self.firstmember = None
-> 1486 self.firstmember = self.next()
1487
1488 if self.mode == "a":
~\lib\tarfile.py in next(self)
2299 continue
2300 elif self.offset == 0:
-> 2301 raise ReadError(str(e))
2302 except EmptyHeaderError:
2303 if self.offset == 0:
ReadError: invalid header |
This error still happens, but for a different reason now: Google Drive returns a warning instead of the dataset. |
Met the same issue +1 |
Hi ! Thanks for reporting. Google Drive changed the way to bypass the warning message recently. The latest release We opened a PR to fix this recently for streaming mode at #3843 - we'll do a new release once the fix is merged :) |
Hi, it seems that loading the amazon_polarity dataset gives a NonMatchingChecksumError.
To reproduce:
This will give the following error:
The text was updated successfully, but these errors were encountered: