Error with HashCheckerIterDataPipe #75

Nayef211 · 2021-10-20T20:24:14Z

🐛 Bug

Using HashCheckerIterDataPipe for implementing a SST2 dataset within torchtext causes test failures for unittest_linux_py3.6 and for all python versions on windows platform.

Here is the CircleCI link for all the test failures: failures.
Here is the Dataset implementation where the HashCheckerIterDataPipe is used: code pointer

I believe there may be changes to how io.seek() works from python 3.6 to 3.7 that could be causing the failures in unittest_linux_py3.6 and unittest_windows_py3.6. I'm not really sure why the other windows unit tests are failing.

To Reproduce

Steps to reproduce the behavior:

Patch commit 62e6fb2 in Nayef211/text repo
Create PR against pytorch/text repo
Look at CircleCI unit test failures

Error for `unittest_linux_py3.6` and `unittest_windows_py3.6`

self = <torchdata.datapipes.iter.util.hashchecker.HashCheckerIterDataPipe object at 0x7f937f867ba8>

    def __iter__(self):
    
        for file_name, stream in self.source_datapipe:
            if self.hash_type == "sha256":
                hash_func = hashlib.sha256()
            else:
                hash_func = hashlib.md5()
    
            while True:
                # Read by chunk to avoid filling memory
                chunk = stream.read(1024 ** 2)
                if not chunk:
                    break
                hash_func.update(chunk)
    
            # TODO(VitalyFedyunin): this will not work (or work crappy for non-seekable steams like http)
            if self.rewind:
>               stream.seek(0)
E               io.UnsupportedOperation: seek

env/lib/python3.6/site-packages/torchdata-0.1.0a0+7772406-py3.6.egg/torchdata/datapipes/iter/util/hashchecker.py:51: UnsupportedOperation

Link to Circle CI Error

Error for all other `unittest_windows_py*`

self = <torchdata.datapipes.iter.util.hashchecker.HashCheckerIterDataPipe object at 0x000001929F2B5548>

    def __iter__(self):
    
        for file_name, stream in self.source_datapipe:
            if self.hash_type == "sha256":
                hash_func = hashlib.sha256()
            else:
                hash_func = hashlib.md5()
    
            while True:
                # Read by chunk to avoid filling memory
                chunk = stream.read(1024 ** 2)
                if not chunk:
                    break
                hash_func.update(chunk)
    
            # TODO(VitalyFedyunin): this will not work (or work crappy for non-seekable steams like http)
            if self.rewind:
                stream.seek(0)
    
            if file_name not in self.hash_dict:
>               raise RuntimeError("Unspecified hash for file {}".format(file_name))
E               RuntimeError: Unspecified hash for file C:\Users\circleci\.torchtext\cache\SST2\SST-2\train.tsv

env\lib\site-packages\torchdata-0.1.0a0+7772406-py3.7.egg\torchdata\datapipes\iter\util\hashchecker.py:54: RuntimeError

Link to Circle CI Error

Expected behavior

Expect all tests to pass

Environment

Tests pass on devserver environment but fails on CircleCI.

The text was updated successfully, but these errors were encountered:

ejguan · 2021-10-20T20:48:24Z

Thanks for notifying us. The behavior of seek is different across platform and python version. And, IMO, this probably only happens on the zipfile. tarfile should more unified behavior.

~~I will comment on your original PR for a work around~~
~~We should add comprehensive CI for TorchData CI Improvement #70~~
- Multiple Python versions are implemented here
- Multiple platform OS.
We need to refactor the hackchecker.

cc: @VitalyFedyunin @NivekT

ejguan · 2021-12-09T17:02:37Z

Closing this Issue as a refactor PR has been landed

ejguan closed this as completed Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with HashCheckerIterDataPipe #75

Error with HashCheckerIterDataPipe #75

Nayef211 commented Oct 20, 2021

ejguan commented Oct 20, 2021 •

edited

Loading

ejguan commented Dec 9, 2021

Error with HashCheckerIterDataPipe #75

Error with HashCheckerIterDataPipe #75

Comments

Nayef211 commented Oct 20, 2021

🐛 Bug

To Reproduce

Error for unittest_linux_py3.6 and unittest_windows_py3.6

Error for all other unittest_windows_py*

Expected behavior

Environment

ejguan commented Oct 20, 2021 • edited Loading

ejguan commented Dec 9, 2021

Error for `unittest_linux_py3.6` and `unittest_windows_py3.6`

Error for all other `unittest_windows_py*`

ejguan commented Oct 20, 2021 •

edited

Loading