Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with HashCheckerIterDataPipe #75

Closed
Nayef211 opened this issue Oct 20, 2021 · 2 comments
Closed

Error with HashCheckerIterDataPipe #75

Nayef211 opened this issue Oct 20, 2021 · 2 comments

Comments

@Nayef211
Copy link

🐛 Bug

Using HashCheckerIterDataPipe for implementing a SST2 dataset within torchtext causes test failures for unittest_linux_py3.6 and for all python versions on windows platform.

  • Here is the CircleCI link for all the test failures: failures.
  • Here is the Dataset implementation where the HashCheckerIterDataPipe is used: code pointer

I believe there may be changes to how io.seek() works from python 3.6 to 3.7 that could be causing the failures in unittest_linux_py3.6 and unittest_windows_py3.6. I'm not really sure why the other windows unit tests are failing.

To Reproduce

Steps to reproduce the behavior:

  1. Patch commit 62e6fb2 in Nayef211/text repo
  2. Create PR against pytorch/text repo
  3. Look at CircleCI unit test failures

Error for unittest_linux_py3.6 and unittest_windows_py3.6

self = <torchdata.datapipes.iter.util.hashchecker.HashCheckerIterDataPipe object at 0x7f937f867ba8>

    def __iter__(self):
    
        for file_name, stream in self.source_datapipe:
            if self.hash_type == "sha256":
                hash_func = hashlib.sha256()
            else:
                hash_func = hashlib.md5()
    
            while True:
                # Read by chunk to avoid filling memory
                chunk = stream.read(1024 ** 2)
                if not chunk:
                    break
                hash_func.update(chunk)
    
            # TODO(VitalyFedyunin): this will not work (or work crappy for non-seekable steams like http)
            if self.rewind:
>               stream.seek(0)
E               io.UnsupportedOperation: seek

env/lib/python3.6/site-packages/torchdata-0.1.0a0+7772406-py3.6.egg/torchdata/datapipes/iter/util/hashchecker.py:51: UnsupportedOperation

Link to Circle CI Error

Error for all other unittest_windows_py*

self = <torchdata.datapipes.iter.util.hashchecker.HashCheckerIterDataPipe object at 0x000001929F2B5548>

    def __iter__(self):
    
        for file_name, stream in self.source_datapipe:
            if self.hash_type == "sha256":
                hash_func = hashlib.sha256()
            else:
                hash_func = hashlib.md5()
    
            while True:
                # Read by chunk to avoid filling memory
                chunk = stream.read(1024 ** 2)
                if not chunk:
                    break
                hash_func.update(chunk)
    
            # TODO(VitalyFedyunin): this will not work (or work crappy for non-seekable steams like http)
            if self.rewind:
                stream.seek(0)
    
            if file_name not in self.hash_dict:
>               raise RuntimeError("Unspecified hash for file {}".format(file_name))
E               RuntimeError: Unspecified hash for file C:\Users\circleci\.torchtext\cache\SST2\SST-2\train.tsv

env\lib\site-packages\torchdata-0.1.0a0+7772406-py3.7.egg\torchdata\datapipes\iter\util\hashchecker.py:54: RuntimeError

Link to Circle CI Error

Expected behavior

Expect all tests to pass

Environment

Tests pass on devserver environment but fails on CircleCI.

@ejguan
Copy link
Contributor

ejguan commented Oct 20, 2021

Thanks for notifying us. The behavior of seek is different across platform and python version. And, IMO, this probably only happens on the zipfile. tarfile should more unified behavior.

  • I will comment on your original PR for a work around
  • We should add comprehensive CI for TorchData CI Improvement #70
    • Multiple Python versions are implemented here
    • Multiple platform OS.
  • We need to refactor the hackchecker.

cc: @VitalyFedyunin @NivekT

@ejguan
Copy link
Contributor

ejguan commented Dec 9, 2021

Closing this Issue as a refactor PR has been landed

@ejguan ejguan closed this as completed Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants