Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version of Pytorch and Cuda #136

Open
yuxiangwei0808 opened this issue Sep 26, 2022 · 0 comments
Open

Version of Pytorch and Cuda #136

yuxiangwei0808 opened this issue Sep 26, 2022 · 0 comments

Comments

@yuxiangwei0808
Copy link

Hi, I am repeating the experiment in the "osdi21-artifact" branch. However, I have encountered multiple jobs failure due to some errors:

Traceback (most recent call last):
  File "run_glue.py", line 750, in <module>
    main()
  File "run_glue.py", line 476, in main
    model = adaptdl.torch.AdaptiveDataParallel(model, optimizer, lr_scheduler)
  File "/root/adaptdl/adaptdl/torch/parallel.py", line 68, in __init__
    adaptdl.checkpoint.load_state(self._state)
  File "/root/adaptdl/adaptdl/checkpoint.py", line 137, in load_state
    state.load(f)
  File "/root/adaptdl/adaptdl/torch/parallel.py", line 194, in load
    state_dicts, self.gain = torch.load(fileobj)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 600, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 242, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

which usually happens when rescaling. I think this possibly resulted from the conflict of environment. Therefore, could you please provide the versions of Pytorch, Cuda, Python, and other necessary modules?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant