You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am repeating the experiment in the "osdi21-artifact" branch. However, I have encountered multiple jobs failure due to some errors:
Traceback (most recent call last):
File "run_glue.py", line 750, in <module>
main()
File "run_glue.py", line 476, in main
model = adaptdl.torch.AdaptiveDataParallel(model, optimizer, lr_scheduler)
File "/root/adaptdl/adaptdl/torch/parallel.py", line 68, in __init__
adaptdl.checkpoint.load_state(self._state)
File "/root/adaptdl/adaptdl/checkpoint.py", line 137, in load_state
state.load(f)
File "/root/adaptdl/adaptdl/torch/parallel.py", line 194, in load
state_dicts, self.gain = torch.load(fileobj)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 600, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 242, in __init__
super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
which usually happens when rescaling. I think this possibly resulted from the conflict of environment. Therefore, could you please provide the versions of Pytorch, Cuda, Python, and other necessary modules?
The text was updated successfully, but these errors were encountered:
Hi, I am repeating the experiment in the "osdi21-artifact" branch. However, I have encountered multiple jobs failure due to some errors:
which usually happens when rescaling. I think this possibly resulted from the conflict of environment. Therefore, could you please provide the versions of Pytorch, Cuda, Python, and other necessary modules?
The text was updated successfully, but these errors were encountered: