Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_math.py model file save error #78

Open
2 of 4 tasks
Tshiyao opened this issue Dec 6, 2024 · 3 comments
Open
2 of 4 tasks

train_math.py model file save error #78

Tshiyao opened this issue Dec 6, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@Tshiyao
Copy link

Tshiyao commented Dec 6, 2024

System Info

ubuntu20
python=3.10
A100-SXM4-80GB

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the codebase (such as scrips/, ...)
  • My own task or dataset (give details below)

Reproduction

save model
total_num_steps:  64
average_step_rewards:  0.7349994
Process Process-5:
Traceback (most recent call last):
  File "/root/.conda/envs/open122/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/.conda/envs/open122/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/openr/train/mat/scripts/../../mat/envs/math/math_env_wrappers.py", line 117, in shareworker
    cmd, data = remote.recv()
  File "/root/.conda/envs/open122/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/root/.conda/envs/open122/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/root/.conda/envs/open122/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
total 17
drwxr-xr-x 2 root root 4096 Dec  6 22:02 ./
drwxr-xr-x 4 root root 4096 Dec  6 22:02 ../
-rw-r--r-- 1 root root 5119 Dec  6 22:02 README.md
-rw-r--r-- 1 root root  667 Dec  6 22:02 adapter_config.json
-rw-r--r-- 1 root root   40 Dec  6 22:02 adapter_model.safetensors

Expected behavior

The program should exit normally, and the model weights should be saved properly.

@Tshiyao Tshiyao added the bug Something isn't working label Dec 6, 2024
@suanflower
Copy link

The same error, have you solved it?

@yy0525
Copy link

yy0525 commented Dec 31, 2024

同样的错误

@meijuanwang
Copy link

同样的错误,感觉和设置了多个train env 有关,有没有哪位大佬解决了呀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants