Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors. #63

Open
kuailexiaohunzi opened this issue Jun 5, 2024 · 11 comments

Comments

@kuailexiaohunzi
Copy link

When using CT mode for training, the following errors occur. Does anyone know how to solve them
image

@RICKand-MORTY
Copy link

Maybe the version of pytorch or cuda is incorrect

@kuailexiaohunzi
Copy link
Author

Maybe the version of pytorch or cuda is incorrect

The pytorch version is 1.13 and cuda is 11.7, which matches

@RICKand-MORTY
Copy link

是多卡训练吗?多卡训练dist_utils.py那个节点gpu数要改成自己的gpu数,另外命令行的mpiexec -n 4的4也要换成自己的gpu数

@kuailexiaohunzi
Copy link
Author

是多卡训练吗?多卡训练dist_utils.py那个节点gpu数要改成自己的gpu数,另外命令行的mpiexec -n 4的4也要换成自己的gpu数

不是,单卡,我甚至没有用mpiexec -n这个命令

@RICKand-MORTY
Copy link

添加环境变量RDMAV_FORK_SAFE吧看看,可能是为了安全不让直接fork子进程
https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

@kuailexiaohunzi
Copy link
Author

添加环境变量RDMAV_FORK_SAFE吧看看,可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

OK,之后试试

@kuailexiaohunzi
Copy link
Author

添加环境变量RDMAV_FORK_SAFE吧看看,可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

在cm.train文件里添加了,但还是不行,报同样的错误

@RICKand-MORTY
Copy link

添加环境变量RDMAV_FORK_SAFE吧看看,可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

在cm.train文件里添加了,但还是不行,报同样的错误

在/etc/profile里添加,作为系统环境变量

@kuailexiaohunzi
Copy link
Author

嗷嗷,OK

@RICKand-MORTY
Copy link

在/etc/profile里添加,作为系统环境变量

记得保存后用source刷新一下

@kuailexiaohunzi
Copy link
Author

OK,感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants