the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors. #63

kuailexiaohunzi · 2024-06-05T17:20:06Z

When using CT mode for training, the following errors occur. Does anyone know how to solve them

RICKand-MORTY · 2024-06-11T07:39:24Z

Maybe the version of pytorch or cuda is incorrect

kuailexiaohunzi · 2024-06-11T11:41:19Z

Maybe the version of pytorch or cuda is incorrect

The pytorch version is 1.13 and cuda is 11.7, which matches

RICKand-MORTY · 2024-06-11T14:12:09Z

是多卡训练吗?多卡训练dist_utils.py那个节点gpu数要改成自己的gpu数,另外命令行的mpiexec -n 4的4也要换成自己的gpu数

kuailexiaohunzi · 2024-06-11T14:14:33Z

是多卡训练吗?多卡训练dist_utils.py那个节点gpu数要改成自己的gpu数,另外命令行的mpiexec -n 4的4也要换成自己的gpu数

不是，单卡，我甚至没有用mpiexec -n这个命令

RICKand-MORTY · 2024-06-11T14:24:54Z

添加环境变量RDMAV_FORK_SAFE吧看看，可能是为了安全不让直接fork子进程
https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

kuailexiaohunzi · 2024-06-11T14:30:25Z

添加环境变量RDMAV_FORK_SAFE吧看看，可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

OK，之后试试

kuailexiaohunzi · 2024-06-13T15:49:04Z

添加环境变量RDMAV_FORK_SAFE吧看看，可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

在cm.train文件里添加了，但还是不行，报同样的错误

RICKand-MORTY · 2024-06-13T15:51:33Z

添加环境变量RDMAV_FORK_SAFE吧看看，可能是为了安全不让直接fork子进程 https://docs.nvidia.com/networking/display/rdmaawareprogrammingv17/ibv_fork_init

在cm.train文件里添加了，但还是不行，报同样的错误

在/etc/profile里添加，作为系统环境变量

kuailexiaohunzi · 2024-06-13T15:53:42Z

嗷嗷，OK

RICKand-MORTY · 2024-06-13T15:54:36Z

在/etc/profile里添加，作为系统环境变量

记得保存后用source刷新一下

kuailexiaohunzi · 2024-06-13T15:58:49Z

OK，感谢

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors. #63

the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors. #63

kuailexiaohunzi commented Jun 5, 2024

RICKand-MORTY commented Jun 11, 2024

kuailexiaohunzi commented Jun 11, 2024

RICKand-MORTY commented Jun 11, 2024

kuailexiaohunzi commented Jun 11, 2024

RICKand-MORTY commented Jun 11, 2024

kuailexiaohunzi commented Jun 11, 2024

kuailexiaohunzi commented Jun 13, 2024

RICKand-MORTY commented Jun 13, 2024

kuailexiaohunzi commented Jun 13, 2024

RICKand-MORTY commented Jun 13, 2024

kuailexiaohunzi commented Jun 13, 2024

the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors. #63

the libfabric EFA provider is operating in a condition that could result in memory corruption or other system errors. #63

Comments

kuailexiaohunzi commented Jun 5, 2024

RICKand-MORTY commented Jun 11, 2024

kuailexiaohunzi commented Jun 11, 2024

RICKand-MORTY commented Jun 11, 2024

kuailexiaohunzi commented Jun 11, 2024

RICKand-MORTY commented Jun 11, 2024

kuailexiaohunzi commented Jun 11, 2024

kuailexiaohunzi commented Jun 13, 2024

RICKand-MORTY commented Jun 13, 2024

kuailexiaohunzi commented Jun 13, 2024

RICKand-MORTY commented Jun 13, 2024

kuailexiaohunzi commented Jun 13, 2024