We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
老师您好,我三卡运行了ppo_training.py,命令如下:
CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node 3 ppo_training.py
观察输出我发现,代码中有一些logger.info的打印输出,都打印了三遍
48条数据,batchsize=8 , 分到每个gpu应该是约16条数据,steps=16/8=2
而我观察输出发现:
2025-02-18 19:16:13.464 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-1.6240]), tensor([-2.7709]), tensor([-2.6670]), tensor([-0.6654]), tensor([-3.5599]), tensor([-3.7845]), tensor([-3.1603]), tensor([-6.4888])] 1it [00:57, 57.62s/it] 2025-02-18 19:16:13.464 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-2.6631]), tensor([-2.8427]), tensor([-0.5427]), tensor([-3.5366]), tensor([-1.0341]), tensor([-2.5443]), tensor([-0.1083]), tensor([-2.9795])] 1it [00:57, 57.62s/it] 2025-02-18 19:16:14.677 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-3.7929]), tensor([-3.0101]), tensor([-2.5988]), tensor([-1.4293]), tensor([-3.6989]), tensor([-4.6202]), tensor([0.0503]), tensor([-4.9181])]1it [00:58, 58.83s/it]
输出三个Step 0/6,似乎总的step仍然是48/8=6, 进行数据并行后,每个gpu的总step按道理应该是16/8
因此我怀疑,这里可能并没有数据并行成功,为什么会出现这种情况呢?
部分代码:
world_size = int(os.environ.get("WORLD_SIZE", "1")) if world_size > 1: args.device_map = {"": int(os.environ.get("LOCAL_RANK", "0"))} #模型加载到device_map上 #…… device = ''cuda' #…… question_tensors = batch["input_ids"] question_tensors = [torch.LongTensor(i).to(device).squeeze(0) for i in question_tensors]
前几轮输出如下:
我减小训练数据量,log每个gpu处理的数据,发现每个gpu处理的数据都存在部分相同,是不是不太合理,大佬知道是怎么回事吗
The text was updated successfully, but these errors were encountered:
No branches or pull requests
老师您好,我三卡运行了ppo_training.py,命令如下:
CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node 3 ppo_training.py
观察输出我发现,代码中有一些logger.info的打印输出,都打印了三遍
48条数据,batchsize=8 , 分到每个gpu应该是约16条数据,steps=16/8=2
而我观察输出发现:
2025-02-18 19:16:13.464 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-1.6240]), tensor([-2.7709]), tensor([-2.6670]), tensor([-0.6654]), tensor([-3.5599]), tensor([-3.7845]), tensor([-3.1603]), tensor([-6.4888])] 1it [00:57, 57.62s/it]
2025-02-18 19:16:13.464 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-2.6631]), tensor([-2.8427]), tensor([-0.5427]), tensor([-3.5366]), tensor([-1.0341]), tensor([-2.5443]), tensor([-0.1083]), tensor([-2.9795])] 1it [00:57, 57.62s/it]
2025-02-18 19:16:14.677 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-3.7929]), tensor([-3.0101]), tensor([-2.5988]), tensor([-1.4293]), tensor([-3.6989]), tensor([-4.6202]), tensor([0.0503]), tensor([-4.9181])]1it [00:58, 58.83s/it]
输出三个Step 0/6,似乎总的step仍然是48/8=6, 进行数据并行后,每个gpu的总step按道理应该是16/8
因此我怀疑,这里可能并没有数据并行成功,为什么会出现这种情况呢?
部分代码:
前几轮输出如下:
我减小训练数据量,log每个gpu处理的数据,发现每个gpu处理的数据都存在部分相同,是不是不太合理,大佬知道是怎么回事吗
The text was updated successfully, but these errors were encountered: