torchrun 数据并行是否成功？ #441

chloeHXY · 2025-02-18T11:40:37Z

老师您好，我三卡运行了ppo_training.py，命令如下：

CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node 3 ppo_training.py

观察输出我发现，代码中有一些logger.info的打印输出，都打印了三遍

48条数据，batchsize=8 , 分到每个gpu应该是约16条数据，steps=16/8=2

而我观察输出发现：

2025-02-18 19:16:13.464 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-1.6240]), tensor([-2.7709]), tensor([-2.6670]), tensor([-0.6654]), tensor([-3.5599]), tensor([-3.7845]), tensor([-3.1603]), tensor([-6.4888])] 1it [00:57, 57.62s/it]
2025-02-18 19:16:13.464 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-2.6631]), tensor([-2.8427]), tensor([-0.5427]), tensor([-3.5366]), tensor([-1.0341]), tensor([-2.5443]), tensor([-0.1083]), tensor([-2.9795])] 1it [00:57, 57.62s/it]
2025-02-18 19:16:14.677 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-3.7929]), tensor([-3.0101]), tensor([-2.5988]), tensor([-1.4293]), tensor([-3.6989]), tensor([-4.6202]), tensor([0.0503]), tensor([-4.9181])]1it [00:58, 58.83s/it]

输出三个Step 0/6，似乎总的step仍然是48/8=6, 进行数据并行后，每个gpu的总step按道理应该是16/8

因此我怀疑，这里可能并没有数据并行成功，为什么会出现这种情况呢？

部分代码：

 world_size = int(os.environ.get("WORLD_SIZE", "1"))
 if world_size > 1:
        args.device_map = {"": int(os.environ.get("LOCAL_RANK", "0"))}
#模型加载到device_map上

#……
device = ''cuda' 
#……
question_tensors = batch["input_ids"]
question_tensors = [torch.LongTensor(i).to(device).squeeze(0) for i in question_tensors]

前几轮输出如下：

我减小训练数据量，log每个gpu处理的数据，发现每个gpu处理的数据都存在部分相同，是不是不太合理，大佬知道是怎么回事吗

The text was updated successfully, but these errors were encountered:

chloeHXY added the bug Something isn't working label Feb 18, 2025

chloeHXY changed the title ~~torchrun 模型并行是否成功？~~ torchrun 数据并行是否成功？ Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchrun 数据并行是否成功？ #441

torchrun 数据并行是否成功？ #441

chloeHXY commented Feb 18, 2025 •

edited

Loading

torchrun 数据并行是否成功？ #441

torchrun 数据并行是否成功？ #441

Comments

chloeHXY commented Feb 18, 2025 • edited Loading

chloeHXY commented Feb 18, 2025 •

edited

Loading