Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchrun 数据并行是否成功? #441

Open
chloeHXY opened this issue Feb 18, 2025 · 0 comments
Open

torchrun 数据并行是否成功? #441

chloeHXY opened this issue Feb 18, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@chloeHXY
Copy link

chloeHXY commented Feb 18, 2025

老师您好,我三卡运行了ppo_training.py,命令如下:

CUDA_VISIBLE_DEVICES=1,2,3 torchrun --nproc_per_node 3 ppo_training.py

观察输出我发现,代码中有一些logger.info的打印输出,都打印了三遍

48条数据,batchsize=8 , 分到每个gpu应该是约16条数据,steps=16/8=2

而我观察输出发现:

2025-02-18 19:16:13.464 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-1.6240]), tensor([-2.7709]), tensor([-2.6670]), tensor([-0.6654]), tensor([-3.5599]), tensor([-3.7845]), tensor([-3.1603]), tensor([-6.4888])] 1it [00:57, 57.62s/it]
2025-02-18 19:16:13.464 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-2.6631]), tensor([-2.8427]), tensor([-0.5427]), tensor([-3.5366]), tensor([-1.0341]), tensor([-2.5443]), tensor([-0.1083]), tensor([-2.9795])] 1it [00:57, 57.62s/it]
2025-02-18 19:16:14.677 | DEBUG | main:main:559 - Step 0/6: reward score:[tensor([-3.7929]), tensor([-3.0101]), tensor([-2.5988]), tensor([-1.4293]), tensor([-3.6989]), tensor([-4.6202]), tensor([0.0503]), tensor([-4.9181])]1it [00:58, 58.83s/it]

输出三个Step 0/6,似乎总的step仍然是48/8=6, 进行数据并行后,每个gpu的总step按道理应该是16/8

因此我怀疑,这里可能并没有数据并行成功,为什么会出现这种情况呢?

部分代码:

 world_size = int(os.environ.get("WORLD_SIZE", "1"))
 if world_size > 1:
        args.device_map = {"": int(os.environ.get("LOCAL_RANK", "0"))}
#模型加载到device_map上

#……
device = ''cuda' 
#……
question_tensors = batch["input_ids"]
question_tensors = [torch.LongTensor(i).to(device).squeeze(0) for i in question_tensors]

前几轮输出如下:

Image

我减小训练数据量,log每个gpu处理的数据,发现每个gpu处理的数据都存在部分相同,是不是不太合理,大佬知道是怎么回事吗

@chloeHXY chloeHXY added the bug Something isn't working label Feb 18, 2025
@chloeHXY chloeHXY changed the title torchrun 模型并行是否成功? torchrun 数据并行是否成功? Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant