-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
如何使用多GPU训练? #40
Comments
已经解决了,以前用的都是DataParallel,而本代码里面用的是DistributedDataParallel来做并行训练。 CUDA_VISIBLE_DEVICES=1,2 python -m torch.distributed.launch --nproc_per_node=2 multi_gpu_train.py --name='irra' --img_aug --MLM --dataset_name='CUHK-PEDES' --loss_names='sdm+mlm+id' --root_dir='./data' --num_epoch=100 --batch_size=188 |
请问使用多gpu时代码里做了哪些修改呢,现在跑多gpu也遇到了一些问题,感谢! |
你好,代码没做任何修改,我是加了一个sh脚本,sh脚本命名为run.sh,其内容如下: python -m torch.distributed.run --nproc_per_node=3 |
上面的脚本内容,从python-m那一行开始后面都有一个反斜杠哈。最后一行不用,刚才我发现github里面没有显示出来这个反斜杠,不要误导您 |
好的,非常感谢! |
作者的代码里并没有实现ddp训练的sampler,应该需要自己补充一下,如下面代码所示: Lines 116 to 118 in c698f85
我尝试改了下,但是具体是不是存在错误需要等我训练完验证一下: if args.distributed:
# TODO valid distributed condition
logger.info('using ddp random sampler')
logger.info('DISTRIBUTED TRAIN START')
mini_batch_size = args.batch_size // get_world_size()
# 初始化DistributedSampler
train_sampler = DistributedSampler(train_set, shuffle=True)
train_loader = DataLoader(train_set,
batch_size=mini_batch_size, # 改为mini_batch
sampler=train_sampler, # 使用DistributedSampler
num_workers=num_workers,
collate_fn=collate) |
如何使用多GPU训练?服务器有四张卡(0,1,2,3).然后我想使用卡1和卡2.怎么办?求大神告知,我看源码里面好像对多GPU训练的部分写的有些问题,有人遇到过这个问题吗
The text was updated successfully, but these errors were encountered: