forked from yl4579/StyleTTS2
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* 修复多机训练问题 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 更新并完善分布式训练功能 近期融合V2版本代码时发现之前修改的多机功能并不正确,仍会报错,只不过单机多卡情况下local_rank即相当于rank,感知不出 1. 修复train_ms.py中DDP初始化及.cuda绑定到local_rank上 2. 在default_config.yml配置文件中添加env变量 LOCAL_RANK,否则默认情况下会key error 3. 添加run_MnodesAndMgpus.sh,更新分布式相关说明 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
1fbddf4
commit 31de84e
Showing
4 changed files
with
80 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
#多机多卡训练 | ||
|
||
#--nnodes=1:3 表示 使用一到三台机器 弹性分配资源 | ||
#--nnodes=<最小节点数>:<最大节点数> | ||
#--nproc_per_node=每台机器上可用的GPU数 | ||
#--rdzv_endpoint=主节点(最先启动的)ip:端口号 | ||
#其他不需要变 | ||
|
||
#注意: 此版本的分布式训练是基于数据并行的,多机多卡相当于开更大的batchsize,此时epoch迭代速度会增加, | ||
#但由于 该版本的代码中 保存模型是按照global step来计算的,所以会出现的效果就是 : 保存模型的时间不会有明显加速, | ||
#但每次保存模型时epoch都比之前迭代了更多次,也就是 “更少的步数,实现更好的效果” | ||
|
||
#************************* | ||
# torchrun \ | ||
# --nnodes=1:3\ | ||
# --nproc_per_node=2\ | ||
# --rdzv_id=1\ | ||
# --rdzv_backend=c10d\ | ||
# --rdzv_endpoint="inspur1:8880"\ | ||
# train_ms.py | ||
#**************************** | ||
|
||
#多卡训练 | ||
#nproc_per_node = 机器上可用的GPU数 | ||
|
||
#************************* | ||
torchrun \ | ||
--nnodes=1\ | ||
--nproc_per_node=2\ | ||
train_ms.py | ||
#************************* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters