Support control flow in DataParallel #32826
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Bug fixes
PR changes
APIs
Describe
动态图关于unused_parameters主要有以下四个情况。
情况1,无unused_parameters
动态图组网中,没有使用stop_gradient/detach 或没有声明额外的参数,即动态图全局声明的参数,训练过程中都会产生梯度,如下组网:
情况2,有unused_parameters,计算无控制流
动态图组网中,全局声明了参数但实际计算没有使用,或者使用stop_gradient/detach/trainable 导致部分参数没有产生梯度,如下组网:
情况3,与计算step相关的控制流
计算过程中,会根据step,调整参数的训练。比如部分Gan网络。
情况4,与数据输入相关的控制流
计算过程中,会根据数据的输入,调整输出。比如数据存在脏数据。
为了解决上述四个情况以及四个情况随机组合的情况,引入参数
find_unused_parameters
, 默认值为False。1、当
find_unused_parameters=False
时,只会检查在第一个step检查,在前向后,反向前,检查反向图,找到第一个step中,所有的unused_parameter, 并认定以后所有的step中,unused_parameter的列表保持不变。对性能无影响,可以解决上述的情况1,2。2、当
find_unused_parameters=True
时,每一个step都会进行检查,遍历反向图,会解决上述情况1,2,3,4。但对于情况1,2存在计算性能下降,不推荐设置其为True。3、目前,如果
find_unused_parameters=True
,会报warning,提示性能下降。find_unused_parameters=False
,则在训练过程中,会出core,提示开启该参数。find_unused_parameters=False
,训练会hang,无法报错,需要用户发现并开启该参数。4、对于情况4而言,由于数据相关的控制流,会导致某张卡一个参数都没有,从而一次梯度hook都不触发。这个情况除了需要设置
find_unused_parameters=True
,还需要添加一个假的叶子结点。需要如下修改: