Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support control flow in DataParallel #32826

Merged
merged 6 commits into from
May 11, 2021

Conversation

ForFishes
Copy link
Member

@ForFishes ForFishes commented May 10, 2021

PR types

Bug fixes

PR changes

APIs

Describe

动态图关于unused_parameters主要有以下四个情况。

情况1,无unused_parameters
动态图组网中,没有使用stop_gradient/detach 或没有声明额外的参数,即动态图全局声明的参数,训练过程中都会产生梯度,如下组网:

class SimpleDPNet(fluid.dygraph.Layer):
    def __init__(self, vocab_size, hidden_size, inner_size, output_size):
        super(SimpleDPNet, self).__init__()
        self.linear1 = paddle.nn.Linear(hidden_size, inner_size)
        self.linear2 = paddle.nn.Linear(inner_size, hidden_size)
        self.linear3 = paddle.nn.Linear(hidden_size, output_size)
        self.embedding = paddle.nn.Embedding(vocab_size, hidden_size))

    def forward(self, x):
        x = self.embedding(x)
        x = self.linear1(x)
        x = self.linear2(x)
        x = self.linear3(x)
        return x.mean()

情况2,有unused_parameters,计算无控制流
动态图组网中,全局声明了参数但实际计算没有使用,或者使用stop_gradient/detach/trainable 导致部分参数没有产生梯度,如下组网:

class SimpleDPNet(fluid.dygraph.Layer):
    def __init__(self, vocab_size, hidden_size, inner_size, output_size):
        super(SimpleDPNet, self).__init__()
        self.linear1 = paddle.nn.Linear(hidden_size, inner_size)
        self.linear2 = paddle.nn.Linear(inner_size, hidden_size)
        self.linear3 = paddle.nn.Linear(hidden_size, output_size)
        self.embedding = paddle.nn.Embedding(vocab_size, hidden_size))
       
        # 该layer不参与前向计算,因此该layer中的w,b将不会产生梯度
        self.tmp = paddle.nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.embedding(x)
        x = self.linear1(x)

        # 计算使用stop_gradient,导致linear1 中的w,b不产生梯度, embedding 不产生梯度。
        x.stop_gradient = True
        x = self.linear2(x)
        x = self.linear3(x)
        return x.mean()

情况3,与计算step相关的控制流
计算过程中,会根据step,调整参数的训练。比如部分Gan网络。

class SimpleDPNet(fluid.dygraph.Layer):
    def __init__(self, vocab_size, hidden_size, inner_size, output_size):
        super(SimpleDPNet, self).__init__()
        self.linear1 = paddle.nn.Linear(hidden_size, inner_size)
        self.linear2 = paddle.nn.Linear(inner_size, hidden_size)
        self.linear3 = paddle.nn.Linear(hidden_size, output_size)
        self.embedding = paddle.nn.Embedding(vocab_size, hidden_size))

    def forward(self, x, step):
        x = self.embedding(x)
        x = self.linear1(x)
        if step > 10:
            # 当step大于10时,linear1 中的w,b不产生梯度, embedding不产生梯度。
            x.stop_gradient = True

        x = self.linear2(x)
        x = self.linear3(x)
        return x.mean()

情况4,与数据输入相关的控制流
计算过程中,会根据数据的输入,调整输出。比如数据存在脏数据。

class SimpleDPNet(fluid.dygraph.Layer):
    def __init__(self, vocab_size, hidden_size, inner_size, output_size):
        super(SimpleDPNet, self).__init__()
        self.linear1 = paddle.nn.Linear(hidden_size, inner_size)
        self.linear2 = paddle.nn.Linear(inner_size, hidden_size)
        self.linear3 = paddle.nn.Linear(hidden_size, output_size)
        self.embedding = paddle.nn.Embedding(vocab_size, hidden_size))

    def forward(self, x):
        # 存在与数据相关的控制流,当x的和为0,则返回loss为0,所有参数均不产生梯度(分布式下,各个卡梯度不一致)。
        if paddle.sum(x) == 0:
            return 0

        x = self.embedding(x)
        x = self.linear1(x)
        x = self.linear2(x)
        x = self.linear3(x)
        return x.mean()

为了解决上述四个情况以及四个情况随机组合的情况,引入参数find_unused_parameters, 默认值为False。

1、当find_unused_parameters=False时,只会检查在第一个step检查,在前向后,反向前,检查反向图,找到第一个step中,所有的unused_parameter, 并认定以后所有的step中,unused_parameter的列表保持不变。对性能无影响,可以解决上述的情况1,2。

2、当find_unused_parameters=True时,每一个step都会进行检查,遍历反向图,会解决上述情况1,2,3,4。但对于情况1,2存在计算性能下降,不推荐设置其为True。

3、目前,如果

  • 情况1下,设置find_unused_parameters=True,会报warning,提示性能下降。
  • 情况3下,设置find_unused_parameters=False,则在训练过程中,会出core,提示开启该参数。
  • 情况4下,设置find_unused_parameters=False,训练会hang,无法报错,需要用户发现并开启该参数。

4、对于情况4而言,由于数据相关的控制流,会导致某张卡一个参数都没有,从而一次梯度hook都不触发。这个情况除了需要设置find_unused_parameters=True,还需要添加一个假的叶子结点。需要如下修改:

class SimpleDPNet(fluid.dygraph.Layer):
    def __init__(self, vocab_size, hidden_size, inner_size, output_size):
        super(SimpleDPNet, self).__init__()
        self.linear1 = paddle.nn.Linear(hidden_size, inner_size)
        self.linear2 = paddle.nn.Linear(inner_size, hidden_size)
        self.linear3 = paddle.nn.Linear(hidden_size, output_size)
        self.embedding = paddle.nn.Embedding(vocab_size, hidden_size))

        # 假的叶子结点
        self.phony = self.create_parameter(shape=[1], dtype="float32")

    def forward(self, x):
        # 存在与数据相关的控制流,当x的和为0,则返回loss为0,所有参数均不产生梯度(分布式下,各个卡梯度不一致)。
        if paddle.sum(x) == 0:
            return 0 * self.phony

        x = self.embedding(x)
        x = self.linear1(x)
        x = self.linear2(x)
        x = self.linear3(x)
        return x.mean()

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@ForFishes ForFishes changed the title Add find_unused_parameters in DataParallel for controlflow processing Support Control flow in DataParallel May 10, 2021
@ForFishes ForFishes changed the title Support Control flow in DataParallel Support control flow in DataParallel May 10, 2021
Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jzhang533 jzhang533 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants