Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flow.cumprod不支持sbp模式,或sbp模式实现存在bug #8920

Closed
shaoshitong opened this issue Aug 16, 2022 · 4 comments · Fixed by #8929
Closed

flow.cumprod不支持sbp模式,或sbp模式实现存在bug #8920

shaoshitong opened this issue Aug 16, 2022 · 4 comments · Fixed by #8929
Assignees
Labels
bug community events from community

Comments

@shaoshitong
Copy link

第一种情况,使用flow.cumprod与sbp:

import oneflow as flow
import oneflow.nn as nn

from libai.utils import distributed as dist

PLACEMENT = flow.placement("cuda", [0])
BROADCAST = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])

class Cumprod(nn.Module):
    def __init__(self):
        super(Cumprod, self).__init__()
        self.param = nn.Parameter(
            flow.randn(1,10).to_global(sbp=BROADCAST,placement=PLACEMENT)
        )
    def forward(self,x):
        x = x * self.param
        x = flow.cumprod(x,1)
        return x.sum()


model = Cumprod()
x = flow.randn(1,10).to_global(sbp=BROADCAST,placement=PLACEMENT)
y = model(x)
y.backward()
"""
/root/anaconda3/envs/torch/bin/python /home/sst/product/libai/alignment/ocumprod.py
libibverbs not available, ibv_fork_init skipped
Distributed env is not set up, configure it by default (single node, single gpu).
Traceback (most recent call last):
  File "/home/sst/product/libai/alignment/ocumprod.py", line 24, in <module>
    y.backward()
  File "/root/anaconda3/envs/torch/lib/python3.7/site-packages/oneflow/framework/tensor.py", line 33, in _backward
    flow.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/root/anaconda3/envs/torch/lib/python3.7/site-packages/oneflow/autograd/autograd.py", line 114, in backward
    create_graph,
oneflow._oneflow_internal.exception.Exception: Check failed: bn2sbp.find(ibn) != bn2sbp.end() In op_name: cumprod_grad9 op_type_name: cumprod_grad, input_arg_name : output input_arg_index : 0 have NOT set sbp signature
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/api/python/autograd/autograd.cpp", line 90, in Backward
    one::GetThreadLocalAutogradEngine()->RunBackwardAndSaveGrads4LeafTensorIf( outputs, *gradients, retain_graph, create_graph)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/autograd/autograd_engine.cpp", line 411, in RunBackwardAndSaveGrads4LeafTensor
    graph_task.Apply( true)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/autograd/autograd_engine.cpp", line 387, in Apply
    node->Apply(create_graph_)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/autograd/autograd_engine.cpp", line 233, in Apply
    backward_fn_->body(output_grads, &input_grads, create_graph)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 107, in operator()
    grad_closure->Apply(out_grads, in_grads)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/autograd/gradient_funcs/cum_ops.cpp", line 91, in Apply
    functional::CumprodGrad(out_grads.at(0), ctx->SavedTensors().at(0), ctx->SavedTensors().at(1), ctx->dim)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 143, in Dispatch<oneflow::one::Tensor>
    Dispatch<TensorTuple>(op_expr, inputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 134, in Dispatch<oneflow::one::TensorTuple>
    Dispatch(op_expr, processor.inputs(), outputs.get(), ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 96, in Apply
    internal_->Apply(op_expr, inputs, outputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_global_op_interpreter.cpp", line 131, in Interpret
    user_op_expr.mut_global_tensor_infer_cache()->GetOrInfer(*infer_args)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/global_tensor_infer_cache.cpp", line 354, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/global_tensor_infer_cache.cpp", line 290, in Infer
    op->InferNdSbpSignatureIf(nd_sbp_constraints, *parallel_desc, NdSbpInferHint4Ibn)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/operator.cpp", line 823, in InferNdSbpSignatureIf
    InferNdSbpSignature(&nd_sbp_signature, nd_sbp_constraints, parallel_desc, NdSbpInferHint4Ibn)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/user_op.cpp", line 823, in InferNdSbpSignature
    Operator::InferNdSbpSignature(nd_sbp_signature, nd_sbp_constraints, parallel_desc, NdSbpInferHint4Ibn)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/operator.cpp", line 846, in InferNdSbpSignature
    InferSbpSignature(&sbp_signature, sbp_constraints, ibn2sbp_infer_hint)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/operator.cpp", line 596, in InferSbpSignature
    InferSbpSignature(infered_sbp_signature, sbp_sig_conf, CalcOrderValue4SbpSig, SbpInferHint4Ibn, *op_parallel_desc_)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/operator.cpp", line 791, in InferSbpSignature
    GetSbpSignaturesIf(LogicalBlobDesc4Ibn, parallel_desc, &sbp_sig_candidates)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/operator.cpp", line 467, in GetSbpSignaturesIf
    GetSbpSignatures(LogicalBlobDesc4Ibn, parallel_desc, sbp_sig_list)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/operator/user_op.cpp", line 772, in GetSbpSignatures
    
Error Type: oneflow.ErrorProto.check_failed_error

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
"""

第二种情况,不使用flow.cumprod使用sbp:

import oneflow as flow
import oneflow.nn as nn

from libai.utils import distributed as dist

PLACEMENT = flow.placement("cuda", [0])
BROADCAST = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])

class Cumprod(nn.Module):
    def __init__(self):
        super(Cumprod, self).__init__()
        self.param = nn.Parameter(
            flow.randn(1,10).to_global(sbp=BROADCAST,placement=PLACEMENT)
        )
    def forward(self,x):
        x = x * self.param
        # x = flow.cumprod(x,1)
        return x.sum()


model = Cumprod()
x = flow.randn(1,10).to_global(sbp=BROADCAST,placement=PLACEMENT)
y = model(x)
y.backward()
"""
/root/anaconda3/envs/torch/bin/python /home/sst/product/libai/alignment/ocumprod2.py
libibverbs not available, ibv_fork_init skipped
Distributed env is not set up, configure it by default (single node, single gpu).

Process finished with exit code 0
"""

第三种情况,不使用sbp使用flow.cumprod:

import oneflow as flow
import oneflow.nn as nn

from libai.utils import distributed as dist

PLACEMENT = flow.placement("cuda", [0])
BROADCAST = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])

class Cumprod(nn.Module):
    def __init__(self):
        super(Cumprod, self).__init__()
        self.param = nn.Parameter(
            flow.randn(1,10).cuda()
        )
    def forward(self,x):
        x = x * self.param
        x = flow.cumprod(x,1)
        return x.sum()


model = Cumprod()
x = flow.randn(1,10).cuda()
y = model(x)
y.backward()
"""
/root/anaconda3/envs/torch/bin/python /home/sst/product/libai/alignment/ocumprod3.py
libibverbs not available, ibv_fork_init skipped
Distributed env is not set up, configure it by default (single node, single gpu).

Process finished with exit code 0
"""
@shaoshitong shaoshitong added bug community events from community labels Aug 16, 2022
@shaoshitong
Copy link
Author

目前想到的解决方式是在python层面自己实现一个

@MARD1NO
Copy link
Contributor

MARD1NO commented Aug 16, 2022

https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/user/ops/cum_ops.cpp#L107-L113

报错信息显示的是,cumprod_grad 这个op的输出 out 没有指定 SBP

@shaoshitong
Copy link
Author

是这样的,但当在前向检查cumprod的输入和输出,都已经有了sbp签名

@shaoshitong
Copy link
Author

Note
我实现了一个python基础的并行版本(flow.cumprod),有需要的朋友可以临时替代:

import oneflow as flow

def cumprod(inputs, dim=0):
    ndim = inputs.ndim
    assert 0 <= dim < ndim or -ndim <= dim <= 0, f"{dim} must between [0,{ndim}) or [-{ndim},0]"
    if dim < 0:
        dim = ndim + dim
    res = flow.index_select(inputs, dim, flow.LongTensor([0]).to_global(sbp=inputs.sbp,placement=inputs.placement))
    result = [res]
    for i in range(1, inputs.shape[dim]):
        res = res * flow.index_select(inputs, dim, flow.LongTensor([i]).to_global(sbp=inputs.sbp,placement=inputs.placement))  #
        result.append(res)
    result = flow.cat(result,dim=dim)
    return result

验证代码:

from libai.utils import distributed as dist

PLACEMENT = flow.placement("cuda", [0])
BROADCAST = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])

x = flow.randn(2,2,2).to_global(placement=PLACEMENT,sbp=BROADCAST)
print(x)
y = cumprod(x,1)
print(y)
# libibverbs not available, ibv_fork_init skipped
# Distributed env is not set up, configure it by default (single node, single gpu).
# tensor([[[ 1.1759,  1.8179],
#          [-2.1007,  1.3614]],
#
#         [[ 0.9461,  0.1282],
#          [-1.0399, -0.2611]]],
#        placement=oneflow.placement(type="cuda", ranks=[0]),
#        sbp=(oneflow.sbp.broadcast,), dtype=oneflow.float32)
# tensor([[[ 1.1759,  1.8179],
#          [-2.4701,  2.4749]],
#
#         [[ 0.9461,  0.1282],
#          [-0.9838, -0.0335]]],
#        placement=oneflow.placement(type="cuda", ranks=[0]),
#        sbp=(oneflow.sbp.broadcast,), dtype=oneflow.float32)
#
# Process finished with exit code 0

反向传播验证:

import oneflow as flow
import oneflow.nn as nn

from libai.utils import distributed as dist

PLACEMENT = flow.placement("cuda", [0])
BROADCAST = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])

class Cumprod(nn.Module):
    def __init__(self):
        super(Cumprod, self).__init__()
        self.param = nn.Parameter(
            flow.randn(1,10).to_global(sbp=BROADCAST,placement=PLACEMENT)
        )
    def forward(self,x):
        x = x * self.param
        x = cumprod(x,1)
        return x.sum()


model = Cumprod()
x = flow.randn(1,10).to_global(sbp=BROADCAST,placement=PLACEMENT)
y = model(x)
y.backward()
"""
/root/anaconda3/envs/torch/bin/python /home/sst/product/libai/alignment/ocumprod3.py
libibverbs not available, ibv_fork_init skipped
Distributed env is not set up, configure it by default (single node, single gpu).

Process finished with exit code 0
"""

@wyg1997 wyg1997 self-assigned this Aug 16, 2022
wyg1997 added a commit that referenced this issue Aug 16, 2022
@mergify mergify bot closed this as completed in #8929 Aug 18, 2022
mergify bot added a commit that referenced this issue Aug 18, 2022
* fix(CumprodGrad): fix cumprod_grad GetSbp bug

fix #8920

* test(Cumprod): add global test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug community events from community
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants