-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flow.cumprod不支持sbp模式,或sbp模式实现存在bug #8920
Comments
目前想到的解决方式是在python层面自己实现一个 |
报错信息显示的是,cumprod_grad 这个op的输出 |
是这样的,但当在前向检查cumprod的输入和输出,都已经有了sbp签名 |
Note import oneflow as flow
def cumprod(inputs, dim=0):
ndim = inputs.ndim
assert 0 <= dim < ndim or -ndim <= dim <= 0, f"{dim} must between [0,{ndim}) or [-{ndim},0]"
if dim < 0:
dim = ndim + dim
res = flow.index_select(inputs, dim, flow.LongTensor([0]).to_global(sbp=inputs.sbp,placement=inputs.placement))
result = [res]
for i in range(1, inputs.shape[dim]):
res = res * flow.index_select(inputs, dim, flow.LongTensor([i]).to_global(sbp=inputs.sbp,placement=inputs.placement)) #
result.append(res)
result = flow.cat(result,dim=dim)
return result 验证代码: from libai.utils import distributed as dist
PLACEMENT = flow.placement("cuda", [0])
BROADCAST = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])
x = flow.randn(2,2,2).to_global(placement=PLACEMENT,sbp=BROADCAST)
print(x)
y = cumprod(x,1)
print(y)
# libibverbs not available, ibv_fork_init skipped
# Distributed env is not set up, configure it by default (single node, single gpu).
# tensor([[[ 1.1759, 1.8179],
# [-2.1007, 1.3614]],
#
# [[ 0.9461, 0.1282],
# [-1.0399, -0.2611]]],
# placement=oneflow.placement(type="cuda", ranks=[0]),
# sbp=(oneflow.sbp.broadcast,), dtype=oneflow.float32)
# tensor([[[ 1.1759, 1.8179],
# [-2.4701, 2.4749]],
#
# [[ 0.9461, 0.1282],
# [-0.9838, -0.0335]]],
# placement=oneflow.placement(type="cuda", ranks=[0]),
# sbp=(oneflow.sbp.broadcast,), dtype=oneflow.float32)
#
# Process finished with exit code 0 反向传播验证: import oneflow as flow
import oneflow.nn as nn
from libai.utils import distributed as dist
PLACEMENT = flow.placement("cuda", [0])
BROADCAST = dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast])
class Cumprod(nn.Module):
def __init__(self):
super(Cumprod, self).__init__()
self.param = nn.Parameter(
flow.randn(1,10).to_global(sbp=BROADCAST,placement=PLACEMENT)
)
def forward(self,x):
x = x * self.param
x = cumprod(x,1)
return x.sum()
model = Cumprod()
x = flow.randn(1,10).to_global(sbp=BROADCAST,placement=PLACEMENT)
y = model(x)
y.backward()
"""
/root/anaconda3/envs/torch/bin/python /home/sst/product/libai/alignment/ocumprod3.py
libibverbs not available, ibv_fork_init skipped
Distributed env is not set up, configure it by default (single node, single gpu).
Process finished with exit code 0
""" |
wyg1997
added a commit
that referenced
this issue
Aug 16, 2022
mergify bot
added a commit
that referenced
this issue
Aug 18, 2022
* fix(CumprodGrad): fix cumprod_grad GetSbp bug fix #8920 * test(Cumprod): add global test Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
第一种情况,使用flow.cumprod与sbp:
第二种情况,不使用flow.cumprod使用sbp:
第三种情况,不使用sbp使用flow.cumprod:
The text was updated successfully, but these errors were encountered: