Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2d sbp data tensor parallel training bug #7134

Closed
L1aoXingyu opened this issue Dec 29, 2021 · 3 comments
Closed

2d sbp data tensor parallel training bug #7134

L1aoXingyu opened this issue Dec 29, 2021 · 3 comments
Labels

Comments

@L1aoXingyu
Copy link
Contributor

Summary

Training model with 2d sbp data&tensor parallel will raise errors when graph building.

File "/workspace/oneflow/python/oneflow/nn/graph/graph.py", line 576, in _build_graph
    oneflow._oneflow_internal.CurJobBuildAndInferCtx_Complete()
IndexError: vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)

image

Code to reproduce bug

Use this repo and branch https://github.com/Oneflow-Inc/libai/tree/dev_lxy_bert_profile can reproduce the errors.

System Information

version: 0.6.0+cu111.git.f27860650
git_commit: 120ecadf3
cmake_build_type: Release
rdma: False
mlir: False
@L1aoXingyu L1aoXingyu added the bug label Dec 29, 2021
@strint
Copy link
Contributor

strint commented Dec 29, 2021

#7032

引入了一个bug,辛苦 @jackalcooper 看下

@jackalcooper
Copy link
Collaborator

Code to reproduce bug

Use this repo and branch https://github.com/Oneflow-Inc/libai/tree/dev_lxy_bert_profile can reproduce the errors.

  • Could you provide a minimum python script to reproduce the bug?
  • Or use gdb to run and get the stacktrace

@leaves-zwx
Copy link
Contributor

leaves-zwx commented Dec 29, 2021

I have found the reason for this bug located at

got_input_sbp_ss << SbpToString(nd_sbp.sbp_parallel(i));
and introduced by #7077.

The reason is that the number of dimensions of nd_sbp is 1 and the number of dimensions of parallel_hierarchy is 2.

I'll submit a pull request to fix this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants