Skip to content

Commit

Permalink
add exe graph physical shape check msg (#8002)
Browse files Browse the repository at this point in the history
* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename eager.multi_client to eager

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: cheng cheng <472491134@qq.com>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* Revert "Merge branch 'master' into fea/graph_check_msg"

This reverts commit 28833b7, reversing
changes made to baadf60.

* Revert "Revert "Merge branch 'master' into fea/graph_check_msg""

This reverts commit 1d5e196.

* update

* resolve conflicts

* resolve conflicts

Co-authored-by: Cijie Xia <cijie.xia@mail.utoronto.ca>
Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: guo ran <360112263@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>
Co-authored-by: Peihong Liu <mosout@qq.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: liufengwei0103 <2472937968@qq.com>
Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: Shijie <821898965@qq.com>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
  • Loading branch information
20 people authored Jun 13, 2022
1 parent 3b42b2f commit ba56c84
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 7 deletions.
5 changes: 4 additions & 1 deletion oneflow/core/common/error_util.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ limitations under the License.
#include <sstream>
#include "oneflow/core/common/error_util.h"
#include "oneflow/core/common/util.h"
#include "oneflow/core/job/graph_scope_vars.h"

namespace oneflow {

Expand Down Expand Up @@ -97,7 +98,9 @@ std::string FormatFunctionOfStackFrame(const std::string& function) {

// msg in stack frame
Maybe<std::string> FormatMsgOfStackFrame(std::string error_msg, bool is_last_stack_frame) {
if (!is_last_stack_frame) { error_msg = *JUST(ShortenMsg(error_msg)); }
const bool debug_mode = GetGraphDebugMode();
// only shorten the message if it is not the last stack frame AND not in debug mode
if (!is_last_stack_frame && !debug_mode) { error_msg = *JUST(ShortenMsg(error_msg)); }
// error_msg of last stack frame come from "<<"
if (is_last_stack_frame) { error_msg = StripSpace(error_msg); }
std::stringstream ss;
Expand Down
18 changes: 12 additions & 6 deletions oneflow/core/graph/exec_graph.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ See the License for the specific language governing permissions and
limitations under the License.
*/
#include "oneflow/core/graph/exec_graph.h"
#include <sstream>
#include "oneflow/core/common/just.h"
#include "oneflow/core/graph/op_graph.h"

namespace oneflow {
Expand Down Expand Up @@ -92,9 +94,10 @@ Maybe<void> CheckPhysicalBlobDesc(
continue;
}
if (*JUST(op.GetParallelDesc4BnInOp(bn)) == *op_parallel_desc) {
JUST(CheckPhysicalBlobDesc(*JUST(GetLogicalBlobDesc(bn)),
nd_sbp_signature->bn_in_op2nd_sbp().at(bn), *op_parallel_desc,
parallel_ctx, *physical_blob_desc));
JUST_MSG(CheckPhysicalBlobDesc(*JUST(GetLogicalBlobDesc(bn)),
nd_sbp_signature->bn_in_op2nd_sbp().at(bn), *op_parallel_desc,
parallel_ctx, *physical_blob_desc),
std::stringstream() << " check physical shape failed, op name " << op.op_loc());
}
}
return Maybe<void>::Ok();
Expand All @@ -114,15 +117,18 @@ void ExecNode::InferBlobDescs(const ParallelContext* parallel_ctx) {
std::bind(&Operator::GetLogicalBlobDesc4Ibn, op().get(), std::placeholders::_1),
nd_sbp_signature, parallel_ctx, GetBlobDesc4BnInOp));
}
CHECK_JUST(op_->InferBlobDescsIf(GetBlobDesc4BnInOp, parallel_ctx, &GlobalJobDesc()));
CHECK_JUST_MSG(op_->InferBlobDescsIf(GetBlobDesc4BnInOp, parallel_ctx, &GlobalJobDesc()),
std::stringstream() << " infer blob descs if failed, op name " << op_->op_loc());
if (op_node != nullptr && parallel_ctx->parallel_num() > 1 && nd_sbp_signature != nullptr) {
CHECK_JUST(CheckPhysicalBlobDesc(
*op(), op()->output_bns(),
std::bind(&Operator::GetLogicalBlobDesc4Obn, op().get(), std::placeholders::_1),
nd_sbp_signature, parallel_ctx, GetBlobDesc4BnInOp));
}
CHECK_JUST(op_->InferInplaceObn2IbnIf(&mut_inplace_obn2ibn_, &con_inplace_obn2ibn_,
GetBlobDesc4BnInOp, parallel_ctx));
CHECK_JUST_MSG(op_->InferInplaceObn2IbnIf(&mut_inplace_obn2ibn_, &con_inplace_obn2ibn_,
GetBlobDesc4BnInOp, parallel_ctx),
std::stringstream()
<< " infer inplace obn to ibn if failed, op name " << op_->op_loc());
}

std::function<BlobDesc*(const std::string&)> ExecNode::GetBlobDesc4BnInOpFunc() const {
Expand Down

0 comments on commit ba56c84

Please sign in to comment.