Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support broadcast in fused_softmax kernel #8321

Merged
merged 14 commits into from
Jun 20, 2022

Conversation

MARD1NO
Copy link
Contributor

@MARD1NO MARD1NO commented May 27, 2022

No description provided.

@chengtbf chengtbf marked this pull request as ready for review June 17, 2022 10:47
@@ -1288,7 +1288,7 @@ template<typename LOAD_Y, typename LOAD_DY, typename STORE, typename ComputeType
inline typename std::enable_if<!std::is_same<ComputeType, double>::value, cudaError_t>::type
DispatchSoftmaxGrad(cudaStream_t stream, LOAD_Y load_y, LOAD_DY load_dy, STORE store,
const int64_t rows, const int64_t cols) {
if (cols <= 1024) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里好像不用改吧,我记得反向改了速度稍微不如之前的

a4ee3b1

我这里只改了前向。这个是实测的结果

@chengtbf chengtbf requested review from leaves-zwx and strint June 18, 2022 12:25
@chengtbf chengtbf requested a review from oneflow-ci-bot June 19, 2022 02:07
@github-actions
Copy link
Contributor

CI failed when running job: Build cpu. PR label automerge has been removed

@github-actions
Copy link
Contributor

Static analysis with clang failed. PR label automerge has been removed

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 130.1ms (= 13011.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.2ms (= 14221.0ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 142.2ms / 130.1ms)

OneFlow resnet50 time: 76.2ms (= 7616.4ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 86.4ms (= 8639.1ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 86.4ms / 76.2ms)

OneFlow resnet50 time: 54.2ms (= 10847.5ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.6ms (= 11927.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.10 (= 59.6ms / 54.2ms)

OneFlow resnet50 time: 43.4ms (= 8672.2ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 42.0ms (= 8399.7ms / 200, input_shape=[2, 3, 224, 224])
❌ Relative speed: 0.97 (= 42.0ms / 43.4ms)

OneFlow resnet50 time: 39.1ms (= 7820.4ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 41.5ms (= 8304.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.06 (= 41.5ms / 39.1ms)

OneFlow swin dataloader time: 0.247s (= 49.335s / 200, num_workers=1)
PyTorch swin dataloader time: 0.151s (= 30.272s / 200, num_workers=1)
Relative speed: 0.614 (= 0.151s / 0.247s)

OneFlow swin dataloader time: 0.067s (= 13.377s / 200, num_workers=4)
PyTorch swin dataloader time: 0.044s (= 8.761s / 200, num_workers=4)
Relative speed: 0.655 (= 0.044s / 0.067s)

OneFlow swin dataloader time: 0.037s (= 7.422s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.471s / 200, num_workers=8)
Relative speed: 0.602 (= 0.022s / 0.037s)

❌ OneFlow resnet50 time: 147.0ms (= 14697.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 178.1ms (= 17808.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.21 (= 178.1ms / 147.0ms)

OneFlow resnet50 time: 96.3ms (= 9634.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 113.0ms (= 11295.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 113.0ms / 96.3ms)

OneFlow resnet50 time: 72.6ms (= 14517.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.7ms (= 17540.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.21 (= 87.7ms / 72.6ms)

OneFlow resnet50 time: 59.0ms (= 11794.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.0ms (= 14991.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 75.0ms / 59.0ms)

OneFlow resnet50 time: 53.2ms (= 10633.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.4ms (= 13674.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.29 (= 68.4ms / 53.2ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8321/

@chengtbf chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot June 19, 2022 11:43
@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8321/

@github-actions
Copy link
Contributor

CI failed when running job: cuda-module. PR label automerge has been removed

@MARD1NO MARD1NO requested a review from daquexian as a code owner June 20, 2022 01:13
@MARD1NO MARD1NO requested review from oneflow-ci-bot and removed request for oneflow-ci-bot June 20, 2022 01:14
@mergify mergify bot merged commit 5d74efa into master Jun 20, 2022
@mergify mergify bot deleted the support_broadcast_softmax_fused_kernel branch June 20, 2022 04:41
Yipeng1994 added a commit that referenced this pull request Jun 23, 2022
* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* print bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* refine cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix reviews

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <daquexian566@gmail.com>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <daquexian566@gmail.com>

* override some methods to set is_initialized_

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename eager.multi_client to eager

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: cheng cheng <472491134@qq.com>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* Revert "Merge branch 'master' into fea/graph_check_msg"

This reverts commit 28833b7, reversing
changes made to baadf60.

* Revert "Revert "Merge branch 'master' into fea/graph_check_msg""

This reverts commit 1d5e196.

* update

* resolve conflicts

* resolve conflicts

Co-authored-by: Cijie Xia <cijie.xia@mail.utoronto.ca>
Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: guo ran <360112263@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>
Co-authored-by: Peihong Liu <mosout@qq.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: liufengwei0103 <2472937968@qq.com>
Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: Shijie <821898965@qq.com>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Juncheng <liujuncheng1022@gmail.com>

* add batch_matmul sbp (#8385)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* suppress gcc11 false positive warning (#8401)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix variable op conversion to tosa error in ninja c1 (#8412)

* pub

* move test iree resnet python script to oneflow_iree repo

* add bracket

* rename const_val to const_val_ and restore resnet.py test script

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* Fix eval error in FusedMLP (#8413)

Fix eval error

* Init NCCL communicator in graph mode unifiedly (#8263)

* centralized comm init

* address review

* revert

* rename

* ref nccl logical send recv

* fix cpu only

Co-authored-by: cheng cheng <472491134@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix dim_scatter 0-dim tensor bug (#8418)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* target based external libraries (#8421)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine hardcoded attr setting/getting in ir (#8420)

* use names in trait static func

* more changes on op name attr

* use wrapped func

* Replace cu115 with cu116 in nightly (#8423)

update workflows

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix repeat interleave 0-size tensor bug (#8414)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Autotest support print input in ci (#8383)

* support print tensor value in autotest to provide more details in ci

* revert

* refine

* auto format by CI

* control precision to 1e-5 when record

* fix bug

* auto format by CI

* relax tensor_size_mb

* fix bug

* fix bug

* refine

* releax

* refinew

* refine

* fix bug

* relax

* refine

* restruct

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Modify sbp.split()'s karg: axis to dim (#8411)

* Modify sbp.split()'s axis karg to dim

* Refine

* Refine

* Refine

* Refine

* Feat/graph logical op debug repr (#8131)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* add module config

* save nn.Module info in job.proto for better debugging

* add new line

* add ModuleBlock.ops_proto() API

* zero use stage 2

* print operators' info when print ModuleBlock

* handle VariableOpConf

* update

* update

* fix

* move operators repr method to graph util

* add limit consumer api

* add new api

* refine zero s select

* add module block

* fix

* refact for rm op in module conf

* fix

* add sbp debug

* add sbp repr

* add shape

* refine

* add sys op in repr

* add full op debug

* fix index out of range

* rm zero limit on device type

* add no scope op to graph

* zero test with activation checkpointing

* fix order

* add indentity when dp sequence len is 1

* add debug repr

* refine repr of op

* refine and fix

* rm useless log

* move to base with master

* fix

* fix

* fix

* fix proto

* refine test

* fix type

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* refine

* restore test

* refine pass and mem debug

* merge master

* repr dtype

* add placement

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

* fix merge

* auto format by CI

* auto format by CI

* refine get job api

* refine graph util import order

* auto format by CI

* fix static check

* auto format by CI

* fix special case

* refine level print and add full dtype repr

* rm useless

Co-authored-by: Cijie Xia <cijie.xia@mail.utoronto.ca>
Co-authored-by: Cijie Xia <xiacijie1998@163.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* rm some test case in test_fused_dot_feature_interaction_pooling_sum (#8425)

rm some case in test

* Remove unused linkages (#8426)

remove unused linkages

* refactor stride (#8402)

* Stride inherits DimVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* fix argument type of OFStrideToNumpyStride

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Move Tensor.__setitem__  and global related api to Python/C api (#8375)

* add local_to_global, global_to_global, to_global. global_to_global still have bugs

* fix bug of global_to_global

* remove python api

* add setitem

* remove local_to_global sbp pack, format code

* format code

* remove redundant code

* add error msg, refine check of to_global

* fix bug of check

* add error msg

* fix clang static check error

* remove useless api in tensor.py, remove redundant code, remove useless CHECK

* add to_local

* fix wrong exception type in unittest for to_local exception message

* cuda add default error msg (#8427)

default error

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* Refactor ShapeView (#8422)

* update

Signed-off-by: daquexian <daquexian566@gmail.com>

* update and add docs

Signed-off-by: daquexian <daquexian566@gmail.com>

* turn on view slice (#8302)

* turn_on_view_slice

* inplace scalar math hnandle non-contiguous input

* fix clang check

* add docs

* refactor

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Add flow env init rdma api (#8415)

* add_flow_env_init_rdma_api

* adjust persistent_workers logic for RDMA support

* adjust persistent_workers logic for RDMA support

* add rmda_inited api

* minro fix

* add docs

* Update python/oneflow/utils/data/dataloader.py

Co-authored-by: daquexian <daquexian566@gmail.com>

* fix typo

* refine

* fix RDMAIsInitialized

* minor fix

* refine

* rename InitRdma to InitRDMA

* refine

Co-authored-by: Flowingsun007 <flowingsun007@163.com>
Co-authored-by: daquexian <daquexian566@gmail.com>

* add 1d send recv in nccl logical (#8355)

* add 1d send recv in nccl logical

* Update insert_nccl_logical_op_pass.cpp

* auto format by CI

Co-authored-by: cheng cheng <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support iree ci (#8419)

* create mlir cpu and modify build gcc 7 shell script

* fix the bug of test_iree_resnet.py cuda test in cpu version error

* fix constant folding tests

* suport oneflow_test_cpu_only

* pub

* build script add flag

* modify test yml

* add python3 into \PATH

* don't use pretrain model

* install flowvision

Co-authored-by: mosout <mosout@qq.com>
Co-authored-by: jackalcooper <jackalcooper@gmail.com>

* Feat straighten task nodes (#8347)

* Add a fast topological traversal

* Add an initial implementation of straighen nodes

* Add the straighen nodes algorithm

* Change algorithm structure

* Remove some debug information

* Finalize the straighten algorithm after
deciding the parameters by experiments

* Notify the usage of straighten algorithm

* Of format

* Update oneflow/core/graph/straighten_nodes.cpp

Of format

Co-authored-by: daquexian <daquexian566@gmail.com>

* Of format

* Stop using visual string before we find a better key

* Remove magic numbers and Of format

* Remove starts

* Of format

* Fix a bug of using GetMaxVal<int32_t>() as an
initial number for comparing

* Refactor add straighten algo interface (#8435)

* feat(*): export straighten nodes algorithm inferface

* export documentation

* Update python/oneflow/nn/graph/graph_config.py

Co-authored-by: Yipeng Li <jamesonli1313@gmail.com>

Co-authored-by: Yipeng Li <jamesonli1313@gmail.com>

* Use TopoForEachNodeFast as default. (#8436)

* Use TopoForEachNodeFast as default.
Rename the original one as TopoForEachNodeDynamic

* Speed up TopoForEachNodeFast when traversing a subgraph

* Rename the switch and code clean up

* Hide the class TopoStruct

* Hide all the other functions

* Grammar

* Of format

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Refactor NLLLoss to support split class dim (#8380)

* refactor

* RuntimeError

* avoid atomic add

* test

* fixes

* update test

* update test

* update test

* fix kernel

* improve backward

* update test

* out_weight to be required

* address static analysis errer

* fix static analysis error

* fix static analysis error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Strict ordering in memory reuse algorithm (#8441)

* Support broadcast in fused_softmax kernel (#8321)

* support broadcast

* refine

* Remove shape check

* fix sbp when broadcast

* rollback softmax grad threshold

* increase threshold of test conv bn folding

* tol to 1e-2

* check error msg of fuse softmax ops

* add more dispatch

* remove double datatype test and add broadcast test

Co-authored-by: cheng cheng <472491134@qq.com>

* Merge slice and logical slice (#8416)

* remove Slice, SliceUpdate, SliceGrad op

* rename logical_slice to slice and logical_slice_assign to slice_update

* move gradient_func logical_slice.cpp to slice.cpp

* fix some bug and refine local test

* feat(SliceUpdate): support 0size tensor

* test(Slice): refine consistent slice test

* test(SliceUpdate): refine consistent slice_update test

* not export slice_update's inplace parameter

* auto format by CI

* recovery slice_grad_op

* fix slice_view bug

* add error message and attr judgement

* modified old test

* auto format by CI

* update test README

* update tensor_string code

* fix test bug

* auto format by CI

* fix(hsplit): hsplit functor bug

* fix vsplit doc test bug

* refine

* fix test

* fix pin_memory bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Graph block.config.set_stage() for recommended Pipeline api. (#8442)

* Graph block.config.set_stage() for recommended Pipeline api.

* revert diff

* refine api doc

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Update PolynomialLR's doc and paramater (#8430)

* update PolynomialLR doc, current_batch = min(decay_batch, current_batch)

* * update PolynomialLR doc, current_batch = min(decay_batch, current_batch)
* rename the steps to decay_batch in parameters

* update PolynomialLR test case

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add mv op (#8445)

* add mv op with bug that Int is incompatible

* add test

* update test_mv.py

* fix based on comments

* fix based on comments

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* enable oneflow_iree(python package) and corresponding test works in ci (#8431)

* update test.yml

* add pytest for oneflow_iree examples

* add oneflow frontend test

* Dev tensor is pinned api (#8447)

* support tensor.is_pinned

* add test case

* add docs

* auto format by CI

* refine

* auto format by CI

* refine

* auto format by CI

* refine

* refine

* refine

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Nd sbp tensor str (#8458)

* nd sbp tensor str

* add nd sbp tensor str test

* bigger input size

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Patch sbp cost (#8378)

* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Add the slight penalty for eager

* Consider B -> (B, B) for a scalar

* Do not consider parallel description in priority ratio

* Of format

* Fix a bug in the old version group boxing with 2D SBP (#8448)

* Update group boxing to deal with hierarchy [1, 2]

* Use a uniform sbp while grouping consumers

* Steal "ParallelDimReduce"
from "hierarchical_sub_task_graph_builder_impl" to "sbp_infer_util"

* Fix bugs of patch-sbp_cost (#8456)

* Update group boxing to deal with hierarchy [1, 2]

* Use a uniform sbp while grouping consumers

* Steal "ParallelDimReduce"
from "hierarchical_sub_task_graph_builder_impl" to "sbp_infer_util"

* Reduce to uniform B for 1 device.
Use the actual parallel description for each tensor

* Fix a bug of fix-group_boxing-bug

* Group boxing reduce [2, 2]: (S0, S0) to [4]: S0,
then we might infer a 1D SBP from a 2D SBP hint

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: cheng cheng <472491134@qq.com>

* Decouple stream and instruction (#7607)

* remove deprecated python api

* backup code

* backup code

* fix compiler complaints

* fix typo in refactoring

* kMockDevice

* add unit test test_mock.py

* revert mock kernels

* vert DEVICE_TYPE_SEQ

* mock placement

* address pr comments

* register device kCriticalSectionDevice and kLazyJobLauncher

* kControlDevice

* Stream::vm_stream_

* fix compiler complaints

* backup code

* rename StreamIsTransport to IsCommNetStream

* decouple vm::StreamType and vm::InstructionType

* fix compiler complaints

* remove 'gpu' related code

* address static analyzer complaints

* address static analyzer complaints

* remove unused module in test_mock.py

* the Env is never destroyed.

* export Env into python

* more unittests

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* fix oneflow.placement.__str__

* revert GlobalSync

* init_producer_stream in oneflow.from_numpy

* debug code for vm

* init disable_vm_threads_ in VirtualMachine::VirtualMachine

* Update oneflow/core/vm/virtual_machine.h

Co-authored-by: daquexian <daquexian566@gmail.com>

* create stream in forked subprocesses.

* refactor StreamRoleSwitch to StreamRoleVisistor

* ThreadLocalGuard

* auto format by CI

* fix compiler complaints

* fix static analyzer complaints

* VirtualMachine::GetVmStream

* fix static analyzer complaints

* reimplement AddAndReadVector by std::deque

* reimplement AddAndReadVector

* merge master

* increase atol for test_consistent_rnn_cell.py

* StreamRole::AsyncLaunchedCommNet is bound to EventRecordedCudaStreamType

* auto format by CI

* remove StreamRoleVisitor<T>::VisitInvalid

* no copy in AddAndReadVector

* fix bug of AddAndReadVector::size_

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* fix AddAndReadVector::GetGranularity

* remove bad unittest

* auto format by CI

* rename CallInstructionType to OpCallInstructionType

* static variable  GlobalSingletonPtr is a unique_ptr

* replace ++atomic_cnt with atomic_cnt.fetch_add(1, std::memory_order_relaxed)

* AddAndReadVector::operator[]

* change comments 'lock free' to 'thread safe'

* rename StatefulLocalOpKernel to StatefulOpKernel

* rename VirtualMachine::vm_ to VirtualMachine::engine_

* mark VirtualMachine::NoMoreErasedInstructions private

* mark VirtualMachine::FindOrCreateScheduleLocalDepObject private

* remove unused version of VirtualMachineEngine::Receive

* rename argname for VirtualMachineEngine::Receive

* rename unused PendingInstructionList

* rename AddAndReadVector to SteadyVector

* optimize SteadyVector::operator[] by __builtin_clzll

* refactor SteadyVector::granularity2vector_ to SteadyVector::granularity2data_

* reduce usage of steady_vector::size_

* rename unused anounymous namespace

* greater atol for test_consistent_tensordot.py

* fix BarrierInstructionType::ComputeInFuseMode

* revert container_util.h

* run AccessBlobByCallback in default stream of tensor->device

* reslove static check

* reslove static check

* SteadyVector::MutableOrAdd

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>
Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: binbinHan <han_binbin@163.com>

* fix_tensor_numpy_to_avoid_gpu_mem_increase (#8449)

* fix_tensor_numpy_to_avoid_gpu_mem_increase

* Update tensor.py

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Rename user op tensor shape to shape view (#8433)

* ThreadLocalGuard

* rename user_op::Tensor::shape to user_op::Tensor::shape_view

* auto format by CI

* fix static analyzer complaints

* more verbose code for HobDataType

* larger timeout

* larger timeout

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: jackalcooper <jackalcooper@gmail.com>
Co-authored-by: binbinHan <han_binbin@163.com>

* speedup global test (#8468)

* speedup global test

* Test refine slice ops test (#8471)

* refine consistent_slice test from 112s -> 30s in 4 device

* test(SliceUpdate): refine test from 119s -> 28s in 4 device

* delete useless code

* auto format by CI

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: wyg1997 <wangyinggang@foxmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Set the minimum mtu value for IB communication connection (#8451)

* Set the minimum mtu value for IB communication connection

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Merge branch 'master' into feat-general_basic_communication

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: liufengwei0103 <2472937968@qq.com>
Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: ZZK <359521840@qq.com>
Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: Yao Zihang <1162526220@qq.com>
Co-authored-by: yuhao <72971170+howin98@users.noreply.github.com>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>
Co-authored-by: cheng cheng <472491134@qq.com>
Co-authored-by: Cijie Xia <cijie.xia@mail.utoronto.ca>
Co-authored-by: guo ran <360112263@qq.com>
Co-authored-by: Peihong Liu <mosout@qq.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: Shijie <821898965@qq.com>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: leaves-zwx <kunta0932@gmail.com>
Co-authored-by: Li Xiang <54010254+lixiang007666@users.noreply.github.com>
Co-authored-by: Cijie Xia <xiacijie1998@163.com>
Co-authored-by: Jia <basicv8vc@gmail.com>
Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
Co-authored-by: wyg1997 <wangyinggang@foxmail.com>
Co-authored-by: Yu OuYang <xuanjiuye@gmail.com>
mergify bot added a commit that referenced this pull request Jul 25, 2022
* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Fix a slight bug

* Add at most 1 middle node for general basic communication

* Add the cost for general basic communication

* Add the slight penalty for eager

* Skip initialization of boxing collector if not needed

* Fix a bug

* Dev nd nccl send recv boxing (#8467)

* nd nccl_send_recv_boxing

* rm print

* support num_axes > 2

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* print bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* refine cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix reviews

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <daquexian566@gmail.com>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <daquexian566@gmail.com>

* override some methods to set is_initialized_

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename eager.multi_client to eager

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: cheng cheng <472491134@qq.com>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* Revert "Merge branch 'master' into fea/graph_check_msg"

This reverts commit 28833b73a8041463e5e3d130784be386ee248bd8, reversing
changes made to baadf6045f2fce69c090e442a755229c1c949773.

* Revert "Revert "Merge branch 'master' into fea/graph_check_msg""

This reverts commit 1d5e196d8530ffd2b9bf781abcf168b94ff9ca41.

* update

* resolve conflicts

* resolve conflicts

Co-authored-by: Cijie Xia <cijie.xia@mail.utoronto.ca>
Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: guo ran <360112263@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>
Co-authored-by: Peihong Liu <mosout@qq.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: liufengwei0103 <2472937968@qq.com>
Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: Shijie <821898965@qq.com>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Juncheng <liujuncheng1022@gmail.com>

* add batch_matmul sbp (#8385)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* suppress gcc11 false positive warning (#8401)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix variable op conversion to tosa error in ninja c1 (#8412)

* pub

* move test iree resnet python script to oneflow_iree repo

* add bracket

* rename const_val to const_val_ and restore resnet.py test script

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* nccl send/recv support different placement

* refine

* auto format by CI

* rm out ctrl

* auto format by CI

Co-authored-by: guo-ran <360112263@qq.com>
Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: liufengwei0103 <2472937968@qq.com>
Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: ZZK <359521840@qq.com>
Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: Yao Zihang <1162526220@qq.com>
Co-authored-by: yuhao <72971170+howin98@users.noreply.github.com>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>
Co-authored-by: cheng cheng <472491134@qq.com>
Co-authored-by: Cijie Xia <cijie.xia@mail.utoronto.ca>
Co-authored-by: Peihong Liu <mosout@qq.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: Shijie <821898965@qq.com>
Co-authored-by: lixinqi <lixinqi0703106@163.com>

* Support different hierarchy

* Merge branch 'master' into feat-general_basic_communication (#8477)

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* print bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* refine cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix reviews

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <daquexian566@gmail.com>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <daquexian566@gmail.com>

* override some methods to set is_initialized_

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename eager.multi_client to eager

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: cheng cheng <472491134@qq.com>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* Revert "Merge branch 'master' into fea/graph_check_msg"

This reverts commit 28833b73a8041463e5e3d130784be386ee248bd8, reversing
changes made to baadf6045f2fce69c090e442a755229c1c949773.

* Revert "Revert "Merge branch 'master' into fea/graph_check_msg""

This reverts commit 1d5e196d8530ffd2b9bf781abcf168b94ff9ca41.

* update

* resolve conflicts

* resolve conflicts

Co-authored-by: Cijie Xia <cijie.xia@mail.utoronto.ca>
Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: guo ran <360112263@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>
Co-authored-by: Peihong Liu <mosout@qq.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: liufengwei0103 <2472937968@qq.com>
Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: Shijie <821898965@qq.com>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Juncheng <liujuncheng1022@gmail.com>

* add batch_matmul sbp (#8385)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* suppress gcc11 false positive warning (#8401)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix variable op conversion to tosa error in ninja c1 (#8412)

* pub

* move test iree resnet python script to oneflow_iree repo

* add bracket

* rename const_val to const_val_ and restore resnet.py test script

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* Fix eval error in FusedMLP (#8413)

Fix eval error

* Init NCCL communicator in graph mode unifiedly (#8263)

* centralized comm init

* address review

* revert

* rename

* ref nccl logical send recv

* fix cpu only

Co-authored-by: cheng cheng <472491134@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix dim_scatter 0-dim tensor bug (#8418)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* target based external libraries (#8421)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine hardcoded attr setting/getting in ir (#8420)

* use names in trait static func

* more changes on op name attr

* use wrapped func

* Replace cu115 with cu116 in nightly (#8423)

update workflows

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix repeat interleave 0-size tensor bug (#8414)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Autotest support print input in ci (#8383)

* support print tensor value in autotest to provide more details in ci

* revert

* refine

* auto format by CI

* control precision to 1e-5 when record

* fix bug

* auto format by CI

* relax tensor_size_mb

* fix bug

* fix bug

* refine

* releax

* refinew

* refine

* fix bug

* relax

* refine

* restruct

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Modify sbp.split()'s karg: axis to dim (#8411)

* Modify sbp.split()'s axis karg to dim

* Refine

* Refine

* Refine

* Refine

* Feat/graph logical op debug repr (#8131)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* add module config

* save nn.Module info in job.proto for better debugging

* add new line

* add ModuleBlock.ops_proto() API

* zero use stage 2

* print operators' info when print ModuleBlock

* handle VariableOpConf

* update

* update

* fix

* move operators repr method to graph util

* add limit consumer api

* add new api

* refine zero s select

* add module block

* fix

* refact for rm op in module conf

* fix

* add sbp debug

* add sbp repr

* add shape

* refine

* add sys op in repr

* add full op debug

* fix index out of range

* rm zero limit on device type

* add no scope op to graph

* zero test with activation checkpointing

* fix order

* add indentity when dp sequence len is 1

* add debug repr

* refine repr of op

* refine and fix

* rm useless log

* move to base with master

* fix

* fix

* fix

* fix proto

* refine test

* fix type

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* refine

* restore test

* refine pass and mem debug

* merge master

* repr dtype

* add placement

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

* fix merge

* auto format by CI

* auto format by CI

* refine get job api

* refine graph util import order

* auto format by CI

* fix static check

* auto format by CI

* fix special case

* refine level print and add full dtype repr

* rm useless

Co-authored-by: Cijie Xia <cijie.xia@mail.utoronto.ca>
Co-authored-by: Cijie Xia <xiacijie1998@163.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* rm some test case in test_fused_dot_feature_interaction_pooling_sum (#8425)

rm some case in test

* Remove unused linkages (#8426)

remove unused linkages

* refactor stride (#8402)

* Stride inherits DimVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* fix argument type of OFStrideToNumpyStride

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Move Tensor.__setitem__  and global related api to Python/C api (#8375)

* add local_to_global, global_to_global, to_global. global_to_global still have bugs

* fix bug of global_to_global

* remove python api

* add setitem

* remove local_to_global sbp pack, format code

* format code

* remove redundant code

* add error msg, refine check of to_global

* fix bug of check

* add error msg

* fix clang static check error

* remove useless api in tensor.py, remove redundant code, remove useless CHECK

* add to_local

* fix wrong exception type in unittest for to_local exception message

* cuda add default error msg (#8427)

default error

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* Refactor ShapeView (#8422)

* update

Signed-off-by: daquexian <daquexian566@gmail.com>

* update and add docs

Signed-off-by: daquexian <daquexian566@gmail.com>

* turn on view slice (#8302)

* turn_on_view_slice

* inplace scalar math hnandle non-contiguous input

* fix clang check

* add docs

* refactor

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Add flow env init rdma api (#8415)

* add_flow_env_init_rdma_api

* adjust persistent_workers logic for RDMA support

* adjust persistent_workers logic for RDMA support

* add rmda_inited api

* minro fix

* add docs

* Update python/oneflow/utils/data/dataloader.py

Co-authored-by: daquexian <daquexian566@gmail.com>

* fix typo

* refine

* fix RDMAIsInitialized

* minor fix

* refine

* rename InitRdma to InitRDMA

* refine

Co-authored-by: Flowingsun007 <flowingsun007@163.com>
Co-authored-by: daquexian <daquexian566@gmail.com>

* add 1d send recv in nccl logical (#8355)

* add 1d send recv in nccl logical

* Update insert_nccl_logical_op_pass.cpp

* auto format by CI

Co-authored-by: cheng cheng <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support iree ci (#8419)

* create mlir cpu and modify build gcc 7 shell script

* fix the bug of test_iree_resnet.py cuda test in cpu version error

* fix constant folding tests

* suport oneflow_test_cpu_only

* pub

* build script add flag

* modify test yml

* add python3 into \PATH

* don't use pretrain model

* install flowvision

Co-authored-by: mosout <mosout@qq.com>
Co-authored-by: jackalcooper <jackalcooper@gmail.com>

* Feat straighten task nodes (#8347)

* Add a fast topological traversal

* Add an initial implementation of straighen nodes

* Add the straighen nodes algorithm

* Change algorithm structure

* Remove some debug information

* Finalize the straighten algorithm after
deciding the parameters by experiments

* Notify the usage of straighten algorithm

* Of format

* Update oneflow/core/graph/straighten_nodes.cpp

Of format

Co-authored-by: daquexian <daquexian566@gmail.com>

* Of format

* Stop using visual string before we find a better key

* Remove magic numbers and Of format

* Remove starts

* Of format

* Fix a bug of using GetMaxVal<int32_t>() as an
initial number for comparing

* Refactor add straighten algo interface (#8435)

* feat(*): export straighten nodes algorithm inferface

* export documentation

* Update python/oneflow/nn/graph/graph_config.py

Co-authored-by: Yipeng Li <jamesonli1313@gmail.com>

Co-authored-by: Yipeng Li <jamesonli1313@gmail.com>

* Use TopoForEachNodeFast as default. (#8436)

* Use TopoForEachNodeFast as default.
Rename the original one as TopoForEachNodeDynamic

* Speed up TopoForEachNodeFast when traversing a subgraph

* Rename the switch and code clean up

* Hide the class TopoStruct

* Hide all the other functions

* Grammar

* Of format

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Refactor NLLLoss to support split class dim (#8380)

* refactor

* RuntimeError

* avoid atomic add

* test

* fixes

* update test

* update test

* update test

* fix kernel

* improve backward

* update test

* out_weight to be required

* address static analysis errer

* fix static analysis error

* fix static analysis error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Strict ordering in memory reuse algorithm (#8441)

* Support broadcast in fused_softmax kernel (#8321)

* support broadcast

* refine

* Remove shape check

* fix sbp when broadcast

* rollback softmax grad threshold

* increase threshold of test conv bn folding

* tol to 1e-2

* check error msg of fuse softmax ops

* add more dispatch

* remove double datatype test and add broadcast test

Co-authored-by: cheng cheng <472491134@qq.com>

* Merge slice and logical slice (#8416)

* remove Slice, SliceUpdate, SliceGrad op

* rename logical_slice to slice and logical_slice_assign to slice_update

* move gradient_func logical_slice.cpp to slice.cpp

* fix some bug and refine local test

* feat(SliceUpdate): support 0size tensor

* test(Slice): refine consistent slice test

* test(SliceUpdate): refine consistent slice_update test

* not export slice_update's inplace parameter

* auto format by CI

* recovery slice_grad_op

* fix slice_view bug

* add error message and attr judgement

* modified old test

* auto format by CI

* update test README

* update tensor_string code

* fix test bug

* auto format by CI

* fix(hsplit): hsplit functor bug

* fix vsplit doc test bug

* refine

* fix test

* fix pin_memory bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Graph block.config.set_stage() for recommended Pipeline api. (#8442)

* Graph block.config.set_stage() for recommended Pipeline api.

* revert diff

* refine api doc

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Update PolynomialLR's doc and paramater (#8430)

* update PolynomialLR doc, current_batch = min(decay_batch, current_batch)

* * update PolynomialLR doc, current_batch = min(decay_batch, current_batch)
* rename the steps to decay_batch in parameters

* update PolynomialLR test case

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add mv op (#8445)

* add mv op with bug that Int is incompatible

* add test

* update test_mv.py

* fix based on comments

* fix based on comments

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* enable oneflow_iree(python package) and corresponding test works in ci (#8431)

* update test.yml

* add pytest for oneflow_iree examples

* add oneflow frontend test

* Dev tensor is pinned api (#8447)

* support tensor.is_pinned

* add test case

* add docs

* auto format by CI

* refine

* auto format by CI

* refine

* auto format by CI

* refine

* refine

* refine

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Nd sbp tensor str (#8458)

* nd sbp tensor str

* add nd sbp tensor str test

* bigger input size

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Patch sbp cost (#8378)

* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Add the slight penalty for eager

* Consider B -> (B, B) for a scalar

* Do not consider parallel description in priority ratio

* Of format

* Fix a bug in the old version group boxing with 2D SBP (#8448)

* Update group boxing to deal with hierarchy [1, 2]

* Use a uniform sbp while grouping consumers

* Steal "ParallelDimReduce"
from "hierarchical_sub_task_graph_builder_impl" to "sbp_infer_util"

* Fix bugs of patch-sbp_cost (#8456)

* Update group boxing to deal with hierarchy [1, 2]

* Use a uniform sbp while grouping consumers

* Steal "ParallelDimReduce"
from "hierarchical_sub_task_graph_builder_impl" to "sbp_infer_util"

* Reduce to uniform B for 1 device.
Use the actual parallel description for each tensor

* Fix a bug of fix-group_boxing-bug

* Group boxing reduce [2, 2]: (S0, S0) to [4]: S0,
then we might infer a 1D SBP from a 2D SBP hint

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: cheng cheng <472491134@qq.com>

* Decouple stream and instruction (#7607)

* remove deprecated python api

* backup code

* backup code

* fix compiler complaints

* fix typo in refactoring

* kMockDevice

* add unit test test_mock.py

* revert mock kernels

* vert DEVICE_TYPE_SEQ

* mock placement

* address pr comments

* register device kCriticalSectionDevice and kLazyJobLauncher

* kControlDevice

* Stream::vm_stream_

* fix compiler complaints

* backup code

* rename StreamIsTransport to IsCommNetStream

* decouple vm::StreamType and vm::InstructionType

* fix compiler complaints

* remove 'gpu' related code

* address static analyzer complaints

* address static analyzer complaints

* remove unused module in test_mock.py

* the Env is never destroyed.

* export Env into python

* more unittests

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* fix oneflow.placement.__str__

* revert GlobalSync

* init_producer_stream in oneflow.from_numpy

* debug code for vm

* init disable_vm_threads_ in VirtualMachine::VirtualMachine

* Update oneflow/core/vm/virtual_machine.h

Co-authored-by: daquexian <daquexian566@gmail.com>

* create stream in forked subprocesses.

* refactor StreamRoleSwitch to StreamRoleVisistor

* ThreadLocalGuard

* auto format by CI

* fix compiler complaints

* fix static analyzer complaints

* VirtualMachine::GetVmStream

* fix static analyzer complaints

* reimplement AddAndReadVector by std::deque

* reimplement AddAndReadVector

* merge master

* increase atol for test_consistent_rnn_cell.py

* StreamRole::AsyncLaunchedCommNet is bound to EventRecordedCudaStreamType

* auto format by CI

* remove StreamRoleVisitor<T>::VisitInvalid

* no copy in AddAndReadVector

* fix bug of AddAndReadVector::size_

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* fix AddAndReadVector::GetGranularity

* remove bad unittest

* auto format by CI

* rename CallInstructionType to OpCallInstructionType

* sta…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants