Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge auto parallel update #7280

Merged
merged 111 commits into from
Jan 20, 2022
Merged

Conversation

wyg1997
Copy link
Contributor

@wyg1997 wyg1997 commented Jan 18, 2022

No description provided.

lixinqi and others added 30 commits December 21, 2021 14:32
* backup code

* EventRecord

* auto format by CI

* backup code

* remove deprecated binary test cases

* refactor valatile to atomic

* add StreamType::InitInstructionStatusIf/StreamType::DeleteInstructionStatusIf

* merge from branch profiling_nn_graph

* address comments

* EventRecordProvider

* more comments for XXXStatusQuerier::SetLaunched

* more comments for SharedEventRecord::Init

* wait source op per critical section

* rename a task_node.cpp

* minor fix

* backup code

* fix compiler complaints

* 1) remove AddCtrlEdgeBetweenSrcDstTickAndInputOutputInSameRank; 2) create CriticalSectionInstance buffers

* fix compiler complaints

* more profiler code

* refactor vm preschedule

* TryMoveFromWaitingToReady

* revert flying_instruction_cnt

* revert to single position to call DispatchInstruction

* revert several code

* reset instruction watermark

* remove is_xxx_hook_empty

* build with profiler

* merge master

* insert device ticks before and after critical sections

* refactor register_num of cs_wait/cs_callback from 2 to 128

* fix static analysis complaints

* fix complier complaints about JobBuilder::ParallelConf4OpName

* Update oneflow/core/operator/critical_section_wait_tick_op.cpp

Co-authored-by: daquexian <daquexian566@gmail.com>

* address pr comments

* add job example for InstructionsBuilder::LaunchLazyJob

* address pr comments

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
Co-authored-by: ouyangyu <xuanjiuye@gmail.com>
Co-authored-by: daquexian <daquexian566@gmail.com>
* more details of error msg

* minor change

* address review comment

* avoid namesake iterator
* add once apply of param

* apply once on buffer

* test reuse var on module to

* test resue var

* rm useless test

* finish test

* refine test

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* change spawn_shell to spawn_shell_and_check, sleep in script

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix distributed test master addr

Signed-off-by: daquexian <daquexian566@gmail.com>

* remove sleep

Signed-off-by: daquexian <daquexian566@gmail.com>

* spawn_shell -> spawn_shell_ignoring_failure

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* fix bug

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* fix the reversed logic

Signed-off-by: daquexian <daquexian566@gmail.com>

* improve error msg

Signed-off-by: daquexian <daquexian566@gmail.com>

* resolve name conflict of MASTER_ADDR

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* fix chunk op dim=-1 bug

* Update oneflow/core/functional/impl/array_functor.cpp

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* Update oneflow/core/functional/impl/array_functor.cpp

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* fix Resource::DumpCudnnConf

* fix typo and error msg

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* fix

* support concat single input
* Clear tensor name scope after graph build

* Add test case of 2 graph caught same free eager tensor

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* fix bias_add dropout fuse when p=0.0

* remove redundant op

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* fix Resource::DumpCudnnConf

* support_1d_to_2d_eager_boxing

* rename stack to unflatten

* add test case

* of format

* refine test case

* Revert "fix Resource::DumpCudnnConf"

This reverts commit f07278d.

* support nd to 1d

* add 2d to 1d test case

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* add oneflow-tblgen: generate op schema (OpInterpCtx) from ods

* cmake: add inja

* tblgen: add oneflow_datatype

* tblgen: use option cat

* tblgen: fix error

* tblgen: put impl in .cpp

* tblgen: fix null attrs

* tblgen: fix null ops

* refine

* refine

* reifne

* Refine op schema template and compilation

* add base OpInterpCtx to finish compilation

* fix

* refine

* fix

* add custom infer code

* generate op registrants automatically

* refine

* fix

* update user op ods and fix shape attr

* refine

* refine

* add custom code in op base

* refine comments

* add same_output_regst_num and infer

* support declare hasxx

* update op schema emitter

* refine

* emit output regist num

* refine

* refine

* migrate acc op

* migrate onerec_reader, ones_like, send, pack and padding ops

* add has_sbp_signature_infer_fn

* refine

* migrate pad, parallel_cast, partial_fc and pooling ops

* rm redundant has_device_infer_fn

* migrate prelu, quantization, randperm, reduce and repeat ops

* migrate reshape, reshape_like, roi_align, same_pad, selu and scalar related ops

* back port

* backport

* migrate ops

* refine

* refine

* refine

* refine

* add new op

* fix llvm not found

* fix mlir headers

* fix mlir headers

* fix llvm not found

* irefine

* mark override

* fix merge

* fix

* fix

* set op schema as obj lib to speed up

* rewrite ops

* add addn

* add grdi

* refien

* add more def (#7051)

* affine grid

* refien

* refine

* refine

* refine

* fix

* refien

* refine

* refine

* refine

* refine

* refine

* refien

* refine

* refine

* refein

* refine

* refine

* refine

* refine

* refien

* refine

* refine

* refine

* refien

* refien

* refien

* refine

* refine

* refien

* refine

* refine

* refine

* refein

* refine

* refine

* refine

* refine

* refine

* refien

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refein

* refine

* refine

* refine

* move more ops

* fix math_binary_broadcast/elementwise_ops

* fix hardtanh

* add norm

* rename file and add CpuOnly no_grad

* fix ir & fix norm op

* fix oneflow-tblgen

* fix math_unary_elementwise_op

* fix norm

* fix bn

* fix op schema

* refine

* fix

* refine physical_tensor_desc_infer_fn

* refine

* add ScalarLogicalNotEqualOp & RecvOp

* refine

* auto format by CI

* fix fmt

* add cuda only trait

* delete unused inja

* del inja_copy_headers_to_destination

* delete unused inja

* del inja_copy_headers_to_destination

* add cuda only to tblgen

* fix json inja url and md5 not used

* fix json inja url and md5 not used

* refine

* revert

* add with cuda

* refine

* delete GenUserOpODS

* remove cuda only

* revert cuda only after meeting

* fix

Co-authored-by: PragmaTwice <i@twice.moe>
Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
* add pass debug

* debug pass

* refine comment of fuse add pass

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* fix error message

* fix dot doc

* fix dot elem cnt

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* AnyType => Tensor

* refine

* refine

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* add once apply of param

* apply once on buffer

* test reuse var on module to

* test resue var

* rm useless test

* finish test

* refine test

* Clear tensor name scope after graph build

* Add test case of 2 graph caught same free eager tensor

* auto format by CI

* refactor var build draft

* add full func; add check

* done

* add test of call parameter ousite its moudule

* fix break test

Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* fix l2_normalize

* add normalize

* add test for normalize

* refine

* clean l2_normalize and refine normalize

* simplify normalize test

* Fix l2norm block_size

* refine

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* add linspace op

* fix align error in swintransformer

* add @ magic method

* fix conflict

* support tensor list

* fix meshgrid bug

* revert

Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* Clear tensor name scope after graph build

* Add test case of 2 graph caught same free eager tensor

* auto format by CI

* add other api graph autotest

* add more samples

* fix comments

* refine

* refine

* refine

* refine

* refine

* fix error

* fix test error

* fix bug

* fix flip bug

* fix bug

* fix bug

* fix ci bug

* fix ci error

* fix bug

* fix ci error

Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
Co-authored-by: Li Xiang <54010254+lixiang007666@users.noreply.github.com>
* add cmake changes for liboneflow_cpp.so

Signed-off-by: daquexian <daquexian566@gmail.com>

* add separate target for cpp api test

Signed-off-by: daquexian <daquexian566@gmail.com>

* add cpp api test in ci

Signed-off-by: daquexian <daquexian566@gmail.com>

* graph run

* reverse the order of cudnn and cuda library

Signed-off-by: daquexian <daquexian566@gmail.com>

* update logic of BUILD_MONOLITHIC_LIBONEFLOW

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename BUILD_MONOLITHIC_LIBONEFLOW to BUILD_MONOLITHIC_LIBONEFLOW_CPP_SO

Signed-off-by: daquexian <daquexian566@gmail.com>

* refine

* [draft] implement graph parameter load and save (#7010)

* implement parameter save (python) and load (c++)

Signed-off-by: daquexian <daquexian566@gmail.com>

* revert accident changes

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix circular reference

Signed-off-by: daquexian <daquexian566@gmail.com>

* pimpl

* batching

* share lib directory in test container

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix typo;

* add github actions debug

Signed-off-by: daquexian <daquexian566@gmail.com>

* Revert "add github actions debug"

This reverts commit 7d9aef6.

* add upterm debug after exe test

Signed-off-by: daquexian <daquexian566@gmail.com>

* sleep after fail

Signed-off-by: daquexian <daquexian566@gmail.com>

* set LD_LIBRARY_PATH in yml for cpp api test exe

Signed-off-by: daquexian <daquexian566@gmail.com>

* refine

* add test file && input order

* sleep

Signed-off-by: daquexian <daquexian566@gmail.com>

* upload liboneflow_cpp.so

Signed-off-by: daquexian <daquexian566@gmail.com>

* modify cmake to trigger compilation

Signed-off-by: daquexian <daquexian566@gmail.com>

* load job from ir && clean && add mlir model

* [remove useless python code]save to .pb

* add target of_common_obj to remove duplicate REGISTER_PASS  && run of_format

* remove openvino

* remove openvino test

* refine

* IValue

* Update oneflow/api/cpp/framework/graph.h

Co-authored-by: daquexian <daquexian566@gmail.com>

* refine

* refine

* refine

* refine

* refine

* refine

* rename in oneflow.cmake

* refine oneflow.cmake

* make of_api_common object library

* move device util function in api to core

* remove device check in New and ThreadLocalGetOrNew

* refine

* fix device test

* refine graph test

* refine GetExeDir()

* refine GetExeDir() again

* fix

* refine

* fix

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
Co-authored-by: mosout <mosout@qq.com>
* disable autograd in lazy mode

* refine
* add test

* fix rand consistent

* add test
* quick fix power

* add int scalar test case

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* Dispatch functional stateful ops

* fix

* fix cmake

* fix

* disable attr check since it may not given when creating op expr.

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* refine

Co-authored-by: VertexC <bob2420083992@gmail.com>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* add_env_api_docs

* minor fix

* fix grammatical errors

Co-authored-by: Yao Chi <later@usopp.net>
Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
* tmp skip s0 print because of slice

* tmp skip s0 print in test case

* fix

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
@wyg1997
Copy link
Contributor Author

wyg1997 commented Jan 20, 2022

本 PR 完成了自动并行分支向 master 的更新,合并了 Operator sbp 半自动推导、sbp_infer_util 中的部分函数以及 OpSchama 新注册一些自动并行所需要的函数。

并修复了一些 bug(后续会 cherry-pick 到 master 中):

  1. AdaptiveAvgPoolGrad Op 中后向 sbp 推导函数中有不需要的 blob name,导致 1d sbp 在用 nd_sbp_constraints 过滤时出现 key 找不到的错误。
  2. NNGraph 中 Optimizer 相关的 variable op 创建在同步 plan 前,如果 job 中这些 variable 的 sbp 不一致,在各 rank 中创建时就会 CHECK failed。

在 resnet50(1d)、vgg(1d)、libai(2d)上均测试通过。

@Yipeng1994 Yipeng1994 merged commit 2824513 into feat-auto_parallel Jan 20, 2022
@Yipeng1994 Yipeng1994 deleted the merge-auto_parallel_update branch January 20, 2022 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.