-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ZeRO with auto parallel #9288
Conversation
f56d995
to
54771bc
Compare
…low into feat-auto_parallel-ZeRO
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9288/ |
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9288/ |
for (auto& cost_row : this_edge->cost_) { | ||
for (auto& cost_value : cost_row) { | ||
// If transferring between devices, we need to add wait time. | ||
if (cost_value > 0.0) { cost_value += this_edge->wait_time_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看起来最主要的增加的逻辑是 Add Wait time 这里?
最主要的删除逻辑是删掉了很多 compute_cost 的开关
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
只不过由赋值改为了增加。当然这些实现在最终结果上都是一样的
if (sbp_edge->cost_[i][j] > 0.0) { sbp_edge->cost_[i][j] = sbp_edge->wait_time_; } | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是因为这里不做 edge 的遍历,所以可以变快?
这样改完后,逻辑还等价不?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里我测了一下,时间基本是一样的。逻辑都是等价的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InitializeCopyCost() 的调用从2次变到1次
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9288/ |
* Use Primitive in Scalar Pow Grad (#8620) * scalar math use primitive * fix * support pow grad * dev scalar pow grad * remove useless code * use std * auto format by CI * Refine Co-authored-by: guo-ran <360112263@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Add higher order derivative for loss function (#9070) * add higher order derivative for smooth_l1/nll loss * add higher order derivative for bce/kl_div loss * fix bug and refine testcase * fix wrong sbp signature of bce loss * optimize code and align precision with pytorch * add some index check * disable calc derivative for target in bce loss * remove unnecessary header include * fix sbp setting in testcase, and restore out_grads size check * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add higher order derivative for softmax and activation (#9032) * add higher order derivative for softmax/logsoftmax * add higher order derivative for mish/gelu activation * auto format by CI * add comment for constexpr parameter Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add higher order derivative for pool (#9096) * add higher order derivative for pool * refine * optimize * fix ndim check error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Cross Encropy 支持 probability 的 target (#9064) * support prob for crossentropy, still has bug for dims > 2 * fix bug of for ndim > 2 inputs, refine code * refine code, use template HasLabelSmoothing * fix grad bug of for ndim > 2 inputs, use pre-calculated factor in kernel * format code, remove redundant including header files * refine op * restore wrong modification * remove op, implement at functor layer * set bind_python to false, remove redundant header files * add docs * fix missing default param in unittest, fix typo in docstr example * auto format by CI * Update loss.py * remove useless file Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix nvjpegDecodeParamsSetROI (#9101) * Fix nvjpegGetImageInfo * fix set ROI * add series op : adaptive_max_pool1d/2d/3d (#9023) * startup: cpu adaptive max pool 2d finished (a draft) * add 1d/2d/3d forward * add return_indices * refine files hieararchy * add adaptive_max_pool2d_grad for test * draft backward op for maxpool 2d * cpu op/kernel finished * reformat * gpu draft kernel * gpu forward finished * draft gpu backward version * refine gpu backward * add nn.AdaptiveMaxPoolnd Module * add docstring * rename avg pool gpu file * refine .td file * refine * refine test case * refine * refine by comments of zzk * refine according to clang_tidy errors * refine * refine by comments of zhuping * one_embedding physical_block_size change to 4096 (#9017) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * OneEmbedding add ONEFLOW_ONE_EMBEDDING_DISABLE_PIPELINE (#9098) * one_embedding eager forward * deterministic forward gen random * merge master * merge master * grad op add attrs * Revert "grad op add attrs" This reverts commit 33b67c75d1e5d0e6529a108f7e7a17bc458dc661. * auto format by CI * format * refine * prefetch consume id_shuffle out and exec in advance * add new task_node * sort and add ctrl edge * rm id_shuffle_task_node * add register same output blob regst num * rm tasktype * refine * address review * rename * refine * refine * refine Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * develop eager AMP (#9088) * implement eager AMP * skip autocast for inplace and implement make autocast meta * fix * rm unused code * autocast python api * fix * fix * refine * skip autocast if any input is float32 for gray or clear list * refine * fix dead loop * add autocast unittest * refine worker seed (#9102) * refine worker seed * refine * reifne * use default_generator.seed * Dev GroupNorm (#7784) * add groupnorm infer * Add groupnorm forward * refine other forawrd situation * groupnorm backward still has bug * fix forward * support backward * add slow groupnorm param grad kernel * use blockreduce * update blocknum * add gradient func * simplify code * refine and add global test * remove annotation * not limit split dim * fix compile error * Add spatialsize pack logic and fix launch blocknum bug * add two stage reduced backward kernel * refine * simplify logic * refine pack logic * use THREAD_CACHED_MUTABLE_ATTR_MAP * fix comment * refine * refine comment * Refine more check * fix affine=False bug * fix bug * tmp use gemm reduce * use ComputeType buf * fix nvbfloat16 compute type * add amp gray list * Revert back * fix clang analysis * refine userops.td * fix userops * remove result_segment_sizes * add dispatch logic for groupnorm grad uncached block impl Co-authored-by: luyang <flowingsun007@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Introduce bfloat16 type (#9067) * introduce_bfloat16_type * storage * fix compile error * support bfloat16 ep operator * support create cpu bfloat tensor * refine code * minor fix * fix static check error * reslove comment * add more test case * fix bfloat16 numeric_limits * fix error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refine check in ibverbs (#8974) * refine check in ibverbs * format * fix typo and test * refine error message when there is no errno Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support padding_idx in OneEmbedding (#8998) * init * Add attribute val in Userops.td * simply add paddingidx logic in EncodeLookupKernel * add simple padding_idx EmbeddingGrad * when index is -1 let gather add 0 * skip atomicadd when row index equals to padding_idx * change padding_idx type to int64 * fix compile error * set padding_idx in Pass * 1n1d eval success * refine * remove print * fix compile error * revert * refine * fix compile * refine * Refine * refine * refine store options * remove embedding grad shuffle redundant padding_idx * move gather in datashuffle kernel * remove redundant code * Refine * refine * remove redundant header file * Set padding idx as optional and remove attr has_padding_idx * Add padding_idx unittest * use array equal instead of allclose * remove a test * enlarge timeout * launch oneflow kernels in code generated with MLIR (#8980) * init * registry * add KernelLaunchFunctionPass * pass ninja and relu test * mlir test script & lowering * relu py * fi * kernel launch * fix * fix op and pass interfaces * add comment * add readme docs * fix typo * kenerl launch function pass is done * use template and rename func.func * declare * pass string through mlir.llvm dialect to c interface: llvm.mlir.global internal constant @"relu-0_var"("relu-0") %0 = "llvm.mlir.addressof"() {global_name = @"relu-0_var"} : () -> !llvm.ptr<array<6 x i8>> %1 = "llvm.mlir.constant"() {value = 0 : index} : () -> i64 %2 = "llvm.getelementptr"(%0, %1, %1) {structIndices = dense<-2147483648> : tensor<2xi32>} : (!llvm.ptr<array<6 x i8>>, i64, i64) -> !llvm.ptr<i8> * use symbol table * use oneflow variable op * fix symboltable * fix * ninja c1 check * split into kernel-launch-function pass and kernel-launch-with-llvm pass * restore pass 1 * Gen kernel example (#9042) * add example * add todo * add basic assertion * add file check * create pass in translation * sanitizeIdentifier * enable print * fix * update test file * kernel llvm pass is ok * pass ctx ptr to func and this ptr will be an operand to call c interface function * restore llvm ptr type to llvm.ptr<i8> * Kernel lookup in launch op (#9059) * add * move function to another unit * create map * add iter * impl TensorDesc4ArgNameAndIndex * set dev tag * load lib when ONEFLOW_MLIR_FUSE_KERNEL_LAUNCH is set * sharedlibs enables and pass enables in commpute * enable c interface callee * impl todo * naming * rm * add invalid * fix invoke arg * typed * rm log * rename pass * Update user_op_kernel_registry.h * Update user_op_kernel_registry.h * Update OneFlowOps.td * Update Passes.cpp * add comp ctx * add todo * refine todo * refactor op infer * minor fix * add check * refine error * refine msg * fix typo * fix typo * remove string in llvm * impl Tensor4ArgNameAndIndex * fix ninja c1 bug * realize gpu and add cuda test * auto format by CI * fix merge * fix ninja with cpu version * auto format by CI * rename * merge def * deduplicate code * fix * refactor * fix license * cache * add back TODO() * add jit arg type check * rm comment * fix typo * fix ci * todo ci * fix code style * rm misadded * rm misadded * Update Passes.cpp * pass ninja without debug about hungry mode of knerel init * fix null parsed module problem * fix dynamic cast of state problem * fix gpu error * fix * fix * auto format by CI * fix * Update kernel_launch_op.cpp * move * fix * auto format by CI * done * fix * fix * auto format by CI * fix * fix * auto format by CI * Update kernel_launch_op.cpp * rename * auto format by CI * fix * done * Update kernel_launch_op.cpp * fix * fix * fix * fix * fix * auto format by CI * Update oneflow/ir/oneflow-extension/kernel_launch_op.cpp Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * fix * fix * fix * fix * fix * Update oneflow/ir/lib/OneFlow/Passes.cpp Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * fix * fix * fix Co-authored-by: jackalcooper <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * interpolate api align (#9118) * Fix masked select op bug (#9120) * fix masked_select bug * refine * fix ci error * align with pytorch RANK env (#9111) * align with pytorch RANK env * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add oneflow hub (#9116) * add OneflowHub feature, consistent with PyTorchHub * add oneflow hub docs * refine docs and add test * refine * refine * refine * fix comment * auto format by CI * skip unittest Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix where op data_type infer bug (#9121) * fix where op data_type infer bug * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix like op infer dtype (#9127) * elementwise.cuh remove template parameter tail (#9128) * fix_global_tensor_detach_bug (#9134) * fix_global_tensor_detach_bug * fix test case * Add deform_conv2d op (#9095) * add new op * add kernel * add deform_conv * add some test * modify test * modify format * modify test * fix the bug and add test * Add error message * modify kernel and add test * adjust the format * add global test * Update python/oneflow/test/modules/test_deform_conv2d.py * add doc and modify global test * adjust OneFlowUserOps.td * remove headfile and modify doc * modify doc * add docs at rst * modify global test * remove unnecessary code * remove unnecessary code * remove debug code * initialize fields * modify global test * modify test * modify test * modify test * auto format by CI Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix inplace mul 0size check bug (#9132) * fix inplace mul 0-size tensor check bug * code format * revert * Align round op to support round half to even (#9135) * align round op * add test * modify doc ,test and kernel * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * rm dict in module apply (#9137) * rm dict in module apply * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * one_embedding support broadcast table_ids (#9109) * support broadcast table_ids * address review * fix like op infer dtype * address review * address review * refine * refine error message for framework (#9104) * refine error msg for framework * more error messages * fix size_t comparison with zero * check for incomplete error messages * err msg for inconsistent placement * modify acc. to review * convert enum to string in error msg * fix redundant error info; clean up * refine error msg for consistency check * auto format by CI Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix loss scale precision (#9126) * fix loss scale cast * amp_white_identity * revert debug log * move constant like back Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * one embedding eager (#8984) * forward * one_embedding eager * fix one_embedding grad * fix * fix * fix * fix amp * fix of_tidy * ONEFLOW_ONE_EMBEDDING_FUSE_UPDATE_PUT default true * merge master * save shadow var * get all ptr from embedding_state * reuse update and put op/kernel * mv id_shuffle to cuh * refine * refine * refine * refine * refine * refine * one_embedding eager forward * deterministic forward gen random * merge master * merge master * merge master * add table_ids in grad op * test pass * refine * create lazy state in lazy mode * optional learning_rate * add attr in update * refine * refine * refine * refine * fix adam and add adagrad attr * refine * refine * refine * refine * refine * address review * refine name * address review Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * module.to aligned with pytorch (#9083) * module.to aligned with pytorch Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * fix to str Signed-off-by: daquexian <daquexian566@gmail.com> * fix kwargs device bug Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: binbinHan <han_binbin@163.com> * eager global zero_grad update sbp from b to p (#8853) * zero_grad b to p Signed-off-by: daquexian <daquexian566@gmail.com> * zero_grad b to p Signed-off-by: daquexian <daquexian566@gmail.com> * skip in lazy Signed-off-by: daquexian <daquexian566@gmail.com> * implement zero_grad in c++ Signed-off-by: daquexian <daquexian566@gmail.com> * _zero_grad to _zero_grad_, skip boxing of lazy tensor Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * auto format by CI * skip test in cpu only mode Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support inplace scatter (#9016) * refine scatter * fix * refine * refine * add atomicMul & refine * refine * Dev linalg cross (#8979) * add linalg_cross in yaml * add linalg cross * fix * refine broadcast * add global test * reformat * refine and fix * fix tidy * add nansum (#9113) * add nansum, can work on cpu, fail on cuda * implement nansum on cuda * restore modification in preprocessor_internal.h * register only for floating types * remove kernel register for int types, and it works * add whole reduce functor * add backward func * add export in __init__ and refine code * refine code * refine code, and register kernel * add sbp * just for debuging, cannot compile * just for debuging, cannot compile * use primitive to implement assign nan * refine code * add docs, remove useless op and functor * remove useless kernel * add docs, fix bug of primitive * fix typo in global test * refine code * refine code * refine code * refine code * auto format by CI * Update binary_func.h * Update binary_func.h Co-authored-by: MARD1NO <359521840@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Feat eager global tensor indexing (#9138) * test(TensorIndexing): add global basic indexing test * format code * feat(TensorIndexing): support eager global advance indexing * test(TensorIndex): add global tensor indexing error message test * format code * feat(TensorIndexing): support global tensor combined indexing * format code * feat(TensorIndexing): eager global combined basic with advance indexing * fix(TensorIndexing): fix global tensor write back bug * remove useless code * refine test and comment * fix(TensorIndexing): remove an unnecessary slice_update * add comment * fix with static analysis Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add lr_scale for optimizers (#9008) * add lr_scale for opt * revert import * set lr scale in pass * add test * lr_scale default value * improve readability * fix_ctc_loss_error_with_float_target_input (#9143) * fix_ctc_loss_error_with_float_target_input * minor fix Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Inplace masked fill (#9133) * add inpalce masked_fill * reformat * refine * auto format by CI * refine according by comments of hbb * export via cpp directly * export oneflow.masked_fill_ * rename arg * refine test case Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix numpy>=1.23.0 advance indexing code (#9139) * test(TensorIndexing): fix numpy>=1.23.0 * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add_tensor_new_full_func (#9149) * add_tensor_new_full_func * auto format by CI * add global test case * fix error Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * As strided regist more dtype (#9150) * as_strided register more kernel * add test * fix commnet * fix ci error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Auto Parallel (#8891) * add auto_parallel code add auto_parallel pass * Feat ap remove hierarchy cast (#7919) * feat(AutoParallel): support remove parallel_cast ops * feat(AutoParallel): export enable_auto_parallel_prune_parallel_cast_ops * format code * Fix add conv grad cost (#7972) * feat(Conv): add grad computation cost * fix ConvDataGrad computation cost * update conv grad cost * refine * Auto parallel/fast collector (#7958) * Try to speed up sbp collector. However, throughput drop * Shrink the parallel candidates for the proxy node * Print out some information and then refine * Store the sbp set for each consumer * Update binary set intersection * Remove impossible parallel candidates from sbp proxy * Refine binary set * Add a Clear() in binary set * Filter out those proxy candidates containing two sbps from the same unique group * refine * Check spells * Clip useless edges * AutoParallel mainstem algorithm add mutable_op_ctrl_edge (#8033) * feat(AutoParallel): mainstem algorithm add mutable_op_ctrl_edge * use if instead std::max * fix(AutoParallel): fix pooling computation cost function bug (#8147) * [WIP] Fix auto parallel dump uniform sbp bug (#8330) * fix(AutoParallel): fix auto parallel dump uniform sbp bug * refine source op judgement * update auto_parallel config (#8356) * Refactor dump nd sbp for auto parallel (#8353) * fix(AutoParallel): fix auto parallel dump uniform sbp bug * feat(AutoParallel): add inferface for op to dump nd_sbp to op_conf * refactor(AutoParallel): refactor DumpNdSbpSignatureForOpConfFn * rename Global to Singleton * Refactor SbpEdge (#8684) * refactor(AP): refactor SbpEdge * Rename variables * Add const for some functions Co-authored-by: Yipeng Li <jamesonli1313@gmail.com> * Refactor auto parallel sbp node (#8712) * Rename * Code clean up * Code clean up * Code clean up and package up * Rename * Add const for some functions * Refactor auto parallel sbp graph (#8722) * Code clean up * Package up * Code clean up and package up in SbpNode and SbpEdge * Rename * Rename * Rename mainstem to trunk * Typo, small bugs and rename * Rename and of format * Refactor auto parallel rest (#8731) * Package up SbpCollector * Add const for SbpGraph * Add const for SbpNode * Add const for SbpEdge * Add const for SbpCollector * Add const, rename, and package up for BinarySet * Rename for BinarySet * Rename for SbpCollector * Rename for SbpCollector * Rename for algorithm utils * Fix a bug for an unused function AddEntries() * Rename for BinarySet * Rename for SbpConstructor * Rename for BoxingCollector * Add const for sbp utils * fix merge conflict * Remove template for sbp signature (#8787) * Remove template for sbp signature * Remove _H_ from cpp files * Remove namespace specifier oneflow:: * Remove namespace specifier oneflow:: * Of format * Move the inline functions to cpp files * Can not add inline specifier? * Update oneflow/core/auto_parallel/sbp_graph.h Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Of format Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Refactor auto parallel class object stuff (#8835) * Delete copy/move constructor/operator * Move the deconstructor of SbpEdge to the cpp file * Equal by address for Sbp data structor * Replace sbp_sig_list_ with sbp_sig_obj_list_ * Fix auto parallel copy cost infer2 (#8788) * Check the output shape for operator in auto parallel * Return infinity for different sbps while is_mutable * Update oneflow/core/auto_parallel/sbp_constructor.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Update oneflow/core/operator/operator.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * with output -> check output Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Refactor prune identity as much as possible (#8849) * Prune a line of parallel cast ops * Avoid repeated pruning * Code clean up * Remove identity op * Update oneflow/core/job_rewriter/auto_parallel.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Fix auto parallel low throughput (#8876) * Speed up after pruning identity * Slight changes * Refactor auto parallel final check (#8887) * Of format * Use const auto & * Of format and rename * Re-compute cost if steals sbp signatures * Docs auto parallel doc (#8896) * doc(AutoParallel): add auto parallel document framework * docs(AutoParallel): add document * fix typo * refine document * refine documentation * Test alexnet for auto_parallel (#8917) * test(AutoParallel): test alexnet for auto_parallel * test(AutoParallel): test model add auto_parallel config * Fix get sbp bug (#8939) * Fix the bug of missing sbp for uniform op * Speed up * Add the mising sbp for optional input UserSourceOpTickInput * Remove the repeated all-B sbp signature * Add sbp for undefined UserSourceOpTickInput * Resolve confits while merging master * Recompute cost with time shape (#9009) * Address comments * fix merge conflict * Address comments * Disabled ZeRO when enabled AutoParallel (#9087) fix(AutoParallel): disabled ZeRO when enabled AutoParallel * Update oneflow/core/job_rewriter/optimizer_placement_optimization_pass.cpp * Address comments * Address comment. GetComputationCostFn -> GetComputationCost * Update oneflow/core/job_rewriter/auto_parallel.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * New interface for pr#9018 * Static analysis * Fix ones like sbp bug and fix test import error in CI (#9123) fix(AutoParallel): skip 1n1d sbp agreement check * auto format by CI * test(AutoParallel): skip acc check * Address comments * rename source op set nd_sbp function and add check * fix typo * Feat full auto parallel (#9140) * Use B for inplace op and remove the check for sbp while truning the auto prallelism on * Slight change * Not using B as the constrain * Address comments * add debugg log for non-deleted cast ops * update prune parallel cast op log * rename auto_parallel_prune_parallel_cast_ops to enable_auto_parallel_ignore_user_sbp_config Co-authored-by: wyg1997 <wangyinggang@foxmail.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * refine oneflow op infer dtype error message (#9155) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix to_global PyArg_ParseTupleAndKeywords (#9158) * Fix tensor local_to_global parse keywords * use PyObject Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Implement exponential_ and multinomial (#9073) * add exponential distribution cpu kernel * add exponential distribution cuda kernel and local tests * refine test * fix bug * auto format by CI * auto format by CI * implement multinomial functor and cpu kernel * auto format by CI * add multinomial cuda kernel * auto format by CI * refine * add multinomial tests * auto format by CI * add categorical distribution module and docs * refine * refine * refine doc * refine * refine * revert Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Disable IB when there no active IB devices (#9115) * fix lru_cache offset (#9162) fix lru_cache offset for larger than uint32 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Rename cast to global and cast from global (#9151) * rename_cast_to_global_and_cast_from_global * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refine datatype error message part2 (#9168) * refine more ops dtype infer error message * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support tensor.triu_ (#9159) * support tensor.triu_ * Update tensor_functions.cpp * tensor.copy_ support stride (#9142) * tensor.copy_ support stride * add test case * PersistentTable add read_only flag (#9145) * read only * fix * avg_pool_nd support half (#9170) * avg_pool_nd support half * refine * refine * fix new_ones size paramater (#9161) * fix new_ones size paramater * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * hot-fix (#9191) * hot-fix * refine * skip env var check and calculate local rank if not given (#9183) * skip env var check Signed-off-by: daquexian <daquexian566@gmail.com> * calc local rank if need * No warning for absent LOCAL_RANK Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: Yu OuYang <xuanjiuye@gmail.com> Co-authored-by: clackhan <han_binbin@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * set to_contiguous to amp clear list (#9171) * add tensor.nansum (#9182) * Add slight cost for different sbp in 1 device (#9172) * Add slight cost for different sbp in 1 device * Print to INFO Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * refine_to_contiguous_dtype_register (#9196) * refine_to_contiguous_dtype_register * add test case * pool_nd_ops register gray list * skip autocast for non-user op (#9199) * `copy_` support numpy fp16 (#9189) * copy_ support numpy fp16 Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix matmul 0 size input error (#9147) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat functional scalar tensor parameter (#9190) * add ScalarTensor check and unpack, bug has link error * refine scalar tensor item function * feat(functional): functional support ScalarTensor transfer to Scalar automatically * feat(functional): support ScalarTensor transfer to Scalar * change auto transfer rule * test(Functional): add functional scalar tensor param test * format code * refine GetItemInScalarTensor function * Fix broadcast fmod grad (#8865) * impl trunc divide * fix broadcast fmod grad * trunc_div grad, scalar_trunc_div, and primitive * format * gradient_func * add test * rename * compatible with older versions of torch * resolve warning * test global Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat straighten compress memory (#9094) * An initial inplementation of linear programming primal matrix * Coding for the revised simplex method * Finish coding for the phase 1 * Fix bug. Now we can get a corrent x for the initial basic feasible solution * Drive the artificial variables out in phase 1 * Bland's rule and bug fix * Adjust the mapping between the basic variables and compact columns * No columns removed while driving artificial variables out. Terminates the code if positive optimal cost found in auxiliary problem. * Implement the phase 2 of the revised simplex method. Remove columns of the inverse base matrix. * Update is_solved status and original problem recovery. * Rows and artificial columns activation * An initial implementation of mix integer programming * Try to assemble the original problem but fial due to the massive exclusion * Steal initial position from current setting * Compute the optimal cost from the compact relationship * Move to a neighbor status and compute the cost * Find the smallest cost and actually move to that status * Check conflit after the adjustment. Adaptively cost reduce * Generate a compact position from nothing * Straighten for memory * Update the offset * Add a demo for using the revised simplex method * Remove the linear programming part * Recompute the compact relationship after moving to a new status * Rename * Code clean up * Set the tag for the straighten algorithm * Code clean up * An attemp to explore the dependency between consumer nodes of a register * Revert "An attemp to explore the dependency between consumer nodes of a register" This reverts commit f219851fb85943d07d28b84c45e5c4bae80872a0. * Compute the lower bound and only execute the adjustment 2 for those cases with possible reduction in memory * Pre-compute and store the memory size for registers * Use pre-stored total register num * Limit the maximum iteration step * Use VLOG(3) instead of std::cout * Change interface * Package up memory share strategy interfaces * Address comments * Address comments * Of format * Fix bug lower bound = 0 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add contains magic method (#9185) * refine more ops dtype infer error message * refine * add tensor.__contains__ magic method Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Build cuda 11.8 (#9204) * export unsorted segment sum (#9206) export unsorted_segment_sum python Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Optimize OneEmbedding Save Snapshot (#9112) * init * fix compile error * refine * Refine put logic * todo lrucache logic * refine dump logic * finish * add flag check * Add env var * fix * fix a silly bug * fix template args * fix comment * add template * Refine comment * remove * fix bug * fix compile error * refine initial Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add Tensor.scatter_add & refine scatter (#9201) add Tensor.scatter_add & refine scatter * optimize layernorm need padding cols perf (#9195) * optimize layernorm need padding cols perf * auto format by CI * reduce binary size Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support Inplace behavior in Type Promotion (#9200) * support inplace * refine * add const * refine Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * Fix Broadcast Matmul check (#9213) fix check * Export MultiTensor Update and FuseUpdateCast to GraphConfig (#9209) * export to graph config * refine or Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix bug of matmul dim check in `oneflow.bmm` (#9215) * fix bug of matmul dim check * refine code * Update nn_functor.cpp * Regist arange fp16 (#9202) * arange op support cuda half * add test * format * fix comment * fix comment * refine * ci test error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix graph out argstree type judge (#9211) * reproduce bug * fix custom class type deal * fix typo * support ordereddict * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix ConcatFunctor error message (#9225) * Check async errors after kernel launched (#9226) Check errors after kernel launched * Skip unnecessary passes (#9219) * Skip unnecessary passes * refine * one_embedding fix typo (#9230) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [GetAsyncError] Add op name to error message (#9228) GetAsyncError refine error message * [JobBuildAndInferCtx]Remove an inefficient check (#9229) Remove an inefficient check * Fix linalg cross 0-size input error (#9232) * Add silu to amp list (#9233) * Disable CUDA virtual arch compilation (#9236) * Support set/get_default_dtype interface (#9227) * feat(DType): support set/get_default_dtype interface * doc(*): fix set/get_default_dtype document * doc(DType): refine document * feat(oneflow.tensor): support infer dtype as get_default_dtype * test(DType): add default dtype test * refine throw error * modify doctest because it will affect default dtype for other test * fix(DType): make DefaultDType is global * use default type in TensorWithDataCtorFunctor * fix(DType): flow.Tensor support DefaultDType * refine function name Co-authored-by: jackalcooper <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Enhance doctest error message (#9237) * test(doctest): enhance doctest error message * Update python/oneflow/test/modules/test_functional_docstr.py Co-authored-by: Yao Chi <later@usopp.net> * Update python/oneflow/test/modules/test_functional_docstr.py Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Yao Chi <later@usopp.net> * Feat: script to import oneflow as torch globally (#9160) * feat: global `import torch as oneflow` * use `console_scripts` to install oneflow-mock-torch to PATH * close quote * use os.makedirs to create temp torch directory * rename to `oneflow-mock-torch` * don't create temp files * use positional argument with 2 choices * add `mock torch test` in CI * uncomment env setup * default argument is enable * fix docker exec * refactor test script * check successful recover * don't run setup.py * support submodule importing & display error message * fix import * and import-from * move mock_torch to oneflow dir; update test command * fix error message * update mock test (less strict) * add more tests for torch imports * modify export path * mock_torch is a package Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add time and mem log tools (#9164) * add time and mem log tools * refine format * auto format by CI * address review * auto format by CI * log with json format * rm useless * refine log format Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support bool for `oneflow.nn.functional.pad` (#9234) * support bool in functor and kernel, add unittest for int and bool * refine unittest * check value for bool tensor * Feat: rand/randn support float16 kernel (#9238) * feat(Op): rand/randn support float16 kernel * add error message and refine code Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * reduc auto tick generate time (#9235) * reduc time * rm useless * address review, refact structure * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * TensorIndexing support float16 (#9247) * feat(TensorIndexing): support float16 * feat(TensorIndexing): support bfloat16 * skip bfloat16 test when cuda version less than 11000 Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * Add cudnn handle pool (#9243) * add_cudnn_handle_queue * deal normalization_kernel * refine * refine * reslove comment * minor fix * refine * auto format by CI * fix static check Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Added error message for CUDA device incompatibility (#9250) * Added error message for CUDA device incompatibility * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix autograd.Function memory leak (#9249) * fix(AutogradFunction): fix memory leak * add ptr check for AutogradState data * test(AutogradFunction): ensure PyAutogradFunctionState released * test(AutogradFunction): decrease memory * register __dict__ function * refine code * fix state release test bug * refine error message * Feat speed up mem reuse (#9210) * Use HashSet instead of vector * O(n^3) -> O(n^2) * Compute offset for memory-first algorithm only * Remove explicit exclusion relationship * Revert print out information * Speed up exclusion judgement * Switch HashMap to vector * Code clean up * life time -> lifetime * mem_reused_regst: HashSet -> std::vector regst_desc_id2regst_desc -> mem_chain2regst_desc_id2reuse_regst_desc * Re-implement MemReusedAlgorithm_TimeLineAlgo and comment out useless code * Make allocate and free timeline local and HashSet -> std::vector * Eliminate a lot of Hash stuffs * Revert "Eliminate a lot of Hash stuffs" This reverts commit abfb86df57b13074cb50ca9dc080a1333cd46802. * Important comment * Address comments * auto format by CI * Remove magic number -1 * Address comment and rename Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix bug: segfult when argmax has 0 size tensor as input (#9242) * fix_half_check_of_reduce_mean (#9014) * fix_half_check_of_reduce_mean * refine * Support float16 for initializer operators (#9253) * feat(*): support float16 for initializer operators * refine test * Add half clamp (#9241) * Register half * register fp16 in clamp kernel, add check for fp16 in functor, update unittest for more dtype * format code * add macro WITH_CUDA Co-authored-by: WangYi <buaawangyi03@gmail.com> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [CUDA]CheckVersionCompatibility (#9257) * [CUDA]CheckVersionCompatibility * Add CUDA 10.2 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat: monkeypatching pytorch (#9256) * update custom meta path finder * update test commands * print warning if `torch` is already imported * rename to `mock` * update tests * private attribute cannot be imported with import * * split testcase Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * support destory_rdma (#9246) * support destory_rdma * refine * auto format by CI * refine * refine Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add bincount (#9156) * add bincount * add docs, use atomic add in cuda kernel, add unittest * add minlength param, fix bug of memset in kernel * refine code * refine code * convert to local when input is global, add global test * auto format by CI * refine code * refine docstr, reduce doc length in one line * register fp16, add tensor function and unittest * add docs for tensor.bincount * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * ONEFLOW_STREAM_ENABLE_H2D_STREAM (#9205) * Modify generator.manual_seed to return generator rather than None (#9262) generator.manual_seed return generator rather than None Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Dev add tensor bernoulli (#9261) * add tensor.bernoulli * add docs * Update tensor.py * Update tensor.py * Update tensor.py * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Multi tensor update (#9252) * fix multi_tensor_sgd segfault * enable learning_rate_val to replace learning_rate Tensor * support adam and adamw * support epsilon for adam and adamw Co-authored-by: songyicheng <int.rejoice@gmail.com> * fix a typo in readme (#9268) * support nested asyncs.thread (#9270) * OneEmbedding add smart decay sparse adam (#9176) * add sparse adam * smart decay sparse adam * address review * fix * mv smart_decay to one_embedding namespace * upgrade clang-tidy used in ninja of_tidy (#9263) upgrade clang-tidy in ninja of_tidy Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat/compile time count (#9245) * add graph compile time count * refine compile log * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix random_normal (#9274) Co-authored-by: Juncheng <liujuncheng1022@gmail.com> * Flip and upsample bilinear support fp16 (#9284) * slice update cpu kernel multi_thread loop * refine * upsample bilinear and flip register fp16 cuda kernel * fix commnet * revert Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix PruneAmpWhiteIdentityOpPass (#9276) * fix * fix dup del * ref algorithm * fix dup mut * simple impl * rm useless code * fix * fix typo * fix typo Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support api flow.randn_like (#9283) * support api flow.randn_like * refine * remove dry run, add sanitizers to ci (#8670) * fix some data races in c++ api and SteadyVector Signed-off-by: daquexian <daquexian566@gmail.com> * skip self copy in MutShapeView::ToShape Signed-off-by: daquexian <daquexian566@gmail.com> * remove dry run, add sanitizers to ci Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * update gh action * skip lit Signed-off-by: daquexian <daquexian566@gmail.com> * suppress ubsan error in llvm Signed-off-by: daquexian <daquexian566@gmail.com> * disable ubsan for now Signed-off-by: daquexian <daquexian566@gmail.com> * fix ci path Signed-off-by: daquexian <daquexian566@gmail.com> * update test manylinux docker Signed-off-by: daquexian <daquexian566@gmail.com> * restore dry run rpc manager Signed-off-by: daquexian <daquexian566@gmail.com> * run tsan for 3 times Signed-off-by: daquexian <daquexian566@gmail.com> * do not find initializer order bug Signed-off-by: daquexian <daquexian566@gmail.com> * fix merge conflict Signed-off-by: daquexian <daquexian566@gmail.com> * skip sanitizer test in cuda misc Signed-off-by: daquexian <daquexian566@gmail.com> * sleep Signed-off-by: daquexian <daquexian566@gmail.com> * suppress by __attribute__((no_sanitize_address)) Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * revert suppression * fix heap-use-after-free found by asan * auto format by CI * bash -c Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: tsai <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add build config for RTX 40xx GPUs (#9290) * Bool support for triu (#9291) * Refix PruneAmpWhiteIdentityOpPass (#9294) fix Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix concat #8833 (#9275) * fix conat #8833 * support multi-none-input * test and global test * auto format by CI * format license Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * support half for masked_fill (#9292) * Fix BatchNorm performance (#9298) * slice update cpu kernel multi_thread loop (#9264) * slice update cpu kernel multi_thread loop * refine * try to fix bug * auto format by CI * deleteusless headfile Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix inplace bug in `tensor.masked_fill_` (#9295) * fix: bind tensor.masked_fill_ to inplace version, fix bug in unittest * refine unittest * fix_inplace_copy_bug (#9301) * FusedMultiHeadAttentionInference (#9287) * FusedMultiHeadAttentionInference * auto format by CI * cmake * fix graph * auto format by CI * fix cmake for mlir * rm duplicated install * fix align * support float * support causal * support causal * test global property * fix * disable clang * skip cpu test * skil all test Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: jackalcooper <jackalcooper@gmail.com> * Fix compile warnings (#9302) * Fix comiple warnings * fix * Set the default value of CUDA_STATIC to OFF when CUDA version is greater than or equal to 11.8 (#9306) * Reduce pass time cost (#9281) * batch del in PrunePinnedIdentityOpPass * add log * fix and refine fuse add_n * add new line * avoid op graph create * add op graph cost cnt and fix boxing log * fix ndsbp csv str * fix multi add same add_n * auto format by CI * rm debug log * auto format by CI * to cont ref * rm useless * refine auto modifier * rm useless * hack to debug * hack to debug * hack to debug * hack to debug * hack to debug ci * hack to debug ci * fix test case env var * fix env var set * revert to const ref * auto format by CI * sync to make sure tensor are created Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refactor get sbp signature (#9304) * Add a GetSbpSignature with use parallel num instead of parallel description * Get sbp_sig_list for each dimension of hierarchy * Add test script and print out information * Remove parallel description in GetSbpSignature() * Fix small bug * Disable InferNdSbp for reshape op * Revert "Add test script and print out information" This reverts commit fdc7ee8558cab68aa9fa152cf1ba2a6dc2b4554e. * Add hierarchy value * Address comments * parallel num j-> hierarchy value for reshape op * Static analysis * refine * Update user_op.cpp * Update operator.cpp * auto format by CI * Revert Update operator.cpp This commit revert 64832e43196067d67f70094a8d35664a805a5891 Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix type error for entering a single tensor using concat op (#9316) * modify tensorprocessor * remove blank line * remove blank line * modify CheckHasDifferentInputDType func * Update oneflow/core/functional/tensor_processor.cpp * auto format by CI Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add more sbp signature print functions for log and debug (#9293) * debug code * ReshapeOp::GetSBP use hierarchy dim instead of parallel_num * comment debug log * revert debug code * auto format by CI * rm NdSbpSignatureListAsString * rm 1d sbp signature print functions Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Release/nightly cu118 (#9308) * update action * 116->118 * preserve 116 Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix different dtype in slice_update (#9331) * fix(SliceUpdate): fix different dtype in slice_update close #9330 * test(SliceUpdate): enhance test case Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix FlattenOp GetSbp (#9322) * fix flatten GetSbp * rm flatten op * update group stat * rm mlir test * fix * more strictly check * add reshape converion Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refactor ONEFLOW_MLIR_PREFER_NHWC to support more ops (#9335) * use bn as gn * hack gn as relu * refine * support concat * ScalarDivOp * fix * move files * refine * fix bn * try fix * fix concat * fix * DRY * refactor * refactor * fix * workaound * add baseclass * rm hack * auto format by CI * minor refine * refine * add more Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * distributions.Categorical support logits not None (#9332) * avoid extra gpu memory usage in flow.save (#9328) * boxing to cpu first in flow.save Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Use primitive to replace Ndarray::BroadcastBinary (#9311) * Use primitive to replace Ndarray::BroadcastBinary * refine * fix * negative * refine * refine * Block forward support modification (#9336) * block forward support modification * add test * fix format * auto format by CI * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add log sum exp api (#9333) * add_log_sum_exp_api * refine * add logsumexp to tensor * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat: isclose and allclose (#9280) * add allclose op in tablegen * add isclose & allclose op in functional layer * use existing framework to implement `isclose` * import isclose & allclose * compose isclose and other op to form allclose in python * typo * add doc & test files * add default arg * curly braces between one stmt * generate one random data, the other is perturbation * update test * comment for ndarray bin func * add ref from torch * Refactor random op with consistent data (#9299) * refactor(RanddomOp): refactor random op with consistent data * test(RandomOp): add data consistent test * fix(RandomSeed): fix parallel_num==1 * move normal functor to random_functor.cpp * test(RandomOp): refine test * add comment for random_seed getter function * remove special judgement for 1n1d * fix random_seed parallel_num==1 * fix cuda generator index bug * fix test function name bug Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * bool tensor slice_update use masked_fill when possible (#9324) * bool tensor slice_update use masked_fill when possible * refine * auto format by CI * fix comment * auto format by CI * Update oneflow/api/python/framework/tensor_functions.cpp Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * refine * auto format by CI * except partial sum test * add todo Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * Move tensor apis to cpython (#9303) * move tensor.is_floating_point to c++ * refine * move tensor.split to c++ * move tensor.flip to c++ * auto format by CI * Update oneflow/api/python/framework/tensor.cpp Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * refactor flip * refine * auto format by CI * fix free(): invalid pointer Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * Add gelu_tanh op and kernel (#9343) * gelu_tanh * rename GeluTanh -> FastGelu * regulate constant and increase precision * instantiate and reg backward * reg grad fn * address review * address review * format * update test * refine_test_maxpool2d_channel_last (#9344) * refine * auto format by CI * add skip * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Refactor normal initializer (#9307) * refactor(RanddomOp): refactor random op with consistent data * test(RandomOp): add data consistent test * refactor(Initializer): refactor normal with oneflow kernel * fix(RandomSeed): fix parallel_num==1 * test(initializer): add initializer data test * format code * move normal functor to random_functor.cpp * test(RandomOp): refine test * add trunc_normal and relax mean/std precision * fix conflict * fix merge conflict Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support fp16 in constant folding (#9337) * support fp16 * format * clean * refine * auto format by CI * refine test * clean * refine * refine Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix exp overflow with minus max trick (#9353) * Fix occasional bug in random_op data test (#9354) fix(RandomOp): fix occasional bug in random_op data test Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Dev add gumbel softmax (#9208) * regis gumbel_softmax * add: gumel_noise, attr-hard, next: log, one-hot, grad * add(fail): exp_dist * add: gumbel, grad on cpu, next: cuda * add: cuda & test bug: Synchronize() * add: docs, test_hrad, test_grad * add: format code * fix: TmpSize * fix: review * format, try to add * add: functor * format & half of rand * remove ops & kernels * support half of argmax & dim_scatter * fix review * add gumbel softmax docs * fix review * remove gumbel_softmax_grad_functor * remove grad in yaml * fix: raise half no util error * auto format by CI * auto format by CI * fix: make * fix: static Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix the inconsistent behavior of slice update (#9321) * modify tensor_index.cpp * modify * support scalar tensor indexing * support scalar * modify tensor_util * modify tensor_index * add macro definition * add support type * refine getitemscalartensor * Update oneflow/core/framework/tensor_util.cpp * modify macro * modify macro and test * modify test * modify function parameter * modify tensor_index ("uint8" is regarded as "bool") Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * enable autocast for that op which has nocast arguments (#9362) * fix autocast * fix * Add NHWC format for group norm (#9368) * group * nhwc * test_case * ir * fix * refine * Enable ZeRO with auto parallel (#9288) * Enable ZeRO with auto parallel in the first setting and speed up * Remove compute_cost parameter from Initialization of copy cost * Move the addition of wait time into sbp_node * Remove transfer cost since it is merged into the GetTransferCost() * Rename mainstem to trunk * Update warning Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat unbalanced split nd sbp (#9310) * Add a GetSbpSignature with use parallel num instead of parallel description * Get sbp_sig_list for each dimension of hierarchy * Add test script and print out information * Remove parallel description in GetSbpSignature() * Fix small bug * Disable InferNdSbp for reshape op * Revert "Add test script and print out information" This reverts commit fdc7ee8558cab68aa9fa152cf1ba2a6dc2b4554e. * Use the same physical shape as eager did * Remove the difference between eager and lazy for physical shape * Update the filter * Revert "Use the same physical shape as eager did" This reverts commit f20e222327e21166d5b5325e37c3cbe9ca4f4ac6. * Compute range for each rank * Compute position for range * Remove the difference between eager and lazy * Allow unbalanced split for variables * Add test script and print out information * Pass 2d test cases * Resolve conflict * Can not merge some split * Reduce in and out sbp simultaneously * Speed up for 1d sbp Package up the function for replacing hierarchy * Reduced simultaneously with the same hierarchy * Deal with 1to2d and 2to1d in InOutParallelDimReduce() * Pass 1to2d and 2to1d test cases * Remove the old code * Revert "Add test script and print out information" This reverts commit 58cdfb40b6536eb74c02174d3a69409676da374f. * Add the check for split questionary back * Feat speed up cost computation (#9355) * Compilation speed up * Speed up compilation for cost between 1d sbp * fix comment typeo * Address comment Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add upsample_nearest_2d to amp clear list (#9366) * fix cuda integral type closeness computation (#9346) * fix cuda integral type computation * remove include Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add fused linear (#9369) * Support fp16 on some cpu operators (#9374) support fp16 cpu triu * Scalar math kernels support inplace (#9372) * Scalar math kernels support inplace * type * fix * Optimize GroupNorm NHWC with FastDivmod (#9373) * GradAcc Mem V5: Part 0-4 (#8961) * default nccl use compute stream in grad acc * rm sharable mem block graph * half implement of LogicalChains * part-0 : Logical Chain * fix compile * logical chain runnable * fix bug of logical chain dp * Part 1 : AfterGradAccChain * fix bug of crush in acc chain infer * AccCtrlTick Op/Task/Actor/Pass * tmp * AccCtrlTick runnable * rename group boxing identity and model diff scale op name * stric order by acc tick * merge mem block by logical chain id group * fix user op register * fix GLOG error when no grad acc * Inplace repeat variable * Inplace repeat support consumed/produced ctrl regst * Part-4: merge acc op in to chain for reuse memory acc input (#9071) LogicalChain can merge acc op in to chain for reuse memory acc input 实测 GPT 的显存与 part-3 一致。 bert 与 t5 大部分的显存都略低于 part-3 https://github.com/Oneflow-Inc/OneTeam/issues/1670#issuecomment-1240468576 * find first source/sink op in acc chain which can be insert ctrl * TryMergeAfterAccLogicalChainToFirstLogicalChain * remove debug log * rm old version repeat kernel * fix format * MergeChainByLogicalChainId/PhysicalTaskGraph * IsValidChainId * rm useless file * remove note * fix clang-tidy * more IsValidChainId * rm debug log * rm note * fix bug of cpu repeat inplace var bug * fix bug of memory reuse for 0-size regst in time line algo * fix bug of acc chain merge mem guard * reuse cast to tick op * fix bug of acc different stream hint cause sync backward compute * actor name log * fix for review * remove log * fix note * fix bug of connect to cast to tick op * refine code for review * fix for review Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix the bug of fill_tensor_ of support fp16 & autocast (#9375) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Allocate in instruction computation (#9282) * allocate memory in InstructionPolicy::Compute * remove unused methods of VirtualMachineEngine. * backup code * UnimplementedAllocator * prepare allocators for each cpu stream. * allocator for ccl stream * init AllocateTensorInstructionPolicy::output_dependences_ * only sync current rank in oneflow._oneflow_internal.eager.Sync * Update oneflow/core/vm/allocate_tensor_instruction_policy.cpp Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * Disable conv algorithm search in eager mode (#9…
这个pr初衷是为了解决 Oneflow-Inc/libai#406 中sbp不对的bug,后来发现那个bug是reshape op的sbp备选不对造成的,在别的pr修复了。
回到这个pr,当时在slack讨论,最终觉得,只要不打开全自动(全自动就是自动并行完全无视用户配置),自动并行跟ZeRO还是能兼容的。(虽说后期自动并行考虑内存的话打开自动并行就真的不需要ZeRO了)
然后开了自动并行yongning做测试的时候发现同时打开ZeRO跟自动并行会使得graph编译的时间贼慢,具体而言就是边太多了,InitCopyCost需要的时间贼多。
所以这个pr就兼顾了图编译的加速。
我们另一个非均衡切割的pr #9310 也有做图编译的加速。
最终效果就是:
大概是减少到80%的时间,优化率20%
此外 @strint 提到一些typo check的插件不认识 mainstem这个词,这里统一改为了trunk(自动并行发布的时候也把一些mainstem改为了trunk)