Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/zero mix with mp #8036

Merged
merged 62 commits into from
Jun 10, 2022
Merged
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
a470ff4
add zero limit
strint Apr 8, 2022
9447157
add debug
strint Apr 12, 2022
5eefdf8
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
strint Apr 12, 2022
e3acaa9
add mix zero test
strint Apr 12, 2022
b481a7e
refactor zero api
strint Apr 13, 2022
3b56468
zero test with mp
strint Apr 14, 2022
66c8ac3
add 2d test
strint Apr 14, 2022
4f56df2
add zero nd
strint Apr 15, 2022
635ac69
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
strint Apr 19, 2022
2834289
add nd zero
strint Apr 21, 2022
b805256
add sbp cast
strint Apr 21, 2022
2ede354
test passed soft limit consumer
strint Apr 22, 2022
0227f54
refine size api
strint Apr 22, 2022
7036e04
zero use stage 2
strint Apr 28, 2022
c26763e
add limit consumer api
strint Apr 29, 2022
d84e8a9
add new api
strint Apr 29, 2022
2555ee1
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
strint Apr 29, 2022
ac0b9d2
refine zero s select
strint Apr 29, 2022
dd0a865
fix index out of range
strint May 5, 2022
e0304a7
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
strint May 6, 2022
501518f
rm zero limit on device type
strint May 6, 2022
0e2f9a2
Merge branch 'feat/zero_mix_with_mp' of https://github.com/Oneflow-In…
strint May 6, 2022
e3eed8c
zero test with activation checkpointing
strint May 7, 2022
f966b4f
Merge branch 'feat/zero_mix_with_mp' of https://github.com/Oneflow-In…
strint May 7, 2022
ebc9ff9
add indentity when dp sequence len is 1
strint May 21, 2022
2011e2c
move to base with master
strint May 26, 2022
b7f4fed
fix confict
strint May 26, 2022
ffe2094
fix
strint May 26, 2022
b58b48a
fix
strint May 26, 2022
6975f33
fix
strint May 26, 2022
32bc1d1
Merge branch 'feat/logical_nccl_send_recv' into feat/zero_mix_with_mp
strint May 26, 2022
cce8efd
add test
strint May 27, 2022
a30b0c0
debug bad case
strint May 27, 2022
c73013f
refine test for eager and graph boxing
strint May 27, 2022
08b1f69
test case ready
strint May 27, 2022
821a8f4
simplify
strint May 30, 2022
29079a0
refine test
strint May 30, 2022
e49d380
fix buff size
strint May 30, 2022
9bd521f
Merge branch 'feat/logical_nccl_send_recv' into feat/zero_mix_with_mp
strint May 30, 2022
b374505
merge master
strint Jun 1, 2022
3fc1821
fix conflict
strint Jun 1, 2022
79e1290
refine zero nd
strint Jun 1, 2022
3225045
refine
strint Jun 1, 2022
c751435
add full test
strint Jun 2, 2022
5c78921
revert change
strint Jun 2, 2022
bfa726c
refine split check
strint Jun 2, 2022
0bcbf30
fix typo
strint Jun 6, 2022
14c8520
rm log
strint Jun 6, 2022
56754bc
spit long func
strint Jun 6, 2022
567af33
restore test
strint Jun 7, 2022
886914c
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
strint Jun 7, 2022
84ca778
Update optimizer_placement_optimization_pass.cpp
strint Jun 9, 2022
7095ec3
auto format by CI
oneflow-ci-bot Jun 9, 2022
5a5b9c5
Merge branch 'master' into feat/zero_mix_with_mp
strint Jun 9, 2022
b401e66
auto format by CI
oneflow-ci-bot Jun 9, 2022
3c7d2c5
Merge branch 'master' into feat/zero_mix_with_mp
strint Jun 9, 2022
2b0324e
fix static check
strint Jun 9, 2022
7d611c4
add tips for zero api change
strint Jun 10, 2022
04548ac
Merge branch 'master' into feat/zero_mix_with_mp
strint Jun 10, 2022
640487b
auto format by CI
oneflow-ci-bot Jun 10, 2022
9928cdd
Merge branch 'master' into feat/zero_mix_with_mp
mergify[bot] Jun 10, 2022
dc4e40d
Merge branch 'master' into feat/zero_mix_with_mp
mergify[bot] Jun 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions docs/source/graph.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,11 @@ Base class for running neural networks in Static Graph Mode.

.. autoclass:: oneflow.nn.graph.graph_config.GraphConfig
:members: enable_amp,
enable_zero,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

优化 zero 的 API

allow_fuse_model_update_ops,
allow_fuse_add_to_output,
allow_fuse_cast_scale,
set_gradient_accumulation_steps,
set_zero_redundancy_optimizer_mode,
set_zero_redundancy_optimizer_min_size_after_split,
enable_cudnn_conv_heuristic_search_algo,
:member-order: bysource

Expand Down
3 changes: 3 additions & 0 deletions oneflow/core/job/eager_nccl_comm_manager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,9 @@ void CreateNcclComm(ncclComm_t* comm, const int dev, const std::string& key,
<< ", nccl_unique_id = " << NcclUniqueId2String(nccl_unique_id) << ", rank = " << rank
<< ", key = {" << key << "}\n";
OF_NCCL_CHECK(ncclCommInitRank(comm, device_vec.size(), nccl_unique_id, rank));
VLOG(2) << " EagerNcclCommMgr::ncclCommInitRank succeed device_vec.size() = " << device_vec.size()
<< ", nccl_unique_id = " << NcclUniqueId2String(nccl_unique_id) << ", rank = " << rank
<< ", key = {" << key << "}\n";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug EagerNcclCommMgr

}

} // namespace
Expand Down
6 changes: 6 additions & 0 deletions oneflow/core/job/job_build_and_infer_ctx.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -997,13 +997,19 @@ Maybe<void> LazyJobBuildAndInferCtx::Complete() {
}
};
int32_t pass_cnt = 0;
const int64_t prev_v = FLAGS_v;
auto DoPass = [&](const std::string& pass_name, int32_t cnt = 0) -> Maybe<void> {
VLOG(1) << job_name << " is compiling with pass"
<< " pass_cnt_" + std::to_string(pass_cnt) + "-" + pass_name
<< (cnt > 0 ? std::to_string(cnt) : "");
if (unlikely(NeedLogJob(pass_name))) {
std::string cnt_str = cnt > 0 ? std::to_string(cnt) : "";
LogJob("pass_cnt_" + std::to_string(pass_cnt) + "-" + pass_name + cnt_str + "-before");
FLAGS_v = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When debugging a pass, its glog level will turn GLog_v = 3

}
JUST(JobPass4Name(pass_name)(mut_job(), &job_pass_ctx));
if (unlikely(NeedLogJob(pass_name))) {
FLAGS_v = prev_v;
std::string cnt_str = cnt > 0 ? std::to_string(cnt) : "";
LogJob("pass_cnt_" + std::to_string(pass_cnt) + "-" + pass_name + cnt_str + "-after");
}
Expand Down
1 change: 1 addition & 0 deletions oneflow/core/job/job_builder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ limitations under the License.
#include "oneflow/core/common/util.h"
#include "oneflow/core/common/container_util.h"
#include "oneflow/core/job/job.pb.h"
#include "oneflow/core/job/sbp_parallel.pb.h"
#include "oneflow/core/operator/operator.h"

namespace oneflow {
Expand Down
1 change: 1 addition & 0 deletions oneflow/core/job/job_conf.proto
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ message JobConfigProto {
optional bool enable_gradients_stats_aggregation = 106 [default = true];
optional string optimizer_placement_optimization_mode = 107;
optional int64 optimizer_placement_optimization_threshold = 108 [default = 1024];
optional int64 optimizer_placement_optimization_shard_restore_level = 110 [default = 2];

optional QatConfig qat_config = 109;

Expand Down
Loading