Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/mut resource eagerly #7267

Merged
merged 65 commits into from
Jan 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
0eff454
debug
strint Nov 12, 2021
8db9f36
modify graph.py
oneflow-ci-bot Nov 28, 2021
71ebc30
fix bug about graph debug interface
oneflow-ci-bot Nov 29, 2021
cb5d556
fix a bug about graph debug
oneflow-ci-bot Nov 29, 2021
b7f504d
Merge branch 'master' into fdz
grybd Nov 29, 2021
2ea13dd
Merge branch 'master' into fdz
oneflow-ci-bot Nov 29, 2021
71717f3
Fix nn graph variable bind (#6895)
wyg1997 Dec 1, 2021
e208b99
Merge branch 'master' into fdz
strint Dec 1, 2021
171a69a
Merge remote-tracking branch 'origin/fdz' into refactor-nn_graph_vari…
wyg1997 Dec 1, 2021
6963a37
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
strint Dec 29, 2021
b4daa4c
hack check
strint Dec 30, 2021
6dcb1ab
Merge branch 'feat/zero' of https://github.com/Oneflow-Inc/oneflow in…
strint Dec 31, 2021
d01784d
add test
strint Dec 31, 2021
b777104
refine test
strint Dec 31, 2021
bc5b9e9
refine test
strint Jan 3, 2022
4a64a7e
refine code
strint Jan 3, 2022
8af682a
Merge branch 'feat/zero' into feat/zero1
strint Jan 3, 2022
f19a8ac
add and refine zero
strint Jan 3, 2022
8c29944
fix test
strint Jan 3, 2022
7fbb341
refine code
strint Jan 7, 2022
c0f1e2d
rm debug log
strint Jan 7, 2022
b7b96e6
refine min size set
strint Jan 7, 2022
429e886
add note
strint Jan 11, 2022
1c37e66
debug zero
strint Jan 11, 2022
c57a48a
fix cudnn config
strint Jan 11, 2022
1b1835c
refine test doc
strint Jan 14, 2022
f03aa4a
add comment of check
strint Jan 14, 2022
1331fc0
Merge branch 'master' into feat/zero
strint Jan 14, 2022
b2186b1
Merge branch 'master' of https://github.com/Oneflow-Inc/oneflow into …
strint Jan 14, 2022
5931111
Merge branch 'feat/zero' of https://github.com/Oneflow-Inc/oneflow in…
strint Jan 14, 2022
55bfd96
eager mode in graph pass
strint Jan 14, 2022
6ed517b
format
strint Jan 14, 2022
5a202c6
rebuid parameter according to sbp in synced plan
strint Jan 14, 2022
b693c1f
auto format by CI
oneflow-ci-bot Jan 14, 2022
06fb3c9
fix code check
strint Jan 14, 2022
77899a2
fix
strint Jan 14, 2022
d7ac1a9
fix test
strint Jan 14, 2022
c5e281f
Merge branch 'master' into feat/zero
oneflow-ci-bot Jan 14, 2022
d8e763a
try init session at graph init
strint Jan 15, 2022
e006a09
refine and revert session init
strint Jan 15, 2022
cdac2e6
Merge branch 'feat/zero' of https://github.com/Oneflow-Inc/oneflow in…
strint Jan 15, 2022
c7521ac
rm useless code
strint Jan 15, 2022
70eea4d
add back print of sys conf
strint Jan 15, 2022
cb76874
prototype
strint Jan 15, 2022
f801b22
add impl
strint Jan 15, 2022
b13411c
finish prototype
strint Jan 15, 2022
209a3b7
test pass
strint Jan 15, 2022
a400ca1
simplify set
strint Jan 15, 2022
81f0636
Merge branch 'master' into feat/mut_resource_eagerly
strint Jan 17, 2022
86da4e8
Update config_util.py
strint Jan 17, 2022
d5a5c74
Merge branch 'master' into feat/mut_resource_eagerly
strint Jan 24, 2022
1b60b54
auto format by CI
oneflow-ci-bot Jan 24, 2022
3f60f6f
Merge branch 'master' into feat/mut_resource_eagerly
strint Jan 25, 2022
a52af3c
Merge branch 'master' into feat/mut_resource_eagerly
oneflow-ci-bot Jan 25, 2022
d86135e
Merge branch 'master' into feat/mut_resource_eagerly
oneflow-ci-bot Jan 25, 2022
f46f14b
Merge branch 'master' into feat/mut_resource_eagerly
oneflow-ci-bot Jan 25, 2022
f07a468
Merge branch 'master' into feat/mut_resource_eagerly
oneflow-ci-bot Jan 25, 2022
0765dfb
Merge branch 'master' into feat/mut_resource_eagerly
oneflow-ci-bot Jan 25, 2022
75faa13
Merge branch 'master' into feat/mut_resource_eagerly
oneflow-ci-bot Jan 25, 2022
77c5f53
Merge branch 'master' into feat/mut_resource_eagerly
oneflow-ci-bot Jan 25, 2022
9344cc1
Merge branch 'master' into feat/mut_resource_eagerly
strint Jan 26, 2022
7b30ad4
Merge branch 'master' into feat/mut_resource_eagerly
strint Jan 26, 2022
c176dcf
Merge branch 'master' into feat/mut_resource_eagerly
oneflow-ci-bot Jan 26, 2022
cbf1272
Merge branch 'master' into feat/mut_resource_eagerly
oneflow-ci-bot Jan 26, 2022
d990a54
Merge branch 'master' into feat/mut_resource_eagerly
oneflow-ci-bot Jan 26, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions oneflow/api/python/session/session.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ ONEFLOW_API_PYBIND11_MODULE("", m) {
// multi-client lazy global session context
m.def("CreateMultiClientSessionContext", &CreateMultiClientSessionContext);
m.def("InitMultiClientSessionContext", &InitMultiClientSessionContext);
m.def("MultiClientSessionContextUpdateResource", &MultiClientSessionContextUpdateResource);
m.def("MultiClientSessionContextAddCGraph", &MultiClientSessionContextAddCGraph);
m.def("TryDestroyMultiClientSessionContext", &TryDestroyMultiClientSessionContext);

Expand Down
11 changes: 11 additions & 0 deletions oneflow/api/python/session/session.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ limitations under the License.

#include <string>
#include <google/protobuf/text_format.h>
#include "oneflow/api/python/session/session_api.h"
#include "oneflow/core/common/protobuf.h"
#include "oneflow/core/control/ctrl_client.h"
#include "oneflow/core/control/global_process_ctx.h"
Expand Down Expand Up @@ -128,8 +129,18 @@ inline Maybe<void> InitMultiClientSessionContext(const std::string& config_proto
return Maybe<void>::Ok();
}

inline Maybe<void> MultiClientSessionContextUpdateResource(const std::string& resource_proto_str) {
CHECK_NOTNULL_OR_RETURN(Global<MultiClientSessionContext>::Get());
Resource reso_proto;
CHECK_OR_RETURN(TxtString2PbMessage(resource_proto_str, &reso_proto))
<< "failed to parse config_proto: " << resource_proto_str;
JUST(Global<MultiClientSessionContext>::Get()->UpdateResource(reso_proto));
return Maybe<void>::Ok();
}

inline Maybe<void> MultiClientSessionContextAddCGraph(
const std::shared_ptr<oneflow::NNGraph>& c_graph_ptr) {
CHECK_NOTNULL_OR_RETURN(Global<MultiClientSessionContext>::Get());
JUST(Global<MultiClientSessionContext>::Get()->AddCGraph(c_graph_ptr));
return Maybe<void>::Ok();
}
Expand Down
4 changes: 4 additions & 0 deletions oneflow/api/python/session/session_api.h
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,10 @@ inline void InitMultiClientSessionContext(const std::string& config_proto_str) {
return oneflow::InitMultiClientSessionContext(config_proto_str).GetOrThrow();
}

inline void MultiClientSessionContextUpdateResource(const std::string& resource_proto_str) {
return oneflow::MultiClientSessionContextUpdateResource(resource_proto_str).GetOrThrow();
}

inline void MultiClientSessionContextAddCGraph(
const std::shared_ptr<oneflow::NNGraph>& c_graph_ptr) {
return oneflow::MultiClientSessionContextAddCGraph(c_graph_ptr).GetOrThrow();
Expand Down
8 changes: 8 additions & 0 deletions oneflow/core/framework/multi_client_session_context.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,11 @@ limitations under the License.
*/

#include "oneflow/core/common/buffer_manager.h"
#include "oneflow/core/common/maybe.h"
#include "oneflow/core/common/multi_client.h"
#include "oneflow/core/framework/multi_client_session_context.h"
#include "oneflow/core/framework/load_library.h"
#include "oneflow/core/job/resource.pb.h"
#include "oneflow/core/job/version.h"
#include "oneflow/core/job/global_for.h"
#include "oneflow/core/job/id_manager.h"
Expand Down Expand Up @@ -117,6 +119,12 @@ Maybe<void> MultiClientSessionContext::TryInit(const ConfigProto& config_proto)
return Maybe<void>::Ok();
}

Maybe<void> MultiClientSessionContext::UpdateResource(const Resource& reso_proto) {
CHECK_NOTNULL_OR_RETURN((Global<ResourceDesc, ForSession>::Get()));
Global<ResourceDesc, ForSession>::Get()->Update(reso_proto);
return Maybe<void>::Ok();
}

Maybe<void> MultiClientSessionContext::AddCGraph(
const std::shared_ptr<oneflow::NNGraph>& c_graph_ptr) {
graphs_.emplace_back(c_graph_ptr);
Expand Down
1 change: 1 addition & 0 deletions oneflow/core/framework/multi_client_session_context.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class MultiClientSessionContext {
~MultiClientSessionContext() {}

Maybe<void> TryInit(const ConfigProto& config_proto);
Maybe<void> UpdateResource(const Resource& reso_proto);
Maybe<void> AddCGraph(const std::shared_ptr<oneflow::NNGraph>& c_graph_ptr);
Maybe<void> TryClose();

Expand Down
11 changes: 11 additions & 0 deletions oneflow/core/job/resource_desc.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ See the License for the specific language governing permissions and
limitations under the License.
*/
#include <algorithm>
#include "oneflow/core/job/resource.pb.h"
#include "oneflow/core/job/resource_desc.h"
#include "oneflow/core/common/util.h"
#include "oneflow/core/control/global_process_ctx.h"
Expand Down Expand Up @@ -110,4 +111,14 @@ void ResourceDesc::DumpCudnnConf(const JobConfigProto& job_conf) {
}
}

void ResourceDesc::Update(const Resource& reso_conf) {
if (reso_conf.has_nccl_use_compute_stream()) {
resource_.set_nccl_use_compute_stream(reso_conf.nccl_use_compute_stream());
}
if (reso_conf.has_disable_group_boxing_by_dst_parallel()) {
resource_.set_disable_group_boxing_by_dst_parallel(
reso_conf.disable_group_boxing_by_dst_parallel());
}
}

} // namespace oneflow
1 change: 1 addition & 0 deletions oneflow/core/job/resource_desc.h
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ class ResourceDesc final {
bool enable_tensor_float_32_compute() const { return resource_.enable_tensor_float_32_compute(); }
const Resource& resource() const { return resource_; }
void DumpCudnnConf(const JobConfigProto& job_conf);
void Update(const Resource& reso_conf);

private:
Resource resource_;
Expand Down
32 changes: 16 additions & 16 deletions python/oneflow/framework/config_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,22 @@
import traceback

import oneflow._oneflow_internal
import oneflow.core.job.resource_pb2 as resource_util
import oneflow.framework.hob as hob
import oneflow.framework.session_context as session_ctx
import oneflow.support.enable_if as enable_if


def _set_attr_to_resource(attr_name, attr_value):
sess = session_ctx.GetDefaultSession()
if sess.status_ == sess.Status.INITED:
reso_config = resource_util.Resource()
setattr(reso_config, attr_name, attr_value)
sess.update_resource_eagerly(reso_config)
else:
setattr(sess.config_proto.resource, attr_name, attr_value)


def api_load_library(val: str) -> None:
"""Load necessary library for job

Expand Down Expand Up @@ -347,14 +358,8 @@ def api_nccl_use_compute_stream(val: bool = False) -> None:
Args:
val (bool, optional): True or False. Defaults to False.
"""
return enable_if.unique([nccl_use_compute_stream, do_nothing])(val=val)


@enable_if.condition(hob.in_normal_mode & ~hob.session_initialized)
def nccl_use_compute_stream(val=False):
sess = session_ctx.GetDefaultSession()
assert type(val) is bool
sess.config_proto.resource.nccl_use_compute_stream = val
_set_attr_to_resource("nccl_use_compute_stream", val)


def api_disable_group_boxing_by_dst_parallel(val: bool = False) -> None:
Expand All @@ -363,14 +368,8 @@ def api_disable_group_boxing_by_dst_parallel(val: bool = False) -> None:
Args:
val (bool, optional): True or False. Defaults to False.
"""
return enable_if.unique([disable_group_boxing_by_dst_parallel, do_nothing])(val=val)


@enable_if.condition(hob.in_normal_mode & ~hob.session_initialized)
def disable_group_boxing_by_dst_parallel(val=False):
sess = session_ctx.GetDefaultSession()
assert type(val) is bool
sess.config_proto.resource.disable_group_boxing_by_dst_parallel = val
_set_attr_to_resource("disable_group_boxing_by_dst_parallel", val)


def api_nccl_num_streams(val: int) -> None:
Expand Down Expand Up @@ -553,5 +552,6 @@ def nccl_enable_mixed_fusion(val):

@enable_if.condition(hob.in_normal_mode & hob.session_initialized)
def do_nothing(*args, **kwargs):
print("Nothing happened because the session is running")
return False
raise NotImplementedError(
"This action donot working because session is initialized."
)
7 changes: 7 additions & 0 deletions python/oneflow/framework/multi_client_session.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,3 +119,10 @@ def _update_function_flag_name2defaultVal(self):
def _update_scope_attr_name2defaultVal(self):
items = c_api_util.GetScopeConfigDef().attr_name2attr_def.items()
self.scope_attr_name2default_val_ = {k: v.default_val for (k, v) in items}

def update_resource_eagerly(self, resource_config):
self._check_status(self.Status.INITED)
config_proto_str = text_format.MessageToString(resource_config)
oneflow._oneflow_internal.MultiClientSessionContextUpdateResource(
config_proto_str
)
10 changes: 10 additions & 0 deletions python/oneflow/test/graph/test_optimization_conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,9 +75,19 @@ def build(self, x):
g = CustomGraphSysConf()

print("optimization conf: \n", g._optimization_conf_proto)
test_case.assertTrue(g._optimization_conf_proto.nccl_use_compute_stream)
g._generate_config_proto()
print("graph conf: \n", g._config_proto)

flow.boxing.nccl.enable_use_compute_stream(False)
test_case.assertTrue(not g._optimization_conf_proto.nccl_use_compute_stream)
flow.boxing.nccl.disable_group_boxing_by_dst_parallel(False)
test_case.assertTrue(
not g._optimization_conf_proto.disable_group_boxing_by_dst_parallel
)

print("optimization conf after session init: \n", g._optimization_conf_proto)


if __name__ == "__main__":
unittest.main()