Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

582 Add ability to model load values used by LB strategies #897

Merged
merged 63 commits into from
Jul 23, 2020

Conversation

PhilMiller
Copy link
Member

@PhilMiller PhilMiller commented Jun 26, 2020

Fixes: #582

@PhilMiller
Copy link
Member Author

I think the key thing I want to push through with this is a relatively straighforward way for application code to interpose a different model, so that we can do as much experimentation as possible without crossing the streams of modifying and recompiling both vt and application code.

@PhilMiller
Copy link
Member Author

PhilMiller commented Jun 26, 2020

A couple models that it may make sense to bake into vt as generic facilities are l norms with various powers and max, each applicable over a configurable subset of subphases.

@PhilMiller PhilMiller changed the title 582 model loads 582 Add ability to model load values used by LB strategies Jun 26, 2020
@codecov
Copy link

codecov bot commented Jun 26, 2020

Codecov Report

Merging #897 into develop will increase coverage by 0.10%.
The diff coverage is 94.79%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #897      +/-   ##
===========================================
+ Coverage    82.79%   82.90%   +0.10%     
===========================================
  Files          356      361       +5     
  Lines        11232    11333     +101     
===========================================
+ Hits          9300     9396      +96     
- Misses        1932     1937       +5     
Impacted Files Coverage Δ
src/vt/vrt/collection/balance/elm_stats.impl.h 92.68% <ø> (ø)
src/vt/vrt/collection/balance/lb_comm.h 9.09% <0.00%> (-2.03%) ⬇️
src/vt/vrt/collection/manager.h 100.00% <ø> (ø)
src/vt/vrt/collection/manager.impl.h 87.26% <ø> (ø)
src/vt/utils/stats/linear_regression.h 95.83% <95.83%> (ø)
...t/collection/test_model_per_collection.extended.cc 96.07% <96.07%> (ø)
...c/vt/vrt/collection/balance/lb_invoke/lb_manager.h 60.00% <100.00%> (+2.85%) ⬆️
...c/vt/vrt/collection/balance/model/composed_model.h 100.00% <100.00%> (ø)
src/vt/vrt/collection/balance/model/load_model.h 100.00% <100.00%> (ø)
tests/unit/utils/test_linear_regression.cc 100.00% <100.00%> (ø)
... and 7 more

@PhilMiller
Copy link
Member Author

I've overhauled the API to I think rationalize responsibilities between user code and the runtime system, and leave models and strategies relatively testable in isolation, without crazy inter-dependencies.

One thing that's niggling at me is the local vs collective nature of the models. I think we need to define them collectively, and possible ensure that at least some calls to them are made similarly.

@PhilMiller
Copy link
Member Author

On the hypothesis that we may want to implement models that communicate to set up their predictions (e.g. to compute global covariance or other such statistics), I have the updating step in a separate global epoch before LBManager calls the strategy that may depend on the model's output.

@PhilMiller
Copy link
Member Author

One thing that's not really considered in the present design is composition across multiple collections, and collections coming from multiple parts of an application. In some respects, load modeling is a global concern that should be configured at the whole-application/job level. In others, part of the expected benefit of load modeling is being able to import knowledge specific to individual collections into the otherwise oblivious LB system. I don't have a clear idea of how to resolve that tension yet.

One possibility that I've been turning over in my head is whether load models should not be one-off, but rather more of a compositional design. For instance, de-noising might consider some history, and provide an adjusted retrospective load, which would then make a more solid base for a prediction.

@PhilMiller PhilMiller marked this pull request as ready for review July 6, 2020 19:11
@PhilMiller
Copy link
Member Author

Per discussion, possibly replace setFocusedSubphase() with a model that does the same.

@PhilMiller
Copy link
Member Author

I should write up some test cases for the various models provided.

src/vt/vrt/collection/balance/model/naive_persistence.cc Outdated Show resolved Hide resolved
src/vt/vrt/collection/balance/model/load_model.h Outdated Show resolved Hide resolved
src/vt/vrt/collection/balance/model/load_model.h Outdated Show resolved Hide resolved
@PhilMiller PhilMiller marked this pull request as draft July 20, 2020 17:29
@lifflander
Copy link
Collaborator

Overall, this looks like a good direction to me. There is some missing documentation on what these actually do. Also, I'm not sure how the user can control them. Does this solve the problem of multiple collections with different models?

@PhilMiller
Copy link
Member Author

Also, I'm not sure how the user can control them.

Users can control them by constructing their own instances, either of a provided type or one they've defined, and passing it to LBManager::setLoadModel(). I've exposed LBManager::getLoadModel() and LBManager::getBaseLoadModel() so that user instances can stack themselves on top of whatever the system would do by default - i.e. if it's configured with noise filtering.

Does this solve the problem of multiple collections with different models?

I just started implementing a PerCollection model that would switch on what collection an element is part of to call collection-specific models for each. However, I ran into the hurdle that there doesn't seem to be any way to get from ElementIDType to the collection's VirtualProxyType or whatever else I would use to identify/distinguish them. As far as data in ProcStats go, all elements of all collections are just part of one big conglomerated sequence.

@PhilMiller
Copy link
Member Author

Have a look at the commented out bits in per_collection.{h,cc}. If I had an implementation for those, we'd be set.

@PhilMiller PhilMiller force-pushed the 582-model-loads branch 4 times, most recently from b7241e8 to 7f086ed Compare July 21, 2020 16:36
docs/md/lb-manager.md Outdated Show resolved Hide resolved
docs/md/lb-manager.md Outdated Show resolved Hide resolved
@lifflander
Copy link
Collaborator

lifflander commented Jul 22, 2020

@PhilMiller Take a look at the test I just finished

A few things:

  • We should probably provide aliases for all the models. Writing out the namespace is a little annoying.
  • The ObjectIterator segfaults if you don't have any objects. Before I called nextPhase/startPhaseCollective, I tried to iterate through with a for loop. There should have been no objects, but instead it just segfaulted:
  auto model = theLBManager()->getLoadModel();
  for (auto&& obj : *model) { }
==4728== Process terminating with default action of signal 11 (SIGSEGV)
==4728==  Access not within mapped region at address 0xFFFFFFFFFFFFFFD8
==4728==    at 0x4E3C0A: std::_Hashtable<unsigned long, std::pair<unsigned long const, double>, std::allocator<std::pair<unsigned long const, double> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_begin() const (hashtable.h:377)
==4728==    by 0x4E3D72: std::_Hashtable<unsigned long, std::pair<unsigned long const, double>, std::allocator<std::pair<unsigned long const, double> >, std::__detail::_Select1st, std::equal_to<unsigned long>, std::hash<unsigned long>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::begin() const (hashtable.h:496)
==4728==    by 0x4E2689: std::unordered_map<unsigned long, double, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, double> > >::begin() const (unordered_map.h:333)
==4728==    by 0x568DF0: vt::vrt::collection::balance::RawData::begin() (raw_data.h:69)
==4728==    by 0x5652CA: vt::vrt::collection::balance::ComposedModel::begin() (composed_model.cc:64)
==4728==    by 0x154071: vt::tests::unit::TestModelPerCollection_test_model_per_collection_1_Test::TestBody() (test_model_per_collection.extended.cc:111)
==4728==    by 0x2EA52E: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2433)
==4728==    by 0x2E4208: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest.cc:2469)
==4728==    by 0x2C1689: testing::Test::Run() (gtest.cc:2508)
==4728==    by 0x2C200E: testing::TestInfo::Run() (gtest.cc:2684)
==4728==    by 0x2C2704: testing::TestSuite::Run() (gtest.cc:2816)
==4728==    by 0x2CE116: testing::internal::UnitTestImpl::RunAllTests() (gtest.cc:5338)

PhilMiller and others added 26 commits July 22, 2020 21:15
Copy link
Collaborator

Codacy Here is an overview of what got changed by this pull request:

Clones added
============
- src/vt/vrt/collection/balance/proc_stats.cc  2
- src/vt/vrt/collection/balance/model/per_collection.h  1
- src/vt/vrt/collection/balance/model/composed_model.h  1
         

See the complete overview on Codacy

@lifflander lifflander merged commit e96ec02 into develop Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Object load modeling in place of implicit short-time persistence
2 participants