Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1838: add WeightedCommunicationVolume load model #1888

Merged
merged 15 commits into from
Oct 5, 2022

Conversation

cz4rs
Copy link
Contributor

@cz4rs cz4rs commented Aug 2, 2022

fixes #1838

@github-actions
Copy link

github-actions bot commented Aug 2, 2022

Pipelines results

PR tests (gcc-12, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (gcc-5, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (clang-3.9, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (gcc-6, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (clang-5.0, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (gcc-9, ubuntu, mpich, zoltan)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (gcc-7, ubuntu, mpich, trace runtime, LB)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (clang-9, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 11.0, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, alpine, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (nvidia cuda 10.1, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (intel icpx, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (clang-11, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (clang-12, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (clang-13, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (clang-14, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (gcc-11, ubuntu, mpich)

Build for eff4e41

Compilation - successful

Testing - passed

Build log


PR tests (clang-10, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (intel icpc, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


PR tests (gcc-11, ubuntu, mpich, json schema test)

Build for 40a4635

Compilation - successful

Testing - passed

Build log


@cz4rs cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch 2 times, most recently from f178e0b to 055f8ee Compare August 4, 2022 14:26
@cz4rs cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from 055f8ee to bb2291d Compare August 4, 2022 14:38
@cz4rs cz4rs marked this pull request as ready for review August 4, 2022 15:28
@cz4rs cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from bb2291d to efc2016 Compare August 5, 2022 13:15
@codecov
Copy link

codecov bot commented Aug 5, 2022

Codecov Report

Merging #1888 (40a4635) into develop (288d612) will increase coverage by 0.03%.
The diff coverage is 93.93%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #1888      +/-   ##
===========================================
+ Coverage    84.32%   84.35%   +0.03%     
===========================================
  Files          728      730       +2     
  Lines        25508    25569      +61     
===========================================
+ Hits         21509    21570      +61     
  Misses        3999     3999              
Impacted Files Coverage Δ
...c/vt/vrt/collection/balance/model/composed_model.h 100.00% <ø> (ø)
...t/vrt/collection/balance/model/weighted_messages.h 100.00% <ø> (ø)
.../vt/vrt/collection/balance/temperedlb/temperedlb.h 100.00% <ø> (ø)
...vrt/collection/balance/temperedwmin/temperedwmin.h 100.00% <ø> (ø)
.../test_model_weighted_communication_volume.nompi.cc 91.83% <91.83%> (ø)
.../vt/vrt/collection/balance/model/composed_model.cc 100.00% <100.00%> (ø)
...ion/balance/model/weighted_communication_volume.cc 100.00% <100.00%> (ø)
...tion/balance/model/weighted_communication_volume.h 100.00% <100.00%> (ø)
...vt/vrt/collection/balance/temperedlb/temperedlb.cc 70.82% <100.00%> (ø)
...rt/collection/balance/temperedwmin/temperedwmin.cc 76.00% <100.00%> (+3.27%) ⬆️
... and 15 more

@nlslatt
Copy link
Collaborator

nlslatt commented Aug 5, 2022

When TemperedWMin is used, how will this load model get set up? I can see three potential options: (a) make the app change the load model based on what LB was selected for that phase, which seems very awkward; (b) have TemperedWMin compose this load model on top of whatever the app already set up; or (c) make it possible to choose load models from the command-line like we do with load balancers already. I would lean toward (b), which might motivate leaving the TemperedWMin LB args as they are instead of removing them.

@PhilMiller
Copy link
Member

I was kind of expecting option B, if the lb strategy is designed with an assumption that it's working over this model

virtual TimeType getModeledWork(ElementIDStruct object, PhaseOffset when) {
return {};
}

Copy link
Member

@PhilMiller PhilMiller Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be me being difficult, but can we just do without this, and have the new model override getModeledLoad() like everything else does? An object's 'load' is really just its contribution to the abstract cost function that whatever LB strategy is trying to minimize. There's no reason to distinguish some kind of 'normal' from 'special' modeled load. It means whatever the model wants it to mean, and whatever the strategy interprets it as.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a point that @ppebay would need to stand up for, since my understanding is that @cz4rs you're just implementing it as requested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point here and it doesn't seem like there's much use for getModeledWork outside of TemperedWMin anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, here's the last 6 commits to load_model.h:

f6ebc31a0 #1672: lb: rename `getComm` to `getModeledComm`
543f320a2 #1672: lb: rename `getLoadMetric` to `getModeledLoad`
1a2ee2e75 #1672: rename `getLoad` to `getLoadMetric`
35b554444 #1672: lb: add TemperedWMin load balancer
26e478bba #1672: lb: add getTotalWork and getComm methods
63c7ee10d #1672: lb: rename getWork to getLoad

There's been plenty of renaming already (note that getModeledLoad() has started as getWork()), so I'd like to make sure that we have a consensus about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any of the load balancers assume that load summed over ranks is an invariant across LB migrations? If so, then it could be important to allow a load balancer to choose pure load even if the load model in use knows about communication.

Copy link
Contributor

@ppebay ppebay Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So getModeledLoad() refers to a notion of a "modeled load" that is itself a generic understanding of "load" (i.e., one that might include communication)? In that sense, would modeledLoad be a replacement for Work?

@PhilMiller

Copy link
Contributor

@ppebay ppebay Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In LBAF we have the following concepts:

  • "Time" that is the compute time required by an object -- and it's assumed to be constant irrespective of the rank to which it is assigned.
  • "Load" that is simply the sum-total of all object times--this is a per-rank quantity;
  • "Weight" that represents the amount of data (in bytes?) associated with object-to-object communication (and which can be aggregated at the rank level);
  • "Work" that is an (affine) combination of the above -- this is a per-rank variable;
  • "Total Work" that is the sum total of "Work" across the entire collection of ranks -- note that this remains constant when only object times are considered.

@cz4rs

Copy link
Contributor Author

@cz4rs cz4rs Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the LoadModel interface (the most general one), we currently have:

  • getModeledLoad(...)
    • Provide an estimate of the given object's load during a specified interval
    • How much computation time the object is estimated to require
  • getRawLoad(...)
    • Provide the given object's raw load during a specified interval
    • How much computation time the object required
  • getModeledComm(...)
    • Provide an estimate of the communication cost for a given object during a specified interval
    • How much communication time the object is estimated to require

Copy link
Contributor Author

@cz4rs cz4rs Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both WeightedMessages and CommOverhead models we use:

  • per_msg_weight_ - weight to add per message received
  • per_byte_weight_ - weight to add per byte received
    for computing the cost of communication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifflander TLDR: c72e02a implements Phil's suggestion from the first comment: do not add getModeledWork, use getModeledLoad instead. Do you think that's the right direction or we should stick with getModeledWork?

@cz4rs cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from 3441b9e to 7f49544 Compare August 12, 2022 13:57
@cz4rs cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from 7f49544 to c72e02a Compare August 17, 2022 16:49
@cz4rs cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch 2 times, most recently from 64c19db to eff4e41 Compare September 15, 2022 10:53
@cz4rs cz4rs requested a review from PhilMiller September 15, 2022 12:21
@cz4rs
Copy link
Contributor Author

cz4rs commented Sep 15, 2022

gcc-7 failure is unrelated.

@cz4rs cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch 2 times, most recently from 47eb740 to 2c34c25 Compare September 20, 2022 17:28
@cz4rs cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from 2c34c25 to f317ef0 Compare September 30, 2022 13:04
@cz4rs cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from f317ef0 to 40a4635 Compare October 3, 2022 10:16
Copy link
Collaborator

@nlslatt nlslatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@nlslatt nlslatt dismissed PhilMiller’s stale review October 5, 2022 17:15

Requested changes made

@nlslatt nlslatt merged commit eb27363 into develop Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

move getModeledWork out of TemeperedWMin
4 participants