1838: add `WeightedCommunicationVolume` load model #1888

cz4rs · 2022-08-02T16:57:44Z

github-actions · 2022-08-02T17:25:34Z

Pipelines results

PR tests (gcc-12, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (gcc-5, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (clang-3.9, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (gcc-6, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (clang-5.0, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (gcc-10, ubuntu, openmpi, no LB)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (gcc-9, ubuntu, mpich, zoltan)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (gcc-7, ubuntu, mpich, trace runtime, LB)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (clang-9, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (nvidia cuda 11.0, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (clang-13, alpine, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (gcc-8, ubuntu, mpich, address sanitizer)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (nvidia cuda 10.1, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (intel icpx, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (clang-11, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (clang-12, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (clang-13, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (clang-14, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (gcc-11, ubuntu, mpich)

Build for eff4e41

Compilation - successful

Testing - passed

Build log

PR tests (clang-10, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (intel icpc, ubuntu, mpich)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

PR tests (gcc-11, ubuntu, mpich, json schema test)

Build for 40a4635

Compilation - successful

Testing - passed

Build log

src/vt/vrt/collection/balance/temperedwmin/temperedwmin.h

codecov · 2022-08-05T13:44:47Z

Codecov Report

Merging #1888 (40a4635) into develop (288d612) will increase coverage by 0.03%.
The diff coverage is 93.93%.

@@             Coverage Diff             @@
##           develop    #1888      +/-   ##
===========================================
+ Coverage    84.32%   84.35%   +0.03%     
===========================================
  Files          728      730       +2     
  Lines        25508    25569      +61     
===========================================
+ Hits         21509    21570      +61     
  Misses        3999     3999

Impacted Files	Coverage Δ
...c/vt/vrt/collection/balance/model/composed_model.h	`100.00% <ø> (ø)`
...t/vrt/collection/balance/model/weighted_messages.h	`100.00% <ø> (ø)`
.../vt/vrt/collection/balance/temperedlb/temperedlb.h	`100.00% <ø> (ø)`
...vrt/collection/balance/temperedwmin/temperedwmin.h	`100.00% <ø> (ø)`
.../test_model_weighted_communication_volume.nompi.cc	`91.83% <91.83%> (ø)`
.../vt/vrt/collection/balance/model/composed_model.cc	`100.00% <100.00%> (ø)`
...ion/balance/model/weighted_communication_volume.cc	`100.00% <100.00%> (ø)`
...tion/balance/model/weighted_communication_volume.h	`100.00% <100.00%> (ø)`
...vt/vrt/collection/balance/temperedlb/temperedlb.cc	`70.82% <100.00%> (ø)`
...rt/collection/balance/temperedwmin/temperedwmin.cc	`76.00% <100.00%> (+3.27%)`	⬆️
... and 15 more

nlslatt · 2022-08-05T17:27:41Z

When TemperedWMin is used, how will this load model get set up? I can see three potential options: (a) make the app change the load model based on what LB was selected for that phase, which seems very awkward; (b) have TemperedWMin compose this load model on top of whatever the app already set up; or (c) make it possible to choose load models from the command-line like we do with load balancers already. I would lean toward (b), which might motivate leaving the TemperedWMin LB args as they are instead of removing them.

PhilMiller · 2022-08-05T17:57:04Z

I was kind of expecting option B, if the lb strategy is designed with an assumption that it's working over this model

src/vt/vrt/collection/balance/model/load_model.h

src/vt/vrt/collection/balance/temperedwmin/temperedwmin.cc

PhilMiller · 2022-08-10T17:19:47Z

src/vt/vrt/collection/balance/model/load_model.h

+  virtual TimeType getModeledWork(ElementIDStruct object, PhaseOffset when) {
+    return {};
+  }
+


This might be me being difficult, but can we just do without this, and have the new model override getModeledLoad() like everything else does? An object's 'load' is really just its contribution to the abstract cost function that whatever LB strategy is trying to minimize. There's no reason to distinguish some kind of 'normal' from 'special' modeled load. It means whatever the model wants it to mean, and whatever the strategy interprets it as.

This is probably a point that @ppebay would need to stand up for, since my understanding is that @cz4rs you're just implementing it as requested.

I can see your point here and it doesn't seem like there's much use for getModeledWork outside of TemperedWMin anyways.

On the other hand, here's the last 6 commits to load_model.h:

f6ebc31a0 #1672: lb: rename `getComm` to `getModeledComm` 543f320a2 #1672: lb: rename `getLoadMetric` to `getModeledLoad` 1a2ee2e75 #1672: rename `getLoad` to `getLoadMetric` 35b554444 #1672: lb: add TemperedWMin load balancer 26e478bba #1672: lb: add getTotalWork and getComm methods 63c7ee10d #1672: lb: rename getWork to getLoad

There's been plenty of renaming already (note that getModeledLoad() has started as getWork()), so I'd like to make sure that we have a consensus about this.

Do any of the load balancers assume that load summed over ranks is an invariant across LB migrations? If so, then it could be important to allow a load balancer to choose pure load even if the load model in use knows about communication.

So getModeledLoad() refers to a notion of a "modeled load" that is itself a generic understanding of "load" (i.e., one that might include communication)? In that sense, would modeledLoad be a replacement for Work?

@PhilMiller

In LBAF we have the following concepts:

"Time" that is the compute time required by an object -- and it's assumed to be constant irrespective of the rank to which it is assigned.

"Load" that is simply the sum-total of all object times--this is a per-rank quantity;

"Weight" that represents the amount of data (in bytes?) associated with object-to-object communication (and which can be aggregated at the rank level);

"Work" that is an (affine) combination of the above -- this is a per-rank variable;

"Total Work" that is the sum total of "Work" across the entire collection of ranks -- note that this remains constant when only object times are considered.

@cz4rs

In the LoadModel interface (the most general one), we currently have:

getModeledLoad(...)

Provide an estimate of the given object's load during a specified interval

How much computation time the object is estimated to require

getRawLoad(...)

Provide the given object's raw load during a specified interval

How much computation time the object required

getModeledComm(...)

Provide an estimate of the communication cost for a given object during a specified interval

How much communication time the object is estimated to require

In both WeightedMessages and CommOverhead models we use:

per_msg_weight_ - weight to add per message received

per_byte_weight_ - weight to add per byte received
for computing the cost of communication.

@lifflander TLDR: c72e02a implements Phil's suggestion from the first comment: do not add getModeledWork, use getModeledLoad instead. Do you think that's the right direction or we should stick with getModeledWork?

src/vt/vrt/collection/balance/lb_invoke/lb_manager.cc

cz4rs · 2022-09-15T12:22:17Z

gcc-7 failure is unrelated.

nlslatt

Looks good

Requested changes made

nmm0 mentioned this pull request Aug 2, 2022

Meeting Agenda [do not close] #925

Open

cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch 2 times, most recently from f178e0b to 055f8ee Compare August 4, 2022 14:26

cz4rs commented Aug 4, 2022

View reviewed changes

src/vt/vrt/collection/balance/temperedwmin/temperedwmin.h Outdated Show resolved Hide resolved

cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from 055f8ee to bb2291d Compare August 4, 2022 14:38

cz4rs marked this pull request as ready for review August 4, 2022 15:28

cz4rs requested review from ppebay, PhilMiller, JacobDomagala, jstrzebonski, lifflander, nlslatt, nmm0, stmcgovern and thearusable August 4, 2022 15:28

cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from bb2291d to efc2016 Compare August 5, 2022 13:15

cz4rs commented Aug 8, 2022

View reviewed changes

src/vt/vrt/collection/balance/model/load_model.h Outdated Show resolved Hide resolved

cz4rs commented Aug 8, 2022

View reviewed changes

src/vt/vrt/collection/balance/temperedwmin/temperedwmin.cc Show resolved Hide resolved

cz4rs mentioned this pull request Aug 9, 2022

make LoadModel::getModeled* methods const #1903

Closed

PhilMiller previously requested changes Aug 10, 2022

View reviewed changes

cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from 3441b9e to 7f49544 Compare August 12, 2022 13:57

cz4rs commented Aug 16, 2022

View reviewed changes

src/vt/vrt/collection/balance/lb_invoke/lb_manager.cc Outdated Show resolved Hide resolved

cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from 7f49544 to c72e02a Compare August 17, 2022 16:49

cz4rs commented Aug 17, 2022

View reviewed changes

src/vt/vrt/collection/balance/lb_invoke/lb_manager.cc Outdated Show resolved Hide resolved

ppebay mentioned this pull request Aug 28, 2022

Rename Time and Load and add capabilities to handle Memory DARMA-tasking/LB-analysis-framework#279

Closed

cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from c72e02a to 90845d7 Compare September 9, 2022 15:33

cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch 2 times, most recently from 64c19db to eff4e41 Compare September 15, 2022 10:53

cz4rs requested a review from PhilMiller September 15, 2022 12:21

cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch 2 times, most recently from 47eb740 to 2c34c25 Compare September 20, 2022 17:28

cz4rs added 14 commits September 30, 2022 12:58

#1838: add WeightedCommunicationVolume load model

ead6c50

#1838: lb: add getModeledWork method to load models

f81eb40

#1838: lb: remove total work concept from load balancers

6755f15

#1838: lb: move default WeightedMessages parameters to ctor

035e9c5

#1838: lb: add getModeledComm to composed model

7461e00

#1838: lb: add unit tests for WeightedCommunicationVolume load model

b697f21

#1838: lb: add getModeledWork to composed model

5a49306

#1838: lb: only use modeled work in TemperedWMin

06a287a

#1838: lb: add back parameters to TemperedWMin

f8911d1

#1838: TemperedWMin: compose total work model on top of existing one

725e877

#1838: TemperedWMin: make sure that load model does not change

c4a92a0

#1838: lb: include modeled work in statistics

a6bf0fd

#1838: lb: remove getModeledWork interface

b2e561a

#1838: Revert "lb: include modeled work in statistics"

535db9f

cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from 2c34c25 to f317ef0 Compare September 30, 2022 13:04

#1838: lb: use const where necessary

40a4635

cz4rs force-pushed the 1838-add-weighted-communication-volume-model branch from f317ef0 to 40a4635 Compare October 3, 2022 10:16

nlslatt approved these changes Oct 5, 2022

View reviewed changes

nlslatt merged commit eb27363 into develop Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1838: add `WeightedCommunicationVolume` load model #1888

1838: add `WeightedCommunicationVolume` load model #1888

cz4rs commented Aug 2, 2022 •

edited

Loading

github-actions bot commented Aug 2, 2022 •

edited

Loading

codecov bot commented Aug 5, 2022 •

edited

Loading

nlslatt commented Aug 5, 2022

PhilMiller commented Aug 5, 2022

PhilMiller Aug 10, 2022 •

edited

Loading

PhilMiller Aug 10, 2022

cz4rs Aug 10, 2022

cz4rs Aug 10, 2022

nlslatt Aug 11, 2022

ppebay Aug 16, 2022 •

edited

Loading

ppebay Aug 16, 2022 •

edited

Loading

cz4rs Aug 17, 2022 •

edited

Loading

cz4rs Aug 17, 2022 •

edited

Loading

cz4rs Aug 23, 2022

cz4rs commented Sep 15, 2022

nlslatt left a comment

1838: add WeightedCommunicationVolume load model #1888

1838: add WeightedCommunicationVolume load model #1888

Conversation

cz4rs commented Aug 2, 2022 • edited Loading

github-actions bot commented Aug 2, 2022 • edited Loading

Pipelines results

codecov bot commented Aug 5, 2022 • edited Loading

Codecov Report

nlslatt commented Aug 5, 2022

PhilMiller commented Aug 5, 2022

PhilMiller Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

PhilMiller Aug 10, 2022

Choose a reason for hiding this comment

cz4rs Aug 10, 2022

Choose a reason for hiding this comment

cz4rs Aug 10, 2022

Choose a reason for hiding this comment

nlslatt Aug 11, 2022

Choose a reason for hiding this comment

ppebay Aug 16, 2022 • edited Loading

Choose a reason for hiding this comment

ppebay Aug 16, 2022 • edited Loading

Choose a reason for hiding this comment

cz4rs Aug 17, 2022 • edited Loading

Choose a reason for hiding this comment

cz4rs Aug 17, 2022 • edited Loading

Choose a reason for hiding this comment

cz4rs Aug 23, 2022

Choose a reason for hiding this comment

cz4rs commented Sep 15, 2022

nlslatt left a comment

Choose a reason for hiding this comment

1838: add `WeightedCommunicationVolume` load model #1888

1838: add `WeightedCommunicationVolume` load model #1888

cz4rs commented Aug 2, 2022 •

edited

Loading

github-actions bot commented Aug 2, 2022 •

edited

Loading

codecov bot commented Aug 5, 2022 •

edited

Loading

PhilMiller Aug 10, 2022 •

edited

Loading

ppebay Aug 16, 2022 •

edited

Loading

ppebay Aug 16, 2022 •

edited

Loading

cz4rs Aug 17, 2022 •

edited

Loading

cz4rs Aug 17, 2022 •

edited

Loading