-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1838: add WeightedCommunicationVolume
load model
#1888
1838: add WeightedCommunicationVolume
load model
#1888
Conversation
Pipelines resultsPR tests (gcc-12, ubuntu, mpich) Build for 40a4635
PR tests (gcc-5, ubuntu, mpich) Build for 40a4635
PR tests (clang-3.9, ubuntu, mpich) Build for 40a4635
PR tests (gcc-6, ubuntu, mpich) Build for 40a4635
PR tests (clang-5.0, ubuntu, mpich) Build for 40a4635
PR tests (gcc-10, ubuntu, openmpi, no LB) Build for 40a4635
PR tests (gcc-9, ubuntu, mpich, zoltan) Build for 40a4635
PR tests (gcc-7, ubuntu, mpich, trace runtime, LB) Build for 40a4635
PR tests (clang-9, ubuntu, mpich) Build for 40a4635
PR tests (nvidia cuda 11.0, ubuntu, mpich) Build for 40a4635
PR tests (clang-13, alpine, mpich) Build for 40a4635
PR tests (gcc-8, ubuntu, mpich, address sanitizer) Build for 40a4635
PR tests (nvidia cuda 10.1, ubuntu, mpich) Build for 40a4635
PR tests (intel icpx, ubuntu, mpich) Build for 40a4635
PR tests (clang-11, ubuntu, mpich) Build for 40a4635
PR tests (clang-12, ubuntu, mpich) Build for 40a4635
PR tests (clang-13, ubuntu, mpich) Build for 40a4635
PR tests (clang-14, ubuntu, mpich) Build for 40a4635
PR tests (gcc-11, ubuntu, mpich) Build for eff4e41
PR tests (clang-10, ubuntu, mpich) Build for 40a4635
PR tests (intel icpc, ubuntu, mpich) Build for 40a4635
PR tests (gcc-11, ubuntu, mpich, json schema test) Build for 40a4635
|
f178e0b
to
055f8ee
Compare
055f8ee
to
bb2291d
Compare
bb2291d
to
efc2016
Compare
Codecov Report
@@ Coverage Diff @@
## develop #1888 +/- ##
===========================================
+ Coverage 84.32% 84.35% +0.03%
===========================================
Files 728 730 +2
Lines 25508 25569 +61
===========================================
+ Hits 21509 21570 +61
Misses 3999 3999
|
When |
I was kind of expecting option B, if the lb strategy is designed with an assumption that it's working over this model |
virtual TimeType getModeledWork(ElementIDStruct object, PhaseOffset when) { | ||
return {}; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be me being difficult, but can we just do without this, and have the new model override getModeledLoad()
like everything else does? An object's 'load' is really just its contribution to the abstract cost function that whatever LB strategy is trying to minimize. There's no reason to distinguish some kind of 'normal' from 'special' modeled load. It means whatever the model wants it to mean, and whatever the strategy interprets it as.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point here and it doesn't seem like there's much use for getModeledWork
outside of TemperedWMin
anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, here's the last 6 commits to load_model.h
:
f6ebc31a0 #1672: lb: rename `getComm` to `getModeledComm`
543f320a2 #1672: lb: rename `getLoadMetric` to `getModeledLoad`
1a2ee2e75 #1672: rename `getLoad` to `getLoadMetric`
35b554444 #1672: lb: add TemperedWMin load balancer
26e478bba #1672: lb: add getTotalWork and getComm methods
63c7ee10d #1672: lb: rename getWork to getLoad
There's been plenty of renaming already (note that getModeledLoad()
has started as getWork()
), so I'd like to make sure that we have a consensus about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any of the load balancers assume that load summed over ranks is an invariant across LB migrations? If so, then it could be important to allow a load balancer to choose pure load even if the load model in use knows about communication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So getModeledLoad()
refers to a notion of a "modeled load" that is itself a generic understanding of "load" (i.e., one that might include communication)? In that sense, would modeledLoad
be a replacement for Work
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In LBAF we have the following concepts:
- "Time" that is the compute time required by an object -- and it's assumed to be constant irrespective of the rank to which it is assigned.
- "Load" that is simply the sum-total of all object times--this is a per-rank quantity;
- "Weight" that represents the amount of data (in bytes?) associated with object-to-object communication (and which can be aggregated at the rank level);
- "Work" that is an (affine) combination of the above -- this is a per-rank variable;
- "Total Work" that is the sum total of "Work" across the entire collection of ranks -- note that this remains constant when only object times are considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the LoadModel
interface (the most general one), we currently have:
getModeledLoad(...)
- Provide an estimate of the given object's load during a specified interval
- How much computation time the object is estimated to require
getRawLoad(...)
- Provide the given object's raw load during a specified interval
- How much computation time the object required
getModeledComm(...)
- Provide an estimate of the communication cost for a given object during a specified interval
- How much communication time the object is estimated to require
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both WeightedMessages
and CommOverhead
models we use:
per_msg_weight_
- weight to add per message receivedper_byte_weight_
- weight to add per byte received
for computing the cost of communication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lifflander TLDR: c72e02a implements Phil's suggestion from the first comment: do not add getModeledWork
, use getModeledLoad
instead. Do you think that's the right direction or we should stick with getModeledWork
?
3441b9e
to
7f49544
Compare
7f49544
to
c72e02a
Compare
c72e02a
to
90845d7
Compare
64c19db
to
eff4e41
Compare
|
47eb740
to
2c34c25
Compare
2c34c25
to
f317ef0
Compare
f317ef0
to
40a4635
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
fixes #1838