-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguity of xfel balance worker #856
Comments
This is all the data I have on
Please don't ask why I ran a job with 555 expts on 512 rank – it seems my goals are beyond anyone's understanding, including my own (・・。)ゞ |
@irisdyoung I tested my own thesis put forward in issue 3 and realized that the algorithm is by no means recursive, and is much smarter and faster than what I initially understood. Since the fault is not in this part, why is |
That's good news from my perspective! I will add some timing info and hopefully get more information. It could be how the experiment and reflection sets are being sliced up (before passing them around, but after all ranks have received their instructions), or how they're aggregated after. I can try removing either of those steps (producing a nonfunctional algorithm; just for testing) to find out. |
@irisdyoung Thanks, thats a thing to check. I started some test jobs with timing, so I might look into it. |
If it helps, you can run it with debug.cProfile=True, then use a profiler on one of the per-rank .prof files to see what is going on. |
Ok, after some timing I got some interesting conclusions:
So it looks like the grossly inefficient part here is EDIT: Distributing expts and refls separately does not solve the issue; both steps in total take exactly the same time. |
Introduction
While working with a large dataset I discovered that the balance worker can act in a very unexpected and undesired ways. I'd like to discuss some of my issues and suggest potential improvements. Given how self-contained the code is, I would gladly take care of the "problems", but I would like to discuss which of my issues are actual "problems".
Issue 1: Balance worker, version
global1
, can be grossly inaccurateThe default balancing algorithm, accessible via phil parameter
input.parallel_file_load.balance = global1
does not require any information to be shared between ranks. Every rank stripes its experiment lists and assigns n-th rank n-th stripe:This works great as long as #ranks << #expts: In the example above, rank 0 should receive 30 experiments, while every other should receive 25. It gets worse with decreasing #expts/#ranks ratio. In particular, if a ranks has not enough expts for all other ranks, it will reshuffle the sliced lists. In this way the fact that every rank will receive some reflections is left to a sheer luck; at least I wasn't able to find any other safety mechanism.
Lately I have been using very low expt/ranks ratio and the problem really shows. Even if none of the ranks have been assigned 0 expts (which can happen!), one can end up with something like this:
The simplest way to solve this issue is to allow rank 0 to collect information about
len(experiments)
from every rank. Using cumulative sums of these lengths, every rank could take into account which rank should it assign to first. The end product would be essentially identical to striping list of all experiments from all ranks, and would never allow formin
andmax
to differ by more than 1.Issue 2: Balance worker, version
global1
,alltoall
slicing could be automatedDue to implementation reasons, mpi4py's alltoall method raises an
OverflowError
when a data of size > 2,147,483,647 should be redistributed. Although neither my experiment list nor reflection list is this long, this error is raised. The worker has a mechanism to protect against it, i.e. it can executealltoall
in several "slices". The number of slices is controlled viainput.parallel_file_load.balance_mpi_alltoall_slices
. However, I believe the need for this parameter could be easily eliminated by callingexchange_reflections_by_alltoall_sliced
recursively with 1, 10, 100... slices if previous raisesOverflowError
.Issue 3: Balance worker, version
global2
, can be grossly inefficientSince
global1
is so inaccurate, I switched toglobal2
. This version limits rank communication by sending only those experiments that actually need to be transferred. It balances 100% accurately,but uses an essentially recursive algorithm of high complexity: move experiments one-by-one until everything is balanced. The same level of balance could be simplified by creating two lists: "Update: The slow step is[rank for _ in experiments]
" and "np.linspace(0, #ranks, #expts, dtype=int)
" (example) and mapping one list onto another to get the result. The end product might be slightly different and require marginally more communication, but the complexity would be essentially O(1). The question is, is it better to wait 2 minutes for rank0 to finish, or to allow up to #ranks more experiments to be passed around?mpi4py
'salltoall
, not balancing algorithm; see comments.Issue 4: Balance worker, version
global2
, does not acceptalltoall
slicing at allFor introduction, see issues 2 and 3. While balance worker
global1
has a safety mechanism to protect against long experiment and reflection lists,global2
's behavior is currently completely unaffected byinput.parallel_file_load.balance_mpi_alltoall_slices
parameter. Whileglobal2
is defined to minimize data transfer between nodes, it still utilizesalltoall
to pass this information around and raisesOverflowError
when the input is too large. Thus it would be convenient to introduce thealltoall
slicing capability toglobal2
.Issue 5: Balance worker, version
per_node
, is... undefinedBalance worker, version
per_node
, utilizes methoddistribute_over_ranks
to pass around experiments and reflections. The problem is, in current implementation this method is completely undefined. As such, the method does not work at all. This might be a typo, but the file was last modified half year ago, and the fact that it remained unnoticed for this long means it isn't really used.Issue 6: Balance worker has two names
I keep referring to the worker in question as
balance
, but in fact I am unsure whether it is calledbalance
orload_balancer
. Filenames are ambiguous, Python code leans towardsload_balancer
, but phil files use exclusively termbalance
. I think it would be nice to settle on one name for clarity and maintainability.Issue 7: Does anyone else need sorting capability in balance?
For my project I need sort my experiments by
experiment.imageset.path
so that one ranks receives as little imagesets as possible. I can achieve this by writing a customsort
worker which would be dispatched beforebalance
versionglobal2
, but this would require expts to be passed around twice. What is your opinion about this functionality being added tobalance
instead?The text was updated successfully, but these errors were encountered: