Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balance workloads within synchronized subphases (vector loads rather than scalar) #708

Closed
4 of 5 tasks
PhilMiller opened this issue Feb 25, 2020 · 14 comments · Fixed by #826
Closed
4 of 5 tasks

Balance workloads within synchronized subphases (vector loads rather than scalar) #708

PhilMiller opened this issue Feb 25, 2020 · 14 comments · Fixed by #826

Comments

@PhilMiller
Copy link
Member

PhilMiller commented Feb 25, 2020

What Needs to be Done?

  1. Add an API for objects to indicate which subphase their work is part of
  2. Extend LB instrumentation to record a subphase-indexed vector of loads per object per phase, rather than the current scalar(s)
  3. Add API for load balancing strategies to query vector loads
  4. Modify LB strategy migration criterion to disallow candidates that would increase vector imbalance
  5. Output load vectors to LB stats vom file

Is your feature request related to a problem? Please describe.

In EMPIRE, each step consists of a number of subphases. Those subphases are separated by global synchronization events. Objects loads across subphases are not directly proportional. Balancing by the objects' aggregate load during each entire step leaves substantial imbalances within each subphase. This limits performance and scalability.

Describe potential solution outcome

Describe alternatives you've considered

Additional context

@PhilMiller
Copy link
Member Author

@lifflander @ppebay will want to track this

@PhilMiller
Copy link
Member Author

@ppebay What do you want for the format of subphase object times in the stats file?

@lifflander
Copy link
Collaborator

I think it make sense for the sub-phases to be sorted in order as the phases are in the output. We could put it at the end to maintain some sort of positional compatibility or just add it after the phase, which probably makes more sense: i.e., in the 2nd position for each line, for both comp and comm.

@PhilMiller
Copy link
Member Author

PhilMiller commented Mar 10, 2020

Here are a few improvement criteria that could be used in a vector strategy.

Suppose we have

TimeType getObjLoad(ObjectID, Phase, Subphase);
TimeType getProcLoad(ProcID, Phase, Subphase);

vector<TimeType> subphase_maximums(num_subphases, 0);
vector<TimeType> subphase_averages(num_subphases, 0);
vector<TimeType> subphase_long_poles(num_subphases, 0);
vector<TimeType> subphase_targets(num_subphases, 0);

for (int j = 0; j < num_subphases; j++) {
  for (int i = 0; i < num_procs; ++i) {
    subphase_maximum[j] = max(subphase_maximum[j], getProcLoad(i, phase, j));
    subphase_averages[j] += getProcLoad(i, phase, j);
  }
  subphase_averages[j] /= num_procs;

  for (int i = 0; i < num_objs; ++i) {
    subphase_long_poles[j] = max(subphase_long_poles[j], getObjLoad(i, phase, j));
  }

  subphase_targets[j] = max(subphase_averages[j], subphase_long_poles[j]);
}

The strongest criterion would be that a proposed migration not create any overload in any subphase:

bool MigrationCreatesNoOverload(ObjectID obj, ProcID src_proc, ProcID dst_proc) {
  for (int i = 0; i < num_subphases; ++i) {
    if (getObjLoad(obj, phase, i) + getProcLoad(dst_proc, phase, i) > subphase_targets[i])
      return false;
  }
  return true;
}

To be continued...

@PhilMiller
Copy link
Member Author

PhilMiller commented Mar 10, 2020

Another criterion would be that the relieved overload on src_proc is greater than any created overload on dst_proc:

bool MigrationRelievesNetOverload(ObjectID obj, ProcID src_proc, ProcID dst_proc) {
  TimeType obj_load = getObjLoad(obj, phase, i);

  TimeType relieved_overload = 0;
  for (int i = 0; i < num_subphases; ++i) {
    TimeType load_before = getProcLoad(src_proc, phase, i);
    TimeType load_after = load_before - obj_load;

    if (load_after >= subphase_targets[i])
      relieved_overload += obj_load;
    else if (load_before >= subphase_targets[i]) // Corrected to *else* if
      relieved_overload += load_before - subphase_targets[i];
  }

  TimeType created_overload = 0;
  for (int i = 0; i < num_subphases; ++i) {
    TimeType load_before = getProcLoad(dst_proc, phase, i);
    TimeType load_after = load_before + obj_load;

    if (load_before >= subphase_targets[i])
      created_overload += obj_load;
    else if (load_after >= subphase_targets[i])
      created_overload += load_after - subphase_targets[i];
  }

  return relieved_overload > created_overload;
}

@PhilMiller
Copy link
Member Author

PhilMiller commented Mar 10, 2020

A slightly trickier one to implement, requiring updated global load knowledge (equivalently, a centralized strategy that collects the stats in one place), would be whether the destination processor becomes a 'long pole' in any subphase.

A modification of that considers whether it would be a long pole assuming the maxes are fixed. This is a valid criterion for global improvement, but probably not a very good one. It would need a lot of iterative migration to smooth everything down.

@PhilMiller
Copy link
Member Author

It's very likely I'm duplicating at least some work of @rbuch here

@rbuch
Copy link

rbuch commented Mar 12, 2020

Indeed, I am looking at similar things in Charm++. I'm currently focused on pushing through some of the internal LB infrastructure changes so a lot of the strategy design and validation is still future work, but I have done some preliminary work on that. In any case, we should coordinate to avoid making the same mistakes and avoiding duplication of work where we can.

@PhilMiller
Copy link
Member Author

PhilMiller commented Mar 12, 2020

Another one, that any created overload in any phase on the destination processor is less than any remaining overload on the source processor:

bool MigrationUniformlyMitigatesOverload(ObjectID obj, ProcID src_proc, ProcID dst_proc) {
  TimeType obj_load = getObjLoad(obj, phase, i);

  TimeType relieved_overload = 0;
  for (int i = 0; i < num_subphases; ++i) {
    TimeType src_load_before = getProcLoad(src_proc, phase, i);
    TimeType src_load_after = src_load_before - obj_load;

    TimeType dst_load_before = getProcLoad(dst_proc, phase, i);
    TimeType dst_load_after = dst_load_before + obj_load;

    if (dst_load_after >= subphase_targets[i] && 
         dst_load_after > src_load_after)
      return false;
  }

  return true;
}

I believe this is, again, an example of a 'strict improvement criterion', while still being more lenient than MigrationCreatesNoOverload

@PhilMiller
Copy link
Member Author

PhilMiller commented Mar 14, 2020

I just amended the preliminary and all of the criterion definitions to clarify that they should consider the best attainable target in each subphase, which is the larger of the heaviest object or the average processor load. If there's a long pole, there's no sense trying to offload a processor that's already shorter than it.

@PhilMiller
Copy link
Member Author

A slight variation on MigrationUniformlyMitigatesOverload would make the last comparison dst_load_after > src_load_before. This would allow migrations that improve some phases, as long as other phases are no worse off than they would have been without the migration.

lifflander pushed a commit that referenced this issue May 6, 2020
lifflander pushed a commit that referenced this issue May 6, 2020
lifflander added a commit that referenced this issue Jun 10, 2020
#708: Pass subphase timings through to ProcStats, and write them out
lifflander pushed a commit that referenced this issue Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants