Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make AMM memory measure configurable #6577

Closed
crusaderky opened this issue Jun 15, 2022 · 0 comments · Fixed by #7062
Closed

Make AMM memory measure configurable #6577

crusaderky opened this issue Jun 15, 2022 · 0 comments · Fixed by #7062
Assignees
Labels

Comments

@crusaderky
Copy link
Collaborator

The Active Memory Manager uses the optimistic memory (managed + unmanaged old) as a hardcoded measure to base all of its decisions upon.
This is generally a good choice in a production environment.
There are however two notable exceptions:

  1. When the process memory does not deflate on its own. This issue is probably fixable with distributed.nanny.environ.MALLOC_TRIM_THRESHOLD_ is ineffective #5971 on Linux, and (to my knowledge) unfixable on MacOSX. This can cause the AMM to take poor decisions, e.g. move all data away from a worker because it sees huge amounts of managed memory - except that that memory is actually reusable.
  2. In unit tests. Most of the AMM tests currently run on nannies and require large amounts of data and lax constraints to be stable. The AMM stress tests are currently disabled on CI, not because of AMM's fault (the same tests fail also with AMM disabled) but instead because, in order to let AMM take correct decisions, they have to spawn 10 Nannies, which are too much for the measly github CI hosts to handle. Those stress tests would be extremely valuable to run in CI, as they've detected state machine corruption and other deadlocks in the past many times already. See Remove @avoid_ci from stress tests #6271.

Design

Add a new setting to distributed.yaml, {distributed.scheduler.active-memory-manager.measure: optimistic}. This mirrors {distributed.worker.memory.rebalance.measure: optimistic}. Note that rebalance() has been penned in to be rewritten: #4906.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant