-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaskGroup.nbytes_in_memory
miscounted for replicated keys
#4927
Comments
I stumbled over this myself recently, see distributed/distributed/tests/test_scheduler.py Lines 1919 to 1920 in fced981
From what I can see, the lack of counting is introduced in distributed/distributed/scheduler.py Lines 6300 to 6303 in fced981
I noticed this in my deadlock PR which already grew without bounds so I didn't fix it. |
Maybe this will solve the problem? #4930 |
Closed via #4930 |
I think there is a logic error with bookkeping for
TaskGroup.nbytes_in_memory
. There's a discrepancy between how we increment it and decrement it when multiple workers hold the same key.In
transition_memory_released
, we decrement it bynbytes
once for every worker that holds that task:distributed/distributed/scheduler.py
Lines 2646 to 2649 in 6340b5b
Whereas in
_propagate_forgotten
, we decrement it once bynbytes
if there are any workers holding the task, regardless of how many. This doesn't match withtransition_memory_released
:distributed/distributed/scheduler.py
Lines 7339 to 7341 in 646b12b
On the creation side, in
TaskState.set_nbytes
, we only increment it by the diff between the last known value and the current value. If the key is being copied to multiple workers, this difference is usually 0:distributed/distributed/scheduler.py
Lines 1556 to 1562 in 6340b5b
In short, I think
TaskGroup.nbytes_in_memory
is incremented once per key, but decremented once per copy of the key.If
nbytes
can be different for different workers, then to do this bookkeeping correctly, I think we'd also need to trackTaskState.total_nbytes
(size of all copies of the key), then decrement by that once intransition_memory_released
and_propagate_forgotten
.Discovered in #4925 (comment). I think #4925 made this more apparent, since it encourages more data replication.
cc @crusaderky since you know more about replicated keys.
The text was updated successfully, but these errors were encountered: