-
-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow stealing of fast tasks in some situations #6115
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,7 +23,7 @@ | |
# submission which may include code serialization. Therefore, be very | ||
# conservative in the latency estimation to suppress too aggressive stealing | ||
# of small tasks | ||
LATENCY = 0.1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is realistic, see #5390 |
||
LATENCY = 0.01 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems unrealistically fast to me (and goes against what the comment above is saying). We've talked about doing the opposite in other places #5324. If we need to artificially suppress the latency estimate, maybe that points to something else being wrong with the stealing heuristics. |
||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
@@ -199,8 +199,7 @@ def steal_time_ratio(self, ts): | |
|
||
ws = ts.processing_on | ||
compute_time = ws.processing[ts] | ||
if compute_time < 0.005: # 5ms, just give up | ||
return None, None | ||
compute_time = max(compute_time, 0.010) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've been down this road before, see #4920 I also noticed the way we measure bandwidths is very skewed which impacts this significantly, see #4962 (comment) Not sure if this is still relevant but it was big back then I still think #4920 is worth getting in but I had to drop this back then because deadlocks popped up. |
||
|
||
nbytes = ts.get_nbytes_deps() | ||
transfer_time = nbytes / self.scheduler.bandwidth + LATENCY | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a large change to occupancy calculations that would affect things beyond stealing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. It's not unreasonable though. getitem tasks have durations like 62us, which I think is not reflective of general overhead. I think that it probably makes sense for us to set a minimum to our compute times more broadly.