-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sequence continuous batching close session race condition #3198
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
continue; | ||
} | ||
// Avoid duplicate poll tasks in the executor queue | ||
if (pollQueueTasks.containsKey("") && !pollQueueTasks.get("").isDone()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about naming the key pollJobGroup for doc purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense, will update the key.
34feee4
to
80d657c
Compare
|
||
public SequenceBatching(Model model) { | ||
super(model); | ||
this.localCapacity = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a static value, not a dynamic value. The real quota is calculated in original code.
int quota =
Math.min(
this.localCapacity - jobsQueue.size(),
model.getPendingJobGroups().size() / model.getMaxWorkers());
this.localCapacity.get(), | ||
Math.max( | ||
1, model.getPendingJobGroups().size() / model.getMaxWorkers())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only fix should be:
int quota =
Math.min(
this.localCapacity - jobsQueue.size(),
Math.max(
1, model.getPendingJobGroups().size() / model.getMaxWorkers()));
@@ -120,6 +127,8 @@ private void cleanJobGroup(String jobGroupId) { | |||
logger.debug("Clean jobGroup: {}", jobGroupId); | |||
if (jobGroupId != null) { | |||
model.removeJobGroup(jobGroupId); | |||
pollQueueTasks.remove(jobGroupId); | |||
localCapacity.incrementAndGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"localCapacity.incrementAndGet();" is not needed.
@@ -176,6 +185,7 @@ public void shutdownExecutors() { | |||
|
|||
private void addJobGroup(String jobGroupId) { | |||
if (jobGroupId != null) { | |||
localCapacity.decrementAndGet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed.
if (localCapacity.get() <= 0) { | ||
continue; | ||
} | ||
// Avoid duplicate poll tasks in the executor queue | ||
if (pollQueueTasks.containsKey("pollJobGroup") | ||
&& !pollQueueTasks.get("pollJobGroup").isDone()) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this logic is covered by original func pollJobGroup().
if (isPollJobGroup.getAndSet(true)) {
return;
}
// HashMap to track poll queue tasks in the executor queue | ||
private ConcurrentHashMap<String, CompletableFuture<Void>> pollQueueTasks = | ||
new ConcurrentHashMap<String, CompletableFuture<Void>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this var is not necessary. in fact, it breaks the existing logic in eventJobGroupIds.
Description
In the current Sequence Batching event dispatcher implementation, we do the following:
serve/frontend/server/src/main/java/org/pytorch/serve/wlm/SequenceBatching.java
Lines 187 to 218 in 079ff7b
Every
maxBatchDelay
interval:As a result of this we have the following outcomes:
Concretely, the issue is triggered in the following scenario:
maxNumSequence
number of sessions are actively openIn summary, a stream response request prevents graceful session closure and cleanup.
The likely root cause here is that the session cleanup logic fails to detect session closure after stream response because, the logic to poll jobs from an existing job group would have already gone past the point where we detect closed sessions:
serve/frontend/server/src/main/java/org/pytorch/serve/wlm/SequenceBatching.java
Lines 220 to 226 in 079ff7b
Moreover, with stream response, for each chunk we send back, we will enqueue a poll job group task on the poll executor queue although we would expect to have only one such active task at a given point in time.
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Without the fix in this PR, the test fails as follows:
With the fix in this PR: