Fix sequence continuous batching close session race condition #3198

namannandan · 2024-06-20T17:03:21Z

Description

In the current Sequence Batching event dispatcher implementation, we do the following:

serve/frontend/server/src/main/java/org/pytorch/serve/wlm/SequenceBatching.java

Lines 187 to 218 in 079ff7b

    
           class EventDispatcher implements Runnable { 
        
               @Override 
        
               public void run() { 
        
                   while (running.get()) { 
        
                       try { 
        
                           String jobGroupId = 
        
                                   eventJobGroupIds.poll(model.getMaxBatchDelay(), TimeUnit.MILLISECONDS); 
        
                           if (jobGroupId == null || jobGroupId.isEmpty()) { 
        
                               CompletableFuture.runAsync( 
        
                                       () -> { 
        
                                           try { 
        
                                               pollJobGroup(); 
        
                                           } catch (InterruptedException e) { 
        
                                               logger.error("Failed to poll a job group", e); 
        
                                           } 
        
                                       }, 
        
                                       pollExecutors); 
        
                           } else { 
        
                               CompletableFuture.runAsync( 
        
                                       () -> { 
        
                                           pollJobFromJobGroup(jobGroupId); 
        
                                       }, 
        
                                       pollExecutors); 
        
                           } 
        
                       } catch (InterruptedException e) { 
        
                           if (running.get()) { 
        
                               logger.error("EventDispatcher failed to get jobGroup", e); 
        
                           } 
        
                       } 
        
                   } 
        
               }

Every maxBatchDelay interval:

We either queue a task to poll for new job groups that the worker can process jobs from or
queue a task to poll an existing job group queue for new jobs

As a result of this we have the following outcomes:

Duplicate tasks to poll for new job groups which may hold up more than one thread in the poll executor thread pool
Duplicate tasks to poll the same existing job group queue which affects job group clean up on close session requests

Concretely, the issue is triggered in the following scenario:

maxNumSequence number of sessions are actively open
A sequence gets a stream response request
The same sequence subsequently gets a close session request
Although the sequence is closed and should free up capacity to open up a new session, it holds session capacity until the session times out and only then gets cleaned up.

In summary, a stream response request prevents graceful session closure and cleanup.

The likely root cause here is that the session cleanup logic fails to detect session closure after stream response because, the logic to poll jobs from an existing job group would have already gone past the point where we detect closed sessions:

serve/frontend/server/src/main/java/org/pytorch/serve/wlm/SequenceBatching.java

Lines 220 to 226 in 079ff7b

    
           private void pollJobFromJobGroup(String jobGroupId) { 
        
               // Poll a job from a jobGroup 
        
               JobGroup jobGroup = model.getJobGroup(jobGroupId); 
        
               Job job = null; 
        
               if (!jobGroup.isFinished()) { 
        
                   job = jobGroup.pollJob(model.getSequenceMaxIdleMSec()); 
        
               }

Moreover, with stream response, for each chunk we send back, we will enqueue a poll job group task on the poll executor queue although we would expect to have only one such active task at a given point in time.

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

Feature/Issue validation/testing

Updated regression test which opens session, makes stream response request and closes session multiple times to test graceful session closure and cleanup.

Without the fix in this PR, the test fails as follows:

$ python -m pytest test/pytest/test_example_stateful_sequence_continuous_batching_http.py::test_infer_stateful_cancel
.....
.....
  AssertionError: assert '{\n  "code":...eueSize"\n}\n' == '-1'
    - -1
    + {
    +   "code": 503,
    +   "type": "ServiceUnavailableException",
    +   "message": "Model \"stateful\" has no worker to serve inference request. Please use scale workers API to add workers. If this is a sequence inference, please check if it is closed, or expired; or exceeds maxSequenceJobQueueSize"
    + }

With the fix in this PR:

$ python -m pytest test/pytest/test_example_stateful_sequence_continuous_batching_http.py::test_infer_stateful_cancel

=============================================================== test session starts ===============================================================
platform darwin -- Python 3.9.6, pytest-7.3.1, pluggy-1.5.0
rootdir: /Volumes/workplace/pytorch/serve
plugins: cov-4.1.0, timeout-2.3.1, mock-3.14.0
collected 1 item                                                                                                                                  

test/pytest/test_example_stateful_sequence_continuous_batching_http.py .                                                                    [100%]

================================================================ warnings summary =================================================================
venvs/ts_dev/lib/python3.9/site-packages/urllib3/__init__.py:35
  /Volumes/workplace/pytorch/serve/venvs/ts_dev/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================== 1 passed, 1 warning in 15.58s ==========================================================

CI

This reverts commit 70ef9b0.

… request" This reverts commit ab78a9a.

This reverts commit 51d706a.

mreso

LGTM

mreso · 2024-06-21T23:29:55Z