Skip to content

Commit

Permalink
Add documentation on generative sequence (#6595)
Browse files Browse the repository at this point in the history
* Add documentation on generative sequence

* Address comment

* Reflect the "iterative" change
  • Loading branch information
GuanLuo authored Nov 20, 2023
1 parent e7bee37 commit daceccf
Showing 1 changed file with 54 additions and 0 deletions.
54 changes: 54 additions & 0 deletions docs/user_guide/model_configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -980,6 +980,60 @@ indicating sequence start, end, ready and correlation ID. See
[Stateful Models](architecture.md#stateful-models) for more
information and examples.

#### Iterative Sequence

When using sequence batcher, user may enable "iterative sequences" by
specifying the following:

```
sequence_batching {
iterative_sequence: true
}
```

"Iterative sequence" refers to the type of model execution that the model
iteratively executes on the same request to produce responses until it
has generated all responses for the request. When iterative sequence is
enabled, the scheduler will expect a single incoming request to initiate the
sequence, and then the backends that support iterative sequence can yield back
to the sequence batcher to reschedule the request for further execution in
a future batch.

Because only one request is used to represent the whole sequence, the user
doesn't need to and should not set
[control inputs](architecture.md#control-inputs) mentioned in the previous
section. They will be filled internally by the scheduler.

There are similarities between iterative sequence and [decoupled](#decoupled)
setting that undetermined number of responses may be generated for a request.
One major difference is that iterative sequence may reschedule and execute the
same request repeatedly to fully utilize the batching feature in Triton: a
decoupled model generates all responses associated to a request within the same
execution which can waste computation resource if execution on one of the
request in the batch is much longer than the rest, whereas for iterative
sequence, the response generation is broken down into multiple executions and
each execution can be run with an newly formed batch. Also note that the two
settings are not mutually exclusive, the user may enable both for the models
that fit.

##### Achieve Inflight Batching with Iterative Sequence

Inflight batching is a terminology used in large language model (LLM) inferencing,
where, if not implemented, the model throughput will be limited. In a LLM
inference, the length of the responses within a batch can vary significantly,
as a result, the batch slot of early ending responses will remain idle until
the whole batch is completed. Inflight batching is introduced to continuously
assign new request to the early ending batch slot while the batch is still in
execution, which can be mapped to the iterative sequence usage described in the
previous section.

To achieve inflight batching with iterative sequence, the backend should break
an execution into steps of execution, where each step corresponds to one
Triton model instance execution. At the end of each step, the model instance
will release the request of the early ending response and reschedule the rest of
the requests so that Triton will form and schedule the next batch of requests
that mixes new and rescheduled requests.

### Ensemble Scheduler

The ensemble scheduler must be used for [ensemble
Expand Down

0 comments on commit daceccf

Please sign in to comment.