Smarter slot handling

The current system of available slots with -np is frustrating in terms of how it forces one to only allow queries of greatly reduced max token count.  For example, if you have a context length of 16k, if you want four slots, each will only be 4k, and you can no longer run any 16k queries at all without them being heavily truncated.

While a partial solution would be to allow the operator to specify the numbers of token in each slot so that they could at least leave one high-token-count slot, an ideal solution would be to have the server be adaptive - to look at what's in the queue, and using a combination of how long each query has been waiting and how well different queries could be packed into the max context length, determine which to run and how many slots to use of what size.

While I wouldn't be an ideal person to write the slot-handling side of things, I'd be more than happy to write the queueing mechanism for you if this were of interest.  I would just need to know what sort of data structure you could provide for the queue and what limitations there would be on slots (including any performance considerations)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Smarter slot handling #5737

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Smarter slot handling #5737

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions