Description
The current system of available slots with -np is frustrating in terms of how it forces one to only allow queries of greatly reduced max token count. For example, if you have a context length of 16k, if you want four slots, each will only be 4k, and you can no longer run any 16k queries at all without them being heavily truncated.
While a partial solution would be to allow the operator to specify the numbers of token in each slot so that they could at least leave one high-token-count slot, an ideal solution would be to have the server be adaptive - to look at what's in the queue, and using a combination of how long each query has been waiting and how well different queries could be packed into the max context length, determine which to run and how many slots to use of what size.
While I wouldn't be an ideal person to write the slot-handling side of things, I'd be more than happy to write the queueing mechanism for you if this were of interest. I would just need to know what sort of data structure you could provide for the queue and what limitations there would be on slots (including any performance considerations)