context : reserve new scheduler when graph topology changes#18547
context : reserve new scheduler when graph topology changes#18547
Conversation
400466c to
bd5de6b
Compare
89d19e0 to
c92df39
Compare
cf2b3ca to
4b74410
Compare
4b74410 to
b579b97
Compare
|
@ngxson PTAL at the server changes when you get the chance. They are relatively minor. |
tools/server/server-task.h
Outdated
| return type != SERVER_TASK_TYPE_EMBEDDING && | ||
| type != SERVER_TASK_TYPE_RERANK; |
There was a problem hiding this comment.
I would prefer having the reversed logic here: type == SERVER_TASK_TYPE_COMPLETION || type == SERVER_TASK_TYPE_INFILL
Also note that SERVER_TASK_TYPE_INFILL will be removed soon, because it's technically just a completion task with a special chat template
tools/server/server-context.cpp
Outdated
| for (auto & slot : slots) { | ||
| if (!slot.is_processing() || !slot.smpl) { | ||
| llama_set_sampler(ctx, slot.id, nullptr); | ||
| } | ||
| } |
There was a problem hiding this comment.
should this be moved to slot.release()? If I understand it correctly, this means we set the sampler to nullptr if the slot is not processing anything
There was a problem hiding this comment.
Yes, good idea.
If I understand it correctly, this means we set the sampler to nullptr if the slot is not processing anything
Yes, this prevents from the llama_context adding dummy sampling nodes to the graph.
There was a problem hiding this comment.
I moved it to server_slot.reset() in d9146ed
Also a bit of refactoring:
- Rename
server_slot.clear()->server_slot.prompt_clear() - Remove redundant
slot.reset()fromlaunch_slot_with_task(). Assumption is that all slots will be reset when they are released, so no need to do it again upon launch
|
Should be good to merge. @ngxson Let me know if you want to take one more look |
cont #17617
In some cases we know that a graph reallocation would be necessary (see #17617). Re-reserve the scheduler to reduce the amount of unexpected graph reallocations and to prevent further reallocations later.
Also, the number of active samplers when using backend sampling with
llama-serveris now properly configured. Before, for a server with N slots, we were always running N samplers, regardless of how many slots are active. Now, thanks to the new reserve logic, we disable the samplers for the inactive slots.TODOs: