You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Speculative inferencing is a great way to speed up inference for larger models, but sometimes a small models for be fine for small tasks. (Quick summarizing). Here, raw speed would be beneficial for many use cases.
Would it be possible to directly access the already loaded speculative draft model in llama server directly?
i.e. when making a completion request using an OpenAI compatible endpoint, use "draft" for the model name parameter and get the output from the draft model only?
For an extended version, you could access either the draft, large, or speculative outputs, although the large and speculative should be identical.
The main goal is to benefit the end user be offering either more intelligence or more speed.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Speculative inferencing is a great way to speed up inference for larger models, but sometimes a small models for be fine for small tasks. (Quick summarizing). Here, raw speed would be beneficial for many use cases.
Would it be possible to directly access the already loaded speculative draft model in llama server directly?
i.e. when making a completion request using an OpenAI compatible endpoint, use "draft" for the model name parameter and get the output from the draft model only?
For an extended version, you could access either the draft, large, or speculative outputs, although the large and speculative should be identical.
The main goal is to benefit the end user be offering either more intelligence or more speed.
Beta Was this translation helpful? Give feedback.
All reactions