Llama server with speculative AND dual inference? #13942

dnhkng · 2025-05-31T07:55:34Z

dnhkng
May 31, 2025

Speculative inferencing is a great way to speed up inference for larger models, but sometimes a small models for be fine for small tasks. (Quick summarizing). Here, raw speed would be beneficial for many use cases.

Would it be possible to directly access the already loaded speculative draft model in llama server directly?

i.e. when making a completion request using an OpenAI compatible endpoint, use "draft" for the model name parameter and get the output from the draft model only?

For an extended version, you could access either the draft, large, or speculative outputs, although the large and speculative should be identical.

The main goal is to benefit the end user be offering either more intelligence or more speed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama server with speculative AND dual inference? #13942

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Llama server with speculative AND dual inference? #13942

Uh oh!

Uh oh!

dnhkng May 31, 2025

Replies: 0 comments

dnhkng
May 31, 2025