You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, n_ctx is locked to 2048, but with people starting to experiment with ALiBi models (BluemoonRP, MTP whenever that gets sorted out properly) and RedPajamas talking about hyena and StableLM aiming for 4k context potentially, the ability to bump context numbers for llama.cpp models is going to be something very useful to have going forward. Especially since most of those models are likely to be run on CPU for most consumer hardware people.
I also think expected behavior is that whatever context limit I'm setting in the UI should be passed through to the inference backend. I think requiring a model reload is fine for changing the setting, but it should pass through the value when a ggml model is loading. A "reload model on context size change" setting could be nice to have if there's a clean spot for it though, assuming it would be useful for more than just ggml files. Maybe instead of a checkbox, just a convenient button that pops up after it's changed to cue people to reload the model, since knowing when the user is done adjusting context is hard and reloading is fairly heavy.
Likewise, I think --n_ctx should be a flag that can be set for people who want to automate sh/bat loading of larger context models.
The text was updated successfully, but these errors were encountered:
Correct me if I'm wrong but doesn't textgen use specifically llama.cpp via llama-cpp-python, which are for LLaMA?
Maybe it could be useful to be able to change this value anyway though.
I'm pretty sure there was recently a merge for llama.cpp that added loading for GPT NeoX models, I may have misread that somewhere. It might not be in yet. Though it is clear supporting GPT NeoX in llama.cpp isn't being considered a different-project sort of thing.
And BluemoonRP 13B is a llama model with ALiBi support baked in (or however that should be phrased) so is loadable and usable today in base llama.cpp without changes.
Currently, n_ctx is locked to 2048, but with people starting to experiment with ALiBi models (BluemoonRP, MTP whenever that gets sorted out properly) and RedPajamas talking about hyena and StableLM aiming for 4k context potentially, the ability to bump context numbers for llama.cpp models is going to be something very useful to have going forward. Especially since most of those models are likely to be run on CPU for most consumer hardware people.
I also think expected behavior is that whatever context limit I'm setting in the UI should be passed through to the inference backend. I think requiring a model reload is fine for changing the setting, but it should pass through the value when a ggml model is loading. A "reload model on context size change" setting could be nice to have if there's a clean spot for it though, assuming it would be useful for more than just ggml files. Maybe instead of a checkbox, just a convenient button that pops up after it's changed to cue people to reload the model, since knowing when the user is done adjusting context is hard and reloading is fairly heavy.
Likewise, I think --n_ctx should be a flag that can be set for people who want to automate sh/bat loading of larger context models.
The text was updated successfully, but these errors were encountered: