Description
Problem
If GPU acceleration is enabled, Jan appears to follow "all or nothing" strategy, with model failing to activate completely if there is not enough vRAM, for example.
Success Criteria
A much better approach would be "graceful degradation", with model activating using CPU instead. Perhaps with a UI warning to notify user what has happened. That way at least the model would still respond, even if more slowly. Additionally, it'd allow accelerating small models and still working with larger ones.
An ideal approach would be to implement partial model offloading. That way it'd be possible to make a guess at how many layers can be safely offloaded into vRAM, so the model is accelerated as much as possible with the given hardware.
Additional context
I think LMStudio and GPT4All implement partial model offloading, so it's something that's possible to do. However, they just stick a slider into UI and leave it to user to find out how many layers can be loaded into vRAM.
Metadata
Metadata
Assignees
Type
Projects
Status