-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Open
Labels
Description
Is your feature request related to a problem? Please describe.
Models remain in GPU memory forever, especially in distributed deployments where shutdown API does not propagate.
Describe the solution you'd like
- Permit (appropriately privileged) user eviction of models from GPU memory into system RAM or fully (exit their runner) to disk across the fleet, on a node-by-node or fleet-wide basis through standard API interfaces
- Implement "ageing" for model states in memory tracking the last interaction with their consumer to proactively free resources for new tasks incoming by demoting layers to system memory or disk. LRU is the simplest way to do this but frequency of model use does have relevance, so an Adaptive Replacement Cache (ARC) pattern might be more appropriate.
- Implement "tuning knobs" to bias behavior such as demoting small models first or entirely evicting certain ones based on their metadata instead of trying to evict them to RAM.
Describe alternatives you've considered
- Currently looking at modifying an openwebui contribution to provide "user-friendly eviction" capability via the shutdown API
Additional context
- Full implementation may merit addition of metadata fields to local models or maintenance of state for the known set to reflect their eviction behaviors
mudler, BjoKaSH, matdave and ResourceHogResourceHog