Hierarchical Model LRU/ARC and Manual Reclaim

**Is your feature request related to a problem? Please describe.**
Models remain in GPU memory forever, especially in distributed deployments where `shutdown` API does not propagate.

**Describe the solution you'd like**
1. Permit (appropriately privileged) user eviction of models from GPU memory into system RAM or fully (exit their runner) to disk across the fleet, on a node-by-node or fleet-wide basis through standard API interfaces
2. Implement "ageing" for model states in memory tracking the last interaction with their consumer to proactively free resources for new tasks incoming by demoting layers to system memory or disk. LRU is the simplest way to do this but frequency of model use does have relevance, so an Adaptive Replacement Cache (ARC) pattern might be more appropriate.
3. Implement "tuning knobs" to bias behavior such as demoting small models first or entirely evicting certain ones based on their metadata instead of trying to evict them to RAM.

**Describe alternatives you've considered**
* Currently looking at modifying an openwebui [contribution](https://openwebui.com/f/userx/unload_models_from_vram) to provide "user-friendly eviction" capability via the shutdown API

**Additional context**
* Full implementation may merit addition of metadata fields to local models or maintenance of state for the known set to reflect their eviction behaviors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Hierarchical Model LRU/ARC and Manual Reclaim #5352

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Hierarchical Model LRU/ARC and Manual Reclaim #5352

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions