Skip to content

Hierarchical Model LRU/ARC and Manual Reclaim #5352

@sempervictus

Description

@sempervictus

Is your feature request related to a problem? Please describe.
Models remain in GPU memory forever, especially in distributed deployments where shutdown API does not propagate.

Describe the solution you'd like

  1. Permit (appropriately privileged) user eviction of models from GPU memory into system RAM or fully (exit their runner) to disk across the fleet, on a node-by-node or fleet-wide basis through standard API interfaces
  2. Implement "ageing" for model states in memory tracking the last interaction with their consumer to proactively free resources for new tasks incoming by demoting layers to system memory or disk. LRU is the simplest way to do this but frequency of model use does have relevance, so an Adaptive Replacement Cache (ARC) pattern might be more appropriate.
  3. Implement "tuning knobs" to bias behavior such as demoting small models first or entirely evicting certain ones based on their metadata instead of trying to evict them to RAM.

Describe alternatives you've considered

  • Currently looking at modifying an openwebui contribution to provide "user-friendly eviction" capability via the shutdown API

Additional context

  • Full implementation may merit addition of metadata fields to local models or maintenance of state for the known set to reflect their eviction behaviors

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions