Skip to content

Model Management

Joe Curlee (w4ffl35) edited this page Dec 3, 2025 · 2 revisions

Model Management System

Universal model resource management for AI Runner. Provides centralized hardware detection, quantization selection, model registry, and memory allocation across all model types.

Overview

The Model Management system handles:

  • Hardware Profiling - Detect VRAM, RAM, and compute capabilities
  • Quantization Strategy - Select optimal quantization based on hardware
  • Model Registry - Database of supported models with requirements
  • Memory Allocation - Track and manage GPU/CPU memory usage

Architecture

Components

Component Description
HardwareProfiler Detects system resources (VRAM, RAM, compute capability)
QuantizationStrategy Selects optimal quantization for model+hardware
ModelRegistry Database of supported models with hardware requirements
MemoryAllocator Manages VRAM/RAM allocation across loaded models
ModelResourceManager Central coordinator for all model operations

Design Goals

  1. Universal - Works for all model types (LLM, SD, TTS, STT, Video)
  2. Automatic - Intelligent model and quantization selection
  3. Memory-Safe - Prevents OOM by tracking allocations
  4. Provider-Agnostic - Supports multiple providers (Mistral, Llama, etc.)
  5. Extensible - Easy to add new models and providers

Usage

Basic Usage

from airunner.components.model_management import ModelResourceManager
from airunner.components.model_management.model_registry import ModelProvider, ModelType
from airunner.components.model_management.quantization_strategy import QuantizationLevel

# Get singleton instance
manager = ModelResourceManager()

# Auto-select best model for hardware
model = manager.select_best_model(
    provider=ModelProvider.MISTRAL,
    model_type=ModelType.LLM
)

# Prepare for loading with auto quantization
metadata, quantization, allocation = manager.prepare_model_loading(
    model_id="mistralai/Ministral-8B-v0.1"
)

# Or with manual quantization preference
metadata, quantization, allocation = manager.prepare_model_loading(
    model_id="mistralai/Magistral-23B-v0.1",
    preferred_quantization=QuantizationLevel.INT4
)

# After unloading model
manager.cleanup_model(model_id)

# Check memory pressure
if manager.check_memory_pressure():
    # Unload some models
    pass

Hardware Detection

from airunner.components.model_management import HardwareProfiler

profiler = HardwareProfiler()

# Get system info
print(f"VRAM: {profiler.vram_gb} GB")
print(f"RAM: {profiler.ram_gb} GB")
print(f"CUDA Compute: {profiler.cuda_compute}")
print(f"GPU Name: {profiler.gpu_name}")

Quantization Selection

from airunner.components.model_management.quantization_strategy import (
    QuantizationStrategy,
    QuantizationLevel
)

strategy = QuantizationStrategy()

# Get recommended quantization for model size and available VRAM
quant = strategy.recommend(
    model_size_gb=14.0,
    available_vram_gb=12.0
)
# Returns: QuantizationLevel.INT4

Model Types

Type Description
LLM Large Language Models (text generation)
SD Stable Diffusion (image generation)
TTS Text-to-Speech
STT Speech-to-Text
EMBEDDING Embedding models for RAG
VIDEO Video generation models

Quantization Levels

Level Bits Memory Reduction Quality Impact
FP32 32 0% None
FP16 16 50% Minimal
BF16 16 50% Minimal
INT8 8 75% Low
INT4 4 87.5% Moderate
GGUF_Q4 ~4 87.5% Moderate
GGUF_Q8 ~8 75% Low

Model Providers

Supported model providers:

  • Mistral - Mistral AI models
  • Meta - Llama models
  • Qwen - Qwen models
  • StabilityAI - Stable Diffusion models
  • OpenAI - Whisper models
  • Custom - User-added models

Memory Management

Checking Available Memory

manager = ModelResourceManager()

# Get current memory status
status = manager.get_memory_status()
print(f"VRAM Used: {status.vram_used_gb} / {status.vram_total_gb} GB")
print(f"RAM Used: {status.ram_used_gb} / {status.ram_total_gb} GB")

Memory Pressure Handling

# Check if under memory pressure
if manager.check_memory_pressure(threshold=0.9):
    # Unload least recently used model
    manager.unload_lru_model()

Configuration

Environment Variables

# Force specific quantization
export AIRUNNER_FORCE_QUANTIZATION=int4

# Disable auto quantization
export AIRUNNER_DISABLE_AUTO_QUANT=1

# Memory safety threshold (0.0-1.0)
export AIRUNNER_MEMORY_THRESHOLD=0.85

CLI Model Download Tool

AI Runner provides a command-line tool for downloading, listing, and managing models, similar to ollama pull.

Installation

The tool is installed automatically with AI Runner:

pip install airunner
# Or in development mode
pip install -e .

Basic Usage

# List all available models
airunner-hf-download

# Download a model (GGUF by default for LLMs)
airunner-hf-download qwen3-8b

# Download full safetensors version instead of GGUF
airunner-hf-download --full qwen3-8b

# List only LLM models
airunner-hf-download --type llm

# Download any HuggingFace model by repo ID
airunner-hf-download Qwen/Qwen2.5-7B-Instruct

Managing Downloaded Models

# List all downloaded models
airunner-hf-download --downloaded

# Delete a downloaded model
airunner-hf-download --delete qwen3-8b

# Delete by path
airunner-hf-download --delete Qwen3-8B

Available Model Types

Type Description Example
llm Large Language Models qwen3-8b, llama-3.1-8b
art Image generation (Stable Diffusion) Various SD models
tts Text-to-Speech (OpenVoice) OpenVoice models
stt Speech-to-Text (Whisper) Whisper models
embedding Embedding models for RAG Sentence transformers

GGUF vs Full Models

By default, LLM downloads use GGUF format which offers:

  • Smaller file sizes - Q4_K_M is ~60% smaller than safetensors
  • Faster inference - Optimized llama.cpp backend
  • Lower VRAM usage - Better quantization efficiency

Use --full flag to download the original safetensors format if needed for fine-tuning or specific compatibility requirements.

Example Output

Available Models
================================================================================
Use 'airunner-hf-download <model>' to download (GGUF by default for LLMs)
Use 'airunner-hf-download --full <model>' for full safetensors version

[LLM]
----------------------------------------
  qwen3-8b [GGUF]
    Repo: Qwen/Qwen3-8B
    VRAM: 8GB (4-bit) | Context: 32K
    Qwen3 8B with built-in reasoning (thinking mode)

  qwen3-coder-30b-a3b [GGUF]
    Repo: Qwen/Qwen3-Coder-30B-A3B-Instruct
    VRAM: 15GB (4-bit) | Context: 262K
    Qwen3 Coder 30B MoE - SOTA agentic coding with 256K context

Download Locations

Models are downloaded to:

Type Location
LLM ~/.local/share/airunner/models/text/models/llm/causallm/
Art ~/.local/share/airunner/models/art/models/
TTS ~/.local/share/airunner/models/text/models/tts/
STT ~/.local/share/airunner/models/text/models/stt/
Embedding ~/.local/share/airunner/models/text/models/llm/embedding/

See Also

Clone this wiki locally