Model Management

Model Management System

Universal model resource management for AI Runner. Provides centralized hardware detection, quantization selection, model registry, and memory allocation across all model types.

Overview

The Model Management system handles:

Hardware Profiling - Detect VRAM, RAM, and compute capabilities
Quantization Strategy - Select optimal quantization based on hardware
Model Registry - Database of supported models with requirements
Memory Allocation - Track and manage GPU/CPU memory usage

Architecture

Components

Component	Description
`HardwareProfiler`	Detects system resources (VRAM, RAM, compute capability)
`QuantizationStrategy`	Selects optimal quantization for model+hardware
`ModelRegistry`	Database of supported models with hardware requirements
`MemoryAllocator`	Manages VRAM/RAM allocation across loaded models
`ModelResourceManager`	Central coordinator for all model operations

Design Goals

Universal - Works for all model types (LLM, SD, TTS, STT, Video)
Automatic - Intelligent model and quantization selection
Memory-Safe - Prevents OOM by tracking allocations
Provider-Agnostic - Supports multiple providers (Mistral, Llama, etc.)
Extensible - Easy to add new models and providers

Usage

Basic Usage

from airunner.components.model_management import ModelResourceManager
from airunner.components.model_management.model_registry import ModelProvider, ModelType
from airunner.components.model_management.quantization_strategy import QuantizationLevel

# Get singleton instance
manager = ModelResourceManager()

# Auto-select best model for hardware
model = manager.select_best_model(
    provider=ModelProvider.MISTRAL,
    model_type=ModelType.LLM
)

# Prepare for loading with auto quantization
metadata, quantization, allocation = manager.prepare_model_loading(
    model_id="mistralai/Ministral-8B-v0.1"
)

# Or with manual quantization preference
metadata, quantization, allocation = manager.prepare_model_loading(
    model_id="mistralai/Magistral-23B-v0.1",
    preferred_quantization=QuantizationLevel.INT4
)

# After unloading model
manager.cleanup_model(model_id)

# Check memory pressure
if manager.check_memory_pressure():
    # Unload some models
    pass

Hardware Detection

from airunner.components.model_management import HardwareProfiler

profiler = HardwareProfiler()

# Get system info
print(f"VRAM: {profiler.vram_gb} GB")
print(f"RAM: {profiler.ram_gb} GB")
print(f"CUDA Compute: {profiler.cuda_compute}")
print(f"GPU Name: {profiler.gpu_name}")

Quantization Selection

from airunner.components.model_management.quantization_strategy import (
    QuantizationStrategy,
    QuantizationLevel
)

strategy = QuantizationStrategy()

# Get recommended quantization for model size and available VRAM
quant = strategy.recommend(
    model_size_gb=14.0,
    available_vram_gb=12.0
)
# Returns: QuantizationLevel.INT4

Model Types

Type	Description
`LLM`	Large Language Models (text generation)
`SD`	Stable Diffusion (image generation)
`TTS`	Text-to-Speech
`STT`	Speech-to-Text
`EMBEDDING`	Embedding models for RAG
`VIDEO`	Video generation models

Quantization Levels

Level	Bits	Memory Reduction	Quality Impact
`FP32`	32	0%	None
`FP16`	16	50%	Minimal
`BF16`	16	50%	Minimal
`INT8`	8	75%	Low
`INT4`	4	87.5%	Moderate
`GGUF_Q4`	~4	87.5%	Moderate
`GGUF_Q8`	~8	75%	Low

Model Providers

Supported model providers:

Mistral - Mistral AI models
Meta - Llama models
Qwen - Qwen models
StabilityAI - Stable Diffusion models
OpenAI - Whisper models
Custom - User-added models

Memory Management

Checking Available Memory

manager = ModelResourceManager()

# Get current memory status
status = manager.get_memory_status()
print(f"VRAM Used: {status.vram_used_gb} / {status.vram_total_gb} GB")
print(f"RAM Used: {status.ram_used_gb} / {status.ram_total_gb} GB")

Memory Pressure Handling

# Check if under memory pressure
if manager.check_memory_pressure(threshold=0.9):
    # Unload least recently used model
    manager.unload_lru_model()

Configuration

Environment Variables

# Force specific quantization
export AIRUNNER_FORCE_QUANTIZATION=int4

# Disable auto quantization
export AIRUNNER_DISABLE_AUTO_QUANT=1

# Memory safety threshold (0.0-1.0)
export AIRUNNER_MEMORY_THRESHOLD=0.85

CLI Model Download Tool

AI Runner provides a command-line tool for downloading, listing, and managing models, similar to ollama pull.

Installation

The tool is installed automatically with AI Runner:

pip install airunner
# Or in development mode
pip install -e .

Basic Usage

# List all available models
airunner-hf-download

# Download a model (GGUF by default for LLMs)
airunner-hf-download qwen3-8b

# Download full safetensors version instead of GGUF
airunner-hf-download --full qwen3-8b

# List only LLM models
airunner-hf-download --type llm

# Download any HuggingFace model by repo ID
airunner-hf-download Qwen/Qwen2.5-7B-Instruct

Managing Downloaded Models

# List all downloaded models
airunner-hf-download --downloaded

# Delete a downloaded model
airunner-hf-download --delete qwen3-8b

# Delete by path
airunner-hf-download --delete Qwen3-8B

Available Model Types

Type	Description	Example
`llm`	Large Language Models	`qwen3-8b`, `llama-3.1-8b`
`art`	Image generation (Stable Diffusion)	Various SD models
`tts`	Text-to-Speech (OpenVoice)	OpenVoice models
`stt`	Speech-to-Text (Whisper)	Whisper models
`embedding`	Embedding models for RAG	Sentence transformers

GGUF vs Full Models

By default, LLM downloads use GGUF format which offers:

Smaller file sizes - Q4_K_M is ~60% smaller than safetensors
Faster inference - Optimized llama.cpp backend
Lower VRAM usage - Better quantization efficiency

Use --full flag to download the original safetensors format if needed for fine-tuning or specific compatibility requirements.

Example Output

Available Models
================================================================================
Use 'airunner-hf-download <model>' to download (GGUF by default for LLMs)
Use 'airunner-hf-download --full <model>' for full safetensors version

[LLM]
----------------------------------------
  qwen3-8b [GGUF]
    Repo: Qwen/Qwen3-8B
    VRAM: 8GB (4-bit) | Context: 32K
    Qwen3 8B with built-in reasoning (thinking mode)

  qwen3-coder-30b-a3b [GGUF]
    Repo: Qwen/Qwen3-Coder-30B-A3B-Instruct
    VRAM: 15GB (4-bit) | Context: 262K
    Qwen3 Coder 30B MoE - SOTA agentic coding with 256K context

Download Locations

Models are downloaded to:

Type	Location
LLM	`~/.local/share/airunner/models/text/models/llm/causallm/`
Art	`~/.local/share/airunner/models/art/models/`
TTS	`~/.local/share/airunner/models/text/models/tts/`
STT	`~/.local/share/airunner/models/text/models/stt/`
Embedding	`~/.local/share/airunner/models/text/models/llm/embedding/`

Uh oh!

Model Management

Model Management System

Overview

Architecture

Components

Design Goals

Usage

Basic Usage

Hardware Detection

Quantization Selection

Model Types

Quantization Levels

Model Providers

Memory Management

Checking Available Memory

Memory Pressure Handling

Configuration

Environment Variables

CLI Model Download Tool

Installation

Basic Usage

Managing Downloaded Models

Available Model Types

GGUF vs Full Models

Example Output

Download Locations

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally