-
-
Notifications
You must be signed in to change notification settings - Fork 97
Model Management
Joe Curlee (w4ffl35) edited this page Dec 3, 2025
·
2 revisions
Universal model resource management for AI Runner. Provides centralized hardware detection, quantization selection, model registry, and memory allocation across all model types.
The Model Management system handles:
- Hardware Profiling - Detect VRAM, RAM, and compute capabilities
- Quantization Strategy - Select optimal quantization based on hardware
- Model Registry - Database of supported models with requirements
- Memory Allocation - Track and manage GPU/CPU memory usage
| Component | Description |
|---|---|
HardwareProfiler |
Detects system resources (VRAM, RAM, compute capability) |
QuantizationStrategy |
Selects optimal quantization for model+hardware |
ModelRegistry |
Database of supported models with hardware requirements |
MemoryAllocator |
Manages VRAM/RAM allocation across loaded models |
ModelResourceManager |
Central coordinator for all model operations |
- Universal - Works for all model types (LLM, SD, TTS, STT, Video)
- Automatic - Intelligent model and quantization selection
- Memory-Safe - Prevents OOM by tracking allocations
- Provider-Agnostic - Supports multiple providers (Mistral, Llama, etc.)
- Extensible - Easy to add new models and providers
from airunner.components.model_management import ModelResourceManager
from airunner.components.model_management.model_registry import ModelProvider, ModelType
from airunner.components.model_management.quantization_strategy import QuantizationLevel
# Get singleton instance
manager = ModelResourceManager()
# Auto-select best model for hardware
model = manager.select_best_model(
provider=ModelProvider.MISTRAL,
model_type=ModelType.LLM
)
# Prepare for loading with auto quantization
metadata, quantization, allocation = manager.prepare_model_loading(
model_id="mistralai/Ministral-8B-v0.1"
)
# Or with manual quantization preference
metadata, quantization, allocation = manager.prepare_model_loading(
model_id="mistralai/Magistral-23B-v0.1",
preferred_quantization=QuantizationLevel.INT4
)
# After unloading model
manager.cleanup_model(model_id)
# Check memory pressure
if manager.check_memory_pressure():
# Unload some models
passfrom airunner.components.model_management import HardwareProfiler
profiler = HardwareProfiler()
# Get system info
print(f"VRAM: {profiler.vram_gb} GB")
print(f"RAM: {profiler.ram_gb} GB")
print(f"CUDA Compute: {profiler.cuda_compute}")
print(f"GPU Name: {profiler.gpu_name}")from airunner.components.model_management.quantization_strategy import (
QuantizationStrategy,
QuantizationLevel
)
strategy = QuantizationStrategy()
# Get recommended quantization for model size and available VRAM
quant = strategy.recommend(
model_size_gb=14.0,
available_vram_gb=12.0
)
# Returns: QuantizationLevel.INT4| Type | Description |
|---|---|
LLM |
Large Language Models (text generation) |
SD |
Stable Diffusion (image generation) |
TTS |
Text-to-Speech |
STT |
Speech-to-Text |
EMBEDDING |
Embedding models for RAG |
VIDEO |
Video generation models |
| Level | Bits | Memory Reduction | Quality Impact |
|---|---|---|---|
FP32 |
32 | 0% | None |
FP16 |
16 | 50% | Minimal |
BF16 |
16 | 50% | Minimal |
INT8 |
8 | 75% | Low |
INT4 |
4 | 87.5% | Moderate |
GGUF_Q4 |
~4 | 87.5% | Moderate |
GGUF_Q8 |
~8 | 75% | Low |
Supported model providers:
- Mistral - Mistral AI models
- Meta - Llama models
- Qwen - Qwen models
- StabilityAI - Stable Diffusion models
- OpenAI - Whisper models
- Custom - User-added models
manager = ModelResourceManager()
# Get current memory status
status = manager.get_memory_status()
print(f"VRAM Used: {status.vram_used_gb} / {status.vram_total_gb} GB")
print(f"RAM Used: {status.ram_used_gb} / {status.ram_total_gb} GB")# Check if under memory pressure
if manager.check_memory_pressure(threshold=0.9):
# Unload least recently used model
manager.unload_lru_model()# Force specific quantization
export AIRUNNER_FORCE_QUANTIZATION=int4
# Disable auto quantization
export AIRUNNER_DISABLE_AUTO_QUANT=1
# Memory safety threshold (0.0-1.0)
export AIRUNNER_MEMORY_THRESHOLD=0.85AI Runner provides a command-line tool for downloading, listing, and managing models, similar to ollama pull.
The tool is installed automatically with AI Runner:
pip install airunner
# Or in development mode
pip install -e .# List all available models
airunner-hf-download
# Download a model (GGUF by default for LLMs)
airunner-hf-download qwen3-8b
# Download full safetensors version instead of GGUF
airunner-hf-download --full qwen3-8b
# List only LLM models
airunner-hf-download --type llm
# Download any HuggingFace model by repo ID
airunner-hf-download Qwen/Qwen2.5-7B-Instruct# List all downloaded models
airunner-hf-download --downloaded
# Delete a downloaded model
airunner-hf-download --delete qwen3-8b
# Delete by path
airunner-hf-download --delete Qwen3-8B| Type | Description | Example |
|---|---|---|
llm |
Large Language Models |
qwen3-8b, llama-3.1-8b
|
art |
Image generation (Stable Diffusion) | Various SD models |
tts |
Text-to-Speech (OpenVoice) | OpenVoice models |
stt |
Speech-to-Text (Whisper) | Whisper models |
embedding |
Embedding models for RAG | Sentence transformers |
By default, LLM downloads use GGUF format which offers:
- Smaller file sizes - Q4_K_M is ~60% smaller than safetensors
- Faster inference - Optimized llama.cpp backend
- Lower VRAM usage - Better quantization efficiency
Use --full flag to download the original safetensors format if needed for fine-tuning or specific compatibility requirements.
Available Models
================================================================================
Use 'airunner-hf-download <model>' to download (GGUF by default for LLMs)
Use 'airunner-hf-download --full <model>' for full safetensors version
[LLM]
----------------------------------------
qwen3-8b [GGUF]
Repo: Qwen/Qwen3-8B
VRAM: 8GB (4-bit) | Context: 32K
Qwen3 8B with built-in reasoning (thinking mode)
qwen3-coder-30b-a3b [GGUF]
Repo: Qwen/Qwen3-Coder-30B-A3B-Instruct
VRAM: 15GB (4-bit) | Context: 262K
Qwen3 Coder 30B MoE - SOTA agentic coding with 256K context
Models are downloaded to:
| Type | Location |
|---|---|
| LLM | ~/.local/share/airunner/models/text/models/llm/causallm/ |
| Art | ~/.local/share/airunner/models/art/models/ |
| TTS | ~/.local/share/airunner/models/text/models/tts/ |
| STT | ~/.local/share/airunner/models/text/models/stt/ |
| Embedding | ~/.local/share/airunner/models/text/models/llm/embedding/ |
- Architecture - System architecture
- Settings - Configuration options
- Installation - Hardware requirements