Releases: HelpingAI/inferno
Releases · HelpingAI/inferno
Inferno: Unleash the Blazing Power of Local AI
🔥 Inferno: Unleash the Blazing Power of Local AI 🔥
Forge your own AI future! Run the absolute latest GGUF models like Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and more directly on your hardware. Inferno delivers scorching-fast performance through an intuitive Command Line Interface (CLI) and powerful, dual-compatible API endpoints (OpenAI & Ollama). Take control and experience the raw heat of AI innovation locally.
Why Inferno? Command the Flames:
- State-of-the-Art Models: Access the newest GGUF models as soon as they hit Hugging Face.
- Seamless Hugging Face Integration: Interactively
pullmodels, browse repos, select specific files, or target directly (repo_id:filename). Get RAM estimates before downloading. - Scorching Performance: Leverage full hardware potential with GPU acceleration via
llama-cpp-python(CUDA, Metal, ROCm, Vulkan, SYCL). Optimized CPU performance too! - Advanced Quantization Forge: Convert models between GGUF formats (
inferno quantize). Supports cutting-edge methods likeiq4_nl,q4_k_m, and importance matrix techniques. Compare methods and see RAM estimates interactively. Quantize local GGUFs or directly from Hugging Face originals (PyTorch/Safetensors). - Dual API Powerhouse: Serve models via OpenAI (
/v1) and Ollama (/api) compatible endpoints simultaneously. Drop-in compatibility with countless tools and frameworks (LangChain, LlamaIndex, etc.). - Intuitive CLI Control: Master your models with simple commands:
inferno pull: Download models.inferno list/ls: View local models, sizes, quants, RAM estimates.inferno run: Chat directly in your terminal.inferno serve: Start the API server.inferno quantize: Convert model formats.inferno remove,copy,show,ps,compare,estimate: Full model lifecycle management.
- Native Python Client: Integrate effortlessly with the built-in
InfernoClient, a drop-in replacement for the officialopenailibrary. Supports streaming, embeddings, multimodal inputs, and tool calling (with capable models). - Smart & Flexible:
- Automatic max context detection from GGUF metadata.
- Adjust context (
n_ctx), GPU layers (n_gpu_layers), threads (n_threads), and more per session. - Keep models loaded with
keep-alivemanagement. - Generate embeddings locally.
- Real-time streaming for chat and completions.
Installation: Fueling the Fire
-
CRITICAL Prerequisite: Install
llama-cpp-pythonFIRST with hardware acceleration flags. This is ESSENTIAL for performance. (See README for full options: ROCm, Vulkan, SYCL, CPU-only etc.)# Example: NVIDIA CUDA CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir # Example: Apple Metal (macOS) CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
(Refer to the full README.md for other backends like ROCm, Vulkan, SYCL, OpenBLAS, and pre-built wheel options)
-
Install Inferno:
# Install from PyPI pip install inferno-llm # Or install from source for development # git clone https://github.com/HelpingAI/inferno.git # cd inferno # pip install -e .
Quick Start Spark:
inferno pull meta-llama/Llama-3.3-8B-Instruct-GGUF(Follow prompts)inferno listinferno run Llama-3.3-8B-Instruct-GGUF(Chat!)inferno serve Llama-3.3-8B-Instruct-GGUF --port 8000(Start API)
Source & Documentation:
- GitHub: https://github.com/HelpingAI/inferno
- Full Docs: https://deepwiki.com/HelpingAI/inferno
Inferno: Your Local AI Powerhouse. Ignite it.