Skip to content

Releases: HelpingAI/inferno

Inferno: Unleash the Blazing Power of Local AI

05 May 15:26

Choose a tag to compare

🔥 Inferno: Unleash the Blazing Power of Local AI 🔥

Forge your own AI future! Run the absolute latest GGUF models like Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and more directly on your hardware. Inferno delivers scorching-fast performance through an intuitive Command Line Interface (CLI) and powerful, dual-compatible API endpoints (OpenAI & Ollama). Take control and experience the raw heat of AI innovation locally.

Why Inferno? Command the Flames:

  • State-of-the-Art Models: Access the newest GGUF models as soon as they hit Hugging Face.
  • Seamless Hugging Face Integration: Interactively pull models, browse repos, select specific files, or target directly (repo_id:filename). Get RAM estimates before downloading.
  • Scorching Performance: Leverage full hardware potential with GPU acceleration via llama-cpp-python (CUDA, Metal, ROCm, Vulkan, SYCL). Optimized CPU performance too!
  • Advanced Quantization Forge: Convert models between GGUF formats (inferno quantize). Supports cutting-edge methods like iq4_nl, q4_k_m, and importance matrix techniques. Compare methods and see RAM estimates interactively. Quantize local GGUFs or directly from Hugging Face originals (PyTorch/Safetensors).
  • Dual API Powerhouse: Serve models via OpenAI (/v1) and Ollama (/api) compatible endpoints simultaneously. Drop-in compatibility with countless tools and frameworks (LangChain, LlamaIndex, etc.).
  • Intuitive CLI Control: Master your models with simple commands:
    • inferno pull: Download models.
    • inferno list / ls: View local models, sizes, quants, RAM estimates.
    • inferno run: Chat directly in your terminal.
    • inferno serve: Start the API server.
    • inferno quantize: Convert model formats.
    • inferno remove, copy, show, ps, compare, estimate: Full model lifecycle management.
  • Native Python Client: Integrate effortlessly with the built-in InfernoClient, a drop-in replacement for the official openai library. Supports streaming, embeddings, multimodal inputs, and tool calling (with capable models).
  • Smart & Flexible:
    • Automatic max context detection from GGUF metadata.
    • Adjust context (n_ctx), GPU layers (n_gpu_layers), threads (n_threads), and more per session.
    • Keep models loaded with keep-alive management.
    • Generate embeddings locally.
    • Real-time streaming for chat and completions.

Installation: Fueling the Fire

  1. CRITICAL Prerequisite: Install llama-cpp-python FIRST with hardware acceleration flags. This is ESSENTIAL for performance. (See README for full options: ROCm, Vulkan, SYCL, CPU-only etc.)

    # Example: NVIDIA CUDA
    CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
    
    # Example: Apple Metal (macOS)
    CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

    (Refer to the full README.md for other backends like ROCm, Vulkan, SYCL, OpenBLAS, and pre-built wheel options)

  2. Install Inferno:

    # Install from PyPI
    pip install inferno-llm
    
    # Or install from source for development
    # git clone https://github.com/HelpingAI/inferno.git
    # cd inferno
    # pip install -e .

Quick Start Spark:

  1. inferno pull meta-llama/Llama-3.3-8B-Instruct-GGUF (Follow prompts)
  2. inferno list
  3. inferno run Llama-3.3-8B-Instruct-GGUF (Chat!)
  4. inferno serve Llama-3.3-8B-Instruct-GGUF --port 8000 (Start API)

Source & Documentation:

Inferno: Your Local AI Powerhouse. Ignite it.