🔥 Inferno: Unleash the Blazing Power of Local AI 🔥

Forge your own AI future! Run the absolute latest GGUF models like Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1, and more directly on your hardware. Inferno delivers scorching-fast performance through an intuitive Command Line Interface (CLI) and powerful, dual-compatible API endpoints (OpenAI & Ollama). Take control and experience the raw heat of AI innovation locally.

Why Inferno? Command the Flames:

State-of-the-Art Models: Access the newest GGUF models as soon as they hit Hugging Face.
Seamless Hugging Face Integration: Interactively pull models, browse repos, select specific files, or target directly (repo_id:filename). Get RAM estimates before downloading.
Scorching Performance: Leverage full hardware potential with GPU acceleration via llama-cpp-python (CUDA, Metal, ROCm, Vulkan, SYCL). Optimized CPU performance too!
Advanced Quantization Forge: Convert models between GGUF formats (inferno quantize). Supports cutting-edge methods like iq4_nl, q4_k_m, and importance matrix techniques. Compare methods and see RAM estimates interactively. Quantize local GGUFs or directly from Hugging Face originals (PyTorch/Safetensors).
Dual API Powerhouse: Serve models via OpenAI (/v1) and Ollama (/api) compatible endpoints simultaneously. Drop-in compatibility with countless tools and frameworks (LangChain, LlamaIndex, etc.).
Intuitive CLI Control: Master your models with simple commands:
- inferno pull: Download models.
- inferno list / ls: View local models, sizes, quants, RAM estimates.
- inferno run: Chat directly in your terminal.
- inferno serve: Start the API server.
- inferno quantize: Convert model formats.
- inferno remove, copy, show, ps, compare, estimate: Full model lifecycle management.
Native Python Client: Integrate effortlessly with the built-in InfernoClient, a drop-in replacement for the official openai library. Supports streaming, embeddings, multimodal inputs, and tool calling (with capable models).
Smart & Flexible:
- Automatic max context detection from GGUF metadata.
- Adjust context (n_ctx), GPU layers (n_gpu_layers), threads (n_threads), and more per session.
- Keep models loaded with keep-alive management.
- Generate embeddings locally.
- Real-time streaming for chat and completions.

Installation: Fueling the Fire

CRITICAL Prerequisite: Install llama-cpp-python FIRST with hardware acceleration flags. This is ESSENTIAL for performance. (See README for full options: ROCm, Vulkan, SYCL, CPU-only etc.)
```
# Example: NVIDIA CUDA
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir

# Example: Apple Metal (macOS)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
```
(Refer to the full README.md for other backends like ROCm, Vulkan, SYCL, OpenBLAS, and pre-built wheel options)

Install Inferno:

# Install from PyPI
pip install inferno-llm

# Or install from source for development
# git clone https://github.com/HelpingAI/inferno.git
# cd inferno
# pip install -e .

Quick Start Spark:

inferno pull meta-llama/Llama-3.3-8B-Instruct-GGUF (Follow prompts)
inferno list
inferno run Llama-3.3-8B-Instruct-GGUF (Chat!)
inferno serve Llama-3.3-8B-Instruct-GGUF --port 8000 (Start API)

Source & Documentation:

GitHub: https://github.com/HelpingAI/inferno
Full Docs: https://deepwiki.com/HelpingAI/inferno

Inferno: Your Local AI Powerhouse. Ignite it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🔥 Inferno: Unleash the Blazing Power of Local AI 🔥

Why Inferno? Command the Flames:

Installation: Fueling the Fire

Quick Start Spark:

Source & Documentation:

Uh oh!

Releases: HelpingAI/inferno

Inferno: Unleash the Blazing Power of Local AI

🔥 Inferno: Unleash the Blazing Power of Local AI 🔥

Why Inferno? Command the Flames:

Installation: Fueling the Fire

Quick Start Spark:

Source & Documentation:

Uh oh!