This project provides PowerShell scripts to download, build, and run large language models locally on Windows using either the standard llama.cpp or the performance-oriented ik_llama.cpp fork.
You can choose the engine that best suits your needs:
llama.cpp: The official, stable, and widely used version.ik_llama.cpp: A fork with advanced features for fine-tuning performance, especially for machines with limited VRAM.
The workflow is self-contained:
repo/ # your checkout
├─ vendor/ # llama.cpp and/or ik_llama.cpp source cloned & built here
└─ models/ # downloaded GGUF model(s)
- Windows 10/11 x64
- PowerShell 7
- NVIDIA GPU with CUDA 12.4+ (compute ≥ 7.0 highly recommended)
- ~40 GB free disk space (source tree and model)
The process is split into two steps:
- Installation: Run the appropriate
install_*.ps1script once. - Execution: Run the corresponding
run_*_server.ps1script to start the model server.
First, decide whether you want to use the standard llama.cpp or the ik_llama.cpp fork.
This is the official and most stable version.
Installation:
Run the install_llama_cpp.ps1 script from an elevated PowerShell 7 prompt. This will download and build the llama.cpp engine.
# Allow script execution for this session
Set-ExecutionPolicy Bypass -Scope Process
# Run the installer
./install_llama_cpp.ps1Execution: Once the installation is complete, start the server.
./run_llama_cpp_server.ps1This version offers special flags for optimizing performance, like quantizing the KV-cache or splitting model layers between GPU and CPU.
Installation:
Run the install_ik_llama.ps1 script from an elevated PowerShell prompt.
# Allow script execution for this session
Set-ExecutionPolicy Bypass -Scope Process
# Run the installer (adjust CudaArch for your GPU)
./install_ik_llama.ps1 -CudaArch 86Execution: Once the installation is complete, start the server.
./run_ik_llama_server.ps1The run scripts will download a ~17 GB GGUF model into the models/ directory and launch the llama-server.exe with a tuned set of runtime flags.
run_llama_cpp_server.ps1uses theQwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.ggufmodel.run_ik_llama_server.ps1uses theQwen3-Coder-30B-A3B-Instruct-IQ4_KSS.ggufmodel, which is a special quantization format supported by this fork.
The server starts on http://localhost:8080 and exposes both a browser chat UI and an OpenAI-compatible REST API.
With the provided settings, both server implementations should achieve comparable performance. On a system with a Ryzen 5 7600 CPU, 32GB DDR5-5600 RAM, and an NVIDIA RTX 4070 Ti (12GB), both servers run at approximately 35 tokens/second.
To get the best performance, match the -CudaArch parameter to your GPU generation during installation.
| Architecture | Cards (examples) | Flag |
|---|---|---|
| Pascal | GTX 10×0, Quadro P | 60 / 61 / 62 |
| Turing | RTX 20×0 / 16×0 | 75 |
| Ampere | RTX 30×0 | 80 / 86 / 87 |
| Ada | RTX 40×0 | 89 |
| Blackwell | RTX 50×0 | 90 |
The run scripts use a set of optimized flags to launch the server. Most of these are now available in both llama.cpp and ik_llama.cpp.
| Flag | Purpose | Value(s) in Script |
|---|---|---|
-ngl 999 |
Offloads all possible layers to the GPU. | 999 (all) |
-c 65536 |
Sets the context size for the model. | 65536 |
-fa |
Enables Flash Attention kernels for faster processing. | Enabled |
-ctk <type> |
Quantizes the 'key' part of the KV cache to save memory. | q8_0 (8-bit) |
-ctv <type> |
Quantizes the 'value' part of the KV cache. | q4_0 (4-bit) |
-ot <regex>=<backend> |
Overrides tensor placement. Used here to keep some MoE experts on the CPU to save VRAM. | See script |
--temp, --top-p, etc. |
Standard sampling parameters to control the model's output. | See script |
While llama.cpp has integrated many high-performance features, ik_llama.cpp currently provides a few unique advantages:
-fmoe/--fused-moe: Enables fused Mixture-of-Experts kernels, which can improve performance for models like Qwen that use this architecture.-ser <n>,<p>/--smart-expert-reduction: A powerful feature that computes only the most probablenexperts with a cumulative probability ofp. This can significantly speed up MoE models by reducing computation, especially on GPUs with lower memory bandwidth.- Specialized Quants:
ik_llama.cppoften supports new quantization methods first. Therun_ik_llama_server.ps1script uses theIQ4_KSSquant, which can offer a different balance of performance and quality compared to theIQ4_NLquant used by the standardllama.cppscript.
The run_ik_llama_server.ps1 script enables -fmoe and -ser for maximum performance.
This project is licensed under the MIT License. See the LICENSE file for details.