Local LLM Environment

This project provides PowerShell scripts to download, build, and run large language models locally on Windows using either the standard llama.cpp or the performance-oriented ik_llama.cpp fork.

You can choose the engine that best suits your needs:

llama.cpp: The official, stable, and widely used version.
ik_llama.cpp: A fork with advanced features for fine-tuning performance, especially for machines with limited VRAM.

The workflow is self-contained:

repo/                     # your checkout
├─ vendor/                # llama.cpp and/or ik_llama.cpp source cloned & built here
└─ models/                # downloaded GGUF model(s)

Prerequisites

Windows 10/11 x64
PowerShell 7
NVIDIA GPU with CUDA 12.4+ (compute ≥ 7.0 highly recommended)
~40 GB free disk space (source tree and model)

Setup and Usage

The process is split into two steps:

Installation: Run the appropriate install_*.ps1 script once.
Execution: Run the corresponding run_*_server.ps1 script to start the model server.

1. Choose Your Engine

First, decide whether you want to use the standard llama.cpp or the ik_llama.cpp fork.

Option A: Standard `llama.cpp` (Recommended for most users)

This is the official and most stable version.

Installation: Run the install_llama_cpp.ps1 script from an elevated PowerShell 7 prompt. This will download and build the llama.cpp engine.

# Allow script execution for this session
Set-ExecutionPolicy Bypass -Scope Process

# Run the installer
./install_llama_cpp.ps1

Execution: Once the installation is complete, start the server.

./run_llama_cpp_server.ps1

Option B: Performance `ik_llama.cpp` (Advanced)

This version offers special flags for optimizing performance, like quantizing the KV-cache or splitting model layers between GPU and CPU.

Installation: Run the install_ik_llama.ps1 script from an elevated PowerShell prompt.

# Allow script execution for this session
Set-ExecutionPolicy Bypass -Scope Process

# Run the installer (adjust CudaArch for your GPU)
./install_ik_llama.ps1 -CudaArch 86

Execution: Once the installation is complete, start the server.

./run_ik_llama_server.ps1

Server and Model

The run scripts will download a ~17 GB GGUF model into the models/ directory and launch the llama-server.exe with a tuned set of runtime flags.

run_llama_cpp_server.ps1 uses the Qwen3-Coder-30B-A3B-Instruct-1M-IQ4_NL.gguf model.
run_ik_llama_server.ps1 uses the Qwen3-Coder-30B-A3B-Instruct-IQ4_KSS.gguf model, which is a special quantization format supported by this fork.

The server starts on http://localhost:8080 and exposes both a browser chat UI and an OpenAI-compatible REST API.

Performance Note

With the provided settings, both server implementations should achieve comparable performance. On a system with a Ryzen 5 7600 CPU, 32GB DDR5-5600 RAM, and an NVIDIA RTX 4070 Ti (12GB), both servers run at approximately 35 tokens/second.

CUDA Architecture (`-CudaArch`)

To get the best performance, match the -CudaArch parameter to your GPU generation during installation.

Architecture	Cards (examples)	Flag
Pascal	GTX 10×0, Quadro P	60 / 61 / 62
Turing	RTX 20×0 / 16×0	75
Ampere	RTX 30×0	80 / 86 / 87
Ada	RTX 40×0	89
Blackwell	RTX 50×0	90

Parameter Explanations

The run scripts use a set of optimized flags to launch the server. Most of these are now available in both llama.cpp and ik_llama.cpp.

Flag	Purpose	Value(s) in Script
`-ngl 999`	Offloads all possible layers to the GPU.	`999` (all)
`-c 65536`	Sets the context size for the model.	`65536`
`-fa`	Enables Flash Attention kernels for faster processing.	Enabled
`-ctk <type>`	Quantizes the 'key' part of the KV cache to save memory.	`q8_0` (8-bit)
`-ctv <type>`	Quantizes the 'value' part of the KV cache.	`q4_0` (4-bit)
`-ot <regex>=<backend>`	Overrides tensor placement. Used here to keep some MoE experts on the CPU to save VRAM.	See script
`--temp`, `--top-p`, etc.	Standard sampling parameters to control the model's output.	See script

Key `ik_llama.cpp` Differences

While llama.cpp has integrated many high-performance features, ik_llama.cpp currently provides a few unique advantages:

-fmoe / --fused-moe: Enables fused Mixture-of-Experts kernels, which can improve performance for models like Qwen that use this architecture.
-ser <n>,<p> / --smart-expert-reduction: A powerful feature that computes only the most probable n experts with a cumulative probability of p. This can significantly speed up MoE models by reducing computation, especially on GPUs with lower memory bandwidth.
Specialized Quants: ik_llama.cpp often supports new quantization methods first. The run_ik_llama_server.ps1 script uses the IQ4_KSS quant, which can offer a different balance of performance and quality compared to the IQ4_NL quant used by the standard llama.cpp script.

The run_ik_llama_server.ps1 script enables -fmoe and -ser for maximum performance.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Local LLM Environment

Prerequisites

Setup and Usage

1. Choose Your Engine

Option A: Standard `llama.cpp` (Recommended for most users)

Option B: Performance `ik_llama.cpp` (Advanced)

Server and Model

Performance Note

CUDA Architecture (`-CudaArch`)

Parameter Explanations

Key `ik_llama.cpp` Differences

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install_ik_llama.ps1		install_ik_llama.ps1
install_llama_cpp.ps1		install_llama_cpp.ps1
run_ik_llama_server.ps1		run_ik_llama_server.ps1
run_llama_cpp_server.ps1		run_llama_cpp_server.ps1

License

Danmoreng/local-qwen3-coder-env

Folders and files

Latest commit

History

Repository files navigation

Local LLM Environment

Prerequisites

Setup and Usage

1. Choose Your Engine

Option A: Standard llama.cpp (Recommended for most users)

Option B: Performance ik_llama.cpp (Advanced)

Server and Model

Performance Note

CUDA Architecture (-CudaArch)

Parameter Explanations

Key ik_llama.cpp Differences

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Option A: Standard `llama.cpp` (Recommended for most users)

Option B: Performance `ik_llama.cpp` (Advanced)

CUDA Architecture (`-CudaArch`)

Key `ik_llama.cpp` Differences

Packages