-
Notifications
You must be signed in to change notification settings - Fork 15k
Open
Labels
Description
Name and Version
Bug description
There is a severe memory leak / infinite graph rebuild in llama-server when using LoRA adapters.
The issue is 100% reproducible and was bisected to a very narrow commit range.
Working vs broken versions
- ✅ Commit <= 7692 — works correctly
- ❌ Commit >= 7792 — broken
Between these commits, llama-server starts to:
- repeatedly rebuild execution graphs
- continuously reserve memory
- increase RAM usage without bound
- eventually exhaust system memory
This happens even with:
- CPU-only mode
- single request
--parallel 1- small context size
Environment
- OS: Windows 11
- llama.cpp built from source
- GPU: RTX 3060 (also reproduced with
-ngl 0, CPU-only) - CUDA: 12.4 (but issue reproduces without CUDA)
Model / LoRA setup
- Base model:
Meta-Llama-3.1-8B-Instruct.Q8_0.gguf - LoRA:
Converted to GGUF (convert_lora_to_gguf.py) - LoRA was trained on the same base model (non-quantized)
Command used to reproduce (CPU-only)
llama-server.exe ^
-m Meta-Llama-3.1-8B-Instruct.Q8_0.gguf ^
--lora _gestalt-adapter.gguf ^
-ngl 0 ^
-c 1024 ^
--parallel 1 ^
--host 0.0.0.0 ^
--port 8080
### Operating systems
Windows
### GGML backends
CPU, CUDA
### Hardware
rtx 5060 3060
### Models
_No response_
### Problem description & steps to reproduce
Problem description & steps to reproduce
*
Please give us a summary of the problem and tell us how to reproduce it. If you can narrow down the bug to specific hardware, compile flags, or command line arguments, that information would be very much appreciated by us. If possible, please try to reproduce the issue using llama-completion with -fit off. If you can only reproduce the issue with -fit on, please provide logs both with and without --verbose.
### First Bad Commit
_No response_
### Relevant log output
<details>
<summary>Logs</summary>
<!-- Copy-pasted short logs go into the "console" area here -->
```console
Reactions are currently unavailable