Skip to content

Conversation

issixx
Copy link
Contributor

@issixx issixx commented Oct 7, 2025

Description

There’s an issue in the condition used when canceling pending tasks in llama-server.
Because of this, when there are two or more queued tasks, pending tasks cannot be canceled properly.
(Only the currently running task can be canceled.)

Steps to reproduce

You can easily reproduce this by repeatedly sending and canceling heavy requests:

  1. Send a prompt to llama_decode that takes several seconds to process.
  2. Quickly repeat sending and canceling the request several times.
  3. After llama_decode finishes, the active task is canceled as expected, but pending tasks remain uncanceled and continue to run.

This issue is especially critical for use cases such as code completion, where requests are sent and canceled rapidly.
Thank you for your review and consideration.

@ggerganov ggerganov merged commit d2ee056 into ggml-org:master Oct 8, 2025
63 of 66 checks passed
anyshu pushed a commit to anyshu/llama.cpp that referenced this pull request Oct 10, 2025
* master: (113 commits)
  webui: updated the chat service to only include max_tokens in the req… (ggml-org#16489)
  cpu : optimize the ggml NORM operation (ggml-org#15953)
  server : host-memory prompt caching (ggml-org#16391)
  No markdown in cot (ggml-org#16483)
  model-conversion : add support for SentenceTransformers (ggml-org#16387)
  ci: add ARM64 Kleidiai build and test support (ggml-org#16462)
  CANN: Improve ACL graph matching (ggml-org#16166)
  kleidiai: kernel interface refactoring (ggml-org#16460)
  [SYCL] refactor soft_max, add soft_max_back (ggml-org#16472)
  model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (ggml-org#16367)
  refactor: centralize CoT parsing in backend for streaming mode (ggml-org#16394)
  Disable CUDA host buffers on integrated GPUs (ggml-org#16308)
  server : fix cancel pending task (ggml-org#16467)
  metal : mark FA blocks (ggml-org#16372)
  server : improve context checkpoint logic (ggml-org#16440)
  ggml webgpu: profiling, CI updates, reworking of command submission (ggml-org#16452)
  llama : support LiquidAI LFM2-MoE hybrid model (ggml-org#16464)
  server : add `/v1/health` endpoint (ggml-org#16461)
  webui : added download action (ggml-org#13552) (ggml-org#16282)
  presets : fix pooling param for embedding models (ggml-org#16455)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants