Skip to content

Conversation

@SamMalayek
Copy link
Contributor

@SamMalayek SamMalayek commented Nov 1, 2025

🧩 Summary

This PR adds a CI workflow for end-to-end embedding tests.
It marks the first phase of an effort to move an abstraction of the existing examples/llama-embedding logic behind llama-server, so the server can use llama.cpp’s own embedding implementation instead of external (OpenAI) APIs.

🎯 Motivation & Future

llama-server currently supports OpenAI-compatible /embedding requests, but those are not backed by native llama.cpp logic.
This workflow establishes a reproducible test foundation before refactoring the embedding code so that:

  • The server can generate embeddings locally.
  • --parallel N can support multiple concurrent embedding requests.
  • The standalone CLI will remain for lightweight workflows, while the server will use the same shared embedding path for persistent deployments.

⚙️ CI Implementation

  • Adds a GitHub Actions job to run embedding E2E tests with cached GGUF models (TinyLlama).
  • Verifies embedding output dimensions and deterministic behavior.
  • Uses lightweight models for fast CI runs (with an optional large model test).

🧱 Embedding CPP Logic Flow Update

A small cleanup in print_raw_embeddings() improves readability, logic flow, and isolation.
Although minor, this change is modular alongside the CI workflow changes, touching a vertical slice of the embedding flow without altering evaluation, model logic, or any interface. Note that expecting purely small horizontal modularity ossifies software (makes it brittle).

🚀 Next Steps

  1. Extend CI test coverage for all embedding endpoints/flags.
  2. Abstract core embedding code from examples into a shared utility (e.g. common/embedding_utils.cpp).
  3. Integrate that abstraction into llama-server for local "/embedding" requests (while maintaining CLI endpoints and backwards compatibility).
    a. Extend CI coverage for concurrent (--parallel) embedding tests.

(could actually become more than three steps)

Note:
This PR includes a workflow (run-e2e-embedding.yml) for local/fork testing.
It runs only on feature/* branches and is designed to verify embedding e2e tests in forked CI.
Maintainers may integrate or adapt this into the upstream CI configuration if desired.

@ggerganov
Copy link
Member

Too much slop.

@ggerganov ggerganov closed this Nov 2, 2025
@SamMalayek
Copy link
Contributor Author

SamMalayek commented Nov 2, 2025

Too much slop.

This is a pristine PR with a pristine plan. However, I'll interpret your comment as "too much scope creep" after I just landed a PR in a relatively sloppy part of the codebase (examples -- totally fair btw), and I'll push another PR that:

  • Removes the embedding.cpp improvement.
  • Refactors the embedding tests into a focused, deterministic CLI suite with broader input coverage and reproducible numeric validation.
  • Removes e2e test benchmarking (which would have been quite useful for this much-needed refactor of the embedding cli code, but I can just run these locally).

... #16940

Furthermore, I'm opening a discussion RFC as well: #16957. This is a plan that is simply needed for both our Llama.cpp repos (including my cloned repo) because:

  • The embedding endpoint for llama-server should be native, rather than use OpenAI's APIs.
  • I actually need this for my project's embedding pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants