A high-throughput and memory-efficient inference and serving engine for LLMs
-
Updated
Apr 22, 2025 - Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
SGLang is a fast serving framework for large language models and vision language models.
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Superduper: End-to-end framework for building custom AI applications and agents.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
MoBA: Mixture of Block Attention for Long-Context LLMs
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
Community maintained hardware plugin for vLLM on Ascend
Efficient AI Inference & Serving
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
🪶 Lightweight OpenAI drop-in replacement for Kubernetes
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation
A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
Repo for "Z1: Efficient Test-time Scaling with Code"
Friendli: the fastest serving engine for generative AI
Add a description, image, and links to the llm-serving topic page so that developers can more easily learn about it.
To associate your repository with the llm-serving topic, visit your repo's landing page and select "manage topics."