llm-serving

Star

Here are 49 public repositories matching this topic...

vllm-project / vllm

Sponsor

Star

A high-throughput and memory-efficient inference and serving engine for LLMs

Updated Apr 22, 2025
Python

ray-project / ray

Star

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Updated Apr 22, 2025
Python

sgl-project / sglang

Star

SGLang is a fast serving framework for large language models and vision language models.

cuda inference pytorch transformer moe llama vlm llm llm-serving llava deepseek-llm deepseek llama3 llama3-1 deepseek-v3 deepseek-r1 deepseek-r1-zero

Updated Apr 22, 2025
Python

bentoml / OpenLLM

Star

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

llama mistral fine-tuning mlops bentoml vicuna llm model-inference llmops llm-serving llm-inference open-source-llm llama2 openllm llm-ops llama3-1 llama3-2 llama3-2-vision

Updated Apr 22, 2025
Python

skypilot-org / skypilot

Star

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Updated Apr 22, 2025
Python

bentoml / BentoML

Star

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

python machine-learning deep-learning model-serving multimodal mlops ml-engineering ai-inference llm generative-ai llmops llm-serving model-inference-service llm-inference inference-platform

Updated Apr 21, 2025
Python

superduper-io / superduper

Star

Superduper: End-to-end framework for building custom AI applications and agents.

Updated Apr 22, 2025
Python

predibase / lorax

Star

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

transformers pytorch llama gpt lora model-serving fine-tuning llm llmops llm-serving llm-inference

Updated Apr 19, 2025
Python

MoonshotAI / MoBA

Star

MoBA: Mixture of Block Attention for Long-Context LLMs

pytorch transformer moe llm llm-serving llm-training flash-attention

Updated Apr 3, 2025
Python

thu-pacman / chitu

Star

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

gpu pytorch model-serving llm llm-serving deepseek

Updated Apr 22, 2025
Python

mosecorg / mosec

Star

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

python rust machine-learning deep-learning mxnet tensorflow gpu cv pytorch tts hacktoberfest model-serving nerual-network machine-learning-platform jax mlops llm llm-serving

Updated Apr 22, 2025
Python

vllm-project / vllm-ascend

Sponsor

Star

Community maintained hardware plugin for vLLM on Ascend

inference transformer model-serving mlops ascend llm llmops llm-serving vllm

Updated Apr 22, 2025
Python

hpcaitech / SwiftInfer

Star

Efficient AI Inference & Serving

deep-learning inference artificial-intelligence llama gpt llm-serving llm-inference llama2

Updated Jan 8, 2024
Python

HPMLL / BurstGPT

Star

A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems

dataset mlsys llm llm-serving

Updated Oct 15, 2024
Python

interestingLSY / swiftLLM

Star

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

cuda transformers inference pytorch transformer llama gpt inference-engine model-serving mlops llm llmops llm-serving llm-inference

Updated Jul 5, 2024
Python

chenhunghan / ialacol

Star

🪶 Lightweight OpenAI drop-in replacement for Kubernetes

python kubernetes ai gpu helm cuda openai cloudnative llm langchain llm-serving llamacpp ggml gptq llm-inference

Updated Feb 5, 2024
Python

bigai-nlco / TokenSwift

Star

From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation

inference transformer llms llm-serving llm-inference qwen speculative-decoding deepseek

Updated Mar 19, 2025
Python

asprenger / ray_vllm_inference

Star

A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.

inference pytorch transformer ray model-serving mlops llm llmops llm-serving vllm

Updated Apr 6, 2024
Python

efficientscaling / Z1

Star

Repo for "Z1: Efficient Test-time Scaling with Code"

reasoning llm llm-serving codellms

Updated Apr 11, 2025
Python

friendliai / friendli-client

Star

Friendli: the fastest serving engine for generative AI

ai ml inference gpt inference-server mistral inference-engine serving mlops gpt3 llm stable-diffusion llms generative-ai llmops llm-serving llm-inference llama2 llm-ops

Updated Jan 24, 2025
Python

Improve this page

Add a description, image, and links to the llm-serving topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-serving topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-serving

Here are 49 public repositories matching this topic...

vllm-project / vllm

ray-project / ray

sgl-project / sglang

bentoml / OpenLLM

skypilot-org / skypilot

bentoml / BentoML

superduper-io / superduper

predibase / lorax

MoonshotAI / MoBA

thu-pacman / chitu

mosecorg / mosec

vllm-project / vllm-ascend

hpcaitech / SwiftInfer

HPMLL / BurstGPT

interestingLSY / swiftLLM

chenhunghan / ialacol

bigai-nlco / TokenSwift

asprenger / ray_vllm_inference

efficientscaling / Z1

friendliai / friendli-client

Improve this page

Add this topic to your repo