TorchServe

TorchServe is a flexible and easy to use tool for serving and scaling PyTorch models in production.

Requires python >= 3.8

curl http://127.0.0.1:8080/predictions/bert -T input.txt

🚀 Quick start with TorchServe

# Install dependencies
# cuda is optional
python ./ts_scripts/install_dependencies.py --cuda=cu121

# Latest release
pip install torchserve torch-model-archiver torch-workflow-archiver

# Nightly build
pip install torchserve-nightly torch-model-archiver-nightly torch-workflow-archiver-nightly

🚀 Quick start with TorchServe (conda)

# Install dependencies
# cuda is optional
python ./ts_scripts/install_dependencies.py --cuda=cu121

# Latest release
conda install -c pytorch torchserve torch-model-archiver torch-workflow-archiver

# Nightly build
conda install -c pytorch-nightly torchserve torch-model-archiver torch-workflow-archiver

Getting started guide

🐳 Quick Start with Docker

# Latest release
docker pull pytorch/torchserve

# Nightly build
docker pull pytorch/torchserve-nightly

Refer to torchserve docker for details.

⚡ Why TorchServe

Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, Nvidia MPS
Model Management API: multi model management with optimized worker to model allocation
Inference API: REST and gRPC support for batched inference
TorchServe Workflows: deploy complex DAGs with multiple interdependent models
Default way to serve PyTorch models in
- Sagemaker
- Vertex AI
- Kubernetes with support for autoscaling, session-affinity, monitoring using Grafana works on-prem, AWS EKS, Google GKE, Azure AKS
- Kserve: Supports both v1 and v2 API, autoscaling and canary deployments for A/B testing
- Kubeflow
- MLflow
Export your model for optimized inference. Torchscript out of the box, PyTorch Compiler preview, ORT and ONNX, IPEX, TensorRT, FasterTransformer, FlashAttention (Better Transformers)
Performance Guide: builtin support to optimize, benchmark and profile PyTorch and TorchServe performance
Expressive handlers: An expressive handler architecture that makes it trivial to support inferencing for your usecase with many supported out of the box
Metrics API: out of box support for system level metrics with Prometheus exports, custom metrics,
Large Model Inference Guide: With support for GenAI, LLMs including
- Fast Kernels with FlashAttention v2, continuous batching and streaming response
- PyTorch Tensor Parallel preview, Pipeline Parallel
- Microsoft DeepSpeed, DeepSpeed-Mii
- Hugging Face Accelerate, Diffusers
- Running large models on AWS Sagemaker and Inferentia2
- Running Llama 2 Chatbot locally on Mac
Monitoring using Grafana and Datadog

🤔 How does TorchServe work

Model Server for PyTorch Documentation: Full documentation
TorchServe internals: How TorchServe was built
Contributing guide: How to contribute to TorchServe

🏆 Highlighted Examples

Serving Llama 2 with TorchServe
Chatbot with Llama 2 on Mac 🦙💬
🤗 HuggingFace Transformers with a Better Transformer Integration/ Flash Attention & Xformer Memory Efficient
Stable Diffusion
Model parallel inference
MultiModal models with MMF combining text, audio and video
Dual Neural Machine Translation for a complex workflow DAG
TorchServe Integrations
TorchServe Internals
TorchServe UseCases

For more examples

🤓 Learn More

https://pytorch.org/serve

🫂 Contributing

We welcome all contributions!

To learn more about how to contribute, see the contributor guide here.

📰 News

High performance Llama 2 deployments with AWS Inferentia2 using TorchServe
Naver Case Study: Transition From High-Cost GPUs to Intel CPUs and oneAPI powered Software with performance
Run multiple generative AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe and save up to 75% in inference costs
Deploying your Generative AI model in only four steps with Vertex AI and PyTorch
PyTorch Model Serving on Google Cloud TPU v5
Monitoring using Datadog
Torchserve Performance Tuning, Animated Drawings Case-Study
Walmart Search: Serving Models at a Scale on TorchServe
🎥 Scaling inference on CPU with TorchServe
🎥 TorchServe C++ backend
Grokking Intel CPU PyTorch performance from first principles: a TorchServe case study
Grokking Intel CPU PyTorch performance from first principles( Part 2): a TorchServe case study
Case Study: Amazon Ads Uses PyTorch and AWS Inferentia to Scale Models for Ads Processing
Optimize your inference jobs using dynamic batch inference with TorchServe on Amazon SageMaker
Using AI to bring children's drawings to life
🎥 Model Serving in PyTorch
Evolution of Cresta's machine learning architecture: Migration to AWS and PyTorch
🎥 Explain Like I’m 5: TorchServe
🎥 How to Serve PyTorch Models with TorchServe
How to deploy PyTorch models on Vertex AI
Quantitative Comparison of Serving Platforms
Efficient Serverless deployment of PyTorch models on Azure
Deploy PyTorch models with TorchServe in Azure Machine Learning online endpoints
Dynaboard moving beyond accuracy to holistic model evaluation in NLP
A MLOps Tale about operationalising MLFlow and PyTorch
Operationalize, Scale and Infuse Trust in AI Models using KFServing
How Wadhwani AI Uses PyTorch To Empower Cotton Farmers
TorchServe Streamlit Integration
Dynabench aims to make AI models more robust through distributed human workers
Announcing TorchServe

💖 All Contributors

Made with contrib.rocks.

⚖️ Disclaimer

This repository is jointly operated and maintained by Amazon, Meta and a number of individual contributors listed in the CONTRIBUTORS file. For questions directed at Meta, please send an email to opensource@fb.com. For questions directed at Amazon, please send an email to torchserve@amazon.com. For all other questions, please open up an issue in this repository here.

TorchServe acknowledges the Multi Model Server (MMS) project from which it was derived

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TorchServe

🚀 Quick start with TorchServe

🚀 Quick start with TorchServe (conda)

🐳 Quick Start with Docker

⚡ Why TorchServe

🤔 How does TorchServe work

🏆 Highlighted Examples

🤓 Learn More

🫂 Contributing

📰 News

💖 All Contributors

⚖️ Disclaimer

Files

README.md

Latest commit

History

README.md

File metadata and controls

TorchServe

🚀 Quick start with TorchServe

🚀 Quick start with TorchServe (conda)

🐳 Quick Start with Docker

⚡ Why TorchServe

🤔 How does TorchServe work

🏆 Highlighted Examples

🤓 Learn More

🫂 Contributing

📰 News

💖 All Contributors

⚖️ Disclaimer