EmbeddedLLM

All

47 repositories

aiter
Public
AI Tensor Engine for ROCm
Cuda
•
MIT License
•5•0•0•0•Updated Feb 14, 2025Feb 14, 2025
vllm-rocmfork
Public
A high-throughput and memory-efficient inference and serving engine for LLMs
Python
•
Apache License 2.0
•5.7k•0•0•2•Updated Feb 14, 2025Feb 14, 2025
JamAIBase
Public
The collaborative spreadsheet for AI. Chain cells into powerful pipelines, experiment with prompts and models, and evaluate LLM responses in real-time. Work together seamlessly to build and iterate on AI applications.
python workflow ai serverless chatbot spreadsheet svelte orchestration baas agents
Python
•
Apache License 2.0
•25•796•1•0•Updated Feb 11, 2025Feb 11, 2025
vllm
Public
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
inference pytorch transformer gpt amdgpu rocm model-serving llm llm-inference
Python
•
Apache License 2.0
•5.7k•88•1•0•Updated Feb 10, 2025Feb 10, 2025
lmcache-vllm
Public
The driver for LMCache core to run in vLLM
Python
•
Apache License 2.0
•16•0•0•0•Updated Jan 24, 2025Jan 24, 2025
LMCache
Public
ROCm support of Ultra-Fast and Cheaper Long-Context LLM Inference
Python
•
Apache License 2.0
•49•0•0•0•Updated Jan 24, 2025Jan 24, 2025
lmcache-tests
Public
Python
•7•0•0•0•Updated Jan 23, 2025Jan 23, 2025
production-stack
Public
Python
•
Apache License 2.0
•54•0•0•0•Updated Jan 22, 2025Jan 22, 2025
kvpress
Public
LLM KV cache compression made easy
Python
•
Apache License 2.0
•26•0•0•0•Updated Jan 21, 2025Jan 21, 2025
litellm
Public
Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
Python
•
Other
•2.1k•0•0•0•Updated Jan 13, 2025Jan 13, 2025
composable_kernel
Public
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
C++
•
Other
•146•0•0•0•Updated Dec 20, 2024Dec 20, 2024
Mooncake
Public
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
C++
•
Apache License 2.0
•151•0•0•0•Updated Dec 16, 2024Dec 16, 2024
torchac_rocm
Public
ROCm Implementation of torchac_cuda from LMCache
Cuda
•1•0•0•0•Updated Dec 16, 2024Dec 16, 2024
etalon
Public
LLM Serving Performance Evaluation Harness
Python
•
Apache License 2.0
•8•0•0•0•Updated Dec 16, 2024Dec 16, 2024
infinity-executable
Public
Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
Python
•
MIT License
•123•0•0•1•Updated Dec 7, 2024Dec 7, 2024
Liger-Kernel
Public
Efficient Triton Kernels for LLM Training
Python
•
BSD 2-Clause "Simplified" License
•267•0•0•0•Updated Dec 6, 2024Dec 6, 2024
Star-Attention
Public
Efficient LLM Inference over Long Sequences
Python
•
Apache License 2.0
•18•0•0•0•Updated Nov 29, 2024Nov 29, 2024
LLM_Sizing_Guide
Public
A calculator to estimate the memory footprint, capacity, and latency on NVIDIA AMD Intel
Python
•4•0•0•0•Updated Nov 24, 2024Nov 24, 2024
SageAttention-rocm
Public
ROCm Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Cuda
•
Apache License 2.0
•58•1•0•0•Updated Nov 21, 2024Nov 21, 2024
axolotl-amd
Public
Go ahead and axolotl questions
Python
•
Apache License 2.0
•958•0•0•0•Updated Nov 16, 2024Nov 16, 2024
jamaibase-ts-docs
Public
Typescript Documentation of JamAISDK
HTML
•0•0•0•0•Updated Nov 14, 2024Nov 14, 2024
skypilot
Public
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Python
•
Apache License 2.0
•564•0•0•0•Updated Nov 7, 2024Nov 7, 2024
flash-attention-docker
Public
This is a repository that contains a CI/CD that will try to compile docker images that already built flash attention into the image to facilitate quicker development and deployment of other frameworks.
Shell
•
Apache License 2.0
•0•1•0•0•Updated Oct 26, 2024Oct 26, 2024
flash-attention-rocm
Public
ROCm Fork of Fast and memory-efficient exact attention (The idea of this branch is to hope to generate flash attention pypi package to be readily installed and used.
Python
•
BSD 3-Clause "New" or "Revised" License
•1.5k•0•0•0•Updated Oct 26, 2024Oct 26, 2024
unstructured-python-client
Public
A Python client for the Unstructured hosted API
Python
•
MIT License
•18•0•0•1•Updated Oct 14, 2024Oct 14, 2024
embeddedllm
Public
EmbeddedLLM: API server for Embedded Device Deployment. Currently support CUDA/OpenVINO/IpexLLM/DirectML/CPU
windows cpu llama gemma mistral directx-12 openvino npu openvino-inference-engine aipc
Python
•1•31•7•2•Updated Oct 6, 2024Oct 6, 2024
github-bot
Public
Go
•1•0•0•0•Updated Sep 26, 2024Sep 26, 2024
PowerToys
Public
Windows system utilities to maximize productivity
C#
•
MIT License
•6.8k•0•0•0•Updated Aug 9, 2024Aug 9, 2024
arena-hard-auto
Public
Arena-Hard-Auto: An automatic LLM benchmark.
Jupyter Notebook
•
Apache License 2.0
•92•0•0•0•Updated Jul 15, 2024Jul 15, 2024
unstructured-api-executable
Public
Python
•
Apache License 2.0
•143•0•0•0•Updated Jul 11, 2024Jul 11, 2024