SqueezeBits Inc.

All

12 repositories

Torch-TRTLLM
Public
Ditto is an open-source framework that enables direct conversion of HuggingFace PreTrainedModels into TensorRT-LLM engines.
Python
•
Apache License 2.0
•0•17•0•3•Updated Mar 4, 2025Mar 4, 2025
owlite
Public
OwLite is a low-code AI model compression toolkit for AI models.
Python
•
GNU Affero General Public License v3.0
•4•42•0•0•Updated Feb 20, 2025Feb 20, 2025
vllm-fork
Public
A high-throughput and memory-efficient inference and serving engine for LLMs
Python
•
Apache License 2.0
•6k•0•0•0•Updated Feb 20, 2025Feb 20, 2025
gradio
Public
Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
Python
•
Apache License 2.0
•2.8k•0•0•0•Updated Jan 13, 2025Jan 13, 2025
TensorRT-LLM
Public
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
C++
•
Apache License 2.0
•1.1k•0•0•1•Updated Dec 12, 2024Dec 12, 2024
vllm-hpu-extension
Public
Python
•
Apache License 2.0
•22•0•0•0•Updated Nov 22, 2024Nov 22, 2024
neural-compressor
Public
Intel Neural Compressor
Python
•
Apache License 2.0
•0•0•0•0•Updated Oct 22, 2024Oct 22, 2024
owlite-examples
Public
OwLite Examples repository offers illustrative example codes to help users seamlessly compress PyTorch deep learning models and transform them into TensorRT engines.
Python
•1•10•0•1•Updated Sep 27, 2024Sep 27, 2024
nvidia-dind
Public
Isolated DinD (Docker in Docker) container for developing and deploying Docker containers using NVIDIA GPUs and the NVIDIA container toolkit. Useful for deploying the Docker engine with NVIDIA in Kubernetes.
Dockerfile
•
Mozilla Public License 2.0
•17•0•0•0•Updated Aug 27, 2024Aug 27, 2024
mlperf_inference_results_v4.0
Public
C++
•
Apache License 2.0
•1•0•0•1•Updated Jul 23, 2024Jul 23, 2024
.github
Public
0•0•0•0•Updated Jul 22, 2024Jul 22, 2024
QUICK
Public
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
Python
•
MIT License
•5•116•6•0•Updated Mar 6, 2024Mar 6, 2024