Stars
Code and data for the Chain-of-Draft (CoD) paper
[NeurIPS'23] Speculative Decoding with Big Little Decoder
scalable and robust tree-based speculative decoding algorithm
official code for GliDe with a CaPE
HArmonizedSS / HASS
Forked from SafeAILab/EAGLEOfficial Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)
Tile primitives for speedy kernels
Awesome LLM pruning papers all-in-one repository with integrating all useful resources and insights.
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
Unofficial implementations of block/layer-wise pruning methods for LLMs.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Fully open reproduction of DeepSeek-R1
[CVPR 2022] AlignQ: Alignment Quantization with ADMM-based Correlation Preservation
Finetune Llama 3.3, DeepSeek-R1 & Reasoning LLMs 2x faster with 70% less memory! 🦥
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)
Puzzles for learning Triton, play it with minimal environment configuration!
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
[NeurIPS'24]Efficient and accurate memory saving method towards W4A4 large multi-modal models.
Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM