Skip to content
View rainyBJ's full-sized avatar
💭
Machine Learning
💭
Machine Learning
  • BUPT
  • Beijing Haidian BUPT

Block or report rainyBJ

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Code and data for the Chain-of-Draft (CoD) paper

Python 115 13 Updated Mar 11, 2025

[NeurIPS'23] Speculative Decoding with Big Little Decoder

Python 89 10 Updated Feb 6, 2024

scalable and robust tree-based speculative decoding algorithm

Python 336 38 Updated Jan 28, 2025

official code for GliDe with a CaPE

Python 13 1 Updated Aug 13, 2024

Official Implementation of "Learning Harmonized Representations for Speculative Sampling" (HASS)

Python 11 Updated Sep 27, 2024

FlashMLA: Efficient MLA decoding kernels

C++ 11,248 785 Updated Mar 1, 2025

Tile primitives for speedy kernels

Cuda 2,135 123 Updated Mar 11, 2025

Awesome LLM pruning papers all-in-one repository with integrating all useful resources and insights.

74 4 Updated Dec 7, 2024

Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models

Python 223 28 Updated Apr 23, 2024

Unofficial implementations of block/layer-wise pruning methods for LLMs.

Jupyter Notebook 64 8 Updated Apr 29, 2024

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Python 2,509 482 Updated Apr 15, 2024

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,210 74 Updated Mar 6, 2025

Fully open reproduction of DeepSeek-R1

Python 22,565 2,025 Updated Mar 11, 2025

[CVPR 2022] AlignQ: Alignment Quantization with ADMM-based Correlation Preservation

Python 10 Updated Jan 6, 2023

Finetune Llama 3.3, DeepSeek-R1 & Reasoning LLMs 2x faster with 70% less memory! 🦥

Python 34,235 2,506 Updated Mar 11, 2025

[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection

Python 84 4 Updated Feb 20, 2025

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Python 123 5 Updated Dec 3, 2024

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Cuda 1,103 66 Updated Feb 28, 2025

Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)

Python 238 27 Updated Mar 10, 2025

Puzzles for learning Triton, play it with minimal environment configuration!

Python 253 25 Updated Dec 3, 2024

Puzzles for learning Triton

Jupyter Notebook 1,490 111 Updated Nov 18, 2024

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

C++ 216 21 Updated Sep 30, 2024

[NeurIPS'24]Efficient and accurate memory saving method towards W4A4 large multi-modal models.

Python 67 5 Updated Jan 3, 2025

Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization

Python 108 9 Updated Jan 23, 2025

Official inference framework for 1-bit LLMs

C++ 12,794 900 Updated Feb 18, 2025

GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM

Python 157 16 Updated Jul 12, 2024
Next
Showing results