The past several years has marked the steady rise of large language models (LLMs), largely driven by advancements in computational power, data availability, and algorithmic innovation. LLMs have profoundly shaped the research landscape, introducing new methodologies and paradigms that challenge traditional approaches.
We have also expanded our research interests to the field of LLMs. Here are some research papers related to LLMs. We highly recommend beginners to read and thoroughly understand these papers.
😄 We welcome and value any contributions.
Title | Link |
---|---|
Sequence to Sequence Learning with Neural Networks | [paper] |
Transformer: Attention Is All You Need | [paper] |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | [paper] |
GPT: Improving Language Understanding by Generative Pre-Training | [paper] |
GPT2: Language Models are Unsupervised Multitask Learners | [paper] |
GPT3: Language Models are Few-Shot Learners | [paper] |
GPT3.5: Fine-Tuning Language Models from Human Preferences | [paper] |
LLaMA: Open and Efficient Foundation Language Models | [paper] |
Llama 2: Open Foundation and Fine-Tuned Chat Models | [paper] |
Title | Link |
---|---|
Efficient Multimodal Large Language Models: A Survey | [paper] |
CLIP: Learning Transferable Visual Models From Natural Language Supervision | [paper] |
Title | Link |
---|---|
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism | [paper] |
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | [paper] |
ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning | [paper] |
ZeRO-Offload: Democratizing Billion-Scale Model Training | [paper] |
PipeDream: generalized pipeline parallelism for DNN training | [paper] |
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | [paper] |
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models | [paper] |
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding | [paper] |
PanGu-$\Sigma$: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing | [paper] |
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale | [paper] |
Accelerating Distributed MoE Training and Inference with Lina | [paper] |
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism | [paper] |
Alpa: Automating Inter- and {Intra-Operator} Parallelism for Distributed Deep Learning | [paper] |
Title | Link |
---|---|
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems | [paper] |
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | [paper] |
Efficiently Scaling Transformer Inference | [paper] |
vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention | [paper] |
DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale | [paper] |
Orca: A Distributed Serving System for Transformer-Based Generative Models | [paper] |
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | [paper] |
S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput | [paper] |
Splitwise: Efficient generative LLM inference using phase splitting | [paper] |
SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification | [paper] |
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU | [paper] |
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | [paper] |
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | [paper] |
Vidur: A Large-Scale Simulation Framework For LLM Inference | [paper] |
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers | [paper] |
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving | [paper] |
Title | Link |
---|---|
Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models | [paper] |
Parameter-Efficient Transfer Learning for NLP | [paper] |
Prefix-Tuning: Optimizing Continuous Prompts for Generation | [paper] |
LoRA: Low-Rank Adaptation of Large Language Models | [paper] |
Towards a Unified View of Parameter-Efficient Transfer Learning | [paper] |
Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning | [paper] |
PetS: A Unified Framework for Parameter-Efficient Transformers Serving | [paper] |
Punica: Multi-Tenant LoRA Serving | [paper] |
S-LoRA: Serving Thousands of Concurrent LoRA Adapters | [paper] |