Skip to content

Latest commit

 

History

History
116 lines (76 loc) · 7.11 KB

README.md

File metadata and controls

116 lines (76 loc) · 7.11 KB

ChattyLLaMA

ChattyLLaMA is LLaMA-based ChatGPT.

FAIR is aware LLaMA generations are unexpected.

This is due to the fact that LLaMA was not trained on conversational prompts.

Here's what they suggest:

To be able to directly prompt the models with questions / instructions, you can either:

  • Prompt it with few-shot examples so that the model understands the task you have in mind.
  • Finetune the models on datasets of instructions to make them more robust to input prompts.

My ideas is to finetune the models on a diverse set of instructions datasets from LAION's OpenAssistant.

You can finetune language models from human preferences (i.e., Reinforcement Learning from Human Feedback (RLHF)).

People under-appreciate fine-tuning alone compared to RLHF: many papers show how far you can get with instruction tuning and no Reinforecement Learning (RL). RL algorithms are quite finicky — sensitive to picking hard-to-tune hyperparams — compared to supervised deep learning.

LLaMA paper touches on finetuning briefly, referencing the fine-tuning protocol from Flan.

ChatLLaMA enables you to build a ChatGPT-style service based on pre-trained LLaMA models.

This allows you to train LLaMA-based architectures in a similar way to ChatGPT, using RLHF.

(Disclaimer: The work is for research purposes.)

Plan

LLaMA model weights

The below is my experiments with all the compression and acceleration techniques, tricks, algorithms, and more — documented in my awesome-ml-model-compression project.

Goals:

  • Inference efficiency: make models smaller and faster
  • Unlock on-device deployment: run on low-resources hardwares and consumer GPUs instead of Cloud

Compression: A classical example is quantization, which compress the weight matrices of a layer, by reducing its precision (i.e., from 32-bit floating point values to 8-bit unsigned integers), with minimal loss in quality.

Current high-level plan (tentatively):

To round up, start with 🤗 PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware - The HugingFace PEFT library enables using the most popular and performant models from Transformers coupled with the simplicity and scalability of Accelerate. Currently supported PEFT methods: LoRA, prefix tuning, prompt tuning, and P-Tuning (which employs trainable continuous prompt embeddings). They'll be exploring more PEFT methods, such as (IA)3 and bottleneck adapters. Results: The number of parameters needed to fine-tune Flan-T5-XXL is now 9.4M, about 7X fewer than AlexNet.

Future plan:

ChattyLLaMA

(TODO)