GitHub链接:https://github.com/ldbb/RAG_NLIBench.git
RAG能够简化大模型的推理和决策过程,提高自主性。通过直接检索相关信息,RAG可以简化模型内部构建推理链的过程,减少推理步骤,使决策过程更加高效和直接。但是在什么情况下简化了模型推理和决策过程没有明确的metric。
- 提出metric,RAG如何简化推理
-
Instruction-tuning Dataset
- 数据来源
数据集/项目名称 任务 语言 数据集链接 Super-Natural Instruction 76 55 https://allenai.org/data/natural-instructions GPT-4-LLM(无划分) - En https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM Unnatural Instruction(无任务种类划分) - En https://github.com/orhonovich/unnatural-instructions Self-Instruct(无任务种类划分) - En https://github.com/yizhongw/self-instruct Dolly(按InstructGPT种类划分) 8 En https://huggingface.co/datasets/databricks/databricks-dolly-15k/tree/main Alpaca(无划分) - En https://github.com/tatsu-lab/stanford_alpaca -
分类:训练一个MLP,把Super Natural Instruction数据集中的1612条definition和category当作训练集进行训练,acc:0.82,并对其它数据集进行分类。
-
任务种类:根据Super Natural Instruction,共76类。
-
训练数据筛选:1600+任务中每个选取前40条,并去除token>256的数据,保证训练数据的完整性,最终有61250条训练数据。任务种类分布如下:
Category Count Answer Verification 209 Answerability Classification 560 Cause Effect Classification 1480 Code to Text 83 Coherence Classification 240 Commonsense Classification 960 Coreference Resolution 534 Data to Text 322 Dialogue Act Recognition 280 Dialogue Generation 520 Dialogue State Tracking 81 Discourse Connective Identification 40 Discourse Relation Classification 40 Entity Generation 40 Entity Relation Classification 40 Ethics Classification 240 Explanation 240 Fact Verification 96 Fill in The Blank 880 Gender Classification 280 Grammar Error Correction 44 Grammar Error Detection 86 Information Extraction 1360 Intent Identification 200 Irony Detection 80 Keyword Tagging 200 Language Identification 600 Linguistic Probing 360 Mathematics 200 Misc. 1480 Named Entity Recognition 1000 Negotiation Strategy Detection 280 Number Conversion 80 Overlap Extraction 80 Paper Review 27 Paraphrasing 480 Poem Generation 40 Pos Tagging 120 Preposition Prediction 40 Program Execution 3592 Punctuation Error Detection 40 Question Answering 7731 Question Generation 2733 Question Rewriting 320 Question Understanding 520 Section Classification 80 Sentence Composition 800 Sentence Compression 40 Sentence Expansion 40 Sentence Ordering 200 Sentence Perturbation 600 Sentiment Analysis 2955 Spam Classification 40 Speaker Identification 358 Speaker Relation Classification 62 Spelling Error Detection 40 Stance Detection 120 Stereotype Detection 280 Story Composition 360 Style Transfer 80 Summarization 481 Text Categorization 1800 Text Completion 805 Text Matching 1720 Text Quality Evaluation 404 Text Simplification 160 Text to Code 160 Textual Entailment 1080 Title Generation 758 Toxic Language Detection 1600 Translation 15760 Word Analogy 320 Word Relation Classification 200 Word Semantics 400 Wrong Candidate Generation 981 -
测试数据筛选:分为文本生成和文本分类两大类。文本生成可以用BLEU,ROUGE,BERTScore来评测;文本分类可以用Accuracy来评测;其它比如NER,QA可以用Exact Match来评测。
- BLEU,ROUGE,BERTScore测试集
Category Count Data_to_Text 50 Overlap_Extraction 40 Question Rewriting 80 Summarization 120 Title Generation 100 Translation(other language->English) 200 - Acc测试集
Category Count Commonsense Classification 50 Sentiment Analysis 100 Spam Classification 20 Textual Entailment(SNLI) 100 - Exact Match测试集
Category Count Question Answering 200 Named Entity Recognition 62
-
Model选择:
LlaMa2-7b微调:参考Alpaca
-
微调相关参数: 更新:在300step基础上增加300step
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT= 0.05
LORA_TARGET_MODULES = [
"q_proj",
"v_proj",
]
BATCH_SIZE = 128
MICRO_BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
LEARNING_RATE = 3e-4
TRAIN_STEPS = 300/600
OUTPUT_DIR = "experiments"/"experiments_2"
评价指标:
-
Llama2-7b微调结果:
-
BLEU,ROUGE,BERTScore测试集结果(300step/600step)
Category Count BLEU ROUGE-1 ROUGE-2 ROUGE-l BERTScore Data_to_Text 50 0.103/0.072 0.530/0.537 0.296/0.280 0.430/0.443 0.905/0.910 Question Rewriting 80 0.078/0.228 0.529/0.651 0.302/0.458 0.462/0.589 0.905/0.931 Summarization 116 0.059/0.052 0.287/0.293 0.140/0.128 0.259/0.257 0.872/0.881 Title Generation 96 0.070/0.067 0.324/0.377 0.158/0.170 0.308/0.355 0.875/0.886 Translation(other language->English) 200 0.163/0.177 0.664/0.675 0.448/0.458 0.615/0.628 0.925/0.931 -
Acc测试集结果
Category Count Acc Commonsense Classification 119 0.44/0.571 Sentiment Analysis 100 0.703/0.76 Spam Classification 20 0.789/0.7 Textual Entailment(SNLI) 65 0.185/0.462 -
EM F1测试集结果
Category Count exact_match Question Answering 200 28.09/32.0 Named Entity Recognition 62 65.0/74.19
标题 | 引用 |
---|---|
Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks | Wang Y, Mishra S, Alipoormolabashi P, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks[J]. arXiv preprint arXiv:2204.07705, 2022. |
Instruction Tuning with GPT-4 | Peng B, Li C, He P, et al. Instruction tuning with gpt-4[J]. arXiv preprint arXiv:2304.03277, 2023. |
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor | Honovich O, Scialom T, Levy O, et al. Unnatural instructions: Tuning language models with (almost) no human labor[J]. arXiv preprint arXiv:2212.09689, 2022. |
Self-Instruct: Aligning Language Model with Self Generated Instructions | Wang Y, Kordi Y, Mishra S, et al. Self-instruct: Aligning language models with self-generated instructions[J]. arXiv preprint arXiv:2212.10560, 2022. |
Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM | Conover M, Hayes M, Mathur A, et al. Free dolly: Introducing the world’s first truly open instruction-tuned llm[J]. Company Blog of Databricks, 2023. |
Stanford Alpaca: An Instruction-following LLaMA model | Taori R, Gulrajani I, Zhang T, et al. Stanford alpaca: An instruction-following llama model[J]. 2023. |
BLEU: a Method for Automatic Evaluation of Machine Translation |
-
instruction 数据集整理(1周)✅
-
transformer baseline 测试模型(1周)
-
其它模型(2周)