FlagAI-Open · marscrazy · May 28, 2022 · May 28, 2022
diff --git a/CLA.md b/CLA.md
@@ -0,0 +1,89 @@
+# The Contributor License Agreement
+
+The [Cloud Native Computing Foundation](https://www.cncf.io) (CNCF) defines
+the legal status of the contributed code in two different types of _Contributor License Agreements_
+(CLAs), [individual contributors](https://github.com/cncf/cla/blob/master/individual-cla.pdf) and [corporations](https://github.com/cncf/cla/blob/master/corporate-cla.pdf).
+
+FlagAI can only accept original source code from CLA signatories.
+
+
+It is important to read and understand this legal agreement.
+
+## How do I sign?
+
+After creating your first Pull Request the linux-foundation-easycla bot will respond with information regarding your CLA status along with a link to sign the CLA.
+
+<img width="1065" alt="EasyCLA bot" src="https://user-images.githubusercontent.com/69111235/152226443-f6fe61ee-0e92-46c5-b6ea-c0deb718a585.png">
+
+#### 1. Authorize EasyCLA to read some of your GitHub information
+
+<img width="554" alt="GitHub EasyCLA Authorization" src="https://user-images.githubusercontent.com/69111235/152228712-7d22f9d0-9f3c-4226-9ee0-bacba4b47725.png">
+
+Click on the "Please click here to be authorized" link to navigate to the GitHub Authorize Linux Foundation: EasyCLA page. Then click Authorize LF-Engineering to give the Linux Foundation read-only access to list the email addresses associated with your GitHub account.
+
+#### 2. Select from the two types of contributor
+
+<img width="1407" alt="EasyCLA" src="https://user-images.githubusercontent.com/69111235/152224818-1246453a-b086-4a57-9d14-c10d62ad438f.png">
+
+
+After authorizing EasyCLA, you will be redirected to a page to identify which type of contributor you are. 
+Select the most appropriate option:
+  * Individual Contributor: You are contributing as yourself, and not as part of another organization.
+  * Corporate Contributor: You are contributing on behalf of your employer or other organization.
+
+#### 3. Sign the CLA
+
+Once you select the type of contributor, proceed to Sign the CLA and follow the instructions to complete the signing process through DocuSign.
+
+**Ensure your GitHub e-mail address matches e-mail address used to sign CLA**
+
+After you have filled out  the information, Click "Finish" and you will be redirected back to your Pull Request.
+
+#### 4. Look for an email indicating successful signup.
+
+> Hello,
+> 
+> This is a notification email from EasyCLA regarding the project Cloud Native Computing > Foundation (CNCF).
+> 
+> The CLA has now been signed. You can download the signed CLA as a PDF here.
+> 
+> If you need help or have questions about EasyCLA, you can read the documentation or reach out to us for support.
+> 
+> Thanks,
+> EasyCLA Support Team
+
+
+
+#### 5. Validate your CLA
+
+Once you are redirected back to your GitHub Pull Request, reply with a comment `/easycla` to update the CLA status of your PR.
+
+
+## Changing your Affiliation
+
+If you've changed employers and still contribute to Kubernetes, your affiliation
+needs to be updated. The Cloud Native Computing Foundation uses [gitdm](https://github.com/cncf/gitdm)
+to track who is contributing and from where. Create a pull request on the [gitdm](https://github.com/cncf/gitdm)
+repository with a change to the corresponding developer affiliation text file.
+Your entry should look similar to this:
+
+```
+Jorge O. Castro*: jorge!heptio.com, jorge!ubuntu.com, jorge.castro!gmail.com
+Heptio
+Canonical until 2017-03-31
+```
+
+## Troubleshooting
+
+If you encounter any problems signing the CLA and need further assistance, log a ticket by clicking on the link [please submit a support request ticket](https://jira.linuxfoundation.org/plugins/servlet/theme/portal/4) from the EasyCLA bot's response. Someone from the CNCF will respond to your ticket to help.
+
+Should you have any issues using the LF Support Site, send a message to the
+backup e-mail support address <login-issues@jira.linuxfoundation.org>
+
+## Setting up the CNCF CLA check
+
+If you are a Kubernetes GitHub organization or repo owner and would like to setup
+the Linux Foundation CNCF CLA check for your repositories, [read the docs on setting up the CNCF CLA check](/github-management/setting-up-cla-check.md)
+
+
+[Linux Foundation Support Site]: https://support.linuxfoundation.org/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -8,6 +8,9 @@ side, please stick to the following process:
 3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the
 commit guidelines below.
 
+## Sign the CLA
+
+Before you can contribute to FlagAI, you will need to sign the [Contributor License Agreement](CLA.md).
 
 ## Git Commit Guidelines
 
@@ -34,17 +37,13 @@ pip install -r requirements.txt
 ```
 
 ### tests
-
-
-
-To run all basic tests execute:
-```bash
-python test.py
+Install `pytest` for testing
 ```
-
-To check the test results in
+pip install pytest
 ```
-tests/test_report
+To run all basic tests execute:
+```bash
+pytest
 ```
 
 ### code formatting

diff --git a/README.md b/README.md
@@ -118,7 +118,7 @@ for text in test_data:
 * [Blank_Filling_QA with GLM ](/docs/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
 * [Title Generation with GLM ](/docs/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md)
 * [Poetry generation with GLM-large-ch](docs/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md)
-* [Using huggingface's t5-3b & tricks ](docs/TUTORIAL_14_HUGGINGFACE_T5.md)
+* [Using huggingface's t5-11b & tricks ](docs/TUTORIAL_14_HUGGINGFACE_T5.md)
 * [Title Generation with RoBerta-WWM](/docs/TUTORIAL_15_BERT_EXAMPLE_TITLE_GENERATION.md)
 * [Semantic Matching with RoBerta-WWM](/docs/TUTORIAL_16_BERT_EXAMPLE_SEMANTIC_MATCHING.md)
 * [NER with RoBerta-WWM](/docs/TUTORIAL_17_BERT_EXAMPLE_NER.md)
@@ -135,15 +135,12 @@ language models, sequence labeling models, and text classification models. Let u
 
 ## Tutorials
 We provide a set of quick tutorials to get you started with the library:
-
-[//]: # (* [Tutorial 1: Background: Transformers]&#40;docs/TUTORIAL_1_BASICS.md&#41;)
 * [Tutorial 1: How to construct and use Tokenizer](/docs/TUTORIAL_1_TOKENIZER.md)
 * [Tutorial 2: Dataset Preprocessing Pipeline](/docs/TUTORIAL_2_DATASET.md)
 * [Tutorial 3: Major Function of Model Module](/docs/TUTORIAL_3_MODEL.md)
 * [Tutorial 4: Customize trainer for model and data-parallel training](/docs/TUTORIAL_4_TRAINER.md)
 * [Tutorial 5: Simplify model and tokenizer Initialization by Using Autoloader](/docs/TUTORIAL_5_INSTRUCTIONS_FOR_AutoLoader.md)
 * [Tutorial 6: Use off-the-shelf inference Algorithms with Predictor](/docs/TUTORIAL_6_INSTRUCTIONS_FOR_PREDICTOR.md)
-
 * [Tutorial 7: Use FlagAI prompt-learning tool-kit to improve performance on SuperGLUE](/docs/TUTORIAL_7_PROMPT_LERANING.md)
 * [Tutorial 8: Setup environment for training models with multi-machine](/docs/TUTORIAL_8_ENVIRONMENT_SETUP.md)
 * [Tutorial 9: Text generation with encoder/decoder/encoder-decoder models](/docs/TUTORIAL_9_SEQ2SEQ_METHOD.md)

diff --git a/README_zh.md b/README_zh.md
@@ -184,7 +184,7 @@ for text_pair in test_data:
 * [GLM-large-ch用户完形填空问答](/doc_zh/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
 * [GLM-large-ch用于诗歌生成](doc_zh/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md)
 * [GLM-large-ch用于标题生成](doc_zh/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md)
-* [对 huggingface t5-3b 模型的支持 以及加速的tricks](doc_zh/TUTORIAL_14_HUGGINGFACE_T5.md)
+* [对 huggingface t5-11b 模型的支持 以及加速的tricks](doc_zh/TUTORIAL_14_HUGGINGFACE_T5.md)
 * [RoBERTa-base-ch用于标题生成](doc_zh/TUTORIAL_15_BERT_EXAMPLE_TITLE_GENERATION.md)
 * [RoBERTa-base-ch用于语义相似度匹配](doc_zh/TUTORIAL_16_BERT_EXAMPLE_SEMANTIC_MATCHING.md)
 * [RoBERTa-base-ch用于命名实体识别](/doc_zh/TUTORIAL_17_BERT_EXAMPLE_NER.md)

diff --git a/doc_zh/GLM.md b/doc_zh/GLM.md
@@ -5,8 +5,13 @@
 目前，存在几种不同的预训练模型架构：仅实现编码器架构的自动编码模型（例如BERT），仅实现解码器的自回归模型（例如GPT），以及同时实现编码器和解码器的编码器-解码器模型（例如T5）。
 
 [**GLM模型**](https://arxiv.org/abs/2103.10360)与这些模型略有不同。它采用了一种自回归的空白填充方法， 并且在NLP领域三种主要的任务(自然语言理解，无条件生成，有条件生成)上都取得了不错的结果。
-<div align=center><img src="img/glm_example_1.png" width="600px"></div>
 
+| Framwork        | NLU | Cond.Gen. | Uncond.Gen |
+|-----------------|-----|-----------|------------|
+| Augoregressive  | -   | -         | ✅          |
+| Autoencoding    | ✅   | ×         | ×          |
+| Encoder-Decoder | -   | ✅         | -          |
+| GLM             | ✅   | ✅         | ✅          |
 GLM的主要功能包括：
 
 - 任务一：文本的一些区间会被屏蔽（参照自动编码的做法）。 这些区间将被随机重新排列，并以自动回归方式进行预测。屏蔽的区间覆盖原始文本的15%。
@@ -18,18 +23,23 @@ GLM的主要功能包括：
 
 ## GLM的表现
 
-1. 通过多任务预训练，GLM-Doc 和 GLM-Sent 的表现略逊于 GLM-Large，但仍优于 BERT-Large 和 UniLM-Large。
 
+### [SuperGLUE](https://super.gluebenchmark.com)
+单模型单任务微调在`dev`集上的效果，更多结果在[这里](https://github.com/THUDM/GLM)
 
-2. 在多任务模型中，GLM-Sent 平均优于 GLM-Doc 1.1%。将 GLM-Doc 的参数增加到 410M（1.25×BERT-Large）会得到比 GLM-Large 更好的性能。至于具有 515M 参数（1.5×BERT-Large）的 GLM 能表现得更好。
+|  Model | COPA | WSC | RTE | WiC | CB | MultiRC | BoolQ | ReCoRD |
+|  ----  | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
+| GLM-10b-ch  | 98.0 | 95.2 | 93.1 | 75.7 | 98.7/98.2 | 88.1/63.3 | 88.7 | 94.4/94.0 |
+| [RoBERTa-Large](https://github.com/pytorch/fairseq/tree/master/examples/roberta) | 94.0 | 91.3 | 86.6 | 75.6 | 98.2/- | 85.7/- | 86.9 |89.5/89.0|
+| [DeBERTa-XXLarge-v2](https://github.com/microsoft/DeBERTa/tree/master/experiments/superglue) | 97.0 | - | 93.5 | - | - | 87.8/63.6 | 88.3 | 94.1/93.7 |
 
-<div align=center><img src="img/glm_results2.png"></div>
+### [CLUE](https://www.cluebenchmarks.com)
+单模型单任务微调在CLUE数据集上的结果（测试还在进行中，这里列出了部分任务）。如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001)。
 
-1. GLM-XXLarge 的平均得分为 79.297，在多项任务中得到显着提高。在选择的3个通用+2业务评估任务中，平均提升2.47pp。
-
-2. CLUE1.0中的任务中，除CMRC任务外，平均提升1.56pp，其中C3和OCNLI数据集提升明显（+9.9PP，+2.84PP）。
-
-<div align=center><img src="img/glm_performance.png"></div>
+|      模型      |  AFQMC | TNEWS1.0 | IFLYTEK | OCNLI_50K |   CSL  | CMRC2018 | CHID1.0 | C3 1.0 |
+|:--------------:|:------:|:--------:|:-------:|:---------:|:------:|:--------:|:-------:|:------:|
+| RoBERTa XLarge | 75.835 |   68.75  |  62.654 |   82.333  | 83.433 |   80.5   |  86.57  |  77.03 |
+|   GLM-10b-ch   |  75.42 |   69.94  |  62.15  |     85    |  86.17 |    70    |  87.009 | 88.335 |
 
 ## FlagAI支持的GLM预训练模型
 参考 [Tutorial 5: 使用AutoLoader工具快速构建模型](/doc_zh/TUTORIAL_5_INSTRUCTIONS_FOR_AutoLoader.md)。

diff --git a/doc_zh/TUTORIAL_11_GLM_BLANK_FILLING_QA.md b/doc_zh/TUTORIAL_11_GLM_BLANK_FILLING_QA.md
@@ -34,7 +34,7 @@ GLM 对下游任务进行微调，并将它们重新定义为空白填充生成
    该目标旨在生成长文本。 For example: ``[CLS]问题:啤酒伤胃吗?回答:[gMASK]<|startofpiece|>谢邀。 我是啤酒爱好者,但是我不喝酒。 我以前也说过,喝酒伤身,啤酒伤胃,伤肠道。 现在我也知道了啤酒伤人的很多细节,我就不瞎几把的说,大家看图片就知道了。 <n><n>其实啤酒伤身这个说法只是表面而已。 <n><n>啤酒中含有少量的碳酸和酒精,碳酸和酒精是成酸性物质,而乙醇是脂溶性的,酒精在胃里能够被分解,生成乙醇和二氧化碳,在体内是水和二氧化碳,两种物质会迅速发生中和反应,结果导致人体出现头痛、呕吐、胸痛、浑身发热等现象,这就是所谓喝大了,喝多了。 <n><n> 啤酒的含糖量在15%左右,喝多了也是伤身的,啤酒含糖量较高的主要成分是水分,而水分的体积比酒精大,所以酒精进入人体,与水相遇,就会产生大量气体,二氧化碳、水、一氧化碳等刺激人体,造成人体大量出汗,使体内温度升高,``
 
 
-例如，GLM 以自回归空白的形式完成问题任务——填充任务:
+例如，我们用`GLM-large-ch`以自回归空白的形式完成问题任务——填充任务，如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001):
 ```python
 import torch
 from flagai.model.glm_model import GLMModel

diff --git a/doc_zh/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md b/doc_zh/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md
@@ -1,7 +1,7 @@
 # GLM 例子：标题生成
 
 ## 背景
-标题生成任务需要输入一段文本，模型根据输入文本输出对应的标题。
+标题生成任务需要输入一段文本，模型根据输入文本输出对应的标题。这里使用`GLM-large-ch`作为样例,如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001)。
 
 ![](./img/bert_title_generation_model.png)
 

diff --git a/doc_zh/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md b/doc_zh/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md
@@ -25,6 +25,7 @@
 cd ./examples/glm_poetry_generation
 python ./train.py
 ```
+这里使用`GLM-large-ch`作为样例,如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001)。
 ### 1.准备训练数据
 1）定义文件读取函数，从文件中读取数据，得到src和tgt列表：
 ```python

diff --git a/doc_zh/TUTORIAL_14_HUGGINGFACE_T5.md b/doc_zh/TUTORIAL_14_HUGGINGFACE_T5.md
@@ -33,21 +33,21 @@ trainer = MyTrainer(
     batch_size=4,
     eval_interval=10,
     log_interval=10,
-    experiment_name='t5-3b',
+    experiment_name='t5-11b',
     pytorch_device='cuda:0',
     load_dir=None,
     lr=1e-4,
     fp16=False)
 
 # using huggingface transformers to get tokenizer and models
-model_name = 't5-3b'
+model_name = 't5-11b'
 tokenizer = T5Tokenizer.from_pretrained(model_name)
 model = T5ForConditionalGeneration.from_pretrained(model_name)
 
 print("loading model & tokenizer is done!")
 src_dir = 'train_inputs.txt'
 tgt_dir = 'train_targets.txt'
-model_dir = "./t5-3b"  # 模型位置
+model_dir = "./t5-11b"  # 模型位置
 maxlen = 1024
 
 
@@ -139,7 +139,7 @@ trainer.train(model,
 ```
 
 ## 加速训练的技巧
-我们可能不会在V100 32G上运行t5-3b。所以，我们需要一些技巧来减少GPU内存的使用。
+我们可能不会在V100 32G上运行t5-11b。所以，我们需要一些技巧来减少GPU内存的使用。
 ### 第一步：fp16
 把模型参数变为 `fp16`
 ```python
@@ -149,23 +149,23 @@ trainer = MyTrainer(
     batch_size=1,
     eval_interval=10,
     log_interval=10,
-    experiment_name='t5-3b',
+    experiment_name='t5-11b',
     pytorch_device='cuda:0',
     load_dir=None,
     lr=1e-4,
     fp16=True) # change to `True`
 ```  
 ### 第二步：梯度重计算（checkpoint）
-在forward阶段不将中间结果保存。我们可以运行`batch size`=1的t5-3b。
-现在，我们可以用 `gradient_accumulation_steps` train/finetune 一个 t5-3b。
+在forward阶段不将中间结果保存。我们可以运行`batch size`=1的t5-11b。
+现在，我们可以用 `gradient_accumulation_steps` train/finetune 一个 t5-11b。
 ```python
 trainer = MyTrainer(
     env_type='pytorch',
     epochs=1,
     batch_size=1,
     eval_interval=10,
     log_interval=10,
-    experiment_name='t5-3b',
+    experiment_name='t5-11b',
     pytorch_device='cuda:0',
     load_dir=None,
     lr=1e-4,
@@ -181,7 +181,7 @@ trainer = Trainer(
     batch_size=1,
     eval_interval=10,
     log_interval=10,
-    experiment_name='t5-3b',
+    experiment_name='t5-11b',
     load_dir=None,
     lr=1e-4,
     fp16=True
@@ -205,7 +205,7 @@ trainer = Trainer(
     batch_size=1,
     eval_interval=10,
     log_interval=10,
-    experiment_name='t5-3b',
+    experiment_name='t5-11b',
     load_dir=None,
     lr=1e-4,
     fp16=True
@@ -230,7 +230,7 @@ trainer = Trainer(
     batch_size=1,
     eval_interval=10,
     log_interval=10,
-    experiment_name='t5-3b',
+    experiment_name='t5-11b',
     load_dir=None,
     lr=1e-4,
     fp16=True