Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bugs in readme #20

Merged
merged 1 commit into from
May 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions CLA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# The Contributor License Agreement

The [Cloud Native Computing Foundation](https://www.cncf.io) (CNCF) defines
the legal status of the contributed code in two different types of _Contributor License Agreements_
(CLAs), [individual contributors](https://github.com/cncf/cla/blob/master/individual-cla.pdf) and [corporations](https://github.com/cncf/cla/blob/master/corporate-cla.pdf).

FlagAI can only accept original source code from CLA signatories.


It is important to read and understand this legal agreement.

## How do I sign?

After creating your first Pull Request the linux-foundation-easycla bot will respond with information regarding your CLA status along with a link to sign the CLA.

<img width="1065" alt="EasyCLA bot" src="https://user-images.githubusercontent.com/69111235/152226443-f6fe61ee-0e92-46c5-b6ea-c0deb718a585.png">

#### 1. Authorize EasyCLA to read some of your GitHub information

<img width="554" alt="GitHub EasyCLA Authorization" src="https://user-images.githubusercontent.com/69111235/152228712-7d22f9d0-9f3c-4226-9ee0-bacba4b47725.png">

Click on the "Please click here to be authorized" link to navigate to the GitHub Authorize Linux Foundation: EasyCLA page. Then click Authorize LF-Engineering to give the Linux Foundation read-only access to list the email addresses associated with your GitHub account.

#### 2. Select from the two types of contributor

<img width="1407" alt="EasyCLA" src="https://user-images.githubusercontent.com/69111235/152224818-1246453a-b086-4a57-9d14-c10d62ad438f.png">


After authorizing EasyCLA, you will be redirected to a page to identify which type of contributor you are.
Select the most appropriate option:
* Individual Contributor: You are contributing as yourself, and not as part of another organization.
* Corporate Contributor: You are contributing on behalf of your employer or other organization.

#### 3. Sign the CLA

Once you select the type of contributor, proceed to Sign the CLA and follow the instructions to complete the signing process through DocuSign.

**Ensure your GitHub e-mail address matches e-mail address used to sign CLA**

After you have filled out the information, Click "Finish" and you will be redirected back to your Pull Request.

#### 4. Look for an email indicating successful signup.

> Hello,
>
> This is a notification email from EasyCLA regarding the project Cloud Native Computing > Foundation (CNCF).
>
> The CLA has now been signed. You can download the signed CLA as a PDF here.
>
> If you need help or have questions about EasyCLA, you can read the documentation or reach out to us for support.
>
> Thanks,
> EasyCLA Support Team



#### 5. Validate your CLA

Once you are redirected back to your GitHub Pull Request, reply with a comment `/easycla` to update the CLA status of your PR.


## Changing your Affiliation

If you've changed employers and still contribute to Kubernetes, your affiliation
needs to be updated. The Cloud Native Computing Foundation uses [gitdm](https://github.com/cncf/gitdm)
to track who is contributing and from where. Create a pull request on the [gitdm](https://github.com/cncf/gitdm)
repository with a change to the corresponding developer affiliation text file.
Your entry should look similar to this:

```
Jorge O. Castro*: jorge!heptio.com, jorge!ubuntu.com, jorge.castro!gmail.com
Heptio
Canonical until 2017-03-31
```

## Troubleshooting

If you encounter any problems signing the CLA and need further assistance, log a ticket by clicking on the link [please submit a support request ticket](https://jira.linuxfoundation.org/plugins/servlet/theme/portal/4) from the EasyCLA bot's response. Someone from the CNCF will respond to your ticket to help.

Should you have any issues using the LF Support Site, send a message to the
backup e-mail support address <login-issues@jira.linuxfoundation.org>

## Setting up the CNCF CLA check

If you are a Kubernetes GitHub organization or repo owner and would like to setup
the Linux Foundation CNCF CLA check for your repositories, [read the docs on setting up the CNCF CLA check](/github-management/setting-up-cla-check.md)


[Linux Foundation Support Site]: https://support.linuxfoundation.org/
17 changes: 8 additions & 9 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ side, please stick to the following process:
3. If we decide your concern needs code changes, we would be happy to accept a pull request. Please consider the
commit guidelines below.

## Sign the CLA

Before you can contribute to FlagAI, you will need to sign the [Contributor License Agreement](CLA.md).

## Git Commit Guidelines

Expand All @@ -34,17 +37,13 @@ pip install -r requirements.txt
```

### tests



To run all basic tests execute:
```bash
python test.py
Install `pytest` for testing
```

To check the test results in
pip install pytest
```
tests/test_report
To run all basic tests execute:
```bash
pytest
```

### code formatting
Expand Down
5 changes: 1 addition & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ for text in test_data:
* [Blank_Filling_QA with GLM ](/docs/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
* [Title Generation with GLM ](/docs/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md)
* [Poetry generation with GLM-large-ch](docs/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md)
* [Using huggingface's t5-3b & tricks ](docs/TUTORIAL_14_HUGGINGFACE_T5.md)
* [Using huggingface's t5-11b & tricks ](docs/TUTORIAL_14_HUGGINGFACE_T5.md)
* [Title Generation with RoBerta-WWM](/docs/TUTORIAL_15_BERT_EXAMPLE_TITLE_GENERATION.md)
* [Semantic Matching with RoBerta-WWM](/docs/TUTORIAL_16_BERT_EXAMPLE_SEMANTIC_MATCHING.md)
* [NER with RoBerta-WWM](/docs/TUTORIAL_17_BERT_EXAMPLE_NER.md)
Expand All @@ -135,15 +135,12 @@ language models, sequence labeling models, and text classification models. Let u

## Tutorials
We provide a set of quick tutorials to get you started with the library:

[//]: # (* [Tutorial 1: Background: Transformers]&#40;docs/TUTORIAL_1_BASICS.md&#41;)
* [Tutorial 1: How to construct and use Tokenizer](/docs/TUTORIAL_1_TOKENIZER.md)
* [Tutorial 2: Dataset Preprocessing Pipeline](/docs/TUTORIAL_2_DATASET.md)
* [Tutorial 3: Major Function of Model Module](/docs/TUTORIAL_3_MODEL.md)
* [Tutorial 4: Customize trainer for model and data-parallel training](/docs/TUTORIAL_4_TRAINER.md)
* [Tutorial 5: Simplify model and tokenizer Initialization by Using Autoloader](/docs/TUTORIAL_5_INSTRUCTIONS_FOR_AutoLoader.md)
* [Tutorial 6: Use off-the-shelf inference Algorithms with Predictor](/docs/TUTORIAL_6_INSTRUCTIONS_FOR_PREDICTOR.md)

* [Tutorial 7: Use FlagAI prompt-learning tool-kit to improve performance on SuperGLUE](/docs/TUTORIAL_7_PROMPT_LERANING.md)
* [Tutorial 8: Setup environment for training models with multi-machine](/docs/TUTORIAL_8_ENVIRONMENT_SETUP.md)
* [Tutorial 9: Text generation with encoder/decoder/encoder-decoder models](/docs/TUTORIAL_9_SEQ2SEQ_METHOD.md)
Expand Down
2 changes: 1 addition & 1 deletion README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ for text_pair in test_data:
* [GLM-large-ch用户完形填空问答](/doc_zh/TUTORIAL_11_GLM_BLANK_FILLING_QA.md)
* [GLM-large-ch用于诗歌生成](doc_zh/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md)
* [GLM-large-ch用于标题生成](doc_zh/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md)
* [对 huggingface t5-3b 模型的支持 以及加速的tricks](doc_zh/TUTORIAL_14_HUGGINGFACE_T5.md)
* [对 huggingface t5-11b 模型的支持 以及加速的tricks](doc_zh/TUTORIAL_14_HUGGINGFACE_T5.md)
* [RoBERTa-base-ch用于标题生成](doc_zh/TUTORIAL_15_BERT_EXAMPLE_TITLE_GENERATION.md)
* [RoBERTa-base-ch用于语义相似度匹配](doc_zh/TUTORIAL_16_BERT_EXAMPLE_SEMANTIC_MATCHING.md)
* [RoBERTa-base-ch用于命名实体识别](/doc_zh/TUTORIAL_17_BERT_EXAMPLE_NER.md)
Expand Down
28 changes: 19 additions & 9 deletions doc_zh/GLM.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,13 @@
目前,存在几种不同的预训练模型架构:仅实现编码器架构的自动编码模型(例如BERT),仅实现解码器的自回归模型(例如GPT),以及同时实现编码器和解码器的编码器-解码器模型(例如T5)。

[**GLM模型**](https://arxiv.org/abs/2103.10360)与这些模型略有不同。它采用了一种自回归的空白填充方法, 并且在NLP领域三种主要的任务(自然语言理解,无条件生成,有条件生成)上都取得了不错的结果。
<div align=center><img src="img/glm_example_1.png" width="600px"></div>

| Framwork | NLU | Cond.Gen. | Uncond.Gen |
|-----------------|-----|-----------|------------|
| Augoregressive | - | - | ✅ |
| Autoencoding | ✅ | × | × |
| Encoder-Decoder | - | ✅ | - |
| GLM | ✅ | ✅ | ✅ |
GLM的主要功能包括:

- 任务一:文本的一些区间会被屏蔽(参照自动编码的做法)。 这些区间将被随机重新排列,并以自动回归方式进行预测。屏蔽的区间覆盖原始文本的15%。
Expand All @@ -18,18 +23,23 @@ GLM的主要功能包括:

## GLM的表现

1. 通过多任务预训练,GLM-Doc 和 GLM-Sent 的表现略逊于 GLM-Large,但仍优于 BERT-Large 和 UniLM-Large。

### [SuperGLUE](https://super.gluebenchmark.com)
单模型单任务微调在`dev`集上的效果,更多结果在[这里](https://github.com/THUDM/GLM)

2. 在多任务模型中,GLM-Sent 平均优于 GLM-Doc 1.1%。将 GLM-Doc 的参数增加到 410M(1.25×BERT-Large)会得到比 GLM-Large 更好的性能。至于具有 515M 参数(1.5×BERT-Large)的 GLM 能表现得更好。
| Model | COPA | WSC | RTE | WiC | CB | MultiRC | BoolQ | ReCoRD |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| GLM-10b-ch | 98.0 | 95.2 | 93.1 | 75.7 | 98.7/98.2 | 88.1/63.3 | 88.7 | 94.4/94.0 |
| [RoBERTa-Large](https://github.com/pytorch/fairseq/tree/master/examples/roberta) | 94.0 | 91.3 | 86.6 | 75.6 | 98.2/- | 85.7/- | 86.9 |89.5/89.0|
| [DeBERTa-XXLarge-v2](https://github.com/microsoft/DeBERTa/tree/master/experiments/superglue) | 97.0 | - | 93.5 | - | - | 87.8/63.6 | 88.3 | 94.1/93.7 |

<div align=center><img src="img/glm_results2.png"></div>
### [CLUE](https://www.cluebenchmarks.com)
单模型单任务微调在CLUE数据集上的结果(测试还在进行中,这里列出了部分任务)。如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001)。

1. GLM-XXLarge 的平均得分为 79.297,在多项任务中得到显着提高。在选择的3个通用+2业务评估任务中,平均提升2.47pp。

2. CLUE1.0中的任务中,除CMRC任务外,平均提升1.56pp,其中C3和OCNLI数据集提升明显(+9.9PP,+2.84PP)。

<div align=center><img src="img/glm_performance.png"></div>
| 模型 | AFQMC | TNEWS1.0 | IFLYTEK | OCNLI_50K | CSL | CMRC2018 | CHID1.0 | C3 1.0 |
|:--------------:|:------:|:--------:|:-------:|:---------:|:------:|:--------:|:-------:|:------:|
| RoBERTa XLarge | 75.835 | 68.75 | 62.654 | 82.333 | 83.433 | 80.5 | 86.57 | 77.03 |
| GLM-10b-ch | 75.42 | 69.94 | 62.15 | 85 | 86.17 | 70 | 87.009 | 88.335 |

## FlagAI支持的GLM预训练模型
参考 [Tutorial 5: 使用AutoLoader工具快速构建模型](/doc_zh/TUTORIAL_5_INSTRUCTIONS_FOR_AutoLoader.md)。
Expand Down
2 changes: 1 addition & 1 deletion doc_zh/TUTORIAL_11_GLM_BLANK_FILLING_QA.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ GLM 对下游任务进行微调,并将它们重新定义为空白填充生成
该目标旨在生成长文本。 For example: ``[CLS]问题:啤酒伤胃吗?回答:[gMASK]<|startofpiece|>谢邀。 我是啤酒爱好者,但是我不喝酒。 我以前也说过,喝酒伤身,啤酒伤胃,伤肠道。 现在我也知道了啤酒伤人的很多细节,我就不瞎几把的说,大家看图片就知道了。 <n><n>其实啤酒伤身这个说法只是表面而已。 <n><n>啤酒中含有少量的碳酸和酒精,碳酸和酒精是成酸性物质,而乙醇是脂溶性的,酒精在胃里能够被分解,生成乙醇和二氧化碳,在体内是水和二氧化碳,两种物质会迅速发生中和反应,结果导致人体出现头痛、呕吐、胸痛、浑身发热等现象,这就是所谓喝大了,喝多了。 <n><n> 啤酒的含糖量在15%左右,喝多了也是伤身的,啤酒含糖量较高的主要成分是水分,而水分的体积比酒精大,所以酒精进入人体,与水相遇,就会产生大量气体,二氧化碳、水、一氧化碳等刺激人体,造成人体大量出汗,使体内温度升高,``


例如,GLM 以自回归空白的形式完成问题任务——填充任务:
例如,我们用`GLM-large-ch`以自回归空白的形式完成问题任务——填充任务,如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001):
```python
import torch
from flagai.model.glm_model import GLMModel
Expand Down
2 changes: 1 addition & 1 deletion doc_zh/TUTORIAL_12_GLM_EXAMPLE_TITLE_GENERATION.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# GLM 例子:标题生成

## 背景
标题生成任务需要输入一段文本,模型根据输入文本输出对应的标题。
标题生成任务需要输入一段文本,模型根据输入文本输出对应的标题。这里使用`GLM-large-ch`作为样例,如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001)。

![](./img/bert_title_generation_model.png)

Expand Down
1 change: 1 addition & 0 deletions doc_zh/TUTORIAL_13_GLM_EXAMPLE_PEOTRY_GENERATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
cd ./examples/glm_poetry_generation
python ./train.py
```
这里使用`GLM-large-ch`作为样例,如果想要使用`GLM-10b-ch`请点[这里](https://model.baai.ac.cn/model-detail/100001)。
### 1.准备训练数据
1)定义文件读取函数,从文件中读取数据,得到src和tgt列表:
```python
Expand Down
22 changes: 11 additions & 11 deletions doc_zh/TUTORIAL_14_HUGGINGFACE_T5.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,21 +33,21 @@ trainer = MyTrainer(
batch_size=4,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
pytorch_device='cuda:0',
load_dir=None,
lr=1e-4,
fp16=False)

# using huggingface transformers to get tokenizer and models
model_name = 't5-3b'
model_name = 't5-11b'
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

print("loading model & tokenizer is done!")
src_dir = 'train_inputs.txt'
tgt_dir = 'train_targets.txt'
model_dir = "./t5-3b" # 模型位置
model_dir = "./t5-11b" # 模型位置
maxlen = 1024


Expand Down Expand Up @@ -139,7 +139,7 @@ trainer.train(model,
```

## 加速训练的技巧
我们可能不会在V100 32G上运行t5-3b。所以,我们需要一些技巧来减少GPU内存的使用。
我们可能不会在V100 32G上运行t5-11b。所以,我们需要一些技巧来减少GPU内存的使用。
### 第一步:fp16
把模型参数变为 `fp16`
```python
Expand All @@ -149,23 +149,23 @@ trainer = MyTrainer(
batch_size=1,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
pytorch_device='cuda:0',
load_dir=None,
lr=1e-4,
fp16=True) # change to `True`
```
### 第二步:梯度重计算(checkpoint)
在forward阶段不将中间结果保存。我们可以运行`batch size`=1的t5-3b
现在,我们可以用 `gradient_accumulation_steps` train/finetune 一个 t5-3b
在forward阶段不将中间结果保存。我们可以运行`batch size`=1的t5-11b
现在,我们可以用 `gradient_accumulation_steps` train/finetune 一个 t5-11b
```python
trainer = MyTrainer(
env_type='pytorch',
epochs=1,
batch_size=1,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
pytorch_device='cuda:0',
load_dir=None,
lr=1e-4,
Expand All @@ -181,7 +181,7 @@ trainer = Trainer(
batch_size=1,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
load_dir=None,
lr=1e-4,
fp16=True
Expand All @@ -205,7 +205,7 @@ trainer = Trainer(
batch_size=1,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
load_dir=None,
lr=1e-4,
fp16=True
Expand All @@ -230,7 +230,7 @@ trainer = Trainer(
batch_size=1,
eval_interval=10,
log_interval=10,
experiment_name='t5-3b',
experiment_name='t5-11b',
load_dir=None,
lr=1e-4,
fp16=True
Expand Down
Loading