Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug BERT-pytorch\bert_pytorch\model\embedding\position.py #101

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
ed3bf25
debug BERT-pytorch\bert_pytorch\model\embedding\position.py
wanghesong2019 Jan 20, 2023
3599ade
为bert_pytorch/dataset/vocab.py中的 TorchVocab类添加self.itos和self.stoi的注释
wanghesong2019 Jan 28, 2023
9d54ff0
debug bert_pytorch/dataset/vocab.py WordVocab line 130 line.repleace()
wanghesong2019 Jan 28, 2023
6a2cae9
debug bert_pytorch/dataset/vocab.py WordVocab line 130 line.repleace()
wanghesong2019 Jan 28, 2023
3a873cd
debug bert_pytorch/dataset/vocab.py WordVocab line 130 line.repleace()
wanghesong2019 Jan 28, 2023
83b6859
Merge branch 'master' of https://github.com/wanghesong2019/BERT-pytorch
wanghesong2019 Jan 28, 2023
919adf1
1.为dataset.py添加注释;2. 增加data文件;3. vacab.py 130行回退成原始版本
wanghesong2019 Jan 29, 2023
2216e03
Create corpus.txt
wanghesong2019 Jan 29, 2023
3189be7
为bert_pytorch/model/embedding下的3种embedding添加注释
wanghesong2019 Jan 31, 2023
6320e35
Merge branch 'master' of https://github.com/wanghesong2019/BERT-pytorch
wanghesong2019 Jan 31, 2023
8e86425
add comment for embedding
wanghesong2019 Jan 31, 2023
2cf41cf
single.py中mask作用的注释
wanghesong2019 Jan 31, 2023
7862e1c
Create tyr.jpg
wanghesong2019 Feb 2, 2023
97c9ac6
upload images
wanghesong2019 Feb 2, 2023
feac7c2
Delete tyr.jpg
wanghesong2019 Feb 2, 2023
3187f92
Rename README.md to README_back.md
wanghesong2019 Feb 2, 2023
0b4c94d
Create README.md
wanghesong2019 Feb 2, 2023
32f3c81
Update README.md
wanghesong2019 Feb 2, 2023
9c13e0f
upload vocab file
wanghesong2019 Feb 2, 2023
0eea299
dataset.py生成sentence pair的逻辑注释
wanghesong2019 Feb 5, 2023
c5cca98
BERTDataset类魔术方法__getitem__(self, item)注释理解
wanghesong2019 Feb 6, 2023
9c7a2fd
修改bert_pytorch/dataset/dataset.py的注释
wanghesong2019 Feb 10, 2023
33cf718
bert_pytorch/trainer/pretrain.py添加注释
wanghesong2019 Feb 10, 2023
e9ffb2f
为bert_pytorch/model/language_model.py的NextSentencePrediction和 MaskedL…
wanghesong2019 Feb 10, 2023
7b55204
Update README.md
wanghesong2019 Feb 11, 2023
38a931c
Update README.md
wanghesong2019 Feb 11, 2023
cdf357c
Update README.md
wanghesong2019 Feb 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions BERT-pytorch
Submodule BERT-pytorch added at 919adf
130 changes: 15 additions & 115 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,120 +1,20 @@
# BERT-pytorch
# BERT-pytorch学习心得
在2023年的2月中旬的凌晨2点,我要结束对BERT-pytorch项目的学习了,这是注册github账号之后第1次相对认真系统的学习一个开源项目,从寒假前夕开始,持续直到现在,坚持下来了离开之前,啰嗦2句,以作纪念!
## 1.经验
- 根据代码,结合bert论文,基本掌握了bert的真面目:包括词典构建和token随机替换,句子对随机采样的dataset模块、基于transformer编码器的encoder架构的modeling模块、包括loss计算和梯度下降的trainner模块;
- 在代码学习的过程中,掌握了git基本操作,github的使用习惯(自己的注释都合并到了master分支)和常见pytorch API用法;
- 开源项目学习最好结合论文看,这样就将理论和实践结合起来了,当然最好是能灌入数据跑起来

[![LICENSE](https://img.shields.io/github/license/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/blob/master/LICENSE)
![GitHub issues](https://img.shields.io/github/issues/codertimo/BERT-pytorch.svg)
[![GitHub stars](https://img.shields.io/github/stars/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/stargazers)
[![CircleCI](https://circleci.com/gh/codertimo/BERT-pytorch.svg?style=shield)](https://circleci.com/gh/codertimo/BERT-pytorch)
[![PyPI](https://img.shields.io/pypi/v/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/)
[![PyPI - Status](https://img.shields.io/pypi/status/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/)
[![Documentation Status](https://readthedocs.org/projects/bert-pytorch/badge/?version=latest)](https://bert-pytorch.readthedocs.io/en/latest/?badge=latest)
## 2.教训
- 代码逐行看了,也搭建了bert-pytorch环境,但是没有结合数据去运行查看结果,故调参经验并没有增加
- 项目学习没有指定里程碑时间表,拖沓
- 后续的开源项目学习,一定要结合数据,运行起来
- 本来想好好写一篇readme,但是到头有泄气了。

Pytorch implementation of Google AI's 2018 BERT, with simple annotation
---
# bert理解记录

> BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
> Paper URL : https://arxiv.org/abs/1810.04805
## 20230213:今天和大华同事明浩讨论了bert的embedding部分:由token到初始化的embedding向量是怎么实现的?他认为初始化的embedding向量会参与到训练学习中,但是晚上我又看了下该项目,发现本项目的embedding模块只是承担着token的随机初始化过程,之后就会进到attention模块,先线性投影成querey,key,value之后就开始了注意力机制的计算;由此可以认为embedding模块还只是数据预处理的一部分,是不会参与到训练中的;

另外一个问题是为什么可以随机初始化embedding?我认为主要是token的索引就是随机的(现到先得),也就是说不管是token的index,还是初始的embedding向量,只要固定好key-value关系即可,不含任何的语义信息;这样就能圆回来了:如果初始化的embedding是模型参数,参与到学习训练,就破环了key-value关系的确定性;

## Introduction

Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA),
including outperform the human F1 score on SQuAD v1.1 QA task.
This paper proved that Transformer(self-attention) based encoder can be powerfully used as
alternative of previous language model with proper language model training method.
And more importantly, they showed us that this pre-trained language model can be transfer
into any NLP task without making task specific model architecture.

This amazing result would be record in NLP history,
and I expect many further papers about BERT will be published very soon.

This repo is implementation of BERT. Code is very simple and easy to understand fastly.
Some of these codes are based on [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)

Currently this project is working on progress. And the code is not verified yet.

## Installation
```
pip install bert-pytorch
```

## Quickstart

**NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator**

### 0. Prepare your corpus
```
Welcome to the \t the jungle\n
I can stay \t here all night\n
```

or tokenized corpus (tokenization is not in package)
```
Wel_ _come _to _the \t _the _jungle\n
_I _can _stay \t _here _all _night\n
```


### 1. Building vocab based on your corpus
```shell
bert-vocab -c data/corpus.small -o data/vocab.small
```

### 2. Train your own BERT model
```shell
bert -c data/corpus.small -v data/vocab.small -o output/bert.model
```

## Language Model Pre-training

In the paper, authors shows the new language model training methods,
which are "masked language model" and "predict next sentence".


### Masked Language Model

> Original Paper : 3.3.1 Task #1: Masked LM

```
Input Sequence : The man went to [MASK] store with [MASK] dog
Target Sequence : the his
```

#### Rules:
Randomly 15% of input token will be changed into something, based on under sub-rules

1. Randomly 80% of tokens, gonna be a `[MASK]` token
2. Randomly 10% of tokens, gonna be a `[RANDOM]` token(another word)
3. Randomly 10% of tokens, will be remain as same. But need to be predicted.

### Predict Next Sentence

> Original Paper : 3.3.2 Task #2: Next Sentence Prediction

```
Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next

Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext
```

"Is this sentence can be continuously connected?"

understanding the relationship, between two text sentences, which is
not directly captured by language modeling

#### Rules:

1. Randomly 50% of next sentence, gonna be continuous sentence.
2. Randomly 50% of next sentence, gonna be unrelated sentence.


## Author
Junseong Kim, Scatter Lab (codertimo@gmail.com / junseong.kim@scatterlab.co.kr)

## License

This project following Apache 2.0 License as written in LICENSE file

Copyright 2018 Junseong Kim, Scatter Lab, respective BERT contributors

Copyright (c) 2018 Alexander Rush : [The Annotated Trasnformer](https://github.com/harvardnlp/annotated-transformer)
120 changes: 120 additions & 0 deletions README_back.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# BERT-pytorch

[![LICENSE](https://img.shields.io/github/license/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/blob/master/LICENSE)
![GitHub issues](https://img.shields.io/github/issues/codertimo/BERT-pytorch.svg)
[![GitHub stars](https://img.shields.io/github/stars/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/stargazers)
[![CircleCI](https://circleci.com/gh/codertimo/BERT-pytorch.svg?style=shield)](https://circleci.com/gh/codertimo/BERT-pytorch)
[![PyPI](https://img.shields.io/pypi/v/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/)
[![PyPI - Status](https://img.shields.io/pypi/status/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/)
[![Documentation Status](https://readthedocs.org/projects/bert-pytorch/badge/?version=latest)](https://bert-pytorch.readthedocs.io/en/latest/?badge=latest)

Pytorch implementation of Google AI's 2018 BERT, with simple annotation

> BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
> Paper URL : https://arxiv.org/abs/1810.04805


## Introduction

Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA),
including outperform the human F1 score on SQuAD v1.1 QA task.
This paper proved that Transformer(self-attention) based encoder can be powerfully used as
alternative of previous language model with proper language model training method.
And more importantly, they showed us that this pre-trained language model can be transfer
into any NLP task without making task specific model architecture.

This amazing result would be record in NLP history,
and I expect many further papers about BERT will be published very soon.

This repo is implementation of BERT. Code is very simple and easy to understand fastly.
Some of these codes are based on [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)

Currently this project is working on progress. And the code is not verified yet.

## Installation
```
pip install bert-pytorch
```

## Quickstart

**NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator**

### 0. Prepare your corpus
```
Welcome to the \t the jungle\n
I can stay \t here all night\n
```

or tokenized corpus (tokenization is not in package)
```
Wel_ _come _to _the \t _the _jungle\n
_I _can _stay \t _here _all _night\n
```


### 1. Building vocab based on your corpus
```shell
bert-vocab -c data/corpus.small -o data/vocab.small
```

### 2. Train your own BERT model
```shell
bert -c data/corpus.small -v data/vocab.small -o output/bert.model
```

## Language Model Pre-training

In the paper, authors shows the new language model training methods,
which are "masked language model" and "predict next sentence".


### Masked Language Model

> Original Paper : 3.3.1 Task #1: Masked LM

```
Input Sequence : The man went to [MASK] store with [MASK] dog
Target Sequence : the his
```

#### Rules:
Randomly 15% of input token will be changed into something, based on under sub-rules

1. Randomly 80% of tokens, gonna be a `[MASK]` token
2. Randomly 10% of tokens, gonna be a `[RANDOM]` token(another word)
3. Randomly 10% of tokens, will be remain as same. But need to be predicted.

### Predict Next Sentence

> Original Paper : 3.3.2 Task #2: Next Sentence Prediction

```
Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
Label : Is Next

Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
Label = NotNext
```

"Is this sentence can be continuously connected?"

understanding the relationship, between two text sentences, which is
not directly captured by language modeling

#### Rules:

1. Randomly 50% of next sentence, gonna be continuous sentence.
2. Randomly 50% of next sentence, gonna be unrelated sentence.


## Author
Junseong Kim, Scatter Lab (codertimo@gmail.com / junseong.kim@scatterlab.co.kr)

## License

This project following Apache 2.0 License as written in LICENSE file

Copyright 2018 Junseong Kim, Scatter Lab, respective BERT contributors

Copyright (c) 2018 Alexander Rush : [The Annotated Trasnformer](https://github.com/harvardnlp/annotated-transformer)
48 changes: 28 additions & 20 deletions bert_pytorch/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,32 +15,36 @@ def __init__(self, corpus_path, vocab, seq_len, encoding="utf-8", corpus_lines=N
self.encoding = encoding

with open(corpus_path, "r", encoding=encoding) as f:
if self.corpus_lines is None and not on_memory:
#读取预料库后分下面2种情况处理:
if self.corpus_lines is None and not on_memory: #如果不将语料库直接加载到内存,则需先确定语料库行数
for _ in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines):
self.corpus_lines += 1

if on_memory:
self.lines = [line[:-1].split("\t")
for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)]
self.corpus_lines = len(self.lines)
#数据集全部加载到内存,语料库解析成list类型的self.liines属性
self.lines = [line[:-1].split('\t')
for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)] #对预料库每行根据\t字符分成2个sentence
self.corpus_lines = len(self.lines) #获取语料库行数

if not on_memory:
if not on_memory:
self.file = open(corpus_path, "r", encoding=encoding)
self.random_file = open(corpus_path, "r", encoding=encoding)

#错位抽取负样本,作用是什么?
for _ in range(random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)):
self.random_file.__next__()

def __len__(self):
return self.corpus_lines

def __getitem__(self, item):
t1, t2, is_next_label = self.random_sent(item)
t1_random, t1_label = self.random_word(t1)
t2_random, t2_label = self.random_word(t2)
#魔术方法__getitem__的定义,功能令类的实例对象向list那样根据索引item取值
#BERTDataset类实例化返回的bert对象均会进行Next Sentence操作和Masked LM操作
t1, t2, is_next_label = self.random_sent(item) #Next Sentence操作
t1_random, t1_label = self.random_word(t1) #Masked LM操作, 其中t1_label表示t1各个位置被masked的类别索引,参看vocab.py中Vocab类的初始化定义
t2_random, t2_label = self.random_word(t2)

# [CLS] tag = SOS tag, [SEP] tag = EOS tag
t1 = [self.vocab.sos_index] + t1_random + [self.vocab.eos_index]
t1 = [self.vocab.sos_index] + t1_random + [self.vocab.eos_index] #论文Figure2
t2 = t2_random + [self.vocab.eos_index]

t1_label = [self.vocab.pad_index] + t1_label + [self.vocab.pad_index]
Expand All @@ -50,7 +54,7 @@ def __getitem__(self, item):
bert_input = (t1 + t2)[:self.seq_len]
bert_label = (t1_label + t2_label)[:self.seq_len]

padding = [self.vocab.pad_index for _ in range(self.seq_len - len(bert_input))]
padding = [self.vocab.pad_index for _ in range(self.seq_len - len(bert_input))] #最大长度和实际长度之差就是需要padding的位置数量
bert_input.extend(padding), bert_label.extend(padding), segment_label.extend(padding)

output = {"bert_input": bert_input,
Expand All @@ -61,12 +65,15 @@ def __getitem__(self, item):
return {key: torch.tensor(value) for key, value in output.items()}

def random_word(self, sentence):
#sentence转换成sentence中的token在token-index词典中对应的index
tokens = sentence.split()
output_label = []
output_label = [] #该列表只存0和非0数字,0表示对应位置的token属于85%没被替换的,非0数字是对应位置的token在被mask处理前的vocab中对应的index

for i, token in enumerate(tokens):
prob = random.random()
#BERT随机选择15%的tokens进行mask
if prob < 0.15:
#对于随机选择的15%的tokens,再做一次随机
prob /= 0.15

# 80% randomly change token to mask token
Expand All @@ -77,26 +84,27 @@ def random_word(self, sentence):
elif prob < 0.9:
tokens[i] = random.randrange(len(self.vocab))

# 10% randomly change token to current token
# 10% doesn't change current token
else:
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

else:
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) #未被masked的词,用其在vocab中真正的index填充
#具体地,self.vocab.unk_index=1,上句相当于从stoi token-index字典
output_label.append(0)

return tokens, output_label

def random_sent(self, index):
t1, t2 = self.get_corpus_line(index)

# output_text, label(isNotNext:0, isNext:1)
t1, t2 = self.get_corpus_line(index)
# for sentence A and B, 50% of the time B is the actual next sentence that follows A(labeled as NotNext)
# and for 50% of the time it is a random sentence from the corpus(labeled as NotNext)
if random.random() > 0.5:
return t1, t2, 1
return t1, t2, 1 #1表示isNext
else:
return t1, self.get_random_line(), 0
return t1, self.get_random_line(), 0 #0表示isNotNext

def get_corpus_line(self, item):
if self.on_memory:
Expand All @@ -122,4 +130,4 @@ def get_random_line(self):
for _ in range(random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)):
self.random_file.__next__()
line = self.random_file.__next__()
return line[:-1].split("\t")[1]
return line[:-1].split("\t")[1]
Loading