BERT example in torchtext #767

zhangguanheng66 · 2020-05-13T13:56:34Z

Train a BERT model with PyTorch and torchtext, including masked language modeling and next sentence tasks. Fine-tune the BERT model for question-answer task.

There are a few things to do:

Move some datasets to torchtex.experimental.datasets, including SQuAD. Some ongoing PRs Question answer datasets: SQuAD1 and SQuAD2 #773, experimental.dataset WikiText2, WikiText103, PennTreeBank, WMTNewsCrawl #774
Switch the models in model.py file to the corresponding components in pytorch or torchtext. For example, the MultiheadAttention container in MultiheadAttention building blocks in torchtext #720
A more generic metric for F1 and exact score in torchtext
Remove some duplicate codes across the three tasks.

This CC adds `torchscript` extension `_torchtext.so`, which contains simple interface to `SentencePiece`. - SentencePiece `v0.1.86` is used. - `libsentencepiece.a` is built right before `_torchtext.so` is compiled. The logic for triggering this build from `setuptools` can be found under `build_tools/setup_helpers`. - `_torchtext.so` provides interface to train a SentencePiece model and load a model from file. Breaking change: Previously `torchtext.data.functional.load_sp_model` returned `sentencepiece.SentencePieceProcessor` object, which supported the following methods, in addition to `__len__` and `__getitem__` ``` $ grep '$self->' third_party/sentencepiece/python/sentencepiece.i return $self->Load(filename); return $self->LoadFromSerializedProto(filename); return $self->SetEncodeExtraOptions(extra_option); return $self->SetDecodeExtraOptions(extra_option); return $self->SetVocabulary(valid_vocab); return $self->ResetVocabulary(); return $self->LoadVocabulary(filename, threshold); return $self->EncodeAsPieces(input); return $self->EncodeAsIds(input); return $self->NBestEncodeAsPieces(input, nbest_size); return $self->NBestEncodeAsIds(input, nbest_size); return $self->SampleEncodeAsPieces(input, nbest_size, alpha); return $self->SampleEncodeAsIds(input, nbest_size, alpha); return $self->DecodePieces(input); return $self->DecodeIds(input); return $self->EncodeAsSerializedProto(input); return $self->SampleEncodeAsSerializedProto(input, nbest_size, alpha); return $self->NBestEncodeAsSerializedProto(input, nbest_size); return $self->DecodePiecesAsSerializedProto(pieces); return $self->DecodeIdsAsSerializedProto(ids); return $self->GetPieceSize(); return $self->PieceToId(piece); return $self->IdToPiece(id); return $self->GetScore(id); return $self->IsUnused(id); return $self->IsControl(id); return $self->IsUnused(id); return $self->GetPieceSize(); return $self->PieceToId(key); ``` The new C++ Extension provides the following methods ``` Encode(input) EncodeAsIds(input) EncodeAsPieces(input) ```

cpuhrsch · 2020-05-13T19:59:24Z

examples/BERT/data.py

@@ -0,0 +1,603 @@
+import torch


I guess we could start incrementally porting this over into the experimental folder which will also help with cleanup. These datasets seem useful in general.

I second this. I would also help make this pull request shorter.

cpuhrsch · 2020-05-13T20:00:19Z

examples/BERT/metrics.py

+ return 100.0 * sum(exact_scores) / len(exact_scores)
+
+
+def compute_qa_f1(ans_pred_tokens_samples):


can this be turned into a generic f1 metric that we can throw into torchtext?

sample_f1 func can be landed in torchtext.

cpuhrsch · 2020-05-13T20:02:11Z

examples/BERT/mlm_task.py

+ elif args.dataset == 'BookCorpus':
+ train_dataset, test_dataset, valid_dataset = BookCorpus(vocab)
+
+ train_data = process_raw_data(train_dataset.data, args)


I'd move this until the end of the function into a separate def train(...) function.

cpuhrsch · 2020-05-13T20:06:19Z

examples/BERT/ns_task.py

+ start_time = time.time()
+
+
+def run_main(args, rank=None):


It seems like some of this code used here could be shared with the other run_main functions?

You are right. However, since those three tasks are quite different, the run_main func is set up explicitly here, instead of passing a lot of arguments.

cpuhrsch · 2020-05-13T20:08:50Z

Looks pretty good! I think the next steps could include some code deduplication. Looks like you could factor out some more stuff already and also use more stuff from torchtext.

vincentqb

It would be nice if this were divided into at least 3 PRs: one for each of the two tasks, and one for some of the abstractions that could land straight into torchtext (e.g. datasets).

examples/BERT/README.md

vincentqb · 2020-05-26T21:37:48Z

examples/BERT/data.py

@@ -0,0 +1,603 @@
+import torch


I second this. I would also help make this pull request shorter.

examples/BERT/README.md

examples/BERT/ns_task.py

examples/BERT/mlm_task.py

zhangguanheng66 · 2020-05-26T22:07:10Z

It would be nice if this were divided into at least 3 PRs: one for each of the two tasks, and one for some of the abstractions that could land straight into torchtext (e.g. datasets).

Thanks for the feedback. Yes, there are currently two PRs to land the datasets into torchtext and a separate PR to merge the model. Once done, this PR will be very task-related work.

examples/BERT/mlm_task.py

vincentqb · 2020-05-28T20:42:45Z

examples/BERT/utils.py

+
+def setup(rank, world_size, seed):
+ os.environ['MASTER_ADDR'] = 'localhost'
+ os.environ['MASTER_PORT'] = '12355'


nit: this is specific to the particular distributed implementation

It's set up to SLURM.

codecov · 2020-06-30T16:25:04Z

Codecov Report

Merging #767 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #767   +/-   ##
=======================================
  Coverage   77.44%   77.44%           
=======================================
  Files          44       44           
  Lines        3055     3055           
=======================================
  Hits         2366     2366           
  Misses        689      689

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b021bf9...c9a9c56. Read the comment docs.

cpuhrsch · 2020-07-14T17:05:14Z

examples/BERT/data.py

+import random
+
+
+class LanguageModelingDataset(torch.utils.data.Dataset):


Can't you use https://github.com/pytorch/text/blob/master/torchtext/experimental/datasets/language_modeling.py#L15 ?

Switched to the one in experimental/datasets.

cpuhrsch · 2020-07-14T17:05:34Z

examples/BERT/data.py

+###################################################################
+# Set up dataset for book corpus
+###################################################################
+def BookCorpus(vocab, tokenizer=get_tokenizer("basic_english"),


Maybe it's worth moving this into experimental as well?

I was thinking about this. But the original data for BookCorpus comes from FAIR cluster.

cpuhrsch · 2020-07-14T17:07:53Z

examples/BERT/ns_task.py

+ return processed_data
+
+
+def collate_batch(batch, args, cls_id, sep_id, pad_id):


You have "collate_batch" earlier as well. Is there some combination possible here with a utils file?

… task.

Guanheng Zhang and others added 9 commits May 1, 2020 09:38

pull torchBERT 70aa449

1ae1a5b

add dataloader to mlm_task

6c097a4

Merge remote-tracking branch 'upstream/master' into torchBERT

45c68ad

Clean up ns and qa task

a07b218

revise WikiText103 to the latest abstraction

9c38629

ns uses new WikiText103

3e2d13b

update README

bbddfc4

merge master branch

39e08ee

zhangguanheng66 requested review from cpuhrsch and fmassa May 13, 2020 13:56

cpuhrsch reviewed May 13, 2020

View reviewed changes

vincentqb mentioned this pull request May 13, 2020

Example pipeline with wav2letter pytorch/audio#632

Merged

8 tasks

Guanheng Zhang added 5 commits May 19, 2020 11:37

Merge branch 'master' into torchBERT

9cc864c

switch to pytorch transformer

f09f191

remove init_weights in model

c8f1498

remove EnWik9 dataset from data.py. Use torchtext one

6b01c9a

revise WikiText103

dd9f239

vincentqb reviewed May 26, 2020

View reviewed changes

vincentqb reviewed May 28, 2020

View reviewed changes

examples/BERT/mlm_task.py Show resolved Hide resolved

vincentqb reviewed May 28, 2020

View reviewed changes

Guanheng Zhang added 4 commits June 2, 2020 07:13

Merge branch 'master' into torchBERT

f8ffbaa

remove language modeling datasets from torchBERT pipeline

b5b0ec1

Merge branch 'master' into torchBERT

12bdf5f

Merge branch 'master' into torchBERT

06c1279

Guanheng Zhang added 12 commits June 10, 2020 12:41

integrate with torchtext

5ec0382

fix ns task

9eab8b4

Merge branch 'master' into torchBERT

b7f1dd4

Merge branch 'master' into torchBERT

a6aea63

switch to MHA container in torchtext

5ea8d83

Add BookCorpus

fa7cc14

update epoch to 15 in mlm_task

22eb9c9

update slurm time

f50bb9d

update README.md

54b1d5c

update epoch ns task

0010f98

Merge branch 'master' into torchBERT

3ec1528

add BookCorpus to ns_task

cc60313

zhangguanheng66 force-pushed the torchBERT branch from c1b12f7 to b61e3d5 Compare July 2, 2020 20:58

Upload pre-trained models and vocab to aws s3

441edff

zhangguanheng66 force-pushed the torchBERT branch from e08705b to 441edff Compare July 2, 2020 21:24

Guanheng Zhang added 2 commits July 12, 2020 19:14

Merge branch 'master' into torchBERT

9ddcb82

Update docs

49c558b

cpuhrsch reviewed Jul 14, 2020

View reviewed changes

Guanheng Zhang added 4 commits July 15, 2020 06:59

Merge branch 'master' into torchBERT

9dbdf80

switch to torchtext.nn

8982ab8

switch to torch.save state_dict

98b76b8

use LanguageModelingDataset from torchtext

015a8f9

cpuhrsch approved these changes Jul 15, 2020

View reviewed changes

Update README file with train/valid/test printout for question-answer…

c9a9c56

… task.

zhangguanheng66 merged commit 6521e7b into pytorch:master Jul 16, 2020

zhangguanheng66 mentioned this pull request Jul 16, 2020

Customize torchtext.data.Dataset takes much time to generate dataset #858

Closed

zhangguanheng66 deleted the torchBERT branch July 16, 2020 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT example in torchtext #767

BERT example in torchtext #767

zhangguanheng66 commented May 13, 2020 •

edited

Loading

cpuhrsch May 13, 2020

vincentqb May 26, 2020

cpuhrsch May 13, 2020

zhangguanheng66 Jun 15, 2020

cpuhrsch May 13, 2020

cpuhrsch May 13, 2020

zhangguanheng66 Jun 15, 2020 •

edited

Loading

cpuhrsch commented May 13, 2020

vincentqb left a comment

vincentqb May 26, 2020

zhangguanheng66 commented May 26, 2020

vincentqb May 28, 2020

zhangguanheng66 Jun 15, 2020

codecov bot commented Jun 30, 2020 •

edited

Loading

cpuhrsch Jul 14, 2020

zhangguanheng66 Jul 15, 2020

cpuhrsch Jul 14, 2020

zhangguanheng66 Jul 15, 2020

cpuhrsch Jul 14, 2020

		return 100.0 * sum(exact_scores) / len(exact_scores)


		def compute_qa_f1(ans_pred_tokens_samples):

		import random


		class LanguageModelingDataset(torch.utils.data.Dataset):

		return processed_data


		def collate_batch(batch, args, cls_id, sep_id, pad_id):

BERT example in torchtext #767

BERT example in torchtext #767

Conversation

zhangguanheng66 commented May 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 Jun 15, 2020 • edited Loading

Choose a reason for hiding this comment

cpuhrsch commented May 13, 2020

vincentqb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 commented May 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 30, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 commented May 13, 2020 •

edited

Loading

zhangguanheng66 Jun 15, 2020 •

edited

Loading

codecov bot commented Jun 30, 2020 •

edited

Loading