Skip to content

Commit

Permalink
Merge remote-tracking branch 'huggingface/master'
Browse files Browse the repository at this point in the history
  • Loading branch information
erenup committed Aug 30, 2019
2 parents 2a2832c + caf1d11 commit 6e1ac34
Show file tree
Hide file tree
Showing 50 changed files with 3,281 additions and 136 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,4 +127,7 @@ proc_data

# examples
runs
examples/runs
examples/runs

# data
data
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@ The library currently contains PyTorch implementations, pre-trained model weight
4. **[Transformer-XL](https://github.com/kimiyoung/transformer-xl)** (from Google/CMU) released with the paper [Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/abs/1901.02860) by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
5. **[XLNet](https://github.com/zihangdai/xlnet/)** (from Google/CMU) released with the paper [​XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237) by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
6. **[XLM](https://github.com/facebookresearch/XLM/)** (from Facebook) released together with the paper [Cross-lingual Language Model Pretraining](https://arxiv.org/abs/1901.07291) by Guillaume Lample and Alexis Conneau.
7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.
7. **[RoBERTa](https://github.com/pytorch/fairseq/tree/master/examples/roberta)** (from Facebook), released together with the paper a [Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
8. **[DistilBERT](https://github.com/huggingface/pytorch-transformers/tree/master/examples/distillation)** (from HuggingFace), released together with the blogpost [Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT](https://medium.com/huggingface/distilbert-8cf3380435b5
) by Victor Sanh, Lysandre Debut and Thomas Wolf.

These implementations have been tested on several datasets (see the example scripts) and should match the performances of the original implementations (e.g. ~93 F1 on SQuAD for BERT Whole-Word-Masking, ~88 F1 on RocStories for OpenAI GPT, ~18.3 perplexity on WikiText 103 for Transformer-XL, ~0.916 Peason R coefficient on STS-B for XLNet). You can find more details on the performances in the Examples section of the [documentation](https://huggingface.co/pytorch-transformers/examples.html).

Expand Down Expand Up @@ -76,7 +78,7 @@ import torch
from pytorch_transformers import *

# PyTorch-Transformers has a unified API
# for 6 transformer architectures and 27 pretrained weights.
# for 7 transformer architectures and 30 pretrained weights.
# Model | Tokenizer | Pretrained weights shortcut
MODELS = [(BertModel, BertTokenizer, 'bert-base-uncased'),
(OpenAIGPTModel, OpenAIGPTTokenizer, 'openai-gpt'),
Expand Down Expand Up @@ -328,7 +330,7 @@ Breaking change in the `from_pretrained()`method:

1. Models are now set in evaluation mode by default when instantiated with the `from_pretrained()` method. To train them don't forget to set them back in training mode (`model.train()`) to activate the dropout modules.

2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead which can break derived model classes build based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/pytorch-transformers/pull/866) by forwarding the the model `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuratoin class attributes.
2. The additional `*input` and `**kwargs` arguments supplied to the `from_pretrained()` method used to be directly passed to the underlying model's class `__init__()` method. They are now used to update the model configuration attribute instead which can break derived model classes build based on the previous `BertForSequenceClassification` examples. We are working on a way to mitigate this breaking change in [#866](https://github.com/huggingface/pytorch-transformers/pull/866) by forwarding the the model `__init__()` method (i) the provided positional arguments and (ii) the keyword arguments which do not match any configuration class attributes.

Also, while not a breaking change, the serialization methods have been standardized and you probably should switch to the new method `save_pretrained(save_directory)` if you were using any other serialization method before.

Expand Down Expand Up @@ -393,8 +395,8 @@ for batch in train_data:
loss = model(batch)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) # Gradient clipping is not in AdamW anymore (so you can use amp without issue)
scheduler.step()
optimizer.step()
scheduler.step()
optimizer.zero_grad()
```

Expand Down
73 changes: 58 additions & 15 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ Examples
- How to use gradient-accumulation, multi-gpu training, distributed training, optimize on CPU and 16-bits training to train Bert models
* - `Fine-tuning with BERT: running the examples <#fine-tuning-bert-examples>`_
- Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``extract_classif.py``\ , ``run_bert_classifier.py``\ , ``run_bert_squad.py`` and ``run_lm_finetuning.py``
* - `Fine-tuning with OpenAI GPT, Transformer-XL and GPT-2 <#fine-tuning>`_
- Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py`` and ``run_gpt2.py``
* - `Fine-tuning with OpenAI GPT, Transformer-XL, GPT-2 as well as BERT and RoBERTa <#fine-tuning>`_
- Running the examples in `examples <https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples>`_\ : ``run_openai_gpt.py``\ , ``run_transfo_xl.py``, ``run_gpt2.py`` and ``run_lm_finetuning.py``
* - `Fine-tuning BERT-large on GPUs <#fine-tuning-bert-large>`_
- How to fine tune ``BERT large``

Expand Down Expand Up @@ -68,7 +68,9 @@ GLUE results on dev set
~~~~~~~~~~~~~~~~~~~~~~~

We get the following results on the dev set of GLUE benchmark with an uncased BERT base
model. All experiments were run on a P100 GPU with a batch size of 32.
model (`bert-base-uncased`). All experiments ran on 8 V100 GPUs with a total train batch size of 24. Some of
these tasks have a small dataset and training can lead to high variance in the results between different runs.
We report the median on 5 runs (with different seeds) for each of the metrics.

.. list-table::
:header-rows: 1
Expand All @@ -78,31 +80,31 @@ model. All experiments were run on a P100 GPU with a batch size of 32.
- Result
* - CoLA
- Matthew's corr.
- 57.29
- 55.75
* - SST-2
- accuracy
- 93.00
- 92.09
* - MRPC
- F1/accuracy
- 88.85/83.82
- 90.48/86.27
* - STS-B
- Pearson/Spearman corr.
- 89.70/89.37
- 89.03/88.64
* - QQP
- accuracy/F1
- 90.72/87.41
- 90.92/87.72
* - MNLI
- matched acc./mismatched acc.
- 83.95/84.39
- 83.74/84.06
* - QNLI
- accuracy
- 89.04
- 91.07
* - RTE
- accuracy
- 61.01
- 68.59
* - WNLI
- accuracy
- 53.52
- 43.66


Some of these results are significantly different from the ones reported on the test set
Expand Down Expand Up @@ -382,7 +384,7 @@ Training with the previous hyper-parameters on a single GPU gave us the followin
LM Fine-tuning
~~~~~~~~~~~~~~

The data should be a text file in the same format as `sample_text.txt <./samples/sample_text.txt>`_ (one sentence per line, docs separated by empty line).
The data should be a text file in the same format as `sample_text.txt <./pytorch_transformers/tests/fixtures/sample_text.txt/sample_text.txt>`_ (one sentence per line, docs separated by empty line).
You can download an `exemplary training corpus <https://ext-bert-sample.obs.eu-de.otc.t-systems.com/small_wiki_sentence_corpus.txt>`_ generated from wikipedia articles and split into ~500k sentences with spaCy.
Training one epoch on this corpus takes about 1:20h on 4 x NVIDIA Tesla P100 with ``train_batch_size=200`` and ``max_seq_length=128``\ :

Expand All @@ -393,12 +395,13 @@ Thank to the work of @Rocketknight1 and @tholor there are now **several scripts*
OpenAI GPT, Transformer-XL and GPT-2: running the examples
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We provide three examples of scripts for OpenAI GPT, Transformer-XL and OpenAI GPT-2 based on (and extended from) the respective original implementations:
We provide three examples of scripts for OpenAI GPT, Transformer-XL, OpenAI GPT-2, BERT and RoBERTa based on (and extended from) the respective original implementations:


* fine-tuning OpenAI GPT on the ROCStories dataset
* evaluating Transformer-XL on Wikitext 103
* unconditional and conditional generation from a pre-trained OpenAI GPT-2 model
* fine-tuning GPT/GPT-2 on a causal language modeling task and BERT/RoBERTa on a masked language modeling task

Fine-tuning OpenAI GPT on the RocStories dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -452,7 +455,47 @@ Unconditional generation:
python run_gpt2.py --unconditional
The same option as in the original scripts are provided, please refere to the code of the example and the original repository of OpenAI.
The same option as in the original scripts are provided, please refer to the code of the example and the original repository of OpenAI.


Causal LM fine-tuning on GPT/GPT-2, Masked LM fine-tuning on BERT/RoBERTa
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before running the following examples you should download the `WikiText-2 dataset <https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/>`__ and unpack it to some directory `$WIKITEXT_2_DATASET`
The following results were obtained using the `raw` WikiText-2 (no tokens were replaced before the tokenization).

This example fine-tunes GPT-2 on the WikiText-2 dataset. The loss function is a causal language modeling loss (perplexity).

.. code-block:: bash
export WIKITEXT_2_DATASET=/path/to/wikitext_dataset
python run_lm_finetuning.py
--output_dir=output
--model_type=gpt2
--model_name_or_path=gpt2
--do_train
--train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw
--do_eval
--eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run.
It reaches a score of about 20 perplexity once fine-tuned on the dataset.

This example fine-tunes RoBERTa on the WikiText-2 dataset. The loss function is a masked language modeling loss (masked perplexity).
The `--mlm` flag is necessary to fine-tune BERT/RoBERTa on masked language modeling.

.. code-block:: bash
export WIKITEXT_2_DATASET=/path/to/wikitext_dataset
python run_lm_finetuning.py
--output_dir=output
--model_type=roberta
--model_name_or_path=roberta-base
--do_train
--train_data_file=$WIKITEXT_2_DATASET/wiki.train.raw
--do_eval
--eval_data_file=$WIKITEXT_2_DATASET/wiki.test.raw
--mlm
.. _fine-tuning-BERT-large:

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,4 @@ The library currently contains PyTorch implementations, pre-trained model weight
model_doc/xlm
model_doc/xlnet
model_doc/roberta
model_doc/distilbert
43 changes: 43 additions & 0 deletions docs/source/model_doc/distilbert.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
DistilBERT
----------------------------------------------------

``DistilBertConfig``
~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: pytorch_transformers.DistilBertConfig
:members:


``DistilBertTokenizer``
~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: pytorch_transformers.DistilBertTokenizer
:members:


``DistilBertModel``
~~~~~~~~~~~~~~~~~~~~

.. autoclass:: pytorch_transformers.DistilBertModel
:members:


``DistilBertForMaskedLM``
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: pytorch_transformers.DistilBertForMaskedLM
:members:


``DistilBertForSequenceClassification``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: pytorch_transformers.DistilBertForSequenceClassification
:members:


``DistilBertForQuestionAnswering``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: pytorch_transformers.DistilBertForQuestionAnswering
:members:
Loading

0 comments on commit 6e1ac34

Please sign in to comment.