Skip to content

Commit

Permalink
Add Readme for language modeling scripts with accelerate (huggingface…
Browse files Browse the repository at this point in the history
  • Loading branch information
hemildesai authored and Iwontbecreative committed Jul 15, 2021
1 parent bf6d7a1 commit c6cc8fb
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 8 deletions.
32 changes: 25 additions & 7 deletions examples/language-modeling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,7 @@ ALBERT, BERT, DistilBERT, RoBERTa, XLNet... GPT and GPT-2 are trained or fine-tu
loss. XLNet uses permutation language modeling (PLM), you can find more information about the differences between those
objectives in our [model summary](https://huggingface.co/transformers/model_summary.html).

These scripts leverage the 🤗 Datasets library and the Trainer API. You can easily customize them to your needs if you
need extra processing on your datasets.
There are two sets of scripts provided. The first set leverages the Trainer API. The second set with `no_trainer` in the suffix uses a custom training loop and leverages the 🤗 Accelerate library . Both sets use the 🤗 Datasets library. You can easily customize them to your needs if you need extra processing on your datasets.

**Note:** The old script `run_language_modeling.py` is still available [here](https://github.com/huggingface/transformers/blob/master/examples/legacy/run_language_modeling.py).

Expand Down Expand Up @@ -60,6 +59,15 @@ python run_clm.py \
--output_dir /tmp/test-clm
```

This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_clm_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below:

```bash
python run_clm_no_trainer.py \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--model_name_or_path gpt2 \
--output_dir /tmp/test-clm
```

### RoBERTa/BERT/DistilBERT and masked language modeling

Expand Down Expand Up @@ -95,23 +103,33 @@ python run_mlm.py \
If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
concatenates all texts and then splits them in blocks of the same length).

This uses the built in HuggingFace `Trainer` for training. If you want to use a custom training loop, you can utilize or adapt the `run_mlm_no_trainer.py` script. Take a look at the script for a list of supported arguments. An example is shown below:

```bash
python run_mlm_no_trainer.py \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--model_name_or_path roberta-base \
--output_dir /tmp/test-mlm
```

**Note:** On TPU, you should use the flag `--pad_to_max_length` in conjunction with the `--line_by_line` flag to make
sure all your batches have the same length.

### Whole word masking

This part was moved to `examples/research_projects/mlm_wwm`.
This part was moved to `examples/research_projects/mlm_wwm`.

### XLNet and permutation language modeling

XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method
to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input
XLNet uses a different training objective, which is permutation language modeling. It is an autoregressive method
to learn bidirectional contexts by maximizing the expected likelihood over all permutations of the input
sequence factorization order.

We use the `--plm_probability` flag to define the ratio of length of a span of masked tokens to surrounding
We use the `--plm_probability` flag to define the ratio of length of a span of masked tokens to surrounding
context length for permutation language modeling.

The `--max_span_length` flag may also be used to limit the length of a span of masked tokens used
The `--max_span_length` flag may also be used to limit the length of a span of masked tokens used
for permutation language modeling.

Here is how to fine-tune XLNet on wikitext-2:
Expand Down
2 changes: 1 addition & 1 deletion examples/language-modeling/run_mlm_no_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@


def parse_args():
parser = argparse.ArgumentParser(description="Finetune a transformers model on a text classification task")
parser = argparse.ArgumentParser(description="Finetune a transformers model on a Masked Language Modeling task")
parser.add_argument(
"--dataset_name",
type=str,
Expand Down

0 comments on commit c6cc8fb

Please sign in to comment.