Provide comprehensive guide & best-practices for run_language_modeling.py #3192

marrrcin · 2020-03-09T17:29:51Z

🚀 Feature request

Provide comprehensive guide for running scripts included in the repository, especially run_language_modeling.py it's parameters and model configurations.

Motivation

Current version has argparse powered help, from which a lot of parameters seem to be either mysterious or have variable runtime behaviour (i.e tokenizer_name is sometimes path and the value that user provides is expected to provide different data for different models, ie. for Roberta and BERT). Again, when it comes to tokenizer_name - it claims that If both are None, initialize a new tokenizer., which does not work at all, i.e when you use RoBERTa model. It should handle the training of the new tokenizer on provided train_data right away.
There are bunch of parameters that are critical to run the script at all (!), which are not even mentioned here https://huggingface.co/blog/how-to-train or even in the notebook https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb, i.e:
for Roberta, without "max_position_embeddings": 514, in config, the script crashes with:
```
CUDA error: device-side assert triggered
```
I had to dig into github to see some unresolved issues around this case and try out a few solutions before the script finally executed (Error with run_language_modeling.py training from scratch #2877).
Models with LM heads will train even though the head output size is different than vocab size of the tokenizer - the script should warn the user or (better) raise an exception in such scenarios.
Describe how the input dataset should look like. Is it required to have one sentence per-line or one article per line or maybe one paragraph per line?
Using multi-GPU on single machine and parameter --evaluate_during_training crashes the script -why? It might be worth an explanation. It's probably also a bug (run_glue.py RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 #1801).
Those are just from the top of my head - I will update this issue once I come up with more or maybe someone else will also add something to this thread.

Given the number of issues currently open, I suspect that I'm not the only one that struggles with the example script. The biggest problem here is that running it without proper configuration might really cost a lot, but the script will still execute, yielding garbage model.

Moreover - by improving the docs and providing best practices guide, you can enable many people with even better toolkit for their research and business.

The text was updated successfully, but these errors were encountered:

thak123 · 2020-03-11T14:59:22Z

Even I tried to follow the blog and train a LM from scratch but the instructions are ambiguous. Like for ex config file is passed as command line args but if its passed it tries to load it and throws error .

marrrcin · 2020-03-17T00:35:39Z

I've covered some of the parts here: https://zablo.net/blog/post/training-roberta-from-scratch-the-missing-guide-polish-language-model/

singhay · 2020-04-15T15:23:40Z

https://stackoverflow.com/questions/61232399/decoding-predictions-for-masked-language-modeling-task-using-custom-bpe

I posted a question related to this on SO. Any help is appreciated! @marrrcin

singhay · 2020-05-19T17:42:49Z

bump!

kuppulur · 2020-06-02T19:52:34Z

I've covered some of the parts here: https://zablo.net/blog/post/training-roberta-from-scratch-the-missing-guide-polish-language-model/

Hey Marcin, Your post is very informative. Thanks for that. Could you say a few words on the reasoning for the vocab size being 32000 exactly? Are there any heuristics that helped your decision? (or) anyone here can say a few words on if there are any good heuristics you can follow to choose this hyperparameter? Thanks

stale · 2020-08-01T20:51:57Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

BramVanroy added Ex: LM (Finetuning) Related to language modeling fine-tuning Ex: LM (Pretraining) Related to language modeling pre-training labels Mar 10, 2020

stale bot added the wontfix label Aug 1, 2020

stale bot closed this as completed Aug 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide comprehensive guide & best-practices for run_language_modeling.py #3192

Provide comprehensive guide & best-practices for run_language_modeling.py #3192

marrrcin commented Mar 9, 2020

thak123 commented Mar 11, 2020

marrrcin commented Mar 17, 2020

singhay commented Apr 15, 2020

singhay commented May 19, 2020

kuppulur commented Jun 2, 2020

stale bot commented Aug 1, 2020

Provide comprehensive guide & best-practices for run_language_modeling.py #3192

Provide comprehensive guide & best-practices for run_language_modeling.py #3192

Comments

marrrcin commented Mar 9, 2020

🚀 Feature request

Motivation

thak123 commented Mar 11, 2020

marrrcin commented Mar 17, 2020

singhay commented Apr 15, 2020

singhay commented May 19, 2020

kuppulur commented Jun 2, 2020

stale bot commented Aug 1, 2020