Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide comprehensive guide & best-practices for run_language_modeling.py #3192

Closed
marrrcin opened this issue Mar 9, 2020 · 6 comments
Closed
Labels
Ex: LM (Finetuning) Related to language modeling fine-tuning Ex: LM (Pretraining) Related to language modeling pre-training wontfix

Comments

@marrrcin
Copy link
Contributor

marrrcin commented Mar 9, 2020

🚀 Feature request

Provide comprehensive guide for running scripts included in the repository, especially run_language_modeling.py it's parameters and model configurations.

Motivation

  1. Current version has argparse powered help, from which a lot of parameters seem to be either mysterious or have variable runtime behaviour (i.e tokenizer_name is sometimes path and the value that user provides is expected to provide different data for different models, ie. for Roberta and BERT). Again, when it comes to tokenizer_name - it claims that If both are None, initialize a new tokenizer., which does not work at all, i.e when you use RoBERTa model. It should handle the training of the new tokenizer on provided train_data right away.

  2. There are bunch of parameters that are critical to run the script at all (!), which are not even mentioned here https://huggingface.co/blog/how-to-train or even in the notebook https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb, i.e:
    for Roberta, without "max_position_embeddings": 514, in config, the script crashes with:

    CUDA error: device-side assert triggered
    

    I had to dig into github to see some unresolved issues around this case and try out a few solutions before the script finally executed (Error with run_language_modeling.py training from scratch #2877).

  3. Models with LM heads will train even though the head output size is different than vocab size of the tokenizer - the script should warn the user or (better) raise an exception in such scenarios.

  4. Describe how the input dataset should look like. Is it required to have one sentence per-line or one article per line or maybe one paragraph per line?

  5. Using multi-GPU on single machine and parameter --evaluate_during_training crashes the script -why? It might be worth an explanation. It's probably also a bug (run_glue.py RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 #1801).

  6. Those are just from the top of my head - I will update this issue once I come up with more or maybe someone else will also add something to this thread.

Given the number of issues currently open, I suspect that I'm not the only one that struggles with the example script. The biggest problem here is that running it without proper configuration might really cost a lot, but the script will still execute, yielding garbage model.

Moreover - by improving the docs and providing best practices guide, you can enable many people with even better toolkit for their research and business.

@BramVanroy BramVanroy added Ex: LM (Finetuning) Related to language modeling fine-tuning Ex: LM (Pretraining) Related to language modeling pre-training labels Mar 10, 2020
@thak123
Copy link

thak123 commented Mar 11, 2020

Even I tried to follow the blog and train a LM from scratch but the instructions are ambiguous. Like for ex config file is passed as command line args but if its passed it tries to load it and throws error .

@marrrcin
Copy link
Contributor Author

I've covered some of the parts here: https://zablo.net/blog/post/training-roberta-from-scratch-the-missing-guide-polish-language-model/

@singhay
Copy link

singhay commented Apr 15, 2020

https://stackoverflow.com/questions/61232399/decoding-predictions-for-masked-language-modeling-task-using-custom-bpe

I posted a question related to this on SO. Any help is appreciated! @marrrcin

@singhay
Copy link

singhay commented May 19, 2020

bump!

@kuppulur
Copy link
Contributor

kuppulur commented Jun 2, 2020

I've covered some of the parts here: https://zablo.net/blog/post/training-roberta-from-scratch-the-missing-guide-polish-language-model/

Hey Marcin, Your post is very informative. Thanks for that. Could you say a few words on the reasoning for the vocab size being 32000 exactly? Are there any heuristics that helped your decision? (or) anyone here can say a few words on if there are any good heuristics you can follow to choose this hyperparameter? Thanks

@stale
Copy link

stale bot commented Aug 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Aug 1, 2020
@stale stale bot closed this as completed Aug 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ex: LM (Finetuning) Related to language modeling fine-tuning Ex: LM (Pretraining) Related to language modeling pre-training wontfix
Projects
None yet
Development

No branches or pull requests

5 participants