-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide comprehensive guide & best-practices for run_language_modeling.py #3192
Comments
Even I tried to follow the blog and train a LM from scratch but the instructions are ambiguous. Like for ex config file is passed as command line args but if its passed it tries to load it and throws error . |
I've covered some of the parts here: https://zablo.net/blog/post/training-roberta-from-scratch-the-missing-guide-polish-language-model/ |
I posted a question related to this on SO. Any help is appreciated! @marrrcin |
bump! |
Hey Marcin, Your post is very informative. Thanks for that. Could you say a few words on the reasoning for the vocab size being 32000 exactly? Are there any heuristics that helped your decision? (or) anyone here can say a few words on if there are any good heuristics you can follow to choose this hyperparameter? Thanks |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
🚀 Feature request
Provide comprehensive guide for running scripts included in the repository, especially
run_language_modeling.py
it's parameters and model configurations.Motivation
Current version has
argparse
powered help, from which a lot of parameters seem to be either mysterious or have variable runtime behaviour (i.etokenizer_name
is sometimes path and the value that user provides is expected to provide different data for different models, ie. for Roberta and BERT). Again, when it comes totokenizer_name
- it claims thatIf both are None, initialize a new tokenizer.
, which does not work at all, i.e when you use RoBERTa model. It should handle the training of the new tokenizer on providedtrain_data
right away.There are bunch of parameters that are critical to run the script at all (!), which are not even mentioned here https://huggingface.co/blog/how-to-train or even in the notebook https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb, i.e:
for Roberta, without
"max_position_embeddings": 514,
in config, the script crashes with:I had to dig into github to see some unresolved issues around this case and try out a few solutions before the script finally executed (Error with run_language_modeling.py training from scratch #2877).
Models with LM heads will train even though the head output size is different than vocab size of the tokenizer - the script should warn the user or (better) raise an exception in such scenarios.
Describe how the input dataset should look like. Is it required to have one sentence per-line or one article per line or maybe one paragraph per line?
Using multi-GPU on single machine and parameter
--evaluate_during_training
crashes the script -why? It might be worth an explanation. It's probably also a bug (run_glue.py RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:3 #1801).Those are just from the top of my head - I will update this issue once I come up with more or maybe someone else will also add something to this thread.
Given the number of issues currently open, I suspect that I'm not the only one that struggles with the example script. The biggest problem here is that running it without proper configuration might really cost a lot, but the script will still execute, yielding garbage model.
Moreover - by improving the docs and providing best practices guide, you can enable many people with even better toolkit for their research and business.
The text was updated successfully, but these errors were encountered: