-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training protocol for Roberta model with SentencePiece #2
Comments
Ok, I will try to do a brief description.
Then you have to encode your data in the format that Fairseq training needs.
Here you will have a dictionary in this format:
For TOTAL_UPDATES we chose based on Roberta Paper Page 6. WARMUP_UPDATES was 10% of TOTAL_UPDATES, and total batch_size was 2k, but it depends on the number of GPUs, and also what kind of GPUs you want to use. we had 256 batch _size (16 MAX_SENTENCES x16 UPDATE_FREQ )for every GPU (8 GPUs) = 2048 |
Close the issue but feel free to open it again |
Hi @simonefrancia I have another question about data preparation. Original fairseq tutorial is based on wikitext103, sample below
As you see, the text were preprocessed. It was tokenized and each token was surrounded with space (all the dots at the end of the sentence has space). |
Hi @ksopyla, |
My intuition was exactly as you write, thanks for confirmation. |
Hi @simonefrancia, i'm training a roberta from scratch with this description (my dataset size is 6gb) and the MLM loss even decreases up to 50k steps, but when i evaluate this model in NER, the f1 score began a progressive decrease value after 16k steps. What could be causing this strange behavior? |
Hi,
I try to train the Roberta model with fairseq library from scratch. I want to pretrain this model on polish text but can't find any good source which explains the details.
There is a readme which explains how to do this with BPE and English https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md
But not everything is so obvious to me. First of all, I want to use sentencepiece instead BPE, you train your model with SentencePiece am I correct?
Could you share how the format of the vocab file for fairseq should look like?
My trained vocab has format
What is the target data format? The same as in original BERT?
My concern is also data preparation, how to preprocess and encode the data (text).
In tutorial they encode text with bpe
So, should I write own script which encodes my data with sentencepiece tokens
And then use
Could you also share some information about your settings
Finnaly, could you share how many GPU you use, how long did it take to train model?
Any tips, warnings are welcome
Thank you in advance. :)
The text was updated successfully, but these errors were encountered: