-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when running the example code for pretraining the rtd model. #127
Comments
did you find any solution to this ? |
Actually, I've not found it yet. I guess I am too new.
Could you please give me step-by-step instructions?
I tried it with or without docker, but the result was the same.
My environment is 2 * 80GB A100 GPUs.
Thanks
Soonil
2023년 4월 17일 (월) 오후 4:29, Stephen Fernandes ***@***.***>님이 작성:
… did you find any solution to this ?
—
Reply to this email directly, view it on GitHub
<#127 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AOTZP55MDJVNI4CRBSXRNZ3XBTWNFANCNFSM6AAAAAAWLUKJYY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@soonilbae did you do just make a clean installation with all the dependencies if using latest releases of torch make sure you mitigate the torch._six dependency, by directly importing six for string_classes I actually got my pretraining running, currently running base version with batch_size of 96 as 256 suffers oom errors |
@StephennFernandes have you manage to train successfuly? I am training a portugues version. I am getting 67,5% validation ccuracy and 1,55 validation loss, after 70k steps (64 for batch size) but when I import the discriminator weights in Huggingface, with the spm file for the tokenizer, the downstream classification tasks never converge and after 10 epochs the training and validation error dont decrease. |
@fmobrj i just ran the rtd.sh pretraining script to ensure everything i working fine. didn't really emphasis on the training metrics. However gently pinging @BigBird01 |
Thanks! |
@fmobrj hey curious to know what hparams did you use how big was your training data, and incase i could get to check your training metrics to better understand your problem |
No problem. I made some ajustments to rtd.sh and app.run, cause I wanted to use the large pretrained english version as starting poin t. It worked: my generator training loss started from 6.58, instead of 10.93 (purely from scratch) and the discriminator from 3.74, intead of 4.19 (scracth). My Dataset is a concatenation of ptwiki and BRWAC (https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC). After tokenization it ended up somewhere around 7MM examples (of 512 tokens each) First, I trained a portuguese spm model with this params:
Then I tokenized the raw dataset myself instead of using prepare_data.py because of memory constraints. The code in prepare_date.py loads all texts into memory and process all in one pass. But it comes with memory consumption. For wikitext103, ok. But not for my dataset (bigger). So i tokenized myself:
Then I created a rtd_pt_continue.sh with some changes to the large version part. I also used the original debertav3-large english checkpoint downloaded from HF:
I also changed this part:
In DeBERTa.apps, I created a run_continue.py to deal with the translation of the english original embeddings to common embeddings in both languages so I could reuse 33k of the 138k embeddings as a staring pretrained point. For this, I copied the app.py as app_continue.py and made this changes to main: I load the pretrained english models and the portuguese tokenizer normaly as the run.py code. But I also load the english tokenizer for copying weights. For this, I added:
Then I create a list with the portuguese dictionary:
After loading the weights of the pretrained english large models:
After doing this I can accelerate training because I use pretrained weights for the embeddings amlost 33k tokens common to both tokenizers. |
@fmobrj hey, I'm glad things started working for you. And thanks a ton for sharing your implementation tweaks that got you up and running, I'm sure someone in the community would highly benifit from this. |
Sure. Send me a DM. |
My training metrics up to now: 04/28/2023 10:08:52|INFO|RTD|00| device=cuda, n_gpu=1, distributed training=False, world_size=1 ... 05/04/2023 09:20:31|INFO|RTD|00| ***** Eval results-dev-078000-1000000 ***** |
The discriminator seems to be learning nothing. After 79k steps at each 1k dump, the discriminator result is: Best metric: 0@1000, despite the generator learning and diminishing loss and increasing accuracy. |
Hi, @BigBird01! Is it expected that during training, the metric for the Discriminator stays at "Best metric: 0@1000" after many steps (currently it is at 5811200 examples and 90k steps)? The generator is improving accuracy (~0.68) and loss (1.51). |
@fmobrj did you get to try your checkpoints in any downstream task to see if the training is working? |
Hi, @pvcastro ! I tried, but it is not converging when applying the model to a classification task in portuguese that works even with the english pretrained model in Huggingface. I suspect the discriminator is not well trained enough. I stop pretarining with a G loss of 1.28 and 71.6 accuracy. But D validation report shows 0@250 after almost 200k steps with a batch size of 64. |
I've setup the environment according to the instruction, and
try pretraining the rtd model as follows:
Deberta/experiments/language_model/bash rtd.sh deberta-v3-base
But I got error message as following:
Traceback (most recent call last):
File "./prepare_data.py", line 37, in
tokenize_data(args.input, args.output, args.max_seq_length)
File "./prepare_data.py", line 9, in tokenize_data
tokenizer=deberta.tokenizerst
File "/usr/local/lib/python3.6/dist-packages/DeBERTa/deberta/spm_tokenizer.py", line 29, in init
assert os.path.exists(vocab_file)
File "/usr/lib/python3.6/genericpath.py", line 19, in exists
os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Is it because I did not specify vocab file? Or did I miss something in setting up environment?
The text was updated successfully, but these errors were encountered: