Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tried training a model #5

Open
amazingvince opened this issue Jul 10, 2024 · 2 comments
Open

Tried training a model #5

amazingvince opened this issue Jul 10, 2024 · 2 comments

Comments

@amazingvince
Copy link

First off thank you for building this repo!

I tried training a t5 base model on mini pile and got some interesting results.
https://wandb.ai/amazingvince/flasht5-pretrain/runs/ch4a9y51?nw=nwuseramazingvince

I had to modify code in the UL2 objective to accept the dataset I tokenized. I am worried my modifications might have broken something.

Could you provide a script or link to the one you used to tokenize your dataset?

After training the model preformed much worse than we expected generating giberish. I wrote a script based on your hf to flasht5 to reverse the process and go from flasht5 to hf. I might have messed up something in this conversion.

Do you have an example of doing inference using the models without converting them somewhere. I tried copying all the files from a repo on y'alls huggingface and loading with trust remote code equals true. Model still generates junk.

I also tried finetuning the model on samsum and this did not result in a model that seemed to have learned anything. Curious if model needed to tune before being used.

Some intresting results here this was using my converted to huggingface model:
https://wandb.ai/amazingvince/t5-summarization-test/runs/3yyem3wy?nw=nwuseramazingvince

Same hyperparms but using fasht5 model loaded with trust_remote_code=True. The loss increases over training run.
https://wandb.ai/amazingvince/t5-summarization-test/runs/cudb9wm2?nw=nwuseramazingvince

Curious if you guys have seen any of these things or have ideas as to what I am doing wrong. Thanks again for building this repo and making it public. I have been annoyed at the lack of resources and attention T5 has gotten over the last few years.

@bourdoiscatie
Copy link
Contributor

Hi @amazingvince

@b-albar is in charge of pre-training and model optimization. However, he's on vacation this week, so he won't be able to answer your questions on the tokenizer until next week.

For my part, I'm in charge of finetuning the resulting models to assess their performance. In the absence of a concrete answer on that point, a few comments based on what I've seen.
For French, the classification heads available in this repo work. We get similar results to Camembert (= a RoBERTa in French)... although our models are trained on 4 times fewer tokens than CamemBERT (they have batch sizes that we can't afford on a single A100).
For the text generation part, we found that the model learns things, but that this training is very unstable. So I think there's an error somewhere without us having had time to look into it (I suspect an initialization problem in the head as we had something similar with the QA head before correcting it).
For English, Boris had converted weights from Flan-T5 which we then put on Hugging Face. I had checked to see if they loaded well (which they did) but hadn't tested finetuning with them. That's what I did this Monday, and like you, I'm seeing problems (I tested doing classification and training an embedding model in my case) when evaluating.
So in addition to a probable problem in the text generation head, I can't exclude the possibility of problems with the weights themselves.

I hope to be able to give you a more precise answer next week when Boris is back and we've discussed this point. We'll get back to you.

@b-albar
Copy link
Collaborator

b-albar commented Jul 16, 2024

Hi @amazingvince

I'll try to provide a full training script for minipile. The diverging loss after 10k steps is not something I encountered before, I'll have a look. Converting from HF to FAT5 is not so trivial as the masking is different, I had to finetune them for a few steps to make it learn how to deal with padding tokens. I haven't try to convert from FAT5 to HF, I feel that finetuning may not be required but the proper way is probably to include the code in the HF repo of the model, similar to this for example : https://huggingface.co/CATIE-AQ/FAT5-large-flan-en

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants