Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect setup of Learning Rate Scheduler #81

Open
aswathn1 opened this issue Jun 8, 2024 · 6 comments
Open

Incorrect setup of Learning Rate Scheduler #81

aswathn1 opened this issue Jun 8, 2024 · 6 comments

Comments

@aswathn1
Copy link

aswathn1 commented Jun 8, 2024

Hello! Thanks for sharing your great work.

I noticed a discrepancy in the way you setup the learning rate scheduler in finetune.py.

When you calculate:
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
Dividing by the total batch size across multiple GPUs should be giving the right number of update steps per epoch instead of gradient_accumulation_steps. This in turn affects your warmup schedule and also your linear decay schedule for your learning rate.

I've also been having issues with reproducing your results with a locally fine-tuned Llama-2 7B model using your codebase and settings compared to your Huggingface checkpoint. So please let me know if you can share any feedback on any additional settings needed to reproduce the Huggingface chcekpoint level perfromance. Thank you.

@fate-ubw
Copy link

Hi ~ I have also been having issues with reproducing the selfrag-7B, I got a low evaluation result compared with eval results from paper。Whould you share your reproduction result from fine tune Llama-2 7B into selfrag 7B

@aswathn1
Copy link
Author

I ran their finetuning script without making any changes and using the hyperparameter settings from their finetuning scripts on PopQA with retrieval using their pre-computed top-k files and was only able to get up to 34.28% on PopQA and 69.60% on PubHealth and str-em of 28.02 and rg of 35.76 on ASQA. Can you share the results you were able to reproduce? That would be helpful for context.

@fate-ubw
Copy link

my results is as following:
base-mode: llama2-hf (no llama-2-chat)
epoch: 3
mode : always retrieval ( you have to attention the mode when retrieval, always retrieval and adaptive retrieval has diff performance
PopQA: 0.546
PubHealth:0.678
arc : 0.569
which is lower then paper, but better then your result. Hope to help you~
I have a question about the wrong code in finetune.py

num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)

Have you attempted to train a model using the aforementioned code along with the corrected finetune code? How does the correct code influence the results?

@aswathn1
Copy link
Author

I did and it did increase the numbers but still lower than the paper.

@fate-ubw
Copy link

Could you please tell me how to modify the above code in finetune.py to make it correct~I would like to test whether the correct code can reproduce the results presented in the paper. ths a lots

@fate-ubw
Copy link

I found finetune.py in self rag is revised based on open-instrction finetune

fate-ubw added a commit to fate-ubw/raglab-exp that referenced this issue Jul 14, 2024
…ignal code of finetune.py from open-instruction is correct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants