-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Result reproduction #7
Comments
It seems the results come from the validation split, right? |
Detailed log of first trial:
2024-03-29 22:14:24,994 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-29 23:41:20,965 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 01:06:17,476 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 02:33:57,594 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 04:00:31,784 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 05:28:45,895 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 06:56:00,224 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 08:22:09,814 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 09:47:24,939 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 11:10:58,770 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 12:34:11,931 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 13:57:33,405 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 15:19:45,125 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 16:41:41,170 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 18:05:40,953 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 19:29:54,037 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 20:50:56,876 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 22:12:26,222 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-30 23:33:28,411 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 00:52:50,428 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 02:12:15,114 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 03:32:28,521 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 04:51:45,901 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 06:10:33,505 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 07:29:52,323 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 08:48:13,919 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 10:05:51,761 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 11:24:59,673 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 12:43:09,277 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] 2024-03-31 14:01:15,778 INFO ***** validate val_unseen split on CVDN task ***** [Eval] dataset=[CVDN] |
The losses of first few epochs are similar to your log. But the val numerical results are much better. It is quite wired. Do you think the simulator environment may be a cause? |
I suspect it may be because we actually used vicuna-7b-v1.1 but mistakenly thought we were using vicuna7b-delta-v0. |
For Vicuna-7b-delta-v0, the `delta model' cannot be used directly. Users have to apply it on top of the original LLaMA weights to get actual Vicuna weights. See instructions. |
So the model is Vicuna-v0? The Vicuna model on my side should be ok, as I have already used it for many other models and tasks. |
It takes me some time to verify. Apologize for the inconvenience. |
That's fine. Thank you for getting back to me. Do you have other ideas about the performance? |
It seems that we actually use vicuna-7b-v1.1. |
It's actually good news. Let me have a try. |
Hi, have you successfully reproduced the results? |
The model is in training. Also, I have encountered another issue: if I use vicuna-v0 tokenizer testing model on ScanQA, everything is fine except the numbers don't match; whlist I use vicuna-v1.1 tokenizer, the code reports error. I can make sure that because I have tried several times with different machines and environments (including the one you provided in requirement.txt). Testing on R2R and REVERIE is fine with both tokenizers. The error is reported on Line 395 in 09f5cbe
Error trace: The generated_ids when reporting bug are: |
Debugging.. |
The vicuna v1.1 I use is https://huggingface.co/lmsys/vicuna-7b-v1.1 |
The point is the loaded tokenizer fails to decode -1, which is not in the vocabulary. Can you check whether this also an issue on your side? By the way, please to help to check the config.json in the vicuna model you use. Now I replace the token id -1 with your pad token id. Traceback (most recent call last): |
The model is vicuna-7b-v1.1 (I've compared my parameters and the official parameters). However, I found that the tokenizer is slightly different from the current version. Here is my config.
|
The pad_token_id of vicuna-7b-v1.1 is -1, and the pad_token_id of vicuna-7b-delta-v1.1 is 0 (their parameters are exactly the same). Thus, -1 is automatically produced in the generation process, causing this error. To solve this problem, you can either modify the pad_token_id in the config, or manually pass it in when calling the generate function. |
Hello, have you addressed the problem and reproduced the results? |
I prepared vicuna-v1.1 using the delta model by FastChat. The overall val-unseen score is 2.93, lower than the released training log. Specifically on REVERIE by 2 points, on ScanQA by ~1 point. But things are much better. |
It's good to hear that. Thanks for your feedback and discussion! |
Hi Thanks for open-sourcing this work. I have problems reproducing the results in the paper, looking forward to your help.
I have replicated twice the multitask w/o pertaining experiments. However, I fail to reproduce the results. My results are

For reference, I have also tested your released checkpoint on R2R and REVERIE, the results are

The text was updated successfully, but these errors were encountered: