Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing Llama results from the paper #2

Open
e-tornike opened this issue Aug 29, 2024 · 5 comments
Open

Reproducing Llama results from the paper #2

e-tornike opened this issue Aug 29, 2024 · 5 comments

Comments

@e-tornike
Copy link

e-tornike commented Aug 29, 2024

Hey there,

firstly, thanks for the nice work!

I am attempting to reproduce the results from the paper. I re-ran the experiments with 10 seeds (averaging the results). However, I am only reproducing the numbers for 5 of 7 of the tasks, which do not require an LLM judge.

My results are the following:

BioASQ BioRED DiscMT EvInf MultiCite SciERC SciFact SciFact
F1 F1 BLEU "Fuzzy" F1 F1 F1 F1-Label F1-Token
Llama-3-8B-Inst. (original) 43.3 40.3 37.3 13.5 37.9 25.4 42.3 40.1
Llama-3-8B-Inst. (reproduced) 43.1 ± 1.5 40.1 ± 1.1 36.8 ± 2.6 15.2 ± 1.7 34.9 ± 8.2 13.2 ± 4.6 22.0 ± 7.7 20.6 ± 7.2

I am uncertain what the reason could be that the reproduced results for SciERC and SciFact are different compared to the original. Do you know what could be the cause of this?

There is a slight change in the --model_args due to memory issues. I added gpu_memory_utilization or max_model_len and removed tensor_parallel_size. I am running the following command:

python -m lm_eval \
  --include_path ./sciriff/eval/eleuther_templates/general \
  --model vllm \
  --model_args pretrained=meta-llama/Meta-Llama-3-8B-Instruct,dtype=float16,gpu_memory_utilization=0.85,max_model_len=5120 \
  --gen_kwargs max_gen_toks=1024 \
  --tasks bioasq_list_qa,biored_ner,discomat_te,evidence_inference,multicite_intent_classification,scierc_ner,scifact_entailment \
  --batch_size auto \
  --output_path results/ \
  --seed 42 \
  --predict_only \
  --log_samples

And I am using the following seeds: 42, 1337, 9876, 12345, 999999, 98765, 5555, 2024, 267, 10.

I am running experiments on a single RTX A6000 (48 GB) using CUDA 12.4, driver version 550.90.07, and Python 3.10.13 with the following versions of the packages:

huggingface-hub==0.24.5
jinja2==3.1.4
jsonschema==4.23.0
https://github.com/EleutherAI/lm-evaluation-harness.git@e74ec966556253fbe3d8ecba9de675c77c075bce
nltk==3.8.1
openai==1.37.2
pandas==2.2.2
pyyaml==6.0.1
rouge_score==0.1.2
spacy==3.7.5
vllm==0.5.4
@lihaoxin2020
Copy link
Contributor

Hi,

Thanks for reaching out.
Seems like you are not passing in a prompting template, which is essentially no template, and that could be a major reason for the discrepancy.

We provide an example template for Llama 3 series in the latest push so try again with --chat_template llama3 \

@e-tornike
Copy link
Author

Hey,

thanks for your response.

As I understand it, the template is actually passed as an argument to the --include_path flag (see here). This is what my code above is passing directly to lm_eval.

Your recent update #3 is now adding the --apply_chat_template flag.

@lihaoxin2020
Copy link
Contributor

i see. general template in the repo basically means no template, which directly dumps the dataset input to the model as shown here. you might want to include llama template or try apply_chat_template

@e-tornike
Copy link
Author

True, but I guess the real question I am asking is if the numbers in the preprint (https://arxiv.org/abs/2406.07835) used no template (i.e., general), as described in the README, or one specific for the underlying model. Do you know which is the case?

@lihaoxin2020
Copy link
Contributor

lihaoxin2020 commented Oct 26, 2024

@e-tornike Thanks for pointing this out.
For your question, we did not use any template for Llama models in the preprint (which is not correct). Our results reported misaligned with yours because we did not set the EOS token correctly for non-tulu finetuned model, so the model would keep generating and repeat itself until hitting the max tokens. Sometimes the model generates a few versions of answer, which mistakenly push the score higher after parsing.
Your results without template are correct (I reproduced the same score w/o templates). The right way of doing these experiments is using templates.

Here are our new scores on Llama2-7B and Llama3-8B with corresponding templates for reference (Tulu models in the preprint are fine):

Task bioasq biored discomat evidence_inference multicite mup qasper qasper scierc scifact scifact Mean Median
Metric f1 f1 bleu f1_overlap f1 lm_judge_reference lm_judge_answer f1_evidence f1 f1_label f1_evidence_token
Llama-2-7b-chat-hf 34.48 19.78 35.57 13.32 27.27 77.25 7.94 2.29 6.88 50.41 31.68 27.90 27.27
Meta-Llama-3-8B-Instruct 44.28 47.06 59.47 0.15 50.08 85.50 55.14 41.19 28.76 68.36 53.36 48.49 50.08

This EOS issue won't happen in the recent push with new lm-eval dependency. In our latest version of the paper, we have already remade this table and will update our preprint soon.

Thanks again for pointing out. Let me know if you have further questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants