Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce the result for bert-base-uncased, avg_first_last setting #285

Closed
kuriyan1204 opened this issue Oct 13, 2024 · 4 comments
Closed

Comments

@kuriyan1204
Copy link

kuriyan1204 commented Oct 13, 2024

@gaotianyu1350
Hi, thank you for the great work / publishing beautiful codes!
I have some questions reproducing the STS results for pre-trained bert models.

When I run the following command in my environment, I got higher STS scores comparing to the results shown in your paper.
Do you have any idea what is causing the issue?

Code executed

python evaluation.py \
    --model_name_or_path bert-base-uncased \
    --pooler avg_first_last \
    --task_set sts \
    --mode test

Results

| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
| 45.09 | 64.30 | 54.56 | 70.52 | 67.87 | 59.05 | 63.75 | 60.73 |

Expected results (scores shown in your paper)

| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
| 39.70 | 59.38 | 49.67 | 66.03 | 66.19 | 53.87 | 62.06 | 56.70 |

Strangely, I can fully reproduce the scores for SimCSE models via following command:

% python evaluation.py \
--model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \
--pooler cls \
--task_set sts \
--mode test

Here is the result of pip freeze and I am using one NVIDIA RTX 6000 Ada GPU.
Thank you very much for your help!

pip freeze result
  aiofiles==23.2.1
  aiohappyeyeballs==2.4.3
  aiohttp==3.10.10
  aiosignal==1.3.1
  annotated-types==0.7.0
  anyio==4.5.0
  async-timeout==4.0.3
  attrs==24.2.0
  certifi==2024.8.30
  charset-normalizer==3.4.0
  click==8.1.7
  contourpy==1.1.1
  cycler==0.12.1
  datasets==3.0.1
  dill==0.3.8
  exceptiongroup==1.2.2
  fastapi==0.115.2
  ffmpy==0.4.0
  filelock==3.16.1
  fonttools==4.54.1
  frozenlist==1.4.1
  fsspec==2024.6.1
  gradio==4.44.1
  gradio-client==1.3.0
  h11==0.14.0
  httpcore==1.0.6
  httpx==0.27.2
  huggingface-hub==0.25.2
  idna==3.10
  importlib-resources==6.4.5
  jinja2==3.1.4
  joblib==1.4.2
  kiwisolver==1.4.7
  markdown-it-py==3.0.0
  MarkupSafe==2.1.5
  matplotlib==3.7.5
  mdurl==0.1.2
  multidict==6.1.0
  multiprocess==0.70.17
  numpy==1.24.4
  orjson==3.10.7
  packaging==24.1
  pandas==2.0.3
  pillow==10.4.0
  prettytable==3.11.0
  propcache==0.2.0
  pyarrow==17.0.0
  pydantic==2.9.2
  pydantic-core==2.23.4
  pydub==0.25.1
  pygments==2.18.0
  pyparsing==3.1.4
  python-dateutil==2.9.0.post0
  python-multipart==0.0.12
  pytz==2024.2
  PyYAML==6.0.2
  regex==2024.9.11
  requests==2.32.3
  rich==13.9.2
  ruff==0.6.9
  sacremoses==0.1.1
  safetensors==0.4.5
  scikit-learn==1.3.2
  scipy==1.10.1
  semantic-version==2.10.0
  shellingham==1.5.4
  six==1.16.0
  sniffio==1.3.1
  starlette==0.39.2
  threadpoolctl==3.5.0
  tokenizers==0.9.4
  tomlkit==0.12.0
  torch==1.7.1+cu110
  torchtyping==0.1.5
  tqdm==4.66.5
  transformers==4.2.1
  typeguard==2.13.3
  typer==0.12.5
  typing-extensions==4.12.2
  tzdata==2024.2
  urllib3==2.2.3
  uvicorn==0.31.1
  wcwidth==0.2.13
  websockets==12.0
  xxhash==3.5.0
  yarl==1.15.1
  zipp==3.20.2
@gaotianyu1350
Copy link
Member

Hi,

It looks like the dependency is the same as our experiment setting and the hardware shouldn't cause that much of a difference. Unfortunately I am also not sure what caused the difference.... have you tried testing the RoBERTa first-last avg?

@kuriyan1204
Copy link
Author

@gaotianyu1350
Thanks for the prompt response! Also could not reproduce the results for the RoBERTa first-last avg.

It turned out that due to the logic change in first-last avg pooling in this commit, it seems that the current codebase cannot reproduce the result for the models which use first-last avg (like BERT or RoBERTa).
After rolling back the codebase (just simply use the static word embedding layer instead of the contextualized embeddings from first layer), I can successfully reproduce the STS results shown in the paper.

It would be very nice if you could add some notes in README or your paper about this discrepancy for those who are/will be trying to reproduce the result! :)

@gaotianyu1350
Copy link
Member

Hi,

Thanks for figuring it out!! Yeah it makes sense that using the contextual embedding improves the result. I'll add a note to the readme.

@kuriyan1204
Copy link
Author

Thank you for updating README! Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants