Cannot reproduce the result for `bert-base-uncased`, `avg_first_last` setting #285

kuriyan1204 · 2024-10-13T07:57:31Z

@gaotianyu1350
Hi, thank you for the great work / publishing beautiful codes!
I have some questions reproducing the STS results for pre-trained bert models.

When I run the following command in my environment, I got higher STS scores comparing to the results shown in your paper.
Do you have any idea what is causing the issue?

Code executed

python evaluation.py \
    --model_name_or_path bert-base-uncased \
    --pooler avg_first_last \
    --task_set sts \
    --mode test

Results

| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
| 45.09 | 64.30 | 54.56 | 70.52 | 67.87 | 59.05 | 63.75 | 60.73 |

Expected results (scores shown in your paper)

| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
| 39.70 | 59.38 | 49.67 | 66.03 | 66.19 | 53.87 | 62.06 | 56.70 |

Strangely, I can fully reproduce the scores for SimCSE models via following command:

% python evaluation.py \
--model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \
--pooler cls \
--task_set sts \
--mode test

Here is the result of pip freeze and I am using one NVIDIA RTX 6000 Ada GPU.
Thank you very much for your help!

pip freeze result

  aiofiles==23.2.1
  aiohappyeyeballs==2.4.3
  aiohttp==3.10.10
  aiosignal==1.3.1
  annotated-types==0.7.0
  anyio==4.5.0
  async-timeout==4.0.3
  attrs==24.2.0
  certifi==2024.8.30
  charset-normalizer==3.4.0
  click==8.1.7
  contourpy==1.1.1
  cycler==0.12.1
  datasets==3.0.1
  dill==0.3.8
  exceptiongroup==1.2.2
  fastapi==0.115.2
  ffmpy==0.4.0
  filelock==3.16.1
  fonttools==4.54.1
  frozenlist==1.4.1
  fsspec==2024.6.1
  gradio==4.44.1
  gradio-client==1.3.0
  h11==0.14.0
  httpcore==1.0.6
  httpx==0.27.2
  huggingface-hub==0.25.2
  idna==3.10
  importlib-resources==6.4.5
  jinja2==3.1.4
  joblib==1.4.2
  kiwisolver==1.4.7
  markdown-it-py==3.0.0
  MarkupSafe==2.1.5
  matplotlib==3.7.5
  mdurl==0.1.2
  multidict==6.1.0
  multiprocess==0.70.17
  numpy==1.24.4
  orjson==3.10.7
  packaging==24.1
  pandas==2.0.3
  pillow==10.4.0
  prettytable==3.11.0
  propcache==0.2.0
  pyarrow==17.0.0
  pydantic==2.9.2
  pydantic-core==2.23.4
  pydub==0.25.1
  pygments==2.18.0
  pyparsing==3.1.4
  python-dateutil==2.9.0.post0
  python-multipart==0.0.12
  pytz==2024.2
  PyYAML==6.0.2
  regex==2024.9.11
  requests==2.32.3
  rich==13.9.2
  ruff==0.6.9
  sacremoses==0.1.1
  safetensors==0.4.5
  scikit-learn==1.3.2
  scipy==1.10.1
  semantic-version==2.10.0
  shellingham==1.5.4
  six==1.16.0
  sniffio==1.3.1
  starlette==0.39.2
  threadpoolctl==3.5.0
  tokenizers==0.9.4
  tomlkit==0.12.0
  torch==1.7.1+cu110
  torchtyping==0.1.5
  tqdm==4.66.5
  transformers==4.2.1
  typeguard==2.13.3
  typer==0.12.5
  typing-extensions==4.12.2
  tzdata==2024.2
  urllib3==2.2.3
  uvicorn==0.31.1
  wcwidth==0.2.13
  websockets==12.0
  xxhash==3.5.0
  yarl==1.15.1
  zipp==3.20.2

The text was updated successfully, but these errors were encountered:

gaotianyu1350 · 2024-10-14T17:23:14Z

Hi,

It looks like the dependency is the same as our experiment setting and the hardware shouldn't cause that much of a difference. Unfortunately I am also not sure what caused the difference.... have you tried testing the RoBERTa first-last avg?

kuriyan1204 · 2024-10-15T07:43:58Z

@gaotianyu1350
Thanks for the prompt response! Also could not reproduce the results for the RoBERTa first-last avg.

It turned out that due to the logic change in first-last avg pooling in this commit, it seems that the current codebase cannot reproduce the result for the models which use first-last avg (like BERT or RoBERTa).
After rolling back the codebase (just simply use the static word embedding layer instead of the contextualized embeddings from first layer), I can successfully reproduce the STS results shown in the paper.

It would be very nice if you could add some notes in README or your paper about this discrepancy for those who are/will be trying to reproduce the result! :)

gaotianyu1350 · 2024-10-16T14:33:57Z

Hi,

Thanks for figuring it out!! Yeah it makes sense that using the contextual embedding improves the result. I'll add a note to the readme.

kuriyan1204 · 2024-10-17T04:27:46Z

Thank you for updating README! Closing this issue.

kuriyan1204 closed this as completed Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce the result for `bert-base-uncased`, `avg_first_last` setting #285

Cannot reproduce the result for `bert-base-uncased`, `avg_first_last` setting #285

kuriyan1204 commented Oct 13, 2024 •

edited

Loading

gaotianyu1350 commented Oct 14, 2024

kuriyan1204 commented Oct 15, 2024

gaotianyu1350 commented Oct 16, 2024

kuriyan1204 commented Oct 17, 2024

Cannot reproduce the result for bert-base-uncased, avg_first_last setting #285

Cannot reproduce the result for bert-base-uncased, avg_first_last setting #285

Comments

kuriyan1204 commented Oct 13, 2024 • edited Loading

Code executed

Results

Expected results (scores shown in your paper)

gaotianyu1350 commented Oct 14, 2024

kuriyan1204 commented Oct 15, 2024

gaotianyu1350 commented Oct 16, 2024

kuriyan1204 commented Oct 17, 2024

Cannot reproduce the result for `bert-base-uncased`, `avg_first_last` setting #285

Cannot reproduce the result for `bert-base-uncased`, `avg_first_last` setting #285

kuriyan1204 commented Oct 13, 2024 •

edited

Loading