Skip to content

Conversation

amstu2
Copy link
Contributor

@amstu2 amstu2 commented Sep 24, 2025

Description

Running lighteval vllm "model_name=meta-llama/Llama-3.2-3B-Instruct" "helm|summarization:xsum|0" --max-samples=5 encounters OverflowError: int too big to convert when attempting to calculate BERTScore metrics.

This appears to be an issue with the tokenizer configuration file (https://huggingface.co/microsoft/deberta-large-mnli/discussions/1), and so the tokenizer's max_model_length attribute defaults to a value of 1e30 (huggingface/transformers#14561).

Changes

  1. I've added an extra optional argument to the __init__ method of the BERTScore class, which allows the user to override the tokenizer's max_model_length attribute.
  2. In the new function validate_tokenizer_length(), the tokenizer's max_model_length will be set to the overriding value if set. If an override value is not set, it will check if the model length is the misconfigured value of 1e30 and, if so, default to 512 with a warning to the user. Otherwise, the original length is used.
  3. Added override value of 512 for deberta-large-mnli, which is the default BERTScore model.

@HuggingFaceDocBuilderDev
Copy link
Collaborator

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NathanHB NathanHB requested a review from Copilot September 24, 2025 11:35
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an overflow error that occurs when calculating BERTScore metrics with certain tokenizers, specifically addressing an issue where the microsoft/deberta-large-mnli model's tokenizer has a misconfigured max_model_length value of 1e30.

  • Added a new tokenizer_max_len parameter to the BERTScorer class to allow overriding the tokenizer's maximum model length
  • Implemented validation logic to detect the problematic 1e30 value and default to 512 when not explicitly overridden
  • Applied the fix to the existing BERTScore usage by setting tokenizer_max_len=512 for the deberta model

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/lighteval/metrics/imports/bert_scorer.py Added tokenizer validation function and new parameter to BERTScorer class
src/lighteval/metrics/metrics_sample.py Applied the tokenizer length override fix to existing BERTScore usage

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
if self._model is None:
logger.info(f"Loading BERTScorer model `{self._model_type}`")
self._tokenizer = AutoTokenizer.from_pretrained(self._model_type)
self._tokenizer.max_model_length = validate_tokenizer_length(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be model_max_length as well ?

Copy link
Member

@NathanHB NathanHB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great ! can you actually run the command and check for correct tokenizer length ?

@NathanHB NathanHB added the bug label Sep 25, 2025
@amstu2
Copy link
Contributor Author

amstu2 commented Sep 27, 2025

Tested with the following script:

from lighteval.metrics.imports.bert_scorer import BERTScorer

SCORE_THRESHOLD = 0.5
CANDIDATE = ["This is an example text."]
REFERENCE = ["This text contains an example sentence."]

print("####### Default BERTScorer model with length override #######")
scorer = BERTScorer(
    model_type="microsoft/deberta-large-mnli",
    lang="en",
    rescale_with_baseline=False,
    num_layers=9,
    tokenizer_max_len=512,
    device="cpu"
)

scores = scorer.score(cands=CANDIDATE, refs=REFERENCE)

assert all(i.item() > 0.5 for i in scores)
print("####### Test passed! #######")

print("####### Default BERTScorer model without length override #######")
scorer = BERTScorer(
    model_type="microsoft/deberta-large-mnli",
    lang="en",
    rescale_with_baseline=False,
    num_layers=9,
    device="cpu"
)

scores = scorer.score(cands=CANDIDATE, refs=REFERENCE)

assert all(i.item() > 0.5 for i in scores)
print("####### Test passed! #######")

print("####### BERTScorer model with correct tokenizer config and without override #######")

scorer = BERTScorer(
    model_type="FacebookAI/roberta-large",
    lang="en",
    rescale_with_baseline=False,
    num_layers=9,
    device="cpu"
)

scores = scorer.score(cands=CANDIDATE, refs=REFERENCE)

assert all(i.item() > 0.5 for i in scores)
print("####### Test passed! #######")

I haven't run the unit test, but it looks like bert_score could probably be removed from the list of skipped metrics in testing.

@amstu2 amstu2 requested a review from NathanHB September 27, 2025 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants