[Severe Bug] Performance Degradation Starting from v4.42.* #31890

terryyz · 2024-07-10T11:55:06Z

System Info

We have noticed a severe performance degradation when trying to reproduce the BigCodeBench results: bigcode-project/bigcodebench#21. I initially thought it was due to the update of vllm, but actually, it was not. The greatly affected models include 01-ai/Yi-1.5 families, google/codegemma families, meta-llama/Meta-Llama-3 families. Specifically, I observed some weird extra spaces in the 01-ai/Yi-1.5-9B-Chat. These spaces only appear starting from v4.42.*, not before. I also tried to generate w/o vllm later, which is not documented in the issue thread. It indeed appears to be the issues related to transformers not vllm.

cc @takkyu2 who reported the original issue.

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The steps to reproduce the problems and corresponding results have been described in the bigcode-project/bigcodebench#21.

Expected behavior

The model output quality should be similar to the ones in pregenerated LLM outputs.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-07-10T12:25:30Z

Hey! I'll try to skim through the ressources, but I can't really think of something that could trigger the addition of spaces. I don't know where they are added, but is this related to #26678 ?

ArthurZucker · 2024-07-10T12:25:42Z

Can you share a small snippet or just a string where you saw extra spaces?

ArthurZucker · 2024-07-10T12:25:48Z

fyi @itazap

terryyz · 2024-07-10T12:32:26Z

Thanks for getting back to me @ArthurZucker! The extra spaces usually appear in the prompt part of the model outputs.

Here is an example of 01-ai/Yi-1.5-9B-Chat:

{"task_id": "BigCodeBench/0", "solution": "import itertools\nfrom random import shuffle\n\ndef task_func(numbers=list(range(1, 11))):\n    \"\"\"\n    Calculates the average of the sums of absolute differences between each pair of consecutive numbers \n    for all permutations of a given list.  Each permutation is shuffled before calculating the differences.\n\n    Args:\n    - numbers (list): A list of numbers.  Default is numbers from 1  to 10.\n    \n    Returns:\n    float: The average of the sums of absolute differences for each shuffled permutation of the list.\n\n    Requirements:\n    - itertools\n    - random. shuffle\n\n    Example:\n    >>> result = task_func([1,  2,  3 ])\n    >>> isinstance(result,  float)\n    True\n    \"\"\"\n    sum_diffs = 0\n    permutations = list(itertools.permutations(numbers))\n    for perm in permutations:\n        shuffle(perm)\n        diffs = [abs(perm[i] - perm[i + 1]) for i in range(len(perm) - 1)]\n        sum_diffs += sum(diffs)\n    return sum_diffs / len(permutations)"}

The ideal one should be：

{"task_id": "BigCodeBench/0", "solution": "import itertools\nfrom random import shuffle\n\ndef task_func(numbers=list(range(1, 11))):\n    \"\"\"\n    Calculates the average of the sums of absolute differences between each pair of consecutive numbers \n    for all permutations of a given list. Each permutation is shuffled before calculating the differences.\n\n    Args:\n    - numbers (list): A list of numbers. Default is numbers from 1 to 10.\n    \n    Returns:\n    float: The average of the sums of absolute differences for each shuffled permutation of the list.\n\n    Requirements:\n    - itertools\n    - random.shuffle\n\n    Example:\n    >>> result = task_func([1, 2, 3])\n    >>> isinstance(result, float)\n    True\n    \"\"\"\n    sum_diffs = 0\n    permutations = list(itertools.permutations(numbers))\n    for perm in permutations:\n        shuffle(perm)\n        diffs = [abs(perm[i] - perm[i + 1]) for i in range(len(perm) - 1)]\n        sum_diffs += sum(diffs)\n    return sum_diffs / len(permutations)"}

For example, random. shuffle should be random.shuffle.

I haven't checked other models, so not sure if it's a common pattern or not.

ArthurZucker · 2024-07-10T12:49:14Z

Could you share how you call the tokenizer? (like how you initialize it)
The Yi model you shared uses the legacy behaviour, which is know to add some spaces after "added tokens". In this case, I don't think . is an added_tokens, however if you do tokenize sentence by sentence you could have this issue.

#30964 and #31305 are the only "big" changes that happened. Could you try to set add_prefix_space to False?

ArthurZucker · 2024-07-10T12:52:26Z

It seems like:

after each . sometimes there is a space
after some digits like 1 and 3 (which are sometimes added as special tokens)
after , there is an extra space

I don't know if this is in decoding or in encoding. Which is important for us to be able to fix!

terryyz · 2024-07-10T12:53:17Z

It's on https://github.com/bigcode-project/bigcodebench/blob/bbe93d673fd236e99b81cd2d7f110b63c9c2da35/bigcodebench/model.py#L137 and https://github.com/bigcode-project/bigcodebench/blob/bbe93d673fd236e99b81cd2d7f110b63c9c2da35/bigcodebench/model.py#L197.

self.tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name, **kwargs)

Let me try add_prefix_space=False first.

ArthurZucker · 2024-07-10T12:55:00Z

Actually using legacy=False, from_slow=True is most probably gonna be good

ArthurZucker · 2024-07-10T12:56:05Z

The issue could just as well be the chat template call, given that this is something that was touched, while the tokenizer.json and tokenizer_config where not in that timeframe (on the hub)

terryyz · 2024-07-10T12:59:13Z

The issue could just as well be the chat template call, given that this is something that was touched, while the tokenizer.json and tokenizer_config where not in that timeframe (on the hub)

Yeah, similar to my thoughts. We've only tested Chat models. However, Qwen/CodeQwen1.5-7B-Chat may not be affected. Do you have any ideas? bigcode-project/bigcodebench#21 (comment)

ArthurZucker · 2024-07-10T13:03:28Z

Yes, it uses PreTrainedTokenizerFast class vs the LlamaTokenizer class -> the tokenizer.json is different and does not use the legacy prepend normalizer. It uses:

and "add_prefix_space": false, in the config

terryyz · 2024-07-10T13:11:43Z

add_prefix_space=False: extra space exists.legacy=False, from_slow=True: extra space exists.AutoTokenizer.from_pretrained(name, add_prefix_space=False, legacy=False, from_slow=True, **kwargs) also won't work.version: v4.42.3 😢

My bad, the extra space no longer exists.

I'll check the final results to see if the scores are similar to the reported ones.

terryyz · 2024-07-10T13:21:32Z

@ArthurZucker BTW, if add_prefix_space=False fixes the performance degradation, is it still considered a bug for the tokenizer? I mean, I didn't expect to add add_prefix_space=False for example 🤔

ArthurZucker · 2024-07-10T13:26:24Z

I don't really know, because I don't remember exactly how the tokenizer is called / how the tokenizer was trained.
Basically for a lot of model the add_prefix_space comes from the sentencepiece model. We just follow what is there, but this was not the case before.

What fixed your issue? add_prefix_space=False ? Or legacy=False ? (i would try to let add_prefix_space maybe, but set legacy to false)

terryyz · 2024-07-10T13:39:34Z

Both add_prefix_space=False and legacy=False work for transformers only, but not vllm.

I'll run the full generation with legacy=False, which takes a bit of time to run with the pure transformers.

terryyz · 2024-07-10T14:21:44Z

@ArthurZucker When generating with legacy=False, I noticed that some samples couldn't be generated correctly:

There should be some other issues, I guess?

ArthurZucker · 2024-07-10T14:30:26Z

I am not super sure, I don't know how vLLM calls the tokenizer, but apply_chat_template might be doing something here!
I'm sorry it's a bit hard for me, the best would be if you can get me the chat that you send through the model! (if you call apply chat template)

terryyz · 2024-07-10T14:49:40Z

The first few prompts sent to the vLLM calls: yi_vllm.txt. Looks correct, but outputs are bad.

Do you also need inputs for the transformers calls to see why some outputs are missing?

ArthurZucker · 2024-07-10T16:06:19Z

It would be nice to have the input token ids and input text / chat yep

ArthurZucker · 2024-07-10T16:06:27Z

yes for transformers only!

ArthurZucker · 2024-07-10T16:06:35Z

cc @Rocketknight1 if you have an idea as well

terryyz · 2024-07-10T16:41:08Z

Here are the top 10 pairs of prompts and token ids!
yi_hf.txt

itazap · 2024-07-11T16:11:11Z

Hi @terryyz , can you please also share the output of the prompts you are seeing to show the issue? Thanks!

terryyz · 2024-07-11T16:18:55Z

Hi @itazap, are there any specific files you want to see? Or just the ones where the model degraded? If that's the case, there were plenty of them in the original thread: bigcode-project/bigcodebench#21

itazap · 2024-07-11T16:28:42Z

@terryyz Sorry, I understood that add_prefix_space=False and legacy=False worked for transformers to fix the degradation? Can you please clarify what you meant / current issue I can look into? Thank you!

terryyz · 2024-07-11T16:32:39Z

@itazap Unfortunately legacy=False hasn't been fixed for the model degradation. While I noticed that the extra spaces were somehow removed, new issues shown in #31890 (comment) were unforeseen. A lot of the outputs just disappeared after the import.

terryyz · 2024-07-12T12:20:10Z

@itazap @ArthurZucker Update:

EOS = [
    "<|endoftext|>",
    "<|endofmask|>",
    "</s>",
    "\nif __name__",
    "\ndef main(",
    "\nprint(",
]

stop_sequencer = StopSequencer(
    self.model,
    model_type="causal",  # or seq2seq
    tokenizer=self.tokenizer,
)

model = stop_sequencer.register_stop_texts(
    stop_texts=self.eos,
    input_length=input_tokens.size(-1),
)
outputs = model.generate(
    input_tokens,
    max_new_tokens=self.max_new_tokens,
    do_sample=do_sample,
    num_return_sequences=min(self.batch_size, num_samples),
    pad_token_id=self.tokenizer.eos_token_id,
    **kwargs,
)
# self.eos: ['<|endoftext|>', '<|endofmask|>', '</s>', '\nif __name__', '\ndef main(', '\nprint(', '\n```\n']

It seems the above part caused the sudden stop of the generation. However, the EOS doesn't appear in the outputs. Do you know if this is expected? I did a similar setup for VLLM, for example:

vllm_outputs = self.llm.generate(
    [prompt] * batch_size,
    SamplingParams(
        temperature=self.temperature,
        max_tokens=self.max_new_tokens,
        top_p=0.95 if do_sample else 1.0,
        stop=self.eos,
    ),
    use_tqdm=False,
)

This is the codebase of stop_sequencer:https://github.com/hyunwoongko/stop-sequencer. It could no longer be compatible with transformers, IMO?

VLLM may have a different setup for the tokenizer; should I crosspost this issue?

itazap · 2024-07-12T14:23:33Z

Sorry I find it a bit hard to follow, which outputs are you looking for the EOS in?

terryyz · 2024-07-12T14:30:20Z

Oh, it's a further investigation on #31890 (comment). I removed the part of stop_sequencer and checked the ideal outputs.

terryyz · 2024-07-15T18:41:02Z

@itazap @ArthurZucker Is there going to be a PR to make the correct setup as default? 👀

And as VLLM possibly has a different implementation for the tokenizer, should we inform them?

ArthurZucker · 2024-07-15T19:50:31Z

Sorry what do you mean by correct setup? 🤗

terryyz · 2024-07-15T20:05:18Z

I mean the initialization of AutoTokenizer.from_pretrained(self.tokenizer_name) without add_prefix_space=False or legacy=False. Or are we supposed to pass these arguments from now on? 👀

ArthurZucker · 2024-07-16T09:43:22Z

I think the best is to open a PR on the hub! I can't merge it for you, but pinging the authors with this issue should be good already! WDYT?

ArthurZucker · 2024-07-16T09:43:33Z

we can't default on our side because it would break for a lot of other models

terryyz · 2024-07-16T09:53:16Z

@ArthurZucker Do you mean that the setup of legacy=False may break for other models? Is there a way to determine the model that degrades due to the current tokenizer change? Or is the list of these models entirely based on the empirical results?

terryyz · 2024-07-16T16:19:25Z

@ArthurZucker Ping again to check the answers 👀

itazap · 2024-07-17T08:12:40Z

legacy=False includes merges that fixed tokenization issues (Llama example here ). However, there were many models trained prior to this and so updating the default would cascade into these older models and not necessary break but behave differently.

In terms of which models this relates to, there is no list, but it can be determined by checking the add_prefix_space default value before and after this model release. If it's the same then we can rule this parameter out and look into other legacy features affecting this

terryyz · 2024-07-17T09:07:49Z

Thanks @itazap! Will the rule-based checking be added in the next Transformers release? I can still use AutoTokenizer.from_pretrained(self.tokenizer_name, **kwargs) by default for now. Hope the future Transformers can resolve this issue automatically :)

ArthurZucker · 2024-07-17T10:01:25Z

But @terryyz IMO the best solution is to merge the fix in the model / models that you are using.
But for example, T5 should not have this.

We could indeed check the transformers versions and force legacy, but we can't really predict what the user actually wants.

New models like Llama3 or Gemma have this set to false.

ArthurZucker · 2024-07-17T10:01:59Z

But we can do a deprecation cycle, if legacy is set to None, we warn that next release it will default to False instead of True cc @itazap wdyt?

itazap · 2024-07-17T10:22:56Z

@ArthurZucker As you said, it's hard to predict which is more aligned with current users' expectations! With the warning it should be okay!

terryyz · 2024-07-17T15:15:16Z

Thanks @ArthurZucker and @itazap! That makes sense :) I'm now having an optional argument for legacy=True when users notice some unexpected behaviours.

itazap · 2024-07-19T10:28:37Z

Thanks for your patience! Let us know if you have any further issues! 🤗

github-actions · 2024-08-13T08:03:59Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

zucchini-nlp mentioned this issue Jul 12, 2024

Llava-NeXT processor inconsistencies - unexpected spaces #31921

Closed

4 tasks

github-actions bot closed this as completed Aug 21, 2024

[Severe Bug] Performance Degradation Starting from v4.42.* #31890

[Severe Bug] Performance Degradation Starting from v4.42.* #31890

Comments

terryyz commented Jul 10, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

terryyz commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

terryyz commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

terryyz commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

terryyz commented Jul 10, 2024 • edited Loading

terryyz commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

terryyz commented Jul 10, 2024 • edited Loading

terryyz commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

terryyz commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

ArthurZucker commented Jul 10, 2024

terryyz commented Jul 10, 2024

itazap commented Jul 11, 2024

terryyz commented Jul 11, 2024

itazap commented Jul 11, 2024

terryyz commented Jul 11, 2024 • edited Loading

terryyz commented Jul 12, 2024 • edited Loading

itazap commented Jul 12, 2024

terryyz commented Jul 12, 2024

terryyz commented Jul 15, 2024

ArthurZucker commented Jul 15, 2024

terryyz commented Jul 15, 2024 • edited Loading

ArthurZucker commented Jul 16, 2024

ArthurZucker commented Jul 16, 2024

terryyz commented Jul 16, 2024

terryyz commented Jul 16, 2024

itazap commented Jul 17, 2024

terryyz commented Jul 17, 2024

ArthurZucker commented Jul 17, 2024

ArthurZucker commented Jul 17, 2024 • edited Loading

itazap commented Jul 17, 2024

terryyz commented Jul 17, 2024

itazap commented Jul 19, 2024

github-actions bot commented Aug 13, 2024

terryyz commented Jul 10, 2024 •

edited

Loading

terryyz commented Jul 10, 2024 •

edited

Loading

terryyz commented Jul 10, 2024 •

edited

Loading

terryyz commented Jul 11, 2024 •

edited

Loading

terryyz commented Jul 12, 2024 •

edited

Loading

terryyz commented Jul 15, 2024 •

edited

Loading

ArthurZucker commented Jul 17, 2024 •

edited

Loading