Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When set num_beams in GenerationConfig, stop_strings parameter has no effect #34843

Open
2 of 4 tasks
ZYM66 opened this issue Nov 21, 2024 · 7 comments · May be fixed by #35023 or #35802
Open
2 of 4 tasks

When set num_beams in GenerationConfig, stop_strings parameter has no effect #34843

ZYM66 opened this issue Nov 21, 2024 · 7 comments · May be fixed by #35023 or #35802
Labels

Comments

@ZYM66
Copy link

ZYM66 commented Nov 21, 2024

System Info

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

  • transformers version: 4.46.2
  • Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
  • Python version: 3.10.15
  • Huggingface_hub version: 0.26.2
  • Safetensors version: 0.4.5
  • Accelerate version: 1.1.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA GeForce RTX 4070 SUPER

Who can help?

@gante

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Code

from transformers import GenerationConfig, AutoTokenizer, AutoModelForCausalLM

generate_config = GenerationConfig(
    num_beams=3,
    do_sample=True,
    temperature=0.7,
    num_return_sequences=3,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.0,
    length_penalty=1.0,
    stop_strings=":",
    return_dict_in_generate=True,
    max_new_tokens=500,
    output_logits=True
)


tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH).cuda()

PROMPT = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"


tokens = tokenizer(PROMPT, return_tensors="pt").to(model.device)
out = model.generate(**tokens, generation_config=generate_config, tokenizer=tokenizer)

print(tokenizer.decode(out.sequences[0], skip_special_tokens=True))

Out

Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? To determine the total number of clips Natalia sold in April and May, we need to follow these steps:

1. Identify the number of clips sold in April.
2. Calculate the number of clips sold in May.
3. Add the number of clips sold in April and May together.

First, we know that Natalia sold 48 clips in April. Next, we need to find out how many clips she sold in May. According to the problem, she sold half as many clips in May as she did in April. Therefore, we calculate the number of clips sold in May as follows:
\[
\text{Number of clips sold in May} = \frac{48}{2} = 24
\]

Now, we add the number of clips sold in April and May together to find the total number of clips sold:
\[
\text{Total number of clips sold} = 48 + 24 = 72
\]

Thus, the total number of clips Natalia sold in April and May is \boxed{72}.

Expected behavior

when I set num_beams=1, the stop_strings works well !

@ZYM66 ZYM66 added the bug label Nov 21, 2024
@Rocketknight1
Copy link
Member

@gante since this is a crossover stop strings / beam search issue, can you take a look? Feel free to ping me if you think it's an issue in the stop string handling, though, and I can try to fix it.

@secrettoad
Copy link
Contributor

Seems to me as though any stopping criteria being met should halt generation, as opposed to all needing to be met simultaneously. If that is incorrect, please let me know and I can work on a different type of fix.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@gante
Copy link
Member

gante commented Jan 15, 2025

@Rocketknight1 I see what's going on.

From beam search's perspective, we may hit the stopping criteria with one of the beams, but we will continue generating because there are other valid beams to continue. However, we do not tag the complete beam as complete -- there is actually no mechanism to do it in our current pytorch implementation of beam search. As such, internally, we can see the stopping criteria being detected, but we continue the sequence anyway.

While I could patch this property (with non-trivial changes), I'm going to prioritize the refactor of beam search instead. I've been meaning to refactor beam search in FLAX/TF style, as it is easier to read/maintain and contains vectorized operations. I will then apply the fix for this property.


For future reference, code for reproduction (the original example is missing a model)

from transformers import GenerationConfig, AutoTokenizer, AutoModelForCausalLM

generate_config = GenerationConfig(
    num_beams=3,
    do_sample=True,
    temperature=0.7,
    num_return_sequences=3,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.0,
    length_penalty=1.0,
    stop_strings=":",
    return_dict_in_generate=True,
    max_new_tokens=500,
    output_logits=True
)


tokenizer = AutoTokenizer.from_pretrained( "Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-0.5B-Instruct").cuda()

PROMPT = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"


tokens = tokenizer(PROMPT, return_tensors="pt").to(model.device)
out = model.generate(**tokens, generation_config=generate_config, tokenizer=tokenizer)
print(tokenizer.decode(out.sequences[0], skip_special_tokens=True))

@Rocketknight1
Copy link
Member

@gante yes, sounds good! I can definitely see some awkward interactions here. Let me know if we need to refactor the stop strings check - right now, it only checks the tokens at the end of a sequence, because that's sufficient when tokens are generated one at a time. However, with beam search that assumption might be invalid.

@gante gante linked a pull request Jan 20, 2025 that will close this issue
7 tasks
@gante
Copy link
Member

gante commented Jan 20, 2025

@Rocketknight1 nope, no need to refactor any stopping criteria :) This has more to do with the old beam search code not expecting stopping criteria other than a specific EOS token and max length than anything else

BTW, I've linked above a draft PR that will vectorize beam search and start digesting any stopping criteria.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants