Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new truncation strategy #2300

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

artemorloff
Copy link
Contributor

Hew feature - "smart truncation". This PR:

  • allow for different truncation modes, not only cutting off ALL tokens from the left (which is now the only available option, that still does not respect special tokens kile BOS for Gemma - BOS is truncated the first:) )
  • truncation is performed while constructing the task. So that the user knows about it before the sequences are fed into the model
  • allows for notification system for users that indicates how many requests have been truncated, and to what extent. This helps decide on the correctness of the metrics. If you have only one token for each tes tquestion, this is incorrect estimation - the model has not even seen the real test questions
  • logging would include the sequences that the LLM has got. Not the full requests from the tasks. Seems that logging is meant to record the real requests that were passed into the model, but not the things that the LLM may not have encountered
  • suggests symbol-wise truncation for APIs with no tokenizer. Till now these APIs cannot be assessed via harness if at least one task sample is too long.

Right now the fewshots strategy is realized and tested. It allows to cut off the entire fewshots, respecting special tokens. The returned status contains either error string or the number of fewshots to be remained.

@artemorloff
Copy link
Contributor Author

@haileyschoelkopf @baberabb this is only the draft. A lot of todo's and unoptimal things. But I want to get the feedback. For me this seems like a completely new feature that allows users to manage the truncation. Working on it. There would be default, transformers and user_prompt_only truncations

@baberabb
Copy link
Contributor

Hi @artemorloff. Thanks very much for this! Will try providing feedback in a couple of days.

@artemorloff
Copy link
Contributor Author

@baberabb thank you! feedback is exactly what I need
this is the draft, but some major "features" are:

  • fewshot_context should always return list
  • system prompt should be define only once for each task (it is always the same for the same task)
  • truncation is done only after forming the arguments of the Instance as only there we definitely know what the target looks like, which affects truncation (need to take into account the length of the target for multiple_choice tasks)
  • may be make a func that forms the arguments, then truncate and then form the Instance
  • right now truncation makes "extra" computations to preserve backward compatibility. It may return tokens as well as the strings and no need to tokenize sequences in model class

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants