add new truncation strategy #2300

artemorloff · 2024-09-15T16:15:54Z

Hew feature - "smart truncation". This PR:

allow for different truncation modes, not only cutting off ALL tokens from the left (which is now the only available option, that still does not respect special tokens kile BOS for Gemma - BOS is truncated the first:) )
truncation is performed while constructing the task. So that the user knows about it before the sequences are fed into the model
allows for notification system for users that indicates how many requests have been truncated, and to what extent. This helps decide on the correctness of the metrics. If you have only one token for each tes tquestion, this is incorrect estimation - the model has not even seen the real test questions
logging would include the sequences that the LLM has got. Not the full requests from the tasks. Seems that logging is meant to record the real requests that were passed into the model, but not the things that the LLM may not have encountered
suggests symbol-wise truncation for APIs with no tokenizer. Till now these APIs cannot be assessed via harness if at least one task sample is too long.

Right now the fewshots strategy is realized and tested. It allows to cut off the entire fewshots, respecting special tokens. The returned status contains either error string or the number of fewshots to be remained.

artemorloff · 2024-09-15T16:17:48Z

@haileyschoelkopf @baberabb this is only the draft. A lot of todo's and unoptimal things. But I want to get the feedback. For me this seems like a completely new feature that allows users to manage the truncation. Working on it. There would be default, transformers and user_prompt_only truncations

baberabb · 2024-09-17T17:38:00Z

Hi @artemorloff. Thanks very much for this! Will try providing feedback in a couple of days.

artemorloff · 2024-09-18T08:58:09Z

@baberabb thank you! feedback is exactly what I need
this is the draft, but some major "features" are:

fewshot_context should always return list
system prompt should be define only once for each task (it is always the same for the same task)
truncation is done only after forming the arguments of the Instance as only there we definitely know what the target looks like, which affects truncation (need to take into account the length of the target for multiple_choice tasks)
may be make a func that forms the arguments, then truncate and then form the Instance
right now truncation makes "extra" computations to preserve backward compatibility. It may return tokens as well as the strings and no need to tokenize sequences in model class

…uncation

add new truncation

6d00279

Merge remote-tracking branch 'upstream/main' into feature/enhanced_tr…

b86caa3

…uncation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add new truncation strategy #2300

add new truncation strategy #2300

artemorloff commented Sep 15, 2024

artemorloff commented Sep 15, 2024

baberabb commented Sep 17, 2024

artemorloff commented Sep 18, 2024

add new truncation strategy #2300

Are you sure you want to change the base?

add new truncation strategy #2300

Conversation

artemorloff commented Sep 15, 2024

artemorloff commented Sep 15, 2024

baberabb commented Sep 17, 2024

artemorloff commented Sep 18, 2024