-
-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add seq2seq eval benchmark callback #1274
Add seq2seq eval benchmark callback #1274
Conversation
Is this intended as a replacement of the log prediction callback? Looks good so far. I assume you want to finish off the last two items before we merge? |
I believe the callbacks serve similar purposes but are different in the objective: the log callback generates a few (e.g. 5) samples for logging purposes while the causal_lm benchmark generates completions for all the dataset to compute metrics for generative tasks. Ideally these two would share the same code for the generation (
I'll have a look at the case when |
@winglian please have a look now if this looks good to you. Do you have further suggestions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Similar to #441, this PR adds an evaluation benchmark for generative tasks as machine translation.
Description
This additional evaluation benchmark is self-contained and can be triggered via the
do_causal_lm_eval
configuration. First it will generate completions for every sample in the eval set (configured viaeval_max_new_tokens
) and then score these against the reference completions via 🤗 Evaluate. Metrics can be chosen among a subset of supported metrics and skipped with a warning if the corresponding libraries are not available.A few notes on this PR:
LogPredictionCallback
)[metrics]
entry to theextras_require
?)Tagging @winglian and @tmm1 who maybe can help? Cheers.
Motivation and Context
Enable Axolotl to compute generative evaluation metrics to better support generative tasks (e.g. machine translation, language modelling, etc.).
How has this been tested?
End-to-end SFT fine-tuning of an existing LLama-2-hf model with and without the feature.
Screenshots (if appropriate)
Types of changes
CausalLMBenchEvalCallback
callback for generative tasks evaluationdo_causal_lm_eval
toAxolotlTrainingArguments
eval_table_max_new_tokens
intoeval_max_new_tokens
sample_packing=True
Social Handles (Optional)