Add seq2seq eval benchmark callback #1274

LeonardoEmili · 2024-02-08T10:55:28Z

Similar to #441, this PR adds an evaluation benchmark for generative tasks as machine translation.

Description

This additional evaluation benchmark is self-contained and can be triggered via the do_causal_lm_eval configuration. First it will generate completions for every sample in the eval set (configured via eval_max_new_tokens) and then score these against the reference completions via 🤗 Evaluate. Metrics can be chosen among a subset of supported metrics and skipped with a warning if the corresponding libraries are not available.

A few notes on this PR:

I believe it could be further improved by removing duplicated code (e.g. it shares the same generation as LogPredictionCallback)
🤗 Evaluate requires additional libraries to be installed to compute metrics (e.g. sacrebleu is needed to compute BLEU). I didn't make to add this libraries as optional, so far code will simply raise a warning if the libraries for the requested metrics are missing. I believe this could be handled in a better way (maybe adding a [metrics] entry to the extras_require?)

Tagging @winglian and @tmm1 who maybe can help? Cheers.

Motivation and Context

Enable Axolotl to compute generative evaluation metrics to better support generative tasks (e.g. machine translation, language modelling, etc.).

How has this been tested?

End-to-end SFT fine-tuning of an existing LLama-2-hf model with and without the feature.

Screenshots (if appropriate)

Types of changes

Add CausalLMBenchEvalCallback callback for generative tasks evaluation
Add do_causal_lm_eval to AxolotlTrainingArguments
Refactor eval_table_max_new_tokens into eval_max_new_tokens
Bump evaluate to 0.4.1 for COMET fix (see related issue)
Test behaviour with sample_packing=True
Add unit and integration tests

Social Handles (Optional)

winglian · 2024-02-09T15:41:45Z

Is this intended as a replacement of the log prediction callback?

Looks good so far. I assume you want to finish off the last two items before we merge?

LeonardoEmili · 2024-02-09T16:23:43Z

Is this intended as a replacement of the log prediction callback?

I believe the callbacks serve similar purposes but are different in the objective: the log callback generates a few (e.g. 5) samples for logging purposes while the causal_lm benchmark generates completions for all the dataset to compute metrics for generative tasks.

Ideally these two would share the same code for the generation (model.generate rather than predict from this thread) and store them somewhere to be re-used if do_causal_lm_eval=True. Do you have any suggestions how to achieve it?

Looks good so far. I assume you want to finish off the last two items before we merge?

I'll have a look at the case when sample_packing=True but don't currently have time to create ad-hoc tests.

LeonardoEmili · 2024-02-09T17:54:00Z

@winglian please have a look now if this looks good to you. Do you have further suggestions?

winglian

lgtm

LeonardoEmili added 2 commits February 7, 2024 16:12

Add CausalLMBenchEvalCallback for measuring seq2seq performance

0e87f23

Merge branch 'OpenAccess-AI-Collective:main' into causal-lm-bench

8795cda

LeonardoEmili added 4 commits February 9, 2024 16:49

Fix code for pre-commit

3392def

Merge branch 'main' into causal-lm-bench

dc0c078

Fix typing and improve logging

61204d6

eval_sample_packing must be false with CausalLMBenchEvalCallback

2757a6c

winglian approved these changes Feb 12, 2024

View reviewed changes

winglian merged commit 5a5d474 into axolotl-ai-cloud:main Feb 13, 2024
7 checks passed

LeonardoEmili deleted the causal-lm-bench branch February 13, 2024 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add seq2seq eval benchmark callback #1274

Add seq2seq eval benchmark callback #1274

LeonardoEmili commented Feb 8, 2024 •

edited

Loading

winglian commented Feb 9, 2024

LeonardoEmili commented Feb 9, 2024

LeonardoEmili commented Feb 9, 2024

winglian left a comment

Add seq2seq eval benchmark callback #1274

Add seq2seq eval benchmark callback #1274

Conversation

LeonardoEmili commented Feb 8, 2024 • edited Loading

Description

Motivation and Context

How has this been tested?

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

winglian commented Feb 9, 2024

LeonardoEmili commented Feb 9, 2024

LeonardoEmili commented Feb 9, 2024

winglian left a comment

Choose a reason for hiding this comment

LeonardoEmili commented Feb 8, 2024 •

edited

Loading