Unitxt evaluator #156

Roni-Friedman · 2024-10-22T13:12:13Z

Adding unitxt evaluator.

To be complemented by adding unitxt as a benchmark in instructlab repo

mergify · 2024-10-22T13:13:15Z

This pull request has merge conflicts that must be resolved before it can be
merged. @Roni-Friedman please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

nathan-weinberg · 2024-10-22T16:03:20Z

@Mergifyio refresh

mergify · 2024-10-22T16:03:57Z

refresh

✅ Pull request refreshed

tests/test_unitxt.py

danmcp

Thanks for the PR!

tests/test_unitxt.py

danmcp · 2024-10-29T12:58:14Z

tests/test_unitxt.py

+    print("===> Executing 'test_unitxt'...")
+    try:
+        model_path = "instructlab/granite-7b-lab"
+        unitxt_recipe = "card=cards.wnli,template=templates.classification.multi_class.relation.default,max_train_instances=5,loader_limit=20,num_demos=3,demos_pool_size=10"


This seems like too much detail to ask for from the cli in 1 long string. Are all the values things we want users to specify or could some of them be implementation details? For the ones that are, I think we need to break them down into individual params.

Users should have the flexibility to specify all details. I can suggest the following, tell me which you prefer:
1.a - have the recipe written down in a file and then in the cli just provide a path
1.b - have the recipe written down in a file, but in a json format (e.g. {card: ..., template: ...}, which is more friendly
2 - add some prefixed parameters such as card and template, but also a freetext parameter, as there are many customization a unitxt user may want to make.
3 - keep it as is :)

src/instructlab/eval/unitxt.py

danmcp · 2024-10-29T13:10:13Z

src/instructlab/eval/unitxt.py

+        unitxt_recipe: str,
+    ):
+        unitxt_task = self.assign_task_name()
+        tasks_dir = self.assign_tasks_dir(unitxt_task)


Is this using a local directory? If so, it needs to be built off a param like output_dir with mt_bench.

this is a temporary directory, deleted at the end of the evaluation process. Would you prefer the user specified an output dir? It does not contain anything of use for the user, just the files required for lm eval to run unitxt.

I think it would make sense to use the output_dir so it doesn't confuse the user in a local directory. Also, since you do want to remove at the end, the create/remove logic should probably be in a try/finally block.

so a user would specify an output dir but will find it doesn't exist at the end of the run?
If the user specifies it, I guess I will not delete it, right?

so a user would specify an output dir but will find it doesn't exist at the end of the run?

The current output dir is a working dir for mt_bench.

If the user specifies it, I guess I will not delete it, right?

I was expecting you would create your directory inside the output_dir and then delete it when you are done. Or you could delete the directory before you start if there is some value in leaving it around.

Would it make sense to put all this into a memory filesystem? In general, it is best to avoid unnecessary disk writes, especially for something that's likely to run in a cloud service where it may or may not have write permissions on some sort of disk.

@jwm4 memory filesystem does not seem to work well, as directory is later accessed also by lm-eval inside the mmlu class and I don't want to start passing this filesystem around (unless owners support such an overall change)

I think it would make sense to use the output_dir so it doesn't confuse the user in a local directory.

@danmcp So I'm not entirely sure what you mean here. I see mt_bench has output_dir: str = "eval_output",, but this is created only if one calls for mt_bench and does not enter a different output dir. Doing the following, although not sure it makes a lot of sense:

def assign_tasks_dir(self, task_name):
return os.path.join( "eval_output" ,f"{TEMP_DIR_PREFIX}_{task_name}")

Apologies if my request wasn't clear, my suggestion was like mt_bench:

Accept the root dir to use for output as a var

Default it to the same root dir as mt_bench

alimaredia · 2024-10-29T15:30:23Z

@Roni-Friedman Could you explain in the description what benefit the Unitxt evaluator would have? Why would a user run the unitxt evaluator over just using the MMLUBranchEvaluator?

tests/test_unitxt.py

src/instructlab/eval/mmlu.py

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

Roni-Friedman · 2024-11-10T08:18:57Z

@Roni-Friedman Could you explain in the description what benefit the Unitxt evaluator would have? Why would a user run the unitxt evaluator over just using the MMLUBranchEvaluator?

Let's discuss this in our meeting as well. I initially wrote a generic evaluator that can use all of unitxt features, but now my understanding is that there are two clear use cases and PR should be adjusted accordingly:
1 - run evaluation on user data
2 - run bluebench

mergify · 2025-02-03T17:08:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. @Roni-Friedman please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-07-08T04:43:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. @Roni-Friedman please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the testing Relates to testing label Oct 22, 2024

mergify bot added needs-rebase ci-failure labels Oct 22, 2024

Roni-Friedman mentioned this pull request Oct 22, 2024

Add Unitxt benchmark instructlab/instructlab#2517

Closed

6 tasks

Roni-Friedman force-pushed the unitxt_eval branch from b8229e2 to abf67d5 Compare October 27, 2024 08:03

mergify bot removed the ci-failure label Oct 27, 2024

Roni-Friedman force-pushed the unitxt_eval branch 2 times, most recently from a6d43e7 to c22397b Compare October 27, 2024 09:44

mergify bot added ci-failure and removed needs-rebase labels Oct 27, 2024

Roni-Friedman force-pushed the unitxt_eval branch from 64aff53 to 9ca436c Compare October 28, 2024 09:26

mergify bot added ci-failure and removed ci-failure labels Oct 28, 2024

Roni-Friedman force-pushed the unitxt_eval branch from 1154444 to 75e934b Compare October 28, 2024 09:55

mergify bot removed the ci-failure label Oct 28, 2024

Roni-Friedman marked this pull request as ready for review October 28, 2024 11:36

Roni-Friedman changed the title ~~Unitxt eval~~ Unitxt evaluator Oct 28, 2024

yoavkatz reviewed Oct 28, 2024

View reviewed changes

tests/test_unitxt.py Outdated Show resolved Hide resolved

danmcp requested changes Oct 29, 2024

View reviewed changes

danmcp reviewed Oct 31, 2024

View reviewed changes

tests/test_unitxt.py Outdated Show resolved Hide resolved

danmcp reviewed Oct 31, 2024

View reviewed changes

src/instructlab/eval/mmlu.py Outdated Show resolved Hide resolved

mergify bot added the ci-failure label Nov 7, 2024

Roni-Friedman added 2 commits November 7, 2024 14:15

initial unitxt evaluator

53beb55

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

create unitxt files on the fly

19bad59

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

Roni-Friedman added 11 commits November 7, 2024 14:15

typo in import

bfc3c6a

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

remove unneeded print

1ed4256

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

tasks -> [task]

6135b91

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

temp dir prefix

5976ce4

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

create+delete temp files in run()

3bdf3e3

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

format: lint

4b11357

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

asserting unitxt evaluation returns a score

93bdd73

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

format: ruff

d70b84f

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

review comments

2faee08

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

simplify run_mmlu return value

777132a

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

make sure temp files are deleted w finally

d918732

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

Roni-Friedman force-pushed the unitxt_eval branch from 905c81e to bbe7108 Compare November 7, 2024 12:18

mergify bot added ci-failure and removed ci-failure labels Nov 7, 2024

test in scripts

7c9e44c

Signed-off-by: Roni Friedman-Melamed <Roni.friedman-melamed@il.ibm.com>

Roni-Friedman force-pushed the unitxt_eval branch from bbe7108 to 7c9e44c Compare November 7, 2024 12:31

mergify bot removed the ci-failure label Nov 7, 2024

nathan-weinberg linked an issue Nov 14, 2024 that may be closed by this pull request

Evaluation of user data using Unitxt #176

Open

mairin mentioned this pull request Dec 17, 2024

InstructLab Maintainer nomination for Bill Murdock instructlab/instructlab#2931

Closed

mergify bot added the needs-rebase label Feb 3, 2025

mergify bot removed the needs-rebase label Jul 8, 2025

mergify bot added the needs-rebase label Jul 8, 2025

Unitxt evaluator #156

Are you sure you want to change the base?

Unitxt evaluator #156

Uh oh!

Conversation

Roni-Friedman commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 22, 2024

Uh oh!

nathan-weinberg commented Oct 22, 2024

Uh oh!

mergify bot commented Oct 22, 2024

✅ Pull request refreshed

Uh oh!

Uh oh!

danmcp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danmcp Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

Roni-Friedman Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danmcp Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

Roni-Friedman Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

danmcp Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

Roni-Friedman Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

danmcp Oct 29, 2024

Choose a reason for hiding this comment

Uh oh!

jwm4 Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

Roni-Friedman Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roni-Friedman Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danmcp Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alimaredia commented Oct 29, 2024

Uh oh!

Uh oh!

Uh oh!

Roni-Friedman commented Nov 10, 2024

Uh oh!

mergify bot commented Feb 3, 2025

Uh oh!

mergify bot commented Jul 8, 2025

Uh oh!

Uh oh!

Roni-Friedman commented Oct 22, 2024 •

edited

Loading

Roni-Friedman Oct 29, 2024 •

edited

Loading

Roni-Friedman Nov 7, 2024 •

edited

Loading

Roni-Friedman Nov 7, 2024 •

edited

Loading

danmcp Nov 7, 2024 •

edited

Loading