Add eleuther_eval as recipe #549

joecummings · 2024-03-21T19:56:45Z

Context

Adding EleutherAI Eval Harness as a recipe and removing it as a hard dependency.

Changelog

Add eval recipe interface
Remove hard dependency on Eleuther
Move eval to recipes/
Update tests to account for additional functionality
Update Github workflow tests to install lm-eval

Testing

CI
Local testing

(torchtune-2) [jrcummings@devvm050.nha0 ~/projects/torchtune (update-eleuther-eval)]$ pytest tests/recipes/test_eleuther_eval.py
========================================================================================== 2 passed, 1 warning in 26.51s ===========================================================================================

Speed & acc

Ours with meta-llama/Llama-2-7b: 1.41s, 0.39 acc, command: tune eleuther_eval --config eleuther_eval tasks=["truthfulqa_mc2"]
Eleuther Harness directly with meta-llama/Llama-2-7b-hf: 1.50s, 0.39 acc, command: lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks truthfulqa_mc2 --device cuda:0 --batch_size 32

pytorch-bot · 2024-03-21T19:56:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/549

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 79d1c55 with merge base 81d93bb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

netlify · 2024-03-21T19:57:04Z

✅ Deploy Preview for torchtune-preview ready!

Name	Link
🔨 Latest commit	`79d1c55`
🔍 Latest deploy log	https://app.netlify.com/sites/torchtune-preview/deploys/65ff7dc304f10800085e1034
😎 Deploy Preview	https://deploy-preview-549--torchtune-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

joecummings · 2024-03-22T01:02:29Z

recipes/eleuther_eval.py

+    from lm_eval.evaluator import evaluate
+    from lm_eval.models.huggingface import HFLM
+    from lm_eval.tasks import get_task_dict
+except ImportError:


This catches if the user has an incorrect version installed or if they don't have any version installed.

So this is basically our workaround so that (a) we can still run eleuther eval as a recipe and (b) we do not have to take every dep on god's green earth in our package?

Oui - I think it's reasonable that certain recipes may require other dependencies and we can make sure it's called out, but we ourselves don't have to depend on it in our torchtune pkg.

joecummings · 2024-03-22T01:02:39Z

recipes/eleuther_eval.py

+        self,
+        model: TransformerDecoder,
+        tokenizer: Tokenizer,
+        *,


joecummings · 2024-03-22T01:03:50Z

recipes/eleuther_eval.py

+        return self._model(inps)
+
+    def _model_generate(self, *args, **kwargs):
+        raise RuntimeError(


Found this out the hard way. In a rough estimate, 85% of all tasks in Eleuther are not free generation so we have the majority of our bases covered. However, if people open a bunch of issues asking for this, we can add a generation method.

What's the reason to fail on this?

Not sure I understand the question completely, but here's some possible responses.

Why raise an error here instead of letting it fail in Eleuther? Better UX, more descriptive message.
Why not implement something for generation now? Keeping this PR as simple as possible and b/c of the limited # of free generation tasks, I don't think it's a priority.

The second one - got you!

joecummings · 2024-03-22T01:04:21Z

recipes/eleuther_eval.py

+            max_seq_length=self._cfg.max_seq_length,
+        )
+
+        # Task initialization API changed between v0.4.1 and 0.4.2


Copied this from gpt-fast

Can we give a bit more detail here? And maybe an explicit type of exception?

ebsmothers · 2024-03-22T19:18:13Z

.github/workflows/recipe_integration_test.yaml

@@ -65,6 +65,7 @@ jobs:
        run: |
          python -m pip install -r requirements.txt
          python -m pip install -r dev-requirements.txt
+          python -m pip install lm-eval==0.4.*


How did we decide on this particular version?

It is used by lit-gpt and gpt-fast and is the most up-to-date release, plus there would be significant "hacks" to be put in place between 0.3 and 0.4 to make it work e.g. using BaseLM instead of HFLM.

ebsmothers · 2024-03-22T19:18:47Z

README.md

@@ -25,7 +25,8 @@ The library provides:
 - Native-PyTorch implementations of popular LLMs
 - Support for checkpoints in various formats, including checkpoints in HF format
 - Training recipes for popular fine-tuning techniques with reference benchmarks and comprehensive correctness checks
- Integration with HuggingFace Datasets for training and EleutherAI's Eval Harness for evaluation
+- Evaluation of trained models with EleutherAI Eval Harness


recipes/configs/llama2_eleuther_eval.yaml

tests/recipes/test_eleuther_eval.py

ebsmothers · 2024-03-22T19:21:59Z

tests/recipes/test_eleuther_eval.py

+pkg_path = Path(torchtune.__file__).parent.parent.absolute()
+EVAL_CONFIG_PATH = Path.joinpath(
+    pkg_path, "recipes", "configs", "llama2_eleuther_eval.yaml"
+)


Why do we need this now? Just use the recipe name only, no?

ebsmothers · 2024-03-22T19:22:29Z

tests/recipes/test_eleuther_eval.py

+    pkg_path, "recipes", "configs", "llama2_eleuther_eval.yaml"
+)
+
+models.small_test_ckpt_tune = llama2_small_test_ckpt


Merge with testing PR changes in #537 and u will live a happy and fulfilling life

Merge bad, rebase good.

wait yall don't use rebase?

torchtune/__init__.py

ebsmothers · 2024-03-22T19:24:44Z

tests/recipes/test_eleuther_eval.py

+        assert "'acc,none': 0.3" in log_out
+
+    @pytest.fixture
+    def hide_available_pkg(self, monkeypatch):


ebsmothers · 2024-03-22T19:33:07Z

recipes/configs/llama2_eleuther_eval.yaml

+checkpointer:
+  _component_: torchtune.utils.FullModelTorchTuneCheckpointer
+  checkpoint_dir: /tmp/llama/
+  checkpoint_files: [finetuned_model.pt]


Where is this coming from? Might be nice to align with our output checkpoint file format so that it works out of the box

It depends on the epoch, so not sure what the nice solution is here.

ebsmothers · 2024-03-22T19:36:02Z

recipes/eleuther_eval.py

+    def device(self):
+        return self._device
+
+    def tok_encode(self, string: str, **kwargs):


This choice of param name hurts me

copy pasta from gpt-fast

I know, I think it's also coming from lm_eval tbh

recipes/eleuther_eval.py

ebsmothers · 2024-03-22T19:37:55Z

recipes/eleuther_eval.py

+        )
+
+
+_DEFAULT_TASKS = ["hellaswag"]


Tbh I am wondering if we should even have this. Like yes it's convenient to not have to write it out but it's already caused us some confusion. Like in what case is a user just gonna say "yolo I'll just run eval without even thinking about the task"

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 21, 2024

joecummings marked this pull request as draft March 21, 2024 19:56

joecummings force-pushed the update-eleuther-eval branch from ae5067a to d18a818 Compare March 21, 2024 20:26

joecummings marked this pull request as ready for review March 22, 2024 00:43

joecummings changed the title ~~[WIP] Update evals~~ Add eleuther_eval as recipe Mar 22, 2024

joecummings requested review from rohan-varma and ebsmothers March 22, 2024 00:44

joecummings changed the title ~~Add eleuther_eval as recipe~~ [WIP] Add eleuther_eval as recipe Mar 22, 2024

joecummings requested a review from kartikayk March 22, 2024 00:47

joecummings commented Mar 22, 2024

View reviewed changes

joecummings force-pushed the update-eleuther-eval branch from 157aa37 to 2d30e98 Compare March 22, 2024 18:06

joecummings changed the title ~~[WIP] Add eleuther_eval as recipe~~ Add eleuther_eval as recipe Mar 22, 2024