-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add eleuther_eval as recipe #549
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/549
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 79d1c55 with merge base 81d93bb (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
✅ Deploy Preview for torchtune-preview ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
ae5067a
to
d18a818
Compare
from lm_eval.evaluator import evaluate | ||
from lm_eval.models.huggingface import HFLM | ||
from lm_eval.tasks import get_task_dict | ||
except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This catches if the user has an incorrect version installed or if they don't have any version installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is basically our workaround so that (a) we can still run eleuther eval as a recipe and (b) we do not have to take every dep on god's green earth in our package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oui - I think it's reasonable that certain recipes may require other dependencies and we can make sure it's called out, but we ourselves don't have to depend on it in our torchtune pkg.
self, | ||
model: TransformerDecoder, | ||
tokenizer: Tokenizer, | ||
*, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KWARGS!
return self._model(inps) | ||
|
||
def _model_generate(self, *args, **kwargs): | ||
raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found this out the hard way. In a rough estimate, 85% of all tasks in Eleuther are not free generation so we have the majority of our bases covered. However, if people open a bunch of issues asking for this, we can add a generation method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to fail on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the question completely, but here's some possible responses.
Why raise an error here instead of letting it fail in Eleuther? Better UX, more descriptive message.
Why not implement something for generation now? Keeping this PR as simple as possible and b/c of the limited # of free generation tasks, I don't think it's a priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second one - got you!
max_seq_length=self._cfg.max_seq_length, | ||
) | ||
|
||
# Task initialization API changed between v0.4.1 and 0.4.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied this from gpt-fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give a bit more detail here? And maybe an explicit type of exception?
157aa37
to
2d30e98
Compare
@@ -65,6 +65,7 @@ jobs: | |||
run: | | |||
python -m pip install -r requirements.txt | |||
python -m pip install -r dev-requirements.txt | |||
python -m pip install lm-eval==0.4.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did we decide on this particular version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used by lit-gpt and gpt-fast and is the most up-to-date release, plus there would be significant "hacks" to be put in place between 0.3 and 0.4 to make it work e.g. using BaseLM instead of HFLM.
@@ -25,7 +25,8 @@ The library provides: | |||
- Native-PyTorch implementations of popular LLMs | |||
- Support for checkpoints in various formats, including checkpoints in HF format | |||
- Training recipes for popular fine-tuning techniques with reference benchmarks and comprehensive correctness checks | |||
- Integration with HuggingFace Datasets for training and EleutherAI's Eval Harness for evaluation | |||
- Evaluation of trained models with EleutherAI Eval Harness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😎
tests/recipes/test_eleuther_eval.py
Outdated
pkg_path = Path(torchtune.__file__).parent.parent.absolute() | ||
EVAL_CONFIG_PATH = Path.joinpath( | ||
pkg_path, "recipes", "configs", "llama2_eleuther_eval.yaml" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this now? Just use the recipe name only, no?
tests/recipes/test_eleuther_eval.py
Outdated
pkg_path, "recipes", "configs", "llama2_eleuther_eval.yaml" | ||
) | ||
|
||
models.small_test_ckpt_tune = llama2_small_test_ckpt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge with testing PR changes in #537 and u will live a happy and fulfilling life
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge bad, rebase good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait yall don't use rebase?
assert "'acc,none': 0.3" in log_out | ||
|
||
@pytest.fixture | ||
def hide_available_pkg(self, monkeypatch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIHI
checkpointer: | ||
_component_: torchtune.utils.FullModelTorchTuneCheckpointer | ||
checkpoint_dir: /tmp/llama/ | ||
checkpoint_files: [finetuned_model.pt] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this coming from? Might be nice to align with our output checkpoint file format so that it works out of the box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the epoch, so not sure what the nice solution is here.
recipes/eleuther_eval.py
Outdated
def device(self): | ||
return self._device | ||
|
||
def tok_encode(self, string: str, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This choice of param name hurts me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy pasta from gpt-fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, I think it's also coming from lm_eval tbh
recipes/eleuther_eval.py
Outdated
) | ||
|
||
|
||
_DEFAULT_TASKS = ["hellaswag"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh I am wondering if we should even have this. Like yes it's convenient to not have to write it out but it's already caused us some confusion. Like in what case is a user just gonna say "yolo I'll just run eval without even thinking about the task"
004a877
to
4888bc7
Compare
Context
Adding EleutherAI Eval Harness as a recipe and removing it as a hard dependency.
Changelog
Testing
CI
Local testing
Ours with
meta-llama/Llama-2-7b
: 1.41s, 0.39 acc, command:tune eleuther_eval --config eleuther_eval tasks=["truthfulqa_mc2"]
Eleuther Harness directly with
meta-llama/Llama-2-7b-hf
: 1.50s, 0.39 acc, command:lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-hf --tasks truthfulqa_mc2 --device cuda:0 --batch_size 32