-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Test] fix test_evaluator.py #675
[Test] fix test_evaluator.py #675
Conversation
This is excellent! Would it be possible to parameterize the tests so that we can launch only the ones corresponding to tasks that have changed? That would massively cut down on the numbers of tests that need to be run each time and would make running more than cursory correctness testing plausible.
I am weakly in favor of this, assuming it’s not computationally prohibitive. We could also potentially connect it to a Kubernetes cluster so that we can run tests on actual GPUs. Not sure how to do that, but I’m 90% sure it’s possible for someone who knows what they’re doing. |
Thanks! Most of the implementation was there already. Pretty minor edits.
Yeah that shouldn't be too hard. Just need to get the logic straight. Maybe check which task(s) folder the changes were made in and get the tasks name(s) from there. Will see if I can come up with something.
Yeah you could definitely link up a custom runner and I was thinking of something fast and quick here to run on every PR just go check if the api's running as it should. |
If you have any experience setting up such a runner for GPU tests, that'd be very helpful to discuss! I'm sure it can be done but I don't currently have the knowledge of github actions to know how yet. |
…refactor_testeval
testing
pytest only on test_evaluator
434e70d
to
4c25086
Compare
Hey @haileyschoelkopf @lintangsutawika . Got a working implementation ready. Right now this runs on every PR/push:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really really great, thanks so much for working on it!
I think, if you're alright with it, we can merge this for now so that it can run on big-refactor PRs and open another PR for continued testing work--we should also fix the errors raised by the linter but that doesn't have to delay this PR.
id: changed-tasks | ||
uses: tj-actions/changed-files@v37.1.2 | ||
with: | ||
files_yaml: | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider selecting a set of tasks that we'll treat as "modified" if lm_eval/api/task.py
is edited, perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added! Checks if any files changed in api/* then runs a couple of more tasks. Feel free to change them. Feel like should check for different tasks but I mostly know of multiple choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good! Will want to add 1 greedy_until task that's suitable at some point but those choices look good for now.
lol. Thats mostly type checking from mypy. That's all you 😂.The pre commit failing was weird though. Looks like the tests are all working! |
Yes, this was the royal we--I'll have to go fix those errors :) Looks like precommit passes, merging this! Thanks a bunch, this is really great progress! |
a8bdf20
into
EleutherAI:big-refactor
linked to #656
I got the test running at least. Using a single task for the time being, not sure how extensive you guys want the tests to be.
I was thinking for the CI, rather than using the dummy, if we could use a small model from hf on a single task and asserting if the results are as expected. Thoughts?