Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test] fix test_evaluator.py #675

Merged

Conversation

baberabb
Copy link
Contributor

linked to #656

I got the test running at least. Using a single task for the time being, not sure how extensive you guys want the tests to be.

I was thinking for the CI, rather than using the dummy, if we could use a small model from hf on a single task and asserting if the results are as expected. Thoughts?

@StellaAthena
Copy link
Member

This is excellent!

Would it be possible to parameterize the tests so that we can launch only the ones corresponding to tasks that have changed? That would massively cut down on the numbers of tests that need to be run each time and would make running more than cursory correctness testing plausible.

I was thinking for the CI, rather than using the dummy, if we could use a small model from hf on a single task and asserting if the results are as expected. Thoughts?

I am weakly in favor of this, assuming it’s not computationally prohibitive. We could also potentially connect it to a Kubernetes cluster so that we can run tests on actual GPUs. Not sure how to do that, but I’m 90% sure it’s possible for someone who knows what they’re doing.

@baberabb
Copy link
Contributor Author

Thanks! Most of the implementation was there already. Pretty minor edits.

Would it be possible to parameterize the tests so that we can launch only the ones corresponding to tasks that have changed? That would massively cut down on the numbers of tests that need to be run each time and would make running more than cursory correctness testing plausible.

Yeah that shouldn't be too hard. Just need to get the logic straight. Maybe check which task(s) folder the changes were made in and get the tasks name(s) from there. Will see if I can come up with something.

I am weakly in favor of this, assuming it’s not computationally prohibitive. We could also potentially connect it to a Kubernetes cluster so that we can run tests on actual GPUs. Not sure how to do that, but I’m 90% sure it’s possible for someone who knows what they’re doing.

Yeah you could definitely link up a custom runner and I was thinking of something fast and quick here to run on every PR just go check if the api's running as it should.

@haileyschoelkopf
Copy link
Collaborator

If you have any experience setting up such a runner for GPU tests, that'd be very helpful to discuss! I'm sure it can be done but I don't currently have the knowledge of github actions to know how yet.

@StellaAthena StellaAthena added the bug Something isn't working. label Jul 14, 2023
@baberabb baberabb marked this pull request as draft July 14, 2023 21:13
@baberabb baberabb marked this pull request as ready for review July 17, 2023 03:05
@baberabb
Copy link
Contributor Author

Hey @haileyschoelkopf @lintangsutawika . Got a working implementation ready. Right now this runs on every PR/push:

  • the evaluator
  • checks the task methods are working properly
  • if a .yaml orpyis amended in a tasks folder then it checks the class methods of those are working

Copy link
Collaborator

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really really great, thanks so much for working on it!

I think, if you're alright with it, we can merge this for now so that it can run on big-refactor PRs and open another PR for continued testing work--we should also fix the errors raised by the linter but that doesn't have to delay this PR.

id: changed-tasks
uses: tj-actions/changed-files@v37.1.2
with:
files_yaml: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider selecting a set of tasks that we'll treat as "modified" if lm_eval/api/task.py is edited, perhaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added! Checks if any files changed in api/* then runs a couple of more tasks. Feel free to change them. Feel like should check for different tasks but I mostly know of multiple choice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! Will want to add 1 greedy_until task that's suitable at some point but those choices look good for now.

@baberabb
Copy link
Contributor Author

baberabb commented Jul 17, 2023

we should also fix the errors raised by the linter but that doesn't have to delay this PR.

lol. Thats mostly type checking from mypy. That's all you 😂.The pre commit failing was weird though.

Looks like the tests are all working!

@haileyschoelkopf
Copy link
Collaborator

haileyschoelkopf commented Jul 18, 2023

lol. Thats mostly type checking from mypy. That's all you 😂.The pre commit failing was weird though.

Yes, this was the royal we--I'll have to go fix those errors :) Looks like precommit passes, merging this! Thanks a bunch, this is really great progress!

@haileyschoelkopf haileyschoelkopf merged commit a8bdf20 into EleutherAI:big-refactor Jul 18, 2023
3 of 4 checks passed
@baberabb baberabb deleted the big-refactor_testeval branch August 2, 2023 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants