Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducability of old runs (Legacy Server?) #12

Open
PGijsbers opened this issue Sep 2, 2024 · 6 comments
Open

Reproducability of old runs (Legacy Server?) #12

PGijsbers opened this issue Sep 2, 2024 · 6 comments
Assignees

Comments

@PGijsbers
Copy link

Goal:
For reproducibility, it would be great if we could recompute any run. For instance, a run with an old scikit-learn version.

Challenges
This is a complex problem. It might, for instance, be the case that the scikit-learn wheel is not available anymore, or not for more modern python versions. Old versions might also have serious security vulnerabilities

Options

  • Setting up Legacy Servers, ran by OpenML, containing snapshots of the system for any run.
  • Making it possible to run a Legacy Server locally. This way, you can compute old runs on your own computer
  • Only safe all the metadata, without offering the possibility to run it. If you want to recompute an old run, we give you all the information on what you would need. Recreating the environment is your own responsibility. This would safe OpenML a lot of complexities.
@joaquinvanschoren
Copy link

At this moment, I think we can only really promise option 3 (which is fair, I think).

Option 1, running many legacy servers won't be maintainable.

For Option 2, simple dockerfiles won't work. We'll need to containerize (using Docker or Singularity) the entire environment, including compiled binaries. That would make sense if we have a regular release cycle of the entire server, and that would really make more sense after we've moved everything to Python? Then we can auto-generate these containers with every major release, and link runs to an OpenML release, but this is not low-hanging fruit.

Thoughts?

@PGijsbers
Copy link
Author

As I understand it, aren't we very close to option 2? @josvandervelde dockerized all server services. So all we would need to do is to tag current versions. Then, combined with older openml-python docker images, we mostly make good on the promise. At least as far as scikit-learn based runs go.
Insofar as there are runs which cannot be reproduced with that, I would fall back to option 3: make sure the metadata stays available (which as far as I am concerned was never really put into contention anyway).

@joaquinvanschoren
Copy link

joaquinvanschoren commented Sep 6, 2024

Maybe. We would also need to store compiled containers, right? With wheels breaking and all that.

@PGijsbers
Copy link
Author

The images (should) come with fully pre-installed environments. I am not entirely sure what you are referring to.

@joaquinvanschoren
Copy link

Then that should be fine. If there are multiple version of dependent libraries (eg. Scikitlearn, torch,...) between OpenML releases, would that still work?

@PGijsbers
Copy link
Author

Gotcha. That's indeed an issue. Likely packages will stay available on PyPI. But we can build an images for (sklearn x openml) tuples (or other dependencies), to cover it a bit better. I am not sure where we should draw the line in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Discuss
Development

No branches or pull requests

2 participants