Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formalise support for old repositories when breaking changes are made #1220

Open
manics opened this issue Nov 21, 2022 · 1 comment
Open
Labels
needs: discussion reproducibility enabling scientific reproducibility with r2d

Comments

@manics
Copy link
Member

manics commented Nov 21, 2022

#1219 bumps the default version of Python, which is a breaking change since it may break repositories that pin package versions that won't work with a newer Python version.

We'll also need to update the base-image from Ubuntu 18.04 soon.

From @betatim

There is an issue (somewhere :-/) about the fact that for real reproducibility people need to have the triple: repo link, revision of the code and revision of repo2docker. We discussed adding a feature to repo2docker that would allow it to fetch a version of itself and use that instead of what the user installed when building the container (something like fetch the container image for that revision of r2d).

In the past we have broken the "reproducibility promise" a few times already. For example when we switched to jupyter lab as default and I think some, rare, cases.

In general, I think because the universe keeps evolving it is already likely that (very) old revisions of a repo will not build or lead to a container that is quite different. Despite this I think we should try hard not to add to this source of entropy by going wild with breaking changes in r2d.

Related:

@manics manics added needs: discussion reproducibility enabling scientific reproducibility with r2d labels Nov 21, 2022
@minrk
Copy link
Member

minrk commented Nov 21, 2022

For mybinder.org, I think it's going to be tricky to balance repo2docker version as an input with security/reliability requirements to not run images from old repo2docker, but that is going to be the only truly reproducible input.

I don't think repos that don't specify a Python version have a reasonable expectation of long-term reproducibility, since they have only a partially-specified environment, but that's hard to weigh against the fact that the Python community doesn't have any standard, widely adopted way to specify the Python version. Only relatively uncommon tools like pipfiles and conda can specify, and they often don't. In large part, I think this is a documentation question - defaults will be updated, and if you don't specify a version, it will change over time (this is true of any package in requirements.txt or any other env spec, too).

I still think it would be appropriate for us to add a repo's last_modified_date as an input, and use that to pick the default Python. I think it would dramatically improve our success rate for reproducible envs by default, based on sampling data from our study a couple years ago, but that's slightly tangential to the current discussion.

I think we should specify in docs somewhere our upgrade policy for:

  1. defaults that can be overridden, like Python version, and
  2. hardcoded values that can't be overridden (at least in mybinder.org), like base image

I think it would be valid, for instance, for our default Python to be latest - 1 (3.10, as in this PR), and upgrade every year, following Python's own releases.

I think it may also be appropriate to communicate more clearly during the build process with e.g. a warning stating that Python version is unspecified, repo2docker outputs will change over time.

Another valid approach would be to follow something like the Python developer survey, and pick the most popular version (3.9) as default, or something like the 50th percentile version (also 3.9 this year, but 52% would be 3.8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: discussion reproducibility enabling scientific reproducibility with r2d
Projects
None yet
Development

No branches or pull requests

2 participants