Use git ls-remote to resolve refs for git provider #895

hugokerstens · 2019-07-11T10:41:29Z

Implements the approach specified in #843.

akhmerov · 2019-07-11T10:47:10Z

Shouldn't unresolved ref default to master?

binderhub/repoproviders.py

hugokerstens · 2019-07-11T10:53:31Z

Shouldn't unresolved ref default to master?

It already defaults to master in the front-end javascript if no ref is specified.

binderhub/static/js/index.js

akhmerov · 2019-07-11T11:49:34Z

Test failures seem relevant; also we need to test the new cases.

minrk · 2019-07-11T14:05:57Z

If it works as advertised, this is awesome! In the future, it could also mean that we could greatly reduce our needed git provider API support, since all we'd need to do is turn provider-specific-url into git url, which should generally be doable purely in our code and without making any API requests.

hugokerstens · 2019-07-11T14:20:22Z

The test for 073dba6 failed because Jupyterhub is not available for some reason.

consideRatio

Thanks wow this is a serious upgrade to the UX! Thanks @manics and @hugokerstens for your work on this!

1 - `unresolved_ref` renamed for clarity?

I was thinking from the naming of self.unresolved_ref that we assume this to be an unresolved reference (like a branch name: master, or a tag name: 0.1.0). I were about to suggest we should be able to handle both. But then I noted that you are already doing a check later within a try catch where it is also allowed for this to be a resolved reference:

        try:
            self.sha1_validate(unresolved_ref)
            self.resolved_ref = unresolved_ref

So, perhaps we can name this variable to indicate it not as an unresolved reference, but simply a reference or unparsed reference, provided reference or input reference etc?

2 - a delayed parsing of the provided git reference?

NOTE: These considerations are not important if an additional webrequest isn't needed, and these considerations are also not essential to merge/not merge of this PR.

The __init__ call to the GitRepoProvider does not trigger the parsing of the reference that could be for example "master" or "0.1.0" to a resolved reference, being a 40 char string. Instead, it will be made when get_resolved_ref() is called. Perhaps this is good, perhaps it isn't, I can't tell because I don't understand the usage pattern of RepoProvider objects yet - will this cause a lag that could be prevented, or does it actually prevent a lag? Could we return __init__ but start the fetch of a resolved reference ahead of time? Hmm...

I also figure that if we call get_resolved_ref multiple times before its return, it may cause multiple invokations of the git ls-remote command while we only want to do one, I think. Can we make it only call once? Is it important? I don't know fully, but that is stuff I considered when I looked through this code!

To summarize:

should we async- or sync- invoke the resolution of the reference logic on init, or not do it at all but wait until get_resolved_ref explicitly is called later?
should we ensure that only one lookup is done with git ls-remote?

binderhub/repoproviders.py

betatim · 2019-07-12T14:12:57Z

binderhub/repoproviders.py

+            self.sha1_validate(unresolved_ref)
+            self.resolved_ref = unresolved_ref
+        except ValueError:
+            pass


There are a few things here which aren't obvious from reading it a few times so maybe we can add some comments.

Why are we now swallowing exceptions here?

On L196 calling self.sha1_validate(unresolved_ref) should fail a lot because users can pass in a value like master so why call it now and not after we resolve the reference?

It seems weird to set self.resolved_ref to the value of unresolved_ref. I think we should delay that to when we have a resolved ref (and it it to None in the meantime).

We pass on an error since we postpone resolving the ref to the getter of resolved_ref if the ref is not a commit SHA. It makes sense to move this sha1 check to the getter for the resolved_ref and not set the resolved_ref at all in the constructor. This is in line with the other repo providers.

I also appreciate managing all logic within the getter (get_resolved_ref) as there is one. It was confusing to me the first time I read this through that some of it was found there but not all of it.

Is the current implementation satisfactory?

The only thing I'm not 100% sure about is setting the resolved_ref in the else block. It is a good practice to only have code in the try block that produces the error you want to catch. However, it could be more readable if the resolved_ref is set directly after validating it in the try block. I tried to make it more readable by adding some comments, but still assign it in the else block.

binderhub/repoproviders.py

binderhub/tests/test_build.py

Co-Authored-By: Tim Head <betatim@gmail.com>

hugokerstens · 2019-07-12T14:35:02Z

1 - unresolved_ref renamed for clarity?

So, perhaps we can name this variable to indicate it not as an unresolved reference, but simply a reference or unparsed reference, provided reference or input reference etc?

That's indeed a lot more readable. If we move the sha1 check to the getter (as discussed here), renaming it is not needed anymore. We can just speak of an unresolved reference in the constructor, and only bother with validating in the getter for the resolved ref.

betatim · 2019-07-12T14:50:32Z

I think switching to unresolved_ref as a variable name is a good idea and consistent with how other providers name it. Maybe the most precise name would be potentially_unresolved_ref but that is quite long :-/

Right now when you enter a git repo URL in the UI you end up with the following link https://mybinder.org/v2/git/http%3A%2F%2Fgit.example.com%2Frepo/123124124

so we should make sure that format continues to work. It makes it a bit tricky to deal with branches that contain / though. From a quick look at the gitlab.com provider it seems to do what you discussed in the comment and encode the / in the URL of the repo and leave the ones in the branch/tag name alone. So that would be a good example to follow.

We should extend the tests to handle the new possibilities (as well as some invalid ones?). For this to work we need to mock the calls to git ls-remote as we don't want to actually run commands that need networking. To test the resolving via git ls-remote we can setup a local "throw away" repo, clone it, and then use the original as the "remote". I think that would work.

consideRatio · 2019-07-12T16:05:30Z

It is this change that led me to consider the variable name in the first place I guess. Previously, a resolved reference was required, but now it cannot be a resolved reference? Well the code is currently meant to support that, but I got confused due to this documentation that led me to think it was no longer supported. Hence, my thinking of changing this documentation snippet and than the variable name felt natural to change along with it.

    Users must provide a spec of the following form.

-    <url-escaped-namespace>/<resolved_ref>
+    <url-escaped-namespace>/<unresolved_ref>

hugokerstens · 2019-07-13T13:59:31Z

binderhub/repoproviders.py

+                raise ValueError("The specified branch, tag or commit SHA ('{}') was not found on the remote repository."
+                                .format(self.unresolved_ref))
+            resolved_ref = result.stdout.split(None, 1)[0]
+            self.sha1_validate(resolved_ref)


Is validating the SHA that is returned by git ls-remote too much? I think this safe-guard is nice to have for unexpected output of git ls-remote.

hugokerstens · 2019-07-13T14:10:50Z

I added some extra documentation for the GitRepoProvider that explicitly states that both resolved and unresolved references are possible to address some of the concerns of @consideRatio. However, the variable is still called unresolved_ref in the code, which is confusing as well. The problem here is the ambiguity of 'unresolved'. It could either mean unresolved as in yet to be resolved to a commit hash, OR unresolved as in not validated at all.

chicocvenancio · 2019-08-07T20:26:24Z

This fixes #675 if I'm not misunderstanding something. A great improvement.

betatim · 2019-08-08T21:29:58Z

Thanks for bumping this @chicocvenancio. Deployed this locally and tested it with some GitHub and GitLab repositories (pretending they were just git repos). Seems to work 🎉.

Merging! Thanks for this cool feature!

jupyterhub/binderhub#895

hugokerstens added 2 commits July 11, 2019 12:37

Use git ls-remote to resolve refs for git provider

b845fa4

Update tag text in front-end for git provider

8bcad7d

akhmerov reviewed Jul 11, 2019

View reviewed changes

binderhub/repoproviders.py Outdated Show resolved Hide resolved

Support refs with a / in them

92ad34c

akhmerov reviewed Jul 11, 2019

View reviewed changes

binderhub/static/js/index.js Show resolved Hide resolved

Correctly encode URL in test build

073dba6

consideRatio reviewed Jul 11, 2019

View reviewed changes

betatim reviewed Jul 12, 2019

View reviewed changes

binderhub/repoproviders.py Show resolved Hide resolved

betatim reviewed Jul 12, 2019

View reviewed changes

binderhub/repoproviders.py Outdated Show resolved Hide resolved

betatim reviewed Jul 12, 2019

View reviewed changes

binderhub/repoproviders.py Outdated Show resolved Hide resolved

betatim reviewed Jul 12, 2019

View reviewed changes

binderhub/tests/test_build.py Show resolved Hide resolved

Apply suggestion from betatim to show the unresolved reference

1a13655

Co-Authored-By: Tim Head <betatim@gmail.com>

hugokerstens added 2 commits July 12, 2019 16:37

Fix syntax error and improve styling

19031b9

Change to RuntimeError

af0d281

hugokerstens added 2 commits July 13, 2019 15:31

Improve documentation of GitRepoProvider

dfceb05

Move resolve ref logic to getter

a20487d

hugokerstens commented Jul 13, 2019

View reviewed changes

betatim merged commit 1a76d5f into jupyterhub:master Aug 8, 2019

This was referenced Aug 8, 2019

Resolve remote git references using provider agnostic method #843

Closed

Resolving branch names for repo providers that aren't github.com or gitlab.com #675

Closed

yuvipanda pushed a commit to jupyterhub/helm-chart that referenced this pull request Aug 8, 2019

[binderhub] Automatic update for commit 1a76d5f

f014454

jupyterhub/binderhub#895

henchbot mentioned this pull request Aug 8, 2019

binderhub: 10a3dee...1a76d5f jupyterhub/mybinder.org-deploy#1116

Merged

hugokerstens mentioned this pull request Aug 9, 2019

Fixes and tests for git unresolved ref support #921

Merged

choldgraf added reference maintenance Under the hood improvements and fixes and removed reference labels Oct 8, 2019

betatim mentioned this pull request Oct 29, 2020

Migrate ALL providers to expecting main branch, not just GitHub, and announce the change #1175

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use git ls-remote to resolve refs for git provider #895

Use git ls-remote to resolve refs for git provider #895

hugokerstens commented Jul 11, 2019

akhmerov commented Jul 11, 2019

hugokerstens commented Jul 11, 2019 •

edited

Loading

akhmerov commented Jul 11, 2019

minrk commented Jul 11, 2019

hugokerstens commented Jul 11, 2019

consideRatio left a comment •

edited

Loading

betatim Jul 12, 2019

hugokerstens Jul 12, 2019

consideRatio Jul 12, 2019 •

edited

Loading

hugokerstens Jul 13, 2019

hugokerstens commented Jul 12, 2019

1 - `unresolved_ref` renamed for clarity?

betatim commented Jul 12, 2019

consideRatio commented Jul 12, 2019 •

edited

Loading

hugokerstens Jul 13, 2019

hugokerstens commented Jul 13, 2019 •

edited

Loading

chicocvenancio commented Aug 7, 2019

betatim commented Aug 8, 2019

Use git ls-remote to resolve refs for git provider #895

Use git ls-remote to resolve refs for git provider #895

Conversation

hugokerstens commented Jul 11, 2019

akhmerov commented Jul 11, 2019

hugokerstens commented Jul 11, 2019 • edited Loading

akhmerov commented Jul 11, 2019

minrk commented Jul 11, 2019

hugokerstens commented Jul 11, 2019

consideRatio left a comment • edited Loading

Choose a reason for hiding this comment

1 - unresolved_ref renamed for clarity?

2 - a delayed parsing of the provided git reference?

betatim Jul 12, 2019

Choose a reason for hiding this comment

hugokerstens Jul 12, 2019

Choose a reason for hiding this comment

consideRatio Jul 12, 2019 • edited Loading

Choose a reason for hiding this comment

hugokerstens Jul 13, 2019

Choose a reason for hiding this comment

hugokerstens commented Jul 12, 2019

1 - unresolved_ref renamed for clarity?

betatim commented Jul 12, 2019

consideRatio commented Jul 12, 2019 • edited Loading

hugokerstens Jul 13, 2019

Choose a reason for hiding this comment

hugokerstens commented Jul 13, 2019 • edited Loading

chicocvenancio commented Aug 7, 2019

betatim commented Aug 8, 2019

hugokerstens commented Jul 11, 2019 •

edited

Loading

consideRatio left a comment •

edited

Loading

1 - `unresolved_ref` renamed for clarity?

consideRatio Jul 12, 2019 •

edited

Loading

1 - `unresolved_ref` renamed for clarity?

consideRatio commented Jul 12, 2019 •

edited

Loading

hugokerstens commented Jul 13, 2019 •

edited

Loading