-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lfs: add support for Git SSH URLs #325
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #325 +/- ##
==========================================
- Coverage 77.69% 77.50% -0.19%
==========================================
Files 39 39
Lines 5034 5068 +34
Branches 904 909 +5
==========================================
+ Hits 3911 3928 +17
- Misses 969 986 +17
Partials 154 154 ☔ View full report in Codecov by Sentry. |
Thinking about this some more, perhaps it would be better to have pluggable LFS authenticators rather than LFS client variants, i.e. class LFSAuth(typing.Protocol):
def get_headers(self, *, upload: bool) -> dict:
... We'd have two classes that implement this protocol:
The |
Ideally I think we should be using asyncssh directly so that this is independent of any particular git backend. I'm not sure how the CLI LFS client works with things like Either way, using dulwich here should be fine for now.
I think testing this will really require setting up a docker container that is running a git + LFS + SSH server instance. We do already have a minimal setup for testing git over SSH, you can take a look at In general this PR LGTM (other than the testing issue), but I should note that my understanding is that this use of LFS auth over SSH is not widely used, and the alternative (HTTPS auth) works for github/gitlab as well as huggingface. So I'm not sure this feature is something that needs to be prioritized. @sisp is your team actively using LFS auth over SSH? (Full disclosure, I'm leaving the DVC team at the end of this week, and at least in the short term, the rest of the team won't have the bandwidth to work on or review changes in the scmrepo LFS client other than bugs where something is completely broken with github/gitlab/huggingface) |
Thanks for your comprehensive feedback, @pmrowla! 🙏 First of all, I'm sorry to hear you're leaving the DVC team. Your work on LFS support here – and thus also for DVC – has been fantastic and triggered an idea for a partial solution related to my struggles with tightly integrating DVC and GitLab (iterative/dvc-http#50 (comment), iterative/dvc-http#56, gitlab.com:gitlab-org/gitlab#413612) by simply using Git LFS for data/artifacts storage instead of a DVC remote. This is feasible on our corporate self-hosted GitLab instance, as it has a very generous LFS storage size limit, so storing significant data and artifacts is no problem. With DVC's LFS support, we can, e.g., import data managed via LFS from non-DVC repos and yet benefit from DVC's other features. That said, LFS is only a substitute for the regular cache but not the run-cache. Since there's little time left before you leave, I'm inclined to leave the PR as is and look into testing in a follow-up. Also, I've looked into refactoring the current approach to using authenticator classes. I realized that not only retrieving credentials for the LFS Batch API is different for HTTP and SSH, but also constructing the LFS server URL from the Git URL is different. An authenticator class should not cover the URL construction, as that's a different concern, but authentication and URL construction are closely related. For this reason, I think we should keep the inheritance-based approach as is. We use Git over SSH on our GitLab instance by default. For us, 2FA is always enabled, and then:
The same applies to GitHub. In fact, I use 2FA and Git over SSH also on github.com and gitlab.com because it's secure and convenient. Git over HTTP only works nicely without 2FA or with public projects IMO, so I think supporting LFS for Git SSH URLs is a high priority in corporate environments, e.g. when not only consuming public LFS-managed files but also internal ones. |
I've added support for Git SSH URLs to the Git LFS client. Resolves #288.
I've refactored the
LFSClient
class a little to have the common code inLFSClient
and implement logic specific to Git HTTP URLs and Git SSH URLs in subclasses. TheLFSClient
class is now abstract with one abstract method_get_auth_header(upload: bool): dict
which gets implemented by the private subclasses_HTTPLFSClient
and_SSHLFSClient
._HTTPLFSClient
retrieves credentials as before using the credential helper, but instead of using AIOHTTP'sauth
argument, it creates the basic auth header manually usingBasicAuth(...).encode()
which gets passed along with other headers._SSHLFSClient
retrieves credentials usingssh git@<host> git-lfs-authenticate <project_path> upload|download
(as documented in the Git LFS SSH authentication docs) and takes theheader
dict from its response, which contains authentication data.Some questions regarding this implementation:
_get_ssh_vendor
fromscmrepo.git.backend.dulwich
the right way to create an SSH client?