Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lfs: add support for Git SSH URLs #325

Merged
merged 1 commit into from
Feb 29, 2024
Merged

Conversation

sisp
Copy link
Collaborator

@sisp sisp commented Feb 27, 2024

I've added support for Git SSH URLs to the Git LFS client. Resolves #288.

I've refactored the LFSClient class a little to have the common code in LFSClient and implement logic specific to Git HTTP URLs and Git SSH URLs in subclasses. The LFSClient class is now abstract with one abstract method _get_auth_header(upload: bool): dict which gets implemented by the private subclasses _HTTPLFSClient and _SSHLFSClient. _HTTPLFSClient retrieves credentials as before using the credential helper, but instead of using AIOHTTP's auth argument, it creates the basic auth header manually using BasicAuth(...).encode() which gets passed along with other headers. _SSHLFSClient retrieves credentials using ssh git@<host> git-lfs-authenticate <project_path> upload|download (as documented in the Git LFS SSH authentication docs) and takes the header dict from its response, which contains authentication data.

Some questions regarding this implementation:

  • Is using _get_ssh_vendor from scmrepo.git.backend.dulwich the right way to create an SSH client?
  • Any suggestions about testing? I don't feel too comfortable adding this feature without tests, but I'm not sure what the best approach to testing would be. Perhaps using a Git LFS test server? I've manually tested both Git SSH URLs and Git HTTP URLs of public and private projects, all scenarios are working for me.

@codecov-commenter
Copy link

codecov-commenter commented Feb 27, 2024

Codecov Report

Attention: Patch coverage is 35.18519% with 35 lines in your changes are missing coverage. Please review.

Project coverage is 77.50%. Comparing base (454a522) to head (1c7907f).

Files Patch % Lines
src/scmrepo/git/lfs/client.py 35.18% 35 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #325      +/-   ##
==========================================
- Coverage   77.69%   77.50%   -0.19%     
==========================================
  Files          39       39              
  Lines        5034     5068      +34     
  Branches      904      909       +5     
==========================================
+ Hits         3911     3928      +17     
- Misses        969      986      +17     
  Partials      154      154              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sisp
Copy link
Collaborator Author

sisp commented Feb 27, 2024

Thinking about this some more, perhaps it would be better to have pluggable LFS authenticators rather than LFS client variants, i.e. LFSClient would be composed of an object that implements a protocol like:

class LFSAuth(typing.Protocol):
    def get_headers(self, *, upload: bool) -> dict:
        ...

We'd have two classes that implement this protocol:

  1. SSHLFSAuth
  2. HTTPLFSAuth

The LFSClient.from_git_url classmethod would instantiate the authenticator object depending on the Git URL structure and pass it to the LFSClient constructor.

@pmrowla
Copy link
Contributor

pmrowla commented Feb 28, 2024

  • Is using _get_ssh_vendor from scmrepo.git.backend.dulwich the right way to create an SSH client?

Ideally I think we should be using asyncssh directly so that this is independent of any particular git backend. I'm not sure how the CLI LFS client works with things like GIT_SSH_COMMAND, so it may actually be necessary to use dulwich if we do need to support falling back to a user-provided SSH shell command.

Either way, using dulwich here should be fine for now.

  • Any suggestions about testing? I don't feel too comfortable adding this feature without tests, but I'm not sure what the best approach to testing would be. Perhaps using a Git LFS test server? I've manually tested both Git SSH URLs and Git HTTP URLs of public and private projects, all scenarios are working for me.

I think testing this will really require setting up a docker container that is running a git + LFS + SSH server instance. We do already have a minimal setup for testing git over SSH, you can take a look at tests/conftest.py and tests/docker-compose.yml to get an idea of how that setup works.


In general this PR LGTM (other than the testing issue), but I should note that my understanding is that this use of LFS auth over SSH is not widely used, and the alternative (HTTPS auth) works for github/gitlab as well as huggingface. So I'm not sure this feature is something that needs to be prioritized.

@sisp is your team actively using LFS auth over SSH?

(Full disclosure, I'm leaving the DVC team at the end of this week, and at least in the short term, the rest of the team won't have the bandwidth to work on or review changes in the scmrepo LFS client other than bugs where something is completely broken with github/gitlab/huggingface)

@sisp
Copy link
Collaborator Author

sisp commented Feb 28, 2024

Thanks for your comprehensive feedback, @pmrowla! 🙏 First of all, I'm sorry to hear you're leaving the DVC team. Your work on LFS support here – and thus also for DVC – has been fantastic and triggered an idea for a partial solution related to my struggles with tightly integrating DVC and GitLab (iterative/dvc-http#50 (comment), iterative/dvc-http#56, gitlab.com:gitlab-org/gitlab#413612) by simply using Git LFS for data/artifacts storage instead of a DVC remote. This is feasible on our corporate self-hosted GitLab instance, as it has a very generous LFS storage size limit, so storing significant data and artifacts is no problem. With DVC's LFS support, we can, e.g., import data managed via LFS from non-DVC repos and yet benefit from DVC's other features. That said, LFS is only a substitute for the regular cache but not the run-cache.

Since there's little time left before you leave, I'm inclined to leave the PR as is and look into testing in a follow-up. Also, I've looked into refactoring the current approach to using authenticator classes. I realized that not only retrieving credentials for the LFS Batch API is different for HTTP and SSH, but also constructing the LFS server URL from the Git URL is different. An authenticator class should not cover the URL construction, as that's a different concern, but authentication and URL construction are closely related. For this reason, I think we should keep the inheritance-based approach as is.

We use Git over SSH on our GitLab instance by default. For us, 2FA is always enabled, and then:

When 2FA is enabled, you can’t use your password to authenticate with Git over HTTPS or the GitLab API. You can use a personal access token instead.

https://docs.gitlab.com/ee/user/profile/account/two_factor_authentication.html#use-personal-access-tokens-with-two-factor-authentication

The same applies to GitHub. In fact, I use 2FA and Git over SSH also on github.com and gitlab.com because it's secure and convenient. Git over HTTP only works nicely without 2FA or with public projects IMO, so I think supporting LFS for Git SSH URLs is a high priority in corporate environments, e.g. when not only consuming public LFS-managed files but also internal ones.

@pmrowla pmrowla merged commit c44c577 into iterative:main Feb 29, 2024
13 checks passed
@sisp sisp deleted the feat/lfs-ssh branch March 1, 2024 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lfs: support SSH git-lfs-authenticate
3 participants