Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement pygit2 backend for fetch_refspec #169

Merged
merged 1 commit into from
Feb 2, 2023

Conversation

karajan1001
Copy link
Contributor

implement pygit2 backend for fetch_refspec

fix #168

Compare to the cprofile in iterative/dvc#8477.

The new greatly reduced the time cost in fetching.

image

Making the total dvc exp show query finished in 1s. And brings a much smoother performance in vs-code extension. And also solves the hanging problem in iterative/dvc#8762

2023-01-17.18-57-45.mp4

But we still had one problem remaining to solve, how to make scmrepo use the pygit2 as the default backend. In the test and the demo, I simply raise NotImplementError in Dulwich to disable it.

@karajan1001 karajan1001 added enhancement New feature or request performance labels Jan 17, 2023
@karajan1001 karajan1001 requested a review from pmrowla January 17, 2023 12:02
@karajan1001 karajan1001 self-assigned this Jan 17, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Jan 17, 2023

But we still had one problem remaining to solve, how to make scmrepo use the pygit2 as the default backend. In the test and the demo, I simply raise NotImplementError in Dulwich to disable it.

Setting the default backend is controlled here:

"dulwich": DulwichBackend,
"pygit2": Pygit2Backend,
"gitpython": GitPythonBackend,

But what we probably need is a way to just force preferring a specific backend in our _backend_func wrapper

def _backend_func(self, name, *args, **kwargs):

so we could do something like

scm.fetch_refspec(..., prefer_backend="pygit2")

which would force rotating the backends list until pygit2 is at the top before continuing to try fetch_refspec calls

@skshetry
Copy link
Member

so we could do something like

scm.fetch_refspec(..., prefer_backend="pygit2")

Instead of a wrapper, we could also use it directly as following in dvc:

scm.pygit2.fetch_refspecs(...)

@karajan1001
Copy link
Contributor Author

scm.fetch_refspec(..., prefer_backend="pygit2")

I had considered before, but the problem is that the backends will be determined the first time we call it. So if we want to make sure a function using a special backend, we need to add this prefer to all place we call it. So maybe it's better to add the argument in

def _backend_func(self, name, *args, **kwargs):

@dtrifiro
Copy link
Contributor

Isn't using it directly also an issue with backend switching? We did have issues in the past related to that (iterative/dvc#5641, iterative/dvc#7458)

@pmrowla
Copy link
Contributor

pmrowla commented Jan 17, 2023

Using the backends directly can cause issues due to caching of the git index, but fetch does not touch the git index at all so it shouldn't be a problem here

@skshetry
Copy link
Member

Does this work on Windows? AFAIK there is no ssh support in pygit2 for Windows?

@karajan1001 karajan1001 force-pushed the implement_fetch_in_pygit2 branch 4 times, most recently from 0a484f0 to ec6f5ec Compare January 18, 2023 08:03
@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2023

Codecov Report

Base: 82.29% // Head: 81.36% // Decreases project coverage by -0.94% ⚠️

Coverage data is based on head (24d1bf0) compared to base (4984eb3).
Patch coverage: 75.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #169      +/-   ##
==========================================
- Coverage   82.29%   81.36%   -0.94%     
==========================================
  Files          25       25              
  Lines        3383     3466      +83     
  Branches      583      601      +18     
==========================================
+ Hits         2784     2820      +36     
- Misses        517      547      +30     
- Partials       82       99      +17     
Impacted Files Coverage Δ
src/scmrepo/git/backend/pygit2.py 76.09% <69.11%> (-3.34%) ⬇️
src/scmrepo/git/__init__.py 84.53% <80.00%> (+0.48%) ⬆️
tests/test_git.py 100.00% <100.00%> (ø)
src/scmrepo/utils.py 67.74% <0.00%> (-19.36%) ⬇️
tests/conftest.py 88.00% <0.00%> (-7.00%) ⬇️
src/scmrepo/git/backend/gitpython.py 63.44% <0.00%> (-1.08%) ⬇️
src/scmrepo/git/backend/dulwich/__init__.py 80.84% <0.00%> (-0.47%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@karajan1001 karajan1001 force-pushed the implement_fetch_in_pygit2 branch 2 times, most recently from d08d198 to c5607b1 Compare January 18, 2023 08:48
@karajan1001
Copy link
Contributor Author

Does this work on Windows? AFAIK there is no ssh support in pygit2 for Windows?

From the test result look like it works for Windows.

except KeyError:
raise SCMError(f"'{url}' is not a valid Git remote or URL")

if os.name == "nt" and remote.url.startswith("file://"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if should I put this code here or in the tests? Looks like pygit2 doesn't support this format on the Windows platform.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should support it https://github.com/libgit2/libgit2/blob/1327dbcf2a4273a8ba6fd978db5f0882530af94d/src/libgit2/transport.c#L35

But the test result is that on Mac or Linux it supports it, but not on Windows. I guess it's a bug of pygit2.

Copy link
Contributor

@pmrowla pmrowla Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be that libgit2 or pygit2 expect windows slashes instead of posix ones after the file:// protocol, or it might be that the file:// prefix is supposed to be stripped as you are doing here. But in regular CLI git there is actually different behavior depending on whether you provide a file:// URL or a raw local path (without file://), so we should double check what the problem is.

Please try to take a look into what pygit2 is doing in this situation and see if there is anything we need to fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hadn't found any issues related to it. Let me create a issue for them libgit2/pygit2#1187

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could still be due to not creating a proper Remote() object for the raw URL. As we discussed, we should be using git_remote_create_anonymous to generate our remote here.

remote_name = "fetch"
self.repo.remotes.set_url(remote_name, url)
remote = self.repo.remotes[remote_name]

This is likely going to cause undefined behavior because we are trying to call set_url on a remote that doesn't actually exist (unless the user has already defined their own remote named fetch, in which case we are incorrectly modifying an existing git remote instead of creating our own temporary one)

Copy link
Contributor Author

@karajan1001 karajan1001 Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like git_remote_create_anonymous is not implemented in pygit2, I just made a temporary solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, as we discussed though we should really consider adding it to pygit

@karajan1001 karajan1001 requested a review from skshetry January 18, 2023 09:16
@pmrowla
Copy link
Contributor

pmrowla commented Jan 18, 2023

libgit2 supports ssh but only when it is compiled with libssh2. The libgit2 build bundled with the pygit2 wheels does not include libssh2, so ssh:// is not supported in pygit2 libgit2/pygit2#994

We should be checking to see if we are passed an ssh URL and then raise NotImpelementedError in pygit2 fetch_refspecs for ssh.

@karajan1001 karajan1001 force-pushed the implement_fetch_in_pygit2 branch 2 times, most recently from 38db2c2 to 5a505bb Compare January 30, 2023 13:20
Comment on lines 458 to 461
remote_name = uuid()
self.repo.remotes.set_url(remote_name, url)
try:
yield self.repo.remotes[remote_name]
Copy link
Contributor

@pmrowla pmrowla Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still has the same issue where you will call set_url on a remote that doesn't actually exist which will likely end up causing errors and/or undefined behavior in the underlying libgit C library, especially when you try to delete it later.

If you are going to implement it this way and create a full named remote (instead of adding the anonymous remote function call to pygit), you need to be using self.repo.remotes.create() here

https://github.com/libgit2/pygit2/blob/fcaec818c384b376404e5ff5477317b608f552ce/pygit2/remote.py#L332

func = getattr(backend, name)
result = func(*args, **kwargs)
self._last_backend = key
self.backends.move_to_end(key, last=False)
return result
except NotImplementedError:
except (NotImplementedError, KeyError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What KeyError is being hit here? If someone tries to call this function with backends=[...] that contains an invalid backend we should really be raising that exception and not catching/ignoring it here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's related to the tests when we initialize the backends with selected argument, I had put the logic into __getitem__ to raise different exceptions for the nonexistent backend names and the not initialized backend names.

@karajan1001 karajan1001 force-pushed the implement_fetch_in_pygit2 branch 2 times, most recently from 08a1cf7 to cd64c82 Compare January 31, 2023 10:53
Comment on lines +484 to +487
if os.name == "nt" and remote.url.startswith("file://"):
url = remote.url[len("file://") :]
self.repo.remotes.set_url(remote.name, url)
remote = self.repo.remotes[remote.name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed even after the remotes.create() change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it still fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make this change in the PR then, but we probably need to open another issue for it in scmrepo (this likely means that a regular git remote with file:// URL in it is also broken for us on windows). This may still be related to us not setting the path separators or formatting the URL in a way that libgit2 expects and probably needs some more investigation.

@karajan1001 karajan1001 force-pushed the implement_fetch_in_pygit2 branch from 24d1bf0 to cd64c82 Compare January 31, 2023 13:03
pmrowla
pmrowla previously approved these changes Feb 1, 2023
Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 please open scmrepo issues for the pygit follow ups, we don't have to do them right away but it would be good for us to handle them whenever we have the bandwidth

  • expose git_remote_create_anonymous in pygit2.Remote.create_anonymous
  • investigate windows file:// behavior in pygit/libgit

fix iterative#168

1. Add order select for `_backend_func`.
2. Raise exception for fetch_refspec for ssh:// repo on Windows.
3. Add order select for _backend_func
@pmrowla
Copy link
Contributor

pmrowla commented Feb 2, 2023

@karajan1001 is this waiting on anything else or can it be merged?

@karajan1001
Copy link
Contributor Author

karajan1001 commented Feb 2, 2023

@karajan1001 is this waiting on anything else or can it be merged?

I'll do some final check today together with DVC.

@karajan1001 karajan1001 enabled auto-merge (rebase) February 2, 2023 08:30
@karajan1001 karajan1001 requested a review from pmrowla February 2, 2023 08:31
@karajan1001 karajan1001 merged commit 3d8c97c into iterative:main Feb 2, 2023
@karajan1001 karajan1001 deleted the implement_fetch_in_pygit2 branch February 3, 2023 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement pygit2 backends for fetch_refspec
5 participants