Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Git support #934

Closed
wants to merge 2 commits into from
Closed

Add Git support #934

wants to merge 2 commits into from

Conversation

arxanas
Copy link
Contributor

@arxanas arxanas commented Jul 26, 2021

Git already supports Watchman via the fsmonitor hook. For performance reasons, I want to add Git support directly:

  • To avoid extra process instantiation overhead.
  • To avoid touching the index file on disk.
  • To eagerly rather than lazily update the "index" (by which I mean the Watchman in-memory data structures).
  • To be able to use clock IDs for incremental computation.

@facebook-github-bot
Copy link
Contributor

Hi @arxanas!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@arxanas
Copy link
Contributor Author

arxanas commented Jul 26, 2021

@wez pls fix build? I just want a macOS executable with these changes 🥺🙏 .

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@facebook-github-bot
Copy link
Contributor

@fanzeyi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@wez
Copy link
Contributor

wez commented Jul 27, 2021

@wez pls fix build? I just want a macOS executable with these changes 🥺🙏 .

watchman isn't my main focus these days, but @fanzeyi and the rest of the team are doing a great job looking after it!

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

1 similar comment
@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@arxanas
Copy link
Contributor Author

arxanas commented Jul 27, 2021

@facebook-github-bot You're welcome!

@facebook-github-bot
Copy link
Contributor

@mroch merged this pull request in b6034ce.

@arxanas
Copy link
Contributor Author

arxanas commented Aug 23, 2021

@mroch Thanks for cleaning this up and merging! And thanks @chadaustin for reviewing!

@arxanas arxanas deleted the git branch August 23, 2021 18:57
@dmose
Copy link

dmose commented Sep 10, 2021

This looks exciting! What does one need to do to an existing git repo monitored by the fsmonitor hook to take advantage of this new integration?

@wez
Copy link
Contributor

wez commented Sep 10, 2021

https://facebook.github.io/watchman/docs/scm-query.html is the main feature that this integration most immediately unlocks.

@arxanas @mroch: looks like those docs could do with being updated to mention the new new scm-git capability!

@dmose
Copy link

dmose commented Sep 10, 2021

My reading of #934 (comment) was that it would also improve performance. Do I misunderstand?

@arxanas
Copy link
Contributor Author

arxanas commented Sep 10, 2021

@wez / @mroch I'm happy to update documentation, but I think that the current implementation is buggy. Git::getFilesChangedSinceMergeBaseWith seems to only return committed files and not uncommitted files. (Regrettably, there appears to be no single git diff command which will include both committed and uncommitted files?)

I had difficulty updating the tests in the OSS build. I tried getdeps.py test, but if I recall correctly, it seemed to be ignoring the updates to the test cases which I wrote, and I couldn't get it to pick up my new test. Is there a guide on testing Watchman for OSS?

@dmose This functionality could be used to improve performance, but it needs to be adopted by tools which plan to make specific use of it. It also probably won't be too useful unless your repository is big enough. I'm curious how large your repository is?

I personally plan to update https://github.com/arxanas/git-branchless to use it for commit/status/diff operations one day. As per this comment, I think there is room for improvement:

[...] because of bugs / misfeatures in the current git-side implementation it can take hundreds of milliseconds until we make sense of all of that. This is because git doesn't really "know" about watchman/inotify, it just uses it as a replacement for walking the FS, but on big repositories other stuff can still be expensive.

To go over the points in my original comment:

  • To avoid extra process instantiation overhead: the fsmonitor hook at my work right now is written in Python, so it takes an extra 100ms to start up the interpreter and then finally return results...! (But that's not a fundamental failing in Git's design.)
  • To avoid touching the index file on disk: my theory is that we can speed up some operations if we don't have to touch the index.
    • For reading, we won't have to iterate through the entire index file, which could be large, to find the currently-updated files. I believe Watchman stores the changes more efficiently, i.e. the most-recently changed files are at the head of a list, and you can stop traversing the list once you've reached the last of the changes.
    • For writing: well, you won't have to update the index file, which is good. But I think you could already do that with git --no-optional-locks status.
    • The "sparse index" feature might help these cases, but it's still under development. See the design document and patch series.
  • To eagerly rather than lazily update the "index": Git won't read the files from disk to check to see if they're changed until you actually query Git, which means you defer a lot of I/O operations until the time you call e.g. git status (which is probably the worst time for interactive use?). On the other hand, you could ask Watchman to keep track of all the file hashes as the files change, so you could avoid a lot of I/O when you want to ask about the status of the repo.
  • To be able to use clock IDs for incremental computation: this is a specific pattern for tools to integrate with Watchman. If you have artifacts which are associated with certain commit hashes (possibly even not on your machine!), and you have a way to update an artifact given a list of changed files, then this interface lets you easily incrementally build and update these artifacts.
    • For example, if you have cached builds in the cloud somewhere associated with the most recent main branch commit, then you could ask Watchman for the most recent main branch commit, as well as all the files that have changed since then. Then you also receive a clock ID, which you can use to process incremental updates afterwards. (The clock ID also helps with dealing with race conditions, like what happens if files change on disk while you're downloading/processing the artifact?)
    • For my case, you could use the same functionality to incrementally update the result of git diff from a certain commit tree, rather than re-processing all changed files.
    • There's also the "saved-state" functionality in Watchman, which helps you associate artifacts with commits. This PR doesn't affect that, as you usually have to write your saved-state code to integrate with your own service.

@dmose
Copy link

dmose commented Sep 10, 2021

It's quite large: https://github.com/mozilla/gecko-dev (it contains all of firefox desktop and gecko, which includes codebases like webrtc, spidermonkey, lots of vendored-in rust etc).

I guess I was imagining a world where git itself used this interface instead of the fs-notify hook to speed up normal operations (git status, etc).

I'm now suspecting that someone would have to sign up to do that work on the git side. Is that correct?

@wez
Copy link
Contributor

wez commented Sep 10, 2021

The changes in this PR allow watchman to reason about git state by calling out to git and caching information.
git can't make use of these changes directly, because this particular feature is asking git for information. This PR is aimed at tools downstream of watchman that need to move more processing to be O(local-changes) in the face of an update/rebase operation that may change many files.

To speed up the git fsmonitor integration, the logical first step would be to eliminate that python process overhead mentioned above by eg: rewriting that hook in say Rust using the watchman_client crate.

My recollection is that fsmonitor only accelerates a subset of git compared to the deeper integration in the Mercurial fsmonitor extension. I believe that it may be technically possible to more deeply integrate the fsmonitor concept in git, but the architecture of git likely makes that a fairly high effort engineering project.

@arxanas
Copy link
Contributor Author

arxanas commented Sep 11, 2021

@dmose I do performance testing for https://github.com/arxanas/git-branchless on gecko-dev (see e.g. libgit2/libgit2#6036). If you're interested in using the tool, I recommend you install directly from master for now, since it includes unreleased significant performance improvements for large repos, such as for git move. If you still see poor performance on large repositories, feel free to open a discussion thread there.

You can add yourself as a watcher to the repository and get notified when I publish a release that adds a replacement for git status. The intention is to 1) talk directly to Watchman, and 2) avoid using the index as a data structure, since it makes operations O(repo). I don't expect to release something like that soon, though.

You might also want to try the sparse-checkout/sparse-index features of Git: https://git-scm.com/docs/git-sparse-checkout. I think that would be the best bet for improving your performance on a monorepo. It should be under active development, so you might want to keep an eye on the Git release notes.

If you also have the problem of an fsmonitor hook with slow startup, you might be able to try https://github.com/jgavris/rs-git-fsmonitor. This isn't an option for me, unfortunately.

You could also try developing on gecko-dev with Mercurial instead. Presumably "deeper integration" with Watchman for Mercurial would make the performance better than with Git?

sluongng added a commit to sluongng/watchman that referenced this pull request Jun 16, 2022
Git SCM support has been added since 2021.08.30 release via a pull
request by @arxanas. (1)

Update website documentation to reflect that Git support has been added
additionally to Mercurial.

(1): facebook#934
facebook-github-bot pushed a commit that referenced this pull request Jun 17, 2022
Summary:
Git SCM support has been added since 2021.08.30 release via a pull
request by arxanas. (1)

Update website documentation to reflect that Git support has been added
additionally to Mercurial.

(1): #934

Pull Request resolved: #1036

Reviewed By: genevievehelsel

Differential Revision: D37215628

Pulled By: chadaustin

fbshipit-source-id: 2bcef730bcccf976e9c98c7d728985e07078c941
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants