Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff between worktree and index #805

Merged
merged 32 commits into from
Apr 17, 2023
Merged

Diff between worktree and index #805

merged 32 commits into from
Apr 17, 2023

Conversation

pascalkuthe
Copy link
Contributor

@pascalkuthe pascalkuthe commented Apr 2, 2023

An initial prototype for computing diffs between the worktree and the index.
Currently, doesn't compute untracked files (needs a separate function) and doesn't handle git-attributes/filters and symlinks correctly. This work is blocked on improvements to the FS cache #400.

Currently, we only support emitting these change events. Post-processing steps like copy/rename detection are therefore also missing. There is also a lot of potential for performance improvements/parallelization.

We likely also want to integrate the diff here with the infrastructure in gix-diff/gix (for example for copy detection) in some capacity but I am not sure how to best approach that from an architectural perspective.


Ok for Byron review the PR on video?

  • I give my permission to record review and upload on YouTube publicly

If I think the review will be helpful for the community, then I might record and publish a video.

@pascalkuthe pascalkuthe force-pushed the dev branch 4 times, most recently from 773ea1f to 66c3d03 Compare April 2, 2023 19:59
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for getting this massive ball rolling and for researching and implementing this massively important feature! It's much appreciated.

A whopping two ours later I took a look at everything but the tests. When I get to it, probably tomorrow, I will probably write some as well or try to fix the existing failure.

What would be great if we could make sure that the naming/semantics of this 'diff' are similar to gix-diff which uses different language due to a different way of thinking. Rather, it asks the question: What do I have to do to turn X in to Y? We can tune the question to yield a typical git status if that's feasible, or leave such a transformation to the application layer if it's not a natural fit.

I think it's also important to differentiate between changes of worktree and index and index and the tree it was created from (maybe that was already the intent and it just read differently in the docs though).

@pascalkuthe
Copy link
Contributor Author

pascalkuthe commented Apr 3, 2023

thanks for the review!

What would be great if we could make sure that the naming/semantics of this 'diff' are similar to gix-diff which uses different language due to a different way of thinking. Rather, it asks the question: What do I have to do to turn X in to Y? We can tune the question to yield a typical git status if that's feasible, or leave such a transformation to the application layer if it's not a natural fit.

I agree I will try to match the terminology. Altough in this case it's "how do I get the worktree from the index" (which the ChandKind mostly reflects). In fact in git there is just one generic "diff index to worktree" function which gets used primarily for the diff (and status is a bit of an afterthought). That's the reason I always yield a change when the stat has changed instead of checking if the file actually changed inside compare_to_index

I think it's also important to differentiate between changes of worktree and index and index and the tree it was created from (maybe that was already the intent and it just read differently in the docs though).

Right now I only implemented a diff between index and worktree. The code currently ever acceses the ODB in any way. The code here is modeled after the run_diff_files function in git. Added files (and conflict) are a bit of a specicaland only determined from the state within the index I expanded in the review comment. In git the added files detection is behind a config option. I might do that here too. It would be nice to keep the special added detection here (same as git does) so that a gix status implementation doesn't have to rescan the entire index for added files (since they would not necessarily show up as modified). Same for conflict detection. Users that just want a diff could simply ignore these additional changes. That the way git handles it.

Worktree and index diff is also important and I want to tackle that in the future but I would like to go step by step here. For example for helix we don't need the index to blob diff so if I finish that work first then I can unblock peoples work and then look at the index/commit diff next. Note that AFAIK git doesn't actually implement a tree to index diff. Instead, it implements an index to index diff and simply checks out a new (in memory) index (unless I missed something).

@Byron
Copy link
Member

Byron commented Apr 4, 2023

Right now I only implemented a diff between index and worktree. The code currently ever acceses the ODB in any way. The code here is modeled after the run_diff_files function in git. Added files (and conflict) are a bit of a specicaland only determined from the state within the index I expanded in the review comment. In git the added files detection is behind a config option. I might do that here too. It would be nice to keep the special added detection here (same as git does) so that a gix status implementation doesn't have to rescan the entire index for added files (since they would not necessarily show up as modified). Same for conflict detection. Users that just want a diff could simply ignore these additional changes. That the way git handles it.

I think the source of me writing this was the misunderstanding of how Added works - it's a flag that is set in the index and persisted with it when adding the file, instead of recognizing the addition by comparing to the tree.

Worktree and index diff is also important and I want to tackle that in the future but I would like to go step by step here. For example for helix we don't need the index to blob diff so if I finish that work first then I can unblock peoples work and then look at the index/commit diff next. Note that AFAIK git doesn't actually implement a tree to index diff. Instead, it implements an index to index diff and simply checks out a new (in memory) index (unless I missed something).

Yes, that makes perfect sense. Thanks for sharing as well, I wasn't aware of it doing index-to-index diff, but can imagine that it simplifies the implementation quite a bit.

@pascalkuthe pascalkuthe force-pushed the dev branch 8 times, most recently from 5994742 to b15a1f7 Compare April 8, 2023 11:15
@pascalkuthe
Copy link
Contributor Author

pascalkuthe commented Apr 11, 2023

@Byron I think this is ready for a second look. There are still some things missing (mostly waiting on the fs cache) but that should not change the implementation a lot.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much!

When trying to get started I noticed that these two files were binary blobs instead of git-lfs stand-ins.

  • gix-worktree/tests/fixtures/generated-archives/
  • gix-worktree/tests/fixtures/generated-archives/status_unchanged.tar.xz

Knowing too well that this probably means I will have trouble pushing to this branch while being unable to fix it myself, could you rewrite these commits and renormalize these files with git-lfs installed?

Thanks a lot.

@pascalkuthe
Copy link
Contributor Author

Thanks so much!

When trying to get started I noticed that these two files were binary blobs instead of git-lfs stand-ins.

* gix-worktree/tests/fixtures/generated-archives/

* gix-worktree/tests/fixtures/generated-archives/status_unchanged.tar.xz

Knowing too well that this probably means I will have trouble pushing to this branch while being unable to fix it myself, could you rewrite these commits and renormalize these files with git-lfs installed?

Thanks a lot.

I am sort of confused about git lfs. I do have it installed and normally all I needed to do was git add and git LFS picked them up so I am not sure exactly why some files didn't get picked up :/ I will try again

@Byron
Copy link
Member

Byron commented Apr 12, 2023

Maybe a commit was done with --no-verify? After all, git-lfs is just a couple of hooks. That's the best explanation I have, if it's installed it always works at least from my experience.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an online-only review which probably makes it a bit superficial, but it's a start I hope.

My general advice is to add unit-tests where feasible and a lot of doc strings to keep track of all the subtleties that might be entailed and transfer them to the future maintainers. Right now, you know most and learned that with a lot of study, and this valuable knowledge should definitely be persisted one way or another. Maybe I am just overly anxious though.

In this session unfortunately I had to stop at index/status.rs, but I will try to give it another go today or definitely tomorrow.

@pascalkuthe pascalkuthe marked this pull request as ready for review April 13, 2023 14:59
Byron added 8 commits April 15, 2023 09:11
Conflicts
=========
  * gix-worktree/src/index/checkout.rs
  * gix-worktree/src/lib.rs
  * gix/src/config/cache/access.rs
…ed at runtime.

Right now git may be compiled without these capabilities, even though on some platforms
it might make perfect sense to enable them by default or enable them on a per repository
basis. This is now possible thanks to added gitoxide specific functions.
- fix doc links
- fix typos
- add trailing '.' in some doc strings.
- remove unused parts of the API.
- type name changes
- adjust module layout from `a/ a.rs` to `a/mod.rs`.
- allow access to ODB as well.
With it one can read entries and read paths at the same time.
@Byron
Copy link
Member

Byron commented Apr 15, 2023

I have reviewed everything in detail and want to thank you again for such great work! It's much appreciated and will soon be run a lot :D!

Here a the highlights of my adjustments:

  • Avoid trait Foo: Send in favor of adding +Send at the site of use. The only reason I can name is that prodash::Progress also inherited :Send at some point, but that prevented me to use it nicely in non-threaded contexts. So maybe I am a bit eager with this 'rule' just because of that, especially with traits that are very specific to a single call site.
  • Avoid Send + Sync requirements in favor of Send + Clone - to my mind this makes the implementation of these easier in the common case. The reason for this is my having some trouble with Sync requirements in the concurrency primitives, and once it was refactored to Send + Clone I could finally do things that weren't possible before. It's vague, I know, but that's all there is left in my memory.
  • Add tests for all common use-cases by default, and try hard to find some for the less common ones as well. It's the last defense to code rotting away over time and the only way to keep the project maintainable for decades to come.

What's missing is the reorganization/shuffle of code to gix-index, but that happens tomorrow.

@Byron
Copy link
Member

Byron commented Apr 16, 2023

The reorganization is most probably done 🎉. I refrained from moving everything from gix-worktree to gix-index as I realized that the layout of these crates and their function is quite intentional, I just forgot about it.

  • gix-worktree is for all interactions of the index with the worktree, and used to offload a lot of code out of gix-index
  • gix-index is for the raw data structures, basic logic, and IO
  • gix marries gix-worktree and gix-index into an API that completely hides gix-worktree - it basically has nearly no exposed types anyway.

Nearly I am done because I might put checkout and status out of index, as the usage of an index is kind of implied. Also I am not perfectly happy with the internal structure of status yet, but it's all mostly there.

Today I might not get to finishing it, but tomorrow for sure.

@Byron
Copy link
Member

Byron commented Apr 17, 2023

And it's finally done!

The solution was to properly (some some docs) frame the purpose of the gix-worktree crate. This allows it to be index-centric by default, which in turn makes the index module unnecessary. From there checkout and status can move up a level.

Additionally, I left the 'changes-to-obtain' centric documentation but renamed the function to status, which leverages the familiarity of the term while making good use of the similarly named module for consistency. Internally, status was cleaned up under the perspective of "it's only for index->worktree" changes, and index->index changes should most definitely go to gix-index directly. This also means that the potential for code sharing flows from gix-index to gix-worktree naturally, but only when it makes sense.

Overall, for code-sharing, there is gix-features, gix-fs and gix-utils (catch-all), so there should be no problems in avoiding duplication with the purpose of testing certain logic thoroughly and only once (or multiple times from the perspective of different crates, whatever makes sense).

All in all, I believe that the current module and crate layout will aid the future development of index to index diffs as well as untracked file detection (and references to these I have removed to nudge you to find better places in the new layout :)).

Of course, nothing is perfect and nor is this, and I am happy to assist with any questions or suggestions you might have.

Merge imminent!

Conflicts
=========
 * gix-index/src/entry/mod.rs
 * gix-utils/src/lib.rs
@Byron Byron merged commit cdef398 into GitoxideLabs:main Apr 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants