Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff correctness #1106

Merged
merged 20 commits into from
Dec 2, 2023
Merged

diff correctness #1106

merged 20 commits into from
Dec 2, 2023

Conversation

Byron
Copy link
Member

@Byron Byron commented Nov 11, 2023

Based on #1049


diff-correctness → gix-status → gix reset


Improve gix status to the point where it's suitable for use in reset functinoality.
Leads to a proper worktree reset implementation, eventually leading to a high-level reset similar to how git supports it.

Architecture

The reason this PR deals quite a bit with gix status is that for a safe implementation of reset() we need to be sure that the files we would want to touch don't don't carry modifications or are untracked files. In order to know what would need to be done, we have to diff the current-index with target-index. The set of files to touch can then be used to lookup information provided by git-status, like worktree modifications, index modifications, and untracked files, to know if we can proceed or not. Here is also where the reset-modes would affect the outcome, i.e. what to change and how.

This is a very modular approach which facilitates testing and understanding of what otherwise would be a very complex algorithm. Having a set of changes as output also allows to one day parallelize applying these changes.

This leaves us in a situation where the current checkout() implementation wants to become a fastpath for situations where the reset involves an empty tree as source (i.e. create everything and overwrite local changes).

On the way to reset() it's a valid choice to warm up more with the matter by improving on the current gix status implementation and assure correctness of what's there, which currently doesn't seem to be the case in comparison. Further, implementing gix status similarly to git status should be made possible.

Tasks Diff Correctness

  • low-level diff conversion pipeline (textconv support, gitattributes)
  • binary file diffs are treated as "new - size bytes added" and "old - size bytes removed", detect binary by data, driver, large-file, content
  • low-level diff platform
  • diffed blobs go through filters in rename-tracker
  • add missing variables for diff.driver for documentation purposes (and a test).
  • is textconv applied unconditionally? Or only if it's actually binary? -> A: always
  • allow controlling direction, so we can convert to git or to worktree + textconv - this is needed as depending on the storage location, different content is diffed or used as base.
    - rewrite-tracking uses what's stored inside of git (pretty sure)
    - user-diffing uses worktree + textconv, but only if textconf is specified
  • a convenient way to execute external diff programs (and mention diff.external in config-tree, probably with key)
  • diff-platform in gix uses gix-diff::blob::Platform to properly do all conversions

Next PR: Gix Status

  • what about index/worktree rename tracking? git2 can do that. Needs generalization of what's available for tree/tree diffs, at least learn from it.
  • a way to obtain untracked files to learn if changes can be made. What about the untracked files extension?
  • status in gix crate
  • fun: a way to apply filters in cat-file equivalent, and possibly textconv conversions just like in `git cat-file.
  • diff index with index to learn what we would want to do in the worktree, or alternatively,
    diff tree with index (with reverse-diff functionality to simulate diff of index with tree), for better performance as it
    would avoid having to allocate a whole index even though we are only interested in a diff. Must include rename tracking.
  • how to make diff results available from status with all transformations applied?

Next PR: Reset

  • reset() that checks if it's allowed to perform a worktree modification is allowed, or if an entry should be skipped. That way we can postpone safety checks like --hard

Postponed

What follows is important for resets, but won't be needed for cargo worktree resets.

  • gix status with actual submodule support - needs status in gix (crate) effectively
  • gix status with actual conflict support

Research

  • what about binary diffs?
  • Ignored files are considered expandable and can be overwritten on reset
  • How to integrate submodules - probably easy to answer once gix status can deal a little better with submodules. Even though in this case a lot of submodule-related information is needed for a complete reset, probably only doable by a higher-level caller which orchestrates it.
  • How to deal with various modes like merge and keep? How to control refresh? Maybe partial (only the files we touch), and full, to also update the files we don't touch as part of status? Maybe it's part of status if that is run before.
  • Worthwhile to make explicit the difference between git reset and git checkout in terms of HEAD modifications. With the former changing HEADs referent, and the latter changing HEAD itself.
  • figure out how this relates to the current checkout() method as technically that's a reset --hard with optional overwrite check. Could it be rolled into one, with pathspec support added?
    • just keep them separate until it's clear that reset() performs just as well, which is unlikely as there is more overhead. But maybe it's not worth to maintain two versions over it. But if so, one should probably rename it.
  • for git status: what about rename tracking? It's available for tree-diffs and quite complex on its own. Probably only needs HEAD-vs-index rename tracking. No, also can have worktree rename tracking, even though it's hard to imagine how this can be fast unless it's tightly integrated with untracked-files handling. This screams for a generalization of the tracking code though as the testing and implementation is complex, but should be generalisable.

@Byron Byron force-pushed the gix-status branch 9 times, most recently from f04cdb4 to 8d06b29 Compare November 18, 2023 20:50
@Byron Byron force-pushed the gix-status branch 2 times, most recently from d257099 to c4e4714 Compare November 24, 2023 20:11
…dHeader` trait.

That way one can know its decompressed size and its kind.

We also add a `FindObjectOrHeader` trait for use as `dyn` trait object that
can find objects and access their headers.
@Byron Byron force-pushed the gix-status branch 7 times, most recently from 5ed5ce7 to 9faf3f3 Compare November 28, 2023 08:41
Byron added 11 commits November 28, 2023 15:16
Note that this is also the minimal required version that is resolved
with `cargo +nightly update -Z minimal-versions`, but it's nothing
I could validate or reproduce myself just yet.
It allows to more easily manage a form of 'double buffering'
to better manage conditional alteration of a source buffer,
and to implement conversion pipelines which conditionally
transform an input over multiple steps.
It's required, but in practice has no effect as it's initialized at
just the right time anyway, which is when it does matter.

Also, re-export `gix_attributes as attributes` to allow using the types
it mentions in the public API.
An attribute selection affects the initialization, hence it should
be added first.
As otherwise, one cannot use `&dyn ` at all in this case as it's
unsized.`

Additionally, rename top-level `pub use gix_glob` to `glob` to be
in-line with other public exports of this kind.
@Byron Byron force-pushed the gix-status branch 11 times, most recently from 5ab67b2 to 7aafd09 Compare December 2, 2023 16:24
Byron added 6 commits December 2, 2023 19:55
…ersions and caching.

The `Pipeline` provides ways to obtain content for use with a diffing algorithm,
and the `Platform` is a way to cache such content to efficiently perform MxN matrix
diffing, and possibly to prepare running external diff programs as well.
…timized diffing.

Correctness is improved as all necessary transformation are now performed.
Performance is improved by avoiding duplicate work by caching transformed
diffable data for later reuse.
…memory diffing of combinations of resources.

We also add the `object::tree::diff::Platform::for_each_to_obtain_tree_with_cache()` to pass a resource-cache
for re-use between multiple invocation for significant savings.
That way it's conceivable that applications correctly run either
a configured external diff tool, or one that is configured on a
per diff-driver basis, while being allowed to fall back to
a built-in implementation as needed.
It can handle it, so let's let it be a no-op.
…ersions.

Previously it would just offer the git-ODB version of a blob for diffing,
while it will now make it possible to apply all necessary conversion steps
for you.

This also moves `Event::diff()` to `Change::diff()`, adds
`Repository::diff_resource_cache()` and refactors nearly everything
about the `objects::blob::diff::Platform`.
@Byron Byron merged commit dfb3f18 into main Dec 2, 2023
18 checks passed
This was referenced Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant