Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Git LFS #80

Open
martinvonz opened this issue Feb 24, 2022 · 34 comments
Open

Add support for Git LFS #80

martinvonz opened this issue Feb 24, 2022 · 34 comments
Labels
enhancement New feature or request

Comments

@martinvonz
Copy link
Member

Git LFS seems to be used frequently enough that it may be worth adding support for it. I don't think it'll be a priority for me very soon, but I guess that depends on how many people want it.

The specification says that it uses clean/smudge filters. We don't have anything like that yet. So the first step is probably to add support for that. We could make the filters only available internally (i.e. in Rust code) to start with to keep it simple. On the other hand, it might not be hard to make them user-configurable.

Another option is to add a separate file type in the data model for LFS entries. For reference, we currently have files, symlinks, trees, conflicts, and gitmodules. I haven't thought through the consequences yet. There should be no difference to the user and no difference in the representation when using the Git backend. However, clean/smudge filters are probably useful to have anyway. Oh, one possible advantage of representing LFS entries in the model is that we can decide to always leave merged LFS files as conflicts, without downloading the files until the user checks them out or looks at the diff etc.

I don't yet know what other aspects of Git LFS we need to consider.

Originally requested in #77.

@ilyagr
Copy link
Contributor

ilyagr commented Apr 30, 2023

Filters are supported in libgit2, but not in git2.rs. It's also apparently not very hard to implement the logic ourselves. See rust-lang/git2-rs#442.

Also, the list of files that use lfs is stored in .gitattributes, so this is related to #53.

@ilyagr ilyagr added the enhancement New feature or request label Apr 30, 2023
@martinvonz
Copy link
Member Author

I'm worried about clean/smudge in general because it seems expensive to make it behave consistently. https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes says that smudging happens just before checkout and cleaning happens just before staging. If that's correct, then that seems to mean that git diff on a commit that changes an LFS file would just show the changed metadata, which is not very user-friendly. On the other hand, if it's not correct, and smudging happens whenever you need to present the file to the user, then there are ugly corner cases to deal with, like where .gitattributes changes but subtrees remain the same (then you'd technically need to diff the whole tree recursively to tell if it actually changed). I think I asked someone on the Git team at Google about this and they said Git just ignores the corner case(s).

@ilyagr
Copy link
Contributor

ilyagr commented May 1, 2023

there are ugly corner cases to deal with, like where .gitattributes changes but subtrees remain the same

(Update: On second thought, this paragraph might not really be addressing your point) Yes, I remember setting up the LFS repo being a pain. I don't remember how git reacted to changing .gitattributes, but it took a while to get right; my intention is to never change the setup (which directories LFS is used for). To make this possible, I have a repository just for LFS, separate from my main dotfiles repository.

For reference, the setup looks like this:

$ cat .gitattributes
.local/bin/* filter=lfs diff=lfs merge=lfs -text

(I use Github's LFS support to sync a few binaries across my machines. I use stow to symlink to the git repo's .local/bin from my real ~/.local/bin)

My sense is that only the filter=lfs -text part is crucial. The mapping from filter=lfs to actual git-lfs commands to run for cleaning/smudging happens inside the git config.

If that's correct, then that seems to mean that git diff on a commit that changes an LFS file would just show the changed metadata, which is not very user-friendly.

I think that's what the diff and merge gitattributes are for, but they don't seem to have much of an effect now:

$ git diff -r HEAD^
diff --git a/.local/bin/hwatch b/.local/bin/hwatch
index acc9c52..5e4ed26 100755
--- a/.local/bin/hwatch
+++ b/.local/bin/hwatch
@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c0e1ce1ee1a4841f2df5b7721bbeefe7d91f12bfe77b0922c9ac708f93379f2b
-size 7989912
+oid sha256:4dbaaf94bfd0812a38e5eb9bc01ec0f22f0953efc13be768aa310deb6b5982ce
+size 7912656

This works pretty well for LFS specifically. I'm not sure what else, if anything, clean/smudge filters are used for.

This does make LFS a little different from the way jj treats conflicted files.

@ilyagr
Copy link
Contributor

ilyagr commented May 1, 2023

This might all be quite awkward with jj's auto-rebasing. OTOH, the level of awkwardness would be similar to what happens if one tracks a binary file in a jj repository normally (without LFS), which is something we should probably eventually improve if we can (I'm not sure how).

@Ralith
Copy link
Contributor

Ralith commented May 1, 2023

Note that LFS is widely disliked due to half-baked support from GitHub (small quotas, made worse by being consumed by both third-party forks and CI activity) and even less support elsewhere. I hope jj can offer a first-class solution to large binary file handling eventually, and that any LFS compat is forwards-compatible with that.

git diff on a commit that changes an LFS file would just show the changed metadata, which is not very user-friendly

What else might they display? A diff of a large binary file will rarely be intelligible, even with a suitable diff algorithm (e.g. based on rolling hashes). I suppose it would be cool for a Sufficiently Smart diff viewer to be able to e.g. display two versions of a .png, but in general large binary files are opaque.

@martinvonz
Copy link
Member Author

git diff on a commit that changes an LFS file would just show the changed metadata, which is not very user-friendly

What else might they display? An actually diff of a large binary file will rarely be intelligible

Fair enough :) But the point still stands for clean/smudge in general.

@ilyagr
Copy link
Contributor

ilyagr commented May 1, 2023

As I mentioned, there seemed to be some attempt to support intelligent diffing with LFS. I would guess that the original idea was for the user to configure custom diff tools (e.g. for pngs) that would be called by git-lfs diff or something like that.

@ghost
Copy link

ghost commented May 1, 2023

Note that LFS is widely disliked

Regardless of LFS's merits, I imagine there's plenty of other users like me that would love to use jj in an existing repo that uses git lfs, but currently can't. Whether I use jj should be opaque to others in the repo, so this wouldn't be a compelling reason to migrate the codebase away from lfs (to whatever solution may be better) in a many-user repo.

@martinvonz
Copy link
Member Author

I agree, support for LFS would be mostly (maybe only) to make it easier to use jj with existing git repos.

@ghost
Copy link

ghost commented May 31, 2023

Is there a way to ignore LFS files so that LFS-enabled repos can still use jj, even if it means we can't interact with LFS itself while using jj? Just wondering if there's a way to use jj without needing full support for LFS.

@martinvonz
Copy link
Member Author

Is there a way to ignore LFS files so that LFS-enabled repos can still use jj, even if it means we can't interact with LFS itself while using jj? Just wondering if there's a way to use jj without needing full support for LFS.

If you don't need the LFS files, then it probably already works - you'd just see the placeholder files (pointers to the real content) in the working copy, I think. But I suspect that you need the actual files, and for that I can't think of a good solution.

Oh, using sparse checkouts in a colocated repo might work. However, sparse checkouts don't currently support negative patterns, so it could be very annoying to maintain the sparse patterns depending on your repo. Hmm, it also looks like we don't have any documentation about sparse checkouts, other than jj help sparse. Run jj sparse set --clear --add <path prefix> --add <path prefix> .... If you realize it's unmaintainable, run jj sparse set --reset to include all paths in the working copy again.

@waylon-brown
Copy link

waylon-brown commented Jun 1, 2023

So sparse seems like an inverse .gitignore (in that it's a list of path inclusions only)? Unfortunately like you said, this would probably only work for a sizable repo if it supported negative patterns, if I have it correct that --remove only removes an existing inclusion, rather than actually adds a negative matching pattern.

This solution would have worked great especially if it supported negative globs so that I could just add my existing LFS globs in my .gitattributes (ex. **/snapshots/**/*.png) into jj sparse.

@martinvonz
Copy link
Member Author

We do want to add support for arbitrary patterns in jj sparse. If you have time to spare, I think a good start would be to add a new GlobMatcher in https://github.com/martinvonz/jj/blob/main/lib/src/matchers.rs.

Then we'd also need to figure out the UX for adding and removing patterns. Git seems to use the same format as for .gitignores (https://git-scm.com/docs/git-sparse-checkout). It seems that you can add paths to the list with git sparse-checkout add <pattern>, but I didn't find a command to modify the list. Maybe you need to manually edit .git/info/sparse-checkout for that. Also, Git has something called "cone mode". I hope we can avoid exposing something like that to the user.

@71
Copy link
Contributor

71 commented Jun 2, 2023

Instead of patterns being "prefixes" as they are now, can't we allow arbitrary globs in jj sparse set (--add|--remove) in terms of UI? Then you could add, remove and list patterns (rather than paths) from the CLI.

Reading comments in this issue, it seems like jj sparse could make it easier to work with Git LFS and replace git update-index --skip-worktree (which I've been using a lot recently), so I could also chip in and try to help bring this.

Edit: adding/removing globs in the command line is (as the git docs mention) error-prone. I personally think this is fine as long as we add warnings about it. An alternative would be a jj sparse edit command which brings up $EDITOR on a temporary file with one pattern per line. After saving the file, jj parses the saved file as patterns and saves them wherever/however it wants.

Edit 2: the git docs mention that having "non-cone" globs can slow down commands, but (naively) this seems like it could be solved by compiling patterns similarly to what globset does.

@martinvonz
Copy link
Member Author

Instead of patterns being "prefixes" as they are now, can't we allow arbitrary globs in jj sparse set (--add|--remove) in terms of UI? Then you could add, remove and list patterns (rather than paths) from the CLI.

If we accept both things like docs/ and **/Cargo.toml, then the issue becomes how to tell which is which. Is docs a file called that or should it match recursively? We could "solve" that by saying that globs are also recursive, so the glob docs also matches all files anywhere under that directory, but I think that will make e.g. **/Cargo.toml confusing, because you would probably not expect that to match lib/Cargo.toml/foo. That's probably not much of an issue in practice for sample path (no one creates a Cargo.toml directory), but there are probably other examples that do happen in practice. Mercurial solves the problem by allowing a prefix to specify what kind of pattern it is. We can just copy that solution.

Reading comments in this issue, it seems like jj sparse could make it easier to work with Git LFS and replace git update-index --skip-worktree (which I've been using a lot recently), so I could also chip in and try to help bring this.

That would be appreciated, thanks! Just to be clear, jj sparse makes jj completely ignore the non-sparse paths, so you will have to rely on git to populate and update those paths.

FYI, the typical use case for sparse checkouts is when you're working on only a small part of a large repo, like working only on a particular file system in the Linux repo (I have never worked in the Linux repo, so I have no idea if that's a realistic example - maybe you need most of the repo in order to do a build anyway, for example).

Edit: adding/removing globs in the command line is (as the git docs mention) error-prone. I personally think this is fine as long as we add warnings about it. An alternative would be a jj sparse edit command which brings up $EDITOR on a temporary file with one pattern per line. After saving the file, jj parses the saved file as patterns and saves them wherever/however it wants.

Yes, I think that would be useful. I was looking for git sparse-checkout edit when I was typing my previous message :)

Edit 2: the git docs mention that having "non-cone" globs can slow down commands, but (naively) this seems like it could be solved by compiling patterns similarly to what globset does.

There are still cases that can't be made fast, like **/Cargo.toml, for example. We need to visit every file in the repo to see if it matches. We can add a warning if the user adds a pattern like that. There are less clear cases like some/dir/**/Cargo.toml, which may be cheap if some/dir/ is small. So maybe what we want to do is to apply the pattern, then check how many paths we visit and how many paths match, and warn if < 1% match or something. But that might be going too far :) So maybe we warn exactly when a pattern is not a pure prefix pattern (i.e. exactly what git's cone mode allows).

@elasticdog
Copy link
Contributor

For posterity, there are other legit uses of Git's clean/smudge filters beyond LFS: https://github.com/elasticdog/transcrypt

@71
Copy link
Contributor

71 commented Jun 6, 2023

Oh, I didn't realize that jj erases all files not in jj sparse list; I thought it would simply ignore them (the way Git ignores files with --skip-worktree). Then I'm not sure jj sparse is the way to go (for my particular use case).

With that said, I actually implemented jj sparse set --edit before I realized this, so I can submit a PR for it (and even if it's not submitted, it can serve as future reference).

@ilyagr
Copy link
Contributor

ilyagr commented Jun 6, 2023

Oh, I didn't realize that jj erases all files not in jj sparse list; I thought it would simply ignore them

This confused me. My impression was the opposite: jj is only allowed to touch (or erase) files in jj sparse list, and should ignore files not in jj sparse list. Did I miss something?

@martinvonz
Copy link
Member Author

Oh, I didn't realize that jj erases all files not in jj sparse list; I thought it would simply ignore them

This confused me. My impression was the opposite: jj is only allowed to touch (or erase) files in jj sparse list, and should ignore files not in jj sparse list. Did I miss something?

I think @71 meant when you go from having some part of the workspace populated to having that part not populated, then the jj sparse set command will remove those paths. For example, if you do jj git clone <the jj repo itself> and then jj sparse set --clear --add src, then all of lib, docs etc. will be removed.

@71, if you make git populate those paths after setting the sparse patterns with jj sparse, does that work for you?

@71
Copy link
Contributor

71 commented Jun 7, 2023

In my case, the problem was that I had files that I did not want in the Git repo.

  1. jj git clone ... && cd ...

  2. echo abc > abc

  3. jj sparse set --clear --add src

    At this point abc does not exist in the working copy anymore, but jj st shows it. I didn't realize that the whole directory had been removed, did jj untrack abc, and lost the file. I can also use jj sparse set --add abc after 3. to recover abc, but am not sure how to recover abc through Git (without adding it back into the repo).

@martinvonz
Copy link
Member Author

3. but am not sure how to recover abc through Git (without adding it back into the repo).

Do you mean that you want it as an untracked file? You can do jj cat abc > abc (jj cat abc reads the content from a commit, and default to reading it from the working-copy commit).

chriskrycho added a commit to chriskrycho/v6.chriskrycho.com that referenced this issue Jul 23, 2023
I will add this again later, [once it is supported][issue]. In the meantime,
though, I will ignore it from Jujutsu's POV, but force add it from Git's POV
so that it ends up in the repo from a Git(Hub) perspective without confusing
Jujustu.

[issue]: jj-vcs/jj#80
chriskrycho added a commit to chriskrycho/v6.chriskrycho.com that referenced this issue Jul 23, 2023
I will add this again later, [once it is supported][issue]. In the meantime,
though, I will ignore it from Jujutsu's POV, but force add it from Git's POV
so that it ends up in the repo from a Git(Hub) perspective without confusing
Jujustu.

[issue]: jj-vcs/jj#80
chriskrycho added a commit to chriskrycho/v6.chriskrycho.com that referenced this issue Jul 23, 2023
I will add this again later, [once it is supported][issue]. In the meantime,
though, I will ignore it from Jujutsu's POV, but force add it from Git's POV
so that it ends up in the repo from a Git(Hub) perspective without confusing
Jujustu.

[issue]: jj-vcs/jj#80
@Valodim
Copy link
Contributor

Valodim commented Aug 3, 2023

Regardless of LFS's merits, I imagine there's plenty of other users like me that would love to use jj in an existing repo that uses git lfs, but currently can't.

I would like to echo this sentiment: Beloved or not, LFS is fairly widely used to track binary files in repos for various reasons, and missing support for it excludes a large amount of repositories from use with jj. It's a very promising sign that "mundane" user compatibility concerns like this are taken seriously 👍 thanks for that, and thanks for jujutsu!

@glencbz
Copy link
Contributor

glencbz commented May 28, 2024

To add some colour to this, I think it's not necessary that jj support the full set of Git LFS features (like smudge, clean, etc), only that jj be able to interop gracefully in a colocated Git LFS repo. IMO it's okay to say this out of scope for jj, then rely on git lfs checkout to smudge the files, and git commit to commit cleaned files.

If so, then perhaps all that's needed is for jj to just ignore files that are named in .gitattributes, i.e.:

  1. Ignore them when snapshotting. Unlike .gitignore, this ignoring should happen even if the files are already tracked
  2. Don't check them out. This would be trickier, because we'd need a mechanism to inspect the tree before writing it to disk. But not having this feature wouldn't be so bad I think, because you can still recover with git lfs checkout IIUC.

There will be some unhappy cases when the .gitattributes file changes, and some files that used to be ignored aren't ignored any more (and vice-versa), but it should work most of the time.

Apologies if this was suggested earlier, I tried to digest the discussion so far as best as I could.

@bcspragu
Copy link

This is the one thing stopping me from using jj everywhere. I use it for most repos, and am always forgetting the few existing LFS-using repos I can't use it with yet. Similar to other comments, I'd be fine with a workaround (e.g. ignoring files in .gitattributes entirely), I'd be happy to do the work and maintain the patch with a few pointers.

@martinvonz
Copy link
Member Author

Thanks! That should only require changes to https://github.com/jj-vcs/jj/blob/main/lib/src/local_working_copy.rs, specifically the snapshot() and check_out() methods.

Note that this approach means that things like jj diff will only show the changes in the underlying files (i.e. changed hashes, I think). I hope that's okay.

@arxanas
Copy link
Contributor

arxanas commented Dec 29, 2024

@bcspragu Depending on your appetite for implementation, you could also try to pick up the work I linked in this comment: #2920 (comment). It should also handle arbitrary smudge/clean filters, submodules, etc., but needs more investment.

@bcspragu
Copy link

bcspragu commented Jan 4, 2025

I have a functional (but extremely janky) patch here main...bcspragu:jj:main. Some notes:

  • I implemented it as a new GitAttributesMatcher that gets subtracted in a DifferenceMatcher to remove LFS files
  • Modifying the matchers in snapshot() and check_out() didn't stop files mentioned in .gitattributes from appearing in jj st
    • I don't know enough of the internals of jj to understand why, I'm sure I'm doing something simple wrong
    • I hacked around this by updating the matcher in cmd_status directly. This works (and proves out the functionality), but definitely isn't correct
  • Error-handling + when to actually parse the .gitattributes file needs work
  • I couldn't figure out how to get a path to the repo root relative to current directory, so it currently assumes commands are run from the repo root
    • Similarly, non-root .gitattributes files are not currently handled
  • It reuses the ignore crate for parsing the .gitattributes, and manually handles where the two file formats differ (described in the Git docs)
    • The alternative was introducing another crate (e.g.), but I figured that was less desirable
  • It decides a file is an LFS file if the entry in .gitattributes includes filter=lfs
    • No idea if there's a better heuristic (or if checking diff=... and merge=... is necessary too)

Depending on your appetite for implementation [...]

Apologies, but I'm not quite that hungry yet 😅. Though I think that approach is a much more reasonable one in terms of what could actually be merged in

@weiznich
Copy link

weiznich commented Jan 6, 2025

@bcspragu I took some time to add support for multiple gitattributes files on top of your change in weiznich@a125051. That also fixes the problem with the relative paths. Maybe one of the jj maintainers can give some hints how to move this forward from here on?

@martinvonz
Copy link
Member Author

I'm not sure the GitAttributesMatcher solution is sufficient. There can be many .gitattributes sprinkled throughout the directory tree, so I think you'll need to do something similar to the .gitignore handling, where you create an updated matcher each time you visit a directory.

@weiznich
Copy link

weiznich commented Jan 6, 2025

@martinvonz https://github.com/weiznich/jj/blob/ignore_lfs/lib/src/matchers.rs#L592-L616 implements exactly that. I've run it locally on a repository that useses several gitattributes files in different subfolders and it seems to work fine.

@martinvonz
Copy link
Member Author

Please send a PR when it's ready. That should include tests. Please put those tests in the library crate (not the CLI crate).

@bcspragu
Copy link

bcspragu commented Jan 6, 2025

@bcspragu I took some time to add support for multiple gitattributes files on top of your change in weiznich@a125051.

Thanks for the updates, looks excellent!

Please send a PR when it's ready. That should include tests. Please put those tests in the library crate (not the CLI crate).

I can pull in @weiznich's changes (if that works for them) and add some library tests. Just to confirm, the idea is to ship the LFS-ignoring functionality as a part of jj? Should it be behind a config option or something?

@weiznich
Copy link

weiznich commented Jan 6, 2025

@bcspragu If you have the capacity to put up a PR with tests that would be great. Just pull in my changes if that helps you.

@martinvonz
Copy link
Member Author

Just to confirm, the idea is to ship the LFS-ignoring functionality as a part of jj?

Yes, I'm fine with that. I assume others will object if they disagree.

Should it be behind a config option or something?

That's probably best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests