Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetchgit with leaveDotGit = true is still not completely deterministic #8567

Open
wizeman opened this issue Jun 30, 2015 · 72 comments
Open

fetchgit with leaveDotGit = true is still not completely deterministic #8567

wizeman opened this issue Jun 30, 2015 · 72 comments

Comments

@wizeman
Copy link
Member

wizeman commented Jun 30, 2015

The hash of the cargo git repository returned by fetchgit has changed twice already, even though the git revision didn't change (see de322b4 and now #8566).

See also: #4752 and #4767

cc @bjornfor and @madjar (due to #4767).

@bjornfor
Copy link
Contributor

How sad. So it may change even with the same git version?

It'd be interesting to know exactly what in .git/ that changes. But if there is non-determinism for single-threaded git repack of the repo, then I don't know what to do.

Btw, here is the script I used to "prove" that git clones could be deterministic (if anyone wants to play around with it):

https://gist.github.com/bjornfor/bb96b09d4bd1a488cd01

It has a copy of the "make_deterministic_repo()" function from nixpkgs and it creates three repos in current dir with the same basename as the script itself.

UPDATE: I mean "make_deterministic_repo" (not ...clone())

@madjar
Copy link
Member

madjar commented Jun 30, 2015

I feared the day would happen.

As @bjornfor, it would be great if you could get your hands on two versions of the same derivation with different hashes. This might be hard, because sometimes it depends on the upstream (I first had problems when new commits where pushed to the master branch). Maybe a new branch is added, or something like that?

@wizeman
Copy link
Member Author

wizeman commented Jun 30, 2015

Ok, so I tarred up my local copy of the cargo src tree in my /nix/store (which is supposed to have the old sha256 hash), and then used nix-prefetch-git with --leave-dotGit and --fetch-submodules to clone a new tree to see how they differ. I have verified that nix-prefetch-git returned the same new sha256 hash as mentioned in issue #8566.

This is the resulting diff:
https://gist.github.com/wizeman/291be066b61e591051af

You can find the tarballs with full contents here:
https://dl.dropboxusercontent.com/u/731705/cargo-old.tar.bz2
https://dl.dropboxusercontent.com/u/731705/cargo-new.tar.bz2

@wizeman
Copy link
Member Author

wizeman commented Jun 30, 2015

The most obvious differences are inside the fetched submodule in src/rust-installer, specifically:

  1. In the old clone, there is a .git/shallow file (which contains a hash). This file doesn't exist in the new clone.
  2. In the new clone, there is a new .git/refs/remotes directory, which itself contains an empty origin subdirectory. The old clone doesn't have these directories.

It seems that there are also differences in the pack files, but I haven't examined these.

@wizeman
Copy link
Member Author

wizeman commented Jun 30, 2015

Regarding point 1, it seems that the previous fetchgit did a shallow clone of the src/rust-installer submodule, because git log only shows one commit (the last one). However, now it seems to be doing a non-shallow clone of that submodule, because I can see the commit history (but only up to the same commit as in the previous fetchgit).

@madjar
Copy link
Member

madjar commented Jun 30, 2015

Do you have a way to get the git versions used for both clones? If they differ, a change in the behavior might explain the difference.

@jagajaga
Copy link
Member

Do you use --deepClone?

@wizeman
Copy link
Member Author

wizeman commented Jul 1, 2015

Neither the cargo derivation nor my nix-prefetch-git command used the deepClone option, they only used leaveDotGit and the option to fetch submodules.

Regarding git, I was using version 2.4.1 when I ran the nix-prefetch-git command.
However, I don't know with which git version my pre-existing cargo source was downloaded with. Apparently, the build logs of fetchgit derivations don't mention the git version...

@bjornfor
Copy link
Contributor

bjornfor commented Jul 1, 2015

I grabbed this diff from two repositories at work, cloned at different times (and probably with different git versions):

https://gist.github.com/bjornfor/5e5ad0a03a52cbc6f70b

@andrewrk
Copy link
Member

what's the use case for leaveDotGit = true? It seems one could clone the repository with a normal git clone, then rm -rf the .git folder, and viola, you have a deterministic fetch. What am I missing?

@ghost
Copy link

ghost commented Jul 16, 2015

@andrewrk AFAIK, some packages use a build system which requires the .git folder to be present. leaveDotGit clones the repository with the .git folder intact.

@aflatter
Copy link
Contributor

@andrewrk e.g. some ruby gems use git ls-files in their gemspec to define the list of all files in the package.

@nbp
Copy link
Member

nbp commented Jul 17, 2015

What is causing such non-determinism? Is that we might be downloading separated commit, then later, as the repository is optimized, we might download packed files?

I think this is doable if we do not clone all the branches/tag, but only the one that we care about, and run

git repack -dnA
git gc --prune=now

Otherwise, we would have to white-list the tags / branches that we are interested in.

@nbp
Copy link
Member

nbp commented Jul 17, 2015

Also the solution above was suggested by @bjornfor before.

Regarding point 1, it seems that the previous fetchgit did a shallow clone of the src/rust-installer submodule, because git log only shows one commit (the last one). However, now it seems to be doing a non-shallow clone of that submodule, because I can see the commit history (but only up to the same commit as in the previous fetchgit).

Yes, this is an optimization I made a while ago, because otherwise we were pulling full repositories all the time, which is extremely inefficient. I think we can get rid of this limitation when we are downloading a copy of the history, by creating a branch, and making a local non-shared --depth=1 clone of the submodule and replacing the .git, by the one of the cloned version.

@vcunat vcunat mentioned this issue Jul 18, 2015
@wizeman
Copy link
Member Author

wizeman commented Jul 21, 2015

It seems that leaveDotGit still continues to cause hash mismatches for the cargo package, making it very hard to maintain.

See: de322b4, #8566 and #8862.

@andrewrk
Copy link
Member

Here's some brainstorming:

Modify leaveDotGit to remove certain problematic files that seem to be nondeterministic (as seen by some of those diffs posted above). The specific files that are removed are likely to not be read by the upstream package build process, and if they are, the build process will fail due to ENOENT.

wizeman added a commit that referenced this issue Jul 21, 2015
For now, let's remove `leaveDotGit`. The only visible effect I could see
was that `cargo --version` won't report the git commit anymore, but
that's only a minor issue compared to the build breaking often.

Fixes #8566 and closes #8862.
@wizeman
Copy link
Member Author

wizeman commented Jul 21, 2015

FWIW, as a workaround I've disabled leaveDotGit for cargo, as it seemed to be causing more problems than it was solving.

@anderspapitto
Copy link
Contributor

@andrewrk another approach would be to "normalize" the git repo by

rm -rf .git
git init
git add .
git commit -am "nix build commit"

which will leave all functionality intact that cares about the actual files (like git ls-files). Build processes that break because they expect certain things in the git history itself should be much rarer.

@bendlas
Copy link
Contributor

bendlas commented Aug 21, 2015

I've attempted to make --leave-dotGit more deterministic, by unpacking those packs into the plain, uncompressed object db format: bendlas@4b9c24a

Unfortunately, this leads to massive growth in git repos (x 20), but since --leave-dotGit should be used mostly for build-time only sources, this might be tolerable, if it buys us a deterministic --leave-dotGit.

@edolstra
Copy link
Member

IMHO, we should remove leaveDotGit. That flag is just a mistake. As far a I know, the contents of .git do not have a canonical representation guaranteed to be stable across Git versions, so attempts to make this "deterministic" are doomed to fail. (Note it's not just a matter of determinism, but also of Git not changing. We don't want the hash to change even when the git dependency to fetchgit is upgraded.)

For packages that really need .git, the solution is to just tar the entire tree and put it some place where it can be downloaded using fetchurl.

@vcunat
Copy link
Member

vcunat commented Aug 21, 2015

It's certainly best to avoid dealing with .git at all, and only a few packages use leaveDotGit.

In any case, packing isn't meant to be fully deterministic. Efficiency and settings of that can differ in different git versions. @bendlas: I don't see why first pack everything and then unpack it.

@bendlas
Copy link
Contributor

bendlas commented Aug 21, 2015

@edolstra I'm a big fan of removing leaveDotGit, as soon as all packages that use it are updated. However, I also think that until then, unpacking the objects is a good stop-gap. The basic git object-db format is pretty stable, so determinism would only depend on the gc working consistently.

@vcunat I wasn't sure if git gc would do the same thing without a preceding repack. Also, repacking makes sure that we only deal with pack files, i.e. can mkdir a fresh object dir and unpack into that.

Another thought: Would it be worth it to have a canonical, deterministic representation of versioned repositories in nix? Not git specific, but with a git dir conjurable from such a representation (as well as other formats)?

@nbp
Copy link
Member

nbp commented Aug 21, 2015

I guess the main reason of having the leaveDotGit option around is to be able to answer a few git commands. The most common is likely to be git describe, but I would not bet on the fact that this is the only one.

What I can suggests would be to provide a placeholder .git directory in the downloaded sources which serves as a cache of expected commands. Then we provide a wrapped git command which check if this .git folder is a repository or a cache and issue an error if there is no cache hit for the given command.

Also, note that Hydra (used to?) relies on leaveDotGit to do things such as merging top-git branches. So we might be able to remove it from the nix expression, but we will have to keep it in the script.

@the-sun-will-rise-tomorrow
Copy link
Contributor

the-sun-will-rise-tomorrow commented Apr 26, 2024

Has anyone tried doing a git-fast-export then git-fast-import? In theory, that should perfectly recreate the repository using only the necessary objects, but unlike git repack, it will be done by creating a new repository instead of mutating an existing one (which is what make_deterministic_repo does), which I think should be more deterministic.

I'm testing this:

{ pkgs
, lib
, stdenv
, git
, cacert
}:

stdenv.mkDerivation {
  name = "stable-git-test";
  dontUnpack = true;
  buildInputs = [
    git
    cacert
  ];
  buildPhase = ''
    git clone https://github.com/WxNzEMof/test.git stable-git-test-full
    git init stable-git-test
    git -C stable-git-test-full fast-export 837ef898e8213b9e0c86cb99581328b0d7541489 |
    git -C stable-git-test fast-import
    git -C stable-git-test update-ref HEAD 837ef898e8213b9e0c86cb99581328b0d7541489
    rm -rf stable-git-test/.git/hooks stable-git-test/.git/logs
  '';
  installPhase = ''
    mv stable-git-test $out
  '';
  outputHash = "sha256-zCuCvoC6u3u57br8nGPFNBaBakVqx9SHHIeRtN4YTEk=";
  outputHashMode = "recursive";
}

Seems deterministic so far, even with new commits, new branches...

@Atemu
Copy link
Member

Atemu commented Apr 26, 2024

What about different versions of git?

@the-sun-will-rise-tomorrow
Copy link
Contributor

I haven't tried that, but couldn't we pin the Git version?

@Atemu
Copy link
Member

Atemu commented Apr 27, 2024

I don't think that's realistic given that fetchgit is somewhat security-critical. We will also need further git features at some point (sha256 support is looming for instance), so this does not strike me as a sustainable solution.

@CyberShadow
Copy link
Contributor

CyberShadow commented Apr 27, 2024

Has anyone tried doing a git-fast-export then git-fast-import?

That won't work all of the time, as the fast-export format does not preserve all metadata perfectly (e.g. signed commits), so such a process won't necessarily create a repository containing a desired revision. It also gives little control over e.g. which tags are copied - we generally wouldn't want to copy all tags on reachable commits, because a developer may retroactively tag an old commit, breaking determinism.

As I understand, the constraints are:

  • We need to create a Git repository containing only the specified revision (and optionally the revisions in its closure), and no other objects or refs.
  • It can't just be a collection of loose object files, as that takes too much space on disk. We have to use Git pack files.
  • We shouldn't use Git itself to create the new repository, as we don't control its determinism, and its implementation is subject to change.

Here is a start in that direction: https://github.com/CyberShadow/misc/blob/master/git-copy-revs.d

This satisfies the first and second constraint, but not the third.

However, I think it wouldn't be too hard to replace the git-based object writer in that program with a stand-alone pack file writer. The pack file format isn't too complicated. The implementation doesn't need to be perfect; just produce reasonably sized pack files that are readable by Git, and be deterministic.

Would such a program be usable in nixpkgs?

@marc-hb
Copy link

marc-hb commented Apr 27, 2024

It could be fun if nix derivations could accept a different checksumming method. Even if the folder checksum differs I trust git to always give me the same checked-out content. Having to provide both the git ref and the folder checksum feels redundant.

...

  1. Compute the hash of the whole directory, excluding .git.
  2. Get the checked-out commit (in git terms, that's what makes a repo "reproducible").
  3. Combine both previous points to get the final hash.

Obviously it's not 100% pure, but it would be OK for most projects that rely on the availability of the .git folder.

This looks like the only sustainable solution: TRUSTING git and delegating to git some of nix' work.

All other solutions are conceptually flawed and hackish because git and nix do more or less the same job so they will always keep stepping on each other's toes.

With the infamous autotools finally dying, the number of projects that fail to build when .git is missing will only go up and up. From the perspective of a maintainer who has never heard about nix or similar (= most maintainers), abandoning tarballs simplifies the source release process greatly which (the irony) helps ensuring build reproducibility. Dropping the tarball layer of indirection also reduces the attack surface: see for instance the recent jiaT75 attack on XZ.

Excluding .git from the hash and "trusting" it to be deterministic for any (reasonable) interacton isn't a terrible idea IMO but would need to add complexity and special cases to Nix itself which I don't think we should do.

Yes this the price to pay: trusting git requires special cases in nix code and some "impurity". But it does not seem avoidable.

Checksumming .git/ seems futile. However it should be practical to checksum's git outputs and use that instead: git describe, git tag etc. Git's user interface is obviously much more stable than its internals.

@CyberShadow
Copy link
Contributor

Checksumming .git/ seems futile.

As per my comment above - only when it is created by Git itself, surely?

@Atemu
Copy link
Member

Atemu commented Apr 28, 2024

It can't just be a collection of loose object files, as that takes too much space on disk. We have to use Git pack files.

I don't see that as a given. Sure it'd be space inefficient but that's a helluvalot better than not working at all.

We shouldn't use Git itself to create the new repository, as we don't control its determinism, and its implementation is subject to change.

We don't have an exclusive right to determinism. As long as a tool explicitly guarantees determinism in its output, we can use it for FODs. See i.e. cargo or Go vendor hashes.

The problem is rather that git does have any such guarantees w.r.t. its repository format AFAIK. If that existed and was enforced by upstream (i.e. breaking it would be considered a bug) we could absolutely use it.

This looks like the only sustainable solution: TRUSTING git and delegating to git some of nix' work.

All other solutions are conceptually flawed and hackish because git and nix do more or less the same job so they will always keep stepping on each other's toes.

Git and Nix do not do the same job. Nix is primarily a raw input-addressed store for binary data while git is a content-addressed store for plain-text data.

The only overlap is that Nix also contains a content-addressed store on the side in the form of FODs aswell as the experimental CA-derivations.

I too would like to see native support for Git and other trustworthy content-addressing schemes such as IPFS or torrents. That's quite complex however as it'd necessitate changes to Nix itself and isn't really a topic to be discussed here. You'd need an RFC concerning the Nix package manager for that and quite a bit of lead-up time before it could actually be used in Nixpkgs probably.

With the infamous autotools finally dying, the number of projects that fail to build when .git is missing will only go up and up.

Autotools has been dying for a while now. While I too whish it'd go on with it finally, that's still at least a few decades off.

Fortunately however, most build tools have no direct dependence on Git in any way. The worst they usually do is to encode the current commit id and that's where I'd draw the line for "the build tool is borken" too because any more dependency on the state of the VCS is a bug IMHO.

From the perspective of a maintainer who has never heard about nix or similar (= most maintainers), abandoning tarballs simplifies the source release process greatly which (the irony) helps ensuring build reproducibility. Dropping the tarball layer of indirection also reduces the attack surface: see for instance the recent jiaT75 attack on XZ.

Note that this is an entirely tangential topic.

You also appear to be a little confused here. We have no hard dependence on upstream tarballs. We have fetchgit, fetchFromGitHub and the like and they're used everywhere. The only reason release tarballs are used in Nixpkgs is because of technical feasibility reasons (i.e. bootstrap) or because people had a preference for them. There is no clear policy concerning this, see https://discourse.nixos.org/t/reconsider-reusing-upstream-tarballs/42524.

This discussion only concerns calling fetchGit with the leaveDotGit = true argument which is necessary because in some rare cases, the build tool is (IMHO) broken and does have a hard dependency on the VCS state; it must be present during the (pure) build.

As per my comment above - only when it is created by Git itself, surely?

Only when it's not guaranteed to be stable IMO. It doesn't matter whether git or some other tool generates it but it has to be stable. If some external tool's output is also not stable, that's no good either.

@CyberShadow
Copy link
Contributor

It can't just be a collection of loose object files, as that takes too much space on disk. We have to use Git pack files.

I don't see that as a given. Sure it'd be space inefficient but that's a helluvalot better than not working at all.

I understand that it was just too space inefficient (20x) to be considered: #8567 (comment)

Only when it's not guaranteed to be stable IMO. It doesn't matter whether git or some other tool generates it but it has to be stable. If some external tool's output is also not stable, that's no good either.

As I understand, the reason for why we are where we are now is because 1) Git indeed has not made any promises about deterministically creating repositories, even when sticking to a single Git version; 2) future versions of Git may change the repository creation code, e.g. by switching to a new pack file format.

We don't have an exclusive right to determinism. As long as a tool explicitly guarantees determinism in its output, we can use it for FODs. See i.e. cargo or Go vendor hashes.

The problem is rather that git does have any such guarantees w.r.t. its repository format AFAIK. If that existed and was enforced by upstream (i.e. breaking it would be considered a bug) we could absolutely use it.

Right. But, are there any existing tools that create Git repositories and do have a stated commitment towards determinism?

If not, why not create our own?

@Atemu
Copy link
Member

Atemu commented Apr 29, 2024

I understand that it was just too space inefficient (20x) to be considered: #8567 (comment)

That's not what their comment said. It said that it's feasible because such size increases are tolerable for sources and I agree with that. It's not great (obviously) but a worthwhile trade-off that only affects the src.

As I understand, the reason for why we are where we are now is because 1) Git indeed has not made any promises about deterministically creating repositories, even when sticking to a single Git version; 2) future versions of Git may change the repository creation code, e.g. by switching to a new pack file format.

That may indeed be the case currently but I see at least a possibility of getting upstream to provide that feature. Has anyone asked upstream git whether they'd consider adding a guaranteed stable representation of a git object?

If not, why not create our own?

Sure but it might be worthwhile to involve upstream or even do the work upstream.

@tobiasBora
Copy link
Contributor

tobiasBora commented May 10, 2024

Until git is fully deterministic, we can maybe solve this issue together with NixOS/nix#969. In NixOS/nix#969 (comment) I propose an idea to try to formalize the notion that two derivations are equivalent. We need this in the above issue to obtain security without redownloading all the sources any time curl gets an update, and I propose to introduce a notion of trusted fetcher that can also be used to deal with non-deterministic derivations (this would be specified directly in nixpkgs, avoiding nix to grow with the list of fetchers).

Details are in NixOS/nix#969 (comment)

adrianpk added a commit to adrianpk/nixpkgs that referenced this issue May 31, 2024
@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/fetchgit-hash-mismatch-with-qmk-firmware-submodules/49667/2

@Scrumplex
Copy link
Member

I think fetchgit should throw an eval warning referencing this issue when leaveDotGit is enabled.

Additionally, this issue doesn't seem to be documented in https://github.com/NixOS/nixpkgs/blob/43b37c5e92802000e2a988cc6652037b49c06a0e/doc/build-helpers/fetchers.chapter.md#fetchgit-fetchgit

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/git-fetcher-with-each-submodule-as-a-separate-deprivation/53861/1

@Atemu
Copy link
Member

Atemu commented Oct 8, 2024

There are valid use-cases for leaveDotGit: Extracting information from .git deterministically (i.e. current commit hash) and then manually removing .git before FOD hashing is still deterministic.

These use-cases could be retained more safely by introducing an argument for commands to be run before .git removal though.

Deprecating leaveDotGit in favor of that new argument could perhaps happen as a fetchgit2 as there are other issues with the current incantation too:

#347212

GiggleSquid added a commit to GiggleSquid/.nixos that referenced this issue Jan 3, 2025
this uses deepClone which is problematic in nix. deepClone is inherently
non deterministic and can be a pita. see:
NixOS/nixpkgs#8567
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests