-
-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fetchgit
with leaveDotGit = true
is still not completely deterministic
#8567
Comments
How sad. So it may change even with the same git version? It'd be interesting to know exactly what in .git/ that changes. But if there is non-determinism for single-threaded git repack of the repo, then I don't know what to do. Btw, here is the script I used to "prove" that git clones could be deterministic (if anyone wants to play around with it): https://gist.github.com/bjornfor/bb96b09d4bd1a488cd01 It has a copy of the "make_deterministic_repo()" function from nixpkgs and it creates three repos in current dir with the same basename as the script itself. UPDATE: I mean "make_deterministic_repo" (not ...clone()) |
I feared the day would happen. As @bjornfor, it would be great if you could get your hands on two versions of the same derivation with different hashes. This might be hard, because sometimes it depends on the upstream (I first had problems when new commits where pushed to the master branch). Maybe a new branch is added, or something like that? |
Ok, so I tarred up my local copy of the cargo src tree in my This is the resulting diff: You can find the tarballs with full contents here: |
The most obvious differences are inside the fetched submodule in
It seems that there are also differences in the pack files, but I haven't examined these. |
Regarding point 1, it seems that the previous |
Do you have a way to get the git versions used for both clones? If they differ, a change in the behavior might explain the difference. |
Do you use |
Neither the cargo derivation nor my Regarding git, I was using version |
I grabbed this diff from two repositories at work, cloned at different times (and probably with different git versions): |
what's the use case for |
@andrewrk AFAIK, some packages use a build system which requires the |
@andrewrk e.g. some ruby gems use |
What is causing such non-determinism? Is that we might be downloading separated commit, then later, as the repository is optimized, we might download packed files? I think this is doable if we do not clone all the branches/tag, but only the one that we care about, and run
Otherwise, we would have to white-list the tags / branches that we are interested in. |
Also the solution above was suggested by @bjornfor before.
Yes, this is an optimization I made a while ago, because otherwise we were pulling full repositories all the time, which is extremely inefficient. I think we can get rid of this limitation when we are downloading a copy of the history, by creating a branch, and making a local non-shared --depth=1 clone of the submodule and replacing the .git, by the one of the cloned version. |
Here's some brainstorming: Modify |
FWIW, as a workaround I've disabled |
@andrewrk another approach would be to "normalize" the git repo by
which will leave all functionality intact that cares about the actual files (like git ls-files). Build processes that break because they expect certain things in the git history itself should be much rarer. |
I've attempted to make --leave-dotGit more deterministic, by unpacking those packs into the plain, uncompressed object db format: bendlas@4b9c24a Unfortunately, this leads to massive growth in git repos (x 20), but since --leave-dotGit should be used mostly for build-time only sources, this might be tolerable, if it buys us a deterministic --leave-dotGit. |
IMHO, we should remove For packages that really need |
It's certainly best to avoid dealing with In any case, packing isn't meant to be fully deterministic. Efficiency and settings of that can differ in different git versions. @bendlas: I don't see why first pack everything and then unpack it. |
@edolstra I'm a big fan of removing @vcunat I wasn't sure if Another thought: Would it be worth it to have a canonical, deterministic representation of versioned repositories in nix? Not git specific, but with a git dir conjurable from such a representation (as well as other formats)? |
I guess the main reason of having the What I can suggests would be to provide a placeholder Also, note that Hydra (used to?) relies on |
Has anyone tried doing a I'm testing this: { pkgs
, lib
, stdenv
, git
, cacert
}:
stdenv.mkDerivation {
name = "stable-git-test";
dontUnpack = true;
buildInputs = [
git
cacert
];
buildPhase = ''
git clone https://github.com/WxNzEMof/test.git stable-git-test-full
git init stable-git-test
git -C stable-git-test-full fast-export 837ef898e8213b9e0c86cb99581328b0d7541489 |
git -C stable-git-test fast-import
git -C stable-git-test update-ref HEAD 837ef898e8213b9e0c86cb99581328b0d7541489
rm -rf stable-git-test/.git/hooks stable-git-test/.git/logs
'';
installPhase = ''
mv stable-git-test $out
'';
outputHash = "sha256-zCuCvoC6u3u57br8nGPFNBaBakVqx9SHHIeRtN4YTEk=";
outputHashMode = "recursive";
} Seems deterministic so far, even with new commits, new branches... |
What about different versions of git? |
I haven't tried that, but couldn't we pin the Git version? |
I don't think that's realistic given that fetchgit is somewhat security-critical. We will also need further git features at some point (sha256 support is looming for instance), so this does not strike me as a sustainable solution. |
That won't work all of the time, as the fast-export format does not preserve all metadata perfectly (e.g. signed commits), so such a process won't necessarily create a repository containing a desired revision. It also gives little control over e.g. which tags are copied - we generally wouldn't want to copy all tags on reachable commits, because a developer may retroactively tag an old commit, breaking determinism. As I understand, the constraints are:
Here is a start in that direction: https://github.com/CyberShadow/misc/blob/master/git-copy-revs.d This satisfies the first and second constraint, but not the third. However, I think it wouldn't be too hard to replace the Would such a program be usable in nixpkgs? |
...
This looks like the only sustainable solution: TRUSTING git and delegating to git some of nix' work. All other solutions are conceptually flawed and hackish because git and nix do more or less the same job so they will always keep stepping on each other's toes. With the infamous autotools finally dying, the number of projects that fail to build when .git is missing will only go up and up. From the perspective of a maintainer who has never heard about nix or similar (= most maintainers), abandoning tarballs simplifies the source release process greatly which (the irony) helps ensuring build reproducibility. Dropping the tarball layer of indirection also reduces the attack surface: see for instance the recent jiaT75 attack on XZ.
Yes this the price to pay: trusting git requires special cases in nix code and some "impurity". But it does not seem avoidable. Checksumming .git/ seems futile. However it should be practical to checksum's git outputs and use that instead: |
As per my comment above - only when it is created by Git itself, surely? |
I don't see that as a given. Sure it'd be space inefficient but that's a helluvalot better than not working at all.
We don't have an exclusive right to determinism. As long as a tool explicitly guarantees determinism in its output, we can use it for FODs. See i.e. cargo or Go vendor hashes. The problem is rather that git does have any such guarantees w.r.t. its repository format AFAIK. If that existed and was enforced by upstream (i.e. breaking it would be considered a bug) we could absolutely use it.
Git and Nix do not do the same job. Nix is primarily a raw input-addressed store for binary data while git is a content-addressed store for plain-text data. The only overlap is that Nix also contains a content-addressed store on the side in the form of FODs aswell as the experimental CA-derivations. I too would like to see native support for Git and other trustworthy content-addressing schemes such as IPFS or torrents. That's quite complex however as it'd necessitate changes to Nix itself and isn't really a topic to be discussed here. You'd need an RFC concerning the Nix package manager for that and quite a bit of lead-up time before it could actually be used in Nixpkgs probably.
Autotools has been dying for a while now. While I too whish it'd go on with it finally, that's still at least a few decades off. Fortunately however, most build tools have no direct dependence on Git in any way. The worst they usually do is to encode the current commit id and that's where I'd draw the line for "the build tool is borken" too because any more dependency on the state of the VCS is a bug IMHO.
Note that this is an entirely tangential topic. You also appear to be a little confused here. We have no hard dependence on upstream tarballs. We have fetchgit, fetchFromGitHub and the like and they're used everywhere. The only reason release tarballs are used in Nixpkgs is because of technical feasibility reasons (i.e. bootstrap) or because people had a preference for them. There is no clear policy concerning this, see https://discourse.nixos.org/t/reconsider-reusing-upstream-tarballs/42524. This discussion only concerns calling fetchGit with the
Only when it's not guaranteed to be stable IMO. It doesn't matter whether git or some other tool generates it but it has to be stable. If some external tool's output is also not stable, that's no good either. |
I understand that it was just too space inefficient (20x) to be considered: #8567 (comment)
As I understand, the reason for why we are where we are now is because 1) Git indeed has not made any promises about deterministically creating repositories, even when sticking to a single Git version; 2) future versions of Git may change the repository creation code, e.g. by switching to a new pack file format.
Right. But, are there any existing tools that create Git repositories and do have a stated commitment towards determinism? If not, why not create our own? |
That's not what their comment said. It said that it's feasible because such size increases are tolerable for sources and I agree with that. It's not great (obviously) but a worthwhile trade-off that only affects the src.
That may indeed be the case currently but I see at least a possibility of getting upstream to provide that feature. Has anyone asked upstream git whether they'd consider adding a guaranteed stable representation of a git object?
Sure but it might be worthwhile to involve upstream or even do the work upstream. |
Until git is fully deterministic, we can maybe solve this issue together with NixOS/nix#969. In NixOS/nix#969 (comment) I propose an idea to try to formalize the notion that two derivations are equivalent. We need this in the above issue to obtain security without redownloading all the sources any time curl gets an update, and I propose to introduce a notion of trusted fetcher that can also be used to deal with non-deterministic derivations (this would be specified directly in nixpkgs, avoiding nix to grow with the list of fetchers). Details are in NixOS/nix#969 (comment) |
Related to NixOS#8567
This issue has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/fetchgit-hash-mismatch-with-qmk-firmware-submodules/49667/2 |
I think Additionally, this issue doesn't seem to be documented in https://github.com/NixOS/nixpkgs/blob/43b37c5e92802000e2a988cc6652037b49c06a0e/doc/build-helpers/fetchers.chapter.md#fetchgit-fetchgit |
This issue has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/git-fetcher-with-each-submodule-as-a-separate-deprivation/53861/1 |
There are valid use-cases for These use-cases could be retained more safely by introducing an argument for commands to be run before .git removal though. Deprecating |
this uses deepClone which is problematic in nix. deepClone is inherently non deterministic and can be a pita. see: NixOS/nixpkgs#8567
The hash of the
cargo
git repository returned byfetchgit
has changed twice already, even though the git revision didn't change (see de322b4 and now #8566).See also: #4752 and #4767
cc @bjornfor and @madjar (due to #4767).
The text was updated successfully, but these errors were encountered: