Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 0133] Git hashing and Git-hashing-based remote stores #133

Merged
merged 27 commits into from
Jul 12, 2023
Merged
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0ee8e17
ipfs: Copy Template
Ericson2314 Aug 23, 2022
ae6ca2d
ipfs: Start drafting
Ericson2314 Aug 23, 2022
5d74bc6
ipfs: Finish draft
Ericson2314 Aug 24, 2022
c858e40
ipfs: Expand discussion of managing complexity
Ericson2314 Aug 24, 2022
3fa874e
ipfs: Fix typos
Ericson2314 Aug 28, 2022
2f25a1a
ipfs: Fix more typos
Ericson2314 Aug 28, 2022
56ad43f
ipfs: FInish motivation on source distribution and archival
Ericson2314 Aug 28, 2022
2290ece
ipfs: Rename now that we have number
Ericson2314 Aug 28, 2022
69b0c46
Apply suggestions from code review
Ericson2314 Aug 29, 2022
d7c3a83
Fix typos
Ericson2314 Sep 8, 2022
a790811
133: Add shepherd team!
Ericson2314 Dec 14, 2022
f134f8c
133: Fix shepherds list
Ericson2314 Feb 1, 2023
fd494d1
133: Move non-`git` steps to future work
Ericson2314 Feb 1, 2023
d3b5313
133: Move one more section out of future work
Ericson2314 Feb 15, 2023
6575564
133: Move IPFS-specific motivation to future work too
Ericson2314 Feb 15, 2023
2e04424
133: Rename feature in light of changes
Ericson2314 Feb 15, 2023
5a68ea0
133: Rename RFC in light of changes
Ericson2314 Feb 15, 2023
17de8dd
133: Discuss the downside of git's file system model being different
Ericson2314 Feb 15, 2023
e2641c9
Split future work, clean up Nix-agnostic stores section
Ericson2314 Jun 22, 2023
852d740
Fix numerious typos
Ericson2314 Jun 24, 2023
15c1cbc
Add RFC open PR date
Ericson2314 Jun 24, 2023
165979c
Be clearer about not supporting references to start
Ericson2314 Jun 24, 2023
3c3cac6
Update rfcs/0133-git-hashing.md
Ericson2314 Jun 26, 2023
641891b
Rip out both RFC-scal Future Work sections
Ericson2314 Jun 26, 2023
5828c41
Remove "Build adoption through seamless interop"
Ericson2314 Jun 26, 2023
9279a03
Apply suggestions from code review
Ericson2314 Jun 29, 2023
3a083b2
Slim down the layering section
Ericson2314 Jun 29, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 164 additions & 0 deletions rfcs/0133-git-hashing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
---
feature: git-hashing
start-date: 2022-08-27
author: John Ericsion (@Ericson2314) on behalf of [Obsidian Systems](https://obsidian.systems)
co-authors: (find a buddy later to help out with the RFC)
shepherd-team: edolstra, kevincox, gador, @amjoseph-nixpkgs
shepherd-leader: amjoseph-nixpkgs
related-issues: (will contain links to implementation PRs)
---

# Summary
[summary]: #summary

Integrate Git hashing with Nix.

Nix should support content-addressed store objects using git blob + tree hashing, and Nix-unaware remote stores that serve git objects.

This follows the work done and described in https://github.com/obsidiansystems/ipfs-nix-guide/ .

# Motivation
[motivation]: #motivation

## Binary distribution

Currently distributing Nix binaries takes a lot of bandwidth and storage.
This is a barrier to being a Nix user in areas of slower internet --- which includes the vast majority of the world's population at this time.
This is also a barrier to users running their own caches.

Content-addressing opens up a *huge* design space of solutions to get around such problems.

The first steps proposed below do *not* tackle this problem directly, but it lays the ground-work for future experiments in this direction.

## Source distribution and archival

Source code used by Nix expressions frequently goes off-line. It would be beneficial if there was some resistance to this form of bitrot.
The Software Heritage archive stores much of the source code that Nix expressions use. They would be a natural partner in this effort.

Unfortunately, as https://www.tweag.io/blog/2020-06-18-software-heritage/ describes at the end, a major challenge is the way Nix content-addresses software.
First of all, Nix hashes sources in bespoke ways that no other project will adopt.
Second of all, hashing tarballs instead of the underlying files leads to non-normative details (compression, odd perms, etc.).

We should natively support Git file hashing, which is supported both by Git repos and Software Heritage.
This will completely obliterate these issues.

Overall, we are building out a uniform way to work with source code, regardless of its origins or the exact tools involved.

# Detailed design
[design]: #detailed-design

Each item can be done separately provided its dependent items are also done.
These are the items we wish to commit to at this time.
(The goals mentioned under [future work](#future-work) are, in a separate document, also broken down into a dependency graph of smaller steps.)

## Git file hashing

- **Purpose**: Source distribution and archival

In addition to the various forms of content-addressing Nix supports today ("text", "fixed" with either "flat" or "nar" serialization of file system objects), Nix should support Git hashing.
This support entails two basic things:

- Content addresses are used to compute store paths.
- Content addresses are used to verify store object integrity.

Git hashing would not (in this first proposed version) support references, since references in Nix's sense are not part of Git's data model.
This is OK for now; encoding references is not needed for the intended initial use-case of exchanging source code.

## Git file hashing for `buitins.fetch*`

- **Purpose**: Source distribution and archival
- **Depends on**: Git file hashing

The built-in fetchers can also be made to work with Git file hashing just as they support the other types.
In addition, Git repo fetching can leverage this better to than the other formats since the data in Git repos is already content-addressed in this way.

## Nix-agnostic content-addressing "stores"

- **Purpose**: All distribution

We want to be able to substitute from an arbitrary store (in the general, non-Nix sense) of content-addressed objects.
For the purpose of this RFC, that means querying objects by Git hash, and being able to trust the results because we can verify them against the Git hash.

In the implementation, we could accomplish this in a variety of ways.

- On one extreme, we could have a `ContentAddressedSubstitutor` abstract interface completely separate from Nix's `Store` interface.

- On the other extreme, we can generalize `Store` itself to allow taking content addresses or store paths as references.

Exactly how this shakes out is to be determined post-RFC, but it would be nice to use Nix-agnostic persistent methods with `--store` and `--substituters`.

If we do go the route of modifying the `Store` class, note that these things will need to happen:

- Many store interface methods that today take store paths will need to also accept names & content address pairs.

For stores that are purpose-built for Nix, like the ones we support today, all addressing can be done with store paths, so the current interface is fine.
But for Nix-agnostic stores, store paths are rather useless as a key type because Nix-agnostic tools don't know about them.
Those store can, however, understand content addresses.
And from such a name + content address, we can always produce a store path again, so there is no loss of functionality with existing stores.

- Relax `ValidPathInfo` to merely require that *either* the pair of `NarHash` and `NarSize` or just `CA` alone be defined.

As described in the first step, currently `NarHash` and `NarSize` are the *normative* fields which are used to verify a store object.
But if the store object is content-addressed, we don't need these, because the content address (`CA` field) will also suffice, all by itself.

Existing Nix stores types are still required to contain a `NarHash` and `NarSize`, which is good for backwards compatibility and don't come with a cost.
Only new Nix-agnostic store types would take advantage of these new, relaxed rules.

# Examples and Interactions
[examples-and-interactions]: #examples-and-interactions

We encourage anyone interested to check our tutorial in https://github.com/obsidiansystems/ipfs-nix-guide/ which demonstrates the above functionality.
Note at the time of writing this guide uses our original 2020 fork of Nix.

# Drawbacks
[drawbacks]: #drawbacks

## Complexity

The main cost is more complexity to the store layer.
For a few reasons we think this is not so bad.

Most importantly is the division of the work into a dependency graph of steps.
This allows us to slowly try out things like IPFS that leverage Git hashing, and not commit to more change than we want to up front.

Even if we do end up adopting everything though, we think for the following two reasons the complexity can still be kept manageable:

1. Per the abstract vs concrete model of the Nix store in https://github.com/NixOS/nix/pull/6877, everything we are doing is simply flushing out alternative interpretations of the abstract model.
This is the sense in which we are, per the Scheme mantra, "removing the weaknesses and restrictions that make additional features appear necessary":
Instead of extending the model with new features, we are relaxing concrete model assumptions (e.g. references are always opaque store paths) while keeping the abstract model the same.

2. We also support plans to decouple the layers of Nix further, and update our educational and marketing material to reflect it.
Layering will "divide and conquer" the project so the interfaces between each layer are still rigorously enforced preventing a combinatorial explosion in complexity.
That frees up "complexity budget" for projects like this.

## Git and Nix's file system data models do not entirely coincide

Nix puts the permission info of a file (executable bit for now) with that file, whereas Git puts it with the name and hash in the directory.
The practical effect of this discrepancy is that a root file (as opposed to directory) in Nix has permission info, but does not in Git.

If we are trying to convert existing Nix data into Git, this is a problem.
Assuming we treat "no permission bits" as meaning "non-executable", we will have a partial conversion that will fail on executable files without a parent directory.
Tricks like always wrapping everything in a directory get around this, but then we have to be careful the directory is exactly as expected when "unwrapping" in the other direction.

For now, we only focus on ingesting data *from* Git *to* Nix, and this side-steps the issue.
That mapping is total, i.e. all Git data can be mapped, and injective, i.e. each Git data has a unique Nix data representative (though not surjective, i.e. not all Nix data can be represented as a piece of Git data), and so there is no problem for now.

# Alternatives
[alternatives]: #alternatives

The dependency graph of steps can be sliced to save some for future work.
For now they are all written together, but during the RFC meetings we will decide which steps (if any) to ratify now, and which steps to save for later.

# Unresolved questions
[unresolved]: #unresolved-questions

None at this time.

# Future work
[future]: #future-work

- Integrate with outside content-addressing storage/transmission like

- The Software Heritage archive

- IPFS