Skip to content

Commit

Permalink
[RFC 0133] Git hashing and Git-hashing-based remote stores (#133)
Browse files Browse the repository at this point in the history
* ipfs: Copy Template

* ipfs: Start drafting

* ipfs: Finish draft

* ipfs: Expand discussion of managing complexity

* ipfs: Fix typos

Thanks!

* ipfs: Fix more typos

Thanks!

* ipfs: FInish motivation on source distribution and archival

* ipfs: Rename now that we have number

* Apply suggestions from code review

Thanks!

Co-authored-by: Kevin Cox <kevincox@kevincox.ca>

* Fix typos

Thanks!

Co-authored-by: Adam Joseph <54836058+amjoseph-nixpkgs@users.noreply.github.com>

* 133: Add shepherd team!

Co-authored-by: Eelco Dolstra <edolstra@gmail.com>

* 133: Fix shepherds list

mjoseph -> amjoseph

* 133: Move non-`git` steps to future work

* 133: Move one more section out of future work

* 133: Move IPFS-specific motivation to future work too

* 133: Rename feature in light of changes

* 133: Rename RFC in light of changes

* 133: Discuss the downside of git's file system model being different

* Split future work, clean up Nix-agnostic stores section

* Fix numerious typos

Thanks, all of you!

Co-authored-by: Kevin Cox <kevincox@kevincox.ca>
Co-authored-by: Adam Joseph <54836058+amjoseph-nixpkgs@users.noreply.github.com>
Co-authored-by: Linus Heckemann <git@sphalerite.org>

* Add RFC open PR date

* Be clearer about not supporting references to start

* Update rfcs/0133-git-hashing.md

Co-authored-by: Kevin Cox <kevincox@kevincox.ca>

* Rip out both RFC-scal Future Work sections

They are now in an `ipfs-2` branch in this repo.

* Remove "Build adoption through seamless interop"

That can go in a separate blog post.

* Apply suggestions from code review

Thank you both!!

Co-authored-by: Valentin Gagarin <valentin.gagarin@tweag.io>
Co-authored-by: Ryan Lahfa <masterancpp@gmail.com>

* Slim down the layering section

The other stuff is already in flight, we don't need to talk about it so much here.

Co-authored-by: Valentin Gagarin <valentin.gagarin@tweag.io>

---------

Co-authored-by: Kevin Cox <kevincox@kevincox.ca>
Co-authored-by: Adam Joseph <54836058+amjoseph-nixpkgs@users.noreply.github.com>
Co-authored-by: Eelco Dolstra <edolstra@gmail.com>
Co-authored-by: Linus Heckemann <git@sphalerite.org>
Co-authored-by: Valentin Gagarin <valentin.gagarin@tweag.io>
Co-authored-by: Ryan Lahfa <masterancpp@gmail.com>
  • Loading branch information
7 people authored Jul 12, 2023
1 parent 8c86187 commit f470655
Showing 1 changed file with 164 additions and 0 deletions.
164 changes: 164 additions & 0 deletions rfcs/0133-git-hashing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
---
feature: git-hashing
start-date: 2022-08-27
author: John Ericsion (@Ericson2314) on behalf of [Obsidian Systems](https://obsidian.systems)
co-authors: (find a buddy later to help out with the RFC)
shepherd-team: edolstra, kevincox, gador, @amjoseph-nixpkgs
shepherd-leader: amjoseph-nixpkgs
related-issues: (will contain links to implementation PRs)
---

# Summary
[summary]: #summary

Integrate Git hashing with Nix.

Nix should support content-addressed store objects using git blob + tree hashing, and Nix-unaware remote stores that serve git objects.

This follows the work done and described in https://github.com/obsidiansystems/ipfs-nix-guide/ .

# Motivation
[motivation]: #motivation

## Binary distribution

Currently distributing Nix binaries takes a lot of bandwidth and storage.
This is a barrier to being a Nix user in areas of slower internet --- which includes the vast majority of the world's population at this time.
This is also a barrier to users running their own caches.

Content-addressing opens up a *huge* design space of solutions to get around such problems.

The first steps proposed below do *not* tackle this problem directly, but it lays the ground-work for future experiments in this direction.

## Source distribution and archival

Source code used by Nix expressions frequently goes off-line. It would be beneficial if there was some resistance to this form of bitrot.
The Software Heritage archive stores much of the source code that Nix expressions use. They would be a natural partner in this effort.

Unfortunately, as https://www.tweag.io/blog/2020-06-18-software-heritage/ describes at the end, a major challenge is the way Nix content-addresses software.
First of all, Nix hashes sources in bespoke ways that no other project will adopt.
Second of all, hashing tarballs instead of the underlying files leads to non-normative details (compression, odd perms, etc.).

We should natively support Git file hashing, which is supported both by Git repos and Software Heritage.
This will completely obliterate these issues.

Overall, we are building out a uniform way to work with source code, regardless of its origins or the exact tools involved.

# Detailed design
[design]: #detailed-design

Each item can be done separately provided its dependent items are also done.
These are the items we wish to commit to at this time.
(The goals mentioned under [future work](#future-work) are, in a separate document, also broken down into a dependency graph of smaller steps.)

## Git file hashing

- **Purpose**: Source distribution and archival

In addition to the various forms of content-addressing Nix supports today ("text", "fixed" with either "flat" or "nar" serialization of file system objects), Nix should support Git hashing.
This support entails two basic things:

- Content addresses are used to compute store paths.
- Content addresses are used to verify store object integrity.

Git hashing would not (in this first proposed version) support references, since references in Nix's sense are not part of Git's data model.
This is OK for now; encoding references is not needed for the intended initial use-case of exchanging source code.

## Git file hashing for `buitins.fetch*`

- **Purpose**: Source distribution and archival
- **Depends on**: Git file hashing

The built-in fetchers can also be made to work with Git file hashing just as they support the other types.
In addition, Git repo fetching can leverage this better to than the other formats since the data in Git repos is already content-addressed in this way.

## Nix-agnostic content-addressing "stores"

- **Purpose**: All distribution

We want to be able to substitute from an arbitrary store (in the general, non-Nix sense) of content-addressed objects.
For the purpose of this RFC, that means querying objects by Git hash, and being able to trust the results because we can verify them against the Git hash.

In the implementation, we could accomplish this in a variety of ways.

- On one extreme, we could have a `ContentAddressedSubstitutor` abstract interface completely separate from Nix's `Store` interface.

- On the other extreme, we can generalize `Store` itself to allow taking content addresses or store paths as references.

Exactly how this shakes out is to be determined post-RFC, but it would be nice to use Nix-agnostic persistent methods with `--store` and `--substituters`.

If we do go the route of modifying the `Store` class, note that these things will need to happen:

- Many store interface methods that today take store paths will need to also accept names & content address pairs.

For stores that are purpose-built for Nix, like the ones we support today, all addressing can be done with store paths, so the current interface is fine.
But for Nix-agnostic stores, store paths are rather useless as a key type because Nix-agnostic tools don't know about them.
Those store can, however, understand content addresses.
And from such a name + content address, we can always produce a store path again, so there is no loss of functionality with existing stores.

- Relax `ValidPathInfo` to merely require that *either* the pair of `NarHash` and `NarSize` or just `CA` alone be defined.

As described in the first step, currently `NarHash` and `NarSize` are the *normative* fields which are used to verify a store object.
But if the store object is content-addressed, we don't need these, because the content address (`CA` field) will also suffice, all by itself.

Existing Nix stores types are still required to contain a `NarHash` and `NarSize`, which is good for backwards compatibility and don't come with a cost.
Only new Nix-agnostic store types would take advantage of these new, relaxed rules.

# Examples and Interactions
[examples-and-interactions]: #examples-and-interactions

We encourage anyone interested to check our tutorial in https://github.com/obsidiansystems/ipfs-nix-guide/ which demonstrates the above functionality.
Note at the time of writing this guide uses our original 2020 fork of Nix.

# Drawbacks
[drawbacks]: #drawbacks

## Complexity

The main cost is more complexity to the store layer.
For a few reasons we think this is not so bad.

Most importantly is the division of the work into a dependency graph of steps.
This allows us to slowly try out things like IPFS that leverage Git hashing, and not commit to more change than we want to up front.

Even if we do end up adopting everything though, we think for the following two reasons the complexity can still be kept manageable:

1. Per the abstract vs concrete model of the Nix store in https://github.com/NixOS/nix/pull/6877, everything we are doing is simply flushing out alternative interpretations of the abstract model.
This is the sense in which we are, per the Scheme mantra, "removing the weaknesses and restrictions that make additional features appear necessary":
Instead of extending the model with new features, we are relaxing concrete model assumptions (e.g. references are always opaque store paths) while keeping the abstract model the same.

2. We also support plans to decouple the layers of Nix further, and update our educational and marketing material to reflect it.
Layering will "divide and conquer" the project so the interfaces between each layer are still rigorously enforced preventing a combinatorial explosion in complexity.
That frees up "complexity budget" for projects like this.

## Git and Nix's file system data models do not entirely coincide

Nix puts the permission info of a file (executable bit for now) with that file, whereas Git puts it with the name and hash in the directory.
The practical effect of this discrepancy is that a root file (as opposed to directory) in Nix has permission info, but does not in Git.

If we are trying to convert existing Nix data into Git, this is a problem.
Assuming we treat "no permission bits" as meaning "non-executable", we will have a partial conversion that will fail on executable files without a parent directory.
Tricks like always wrapping everything in a directory get around this, but then we have to be careful the directory is exactly as expected when "unwrapping" in the other direction.

For now, we only focus on ingesting data *from* Git *to* Nix, and this side-steps the issue.
That mapping is total, i.e. all Git data can be mapped, and injective, i.e. each Git data has a unique Nix data representative (though not surjective, i.e. not all Nix data can be represented as a piece of Git data), and so there is no problem for now.

# Alternatives
[alternatives]: #alternatives

The dependency graph of steps can be sliced to save some for future work.
For now they are all written together, but during the RFC meetings we will decide which steps (if any) to ratify now, and which steps to save for later.

# Unresolved questions
[unresolved]: #unresolved-questions

None at this time.

# Future work
[future]: #future-work

- Integrate with outside content-addressing storage/transmission like

- The Software Heritage archive

- IPFS

0 comments on commit f470655

Please sign in to comment.