Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-0421: HTTP Delegated Routing Reader Privacy Upgrade #421

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
97 changes: 97 additions & 0 deletions src/ipips/ipip-0421.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
title: "IPIP-0421: HTTP Delegated Routing Reader Privacy Upgrade"
date: 2023-05-31
ipip: proposal
editors:
- name: Andrew Gillis
github: gammazero
- name: Ivan Schasny
github: ischasny
- name: Masih Derkani
github: masih
- name: Will Scott
github: willscott
order: 421
tags: ['ipips', 'routing', 'privacy', 'double hashing']
---

## Summary

This IPIP introduces a HTTP API designed for Privacy Preserving Delegated Content Routing provider lookups.

## Motivation

Currently, IPFS's privacy safeguards are notably deficient, particularly regarding the Content Routing subsystem. Neither Readers (clients who access files) nor Writers (hosts that store and distribute content) can maintain significant privacy related to the content they produce or consume. Presently, a Content Router or a Passive Observer can discern the identity of a file requested by a client and the specific client making the request during the routing process. This situation allows potential adversaries to gain knowledge about the requested CID. An interested party could then request the same CID and download the corresponding file to track the user's activities. Addressing these privacy concerns has been a long-standing demand from the community.

Recent enhancements to the [IPFS DHT](https://github.com/ipfs/specs/pull/373) and [InterPlanetary Network Indexer (IPNI)](https://github.com/ipni/specs/pull/5) have incorporated Double Hashing to improve Reader Privacy. With Double Hashing, Provider Records become encrypted and non-transparent to Content Routers. Given the original CID, a Content Router can decrypt the relevant Provider Records and supply them through the existing Delegated Routing API. To make use of these privacy enhancements, users must modify their interactions with Content Routers by:

* Utilizing a secondary hash over the original Multihash during content lookup;
* Decrypting the returned, encrypted Provider Records prior to use; and
* Optionally retrieving additional encrypted Metadata from the Content Router.

Existing APIs cannot support these changes in interaction, necessitating this IPIP as a step to improve the HTTP Delegated Routing API. This proposal adds new endpoints for delivering encrypted content while maintaining the original API for non-privacy-preserving lookups. Writer Privacy, however, is not within the scope of this IPIP and will be handled separately.

## Detailed design

Please refer to the Delegated Routing Reader Privacy Upgrade specification (:cite[http-routing-reader-privacy-v1]) included with this IPIP for detailed design information.

## Design rationale

The proposed API makes two key changes:

1. It introduces new methods for looking up encrypted Provider Records and encrypted Metadata.
2. It establishes Hashing and Encryption functions and structures the response payloads.

This proposal does not alter the API's idioms, upholding all data formats, design rationale, and principles established in the original :cite[ipip-0337].

### User benefit

With the proposed APIs, users can protect themselves against malicious actors who might spy on their activities by monitoring their traffic to Content Routers and subsequently downloading identical data. Additionally, this API serves as a first step towards a fully private HTTP Delegated Routing protocol, which would eliminate centralized observers like IPNI routers.

### Compatibility
masih marked this conversation as resolved.
Show resolved Hide resolved

#### Backwards Compatibility
lidel marked this conversation as resolved.
Show resolved Hide resolved

Users will need to deliberately activate Reader Privacy on their nodes. A new flag could be introduced into IPFS implementations such as Kubo's HTTP Delegated Content Router configuration to streamline this process. Users on older nodes can continue using the existing API and switch on Reader Privacy later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd hope this doesn't need to be the case in an application that has some IPFS smarts (rather than a simple HTTP client). If enough features are expressed through something like #388 then the client should be able to have plausible defaults here (e.g. if my delegated router supports IPNI + DHT, but only IPNI has double-hashing support and the client can run its own DHT client it could choose to send double-hashed requests to the delegated router for IPNI and do the DHT lookups itself).

Obviously some clients will still offer configurability (e.g. would you rather ask the delegated router to do DHT lookups for you in cleartext, or not do them at all) but having reasonable default behavior should be possible.


Content Routers should maintain the same Quality of Service (QoS) for both Privacy Preserving and regular APIs, as both can be served over the same encrypted data. A shim non-encrypted content router can be implemented to encrypt regular CIDs on the fly, proxy the requests through an encrypted content router and finally decrypt the results before returning them to the user.

It is worth noting that not all Content Routers might adopt Reader Privacy. Default HTTP Delegated Routers like `cid.contact` should have Reader Privacy enabled by default in the latest versions of IPFS implementations such as Kubo and Helia. Users should confirm if their chosen custom router supports Reader Privacy when setting it up.

The `/routing/v1/encrypted/` API will be implemented in existing libraries, such as [`boxo/routing/http`](https://github.com/ipfs/boxo/tree/main/routing/http), and will not introduce any breaking changes to existing clear text endpoints. The API will be introduced in a new minor version.

#### Forward Compatibility

Reader Privacy relies on the use of specific hashing and encryption functions. Altering these functions would require a network-wide migration. Content Routers might not be able to migrate seamlessly, as they do not possess the original values. Such function rotation should occur infrequently and necessitate network-wide efforts. When function rotation is required, the API version will be incremented.

### Security

For details on security, please see the "Threat Modelling" section of :cite[http-routing-reader-privacy-v1].

### Alternatives

When considering alternatives to this IPIP, two potential scenarios and their corresponding technologies are worth exploring:

1. Oblivious HTTP (OHTTP)
2. Onion Services

In scenario (1), `/routing/v1` would be implemented behind Oblivious HTTP (OHTTP), a protocol proposed by IETF and Cloudflare. OHTTP separates the information about 'who' is making a request from 'what' they are requesting, thereby preventing routing systems such as IPNI instances from viewing both pieces of information concurrently. This would add an additional layer of privacy by obscuring metadata, such as user behavior patterns, IP addresses, and user-agents.

Scenario (2) envisages the `/routing/v1` behind Onion Services. Onion Services provide another approach to concealing the origin of requests by routing them through the Tor network, further enhancing user privacy.

These two scenarios and their corresponding technologies aren't mutually exclusive to this IPIP. Instead, they could be viewed as complementary solutions that could be deployed in conjunction with Double Hashed records, as proposed in this IPIP, to create a more comprehensive privacy solution. The Double Hashing technique encrypts the content of the communication, making it opaque to passive observers. Simultaneously, OHTTP and Onion Services could provide additional privacy layers by obfuscating metadata about who is making a request.

For more information on OHTTP and Onion Services, please refer to these resources:

- [Oblivious HTTP: IETF](https://www.ietf.org/archive/id/draft-thomson-http-oblivious-01.html)
- [Oblivious HTTP: Cloudflare](https://blog.cloudflare.com/stronger-than-a-promise-proving-oblivious-http-privacy-properties/)
- [Onion Services](https://community.torproject.org/onion-services/)

lidel marked this conversation as resolved.
Show resolved Hide resolved
### Resources

- [Double-hashed DHT](https://github.com/ipfs/specs/pull/373/)
- [Reader Privacy in Indexers](https://github.com/ipni/specs/pull/5)

### Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).
128 changes: 128 additions & 0 deletions src/routing/http-routing-reader-privacy-v1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
title: Routing V1 HTTP Delegated Routing Reader Privacy Upgrade
description: >
This specification outlines the Delegated Routing Reader Privacy Upgrade, representing an incremental enhancement to the HTTP Delegated Routing API. It seamlessly integrates with the existing API, adopting its formats and design principles, to ensure continuity and coherence while offering improved privacy protections.
date: 2023-05-31
maturity: reliable
editors:
- name: Andrew Gillis
github: gammazero
- name: Ivan Schasny
github: ischasny
- name: Masih Derkani
github: masih
- name: Will Scott
github: willscott
order: 1
tags: [ 'routing', 'double hashing', 'privacy' ]
---

This specification details the implementation of a new HTTP API for Privacy Preserving Delegated Content Routing provider lookups. It represents an expansion of the HTTP Delegated Routing API, embracing its formats and design principles.

## API Specification

### Magic Values

All salts below are 64-bytes long and represent a string padded with `\x00`.

- `SALT_DOUBLEHASH`: The string value `CR_DOUBLEHASH`, where each if the 13 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 51 null bytes after the `CR_DOUBLEHASH` string. The following illustrates its corresponding byte frame diagram:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `SALT_DOUBLEHASH`: The string value `CR_DOUBLEHASH`, where each if the 13 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 51 null bytes after the `CR_DOUBLEHASH` string. The following illustrates its corresponding byte frame diagram:
- `SALT_DOUBLEHASH`: The string value `CR_DOUBLEHASH`, where each of the 13 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 51 null bytes after the `CR_DOUBLEHASH` string. The following illustrates its corresponding byte frame diagram:


```
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| C | R | _ | D | O | U | B | L | E | H | A | S | H | \x00...\x00 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
<---------------------------- 64 Bytes --------------------------->
```
For reference, the following snippet represents the hex dump of the above, where each character of `CR_DOUBLEHASH` is represented by its ASCII hexadecimal equivalent, and the null bytes are represented by "00":

```
43 52 5F 44 4F 55 42 4C 45 48 41 53 48 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
```

- `SALT_ENCRYPTIONKEY`: The string value `CR_ENCRYPTIONKEY`, where each if the 15 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 49 null bytes after the `CR_ENCRYPTIONKEY` string. The following illustrates its corresponding byte frame diagram:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `SALT_ENCRYPTIONKEY`: The string value `CR_ENCRYPTIONKEY`, where each if the 15 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 49 null bytes after the `CR_ENCRYPTIONKEY` string. The following illustrates its corresponding byte frame diagram:
- `SALT_ENCRYPTIONKEY`: The string value `CR_ENCRYPTIONKEY`, where each of the 16 characters are represented by their byte value. The remainder of the 64 bytes is filled with null bytes represented by `\x00`. This results in 48 null bytes after the `CR_ENCRYPTIONKEY` string. The following illustrates its corresponding byte frame diagram:


```
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| C | R | _ | E | N | C | R | Y | P | T | I | O | N | K | E | Y |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| \x00...\x00 |
+---+---+---+
<---------------------------- 64 Bytes --------------------------->
```
For reference, the following snippet represents the hex dump of the above, where each character of `CR_ENCRYPTIONKEY` is represented by its ASCII hexadecimal equivalent, and the null bytes are represented by "00":

```
43 52 5F 45 4E 43 52 59 50 54 49 4F 4E 4B 45 59 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
```

These magic values are utilized to compute distinct digests from identical values for varying purposes. For instance, a hash of a Multihash employed for lookups should differ from the one used for key derivation, despite originating from the same value. To achieve this, the Multihash is concatenated with different magic values before applying the hash function: `SALT_DOUBLEHASH` for lookups and `SALT_ENCRYPTIONKEY` for key derivation as elaborated in the `Glossary`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strictly needed, but might be nice to explain a bit why the hashes should be different for when people coming looking at this later.


### Glossary

- **`enc`** refers to [AESGCM](https://en.wikipedia.org/wiki/Galois/Counter_Mode) encryption. The notation `enc(passphrase, nonce, payload)` will be used henceforth in this specification.
- **`hash`** denotes [SHA256](https://en.wikipedia.org/wiki/SHA-2) hashing.
- **`||`** signifies concatenation of two values.
- **`deriveKey`** pertains to the derivation of a 32-byte encryption key from a passphrase, performed as `hash(SALT_ENCRYPTIONKEY || passphrase)`.
- **`CID`** stands for [Content IDentifier](https://github.com/multiformats/cid).
- **`MH`** refers to the [Multihash](https://github.com/multiformats/multihash) contained in a `CID`. It corresponds to the hash function's digest over certain content.
- **`HASH2`** is a second hash over the multihash. Second Hashes must follow the `Multihash` format with `SHA2_256` codec. The digest must be calculated as `hash(SALT_DOUBLEHASH || MH)`.
- **`ProviderRecord`** is a JSON with Provider Record as described in the [HTTP Delegated Routing Specification](http-routing-v1.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
- **`ProviderRecord`** is a JSON with Provider Record as described in the [HTTP Delegated Routing Specification](http-routing-v1.md).
- **`ProviderRecord`** is a JSON object with Provider Record as described in the [HTTP Delegated Routing Specification](http-routing-v1.md).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? The routing-v1 spec allows for opaque blobs in the provider record. Where's the line between "metadata" and "provider record" here?

- **`ProviderRecordKey`** is a concatenation of `peerID || contextID`. Explicit encoding lengths are unnecessary as they are inherently encoded as part of the multihash format. Max `contextID` length is 64 bytes.
- **`EncProviderRecordKey`** is `Nonce || enc(deriveKey(multihash), Nonce, ProviderRecordKey)`. Max `EncProviderRecordKey` is 200 bytes.
- **`HashProviderRecordKey`** is a hash over `ProviderRecordKey`, calculated as `hash(SALT_DOUBLEHASH || ProviderRecordKey)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a SHA256 multihash as well?

- **`Metadata`** are free-form bytes that can represent such information such as IPNI metadata. Max `Metadata` length is 1024 bytes.
- **`EncMetadata`** is `Nonce || enc(deriveKey(ProviderRecordKey), Nonce, Metadata)`. Max `EncMetadata` length is 2000 bytes.

:::note
Maximum allowed lengths may change without incrementing the API version. Such fields as `contextID` or `Metadata` are free-form bytes and their maximum lengths can be altered in the underlying protocols.
:::

### API
#### `GET /routing/v1/encrypted/providers/{HASH2}`
masih marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is Hash2 encoded here?

  • Some specific base (e.g. 16, 32, 64)
  • Multibase prefixed, with a standard base being expected (16, 32, 64)
  • CIDv1 with 0x55 (i.e. raw) codec (and a standard multibase)
  • ...


##### Response codes

- `200` (OK): the response body contains one or more records
- `404` (Not Found): must be returned if no matching records are found
- `422` (Unprocessable Entity): request does not conform to schema or semantic constraints

##### Response Body

```json
{
"EncProviderRecordKeys": [
"EBxdYDhd.....",
"IOknr9DK....."
]
}
Comment on lines +91 to +97
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful/illustrative for the encrypted data to also show what it's expected the unencrypted form would look like.



```

Where:

- `EncProviderRecordKeys` is a list of base64 encoded `EncProviderRecordKey`;

#### `GET /routing/v1/encrypted/metadata/{HashProviderRecordKey}`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about encoding as for HASH2


##### Response codes

- `200` (OK): the response body contains one record
- `404` (Not Found): must be returned if no matching records are found
- `422` (Unprocessable Entity): request does not conform to schema or semantic constraints

##### Response Body

```json
{
"EncMetadata": "EBxdYDhd....."
}
```

Where:

- `EncMetadata` is a base64 encoded `EncMetadata`;

### Notes

Assembling a full `ProviderRecord` from the encrypted data requires multiple server roundtrips. The first fetches a list of `EncProviderRecordKey`s, followed by one for each `EncProviderRecordKey` to retrieve `EncMetadata`. To minimize the number of roundtrips to one, the client implementation should use the local libp2p peerstore for multiaddress discovery and [libp2p multistream select](https://github.com/multiformats/multistream-select) for protocol negotiation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masih we've talked about this before and it's more exploratory, but I'm noting here to make it public and see if folks have any thoughts about how this might impact the interface/API here in the future.

Note: This is not a "please rewrite everything" request.


How well this works is a function of how important the metadata is to performing a useful retrieval, and how important the metadata is depends on the distribution of information between the "ProviderRecord" and the "(ProviderRecord)Metadata".

IIUC the reason it's implemented this way is to keep data storage in routing backends like IPNI from needing to store the same data but encrypted many many times (i.e. once per multihash advertised).

At the extremes we have:

  • A lot of space could be saved by making ProviderRecord just contain a small pointer and Metadata contain all the actual provider record information (e.g. peerID, multiaddrs, protocols, ...). This means two round-trips unless the client already has the pointer information locally (e.g. if the pointer was a peerID, then having the multiaddrs, etc. locally, and if the pointer was some system-specific unique-ID then having that cached from a prior lookup). Also, unless there's some aggregation service/proxy it allows correlation between many different requests that use the same metadata (not necessarily a big deal here).
  • The second round-trip could disappear if we store all the information in the ProviderRecord portion. However, this means encrypting all the data for every advertised multihash

I could situationally see reasons to shuffle data between these two, depending on things like:

  1. how reusable the metadata is
  2. how frequently the metadata information is to be cached
  3. cost models for routing systems
    ...

Two areas where I could see the extremes being in use:

  1. Save storage by just returning the target's "identifier": e.g. libp2p peerID or an unauthenticated HTTP+libp2p URL (that has .well_known for protocol negotiation) since peer routing makes sense to be separable in libp2p, and protocol negotiation to be separate for unauthenticated HTTP + libp2p.
  2. Save round-trips by returning all the information: e.g. a webseeds-like advertisement that points to an outboard blake3 HTTP URL and an HTTP URL for the data (data could be a separate advertisement)

This makes me wonder if there's a better way to do this. For example:

  • encrypted/providers returns a (JSON) blob mimicking the routing-v1 results
  • It contains some information indicating if there's metadata and/or what might be in the metadata
  • encrypted/metadata provides the metadata

This could allow systems like IPNI to optimize data layouts on ingestion and have some flexibility without breaking downstream clients.