Dagger remote auditing compression and aggregation #67

dryajov · 2022-05-03T20:07:55Z

dryajov
May 3, 2022
Maintainer

Dagger remote auditing compression and aggregation

Introduction

The current remote auditing scheme in Dagger is based on the Shacham and Waters Compact Proofs of Retrievability. This scheme relies on bilinear pairings and interactive verification that we extend with a globally verifiable random oracle (a.k.a blockchain) to make it non-interactive. The scheme is public, in the sense that anyone can challenge and verify the prover with publicly available data produced during the setup phase of the scheme.

High level scheme requirements

The basic requirements of our remote auditing scheme are:

Publicly verifiable - anyone in the system can verify a proof provided they have the required public parameters
Non-interactive - challenges and proofs can be generated and verified without coordination between the prover and verifier
Secure - non-trivially cheat-able, e.g. not susceptible to pre-image attacks
Succinct
1. small (constant) metadata footprint
2. small (constant) proof size
Fast - should be practical to execute on commodity hardware

The S&W PoR scheme gives us 1, 2, 3 and 5 however, the public parameters (a.k.a metadata) size is linear in relation to the size of the dataset and block size. Proofs are also large, in the order of KBs - one block worth of aggregated data plus one aggregated tag.

S&W high level overview

The goals of the current S&W scheme is to determine if the dataset is still recoverable. The "Retrievability" in the "Proofs of Retrievability" indicates that the scheme aims to detect wether the data is still retrievable in it's entirety - the data is still recoverable, or wether it isn't - the data has been lost. Another aspect of the scheme, is that it guarantees data can be retrieved with enough queries (extractability) because the proof contains the queried data in aggregated form.

It accomplishes this with the following steps:

During the setup phase, the file is erasure coded and split into blocks and sectors. A set of homomorphic authenticator are then generated for each block along side a public key and a file name (a cryptographic token). The authenticators are then stored on the storage nodes alongside the data.
During verification
1. The verifier generates a challenge consisting randomly selected indices and multipliers.
2. The prover receives the challenge and using the random multipliers computes a proof consisting of the aggregated sectors and tags. This are sent to the verifier in the response.
3. The verifier receives the response and using the aggregated tags and data verifies the proof.

Whats important to understand about the scheme is that it sends the data being verified (albeit in aggregated form) as part of the proof. This is important because it protects against potential per-image attacks.

Strictly speaking, the PoR scheme aims at both recoverability and retrievability. The proofs should allow detecting if the available data is enough to reconstruct the original dataset, this is related to the erasure coding briefly mentioned in the setup phase and it also allows for retrievability, where after enough queries, the total aggregated data should be enough to reconstruct the original. The later property is of lesser importance for our particular use case, however the former is of upmost importance.

A simpler way of interpreting the described property, is that the scheme should prevent pre-generating a number of proofs in the future and forgetting the original data.

Compression and aggregation requirements

What to aggregate and compress?

The first question to ask is what needs to be aggregated and compressed?

The two most obvious pieces are the authenticators and the proofs. It would be desirable to store the authenticators in aggregated form (some vector and/or polynomial commitment schemes seem to support this), because they are linear with respect to the size of both the dataset and the number of blocks it is composed of.

At first glance, it would seem that the proofs are already at least partially aggregated, both with regards to the authenticators and the sent data, however even in it's current aggregated form, the proofs are massive, in the order of KBs, this isn't scalable across millions of datasets, or even just a few and specially when interfacing with the blockchain, thus further compression and aggregation is required.

Can we get rid of the aggregated data sent as part of the proof?

As mentioned before, the aggregated data has two functions:

protect against pre-image attacks and
serve as an extractor function

We don't care about extractability in this context, so we can safely ignore it, however we do need the security guarantees of the first property.

Requirements

The basic requirement for the proof generation is to produce a succinct proof that guarantees (w.h.p.) that the original data was involved in the computation of the proof.

The parts that require compression and aggregation thus are:

The authenticators - so we can save them efficiently along side the original or partial dataset
The proof without compromising security - so we can efficiently send them over the network and/or persist it in a blockchain

Where to aggregate?

Another question is where aggregation should happen? In the case of the authenticators it is safe to say that the best place is during the setup phase and it might very well be implicit in it once we move away from the standard BLS based S&W scheme.

In the case of the proof, it is a bit more ambiguous. Aggregation and compression can take place either on the prover or in a special "aggregator" node that will perform the aggregation. A potentially naive way might imply taking the current BLS based scheme verbatim and running it as a SNARK either on the prover side or sending the aggregated data and tag to the special aggregator node that will execute the SNARK.

It might seem at first, that the aggregator role is redundant and everything can be performed on the prover side, however consider that many nodes are producing several proofs constantly and this need to be eventually committed to some persistence layer, very likely an external blockchain. This makes further compression and aggregation a requirement. In both cases, wether the prover produces the standard BLS proofs and sends them to the aggregator for further processing or the prover executes the SNARK locally and sends the result of the execution to the aggregator for further compression and aggregation an aggregator role seems to be required.

Recursive aggregatable proofs

Another important requirement is recursive aggregatable proofs.

Revealing proofs (sending over the network) and generating proofs, aren't necessarily the same thing. We might want the storage nodes generate proofs locally on an ongoing basis and aggregating this into a single proof that will be then sent over the network on request.

This strengthens the proving mechanism in two ways 1) it allows to retroactively detect downtime and failures 2) it makes it harder to outsource storing of the data on third parties.

This then raises a third requirement - recursive aggregatable proofs.

Recursive because proofs should be linked, proof c takes proof b as input, which takes proof a as input
Aggregatable, because we still want the proof to be succinct

Intuitions and research directions

There are several intuition and potential research directions to explore. For example, would it possible to aggregate the authenticators into a commitment to overcome its linear size requirements? Another aspect to look at is what is the best construct that would guarantee that a proof was generated on request using the original data - this can potentially be a SNARK or some other ZK system.

So the potential topics to explore are:

Zero Knowledge Systems - STARKs, SNARKs, FRIs, etc...
Commitment Schemes - Vector and Polynomial commitments
Other potentially novel primitives that might combine the properties captured in this writeup

cskiraly · 2022-05-04T08:43:01Z

cskiraly
May 4, 2022
Maintainer

Regarding compactness of proofs and the extractor, it is important to highlight that these two are contradicting requirements, at least in their practical forms. It is easy to see that the accumulated amount of proofs needed to actually reconstruct the data is at least the same as the size of the actual data. Thus, the more compact the proofs, the more proofs are required, somewhat eliminating the very reason why proofs are compacted in the first place.

1 reply

dryajov May 4, 2022
Maintainer Author

Correct, it might be also worth noting that extractability is handled by a different layer in our system. This is the combination of erasure coding and chunk dispersal (or independent failures if you will).

dryajov · 2022-05-04T16:39:36Z

dryajov
May 4, 2022
Maintainer Author

Another aspect that is worth mentioning that we might want to incorporate into the write up above is recursive proof aggregation. What I mean by that is, revealing proofs (sending over the network) and generating proofs, aren't necessarily the same thing. We might want the storage nodes generate proofs locally on an ongoing basis and aggregating this into a single proof that will be then sent over the network on request.

This strengthens the proving mechanism in two ways 1) it allows to retroactively detect downtime and failures 2) it makes it harder to outsource storing of the data on third parties.

This then raises a third requirement - recursive aggregatable proofs.

Recursive because proofs should be linked, proof c takes proof b as input, which takes proof a as input
Aggregatable, because we still want the proof to be succinct

1 reply

dryajov May 4, 2022
Maintainer Author

I've update the main writeup with this requirement

dryajov · 2022-05-10T00:47:31Z

dryajov
May 10, 2022
Maintainer Author

Tagging @staheri14 as per previous discussion

1 reply

staheri14 May 10, 2022

Thanks @dryajov for tagging me and compiling such a nice and neat (and detailed) writeup! I'll need to spend some time on this consolidating my thoughts and will soon get back to you either with more questions or more suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codex Storage

Dagger remote auditing compression and aggregation #67

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Codex Storage

Dagger remote auditing compression and aggregation #67

dryajov May 3, 2022 Maintainer

Dagger remote auditing compression and aggregation

Introduction

High level scheme requirements

S&W high level overview

Compression and aggregation requirements

What to aggregate and compress?

Requirements

Where to aggregate?

Recursive aggregatable proofs

Intuitions and research directions

Replies: 3 comments · 3 replies

cskiraly May 4, 2022 Maintainer

dryajov May 4, 2022 Maintainer Author

dryajov May 4, 2022 Maintainer Author

dryajov May 4, 2022 Maintainer Author

dryajov May 10, 2022 Maintainer Author

staheri14 May 10, 2022

dryajov
May 3, 2022
Maintainer

Replies: 3 comments 3 replies

cskiraly
May 4, 2022
Maintainer

dryajov May 4, 2022
Maintainer Author

dryajov
May 4, 2022
Maintainer Author

dryajov May 4, 2022
Maintainer Author

dryajov
May 10, 2022
Maintainer Author