Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

car-mirror: Potentially unbounded cost of creating a Provider Graph Confirmation #25

Open
softwareplumber opened this issue Nov 29, 2022 · 1 comment
Labels

Comments

@softwareplumber
Copy link

softwareplumber commented Nov 29, 2022

On a partial cold call, the Provider Graph Estimate MUST contain the entire graph minus CIDs in the initial payload. The Provider MUST respond with a Bloom filter of all CIDs that match the Provider Graph Estimate, which is called the Provider Graph Confirmation. On subsequent rounds, the Provider Graph Estimate continues to be refined until it is empty or the entire graph has been synchronized.

In this 'Push' scenario the provider is the block sink. The block source is sending a CAR file with a number of blocks, plus a bloom of the remaining unsent CIDs. The block sink may not have any of the blocks in the initial CAR, meaning that creating the mandatory response (a bloom of all CIDs it has which match the list of unsent CIDs) would require it to test the estimate bloom against every CID in its underlying store. For a large block store on slow or remote storage, this could take a very long time.

The underlying requirement could perhaps be addressed instead by allowing CAR messages from source to sink to include a list of unsent child CIDs (with a maximum size). This allows the sink to skip walking the DAG in the incoming CAR blocks and begin constructing a bloom (of CIDs from the unsent list which it already possesses) at the same time as it works to process the incoming blocks from the CAR. This will reduce the round-trip time considerably.

@expede
Copy link
Member

expede commented Nov 29, 2022

include a list of unsent child CIDs (with a maximum size).

@softwareplumber that sounds a lot like the straggler protocol! This is what the bloom does the rest of the time, but compressed. A manifest is 10-100x the size in our early tests. It has been found to be impractically large in similar approaches.

would require it to test the estimate bloom against every CID in its underlying store

This could perhaps be clearer in the spec. You don't have to walk the entire store, just whatever is in your estimate, which may be nothing, or a list of your most common CIDs. The cold-call estimates will vary between implementations based on their access patterns.

For example, if you've been tracking foo.files.fission.name, you only need to walk whatever is underneath that resolved CID. As soon as you have a match against what just came in the request, you can stop walking, because of the recursive nature of CIDs: you can assume that you both have that subgraph, and can include the root of that subgraph in your estimate. This is even pre-cachable 🚀

Even in the general case, you can use dynamic programming techniques to radically speed up even the worst case where you for some reason need to walk entire store, but it's left up to the implementer.

We should add more prose about this to the spec to help clarify 💯

@expede expede added the clarify label Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants