You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a partial cold call, the Provider Graph Estimate MUST contain the entire graph minus CIDs in the initial payload. The Provider MUST respond with a Bloom filter of all CIDs that match the Provider Graph Estimate, which is called the Provider Graph Confirmation. On subsequent rounds, the Provider Graph Estimate continues to be refined until it is empty or the entire graph has been synchronized.
In this 'Push' scenario the provider is the block sink. The block source is sending a CAR file with a number of blocks, plus a bloom of the remaining unsent CIDs. The block sink may not have any of the blocks in the initial CAR, meaning that creating the mandatory response (a bloom of all CIDs it has which match the list of unsent CIDs) would require it to test the estimate bloom against every CID in its underlying store. For a large block store on slow or remote storage, this could take a very long time.
The underlying requirement could perhaps be addressed instead by allowing CAR messages from source to sink to include a list of unsent child CIDs (with a maximum size). This allows the sink to skip walking the DAG in the incoming CAR blocks and begin constructing a bloom (of CIDs from the unsent list which it already possesses) at the same time as it works to process the incoming blocks from the CAR. This will reduce the round-trip time considerably.
The text was updated successfully, but these errors were encountered:
include a list of unsent child CIDs (with a maximum size).
@softwareplumber that sounds a lot like the straggler protocol! This is what the bloom does the rest of the time, but compressed. A manifest is 10-100x the size in our early tests. It has been found to be impractically large in similar approaches.
would require it to test the estimate bloom against every CID in its underlying store
This could perhaps be clearer in the spec. You don't have to walk the entire store, just whatever is in your estimate, which may be nothing, or a list of your most common CIDs. The cold-call estimates will vary between implementations based on their access patterns.
For example, if you've been tracking foo.files.fission.name, you only need to walk whatever is underneath that resolved CID. As soon as you have a match against what just came in the request, you can stop walking, because of the recursive nature of CIDs: you can assume that you both have that subgraph, and can include the root of that subgraph in your estimate. This is even pre-cachable 🚀
Even in the general case, you can use dynamic programming techniques to radically speed up even the worst case where you for some reason need to walk entire store, but it's left up to the implementer.
We should add more prose about this to the spec to help clarify 💯
In this 'Push' scenario the provider is the block sink. The block source is sending a CAR file with a number of blocks, plus a bloom of the remaining unsent CIDs. The block sink may not have any of the blocks in the initial CAR, meaning that creating the mandatory response (a bloom of all CIDs it has which match the list of unsent CIDs) would require it to test the estimate bloom against every CID in its underlying store. For a large block store on slow or remote storage, this could take a very long time.
The underlying requirement could perhaps be addressed instead by allowing CAR messages from source to sink to include a list of unsent child CIDs (with a maximum size). This allows the sink to skip walking the DAG in the incoming CAR blocks and begin constructing a bloom (of CIDs from the unsent list which it already possesses) at the same time as it works to process the incoming blocks from the CAR. This will reduce the round-trip time considerably.
The text was updated successfully, but these errors were encountered: