Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map id to cid instead of multihash #336

Merged
merged 6 commits into from
Oct 17, 2022

Conversation

rklaehn
Copy link
Contributor

@rklaehn rklaehn commented Oct 13, 2022

@dignifiedquire
Copy link
Contributor

dignifiedquire commented Oct 13, 2022

this is really problematic, the dht only works on multihashes for provider records, so we must be able to find content based on only multihashes

@dignifiedquire
Copy link
Contributor

one alternative I could see is to store links as a a mapping id + codec to links, such that we can store links per codec, but the content itself is still referenced by multihash

@rklaehn
Copy link
Contributor Author

rklaehn commented Oct 13, 2022

one alternative I could see is to store links as a a mapping id + codec to links, such that we can store links per codec, but the content itself is still referenced by multihash

OK, will try that then.

Edit: That is also not that easy. That means that links needs to be a mapping from (code, id) to [(code, id)] not just to [id], otherwise I won't be able to traverse the graph anymore. As somebody who has been working on the store side, I want the part that does the traversal to benefit from the u64 ids...

@rklaehn
Copy link
Contributor Author

rklaehn commented Oct 13, 2022

this is really problematic, the dht only works on multihashes for provider records, so we must be able to find content based on only multihashes

Doesn't bitswap work on the level of cids? Where exactly do you need to look up content by multihash currently?

@dignifiedquire
Copy link
Contributor

  • bitswap works with both depending on the protocol version
  • when we provide records they are going to be by multihash, not by cid
  • when making queries to the dht we always convert to a multihash

@rklaehn
Copy link
Contributor Author

rklaehn commented Oct 13, 2022

  • bitswap works with both depending on the protocol version

Ah, I did not know that.

  • hey are going to be by multihash, not by cid
  • when making queries to the dht we always convert to a multihash

That just requires going from cid to multihash, not vice versa, so no big deal.

So as far as I can see there are two solutions:

  • change the graph table so that the key is (code, hash_id) and the value is [(code, hash_id)]
    I would prefer to avoid that since it makes graph traversal less efficient
  • keep ids as 1:1 with cids but provide a way to look up blobs by hash nonetheless. Key for id could be <hash><code>, then just look for entries with a prefix of .

I guess it depends on what we expect to be the more frequent use case - lookup by cid or by hash.

@rklaehn
Copy link
Contributor Author

rklaehn commented Oct 13, 2022

So I guess everything is fine as long as we can implement these two efficiently, correct?

    #[tracing::instrument(skip(self))]
    pub async fn get_blob_by_hash(&self, hash: &Multihash) -> Result<Option<DBPinnableSlice<'_>>> {
      todo!()
    }

    #[tracing::instrument(skip(self))]
    pub async fn has_blob_for_hash(&self, hash: &Multihash) -> Result<bool> {
      todo!()
    }

@rklaehn rklaehn force-pushed the rklaehn/id-cid-mapping branch 3 times, most recently from 01d4a98 to 924ad8f Compare October 13, 2022 15:47
@rklaehn rklaehn marked this pull request as ready for review October 13, 2022 15:49
@dignifiedquire
Copy link
Contributor

@rklaehn yes

@rklaehn rklaehn force-pushed the rklaehn/id-cid-mapping branch 4 times, most recently from 2258834 to 66e7972 Compare October 14, 2022 09:04
@rklaehn
Copy link
Contributor Author

rklaehn commented Oct 17, 2022

The gist of the above is that ids are still mapped to cids, but the key for looking up ids from cids is

<hash><code>.

So to look up ids by hash you can just do a prefix search by hash and get all codes for which we have the hash. Just pick one that has associated data. Typically, in 99.9% of all cases, there should be exactly one?

Should we go with this? The other alternative would be to have id correspond to hash, and store (id, code) in both graph key and graph content. But I would really like to avoid that since it will make all graph traversal code slower and more cumbersome, for something that is not needed 99.9% of the time.

I think this case definitely needs to be solved one way or another before v1, because it allows you to get the store in a weird state (it is confused about links) from the outside.

Copy link
Contributor

@dignifiedquire dignifiedquire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good way to go 👍

@rklaehn
Copy link
Contributor Author

rklaehn commented Oct 17, 2022

Addressed your PR comments.

These tests are pretty much just smoke tests. I think we need more tests, but that is unrelated to this particular issue. See n0-computer/beetle#151

That way you can have 2 cids with the same hash but different links.
Downside is that you might store the same data twice in the very unlikely
case where you have the same data as both raw and dag-cb. ¯\_(ツ)_/¯
This is one of the two ways of doing this:

- downside: in the rare case where there are 2 cids with the same hash, the data gets stored twice
- upside: storing the graph can be done using just u64 ids instead of (code, id) tuples
@rklaehn rklaehn merged commit 092bdf7 into n0-computer:main Oct 17, 2022
@rklaehn rklaehn deleted the rklaehn/id-cid-mapping branch October 17, 2022 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Properly handle multiple cids with the same hash
2 participants