Skip to content
This repository has been archived by the owner on Jan 11, 2024. It is now read-only.

IPC-37: Resolver Service and Client #55

Merged
merged 11 commits into from
Mar 6, 2023
Merged

IPC-37: Resolver Service and Client #55

merged 11 commits into from
Mar 6, 2023

Conversation

aakoshh
Copy link
Contributor

@aakoshh aakoshh commented Mar 1, 2023

Closes consensus-shipyard/ipc#465

The PR creates a Service and Client to act as the public interface of content resolution.

The Service is what users of the library have to instantiate, it creates all the behaviours created in consensus-shipyard/ipc#34 , consensus-shipyard/ipc#467 and consensus-shipyard/ipc#466 , along with a Client, which can be used to send requests/commands to it over a channel. The Client can be cloned, so that a separate instance can be given to any component that needs to talk to the Service.

Instead of an request/response queue as in Forest, the Client hides the query behind an async resolve method. The caller needs to tell which CID they want to resolve and from which subnet. The Service returns an error immediately if there are no peers known in that subnet.

The other two methods at the moment are to set a list of subnets the agent can provide data for, and to pin subnets which have known IDs (from config or from the ledger). Both take a list, but the first overrides any current value, the second adds to them.

Copy link
Contributor

@adlrocha adlrocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM, but I am having a hard time putting all things together and understanding when each of the behaviors kick in. Maybe it is worth having a quick sync once the whole protocol is done for you to guide us through the code 🙏

Also, up till now we are using vanilla Bitswap, right? Which assumes that the content being retrieved lives in peers' BitswapStore, which is not currently the case for the agent.

Ok(Self {
ping: Default::default(),
identify: identify::Behaviour::new(identify::Config::new(
"ipfs/0.1.0".into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of a curiosity, why do we need to pass this ipfs/ for the identify config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I copied it as-is from Forest. The KademliaConfig says the protocol name can be used to distinguish different groups of peers, but the IdentifyConfig just says:

    /// Application-specific version of the protocol family used by the peer,
    /// e.g. `ipfs/1.0.0` or `polkadot/1.0.0`.
    pub protocol_version: String,

I fixed the 0.1.0 to be 1.0.0, it looked like a typo. I suspect it doesn't matter what it is, it could be ipc? I can't see it used, but I only looked at the Identify behaviour.

For the record when my nodes connect to each other, I see these protocol being supported:

["/ipfs/ping/1.0.0", "/ipfs/id/1.0.0", "/ipfs/id/push/1.0.0", "/ipc/smoke-test/kad/1.0.0", "/meshsub/1.1.0", "/meshsub/1.0.0", "/ipfs-embed/bitswap/1.0.0"]

So this exact thing doesn't even appear 🤷

Copy link
Contributor

@adlrocha adlrocha Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see they are using it for authentication purposes. They tag the specific DHT the interact with.

let transport = build_transport(config.network.local_key.clone());
let behaviour = Behaviour::new(config.network, config.discovery, config.membership, store)?;

// NOTE: Hardcoded values from Forest. Will leave them as is until we know we need to change.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, with our current scale this values seem conservative enough.

// We could keep the Swarm alive to serve content to others,
// but we ourselves are unable to send requests. Let's treat
// this as time to quit.
None => { break; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the client is dropped and we need to spawn a new one from a Service already running? We shouldn't have to restart the service for this, right? Maybe the user should worry not to ever drop the client in the lifetime of the service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to spawn a client, the Service would have to have a copy of the Sender<Request> channel (which is what happens in Forest. But the Service instance is consumed by run, so you can't call anything on it any more, you can't have a reference to it, you would not be able to clone a new Sender. And by itself having a copy of Sender, it would never experience this condition either, unless it manually drop it.

So I decided that a cleaner way is for Service::new to return the Service and the Sender (wrapped into a Client). Then you can make as many clones of the Client as you want, before or after you run the Service.

But you are right, you have to make sure not to drop all Client instances. But this is easy, because if you did, and you needed another one, you wouldn't be able to get it from anywhere, so the compiler wouldn't let this happen. And if you never ever need to make a Client call (after telling it what subnets you support), then yeah, you just need to make sure you keep an instance in your application state anyway.

membership::Event::Skipped(peer_id) => {
// Don't repeatedly look up peers we can't add to the routing table.
if self.background_lookup_filter.insert(&peer_id) {
self.discovery_mut().background_lookup(peer_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we refreshing the bloom filter after some time? I don´t know if having a lot of peers after some point could lead to skipping legit peers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also thinking about this, but I haven't come up with a nice way to do it yet. A fresh start every now and then seems like a blunt and easy way, but leads to a burst of lookups after. Maybe a better protection would be to use a cache like the guava library supported in the JVM: it has limits on memory and a TTL as well.

Comment on lines +403 to +404
let tcp_transport =
|| libp2p::tcp::tokio::Transport::new(libp2p::tcp::Config::new().nodelay(true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it is already supported in rust-libp2p, but I would love if we could switch to quic in the future and only fallback to tcp if not supported :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will check. FWIW in #64 I factor out transport creation, so we can switch easily.

Request::Resolve(cid, subnet_id, response_channel) => {
let peers = self.membership_mut().providers_of_subnet(&subnet_id);
if peers.is_empty() {
send_resolve_result(response_channel, Err(anyhow!(NoKnownPeers(subnet_id))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean that no membership was picked up through gossipsub for the subnet, right? We could include in the future a fallback that optimistically asks our direct connections what subnets they are part of for the case we are lucky and we find a provider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, most likely we asked too early and we should come back later, or perhaps find a known peer and add it to the static address list.

I am also thinking on how to make the gossip better for newcomers, to lessen the trade-off of periodic republishing and let newcomers learn the providers ASAP.

Here's an idea: when we receive a gossip from a provider we haven't known about before, we publish our own in return. They must be a newcomer, so they could use our own information for sure. Perhaps with some cool-off period, like no repeated publishing within 10 seconds (just above the gossipsub broadcast time frame).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aakoshh
Copy link
Contributor Author

aakoshh commented Mar 4, 2023

Also, up till now we are using vanilla Bitswap, right? Which assumes that the content being retrieved lives in peers' BitswapStore, which is not currently the case for the agent.

That's right, this Bitswap library interacts with the store. The agent will have to provide an implementation of the store that is fit for purpose:

  • For requests, it would look in its own memory, or the Lotus node it is connected to
  • For respones, it would store the data in a MemoryBlockstore, so that it can serve it to Lotus later

With Fendermint they can share the actual database, so this is easier. The potential question is whether databases of the same validator across the different levels of the hierarchy could be shared.

@aakoshh
Copy link
Contributor Author

aakoshh commented Mar 4, 2023

I know you had a different idea of how to resolve content, by going through the Gateway actor itself, something like "give the the contents of the checkpoint identified by this CID", and the Gateway would collect every message that belongs to the checkpoint. Whereas Bitswap would go CID by CID until the whole thing is pulled.

So, I reckon your implementation of the BitswapStore would have to be special in that it would first ask the Gateway actor for the CIDs of the checkpoints and load them into its memory; when a request comes, it's only served if it was discovered this way, otherwise it's considered not asking for a part of a checkpoint.

I recognise that this may not work well for you if you want to respond to a CID by some data that doesn't actually correspond to that CID, which would be impossible, so I urge you to find a compromise.

Base automatically changed from ipc-36-content to main March 6, 2023 09:35
@aakoshh aakoshh merged commit 62662d1 into main Mar 6, 2023
@aakoshh aakoshh deleted the ipc-37-service branch March 6, 2023 09:47
@adlrocha
Copy link
Contributor

adlrocha commented Mar 6, 2023

I recognise that this may not work well for you if you want to respond to a CID by some data that doesn't actually correspond to that CID, which would be impossible, so I urge you to find a compromise.

This will never be the case. I looked at the BitswapStore implementation in the smoke test and for our case it may just suffice to add a wrapper for the async calls to the gateway or relevant subnet state. 🙏

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IPLD Resolver: Service
2 participants