Provider Strategy Discussion #6221

michaelavila · 2019-04-15T18:27:37Z

@Stebalien please add the people you think should be a part of this discussion.

We are introducing a provider system into go-ipfs (#6141) that replaces the mechanism in bitswap that provides all new blocks when it comes into contact with them. The new provider system is intended to give us more control over which blocks are provided during different operations.

The important questions for this group are:

1. What kinds of provide strategies will need to be supported?

e.g.

All

Roots

Whole/recursive (all of a dag when add or pin)

Pin roots

Pin recursive

Probabilistic (TBD)

Nothing

Some other strategy ...

@Stebalien mentioned an approach here #5870 (comment) which should be considered here as well.

2. What do we have to support initially?

Given the concerns I hear around providing, it seems like the following would help right away and could be merged on its own:

the ability to provide everything
the ability to disable providing without disabling content routing
the ability to provide only roots during bulk provide operations (e.g. add and pin)

But would that be enough?

Additionally, the reprovider strategies are being removed and instead a reprovide will work over all blocks that have been provided. Is that ok?

michaelavila · 2019-04-15T18:30:29Z

cc @Kubuxu @eingenito @magik6k @hannahhoward

hannahhoward · 2019-04-15T18:34:25Z

currently various "get" like operations trigger a provide. (i.e. ipfs get, ipfs ls) provide everything would be fine, but this also presents a good time to use a more limited strategy like just roots.

hannahhoward · 2019-04-15T18:35:13Z

^^ I think the above behavior is just a side affect of BS just providing every single block that goes through its system.

obo20 · 2019-04-15T22:26:15Z

@Stebalien had mentioned to me there's an issue with providing roots where for example, if a node downloads certain parts of a tree and then gets interrupted, it won't automatically know how to "walk back up" the tree to get the root so it can download the rest of the hashes it needs to.

That might be something useful to consider when designing this.

michaelavila · 2019-04-15T22:42:26Z

@obo20 thanks for joining the discussion! You're correct, it's a thing we have to consider and there has been some discussion about how best to approach that problem. I'm curious what some of the needs and/or pain points around providing are for Pinata, do you have any?

michaelavila · 2019-04-15T22:44:39Z

@hannahhoward thanks! That's a great thing to point out.

obo20 · 2019-04-16T00:50:57Z

The biggest pain points we've had involve content discovery times for both old content as well as recently added content. For lack of a better explanation, it seems like the provider just can't keep up with the amount of content it needs to announce.

michaelavila · 2019-04-16T16:51:50Z

@obo20 thanks again for responding.

The biggest pain points we've had involve content discovery times for both old content ...

Understood. Here we are taking measures to reduce the number of things we provide in the first place. But like you mention, that's only part of the solution. There are efforts parallel to this for improving content discovery times.

... as well as recently added content.

Although it does not solve all of the problems you're running into, a small portion of the new provider system is being released with 0.4.20 (#6223) that will announce the root node from ipfs add and ipfs pin add commands. Hopefully this will make recently added content discoverable more quickly for you.

@obo20 I have two questions if you don't mind: I know Pinata is primarily dealing with pinned content, but how much non pinned content are you dealing with? Do you have a sense of the number of things you're trying to provide each day?

@Stebalien @eingenito do either of you know of an issue (or anything) tracking work being done in libp2p on content discovery performance?

obo20 · 2019-04-16T17:26:16Z

@michaelavila No problem. Glad to help in any way I can. We're quite excited about a lot of the improvements coming to IPFS in 0.4.20. We'll be keeping an eye on things to see how performance improves.

I have two questions if you don't mind: I know Pinata is primarily dealing with pinned content, but how much non pinned content are you dealing with?

Currently our provider strategy is set to pinned. So we should only be announcing our content.

Do you have a sense of the number of things you're trying to provide each day?

Currently we have roughly 7000 root nodes that are being announced each day, but this number is increasing steadily.

michaelavila · 2019-04-16T20:41:14Z

@obo20 thank you. Out of curiosity, does Pinata have non pinned content in ipfs? If so, is it a lot?

eingenito · 2019-04-17T20:23:35Z

@michaelavila and I just had a discussion and, following a conversation with @Stebalien it seems like there's a requirement that we preserve the current 0.4.20 'root block first' providing behavior no matter what. Would it be useful to merge an experimental provider system that offers the following control of re/providing:

Re/Provide none
Provide all blocks with root blocks always provided first; reprovide all blocks without prioritization

That's really it. Users who choose to can re/provide nothing without turning off content routing, and everyone else will re/provide everything, but root blocks will always take strict precedence on initial provide. If we can't provide at a rate that exhausts our list of root blocks we'll never even get to other blocks. Prioritizing (or restricting re/providing to) pinned roots or subtrees would be added later.

Presumably gateway could make use of the 'none' behavior. @hsanjuan could cluster make use of the 'all' strategy knowing that roots would always be provided first? Or more generally what does cluster do wrt re/providing?

Subsequent merges could add (potentially):

Prioritizing blocks when re/providing using the same strategy as providing.
Prioritizing pinned content e.g. (in order of priority): pinned roots, roots, recursively pinned blocks, everything else.
Omitting some blocks from ever being provided (priority 0?)
Introducing specific strategies and configuration (as discussed in RFC#2: Add provider to ipfs and provide when adding/fetching (WIP) #5870 (comment))

@Stebalien, @magik6k, @hsanjuan, @scout - any comments?

eingenito · 2019-04-17T20:31:54Z

@obo20 I had a couple of questions/comments: you said you're using the 'pinned' reprovider strategy. If most (all?) of your content is pinned that's roughly equivalent to using 'all'. Would the 'roots' strategy be appropriate for your use case which just reprovides pin roots?

obo20 · 2019-04-17T20:49:24Z

@obo20 thank you. Out of curiosity, does Pinata have non pinned content in ipfs? If so, is it a lot?

All of our content is pinned. We don't store any non-pinned content.

@obo20 I had a couple of questions/comments: you said you're using the 'pinned' reprovider strategy. If most (all?) of your content is pinned that's roughly equivalent to using 'all'. Would the 'roots' strategy be appropriate for your use case which just reprovides pin roots?

@eingenito We considered this, however the issue I described in my first comment could cause problems. For reference, this is the comment I'm referring to:

@Stebalien had mentioned to me there's an issue with providing roots where for example, if a node downloads certain parts of a tree and then gets interrupted, it won't automatically know how to "walk back up" the tree to get the root so it can download the rest of the hashes it needs to.

If this issue gets fixed, then we could absolutely use the "roots" strategy

eingenito · 2019-04-17T21:06:15Z

@obo20 - thanks for your answers. The ultimate goal of these refactors is exactly what you're talking about; providing a subset of nodes (and @Kubuxu has done some work to characterize nodes as being particularly awesome to provide: #6155), and then enable bitswap (or derivative) to walk back up a DAG that is stalled in transfer looking for providers along the way until it can restart.

michaelavila · 2019-04-17T23:06:56Z

@obo20 another question for you (thanks again): any idea how often Pinata is getting find provider requests for non root nodes (aka how often are you dealing with the interrupt situation you point out)?

obo20 · 2019-04-18T00:59:17Z

As far as I’m aware, we’re not running into that interrupt situation right now as we’re running “pinned” as the provider strategy instead of “roots”.

@Stebalien had simply warned me that it may be an issue if we switched to “roots” as a provider strategy.

hsanjuan · 2019-04-18T09:54:32Z

Presumably gateway could make use of the 'none' behavior. @hsanjuan could cluster make use of the 'all' strategy knowing that roots would always be provided first? Or more generally what does cluster do wrt re/providing?

For cluster, it's just fine that IPFS peers can pin (that is, find and retrieve) things that are pinned somewhere else. Therefore the "all", the "roots", the "pinned" strategies would work and we don't have special requirements that I can think of now.

The sharding feature (when it lands) might require the "all" strategy to be used in order to re-construct DAGs split across multple peers, but we can just make this a requirement for people wanting to shard.

@lanzafame ^^^ double-check this sounds right?

Stebalien · 2019-04-19T05:11:40Z

@michaelavila

@Stebalien @eingenito do either of you know of an issue (or anything) tracking work being done in libp2p on content discovery performance?

There are a few PRs in flight to improve query perf. I also talked with @momack2 earlier today about having someone on the go-ipfs team work with the libp2p team on this kind of stuff. Unfortunately, much of this is still up in the air.

@obo20

@michaelavila No problem. Glad to help in any way I can. We're quite excited about a lot of the improvements coming to IPFS in 0.4.20. We'll be keeping an eye on things to see how performance improves.

So you don't get too excited, 0.4.20 brings some improvements but isn't likely to significantly improve content routing in your case. It may improve content routing for new content on initial add but that's about it.

@eingenito

Would it be useful to merge an experimental provider system that offers the following control of re/providing:

Re/Provide none

Provide all blocks with root blocks always provided first; reprovide all blocks without prioritization

Discussed out-of-band but, for the record, yes.

following a conversation with @Stebalien it seems like there's a requirement that we preserve the current 0.4.20 'root block first' providing behavior no matter what.

For context, the issue is that we just massively reduced the provider parallelism in bitswap. Unfortunately, that means it'll take longer to fully provide large files after adding them to go-ipfs. The current 'root block first' providing behavior isn't affected by this reduced provider parallelism.

Users who choose to can re/provide nothing without turning off content routing, and everyone else will re/provide everything, but root blocks will always take strict precedence on initial provide. If we can't provide at a rate that exhausts our list of root blocks we'll never even get to other blocks. Prioritizing (or restricting re/providing to) pinned roots or subtrees would be added later.

To make sure we're on the same page, if the user chooses a provider strategy that doesn't provide the first block, we shouldn't provide it even on initial provide.

Subsequent merges could add (potentially):

Those all sound like good ideas. Implementing them with our current datastore may be tricky but this could be good motivation to finally adopt a database.

@obo20

(@eingenito, wrt our conversation about the "pinned" provider strategy)

How volatile is pinned data. Specifically, could you approximate (don't spend any time on this) the ratio of content unpinned between GC cycles to the total number of pins you have?

I'm asking because it'll be somewhat tricky to re-implement the "pinned" provider strategy in the new provider subsystem exactly as is. Specifically, we'd start providing pinned content when pinned but we wouldn't stop until we've GCed the content (even if something unpins it in the meantime). We don't have to do it this way but it's simpler to implement.

So, I'm wondering how long you generally have "stale" data around.

obo20 · 2019-04-19T14:34:59Z

How volatile is pinned data. Specifically, could you approximate (don't spend any time on this) the ratio of content unpinned between GC cycles to the total number of pins you have?

Our garbage collection runs every 24 hours. At quick glance I'd guess that roughly 10-20% of our repo consists of unpinned data when a typical collection starts.

raulk · 2019-04-19T14:51:14Z

I also talked with @momack2 earlier today about having someone on the go-ipfs team work with the libp2p team on this kind of stuff.

Hit the nail on the head with this proposal. I’ll connect with @momack2 to get the ball rolling.

jdannenberg · 2022-09-26T10:39:43Z

@Stebalien had mentioned to me there's an issue with providing roots where for example, if a node downloads certain parts of a tree and then gets interrupted, it won't automatically know how to "walk back up" the tree to get the root so it can download the rest of the hashes it needs to.

That might be something useful to consider when designing this.

Is this still true? For my use case, providing only root cids of very large trees of recursive pins (multi terabytes) would be enough in terms of the root cid is the only "entrypoint" (e.g. giving root cid to a pinning service). But I do not want to limit functionality / stability / resilience.

Winterhuman · 2022-09-26T11:33:46Z

@bigCrash Setting Reprovider.Strategy to roots (https://github.com/ipfs/kubo/blob/master/docs/config.md#reproviderstrategy) would make your node only advertise root CIDs

jdannenberg · 2022-09-26T11:47:03Z

@Winterhuman I understand. But what are the drawbacks exactly. As someone without in-depth knowledge only advertising root CIDs seems to be enough for e.g. the use case of a pinning service pinning this root CID and all recursive descendants (so the root CID is always the entrypoint).

However, the comment I cited states that there might be problems fetching that tree, if ipfs gets interrupted. My question is: Is that still the case? Or am I fundamentally understanding something wrong here?

michaelavila added the need/community-input Needs input from the wider community label Apr 15, 2019

michaelavila added the topic/provider Topic provider label Jun 5, 2019

michaelavila mentioned this issue Jul 30, 2019

Providing System ipfs-inactive/package-managers#84

Open

22 tasks

lidel mentioned this issue Jan 14, 2022

Improved Reprovider.Strategy for entity DAGs (HAMT/UnixFS dirs, big files) #8676

Open

AnnaArchivist mentioned this issue Nov 16, 2022

Content routing issues with "Reprovider.Strategy" set to "roots" #9416

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provider Strategy Discussion #6221

Provider Strategy Discussion #6221

michaelavila commented Apr 15, 2019 •

edited

Loading

michaelavila commented Apr 15, 2019

hannahhoward commented Apr 15, 2019

hannahhoward commented Apr 15, 2019

obo20 commented Apr 15, 2019

michaelavila commented Apr 15, 2019

michaelavila commented Apr 15, 2019

obo20 commented Apr 16, 2019

michaelavila commented Apr 16, 2019

obo20 commented Apr 16, 2019 •

edited

Loading

michaelavila commented Apr 16, 2019 •

edited

Loading

eingenito commented Apr 17, 2019 •

edited

Loading

eingenito commented Apr 17, 2019

obo20 commented Apr 17, 2019

eingenito commented Apr 17, 2019 •

edited

Loading

michaelavila commented Apr 17, 2019

obo20 commented Apr 18, 2019

hsanjuan commented Apr 18, 2019 •

edited

Loading

Stebalien commented Apr 19, 2019

obo20 commented Apr 19, 2019

raulk commented Apr 19, 2019

jdannenberg commented Sep 26, 2022

Winterhuman commented Sep 26, 2022

jdannenberg commented Sep 26, 2022 •

edited

Loading

Provider Strategy Discussion #6221

Provider Strategy Discussion #6221

Comments

michaelavila commented Apr 15, 2019 • edited Loading

1. What kinds of provide strategies will need to be supported?

2. What do we have to support initially?

michaelavila commented Apr 15, 2019

hannahhoward commented Apr 15, 2019

hannahhoward commented Apr 15, 2019

obo20 commented Apr 15, 2019

michaelavila commented Apr 15, 2019

michaelavila commented Apr 15, 2019

obo20 commented Apr 16, 2019

michaelavila commented Apr 16, 2019

obo20 commented Apr 16, 2019 • edited Loading

michaelavila commented Apr 16, 2019 • edited Loading

eingenito commented Apr 17, 2019 • edited Loading

eingenito commented Apr 17, 2019

obo20 commented Apr 17, 2019

eingenito commented Apr 17, 2019 • edited Loading

michaelavila commented Apr 17, 2019

obo20 commented Apr 18, 2019

hsanjuan commented Apr 18, 2019 • edited Loading

Stebalien commented Apr 19, 2019

obo20 commented Apr 19, 2019

raulk commented Apr 19, 2019

jdannenberg commented Sep 26, 2022

Winterhuman commented Sep 26, 2022

jdannenberg commented Sep 26, 2022 • edited Loading

michaelavila commented Apr 15, 2019 •

edited

Loading

obo20 commented Apr 16, 2019 •

edited

Loading

michaelavila commented Apr 16, 2019 •

edited

Loading

eingenito commented Apr 17, 2019 •

edited

Loading

eingenito commented Apr 17, 2019 •

edited

Loading

hsanjuan commented Apr 18, 2019 •

edited

Loading

jdannenberg commented Sep 26, 2022 •

edited

Loading