-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provider Strategy Discussion #6221
Comments
currently various "get" like operations trigger a provide. (i.e. ipfs get, ipfs ls) provide everything would be fine, but this also presents a good time to use a more limited strategy like just roots. |
^^ I think the above behavior is just a side affect of BS just providing every single block that goes through its system. |
@Stebalien had mentioned to me there's an issue with providing roots where for example, if a node downloads certain parts of a tree and then gets interrupted, it won't automatically know how to "walk back up" the tree to get the root so it can download the rest of the hashes it needs to. That might be something useful to consider when designing this. |
@obo20 thanks for joining the discussion! You're correct, it's a thing we have to consider and there has been some discussion about how best to approach that problem. I'm curious what some of the needs and/or pain points around providing are for Pinata, do you have any? |
@hannahhoward thanks! That's a great thing to point out. |
The biggest pain points we've had involve content discovery times for both old content as well as recently added content. For lack of a better explanation, it seems like the provider just can't keep up with the amount of content it needs to announce. |
@obo20 thanks again for responding.
Understood. Here we are taking measures to reduce the number of things we provide in the first place. But like you mention, that's only part of the solution. There are efforts parallel to this for improving content discovery times.
Although it does not solve all of the problems you're running into, a small portion of the new provider system is being released with @obo20 I have two questions if you don't mind: I know Pinata is primarily dealing with pinned content, but how much non pinned content are you dealing with? Do you have a sense of the number of things you're trying to provide each day? @Stebalien @eingenito do either of you know of an issue (or anything) tracking work being done in |
@michaelavila No problem. Glad to help in any way I can. We're quite excited about a lot of the improvements coming to IPFS in 0.4.20. We'll be keeping an eye on things to see how performance improves.
Currently our provider strategy is set to pinned. So we should only be announcing our content.
Currently we have roughly 7000 root nodes that are being announced each day, but this number is increasing steadily. |
@obo20 thank you. Out of curiosity, does Pinata have non pinned content in ipfs? If so, is it a lot? |
@michaelavila and I just had a discussion and, following a conversation with @Stebalien it seems like there's a requirement that we preserve the current 0.4.20 'root block first' providing behavior no matter what. Would it be useful to merge an experimental provider system that offers the following control of re/providing:
That's really it. Users who choose to can re/provide nothing without turning off content routing, and everyone else will re/provide everything, but root blocks will always take strict precedence on initial provide. If we can't provide at a rate that exhausts our list of root blocks we'll never even get to other blocks. Prioritizing (or restricting re/providing to) pinned roots or subtrees would be added later. Presumably gateway could make use of the 'none' behavior. @hsanjuan could cluster make use of the 'all' strategy knowing that roots would always be provided first? Or more generally what does cluster do wrt re/providing? Subsequent merges could add (potentially):
@Stebalien, @magik6k, @hsanjuan, @scout - any comments? |
@obo20 I had a couple of questions/comments: you said you're using the 'pinned' reprovider strategy. If most (all?) of your content is pinned that's roughly equivalent to using 'all'. Would the 'roots' strategy be appropriate for your use case which just reprovides pin roots? |
All of our content is pinned. We don't store any non-pinned content.
@eingenito We considered this, however the issue I described in my first comment could cause problems. For reference, this is the comment I'm referring to:
If this issue gets fixed, then we could absolutely use the "roots" strategy |
@obo20 - thanks for your answers. The ultimate goal of these refactors is exactly what you're talking about; providing a subset of nodes (and @Kubuxu has done some work to characterize nodes as being particularly awesome to provide: #6155), and then enable bitswap (or derivative) to walk back up a DAG that is stalled in transfer looking for providers along the way until it can restart. |
@obo20 another question for you (thanks again): any idea how often Pinata is getting |
As far as I’m aware, we’re not running into that interrupt situation right now as we’re running “pinned” as the provider strategy instead of “roots”. @Stebalien had simply warned me that it may be an issue if we switched to “roots” as a provider strategy. |
For cluster, it's just fine that IPFS peers can pin (that is, find and retrieve) things that are pinned somewhere else. Therefore the "all", the "roots", the "pinned" strategies would work and we don't have special requirements that I can think of now. The sharding feature (when it lands) might require the "all" strategy to be used in order to re-construct DAGs split across multple peers, but we can just make this a requirement for people wanting to shard. @lanzafame ^^^ double-check this sounds right? |
There are a few PRs in flight to improve query perf. I also talked with @momack2 earlier today about having someone on the go-ipfs team work with the libp2p team on this kind of stuff. Unfortunately, much of this is still up in the air.
So you don't get too excited, 0.4.20 brings some improvements but isn't likely to significantly improve content routing in your case. It may improve content routing for new content on initial add but that's about it.
Discussed out-of-band but, for the record, yes.
For context, the issue is that we just massively reduced the provider parallelism in bitswap. Unfortunately, that means it'll take longer to fully provide large files after adding them to go-ipfs. The current 'root block first' providing behavior isn't affected by this reduced provider parallelism.
To make sure we're on the same page, if the user chooses a provider strategy that doesn't provide the first block, we shouldn't provide it even on initial provide.
Those all sound like good ideas. Implementing them with our current datastore may be tricky but this could be good motivation to finally adopt a database. (@eingenito, wrt our conversation about the "pinned" provider strategy) How volatile is pinned data. Specifically, could you approximate (don't spend any time on this) the ratio of content unpinned between GC cycles to the total number of pins you have? I'm asking because it'll be somewhat tricky to re-implement the "pinned" provider strategy in the new provider subsystem exactly as is. Specifically, we'd start providing pinned content when pinned but we wouldn't stop until we've GCed the content (even if something unpins it in the meantime). We don't have to do it this way but it's simpler to implement. So, I'm wondering how long you generally have "stale" data around. |
Our garbage collection runs every 24 hours. At quick glance I'd guess that roughly 10-20% of our repo consists of unpinned data when a typical collection starts. |
Is this still true? For my use case, providing only root cids of very large trees of recursive pins (multi terabytes) would be enough in terms of the root cid is the only "entrypoint" (e.g. giving root cid to a pinning service). But I do not want to limit functionality / stability / resilience. |
@bigCrash Setting |
@Winterhuman I understand. But what are the drawbacks exactly. As someone without in-depth knowledge only advertising root CIDs seems to be enough for e.g. the use case of a pinning service pinning this root CID and all recursive descendants (so the root CID is always the entrypoint). However, the comment I cited states that there might be problems fetching that tree, if ipfs gets interrupted. My question is: Is that still the case? Or am I fundamentally understanding something wrong here? |
@Stebalien please add the people you think should be a part of this discussion.
We are introducing a provider system into go-ipfs (#6141) that replaces the mechanism in bitswap that provides all new blocks when it comes into contact with them. The new provider system is intended to give us more control over which blocks are provided during different operations.
The important questions for this group are:
1. What kinds of provide strategies will need to be supported?
@Stebalien mentioned an approach here #5870 (comment) which should be considered here as well.
2. What do we have to support initially?
Given the concerns I hear around providing, it seems like the following would help right away and could be merged on its own:
But would that be enough?
Additionally, the reprovider strategies are being removed and instead a reprovide will work over all blocks that have been provided. Is that ok?
The text was updated successfully, but these errors were encountered: