-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: pin add --max-depth (arbitrary depth recursive pins) #5142
base: master
Are you sure you want to change the base?
Conversation
Please note that we also have the notion of a "best effort" pin used to pin anything off the files root. It works by keeping the 'GC' from removing anything under it but the GC won't fail if one of the children can not be found. I am thinking we should just make that pin type explicit. @hsanjuan does that type of pin not meet your needs? @whyrusleeping what do you think? |
Also if we go trough we this, I want to avoid the special case when the recursion is 0 by defining a direct pin as a recursive pin with a depth of 0. |
I don't think it does, honestly I don't fully understand it. I need go-ipfs to NOT fetch all the subtree.
That's doable, happy to do it. |
This implements #5133 introducing an option to limit how deep we fetch and store the DAG associated to a recursive pin ("--max-depth"). This feature comes motivated by the need to fetch and pin partial DAGs in order to do DAG sharding with IPFS Cluster. This means that, when pinning something to --max-depth, the DAG will be fetched only to that depth and not more. In order to get this, the PR introduces new recursive pin types: "recursive1" means: the given CID is pinned along with its direct children (maxDepth=1) "recursive2" means: the given CID is pinned along with its direct children and its grandchildren. And so on... This required introducing "maxDepth" limits to all the functions walking down DAGs (in merkledag, pin, core/commands, core/coreapi, exchange/reprovide modules). maxDepth == -1 effectively acts as no-limit, and all these functions behave like they did before. In order to facilitate the task, a new CID Set type has been added: thirdparty/recpinset. This set carries the MaxDepth associated to every Cid. This allows to shortcut exploring already explored branches just like the original cid.Set does. It also allows to store the Recursive pinset (and replaces cid.Set). recpinset should be moved outside to a different repo eventually. TODO: tests TODO: refs -r with --max-depth License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
f9e2146
to
45e9a4c
Compare
InternalPins() is a pinset composed by: - Recursive pins CIDs - Direct pins CIDs - The empty node CID - A root CID pointing to all above (and any of the subbuckets that may have been created) It is only set during Flush/Load operations for the pinner. Thus recursively exploring internal pins in order to decide which CIDs are safe from GC only re-explores the recursive DAGs and should not be necessary. Mind that, previously, the CidSet will correctly prune any already explored branches so it did not have pernicious effects. But now it does. License: MIT Signed-off-by: Hector Sanjuan <code@hector.link>
Creating a best-effort pin won't fetch anything. It will simply prevent any subtrees already fetched from being garbage collected. |
edit: (not a cancel comment button...) |
@kevina I'm thinking that deleting Direct pins as a fully different set might have some performance impact (in order to keep current APIs), as direct pins will need to be extracted from the recursive set by filtering them out, thus listing direct pins may become a slower operation if the recursive set is very big (something that doesn't happen now). Is this something acceptable? |
I need to fetch partial subtrees too. |
I am thinking this should be implemented directly without complicating the pinier or the However, I am not really against this idea if it will be useful in other contextes. @Stebalien @whyrusleeping what do you think. |
I would think so. But others may disagree. |
@Kubuxu @whyrusleeping can I get some attention on this? |
@Stebalien maybe you can help with this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
- It's a hack.
- It's a ~800 line hack.
- It's a hack that touches a bunch of critical code.
- It adds a package to thirdparty.
If we're going to do this, we should do it right.
Let's go back to the issue and discuss ways to do this right. While this technically addresses the immediate need, it won't solve the long-term issue of needing more flexible ways to specify pins and it abuses an enum to store a number. Really, we need a more flexible pinset that allows for complex pin policies. |
@Stebalien I am a bit dissapointed it took 1 month to get a full frontal rejection with little alternative proposal. I understand that the "right" way to do this:
My request is clearly spec-ed and implementable within the current state of things. So instead of deferring to larger abstract change, I would like to get a list of concrete steps that we can take to move forward, possibly in parallel with a the long total-revamp-of-the-pin-system discussion. For example, this change, provides you with a bunch of areas to discuss on already that would also clarify parts in the larger discussion:
Despite the critic, the change does not get much in the way of the current pinning system (it does not change the logic, the api, the storage format) and the standard pinning path. There is another way of approaching this too. I only need max-depth 1 and 2 (or maybe even reducing to just 1). I can just support those values, introducing 2 specific pin types and potentially reducing the parts of the "hack" to support arbitrary pin depths. Is there a way of doing this now that you'd consider workable? |
It's a large patch, touches a bunch of critical code, hasn't been flagged as a priority, and the design wasn't discussed at all before implementing. I don't even look at patches like this until I have some time to actually think about them.
I don't know enough about the current pin system to give one off the top of my head.
This patch adds a cluster-specific, weird addition to pinning by hacking it into the the existing pin system with no discussion. Whatever feature we end up adding to support depth-limited pins, we'll have to maintain it. The primary issue here is the lack of discussion and/or context. You should probably read:
It should be possible to achieve the same thing by passing the depth to the visit function (can create a new Whatever we do, those functions shouldn't talk about pinning or use types like
Personally, I'd give them the type "partial" and then add a "max-depth" field, or something like that.
Yes.
We (and cluster) will need more complex pins anyways.
The values will usually be 1 or 2 for cluster. |
@hsanjuan @Stebalien I attempted to start a discussion about alternative ways to solve this problem, but it seams I was ignored. In particular it would likely be better to enhance our |
We just need to make sure that fills the need. If we do that, we won't end up removing any accidentally downloaded nodes. Let's move the discussion back to the issue. |
func (s *Set) Visit(c *cid.Cid, maxDepth int) bool { | ||
curMaxDepth, ok := s.set[string(c.Bytes())] | ||
|
||
if !ok || IsDeeper(maxDepth, curMaxDepth) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[ A ]
| \
[ B ] [ C ]
| |
[ C ] [ D ]
So what happens if we visit C in the first (left) tree? It seems like we would call visit on the second one, and pass 1,2
to IsDeeper
which would return false, and cause us to not ever visit D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but that's not correct.
Assuming you are traversing this graph with MaxDepth=2 at the beginning:
- The C on the first branch would be visited with maxDepth=0
- The C on the second branch would be visited with MaxDepth=1
1>0, thus Visit will return true
, thus the functions will keep traversing the path because the previously visited C had a lower depth limit than the new one.
In this context maxDepth
means, the maximum depth of the tree below this CID
. Thus we always keep exploring if IsDeeper()
. It does not mean item's depth
as you seemed to assume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, these variables could definitely use some better naming then. and maybe a comment. maybe maxDepth
-> curHeight
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually maxDepth
. That is, it's the "max depth" at which we are planning on exploring this CID. When we increase the maxDepth
, we explore the path again.
@hsanjuan let's do the depth limited recursion stuff in a separate PR, and discuss pinning improvements where @kevina and @Stebalien are mentioning. Sorry for the wait, but let's sketch this out a bit more before moving forward. |
// Thus, setting depth to two, will walk the root, the children, and the | ||
// children of the children. | ||
// Setting depth to a negative number will walk the full tree. | ||
func EnumerateChildrenMaxDepth(ctx context.Context, getLinks GetLinks, root *cid.Cid, maxDepth int, visit func(*cid.Cid, int) bool) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more general way to do this (and the async version) would be to:
- Make the visit function take a current depth. I'd like to pass a full path (as a
[]string
) but that might get expensive. - Have the visit function determine if we should go deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have the visit function determine if we should go deeper.
But this is what is does. You can either remember how far the item is to the limit (maxDepth, as it does now), or you can remeber how deep the item is and what the absolute depth limit is. The way it's done now requires only remembering one thing.
This is the discussion. This PR is here to kickstart a discussion: how to support this feature in the current pin system. It was told to me (offline) that I could attempt to do this in the current pin system until the whole MFS thing is figured out. Now you have a bunch of concrete stuff to criticize and propose improvements on, a list of all the pieces that are potentially touched and the minimal approach to it (hacky, yes, but minimal). Sorry but I thought actually understanding the current pin system and propose how to change it was a better approach than nicely asking to someone else to write my feature. That said, I'll gladly maintain it too whatever the outcome is. Sorry I don't like the stress on the "we'll have to maintain it". It leaves me with very mixed feelings about what you're implying and makes me a little sad.
I think you cannot shortcut branches from the search by knowing only the current depth. You need to know how deep you explored the last time you visited the CID (I tried that way first). At least I could not figure out how to do it just with depth while at the same time not re-exploring branches.
Happy to rename.
That's fair to say, but since cluster is requesting this feature and is very limited in scope compared to "complex pins", it should make sense that it works for well for the cluster usecase. It can always be improved in the future as more use cases and needs appear. Perhaps having added a single "only children/maxdetph=1" pin type would have been a better approach to this. (as a side not, it would be interesting to hear use-cases of people pinning large amounts of pins all with different depths limits, but I agree this should eventually work well).
Sure, no problem. |
This PR was couched as a finished product. No WIP, "proposal", open design questions, "should we do this", etc. and the remaining TODOs are "write more tests" and "implement refs -r --max-depth". That's why I reacted the way I did. A fair amount of your time was wasted writing this patch and some (although significantly less) of my time was wasted reviewing it.
By "we", I meant all of us. My point was that this is a group effort. And no, you won't (and shouldn't) handle every bug report possibly related to this PR.
Correct. As this patch currently does, the visit function would have to remember how deep it was when it last explored a CID. However, that means that you can:
|
@hsanjuan The way I see it, the main sticking point around 'maintaining' this is whether or not you're okay breaking it when we come up with something better. Upgrading pinning to something better will require a migration anyways, so thats not too big of an issue, however, what we have to make sure of is: Are we confident the future solution will support exactly this behavior? One potential thing I see being problematic is if we switch to ipld selectors, and that makes more sense for cluster to use, then we might get stuck maintaining a feature we really don't need anymore because someone else might start using it. Let's see how this PR looks without all the depth limited traversal logic, and revisit the design then. |
Yes, ok that works too and I see the advantage now. Thanks for taking the time to explain. |
As long as there's an equivalent way of doing things, I'm happy with breaking changes (but this opens a bad, or weird precedent at least, even if it's good for me). Anyway, as you said, let's do the things we agree on first and revisit the discussion. |
Is this going to be revisited in the future? From what I understand the ability to pin subsections of graphs is blocking cluster shipping their feature of splitting an unreasonably large DAG across multiple IPFS nodes - this would very much help package manager maintainers increase availability of their registries if stored on IPFS. |
@achingbrain hopefully as this is still blocking the sharding functionality in IPFS Cluster. |
Is there any news/progress on this issue over the past few months? Whilst I don't was to be 'that guy' who asks others to implement features, ipfs-cluster is currently pretty useless for clusters over a few TB with large files inside due to the requirement to replicate entire files. Whilst likely not possible for a while due to the need for me to familiarise myself with the technicalities of IPFS under the hood and the codebase, if there are no plans to revisit this in the near future, I will try to start learning about IPFS and thinking about some of the problems in previous comments, with a view to hopefully writing this feature in X months time. Though whilst considering that option, there seem to be a few mentions that the entire pin system is due for a refactor - is this happening anytime soon, as it makes thinking about an implementation for this feature harder if this is the case. |
This implements #5133 introducing an option to limit how deep we fetch and store
the DAG associated to a recursive pin ("--max-depth"). This feature
comes motivated by the need to fetch and pin partial DAGs in order to do
DAG sharding with IPFS Cluster.
This means that, when pinning something to --max-depth, the DAG will be
fetched only to that depth and not more.
In order to get this, the PR introduces new recursive pin types: "recursive1"
means: the given CID is pinned along with its direct children (maxDepth=1)
"recursive2" means: the given CID is pinned along with its direct children
and its grandchildren.
And so on...
This required introducing "maxDepth" limits to all the functions walking down
DAGs (in merkledag, pin, core/commands, core/coreapi, exchange/reprovide modules).
maxDepth == -1 effectively acts as no-limit, and all these functions behave like
they did before.
In order to facilitate the task, a new CID Set type has been added:
thirdparty/recpinset. This set carries the MaxDepth associated to every Cid.
This allows to shortcut exploring already explored branches just like the original
cid.Set does. It also allows to store the Recursive pinset (and replaces cid.Set).
recpinset should be moved outside to a different repo eventually.
TODO: tests
TODO: refs -r with --max-depth
License: MIT
Signed-off-by: Hector Sanjuan code@hector.link