-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: replace go-merkledag walk with go-ipld-prime traversal for dag export #8506
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvagg is this going to be really slow for network operations? My understanding is that go-merkledag has things like concurrent DAG traversal (and prefetching, although that might just be for UnixFS data) whereas go-ipld-prime traversal does not currently have that.
My guess is that this means gocar's traversal is also going to be slow and linear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, well I'm going to claim ignorance on many of those counts, but maybe @warpfork can help out here
it's true that the vast majority of data going through here will be unixfs, so optimisations for that are going to be beneficial if they make a difference, but maybe the case is to build some of that in to go-ipld-prime instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear to me how much we should worry about "prefetching" or concurrent traversal.
I acknowledge that those features exist in some code, but that does not necessarily imply that they need to exist in the world we're building towards. (Maybe it does; maybe it doesn't. If there's not a benchmark proving it, and a documented user story stating why it's required in context, maybe it's time to review the perceived need.)
ISTM the whole point of CAR files (and a lot of the new systems we're building) are to give people new APIs that explicitly avoid having one query turn into an unbounded number of surprise subqueries of unknown latency. (This in turn is a goal that makes sense because once those horses are allowed to bolt the barn, all possible work is damage control: one can do the Sisyphean work to try to minimize the costs, but one can never make them predictable again, nor as minimal as purely local operation would've been.)
So when I think through what I'd want as an end user, probably I'd want the CAR APIs to explicitly error, or return me a partial CAR, if data would need to be fetched that isn't already local.
Which would mean prefetching is radically less relevant. (It'd be widdling around disk access latency rather than worrying about internet network latency, or heaven forbid DHT lookup latency. At that point, it probably drops into "nice to have, as an optimization, that we can add... someday" territory... especially because simply having a predictable local-only mode would still probably be noticably faster than something that hits even one DHT lookup.)
I don't know if that's the only mode of use to care about; at some point that becomes an "ipfs" decision, and I'll leave that decision for someone else. But if I walk through this, starting from visualizing myself as an end user, I'd be looking for APIs that are predictable and fast, and that means I'm looking to not use the network at all, and that really minimizes the relevance of prefetching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISTM users also pretty repeatedly tell us that this is what they're interested in. For example, QRI's bulk transporter, which builds whole graphs of content, and ships them all at once, point to point, is evidence of this: that system does not do network requests for missing pieces, but just moves what they have: and this has worked well for them, to the point that they ditched basically any of our other transport mechanisms that were less predictable.
My understanding is that the holistic goal of CAR-based APIs is to unlock that kind of victory in an even better-supported way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@warpfork if you are not building towards reasonable-to-massive parallelism at every step and layer of your designs, you are building for an already obsolete world. Latencies continue to increase in every piece of interconnect, regardless of its position in the stack. The only salvation is "do more things", especially in the world of quickly-branching DAGs we find ourselves in.
On modern systems with locally-attached NVMe even something as trivial as
is being blown out of the water at a near linear factor by
( I need to preload VFS caches quite often, so know this behavior painfully well )
One of the main point of the batteries-included
go-ipfs
application is to create (an illusion of) a locality-independent-accessible content. I strongly believe that at the current place of its lifecycle,go-ipfs
does not get the luxury of redefining its implicit API contracts. Nor does any greenfield project ( like e.g.Estuary
) gets to avoid the need for parallel traversal either: if anything, removing much of the architectural debt induced roadblocks, makes the inability to saturate any reasonable hardware even more apparent.Please reconsider your position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a position.
I'm just saying if somebody finds this important, they'd better figure out how important, and figure out who's going to work on it.
I'd recommend starting with a document about what the user story is that we care about, and a benchmark that gives us a needle to focus on moving. The only thing I'm being cautious about here is letting the thread start asking for work to be done with neither of those in hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasonable use of the --offline flag (generally already supported across go-ipfs commands and here as well). However, for those who use the online mode being able to download things with reasonable speed is important.
Is this a request for a benchmark showing that a linear DAG walk with a latency per retrieval of 1s is slower than one with concurrent traversal and prefetching? Seems unnecessary, but making a basic one seems pretty straightforward if necessary.
One is using car export to create verifiable gateways (i.e. trusting the gateway only with availability rather than also data integrity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvagg I poked around here a bit. The current version of go-car uses slow linear traversal as well so I'm fine with this PR (after we do a
go mod tidy
so tests pass).However, it seems likely we're going to have to revisit this topic as linear DAG traversal is going to be a problem for folks interested in verifiable gateways or generally folks interested in pulling CAR files out of the network (or off of high latency disks which is true for a number of our collabs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@warpfork prefetching is a a very high priority and I feel like I've been saying this for a while. It's already limiting graphsync performance to disk latency, and basically means we can't use selectors anywhere in go-ipfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can tell you that the rate-limit will be
blockSize/rtt
which limits us to a few MiB/s assuming 256KiB blocks. Assuming smaller data, it gets much worse. If you want a benchmark, you can pretty easily testipfs dag export
(no prefetching) versusipfs pin add
(parallel download) versusipfs get
(some prefetching that kind of sucks).