HAMTv4 using generics instead of cbg.Deferred for perf gains #298
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ref: #285
Continuing on the optimisation journey with the belief that allocations and GC are taking up a significant portion of the time, I experimented with removing the
cbg.Deferred
use in go-hamt-ipld and using Go generics to replicate that functionality. When the HAMT doesn't know what value type it's got stored, it skips over the CBOR tokens until it finds a whole unit (single scalar token, whole nested object, whatever) and then slices the bytes for that value and passes it on incbg.Deferred
for the caller to decode themselves. This means double CBOR parsing and the allocation and GC of a byte slice that really shouldn't be there. So instead we can pass our serializable type down into the HAMT and it can just defer directly to it to decode its values. That work is in filecoin-project/go-hamt-ipld#122 and this branch depends on that branch asgithub.com/filecoin-project/go-hamt-ipld/v4
, keeping/v3
intact.Summary of changes so far:
builtin.ActorTree
is now an interface which only has theActorsV5
method variants and I've also madebuiltin.LegacyActorTree
which has those plus theActorsV4
method variants.builtin.NewLegacyTree
andbuiltin.LoadLegacyTree
are the new names for the original form, these will give you what was always returned before.GetStore()
on both interfaces, andGetMap()
onLegacyActorTree
—I need the latter for lotus, which only knows aboutadt.Map
for now but with this changeadt.Map
can become a legacy concern and we can just usehamt.Node[*ActorV5]
to get basically the same thing.RunMigration
only takes anActorTree
, it doesn't need the v4 actors; it also usesbuiltin.NewTree
to make the new v4 hamt with generics varietybuiltin.LoadTree
to get the new v4 hamt with generics to pass toRunMigration
; the other migrations still useLoadLegacyTree
.So the impact of this is that when running the v14 migration we read and write the actors HAMT with the new v4 HAMT code that uses generics instead of
cbg.Deferred
.Results
lotus-shed migrate-state 23 bafy2bzacecnpvunvyytfzmdofrdwk2jr5sf4cuzit6o6uctrzqdltqwzbwwmk
for both of them:Current lotus & go-state-types
migration height 4145482
old cid bafy2bzacedpcjm7wgoq6ft3x7jdjtjdmbaqil2zjd2pa6exl2mwpv5opik4s2
new cid bafy2bzacebpjrptlb44oaj4vyaaa5th3lvqwv5cogdfd6agzvjqr223grwhli
completed round actual (without cache), took 11.323758969s
completed round actual (with cache), took 9.755098713s
This branch and HAMT v4
migration height 4145482
old cid bafy2bzacedpcjm7wgoq6ft3x7jdjtjdmbaqil2zjd2pa6exl2mwpv5opik4s2
new cid bafy2bzacebpjrptlb44oaj4vyaaa5th3lvqwv5cogdfd6agzvjqr223grwhli
completed round actual (without cache), took 10.133880646s
completed round actual (with cache), took 8.987781194s
So we're about 10.5% faster without cache and nearly 8% faster with cache.
Flame graphs
Before:
After:
You can see in there:
ForEach
on the actors tree goes from 30.9% to 28.90% (most of the action is captured under here since it nests theSet
tooSet
is 11.25% vs 6.22% (this is dramatic because we get to defer serialization whereas previously we had to serialize the value into a slice and hold onto it till a flush)Flush
goes up from 4.97% to 6.51% (this is where the value serialization is moved to, but it's much cheaper because we go directly to theio.Writer
rather than a slice allocation [+GC], hold, then write)Profiles
(click to see full profiles)
Before:
After:
These are some of the bits I was most interested in:
vsvs
(GC improvements not as dramatic as I hoped, but still there)
There's probably a few more tweaks that could be done to make this nicer. There's some internal mutation code in the HAMT that I think should be changed now with the generics that may speed up the path through the code for inserts and deletes.