-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gossipsub: unsubscribe fixes #569
Conversation
* fix KeyError when updating metric of unsubscribed topic * fix unsubscribe message not being sent to all peers causing them to keep thinking we're still subscribed * release memory earlier in a few places
for i in 0..<min(rpcMsg.subscriptions.len, p.topicsHigh): | ||
let sub = rpcMsg.subscriptions[i] | ||
template sub(): untyped = rpcMsg.subscriptions[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dislike this because it's using a clever and unintuitive trick to create an alias. It's also masking away potential rase conditions, if this proc becomes async it would be very easy to miss why we suddenly get NPEs or out of bounds access. I would rather we use var sub = addr rpcMsg.subscriptions[i] # do something with sub[]
which makes things explicit and hard to miss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might have missed something here, addr
is the construct that disables the safety mechanisms in Nim and permits the kind of issues you're referring to because the resulting pointer can become dangling - it should be avoided when possible.
also, it would not be applicable here since this is not a mutable instance.
template is used throughout the codebase to work around nim 1.2 issues in performance hotspots. gossipsub is the biggest memory consumer in nimbus and this is one of the sources of that consumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addr is the construct that disables the safety mechanisms in Nim
Nope, didn't miss that. What I'm referring to here is explicitness. The template in this case is "pretending" that the element has been placed in a variable - for all intents and purposes it's usage looks and feels like that, like here for example - https://github.com/status-im/nim-libp2p/pull/569/files#diff-a99c4291e04a9bde36fc1088e4d03053d57e0fcb54d659fcbc4b313362cf7278R206, this is misleading.
I don't think that the compiler can deduce the size of the array/sequence in this case, so there are no compile checks for out of bounds access and you effectively getting similar guarantees to addr
.
The problem with masking it like this is that someone can innadverdly do something like:
template sub(): untyped = rpcMsg.subscriptions[i]
....
del rpcMsg.subscriptions[i]
...
# uhoh - in the best case you're going to crash, in the worst it's you're pointing to the wrong element
echo sub.topic
It won't happen now, because the proc is not async
, but this tend to change over the lifetime of the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deduce the size
the []
operator checks for out-of-bounds at runtime. the pointer returned by addr
is not checked any further.
similar guarantees to addr.
addr
returns a ptr T
- a pointer can for example be copied into a closure, but when that is done, the data that it points to is not copied - this causes it to become dangling, the very thing you're trying to argue is a problem with the template. addr
is one of the memory-unsafe constructs in Nim - it doesn't offer any guarantees at all. If you were to use addr
with async
, you would be circumventing the memory-safety that the language gives you and crash where a template like this wouldn't (it would take a copy of rpcMsg
and keep working, albeit inefficiently)
this is misleading.
Generally, rpcMsg
is immutable meaning that it cannot change the way you're suggesting in your snippet - you also cannot use addr
on it, for similar reasons: you would now be able to mutate it and cause these kinds of issues.
In cases like this where the source data is immutable, in Nim 1.2 you can either take a copy with let
(what the previous code did) or use a template (or type it out) - the language doesn't allow you to use addr
for good reasons, and even if it did, it would cause worse problems if the function was refactored or turned into async, as noted above.
We can forgo this optimization of course, but doing so comes at a measurable cost.
let err = fut.readError() | ||
warn "Error in topic handler", msg = err.msg | ||
proc handleData*(p: PubSub, topic: string, data: seq[byte]): Future[void] = | ||
# Start work on all data handlers without copying data into closure like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this holds. The only local variables are handlers
which is a pointer/ref and futs
. I would imagine that only futs
gets copied because it's the only thing on the stack, the rest get lifted and passed as arguments to the closure. The new waiter
closure keeps copying futs
into the closure environment just as the previous version did, while obfuscating the flow. Where is the gain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nimbus does not use the data handlers, and typically, if a handler returns a completed future, it is no longer interesting to pass it to the callSoon
machinery. The same reasoning holds here as when writing: everything inside {.async.}
is copied into a closure environment to guarantee the lifetime of the data. In this particular example, we know more and can exploit it to conserve memory in a hotspot.
The gain is expressed as the "fast path" in the comments - futs will remain empty and no allocations will happen - waiter
will not be called at all in this case.
Before implementing tricks like this, you also need to have a clear understanding of exception handling in Nim - in particular, this is a good example of why functions that return Future
must not raise.
# Next time sendConn is used, it will be have its close flag set and thus | ||
# will be recycled | ||
let fut = conn.writeLp(encoded) # Avoid copying `encoded` into future | ||
proc sendWaiter(): Future[void] {.async.} = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again not sure where is the gain here, wouldn't this do exactly the same amount of copying as the previous version but in a more convoluted way? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To understand this, you need to look into how the compiler transforms the code into a closure iterator - in particular, it will copy the variables that enter the closure onto a heap instance - the closure environment - which then gets passed around by reference. encoded
is such a variable, often having significant size.
The best place to see this is the C
code. The comment simplifies things by calling it into the future
- this is a mental shortcut that is mostly accurate.
before:
T9_ = (tyObject_FuturecolonObjectType___GXFSekg1U8JRoedGa2vBSA*)0; T9_ = writeLp__9c71DJYgEH8rgDWbdnwWECg(T8_, (*(*colonenvP_).colonup_).encoded2->data, ((*(*colonenvP_).colonup_).encoded2 ? (*(*colonenvP_).colonup_).encoded2->Sup.len : 0)); asgnRef((void**) (&(*colonenvP_).chronosInternalTmpFuture4), &T9_->Sup);
after:
asgnRef((void**) (&(*colonenv_).fut1), writeLp__9c71DJYgEH8rgDWbdnwWECg(T4_, encoded->data, (encoded ? encoded->Sup.len : 0)));
If you want more background on how libp2p uses memory, the easiest place is the metrics - seq[byte]
, Future[void]
and string
are the 3 kinds of memory that libp2p allocates a lot of and this was one example: to create the future, it first copied encoded
into the closure environment, then passed a reference to that environment to writeLp
- because the closure is a single allocation that has roughly the same lifetime as the future, the copy of encoded
stayed referenced until the future was resolved. This can take a while, and the result is that we see spikes of memory usage in Nimbus.
By doing it this way, the closure environment still has a significant lifetime, but it's now much smaller - in particular, it doesn't hold on to a copy of encoded
.
These kinds of optimizations generally don't make sense unless they've been verified to cause trouble - they are rarely noticeable - but gossipsub
, because of the fanout, is a pathological case - every message is copied to each peer and goes through 4-5 layers of abstraction each taking multiple copies of it as part of framing it - the fewer such copies, the less memory libp2p will use and keep referenced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But I'd really like to understand the reasoning behind the things I pointed out.
* gossipsub: unsubscribe fixes * fix KeyError when updating metric of unsubscribed topic * fix unsubscribe message not being sent to all peers causing them to keep thinking we're still subscribed * release memory earlier in a few places * floodsub fix
* mem usage cleanups for pubsub (#564) In `async` functions, a closure environment is created for variables that cross an await boundary - this closure environment is kept in memory for the lifetime of the associated future - this means that although _some_ variables are no longer used, they still take up memory for a long time. In Nimbus, message validation is processed in batches meaning the future of an incoming gossip message stays around for quite a while - this leads to memory consumption peaks of 100-200 mb when there are many attestations in the pipeline. To avoid excessive memory usage, it's generally better to move non-async code into proc's such that the variables therein can be released earlier - this includes the many hidden variables introduced by macro and template expansion (ie chronicles that does expensive exception handling) * move seen table salt to floodsub, use there as well * shorten seen table salt to size of hash * avoid unnecessary memory allocations and copies in a few places * factor out message scoring * avoid reencoding outgoing message for every peer * keep checking validators until reject (in case there's both reject and ignore) * `readOnce` avoids `readExactly` overhead for single-byte read * genericAssign -> assign2 * fix control messages (#566) * remove unused control graft check in handleControl * avoid sending empty Iwant messages * gossipsub: unsubscribe fixes (#569) * gossipsub: unsubscribe fixes * fix KeyError when updating metric of unsubscribed topic * fix unsubscribe message not being sent to all peers causing them to keep thinking we're still subscribed * release memory earlier in a few places * floodsub fix * simplify connmanager (#573) * no need to init orderedset * array more simple than table * adding raises defect across the codebase (#572) * adding raises defect across the codebase * use unittest2 * add windows deps caching * update mingw link * die on failed peerinfo initialization * use result.expect instead of get * use expect more consistently and rework inits * use expect more consistently * throw on missing public key * remove unused closure annotation * merge master * Split dialer (#542) * extracting dialing logic to dialer * exposing upgrade methods on transport * cleanup * fixing tests to use new interfaces * add comments * add base exception class and fix hierarchy * fix imports * Merge master (#555) * Revisit Floodsub (#543) Fixes #525 add coverage to unsubscribeAll and testing * add mounted protos to identify message (#546) * add stable/unstable auto bumps * fix auto-bump CI * merge nbc auto bump with CI in order to bump only on CI success * put conditional locks on nbc bump (#549) * Fix minor exception issues (#550) Makes code compatible with status-im/nim-chronos#166 without requiring it. * fix nimbus ref for auto-bump stable's PR * Split dialer (#542) * extracting dialing logic to dialer * exposing upgrade methods on transport * cleanup * fixing tests to use new interfaces * add comments * add base exception class and fix hierarchy * fix imports * `doAssert` is `ValueError` not `AssertionError`? * revert back to `AssertionError` Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> Co-authored-by: Jacek Sieka <jacek@status.im> * Builders (#558) * use a builder pattern to build the switch (#551) * use a builder pattern to build the switch * with with * more refs * Merge master (#555) * Revisit Floodsub (#543) Fixes #525 add coverage to unsubscribeAll and testing * add mounted protos to identify message (#546) * add stable/unstable auto bumps * fix auto-bump CI * merge nbc auto bump with CI in order to bump only on CI success * put conditional locks on nbc bump (#549) * Fix minor exception issues (#550) Makes code compatible with status-im/nim-chronos#166 without requiring it. * fix nimbus ref for auto-bump stable's PR * Split dialer (#542) * extracting dialing logic to dialer * exposing upgrade methods on transport * cleanup * fixing tests to use new interfaces * add comments * add base exception class and fix hierarchy * fix imports * `doAssert` is `ValueError` not `AssertionError`? * revert back to `AssertionError` Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> Co-authored-by: Jacek Sieka <jacek@status.im> * `doAssert` is `ValueError` not `AssertionError`? * revert back to `AssertionError` * fix builders * more builder stuff * more builders Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> Co-authored-by: Jacek Sieka <jacek@status.im> * Merge master (#562) * builders (#559) * More builders (#560) * address some issues pointed out in review * re-add to prevent breaking other projects * Merge master (#555) * Revisit Floodsub (#543) Fixes #525 add coverage to unsubscribeAll and testing * add mounted protos to identify message (#546) * add stable/unstable auto bumps * fix auto-bump CI * merge nbc auto bump with CI in order to bump only on CI success * put conditional locks on nbc bump (#549) * Fix minor exception issues (#550) Makes code compatible with status-im/nim-chronos#166 without requiring it. * fix nimbus ref for auto-bump stable's PR * Split dialer (#542) * extracting dialing logic to dialer * exposing upgrade methods on transport * cleanup * fixing tests to use new interfaces * add comments * add base exception class and fix hierarchy * fix imports * `doAssert` is `ValueError` not `AssertionError`? * revert back to `AssertionError` Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> Co-authored-by: Jacek Sieka <jacek@status.im> Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> Co-authored-by: Jacek Sieka <jacek@status.im> * use expect with multiaddress init * use expect * raise LPError Co-authored-by: Jacek Sieka <jacek@status.im> Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>
* mem usage cleanups for pubsub (#564) In `async` functions, a closure environment is created for variables that cross an await boundary - this closure environment is kept in memory for the lifetime of the associated future - this means that although _some_ variables are no longer used, they still take up memory for a long time. In Nimbus, message validation is processed in batches meaning the future of an incoming gossip message stays around for quite a while - this leads to memory consumption peaks of 100-200 mb when there are many attestations in the pipeline. To avoid excessive memory usage, it's generally better to move non-async code into proc's such that the variables therein can be released earlier - this includes the many hidden variables introduced by macro and template expansion (ie chronicles that does expensive exception handling) * move seen table salt to floodsub, use there as well * shorten seen table salt to size of hash * avoid unnecessary memory allocations and copies in a few places * factor out message scoring * avoid reencoding outgoing message for every peer * keep checking validators until reject (in case there's both reject and ignore) * `readOnce` avoids `readExactly` overhead for single-byte read * genericAssign -> assign2 * fix control messages (#566) * remove unused control graft check in handleControl * avoid sending empty Iwant messages * gossipsub: unsubscribe fixes (#569) * gossipsub: unsubscribe fixes * fix KeyError when updating metric of unsubscribed topic * fix unsubscribe message not being sent to all peers causing them to keep thinking we're still subscribed * release memory earlier in a few places * floodsub fix * simplify connmanager (#573) * no need to init orderedset * array more simple than table * adding raises defect across the codebase (#572) * adding raises defect across the codebase * use unittest2 * add windows deps caching * update mingw link * die on failed peerinfo initialization * use result.expect instead of get * use expect more consistently and rework inits * use expect more consistently * throw on missing public key * remove unused closure annotation * merge master * Split dialer (#542) * extracting dialing logic to dialer * exposing upgrade methods on transport * cleanup * fixing tests to use new interfaces * add comments * add base exception class and fix hierarchy * fix imports * Merge master (#555) * Revisit Floodsub (#543) Fixes #525 add coverage to unsubscribeAll and testing * add mounted protos to identify message (#546) * add stable/unstable auto bumps * fix auto-bump CI * merge nbc auto bump with CI in order to bump only on CI success * put conditional locks on nbc bump (#549) * Fix minor exception issues (#550) Makes code compatible with status-im/nim-chronos#166 without requiring it. * fix nimbus ref for auto-bump stable's PR * Split dialer (#542) * extracting dialing logic to dialer * exposing upgrade methods on transport * cleanup * fixing tests to use new interfaces * add comments * add base exception class and fix hierarchy * fix imports * `doAssert` is `ValueError` not `AssertionError`? * revert back to `AssertionError` Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> Co-authored-by: Jacek Sieka <jacek@status.im> * Builders (#558) * use a builder pattern to build the switch (#551) * use a builder pattern to build the switch * with with * more refs * Merge master (#555) * Revisit Floodsub (#543) Fixes #525 add coverage to unsubscribeAll and testing * add mounted protos to identify message (#546) * add stable/unstable auto bumps * fix auto-bump CI * merge nbc auto bump with CI in order to bump only on CI success * put conditional locks on nbc bump (#549) * Fix minor exception issues (#550) Makes code compatible with status-im/nim-chronos#166 without requiring it. * fix nimbus ref for auto-bump stable's PR * Split dialer (#542) * extracting dialing logic to dialer * exposing upgrade methods on transport * cleanup * fixing tests to use new interfaces * add comments * add base exception class and fix hierarchy * fix imports * `doAssert` is `ValueError` not `AssertionError`? * revert back to `AssertionError` Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> Co-authored-by: Jacek Sieka <jacek@status.im> * `doAssert` is `ValueError` not `AssertionError`? * revert back to `AssertionError` * fix builders * more builder stuff * more builders Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> Co-authored-by: Jacek Sieka <jacek@status.im> * Merge master (#562) * builders (#559) * More builders (#560) * address some issues pointed out in review * re-add to prevent breaking other projects * Merge master (#555) * Revisit Floodsub (#543) Fixes #525 add coverage to unsubscribeAll and testing * add mounted protos to identify message (#546) * add stable/unstable auto bumps * fix auto-bump CI * merge nbc auto bump with CI in order to bump only on CI success * put conditional locks on nbc bump (#549) * Fix minor exception issues (#550) Makes code compatible with status-im/nim-chronos#166 without requiring it. * fix nimbus ref for auto-bump stable's PR * Split dialer (#542) * extracting dialing logic to dialer * exposing upgrade methods on transport * cleanup * fixing tests to use new interfaces * add comments * add base exception class and fix hierarchy * fix imports * `doAssert` is `ValueError` not `AssertionError`? * revert back to `AssertionError` Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> Co-authored-by: Jacek Sieka <jacek@status.im> Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com> Co-authored-by: Jacek Sieka <jacek@status.im> * use expect with multiaddress init * use expect * raise LPError Co-authored-by: Jacek Sieka <jacek@status.im> Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>
* gossipsub: unsubscribe fixes * fix KeyError when updating metric of unsubscribed topic * fix unsubscribe message not being sent to all peers causing them to keep thinking we're still subscribed * release memory earlier in a few places * floodsub fix
keep thinking we're still subscribed