-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intrusive shamap inner (SHAMapTreeNode memory reduction) #4815
base: develop
Are you sure you want to change the base?
Conversation
This branch has a long history. About two years ago I wrote a patch to remove the mutex from shamap inner nodes (ref: https://github.com/seelabs/rippled/tree/lockfree-tagged-cache). At the time I measured a large memory savings of about 2 gig. Unfortunately, the code required using the `folly` library, and I was hesitant to introduce such a large dependency into rippled (especially one that was so hard to build). This branch resurrects that old work and removes the `folly` dependency. The old branch used a lockless atomic shared pointer. This new branch introduces a intrusive pointer type. Unlike boost's intrusive pointer, this intrusive pointer can handle both strong and weak pointers (needed for the tagged cache). Since this is an intrusive pointer type, in order to support weak pointers, the object is not destroyed when the strong count goes to zero. Instead, it is "partially destroyed" (for example, inner nodes will reset their children). This intrusive pointer takes 16-bits for the strong count and 14-bits for the weak count, and takes one 64-bit pointer to point at the object. This is much smaller than a std::shared_pointer, which needs a control block to hold the strong and weak counts (and potentially other objects), as well as an extra pointer to point at the control block. The intrusive shared pointer can be modified to support for atomic operations (there is a branch that adds this support). These atomic operations can be used instead of the lock when changing inner node pointers in the shamap. Note: The space savings is independent from removing the locks from shamap inner node. Therefor this work is divided into two phases. In the first phase a non-atomic intrusive pointer is introduced and the locks are kept. In a second phases the atomic intrusive pointer could be introduced and the locks will be removed. Some of the code in this patch is written with the upcoming atomic work in mind (for example, using exchange in places). The atomic intrusive pointer also requires the C++ library to support `atomic_ref`. Both gcc and msvc support this, but at the time of this writing clang's library does not. Note: Intrusive pointer will be 12 bytes. The shared_ptr will be around 40 bytes, depending on implementation. When measuring memory usage on a validator, this patch resulted in between a 10 and 15% memory savings.
FYI: This is a branch that has some work to make the shared pointer atomic: https://github.com/seelabs/rippled/tree/atomic_shamap_inner. With that patch we may be able to remove the inner node locks. But I'm putting that on the back-burner for a while. |
One important todo before this can be merge is to confirm that we can get away with 16-bit strong counts and 14-bit weak counts. |
That's max refs for strong/weak? |
Yes, that's right. It's trivial to change that so it the counts are 32 bits and 30 bit, but at the cost of 4 extra bytes per object. If we can, I'd like to avoid those 4 extra bytes. |
I guess that adds up with the current node counts. |
Damn mac eh |
How would you suggest we make that confirmation? Do you know which kinds of workloads are likely to run up those reference counts? |
@scottschurr I think we need to audit the code and understand all the places that are storing strong and weak shamap pointers and decide if there are ways this could blow up. I have not done that audit, but we'll need to do it before merging this. However, that audit shouldn't stop a review from starting. Even if we decide the limit is too low this is still worth merging, we'll just have to increase the limit (for slightly less memory savings). |
I think this is more than a todo. It's a blocker. We should not approve this PR until we know the answer to this question. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not done with my review, but I wanted to push the comments I have so far.
bool IsKeyCache, | ||
class Hash, | ||
class KeyEqual, | ||
class Mutex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought templates were implicitly inline. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. If the function definition is inside the class declaration then they are implicitly inline. Outside the declaration they are implicitly not inline. Either way the inline is just a hint. The compiler is free to inline or not no matter the inline declaration. So this isn't critical. I consider it good style to hint to the best of our ability to the compiler though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL. Thanks!
bool IsKeyCache, | ||
class Hash, | ||
class KeyEqual, | ||
class Mutex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline
bool IsKeyCache, | ||
class Hash, | ||
class KeyEqual, | ||
class Mutex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline
bool IsKeyCache, | ||
class Hash, | ||
class KeyEqual, | ||
class Mutex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline
bool IsKeyCache, | ||
class Hash, | ||
class KeyEqual, | ||
class Mutex> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline
// (The normal operator= is needed or it will be marked `deleted`) | ||
if (this == &rhs) | ||
return *this; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about instead just:
static_assert(!std::is_same_v<T, TT>);
// There is no move constructor from a strong intrusive ptr because | ||
// moving would be move expensive than copying in this case (the strong | ||
// ref would need to be decremented) | ||
WeakIntrusive(SharedIntrusive<T> const&& rhs) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prohibits constructing a WeakIntrusive
from an rvalue SharedIntrusive
. I'm guessing that what is intended is that such an expression would execute the WeakIntrusive(SharedIntrusive<T> const& rhs)
constructor. But instead a compile-time error is created.
If I'm correct, just remove this signature.
For reference, the std allows std::weak_ptr<T>
to be constructed from an rvalue std::shared_ptr<TT>
where TT
is convertible to T
or is the same as T
: http://eel.is/c++draft/util.smartptr.weak#const-3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the comment from Scott states, deleting the rvalue constructor was a deliberate choice favouring performance over consistency with standard library smart pointers. Allowing construction from an rvalue SharedIntrusive
would involve unnecessary reference count modifications (decrement and re-increment), which can be more expensive than copying. This design ensures predictable and efficient behavior in the context of WeakIntrusive
.
// moving would be move expensive than copying in this case (the strong | ||
// ref would need to be decremented) | ||
WeakIntrusive(SharedIntrusive<T> const&& rhs) = delete; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the implicit copy assignment operator does the wrong thing.
For std::weak_ptr
the copy assignment operator is: Equivalent to weak_ptr(r).swap(*this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are no current use cases for copy assignment in WeakIntrusive, I suggest we delete this operator to simplify the implementation. If a need arises in the future, we can reintroduce it with proper consideration.
{ | ||
if (this == &rhs) | ||
return *this; | ||
unsafeReleaseNoStore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is in danger of destructive modification of rhs
prior to adding the ref count below. The release should come after the add.
The std spec does this with a member swap: SharedWeakUnion(rhs).swap(*this)
.
Cases that are not caught by this == &rhs
and yet still self-reference are very rare, and very difficult to replicate and/or reason about. But I have seen them happen in the field with shared_ptr
, weak_ptr
and unique_ptr
. Here be dragons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To validate the correctness of state updates, I have added unit tests covering the following scenarios:
- Normal Use of the Assignment Operator: Ensuring correct state updates when assigning one SharedWeakUnion to another.
- Self-Assignment: Testing for robustness in cases of self-assignment to ensure no unintended state modifications occur.
- Null Pointer Assignment: Verifying that assigning a null SharedWeakUnion correctly updates the state to reflect null.
- Expired Pointer Assignment: Ensuring that assigning an expired weak pointer behaves as expected, transitioning to the appropriate state.
This should address the concerns raised about potential self-referential issues or incorrect state transitions.
@@ -335,7 +472,7 @@ TaggedCache<Key, T, IsKeyCache, Hash, KeyEqual, Mutex>::del( | |||
if (entry.isCached()) | |||
{ | |||
--m_cache_count; | |||
entry.ptr.reset(); | |||
entry.ptr.convertToWeak(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not mistaken, ptr
had type shared_ptr
. Changing reset()
to convertToWeak()
is alarming! I'm not sure how to reason about this. I would've thought we should simply decrement the strong count here by 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distinction between reset
and convertToWeak
is necessary due to the intrusive nature of SharedWeakUnion
, which combines both strong and weak references. By switching to more explicit semantics, we ensure that the transition of the SharedWeakUnion
from a strong to a weak state is made explicit. This change aligns with the intended behavior: to convert the existing strong reference into a weak reference rather than fully releasing the pointer. This not only preserves the weak reference count but also makes the operation more transparent to users of SharedWeakUnion
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Should *.ipp files be moved to the impl
folder? TaggedPointer.ipp and DatabaseBody.ipp already use this convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not relevant anymore
@@ -58,6 +58,8 @@ template < | |||
class Key, | |||
class T, | |||
bool IsKeyCache, | |||
class SharedWeakUnionPointer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] It makes sense to rename all usages of SharedWeakUnionPointer / SharedWeakUnionPointerType into SharedWeakCachePointer / SharedWeakCachePointerType for naming consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No support for this. Ignoring
|
||
SharedWeakCachePointer(SharedWeakCachePointer const& rhs); | ||
|
||
template <class TT> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Using TT as a type name is not conventional. Usually, if a descriptive type name can not be given and T is already used, the type name should follow the alphabet - T, U, V, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring, since there is no support for this
prevIntVal = refCounts.load(std::memory_order_acquire); | ||
prev = RefCountPair{prevIntVal}; | ||
} | ||
if (!prev.partialDestroyFinishedBit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be possible to use a single bit to track partial destruction? If it were, that would double the range of max refs for weak pointers.
2 ^15 -1 = 32767 (instead of 16383)
For example, why do we need to check if partial destruction started?
Should the logic try to wait regardless? If the target of refs is already hit, then the wait function will not block here.
// if (!prev.partialDestroyFinishedBit)
// {
// partial destroy MUST finish before running a full destroy (when
// using weak pointers)
refCounts.wait(prevIntVal - weakDelta, std::memory_order_acq_rel);
// }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment on line 128 is meant to explain this:
3. Partial destroy started bit. This bit is set if the
`partialDestructor` function has been started (or is about to be
started). This is used to prevent the destructor from running
concurrently with the partial destructor. This can easily happen when
the last strong pointer release its reference in one thread and starts
the partialDestructor, while in another thread the last weak pointer
goes out of scope and starts the destructor while the partialDestructor
is still running. Both a start and finished bit is needed to handle a
corner-case where the last strong pointer goes out of scope, then then
last `weakPointer` goes out of scope, but this happens before the
`partialDestructor` bit is set. It would be possible to use a single
bit if it could also be set atomically when the strong count goes to
zero and the weak count is non-zero, but that would add complexity (and
likely slow down common cases as well).
All subsequent discussions, fixes and changes will continue in the new PR: #5152 |
High Level Overview of Change
This branch is a memory optimization. It results in about a 2-2.5 GB savings
(10%-15%) when measured on my local validator.
This branch has a long history. About two years ago I wrote a patch to
remove the mutex from shamap inner nodes (ref:
https://github.com/seelabs/rippled/tree/lockfree-tagged-cache). At the
time I measured a large memory savings of about 2 gig. Unfortunately,
the code required using the
folly
library, and I was hesitant tointroduce such a large dependency into rippled (especially one that was
so hard to build). This branch resurrects that old work and removes the
folly
dependency.The old branch used a lockless atomic shared pointer. This new branch
introduces a intrusive pointer type. Unlike boost's intrusive pointer,
this intrusive pointer can handle both strong and weak pointers (needed
for the tagged cache). Since this is an intrusive pointer type, in order
to support weak pointers, the object is not destroyed when the strong
count goes to zero. Instead, it is "partially destroyed" (for example,
inner nodes will reset their children). This intrusive pointer takes
16-bits for the strong count and 14-bits for the weak count, and takes
one 64-bit pointer to point at the object. This is much smaller than a
std::shared_pointer, which needs a control block to hold the strong and
weak counts (and potentially other objects), as well as an extra pointer
to point at the control block.
The intrusive shared pointer can be modified to support for atomic
operations (there is a branch that adds this support). These atomic
operations can be used instead of the lock when changing inner node
pointers in the shamap.
Note: The space savings is independent from removing the locks from
shamap inner node. Therefor this work is divided into two phases. In the
first phase a non-atomic intrusive pointer is introduced and the locks
are kept. In a second phases the atomic intrusive pointer could be
introduced and the locks will be removed. Some of the code in this patch
is written with the upcoming atomic work in mind (for example, using
exchange in places). The atomic intrusive pointer also requires the C++
library to support
atomic_ref
. Both gcc and msvc support this, but atthe time of this writing clang's library does not.
Note: Intrusive pointer will be 12 bytes. The shared_ptr will be around
40 bytes, depending on implementation.
When measuring memory usage on a validator, this patch resulted in
between a 10 and 15% memory savings.
Type of Change
Memory Optimization