Intrusive shamap inner (SHAMapTreeNode memory reduction) #4815

seelabs · 2023-11-15T01:39:31Z

High Level Overview of Change

This branch is a memory optimization. It results in about a 2-2.5 GB savings
(10%-15%) when measured on my local validator.

This branch has a long history. About two years ago I wrote a patch to
remove the mutex from shamap inner nodes (ref:
https://github.com/seelabs/rippled/tree/lockfree-tagged-cache). At the
time I measured a large memory savings of about 2 gig. Unfortunately,
the code required using the folly library, and I was hesitant to
introduce such a large dependency into rippled (especially one that was
so hard to build). This branch resurrects that old work and removes the
folly dependency.

The old branch used a lockless atomic shared pointer. This new branch
introduces a intrusive pointer type. Unlike boost's intrusive pointer,
this intrusive pointer can handle both strong and weak pointers (needed
for the tagged cache). Since this is an intrusive pointer type, in order
to support weak pointers, the object is not destroyed when the strong
count goes to zero. Instead, it is "partially destroyed" (for example,
inner nodes will reset their children). This intrusive pointer takes
16-bits for the strong count and 14-bits for the weak count, and takes
one 64-bit pointer to point at the object. This is much smaller than a
std::shared_pointer, which needs a control block to hold the strong and
weak counts (and potentially other objects), as well as an extra pointer
to point at the control block.

The intrusive shared pointer can be modified to support for atomic
operations (there is a branch that adds this support). These atomic
operations can be used instead of the lock when changing inner node
pointers in the shamap.

Note: The space savings is independent from removing the locks from
shamap inner node. Therefor this work is divided into two phases. In the
first phase a non-atomic intrusive pointer is introduced and the locks
are kept. In a second phases the atomic intrusive pointer could be
introduced and the locks will be removed. Some of the code in this patch
is written with the upcoming atomic work in mind (for example, using
exchange in places). The atomic intrusive pointer also requires the C++
library to support atomic_ref. Both gcc and msvc support this, but at
the time of this writing clang's library does not.

Note: Intrusive pointer will be 12 bytes. The shared_ptr will be around
40 bytes, depending on implementation.

When measuring memory usage on a validator, this patch resulted in
between a 10 and 15% memory savings.

Type of Change

Memory Optimization

This branch has a long history. About two years ago I wrote a patch to remove the mutex from shamap inner nodes (ref: https://github.com/seelabs/rippled/tree/lockfree-tagged-cache). At the time I measured a large memory savings of about 2 gig. Unfortunately, the code required using the `folly` library, and I was hesitant to introduce such a large dependency into rippled (especially one that was so hard to build). This branch resurrects that old work and removes the `folly` dependency. The old branch used a lockless atomic shared pointer. This new branch introduces a intrusive pointer type. Unlike boost's intrusive pointer, this intrusive pointer can handle both strong and weak pointers (needed for the tagged cache). Since this is an intrusive pointer type, in order to support weak pointers, the object is not destroyed when the strong count goes to zero. Instead, it is "partially destroyed" (for example, inner nodes will reset their children). This intrusive pointer takes 16-bits for the strong count and 14-bits for the weak count, and takes one 64-bit pointer to point at the object. This is much smaller than a std::shared_pointer, which needs a control block to hold the strong and weak counts (and potentially other objects), as well as an extra pointer to point at the control block. The intrusive shared pointer can be modified to support for atomic operations (there is a branch that adds this support). These atomic operations can be used instead of the lock when changing inner node pointers in the shamap. Note: The space savings is independent from removing the locks from shamap inner node. Therefor this work is divided into two phases. In the first phase a non-atomic intrusive pointer is introduced and the locks are kept. In a second phases the atomic intrusive pointer could be introduced and the locks will be removed. Some of the code in this patch is written with the upcoming atomic work in mind (for example, using exchange in places). The atomic intrusive pointer also requires the C++ library to support `atomic_ref`. Both gcc and msvc support this, but at the time of this writing clang's library does not. Note: Intrusive pointer will be 12 bytes. The shared_ptr will be around 40 bytes, depending on implementation. When measuring memory usage on a validator, this patch resulted in between a 10 and 15% memory savings.

seelabs · 2023-11-15T01:49:22Z

FYI: This is a branch that has some work to make the shared pointer atomic: https://github.com/seelabs/rippled/tree/atomic_shamap_inner. With that patch we may be able to remove the inner node locks. But I'm putting that on the back-burner for a while.

seelabs · 2023-11-15T02:10:08Z

One important todo before this can be merge is to confirm that we can get away with 16-bit strong counts and 14-bit weak counts.

sublimator · 2023-11-15T02:15:24Z

important todo

> 2 ** 16 -1
65535
> 2 ** 14 -1
16383

That's max refs for strong/weak?

seelabs · 2023-11-15T15:50:13Z

important todo
> 2 ** 16 -1
65535
> 2 ** 14 -1
16383
That's max refs for strong/weak?

Yes, that's right. It's trivial to change that so it the counts are 32 bits and 30 bit, but at the cost of 4 extra bytes per object. If we can, I'd like to avoid those 4 extra bytes.

sublimator · 2023-11-17T01:02:58Z

at the cost of 4 extra bytes per object

I guess that adds up with the current node counts.
Let's say ~20M for just one state tree.
That's ~76MB there alone.

sublimator · 2023-11-21T02:57:13Z

Damn mac eh

scottschurr · 2023-12-11T22:18:07Z

One important todo before this can be merge is to confirm that we can get away with 16-bit strong counts and 14-bit weak counts.

How would you suggest we make that confirmation? Do you know which kinds of workloads are likely to run up those reference counts?

seelabs · 2024-01-16T20:30:37Z

@scottschurr I think we need to audit the code and understand all the places that are storing strong and weak shamap pointers and decide if there are ways this could blow up. I have not done that audit, but we'll need to do it before merging this. However, that audit shouldn't stop a review from starting. Even if we decide the limit is too low this is still worth merging, we'll just have to increase the limit (for slightly less memory savings).

HowardHinnant · 2024-01-31T21:00:17Z

important todo
> 2 ** 16 -1
65535
> 2 ** 14 -1
16383
That's max refs for strong/weak?
Yes, that's right. It's trivial to change that so it the counts are 32 bits and 30 bit, but at the cost of 4 extra bytes per object. If we can, I'd like to avoid those 4 extra bytes.

I think this is more than a todo. It's a blocker. We should not approve this PR until we know the answer to this question.

seelabs · 2024-02-01T20:49:22Z

important todo
I think this is more than a todo. It's a blocker. We should not approve this PR until we know the answer to this question.
@HowardHinnant
I agree, and I have no intention on merging this until I audit the code to decide if we need larger count sizes - maybe "important todo" wasn't worded as well as I should have. On the other hand the code can be reviewed as-is now and if I need to up the count sizes there will be a slightly smaller memory saving (but it should still be worth doing) and it will be trivial to review the changes to the count sizes. I don't think the review needs to wait for the audit of the count sizes.

HowardHinnant

I'm not done with my review, but I wanted to push the comments I have so far.

HowardHinnant · 2024-01-31T16:38:51Z

src/ripple/basics/TaggedCache.ipp

+    bool IsKeyCache,
+    class Hash,
+    class KeyEqual,
+    class Mutex>


I thought templates were implicitly inline. No?

Nope. If the function definition is inside the class declaration then they are implicitly inline. Outside the declaration they are implicitly not inline. Either way the inline is just a hint. The compiler is free to inline or not no matter the inline declaration. So this isn't critical. I consider it good style to hint to the best of our ability to the compiler though.

TIL. Thanks!

HowardHinnant · 2024-01-31T16:40:02Z

src/ripple/basics/TaggedCache.ipp

+    bool IsKeyCache,
+    class Hash,
+    class KeyEqual,
+    class Mutex>


HowardHinnant · 2024-01-31T16:40:50Z

src/ripple/basics/TaggedCache.ipp

+    bool IsKeyCache,
+    class Hash,
+    class KeyEqual,
+    class Mutex>


HowardHinnant · 2024-01-31T16:41:06Z

src/ripple/basics/TaggedCache.ipp

+    bool IsKeyCache,
+    class Hash,
+    class KeyEqual,
+    class Mutex>


HowardHinnant · 2024-01-31T16:41:54Z

src/ripple/basics/TaggedCache.ipp

+    bool IsKeyCache,
+    class Hash,
+    class KeyEqual,
+    class Mutex>


HowardHinnant · 2024-02-06T20:40:41Z

src/ripple/basics/IntrusivePointer.ipp

+        // (The normal operator= is needed or it will be marked `deleted`)
+        if (this == &rhs)
+            return *this;
+    }


How about instead just:

static_assert(!std::is_same_v<T, TT>);

HowardHinnant · 2024-02-06T20:59:41Z

src/ripple/basics/IntrusivePointer.h

+    // There is no move constructor from a strong intrusive ptr because
+    // moving would be move expensive than copying in this case (the strong
+    // ref would need to be decremented)
+    WeakIntrusive(SharedIntrusive<T> const&& rhs) = delete;


This prohibits constructing a WeakIntrusive from an rvalue SharedIntrusive. I'm guessing that what is intended is that such an expression would execute the WeakIntrusive(SharedIntrusive<T> const& rhs) constructor. But instead a compile-time error is created.

If I'm correct, just remove this signature.

For reference, the std allows std::weak_ptr<T> to be constructed from an rvalue std::shared_ptr<TT> where TT is convertible to T or is the same as T: http://eel.is/c++draft/util.smartptr.weak#const-3

As the comment from Scott states, deleting the rvalue constructor was a deliberate choice favouring performance over consistency with standard library smart pointers. Allowing construction from an rvalue SharedIntrusive would involve unnecessary reference count modifications (decrement and re-increment), which can be more expensive than copying. This design ensures predictable and efficient behavior in the context of WeakIntrusive.

HowardHinnant · 2024-02-06T21:39:46Z

src/ripple/basics/IntrusivePointer.h

+    // moving would be move expensive than copying in this case (the strong
+    // ref would need to be decremented)
+    WeakIntrusive(SharedIntrusive<T> const&& rhs) = delete;
+


I believe the implicit copy assignment operator does the wrong thing.

For std::weak_ptr the copy assignment operator is: Equivalent to weak_ptr(r).swap(*this)

Since there are no current use cases for copy assignment in WeakIntrusive, I suggest we delete this operator to simplify the implementation. If a need arises in the future, we can reintroduce it with proper consideration.

HowardHinnant · 2024-02-06T21:46:59Z

src/ripple/basics/IntrusivePointer.ipp

+{
+    if (this == &rhs)
+        return *this;
+    unsafeReleaseNoStore();


I believe this is in danger of destructive modification of rhs prior to adding the ref count below. The release should come after the add.

The std spec does this with a member swap: SharedWeakUnion(rhs).swap(*this).

Cases that are not caught by this == &rhs and yet still self-reference are very rare, and very difficult to replicate and/or reason about. But I have seen them happen in the field with shared_ptr, weak_ptr and unique_ptr. Here be dragons.

To validate the correctness of state updates, I have added unit tests covering the following scenarios:

Normal Use of the Assignment Operator: Ensuring correct state updates when assigning one SharedWeakUnion to another.

Self-Assignment: Testing for robustness in cases of self-assignment to ensure no unintended state modifications occur.

Null Pointer Assignment: Verifying that assigning a null SharedWeakUnion correctly updates the state to reflect null.

Expired Pointer Assignment: Ensuring that assigning an expired weak pointer behaves as expected, transitioning to the appropriate state.

This should address the concerns raised about potential self-referential issues or incorrect state transitions.

HowardHinnant · 2024-02-06T22:05:43Z

src/ripple/basics/TaggedCache.ipp

@@ -335,7 +472,7 @@ TaggedCache<Key, T, IsKeyCache, Hash, KeyEqual, Mutex>::del(
    if (entry.isCached())
    {
        --m_cache_count;
-        entry.ptr.reset();
+        entry.ptr.convertToWeak();


If I'm not mistaken, ptr had type shared_ptr. Changing reset() to convertToWeak() is alarming! I'm not sure how to reason about this. I would've thought we should simply decrement the strong count here by 1.

The distinction between reset and convertToWeak is necessary due to the intrusive nature of SharedWeakUnion, which combines both strong and weak references. By switching to more explicit semantics, we ensure that the transition of the SharedWeakUnion from a strong to a weak state is made explicit. This change aligns with the intended behavior: to convert the existing strong reference into a weak reference rather than fully releasing the pointer. This not only preserves the weak reference count but also makes the operation more transparent to users of SharedWeakUnion.

vlntb · 2024-07-01T15:42:10Z

src/ripple/basics/IntrusivePointer.ipp

[nit] Should *.ipp files be moved to the impl folder? TaggedPointer.ipp and DatabaseBody.ipp already use this convention.

Not relevant anymore

vlntb · 2024-07-01T15:43:06Z

src/ripple/app/main/Application.h

@@ -58,6 +58,8 @@ template <
    class Key,
    class T,
    bool IsKeyCache,
+    class SharedWeakUnionPointer,


[nit] It makes sense to rename all usages of SharedWeakUnionPointer / SharedWeakUnionPointerType into SharedWeakCachePointer / SharedWeakCachePointerType for naming consistency.

No support for this. Ignoring

vlntb · 2024-07-01T15:43:52Z

src/ripple/basics/SharedWeakCachePointer.h

+
+    SharedWeakCachePointer(SharedWeakCachePointer const& rhs);
+
+    template <class TT>


[nit] Using TT as a type name is not conventional. Usually, if a descriptive type name can not be given and T is already used, the type name should follow the alphabet - T, U, V, etc.

Ignoring, since there is no support for this

vlntb · 2024-07-01T15:47:28Z

src/ripple/basics/IntrusiveRefCounts.h

+            prevIntVal = refCounts.load(std::memory_order_acquire);
+            prev = RefCountPair{prevIntVal};
+        }
+        if (!prev.partialDestroyFinishedBit)


Can it be possible to use a single bit to track partial destruction? If it were, that would double the range of max refs for weak pointers.

2 ^15 -1 = 32767 (instead of 16383)

For example, why do we need to check if partial destruction started?
Should the logic try to wait regardless? If the target of refs is already hit, then the wait function will not block here.

// if (!prev.partialDestroyFinishedBit)
// {
// partial destroy MUST finish before running a full destroy (when
// using weak pointers)
refCounts.wait(prevIntVal - weakDelta, std::memory_order_acq_rel);
// }

This comment on line 128 is meant to explain this:

3. Partial destroy started bit. This bit is set if the `partialDestructor` function has been started (or is about to be started). This is used to prevent the destructor from running concurrently with the partial destructor. This can easily happen when the last strong pointer release its reference in one thread and starts the partialDestructor, while in another thread the last weak pointer goes out of scope and starts the destructor while the partialDestructor is still running. Both a start and finished bit is needed to handle a corner-case where the last strong pointer goes out of scope, then then last `weakPointer` goes out of scope, but this happens before the `partialDestructor` bit is set. It would be possible to use a single bit if it could also be set atomically when the strong count goes to zero and the weak count is non-zero, but that would add complexity (and likely slow down common cases as well).

vlntb · 2024-09-30T11:55:41Z

All subsequent discussions, fixes and changes will continue in the new PR: #5152
The current PR will be closed later.

seelabs added 3 commits November 13, 2023 16:54

Split tagged cache into .h and .ipp files cc

cd824f3

[remove] Scripts to measure and plot memory savings

574efd0

seelabs requested review from HowardHinnant and ximinez November 15, 2023 01:39

[fold] Replace jthread with thread (mac does not have jthread yet)

d16f27c

intelliot changed the title ~~Intrusive shamap inner~~ Intrusive shamap inner (SHAMapTreeNode memory reduction) Jan 24, 2024

HowardHinnant suggested changes Feb 7, 2024

View reviewed changes

vlntb reviewed Jul 1, 2024

View reviewed changes

vlntb mentioned this pull request Sep 30, 2024

[WIP ]Intrusive shamap inner final #5150

Closed

7 tasks

vlntb mentioned this pull request Oct 3, 2024

Intrusive shamap inner final #5152

Open

9 tasks

RichardAH mentioned this pull request Oct 16, 2024

SHAMap Optimization [DO NOT MERGE] Xahau/xahaud#371

Closed


		SharedWeakCachePointer(SharedWeakCachePointer const& rhs);

		template <class TT>

Intrusive shamap inner (SHAMapTreeNode memory reduction) #4815

Are you sure you want to change the base?

Intrusive shamap inner (SHAMapTreeNode memory reduction) #4815

Conversation

seelabs commented Nov 15, 2023 • edited Loading

High Level Overview of Change

Type of Change

seelabs commented Nov 15, 2023

seelabs commented Nov 15, 2023

sublimator commented Nov 15, 2023

seelabs commented Nov 15, 2023

sublimator commented Nov 17, 2023

sublimator commented Nov 21, 2023

scottschurr commented Dec 11, 2023

seelabs commented Jan 16, 2024

HowardHinnant commented Jan 31, 2024

seelabs commented Feb 1, 2024

HowardHinnant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vlntb Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vlntb commented Sep 30, 2024 • edited Loading

seelabs commented Nov 15, 2023 •

edited

Loading

vlntb Jul 1, 2024 •

edited

Loading

vlntb commented Sep 30, 2024 •

edited

Loading