Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intrusive shamap inner (SHAMapTreeNode memory reduction) #4815

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from

Conversation

seelabs
Copy link
Collaborator

@seelabs seelabs commented Nov 15, 2023

High Level Overview of Change

This branch is a memory optimization. It results in about a 2-2.5 GB savings
(10%-15%) when measured on my local validator.

This branch has a long history. About two years ago I wrote a patch to
remove the mutex from shamap inner nodes (ref:
https://github.com/seelabs/rippled/tree/lockfree-tagged-cache). At the
time I measured a large memory savings of about 2 gig. Unfortunately,
the code required using the folly library, and I was hesitant to
introduce such a large dependency into rippled (especially one that was
so hard to build). This branch resurrects that old work and removes the
folly dependency.

The old branch used a lockless atomic shared pointer. This new branch
introduces a intrusive pointer type. Unlike boost's intrusive pointer,
this intrusive pointer can handle both strong and weak pointers (needed
for the tagged cache). Since this is an intrusive pointer type, in order
to support weak pointers, the object is not destroyed when the strong
count goes to zero. Instead, it is "partially destroyed" (for example,
inner nodes will reset their children). This intrusive pointer takes
16-bits for the strong count and 14-bits for the weak count, and takes
one 64-bit pointer to point at the object. This is much smaller than a
std::shared_pointer, which needs a control block to hold the strong and
weak counts (and potentially other objects), as well as an extra pointer
to point at the control block.

The intrusive shared pointer can be modified to support for atomic
operations (there is a branch that adds this support). These atomic
operations can be used instead of the lock when changing inner node
pointers in the shamap.

Note: The space savings is independent from removing the locks from
shamap inner node. Therefor this work is divided into two phases. In the
first phase a non-atomic intrusive pointer is introduced and the locks
are kept. In a second phases the atomic intrusive pointer could be
introduced and the locks will be removed. Some of the code in this patch
is written with the upcoming atomic work in mind (for example, using
exchange in places). The atomic intrusive pointer also requires the C++
library to support atomic_ref. Both gcc and msvc support this, but at
the time of this writing clang's library does not.

Note: Intrusive pointer will be 12 bytes. The shared_ptr will be around
40 bytes, depending on implementation.

When measuring memory usage on a validator, this patch resulted in
between a 10 and 15% memory savings.

Type of Change

Memory Optimization

This branch has a long history. About two years ago I wrote a patch to
remove the mutex from shamap inner nodes (ref:
https://github.com/seelabs/rippled/tree/lockfree-tagged-cache). At the
time I measured a large memory savings of about 2 gig. Unfortunately,
the code required using the `folly` library, and I was hesitant to
introduce such a large dependency into rippled (especially one that was
so hard to build). This branch resurrects that old work and removes the
`folly` dependency.

The old branch used a lockless atomic shared pointer. This new branch
introduces a intrusive pointer type. Unlike boost's intrusive pointer,
this intrusive pointer can handle both strong and weak pointers (needed
for the tagged cache). Since this is an intrusive pointer type, in order
to support weak pointers, the object is not destroyed when the strong
count goes to zero. Instead, it is "partially destroyed" (for example,
inner nodes will reset their children). This intrusive pointer takes
16-bits for the strong count and 14-bits for the weak count, and takes
one 64-bit pointer to point at the object. This is much smaller than a
std::shared_pointer, which needs a control block to hold the strong and
weak counts (and potentially other objects), as well as an extra pointer
to point at the control block.

The intrusive shared pointer can be modified to support for atomic
operations (there is a branch that adds this support). These atomic
operations can be used instead of the lock when changing inner node
pointers in the shamap.

Note: The space savings is independent from removing the locks from
shamap inner node. Therefor this work is divided into two phases. In the
first phase a non-atomic intrusive pointer is introduced and the locks
are kept. In a second phases the atomic intrusive pointer could be
introduced and the locks will be removed. Some of the code in this patch
is written with the upcoming atomic work in mind (for example, using
exchange in places). The atomic intrusive pointer also requires the C++
library to support `atomic_ref`. Both gcc and msvc support this, but at
the time of this writing clang's library does not.

Note: Intrusive pointer will be 12 bytes. The shared_ptr will be around
40 bytes, depending on implementation.

When measuring memory usage on a validator, this patch resulted in
between a 10 and 15% memory savings.
@seelabs
Copy link
Collaborator Author

seelabs commented Nov 15, 2023

FYI: This is a branch that has some work to make the shared pointer atomic: https://github.com/seelabs/rippled/tree/atomic_shamap_inner. With that patch we may be able to remove the inner node locks. But I'm putting that on the back-burner for a while.

@seelabs
Copy link
Collaborator Author

seelabs commented Nov 15, 2023

One important todo before this can be merge is to confirm that we can get away with 16-bit strong counts and 14-bit weak counts.

@sublimator
Copy link
Contributor

important todo

> 2 ** 16 -1
65535
> 2 ** 14 -1
16383

That's max refs for strong/weak?

@seelabs
Copy link
Collaborator Author

seelabs commented Nov 15, 2023

important todo

> 2 ** 16 -1
65535
> 2 ** 14 -1
16383

That's max refs for strong/weak?

Yes, that's right. It's trivial to change that so it the counts are 32 bits and 30 bit, but at the cost of 4 extra bytes per object. If we can, I'd like to avoid those 4 extra bytes.

@sublimator
Copy link
Contributor

at the cost of 4 extra bytes per object

I guess that adds up with the current node counts.
Let's say ~20M for just one state tree.
That's ~76MB there alone.

@sublimator
Copy link
Contributor

Damn mac eh

@scottschurr
Copy link
Collaborator

One important todo before this can be merge is to confirm that we can get away with 16-bit strong counts and 14-bit weak counts.

How would you suggest we make that confirmation? Do you know which kinds of workloads are likely to run up those reference counts?

@seelabs
Copy link
Collaborator Author

seelabs commented Jan 16, 2024

@scottschurr I think we need to audit the code and understand all the places that are storing strong and weak shamap pointers and decide if there are ways this could blow up. I have not done that audit, but we'll need to do it before merging this. However, that audit shouldn't stop a review from starting. Even if we decide the limit is too low this is still worth merging, we'll just have to increase the limit (for slightly less memory savings).

@intelliot intelliot changed the title Intrusive shamap inner Intrusive shamap inner (SHAMapTreeNode memory reduction) Jan 24, 2024
@HowardHinnant
Copy link
Contributor

important todo

> 2 ** 16 -1
65535
> 2 ** 14 -1
16383

That's max refs for strong/weak?

Yes, that's right. It's trivial to change that so it the counts are 32 bits and 30 bit, but at the cost of 4 extra bytes per object. If we can, I'd like to avoid those 4 extra bytes.

I think this is more than a todo. It's a blocker. We should not approve this PR until we know the answer to this question.

@seelabs
Copy link
Collaborator Author

seelabs commented Feb 1, 2024

important todo
I think this is more than a todo. It's a blocker. We should not approve this PR until we know the answer to this question.
@HowardHinnant
I agree, and I have no intention on merging this until I audit the code to decide if we need larger count sizes - maybe "important todo" wasn't worded as well as I should have. On the other hand the code can be reviewed as-is now and if I need to up the count sizes there will be a slightly smaller memory saving (but it should still be worth doing) and it will be trivial to review the changes to the count sizes. I don't think the review needs to wait for the audit of the count sizes.

Copy link
Contributor

@HowardHinnant HowardHinnant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not done with my review, but I wanted to push the comments I have so far.

bool IsKeyCache,
class Hash,
class KeyEqual,
class Mutex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought templates were implicitly inline. No?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. If the function definition is inside the class declaration then they are implicitly inline. Outside the declaration they are implicitly not inline. Either way the inline is just a hint. The compiler is free to inline or not no matter the inline declaration. So this isn't critical. I consider it good style to hint to the best of our ability to the compiler though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL. Thanks!

bool IsKeyCache,
class Hash,
class KeyEqual,
class Mutex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline

bool IsKeyCache,
class Hash,
class KeyEqual,
class Mutex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline

bool IsKeyCache,
class Hash,
class KeyEqual,
class Mutex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline

bool IsKeyCache,
class Hash,
class KeyEqual,
class Mutex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline

// (The normal operator= is needed or it will be marked `deleted`)
if (this == &rhs)
return *this;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about instead just:

static_assert(!std::is_same_v<T, TT>);

// There is no move constructor from a strong intrusive ptr because
// moving would be move expensive than copying in this case (the strong
// ref would need to be decremented)
WeakIntrusive(SharedIntrusive<T> const&& rhs) = delete;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prohibits constructing a WeakIntrusive from an rvalue SharedIntrusive. I'm guessing that what is intended is that such an expression would execute the WeakIntrusive(SharedIntrusive<T> const& rhs) constructor. But instead a compile-time error is created.

If I'm correct, just remove this signature.

For reference, the std allows std::weak_ptr<T> to be constructed from an rvalue std::shared_ptr<TT> where TT is convertible to T or is the same as T: http://eel.is/c++draft/util.smartptr.weak#const-3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comment from Scott states, deleting the rvalue constructor was a deliberate choice favouring performance over consistency with standard library smart pointers. Allowing construction from an rvalue SharedIntrusive would involve unnecessary reference count modifications (decrement and re-increment), which can be more expensive than copying. This design ensures predictable and efficient behavior in the context of WeakIntrusive.

// moving would be move expensive than copying in this case (the strong
// ref would need to be decremented)
WeakIntrusive(SharedIntrusive<T> const&& rhs) = delete;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the implicit copy assignment operator does the wrong thing.

For std::weak_ptr the copy assignment operator is: Equivalent to weak_ptr(r).swap(*this)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are no current use cases for copy assignment in WeakIntrusive, I suggest we delete this operator to simplify the implementation. If a need arises in the future, we can reintroduce it with proper consideration.

{
if (this == &rhs)
return *this;
unsafeReleaseNoStore();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is in danger of destructive modification of rhs prior to adding the ref count below. The release should come after the add.

The std spec does this with a member swap: SharedWeakUnion(rhs).swap(*this).

Cases that are not caught by this == &rhs and yet still self-reference are very rare, and very difficult to replicate and/or reason about. But I have seen them happen in the field with shared_ptr, weak_ptr and unique_ptr. Here be dragons.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To validate the correctness of state updates, I have added unit tests covering the following scenarios:

  • Normal Use of the Assignment Operator: Ensuring correct state updates when assigning one SharedWeakUnion to another.
  • Self-Assignment: Testing for robustness in cases of self-assignment to ensure no unintended state modifications occur.
  • Null Pointer Assignment: Verifying that assigning a null SharedWeakUnion correctly updates the state to reflect null.
  • Expired Pointer Assignment: Ensuring that assigning an expired weak pointer behaves as expected, transitioning to the appropriate state.

This should address the concerns raised about potential self-referential issues or incorrect state transitions.

@@ -335,7 +472,7 @@ TaggedCache<Key, T, IsKeyCache, Hash, KeyEqual, Mutex>::del(
if (entry.isCached())
{
--m_cache_count;
entry.ptr.reset();
entry.ptr.convertToWeak();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken, ptr had type shared_ptr. Changing reset() to convertToWeak() is alarming! I'm not sure how to reason about this. I would've thought we should simply decrement the strong count here by 1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distinction between reset and convertToWeak is necessary due to the intrusive nature of SharedWeakUnion, which combines both strong and weak references. By switching to more explicit semantics, we ensure that the transition of the SharedWeakUnion from a strong to a weak state is made explicit. This change aligns with the intended behavior: to convert the existing strong reference into a weak reference rather than fully releasing the pointer. This not only preserves the weak reference count but also makes the operation more transparent to users of SharedWeakUnion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Should *.ipp files be moved to the impl folder? TaggedPointer.ipp and DatabaseBody.ipp already use this convention.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant anymore

@@ -58,6 +58,8 @@ template <
class Key,
class T,
bool IsKeyCache,
class SharedWeakUnionPointer,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] It makes sense to rename all usages of SharedWeakUnionPointer / SharedWeakUnionPointerType into SharedWeakCachePointer / SharedWeakCachePointerType for naming consistency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No support for this. Ignoring


SharedWeakCachePointer(SharedWeakCachePointer const& rhs);

template <class TT>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Using TT as a type name is not conventional. Usually, if a descriptive type name can not be given and T is already used, the type name should follow the alphabet - T, U, V, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring, since there is no support for this

prevIntVal = refCounts.load(std::memory_order_acquire);
prev = RefCountPair{prevIntVal};
}
if (!prev.partialDestroyFinishedBit)
Copy link
Collaborator

@vlntb vlntb Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be possible to use a single bit to track partial destruction? If it were, that would double the range of max refs for weak pointers.

2 ^15 -1 = 32767 (instead of 16383)

For example, why do we need to check if partial destruction started?
Should the logic try to wait regardless? If the target of refs is already hit, then the wait function will not block here.

// if (!prev.partialDestroyFinishedBit)
// {
// partial destroy MUST finish before running a full destroy (when
// using weak pointers)
refCounts.wait(prevIntVal - weakDelta, std::memory_order_acq_rel);
// }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment on line 128 is meant to explain this:

3. Partial destroy started bit. This bit is set if the
         `partialDestructor` function has been started (or is about to be
         started). This is used to prevent the destructor from running
         concurrently with the partial destructor. This can easily happen when
         the last strong pointer release its reference in one thread and starts
         the partialDestructor, while in another thread the last weak pointer
         goes out of scope and starts the destructor while the partialDestructor
         is still running. Both a start and finished bit is needed to handle a
         corner-case where the last strong pointer goes out of scope, then then
         last `weakPointer` goes out of scope, but this happens before the
         `partialDestructor` bit is set. It would be possible to use a single
         bit if it could also be set atomically when the strong count goes to
         zero and the weak count is non-zero, but that would add complexity (and
         likely slow down common cases as well).

@vlntb vlntb mentioned this pull request Sep 30, 2024
7 tasks
@vlntb
Copy link
Collaborator

vlntb commented Sep 30, 2024

All subsequent discussions, fixes and changes will continue in the new PR: #5152
The current PR will be closed later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants