-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock-free ClockCache #10390
Lock-free ClockCache #10390
Conversation
new_entry->InternalToExclusiveRef(); | ||
Assign(new_entry, h); | ||
return new_entry; | ||
RemoveAll(h->key(), h->hash, probe, deleted); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we do this outside of holding the exclusive ref?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be right. I had a bad case in mind: two or more concurrent inserts (of the same key) annihilating each other. But perhaps this is not possible: the smallest insert in the probe sequence (the one whose call to FindAvailableSlot lands in the smallest probe) that is not deleted during the other concurrent FindAvailableSlot calls will survive, since RemoveAll only deletes larger items in the probe sequence.
cache/clock_cache.h
Outdated
@@ -257,7 +246,15 @@ struct ClockHandle { | |||
key_data.fill(0); | |||
} | |||
|
|||
ClockHandle(const ClockHandle& other) { *this = other; } | |||
// The copy constructor is only used to copy a handle for immediate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this confusing / troublesome, especially because now the copy ctor and assignment operator disagree. This is unusual if not undefined behavior depending on which the compiler chooses (e.g. optional copy elision).
How about a base class of ClockHandle that includes only these fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a mistake---they should agree. Apparently the push_back function in autovector requires both of them, but I don't understand why is the copy ctor needed. I'll investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the copy ctor is required to push_back const elements. We're not using this feature, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks pretty complicated. At this point, I have aa few small comments and some questions. Will try to review in more detail tomorrow.
cache/clock_cache.cc
Outdated
dst->SetCachePriority(src->GetCachePriority()); | ||
// dst->SetClockPriority(ClockHandle::ClockPriority::NONE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I will delete it.
cache/clock_cache.h
Outdated
// allows us to save many atomic operations by packing data more carefully. | ||
// In particular: | ||
// references are functionally equivalent to RW locks (external and internal | ||
// references are read locks, and exclusive references are write locks). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, how is the distinction between internal and external refs useful? Are there any operations that are not allowed when there are external refs, but allowed if there are only internal refs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 2 reasons:
- Internal references are short lived, but external references may not. This is helpful when acquiring an exclusive ref: if there is an external ref to the item, it's probably not worth spinning, so we move on.
- In Release, when deciding whether the reference was the last one, transient references to the handle (i.e., other threads probing the slot) don't get mixed up in the ref count. (This is unlikely to happen, though, unless there are too many concurrent operations.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make this a comment.
constexpr double kStrictLoadFactor = 0.7; | ||
|
||
// Maximum number of spins when trying to acquire a ref. | ||
constexpr uint32_t kSpinsPerTry = 100000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale behind this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This choice was arbitrary. I was tested a few values and this one performed okay. I'm adding a TODO to investigate more.
cache/clock_cache.cc
Outdated
bool is_high_priority = | ||
h->HasHit() || h->GetCachePriority() == Cache::Priority::HIGH; | ||
h->SetClockPriority(static_cast<ClockHandle::ClockPriority>( | ||
is_high_priority * ClockHandle::ClockPriority::HIGH + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not is_high_priority ? ClockHandle::ClockPriority::HIGH : ClockHandle::ClockPriority::MEDIUM
? Might be a little easier to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way to avoid a conditional. But I guess I should trust the compiler more?
EXCLUSIVE_REF | will_be_deleted)) { | ||
if (expected & EXTERNAL_REFS) { | ||
EXCLUSIVE_REF | will_be_deleted) && | ||
spins--) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe pause after each attempt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Some questions and comments inline. One larger question - to verify the correctness of the lock free implementation, is db_stress sufficient or do you think we need a dedicated clock_cache stress test? I'm wondering if db_stress can flush out all possible race conditions since each operation would do a lot more than cache access.
key, | ||
[&](ClockHandle* h) { | ||
if (h->TryInternalRef()) { | ||
if (h->IsElement() && h->Matches(key, hash)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it wasn't introduced in this PR, but IsElement()
is not very clear. Maybe something like IsValid()
would have been more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A handle can be 3 different things: a KV element, a tombstone or an empty slot. Tombstones and empty slots are not really invalid.
Maybe IsOccupied()
? I don't love this term either because I think of tombstones as occupying slots too.
} | ||
return false; | ||
}, | ||
[&](ClockHandle* h) { return h->displacements == 0; }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to come up with a intuitively named function, instead of h->displacements == 0
? I don't have any good suggestions though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NothingAfterThisProbe()
?
cache/clock_cache.h
Outdated
// These properties induce 4 different states, with transitions defined as | ||
// follows: | ||
// - Not M --> M: When a handle is deleted or replaced by a new version, but | ||
// not immediately evicted. | ||
// - M --> not M: This cannot happen. Once a handle is marked for deletion, | ||
// there is no can't go back. | ||
// there is no way back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, it'd be good to document the transitions in more detail, including the triggers for transition. For example, what would cause a handle to be marked for deletion but not immediately evicted, transition from internal to external ref or vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will work on improving this part of the documentation.
cache/clock_cache.h
Outdated
// ------------------------------------ | ||
// | ||
// We separate data members that are updated frequently from the ones that | ||
// are not frequently updated so that they don't share the same cache line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is cacheline separation being ensured? Also, shouldn't each frequently modified data member be on a separate cacheline? And array_ is not frequently modified I believe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I ported this from LRUCache without giving it much thought, so thank you for asking this question. I think we want to avoid reads on some field X to trigger the cache coherence mechanism only because it shares the cache line with other modified field Y, not because the operation reading X actually needs to use Y. In our case: Lookup uses array_; Release and Erase use array_, occupancy_ and usage_; and Insert uses array_, occupancy_, usage_ and clock_pointer_. So I think we want 3 different cache lines: array_; occupancy_ and usage_; and clock_pointer_.
As to how is cache line separation ensured: I think we should be aligning the fields.
cache/clock_cache.cc
Outdated
|
||
if (h->TryExclusiveRef()) { | ||
if (h->WillBeDeleted()) { | ||
Remove(h, deleted); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove()
will update charge_
and occupancy_
. If multiple insertions are happening in parallel, it could result in those cache lines thrashing. Would it be better to evict as many as needed and update in one shot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll implement this. clock_pointer_
is also contended among all inserts doing eviction, but it's trickier to fix, because with variable-size blocks we don't know right off the bat how many elements we will need to evict.
An alternative approach is to have a single thread on every shard periodically run the eviction algorithm. This could be future optimization work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed all of this, including the clock_pointer_
thing.
I don't know the details of db_stress, but I believe it only exercises steady state. This particular steady state may not be really risky for our lock-free algorithm. For example, I would like to stress it with multiple concurrent lookup/insert/delete on the same key. I don't know if this is tested in db_stress. Also, it'd be good to have more unit tests for adversarial cases that are not exercised in the current battery of tests, which are mostly single thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. I'll leave it up to you to implement further optimizations either in this PR or a follow-on PR.
@guidotag has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@guidotag has updated the pull request. You must reimport the pull request before landing. |
@guidotag has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: ClockCache completely free of locks. As part of this PR we have also pushed clock algorithm functionality out of ClockCacheShard into ClockHandleTable, so that ClockCacheShard acts more as an interface and less as an actual data structure.
Test plan:
make -j24 check
make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush