Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock-free ClockCache #10390

Closed
wants to merge 25 commits into from
Closed

Lock-free ClockCache #10390

wants to merge 25 commits into from

Conversation

guidotag
Copy link
Contributor

@guidotag guidotag commented Jul 19, 2022

Summary: ClockCache completely free of locks. As part of this PR we have also pushed clock algorithm functionality out of ClockCacheShard into ClockHandleTable, so that ClockCacheShard acts more as an interface and less as an actual data structure.

Test plan:

  • make -j24 check
  • make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush

Guido Tagliavini Ponce added 2 commits July 18, 2022 23:55
new_entry->InternalToExclusiveRef();
Assign(new_entry, h);
return new_entry;
RemoveAll(h->key(), h->hash, probe, deleted);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do this outside of holding the exclusive ref?

Copy link
Contributor Author

@guidotag guidotag Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be right. I had a bad case in mind: two or more concurrent inserts (of the same key) annihilating each other. But perhaps this is not possible: the smallest insert in the probe sequence (the one whose call to FindAvailableSlot lands in the smallest probe) that is not deleted during the other concurrent FindAvailableSlot calls will survive, since RemoveAll only deletes larger items in the probe sequence.

@guidotag guidotag requested a review from anand1976 July 20, 2022 15:58
@@ -257,7 +246,15 @@ struct ClockHandle {
key_data.fill(0);
}

ClockHandle(const ClockHandle& other) { *this = other; }
// The copy constructor is only used to copy a handle for immediate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing / troublesome, especially because now the copy ctor and assignment operator disagree. This is unusual if not undefined behavior depending on which the compiler chooses (e.g. optional copy elision).

How about a base class of ClockHandle that includes only these fields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake---they should agree. Apparently the push_back function in autovector requires both of them, but I don't understand why is the copy ctor needed. I'll investigate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the copy ctor is required to push_back const elements. We're not using this feature, though.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks pretty complicated. At this point, I have aa few small comments and some questions. Will try to review in more detail tomorrow.

dst->SetCachePriority(src->GetCachePriority());
// dst->SetClockPriority(ClockHandle::ClockPriority::NONE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I will delete it.

// allows us to save many atomic operations by packing data more carefully.
// In particular:
// references are functionally equivalent to RW locks (external and internal
// references are read locks, and exclusive references are write locks).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, how is the distinction between internal and external refs useful? Are there any operations that are not allowed when there are external refs, but allowed if there are only internal refs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 2 reasons:

  1. Internal references are short lived, but external references may not. This is helpful when acquiring an exclusive ref: if there is an external ref to the item, it's probably not worth spinning, so we move on.
  2. In Release, when deciding whether the reference was the last one, transient references to the handle (i.e., other threads probing the slot) don't get mixed up in the ref count. (This is unlikely to happen, though, unless there are too many concurrent operations.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make this a comment.

constexpr double kStrictLoadFactor = 0.7;

// Maximum number of spins when trying to acquire a ref.
constexpr uint32_t kSpinsPerTry = 100000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind this value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This choice was arbitrary. I was tested a few values and this one performed okay. I'm adding a TODO to investigate more.

bool is_high_priority =
h->HasHit() || h->GetCachePriority() == Cache::Priority::HIGH;
h->SetClockPriority(static_cast<ClockHandle::ClockPriority>(
is_high_priority * ClockHandle::ClockPriority::HIGH +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not is_high_priority ? ClockHandle::ClockPriority::HIGH : ClockHandle::ClockPriority::MEDIUM? Might be a little easier to reason about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way to avoid a conditional. But I guess I should trust the compiler more?

EXCLUSIVE_REF | will_be_deleted)) {
if (expected & EXTERNAL_REFS) {
EXCLUSIVE_REF | will_be_deleted) &&
spins--) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pause after each attempt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try this.

@guidotag guidotag mentioned this pull request Jul 22, 2022
17 tasks
Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Some questions and comments inline. One larger question - to verify the correctness of the lock free implementation, is db_stress sufficient or do you think we need a dedicated clock_cache stress test? I'm wondering if db_stress can flush out all possible race conditions since each operation would do a lot more than cache access.

key,
[&](ClockHandle* h) {
if (h->TryInternalRef()) {
if (h->IsElement() && h->Matches(key, hash)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it wasn't introduced in this PR, but IsElement() is not very clear. Maybe something like IsValid() would have been more appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A handle can be 3 different things: a KV element, a tombstone or an empty slot. Tombstones and empty slots are not really invalid.

Maybe IsOccupied()? I don't love this term either because I think of tombstones as occupying slots too.

}
return false;
},
[&](ClockHandle* h) { return h->displacements == 0; },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to come up with a intuitively named function, instead of h->displacements == 0? I don't have any good suggestions though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NothingAfterThisProbe()?

// These properties induce 4 different states, with transitions defined as
// follows:
// - Not M --> M: When a handle is deleted or replaced by a new version, but
// not immediately evicted.
// - M --> not M: This cannot happen. Once a handle is marked for deletion,
// there is no can't go back.
// there is no way back.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, it'd be good to document the transitions in more detail, including the triggers for transition. For example, what would cause a handle to be marked for deletion but not immediately evicted, transition from internal to external ref or vice versa.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will work on improving this part of the documentation.

// ------------------------------------
//
// We separate data members that are updated frequently from the ones that
// are not frequently updated so that they don't share the same cache line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is cacheline separation being ensured? Also, shouldn't each frequently modified data member be on a separate cacheline? And array_ is not frequently modified I believe?

Copy link
Contributor Author

@guidotag guidotag Jul 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I ported this from LRUCache without giving it much thought, so thank you for asking this question. I think we want to avoid reads on some field X to trigger the cache coherence mechanism only because it shares the cache line with other modified field Y, not because the operation reading X actually needs to use Y. In our case: Lookup uses array_; Release and Erase use array_, occupancy_ and usage_; and Insert uses array_, occupancy_, usage_ and clock_pointer_. So I think we want 3 different cache lines: array_; occupancy_ and usage_; and clock_pointer_.

As to how is cache line separation ensured: I think we should be aligning the fields.


if (h->TryExclusiveRef()) {
if (h->WillBeDeleted()) {
Remove(h, deleted);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove() will update charge_ and occupancy_. If multiple insertions are happening in parallel, it could result in those cache lines thrashing. Would it be better to evict as many as needed and update in one shot?

Copy link
Contributor Author

@guidotag guidotag Jul 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll implement this. clock_pointer_ is also contended among all inserts doing eviction, but it's trickier to fix, because with variable-size blocks we don't know right off the bat how many elements we will need to evict.

An alternative approach is to have a single thread on every shard periodically run the eviction algorithm. This could be future optimization work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed all of this, including the clock_pointer_ thing.

@guidotag
Copy link
Contributor Author

guidotag commented Jul 23, 2022

Overall LGTM. Some questions and comments inline. One larger question - to verify the correctness of the lock free implementation, is db_stress sufficient or do you think we need a dedicated clock_cache stress test? I'm wondering if db_stress can flush out all possible race conditions since each operation would do a lot more than cache access.

I don't know the details of db_stress, but I believe it only exercises steady state. This particular steady state may not be really risky for our lock-free algorithm. For example, I would like to stress it with multiple concurrent lookup/insert/delete on the same key. I don't know if this is tested in db_stress.

Also, it'd be good to have more unit tests for adversarial cases that are not exercised in the current battery of tests, which are mostly single thread.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. I'll leave it up to you to implement further optimizations either in this PR or a follow-on PR.

@facebook-github-bot
Copy link
Contributor

@guidotag has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@guidotag has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@guidotag has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants