Experiment: Improve concurrent merge performance by weakly owning branch updates #8268

arielshaqed · 2024-10-07T16:05:31Z

What

When enabled, this feature improves performance of concurrent merges.

No seriously, what?

lakeFS retains branch heads on KV. An operation that updates the branch head - typically a commit or a merge - performs a 3-phase commit protocol:

Get the branch head from KV
Create the metarange on block storage based on that branch head
Create a new commit with that metarange whose parent is that branch head, IF the branch head has not moved

When the third phase fails, the operation fails. We do retry automatically on the server, but obviously under load it will still fail. Indeed we see this happen to some users who have heavy merge loads. In this case each retry is actually more expensive, because the gap between the source commit and the destination branch tends to grow with every successive commit to the destination. Failure is pretty much guaranteed once sustained load crosses some critical threshold.

How

This PR is an experiment to use "weak ownership" to avoid having to retry. Broadly speaking, every branch update operation takes ownership of its branch. This excludes all other operations from updating that branch. This exclusion does not reduce performance - only one concurrent branch operation can ever succeed, and almost all concurrent branch operations can succeed. In fact, it increases performance because it prevents wasting resources on all the concurrent failing operations.

What about CAP?

Distributed locking is impossible by the CAP theorem. That's why we use weak ownership. This gives up on C (consistency) and some P (partition resistance) in favour of keeping A (availability). Specifically, a thread takes ownership of a branch by asserting ownership as a record on KV. This ownership has an expiry time; the thread refreshes its ownership until it is done, at which point it deletes the record.

This ownership is weak: usually a single thread owns the branch, but sometimes 2 threads can end up owning the same branch! If the thread fails to refresh on time (due to a partition or slowness or poor clock sync, for instance) then another thread will also acquire ownership! When that happens we have concurrent branch updates. Because these are correct in lakeFS, we lose performance and may need to retry, but we never give incorrect answers.

Results

This PR also adds a lakectl abuse merge command to allow us to measure. b0cfca3 has some numbers when running locally: with sustained concurrency 50, we run faster and get 0 failures instead of 6.6% failures. More details there why this is even better than it sounds. Graph available here, here's a spoiler:

github-actions · 2024-10-07T16:06:58Z

♻️ PR Preview d352071 has been successfully destroyed since this PR has been closed.

_{🤖 By surge-preview}

github-actions · 2024-10-07T16:14:01Z

E2E Test Results - DynamoDB Local - Local Block Adapter

github-actions · 2024-10-07T16:15:16Z

E2E Test Results - Quickstart

yonipeleg33 · 2024-10-07T19:19:26Z

@arielshaqed looks like amazing work, thank you!
I haven't reviewed it yet, but a few general notes:

Do we have an experiment plan for this feature? Which customers/users, for how long, etc.
The term "weak" is a bit inaccurate here IMO:
The term "weak" as defined in places like CPP's std::weak_ptr or Rust's std::sync::Weak means a "non-owning reference", which is different than how this mechanism of branch locking works - each worker owns the branch exclusively, but only for a short period, and with a graceful mechanism for giving up exclusive ownership.
The term "lease-based ownership" suites better here IMO.

arielshaqed · 2024-10-07T19:42:01Z

I should probably mention that if we go in this direction, we can deploy this and then improve it. The next step is to add fairness: earlier merge attempts should have a better chance of being next. Otherwise when there are multiple concurrent merges some of the earlier merge will repeatedly lose and time out. If everything is fair then we expect fewer timeouts, and in general the tail latency will shorten.
One way to make things fairer is to manage an actual queue of tasks waiting to own the branch. Then either they're thread gets priority to pick up the task (it's allowed to pick up tasks sooner after their expiry, for instance task n in the queue waits until $t_{expiry} + n\cdot dt_{check}$) or the thread that owns the branch just processes all merges in the queue.
Anyway, we don't need that to run this experiment.

itaiad200 · 2024-10-08T12:14:39Z

pkg/kv/util/weak_owner.go

+				"now":       now,
+			}).Trace("Try to take ownership")
+			if errors.Is(err, kv.ErrNotFound) {
+				err = kv.SetMsg(ctx, w.Store, weakOwnerPartition, prefixedKey, &ownership)


This is wrong, no? multiple routines can succeed at the same time

Ouch, yeah, this was supposed to be a SetIf with the "not present" predicate - and I forgot to find out the correct value for that predicate.
Thanks for catching this!

arielshaqed

Thanks, that one's embarrassing.

arielshaqed · 2024-10-08T12:50:08Z

pkg/kv/util/weak_owner.go

+				"now":       now,
+			}).Trace("Try to take ownership")
+			if errors.Is(err, kv.ErrNotFound) {
+				err = kv.SetMsg(ctx, w.Store, weakOwnerPartition, prefixedKey, &ownership)


Ouch, yeah, this was supposed to be a SetIf with the "not present" predicate - and I forgot to find out the correct value for that predicate.
Thanks for catching this!

Weak ownership is a best-effort lock that occasionally fails. This one can fail when a goroutine is delayed for a long interval. This is fine if the calling code relies on ownership for performance but not for correctness. E.g. merges and commits.

This includes merges. Only one concurrent `BranchUpdate` operation can succeed, so unless many long-lived such operations can fail there is little point in running multiple concurrent updates.

Performs multiple small merges in parallel.

This shows results, even on local! When I run lakefs (by default weak ownership is OFF) I get 6.6% errors with concurrency 50. Rate is <50/s. Also the long tail is _extremely_ long. When I switch weak ownership ON, using the default parameters, I get **0** errors with concurrency 50. Rate is about the same, except that the tail (when load drops) is _short_. See the difference [here][merge-abuse-speed-chart]: it's faster _and_ returns 0 errors. The distribution of actual successful merge times is somewhat slower - possibly because of the time to lock, possibly because of the fact that errors in the really slow cases cause those slow cases to be dropped. Finally, note that because we do not queue, some merges take a *long* time under sustained load. We could improve weak ownership to hold an actual queue of work. This would make merges _fair_: merges will occur roughly in order of request arrival. ==== Weak ownership OFF ==== ``sh ❯ go run ./cmd/lakectl abuse merge --amount 1000 --parallelism 50 lakefs://abuse/main Source branch: lakefs://abuse/main merge - completed: 34, errors: 0, current rate: 33.81 done/second merge - completed: 80, errors: 0, current rate: 45.98 done/second merge - completed: 128, errors: 0, current rate: 48.02 done/second merge - completed: 177, errors: 0, current rate: 49.03 done/second merge - completed: 222, errors: 0, current rate: 44.97 done/second merge - completed: 265, errors: 3, current rate: 43.03 done/second merge - completed: 308, errors: 9, current rate: 42.97 done/second merge - completed: 357, errors: 15, current rate: 49.01 done/second merge - completed: 406, errors: 21, current rate: 49.03 done/second merge - completed: 451, errors: 22, current rate: 44.97 done/second merge - completed: 499, errors: 29, current rate: 48.01 done/second merge - completed: 545, errors: 30, current rate: 46.01 done/second merge - completed: 585, errors: 31, current rate: 39.97 done/second merge - completed: 632, errors: 33, current rate: 47.04 done/second merge - completed: 679, errors: 37, current rate: 47.00 done/second merge - completed: 728, errors: 46, current rate: 48.96 done/second merge - completed: 768, errors: 49, current rate: 40.04 done/second merge - completed: 808, errors: 53, current rate: 39.98 done/second merge - completed: 854, errors: 57, current rate: 45.99 done/second merge - completed: 891, errors: 58, current rate: 37.00 done/second merge - completed: 935, errors: 64, current rate: 44.00 done/second merge - completed: 972, errors: 66, current rate: 36.98 done/second merge - completed: 990, errors: 66, current rate: 18.00 done/second merge - completed: 995, errors: 66, current rate: 5.00 done/second merge - completed: 996, errors: 66, current rate: 1.00 done/second merge - completed: 998, errors: 66, current rate: 2.00 done/second merge - completed: 999, errors: 66, current rate: 1.00 done/second merge - completed: 999, errors: 66, current rate: 0.00 done/second merge - completed: 999, errors: 66, current rate: 0.00 done/second completed: 1000, errors: 66, current rate: 5.27 done/second Histogram (ms): 1 0 2 0 5 0 7 0 10 0 15 0 25 0 50 0 75 601 100 671 250 672 350 672 500 696 750 740 1000 765 5000 896 min 54 max 12022 total 934 ``` ==== Weak ownership ON ==== ```sh ❯ go run ./cmd/lakectl abuse merge --amount 1000 --parallelism 50 lakefs://abuse/main Source branch: lakefs://abuse/main merge - completed: 36, errors: 0, current rate: 35.23 done/second merge - completed: 86, errors: 0, current rate: 49.98 done/second merge - completed: 136, errors: 0, current rate: 50.03 done/second merge - completed: 185, errors: 0, current rate: 48.99 done/second merge - completed: 236, errors: 0, current rate: 51.02 done/second merge - completed: 286, errors: 0, current rate: 49.99 done/second merge - completed: 337, errors: 0, current rate: 50.97 done/second merge - completed: 390, errors: 0, current rate: 53.03 done/second merge - completed: 438, errors: 0, current rate: 48.01 done/second merge - completed: 487, errors: 0, current rate: 49.00 done/second merge - completed: 534, errors: 0, current rate: 46.98 done/second merge - completed: 581, errors: 0, current rate: 46.99 done/second merge - completed: 632, errors: 0, current rate: 51.00 done/second merge - completed: 680, errors: 0, current rate: 48.04 done/second merge - completed: 725, errors: 0, current rate: 44.98 done/second merge - completed: 771, errors: 0, current rate: 45.99 done/second merge - completed: 815, errors: 0, current rate: 44.02 done/second merge - completed: 861, errors: 0, current rate: 46.01 done/second merge - completed: 905, errors: 0, current rate: 43.98 done/second merge - completed: 947, errors: 0, current rate: 42.00 done/second merge - completed: 977, errors: 0, current rate: 30.01 done/second merge - completed: 997, errors: 0, current rate: 19.99 done/second completed: 1000, errors: 0, current rate: 4.92 done/second Histogram (ms): 1 0 2 0 5 0 7 0 10 0 15 0 25 0 50 0 75 457 100 464 250 468 350 468 500 642 750 647 1000 729 5000 952 min 54 max 13744 total 1000 ```

- Add some jitter when acquiring ownership on a branch - Refresh _for_ refresh interval, twice _every_ refresh interval - nolint unjustified warnings

Use SetIf with a nil predicate.

arielshaqed · 2024-10-10T06:17:06Z

@arielshaqed looks like amazing work, thank you! I haven't reviewed it yet, but a few general notes:

Thanks!

Do we have an experiment plan for this feature? Which customers/users, for how long, etc.

Yes. Firstly, this PR adds a lakectl abuse merge command to measure. After we pull I will deploy to Cloud staging, switch it on for a single private installation, and measure. Also as you probably know, we have a user for whom I plan this. I will communicate with them to measure on their system; obviously timing for that depends on the customer timing and how comfortable they feel switching it on.

The term "weak" is a bit inaccurate here IMO:
The term "weak" as defined in places like CPP's std::weak_ptr or Rust's std::sync::Weak means a "non-owning reference", which is different than how this mechanism of branch locking works - each worker owns the branch exclusively, but only for a short period, and with a graceful mechanism for giving up exclusive ownership.
The term "lease-based ownership" suites better here IMO.

In the C++ and Rust standard libraries, "weak" in this case refers to the pointer. (In other cases in the C++ standard library, "weak" can refer to any variant which is logically weaker, for instance we also have class weak_ordering. In this PR "weak" is supposed to weaken "ownership".

What we have in this PR indeed uses a lease-based implementation. But it's not ownership, it just behaves like ownership in almost all reasonable cases.

A better term might be "mostly correct ownership"¹. In fact, I believe that such an unwieldy name would be a good choice! If you agree I will change to use it.

The technical term "mostly" here is intended in same sense that Miracle Max uses it. ↩

itaiad200

Reviewed everything except for the tests. Some really good things here!

itaiad200 · 2024-10-10T10:52:00Z

pkg/kv/util/weak_owner.go

+//
+// So it *cannot* guarantee correctness.  However it usually works, and if
+// it does work, the owning goroutine wins all races by default.
+type WeakOwner struct {


It seems to me that this shouldn't be in kv pkg. It deserves a standalone ownership/owner/ anything else package. The kv is an implementation detail of the ownership, so this struct should be called KVWeakOwner

Sure. Let's do this after we agree on a name? @yonipeleg33 requested a name change, I'd rather do a single name change towards the end of the review.

itaiad200 · 2024-10-10T10:55:33Z

docs/reference/cli.md

@@ -3031,6 +3031,25 @@ lakectl abuse list <source ref URI> [flags]



+### lakectl abuse merge
+
+Merge nonconflicting objects to the source branch in parallel


Suggested change

Merge nonconflicting objects to the source branch in parallel

Merge non-conflicting objects to the source branch in parallel

Will auto-generate this :-)

pkg/graveler/ref/manager.go

itaiad200 · 2024-10-10T11:00:08Z

pkg/graveler/ref/manager.go

+		requestID = *requestIDPtr
+	}
+	if m.branchOwnership != nil {
+		own, err := m.branchOwnership.Own(ctx, requestID, string(branchID))


Suggested change

own, err := m.branchOwnership.Own(ctx, requestID, string(branchID))

ownRelease, err := m.branchOwnership.Own(ctx, requestID, string(branchID))

or

Suggested change

own, err := m.branchOwnership.Own(ctx, requestID, string(branchID))

ownClose, err := m.branchOwnership.Own(ctx, requestID, string(branchID))

Nice! Changed to release(), which I hope is even nicer.

itaiad200 · 2024-10-10T11:01:09Z

pkg/kv/util/weak_owner.go

+//nolint:stylecheck
+var finished = errors.New("finished")


Fix linting

This was a sentinel not an actual error. See for instance io.EOF. It is used only in sleepFor to indicate that the context expired with no error. But it turns out that there are special cases in contest.WithCancelCause so we don't need it.

Removed, thanks!

itaiad200 · 2024-10-10T11:47:21Z

pkg/kv/util/weak_owner.go

+				// Do NOT attempt to delete, to avoid
+				// knock-on effects destroying the new
+				// owner.
+				break


Shouldn't you check the error and exit the function?

I reported the error, and I do not think there is any useful action to take.

Aborting the action here, in particular, will probably livelock in some situations!

This goroutine has nothing waiting for it, and there really is nothing much to do with broken ownership. "All" that happens is that performance will suffer! This goroutine for the former owner thinks that another action stole its lease. So it makes sense to continue - it is likely to win the race to update, because it has already performed some work. Meanwhile the other action thinks that the former owner has stopped and will never finish - it should continue to avoid zombielock (a name I just invented for livelock against a stuck thread).

Which one is right? Well, if this goroutine is still running then the other action will most likely suffer. If it isn't running, it cannot do anything useful anyway.

itaiad200 · 2024-10-10T11:48:45Z

pkg/kv/util/weak_owner.go

+				log.WithFields(logging.Fields{
+					"new_owner": ownership.Owner,
+				}).Info("Lost ownership race")
+				break


Add a comment explaining why the code continues to try and own the key. Assuming that it should, I worry that this across multiple routines may have a cascading effect of kv calls

Adding a comment to explain. But note that "all" that happens is that we fall back to the previous behaviour. Frankly I believe that the common cause for A losing a race to B will be that the clock on B is a bit too fast; this does not cause chains of lost races. The other common cause for A losing a race to B will be that A was stuck; again, that does not give it or any other thread an advantage against B. The final case is when all are badly stuck - for instance if you configure very short acquire and refresh intervals. In that case you do get what you would expect: a stuck system.

itaiad200 · 2024-10-10T11:56:46Z