Increase lock acquiring timeout #3423

fschade · 2022-11-01T11:07:09Z

When acquiring a file lock the code tries to wait for a certain amount of time before it fails. In Some cases the fixed waiting time can be too short.

For example when uploading a whole bunch of files into the same folder, the etag propagation updates the destination folder many times which fails in some cases.

This PR does not change the implementation, instead it adds some basic unit tests and increases the wait time to 180ms.
I'll create a follow up pr which will use context.WithCancel//context.WithTimeOut to detect the timeOut... but first this needs to be cherry-picked back into experimental. cC. @kobergj @wkloucek 🍒 to experimental!?

before // after

dragotin · 2022-11-01T12:30:01Z

This patch is only hiding symptoms. With the previous implementation we already wait 165 msec in total to acquire the lock, and I am unable to calc how long that will be with your change. If that is not sufficient, we have a different problem I think.

@aduffeck tried to solve that with #3395 - are you sure you have that in your test installation?

I am ok having this in edge, but for edgeI am not ok without further investigation.

dragotin

Please do not merge to edge without further investigation with @aduffeck and me.

butonic · 2022-11-01T13:28:57Z

@dragotin @aduffeck aquiring the lock starts to take too long when lots of small files are being uploaded by the desktop client. It will send up to 6? PUT requests (or the TUS equivalent) in parralel. Each request tries to propagate the etag in sync. To do that each request has to lock the parent, which takes longer the more requests happen in parallel.

Increasing the timeout allows syncing more small files without returning 500 errors to the client, which is what this PR tries to do for now.

The correct fix is to delay etag propagation for a small amount of time (eg 500ms) and extend the delay on subsequent propagation requests. That would aggregate the propagation calls for uploads into the same folder ideally to 1 instead of n, reducing the number of write requests AND completely getting rid of multiple goroutinges fighting over a lock on parent resources. But since this will make etag propagation async @fschade and I considered it too fragile for GA. we could try to make it optional and make it async when the client sends a Prefer: respond-async header. To be discussed. For now, I vote to merge this PR to get rid of user visible 500 errors.

@fschade you could introduce a variable for the number of iterations to try an get a lock. Then you can set that to 1 or 0 for a negative test and expect it to fail with lots of concurrent goroutinges trying to get a lock, similar to the other tests you added. That will test the failure case.

butonic · 2022-11-01T13:30:09Z

converted to draft to prevent accidental merging (by eg myself).

fschade · 2022-11-01T14:28:18Z

@dragotin as said this is the first iteration, I need this in experimental to get rid of all those 500 errors (those happen on Master too).

Yes this pr does not fix the problem but introduces tests as a starting point, the follow up pr then introduces context.withCancel as a next step (and configuration).

the final phase then should provide some kind of buffered xattr writer which takes care to only persist latest attributes at once and skips the old ones. This also reduces the fs pressure…. But this is a bigger change and for later.

Let’s discuss it tomorrow how to proceed here… we should apply it at least for experimental to get rid of those client sync errors

fschade · 2022-11-01T14:31:01Z

The tests only cover the filelocks package, nothing else

labkode · 2022-11-01T15:22:17Z

@fschade @dragotin @butonic in EOS the propagation of ETAG can happen up to 5 seconds after update and is okay given that the sync model is eventual consistency.

File contents can be updated without reflecting the new etag immediately, the important is to not loose the propagation of the etag (if server crashes, etc ...)

The lock should only be taken when updating the etag.

aduffeck · 2022-11-01T18:25:41Z

@fschade @dragotin @butonic right, etag propagation can cause lock contention eventually when a lot of uploads happen in parallel, but what I've seen as most problematic in master was locking the file for listing the xattrs (because it's not atomic and actually fails occassionally without it) and the huge amount of those calls we currently do.

#3397 is supposed to eventually improve the latter but #3395 improved the first problem and was already merged into edge but not into experimental as far as I can see. @fschade It would be interesting to see whether that maybe already fixes the problem for you, it had tremendous effects during my testing.

fschade · 2022-11-01T21:03:05Z

@aduffeck I tested latest ocis master and local linked latest edge reva… Problem still exists.

i give it another try tomorrow

kobergj · 2022-11-02T09:43:59Z

pkg/storage/utils/filelocks/filelocks.go

@@ -28,7 +28,13 @@ import (
 	"github.com/gofrs/flock"
 )

-var _localLocks sync.Map
+var (
+	localLocks sync.Map


please undo this change. unexported global vars should be prefixed with _ so they can be easily identified

kobergj · 2022-11-02T09:44:51Z

pkg/storage/utils/filelocks/filelocks.go

-	for i := 1; i <= 10; i++ {
-		if flock = getMutexedFlock(n); flock != nil {
+	var lock *flock.Flock
+	for i := 1; i <= 60; i++ {


For a permanent solution I would introduce exponential backoff behaviour instead of using time.Sleep

fschade · 2022-11-02T10:44:15Z

i close this here and open a separate pr for experimental

increase locking timout

00511a9

fschade requested review from labkode, ishank011, glpatcern and a team as code owners November 1, 2022 11:07

add docStrings

b4b09c8

fschade requested review from butonic, dragotin and micbar and removed request for labkode, glpatcern and ishank011 November 1, 2022 11:12

add license header

430b319

dragotin requested changes Nov 1, 2022

View reviewed changes

butonic approved these changes Nov 1, 2022

View reviewed changes

butonic marked this pull request as draft November 1, 2022 13:29

kobergj requested changes Nov 2, 2022

View reviewed changes

fschade closed this Nov 2, 2022

fschade mentioned this pull request Nov 2, 2022

[tests-only] Max acquire lock cycles #3428

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase lock acquiring timeout #3423

Increase lock acquiring timeout #3423

fschade commented Nov 1, 2022 •

edited

Loading

dragotin commented Nov 1, 2022

dragotin left a comment

butonic commented Nov 1, 2022

butonic commented Nov 1, 2022

fschade commented Nov 1, 2022 •

edited

Loading

fschade commented Nov 1, 2022

labkode commented Nov 1, 2022 •

edited

Loading

aduffeck commented Nov 1, 2022

fschade commented Nov 1, 2022

kobergj Nov 2, 2022

kobergj Nov 2, 2022

fschade commented Nov 2, 2022

Increase lock acquiring timeout #3423

Increase lock acquiring timeout #3423

Conversation

fschade commented Nov 1, 2022 • edited Loading

dragotin commented Nov 1, 2022

dragotin left a comment

Choose a reason for hiding this comment

butonic commented Nov 1, 2022

butonic commented Nov 1, 2022

fschade commented Nov 1, 2022 • edited Loading

fschade commented Nov 1, 2022

labkode commented Nov 1, 2022 • edited Loading

aduffeck commented Nov 1, 2022

fschade commented Nov 1, 2022

kobergj Nov 2, 2022

Choose a reason for hiding this comment

kobergj Nov 2, 2022

Choose a reason for hiding this comment

fschade commented Nov 2, 2022

fschade commented Nov 1, 2022 •

edited

Loading

fschade commented Nov 1, 2022 •

edited

Loading

labkode commented Nov 1, 2022 •

edited

Loading