Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase lock acquiring timeout #3423

Closed
wants to merge 3 commits into from

Conversation

fschade
Copy link
Contributor

@fschade fschade commented Nov 1, 2022

When acquiring a file lock the code tries to wait for a certain amount of time before it fails. In Some cases the fixed waiting time can be too short.

For example when uploading a whole bunch of files into the same folder, the etag propagation updates the destination folder many times which fails in some cases.

This PR does not change the implementation, instead it adds some basic unit tests and increases the wait time to 180ms.
I'll create a follow up pr which will use context.WithCancel//context.WithTimeOut to detect the timeOut... but first this needs to be cherry-picked back into experimental. cC. @kobergj @wkloucek 🍒 to experimental!?

before // after

Bildschirmfoto 2022-11-01 um 11 53 11

Bildschirmfoto 2022-11-01 um 11 56 45

@fschade fschade requested review from labkode, ishank011, glpatcern and a team as code owners November 1, 2022 11:07
@fschade fschade requested review from butonic, dragotin and micbar and removed request for labkode, glpatcern and ishank011 November 1, 2022 11:12
@dragotin
Copy link
Contributor

dragotin commented Nov 1, 2022

This patch is only hiding symptoms. With the previous implementation we already wait 165 msec in total to acquire the lock, and I am unable to calc how long that will be with your change. If that is not sufficient, we have a different problem I think.

@aduffeck tried to solve that with #3395 - are you sure you have that in your test installation?

I am ok having this in edge, but for edgeI am not ok without further investigation.

Copy link
Contributor

@dragotin dragotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not merge to edge without further investigation with @aduffeck and me.

@butonic
Copy link
Contributor

butonic commented Nov 1, 2022

@dragotin @aduffeck aquiring the lock starts to take too long when lots of small files are being uploaded by the desktop client. It will send up to 6? PUT requests (or the TUS equivalent) in parralel. Each request tries to propagate the etag in sync. To do that each request has to lock the parent, which takes longer the more requests happen in parallel.

Increasing the timeout allows syncing more small files without returning 500 errors to the client, which is what this PR tries to do for now.

The correct fix is to delay etag propagation for a small amount of time (eg 500ms) and extend the delay on subsequent propagation requests. That would aggregate the propagation calls for uploads into the same folder ideally to 1 instead of n, reducing the number of write requests AND completely getting rid of multiple goroutinges fighting over a lock on parent resources. But since this will make etag propagation async @fschade and I considered it too fragile for GA. we could try to make it optional and make it async when the client sends a Prefer: respond-async header. To be discussed. For now, I vote to merge this PR to get rid of user visible 500 errors.

@fschade you could introduce a variable for the number of iterations to try an get a lock. Then you can set that to 1 or 0 for a negative test and expect it to fail with lots of concurrent goroutinges trying to get a lock, similar to the other tests you added. That will test the failure case.

@butonic butonic marked this pull request as draft November 1, 2022 13:29
@butonic
Copy link
Contributor

butonic commented Nov 1, 2022

converted to draft to prevent accidental merging (by eg myself).

@fschade
Copy link
Contributor Author

fschade commented Nov 1, 2022

@dragotin as said this is the first iteration, I need this in experimental to get rid of all those 500 errors (those happen on Master too).

Yes this pr does not fix the problem but introduces tests as a starting point, the follow up pr then introduces context.withCancel as a next step (and configuration).

the final phase then should provide some kind of buffered xattr writer which takes care to only persist latest attributes at once and skips the old ones. This also reduces the fs pressure…. But this is a bigger change and for later.

Let’s discuss it tomorrow how to proceed here… we should apply it at least for experimental to get rid of those client sync errors

@fschade
Copy link
Contributor Author

fschade commented Nov 1, 2022

The tests only cover the filelocks package, nothing else

@labkode
Copy link
Member

labkode commented Nov 1, 2022

@fschade @dragotin @butonic in EOS the propagation of ETAG can happen up to 5 seconds after update and is okay given that the sync model is eventual consistency.

File contents can be updated without reflecting the new etag immediately, the important is to not loose the propagation of the etag (if server crashes, etc ...)

The lock should only be taken when updating the etag.

@aduffeck
Copy link
Contributor

aduffeck commented Nov 1, 2022

@fschade @dragotin @butonic right, etag propagation can cause lock contention eventually when a lot of uploads happen in parallel, but what I've seen as most problematic in master was locking the file for listing the xattrs (because it's not atomic and actually fails occassionally without it) and the huge amount of those calls we currently do.

#3397 is supposed to eventually improve the latter but #3395 improved the first problem and was already merged into edge but not into experimental as far as I can see. @fschade It would be interesting to see whether that maybe already fixes the problem for you, it had tremendous effects during my testing.

@fschade
Copy link
Contributor Author

fschade commented Nov 1, 2022

@aduffeck I tested latest ocis master and local linked latest edge reva… Problem still exists.

i give it another try tomorrow

@@ -28,7 +28,13 @@ import (
"github.com/gofrs/flock"
)

var _localLocks sync.Map
var (
localLocks sync.Map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please undo this change. unexported global vars should be prefixed with _ so they can be easily identified

for i := 1; i <= 10; i++ {
if flock = getMutexedFlock(n); flock != nil {
var lock *flock.Flock
for i := 1; i <= 60; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a permanent solution I would introduce exponential backoff behaviour instead of using time.Sleep

@fschade
Copy link
Contributor Author

fschade commented Nov 2, 2022

i close this here and open a separate pr for experimental

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants