-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase lock acquiring timeout #3423
Conversation
This patch is only hiding symptoms. With the previous implementation we already wait 165 msec in total to acquire the lock, and I am unable to calc how long that will be with your change. If that is not sufficient, we have a different problem I think. @aduffeck tried to solve that with #3395 - are you sure you have that in your test installation? I am ok having this in edge, but for edgeI am not ok without further investigation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not merge to edge without further investigation with @aduffeck and me.
@dragotin @aduffeck aquiring the lock starts to take too long when lots of small files are being uploaded by the desktop client. It will send up to 6? PUT requests (or the TUS equivalent) in parralel. Each request tries to propagate the etag in sync. To do that each request has to lock the parent, which takes longer the more requests happen in parallel. Increasing the timeout allows syncing more small files without returning 500 errors to the client, which is what this PR tries to do for now. The correct fix is to delay etag propagation for a small amount of time (eg 500ms) and extend the delay on subsequent propagation requests. That would aggregate the propagation calls for uploads into the same folder ideally to 1 instead of n, reducing the number of write requests AND completely getting rid of multiple goroutinges fighting over a lock on parent resources. But since this will make etag propagation async @fschade and I considered it too fragile for GA. we could try to make it optional and make it async when the client sends a @fschade you could introduce a variable for the number of iterations to try an get a lock. Then you can set that to 1 or 0 for a negative test and expect it to fail with lots of concurrent goroutinges trying to get a lock, similar to the other tests you added. That will test the failure case. |
converted to draft to prevent accidental merging (by eg myself). |
@dragotin as said this is the first iteration, I need this in experimental to get rid of all those 500 errors (those happen on Master too). Yes this pr does not fix the problem but introduces tests as a starting point, the follow up pr then introduces context.withCancel as a next step (and configuration). the final phase then should provide some kind of buffered xattr writer which takes care to only persist latest attributes at once and skips the old ones. This also reduces the fs pressure…. But this is a bigger change and for later. Let’s discuss it tomorrow how to proceed here… we should apply it at least for experimental to get rid of those client sync errors |
The tests only cover the filelocks package, nothing else |
@fschade @dragotin @butonic in EOS the propagation of ETAG can happen up to 5 seconds after update and is okay given that the sync model is eventual consistency. File contents can be updated without reflecting the new etag immediately, the important is to not loose the propagation of the etag (if server crashes, etc ...) The lock should only be taken when updating the etag. |
@fschade @dragotin @butonic right, etag propagation can cause lock contention eventually when a lot of uploads happen in parallel, but what I've seen as most problematic in master was locking the file for listing the xattrs (because it's not atomic and actually fails occassionally without it) and the huge amount of those calls we currently do. #3397 is supposed to eventually improve the latter but #3395 improved the first problem and was already merged into edge but not into experimental as far as I can see. @fschade It would be interesting to see whether that maybe already fixes the problem for you, it had tremendous effects during my testing. |
@aduffeck I tested latest ocis master and local linked latest edge reva… Problem still exists. i give it another try tomorrow |
@@ -28,7 +28,13 @@ import ( | |||
"github.com/gofrs/flock" | |||
) | |||
|
|||
var _localLocks sync.Map | |||
var ( | |||
localLocks sync.Map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please undo this change. unexported global vars should be prefixed with _
so they can be easily identified
for i := 1; i <= 10; i++ { | ||
if flock = getMutexedFlock(n); flock != nil { | ||
var lock *flock.Flock | ||
for i := 1; i <= 60; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a permanent solution I would introduce exponential backoff behaviour instead of using time.Sleep
i close this here and open a separate pr for experimental |
When acquiring a file lock the code tries to wait for a certain amount of time before it fails. In Some cases the fixed waiting time can be too short.
For example when uploading a whole bunch of files into the same folder, the etag propagation updates the destination folder many times which fails in some cases.
This PR does not change the implementation, instead it adds some basic unit tests and increases the wait time to 180ms.
I'll create a follow up pr which will use context.WithCancel//context.WithTimeOut to detect the timeOut... but first this needs to be cherry-picked back into experimental. cC. @kobergj @wkloucek 🍒 to experimental!?
before // after