-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor concurrent uploading of local changes #18449
Conversation
📲 You can test the changes from this Pull Request in Jetpack by scanning the QR code below to install the corresponding build.
|
📲 You can test the changes from this Pull Request in WordPress by scanning the QR code below to install the corresponding build.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking inside the Mutex code for the version we use (1.6.4) and the crash we see in Sentry is in a very specific situation that I wasn't able to reproduce at all.
I see you removed an unlock
call from the code, which is definitely good, but the code inside Mutex makes me think something could go wrong even just using withMutex
since it unlock
s inside a finally
but there's that other code that unlock
s as well in the very specific situation of the continuation already being canceled right after trying to resume the Lock waiter.
Anyway, I think that Mutex implementation with LockWaiter
is pretty confusing to follow, and I see it was changed in version 1.7.0 (Kotlin/kotlinx.coroutines#3020), so I wonder if that version would have this issue at all... 🤔
} | ||
} catch (e: Exception) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure this will cause the other coroutines (for other sites) to not be canceled since this try/catch is around the coroutine launch
es (which return instantly) and not around the actual code that throws the exception (call to update
). Also, this function is creating a new coroutineScope
which does not use a SupervisorJob
(line 114) so cancellations will be propagated to sibling coroutine jobs if any job fails.
Also, even though we are talking about Coroutine cancellation, this doesn't mean that code will actually stop running immediately because coroutine cancellation is cooperative.
To be honest, I couldn't quite understand what exactly this class is trying to accomplish in terms of job execution order, parallel work, and structured concurrency, so it is a bit hard to understand if the proposed fixes here and previous PR would work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, and I think this actually might shed light on the original issue. I see that this same possibility exists in an earlier version of this implementation, and I believe that the underlying "double-unlock" issue could be a result of this.
One theory / possibility (after reading the backscroll, and this comment) is that the forEach
is spawning multiple jobs on the ioDispatcher
(with a thread pool), most of them blocking on the lock()
invocation. Then, if any of the child jobs throws, the parent job cancels, and all siblings too, and then things get a little bit tricky, imo. Since lock()
docs say:
**This function releases the lock if it was already acquired by this function before the CancellationException was thrown.
and
Note that this function does not check for cancellation when it is not suspended.
So, although each site's uploads should be blocked and served in FIFO manner, I wonder if the cancellation bubbling up to the parent job and then back down allows another child job to acquire the lock before it is cancelled as a sibling 🤔
Though there is a "prompt cancellation guarantee", does this guarantee that it cancels immediately when a sibling cancels? (Since those children can be on different threads, along with the fact that we catch and re-throw, it makes me wonder).
I have not tested this hypothesis, so please take it with a grain of salt. I just wanted to suggest it as I've skimmed the back scroll, and your comment seems that it might shed some light on the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was actually trying to reproduce it with some custom code (though I was running it as a Unit test on the JVM and not on an actual device, so I believe thread implementation might be different).
Because of the mutex
, no upload jobs will be running at the same time, though their coroutines would've been started and they're all suspended at the lock
function.
I wonder if the cancellation bubbling up to the parent job and then back down allows another child job to acquire the lock before it is cancelled as a sibling 🤔
In my tests, I see this definitely happens, though it doesn't seem to cause of the crash because in that scenario the unlock
call throwing an exception would be either this or this, but not the one causing the Sentry crash, which is this one.
Since the crash happens, I am guessing is also possible that timing can make that be called right when one of the jobs released the lock AND while the cancellation is reaching the internal continuation in suspendCancellableCoroutineReusable
from the Mutex.
This is a tricky crash, and I still think it could happen even using just the withLock
and not calling unlock
on our side, though that extra unlock
on our side definitely doesn't help, so maybe @ovitrif changes here help solve the issue.
I feel that updating to coroutines to 1.7.0 will definitely solve this crash (since the Mutex internals were completely changed) though not sure what would take to make that dependency update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, just adding more information about this part here:
To be honest, I couldn't quite understand what exactly this class is trying to accomplish in terms of job execution order, parallel work, and structured concurrency, so it is a bit hard to understand if the proposed fixes here and previous PR would work as expected.
What I mean is: I don't understand why we are starting multiple parallel jobs but suspending them all on the same mutex. In this scenario it feels to me the mutex
(and concurrency) is not needed since in practice we are just running each job sequentially.
It looks like all those jobs could be called from 1 coroutine, with some exception handling, instead of relying on concurrent jobs and a syncing mechanism (mutex
in this case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the crash happens, I am guessing is also possible that timing can make that be called right when one of the jobs released the lock AND while the cancellation is reaching the internal continuation in suspendCancellableCoroutineReusable from the Mutex.
Yeah, this is exactly what I was thinking might be happening (and thanks for linking to the various sources where unlock
is called). It is possible that the test dispatcher exhibits different behavior from the ioDispatcher
actually used in production. If it is indeed a timing issue / race condition of some sort, this might explain why it is hard to reproduce in the test environment.
This is a tricky crash, and I still think it could happen even using just the
withLock
I also agree, since, as @ovitrif pointed out, it is basically doing the same thing we were doing (i.e. calling unlock
in a finally).
In this scenario it feels to me the mutex (and concurrency) is not needed since in practice we are just running each job sequentially.
It looks like all those jobs could be called from 1 coroutine, with some exception handling, instead of relying on concurrent jobs and a syncing mechanism (mutex in this case).
I agree with this as well, especially since the jobs are blocked at the highest level, before any control flow traverses into the more granular layers of the implementation where service boundaries might offer concurrency advantages (disk vs. network performance characteristics).
Also, considering performance from a high level, if we hypothetically implemented more granular concurrency, I have doubts that such advantages would be significant anyway, since the likely bottleneck would be the device's network bandwidth (in most circumstances) - so uploading to multiple sites simultaneously likely won't be much (if at all) faster than uploading to each one sequentially, and incurs this complexity cost, as well as an increased risk of multiple interrupted transfers instead of just one in the case of network failure.
That said, currently, it seems that the purpose of the mutex
is described in the code comment:
/**
* When the app comes to foreground both `queueUploadFromAllSites` and `queueUploadFromSite` are invoked.
* The problem is that they can run in parallel and `uploadServiceFacade.isPostUploadingOrQueued(it)` might return
* out-of-date result and a same post is added twice.
*/
so maybe more work needs to be done to make those queueUploadFrom{AllSites,Site}
methods idempotent? I don't have enough context to know what problem this mutex
solves here. Perhaps the "deduping" should be happening at the layer closer to the queue, with any needed synchronization there? @ParaskP7 👋 😄 I see that this code comment was from 43949f1 but it appears originate a bit earlier: (#10878) - so I'm not sure if you have full context of the original reason for that comment. But I wonder wdyt about this? I.e. do you think it is possible this synchronization / deduping could be done in the service somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the input @thomashorta and @mkevins 🙇🏻 , you're awesome and really helpful 🙏🏻
so maybe more work needs to be done to make those queueUploadFrom{AllSites,Site} methods idempotent? I don't have enough context to know what problem this mutex solves here
If I can think outside the box on this one, this is also my view. That mutex
should be replaced with a different solution that avoids the post duplication 👍🏻 . To me, locking an async job to enforce synchronicity (basically manually locking and unlocking mutex
to force sequential batching) is more like a workaround than an actual solution, therefore we should not take the mutex
as a given for solving the underlying issue it attempted to solve. This would automatically fix the crash we are seeing in Sentry.
👋 all!
I don't think updating Coroutines to its latest 1.7.1 will take much effort. 🤞 FYI: Plus, you should now be able to do any such update yourselves, with confidence, by having me guiding you when and if you feel the need. With that I mean that but you shouldn't feel the need to get my 👍 on any such updates going forward, especially after us merging this last major beast into PS: I'll create a P2 post when this whole dependency updates work is done, just to make all that crystal clear for you all.
Yes @mkevins , you are right, I was just reverting d5b9948 at this point, adding back the previous
Actually no @mkevins , I had zero context of the original reason and was trying to ping Jirka on it to help figure this out (see here). 🤷 FYI: See also this origin parent PR, child PR and commit.
I agree with @ovitrif here, and actually all of you (to be honest), I too think it would be now prudent to step back and try to solve the actual problem (sequential batching), not its workaround (mutex). 💯 |
df3ca1c
to
875e30f
Compare
As expected by some of us here, the FYI This issue occurred on a Pixel test device used by @SiobhyB while beta testing. We're trying to confirm in the meantime that we really know a stable repro scenario and environment. If we get to this milestone we might be able to proceed much more "in the known" with fixing this by manually testing before merging, instead of relying on the sentry reports to figure out if the crash is fixed or not. |
Would've been nice but it doesn't seem to be the case 😅 |
875e30f
to
41da654
Compare
@thomashorta and @mkevins I might have been able to trigger the same exception we're seeing in Sentry, inside a failing unit test, please see f377b6e! Thanks to having a unit test to repro this, we can also understand now that it could indeed result in data loss, as once the mutex exception is thrown, the queued uploads are cancelled. |
Thanks for looking into this, Ovi! I really like the idea of creating a failing unit test to repro the issue, but I wonder if invoking I wonder if there is a way (in the test environment) to simulate |
Thank you @mkevins for validating my approach 🙇🏻, this is what I was looking for! I also had my doubts it's super useful, but nonetheless imho at least the fact that the rest of the invocations don't happen (ie. the expected number is not actual in the assert) adds a bit of weight to the importance of finding a solution here 👍🏻 I will continue my investigation, going forward I will dive into debugging & following the callstack at runtime. The first thing I'm after is to simulate the scenario where we would encounter duplicate uploads, basically to see more clearly what problem the mutex is actually solving here 👍🏻 |
Thank you @ovitrif for looking into it again. It would definitely help to have a failing test, but I agree with @mkevins that maybe handling the Mutex and doing extra stuff in the test/mock itself might not emulate the issue. 🙇
I thought the same and tried to create a class from scratch that does similar stuff in terms of blocking threads, suspension, and coroutine scopes and write something (as a unit test) to see if I could break it somehow. But I couldn't find a way of reproducing anything that would crash it, let alone the crash we got from Sentry. The test code I made is available in the
Nice! I think you're on the right path here. I feel like finding a way to consistently reproduce this specific race condition is gonna be pretty complicated (though I agree it would be great to do it) and I think it really boils down to timing, otherwise, I'm sure we would have this issue happening WAY MORE. It might make more sense, in this case, to do our best to understand what could be happening (like the hypothesis we have), figure out what we want our code to do exactly, and improve our solution with those things in mind. Count on me to help with this problem, if needed! ✋ |
c5e8a15
to
c3ab67a
Compare
Closing this for now, since we're now on |
Resolves #17463
This PR adapts the logic for uploading local changes to the cloud and introduces unit tests for the batch job of uploading all local sites posts & pages.
We've noticed in Sentry the crash still occurs, and attempted to fix it via the
mutex.withLock { … }
workaround that we've merged via PR #18067 into22.4
. Unfortunately it didn't eradicate the crash as seen in this Sentry report for22.4-rc-1
.To Test
Regression Notes
Potential unintended areas of impact
Offline posts and pages upload
What I did to test those areas of impact (or what existing automated tests I relied on)
Manual testing and unit testing
What automated tests I added (or what prevented me from doing so)
Unit tests for the updated code
PR submission checklist:
RELEASE-NOTES.txt
if necessary.