-
Notifications
You must be signed in to change notification settings - Fork 720
NIOAsyncWriter: Fix suspending yield when writer is finished #3044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIOAsyncWriter: Fix suspending yield when writer is finished #3044
Conversation
Lukasa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great catch, thanks so much!
Regarding your test, it might be useful for the test to use a new delegate that makes it easier to arrange the ordering to work out. NIO's ConditionLock can be used to set up a nice ordering chain, so that instead of using the usleeps we can just block on that lock being unlocked with the right value.
bf18517 to
f15a87b
Compare
|
@Lukasa : I updated it with ConditionLock. Is this what you had in mind? This is still blocking a thread from the concurrency pool and then depending on another thread to unblock it. Will that not be a problem? The only way I see around that is to add an async hook to NIOAsyncWriter. |
|
For testing purposes, that blocking will be safe. We will eventually resolve it, and we don't rely on Dispatch to make that happen (it requires the NIO thread to do it), so it'll be ok. |
Lukasa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really lovely patch, thanks so much @orobio!
I'm not sure I fully understand what you're saying here, but I assume it's fine then. I do still have that 'FIXME' line in there, which would probably be good to remove. Well..., this got merged while I was typing, so I'll open a new pull request for that. Edit: Opened #3059 |
This was left in when pull request apple#3044 was merged, but as discussed on that pull request, this doesn't need to be fixed.
This was left in when pull request #3044 was merged, but as discussed on that pull request, this doesn't need to be fixed.
|
It appears this pull gets the NIO tests to hang in the Android x86_64 emulator, but weirdly only when compiled and run on macOS, ie the linux CI runs shown there work fine. I don't know if that's a macOS cross-compilation issue or some incompatibility with the macOS emulator triggered by this pull, possibly both. Reverting this pull gets my Android CI runs on macOS to work again. Let me know what you guys think. |
|
That’s very interesting. Do you have a smaller repro? The OS-specificity makes me think this is probably a Swift bug, but if it’s easy enough to reproduce we may be able to figure out if the test that hangs is just flaky. |
|
A bit tough for me to reproduce since I don't use macOS. @marcprux, any interest in looking into this on your Mac? |
|
When I test against a local Android emulator (with However, the test run does ultimately fail due to a leaked promise. Edit: see next comment.
Here's the leaked promise, although it is likely a red herring: |
|
Actually, scratch all that. It seems to be hanging randomly, on both macOS and Ubuntu runners, irrespective of API level. See the runs at https://github.com/marcprux/swift-nio/actions/runs/12812276673 I'm still not able to get it to hang on my local machine, but I'll keep trying. |
|
Does it always hang on the same test? |
|
No, it seems to vary. Sometimes in |
|
Yeah, that seems fairly likely. Can you run the underlying test binary directly? Potentially running it under |
|
On my Android CI, it is entirely consistent: linux CI runs pass every time while macOS runs always hang. See the run yesterday: across two attempts with this pull, I had four macOS hangs and six linux runs pass (ignore the devel macOS failures, that's a separate issue). So much so that today I only reverted this pull for macOS, finagolfin/swift-android-sdk@7476f2be5, and all the runs passed. |
|
@finagolfin What happens if you only revert the test? |
|
I don't think this test is the issue, as it appears to be failing elsewhere in the log, but I will check that and let you know. |
|
Nope, you were right to suspect the test, as simply reverting this test on macOS got the NIO tests to pass again. 😄 Marc, can you confirm? |
Confirmed: if I remove the test, then the hangs never occurs in the emulator (running on macOS-13 or Ubuntu). |
|
@marcprux, are you disabling all NIOAsyncWriterTests though, as shown in the linked commit above, marcprux/swift-nio@87d02310d? If so, it would be better to just disable this new test, as I did above, to see if this code change affects the older tests too, which you are no longer running. I'm going to try running the linux-compiled tests on the macOS emulator and vice versa next, to see if that helps narrow down where this strange hang is coming from, as it could be that we're hitting a bug in the Android emulator for macOS. |
|
This could be an emulator bug, but it could also be that we're seeing an ordering issue flushed out by the performance overhead of the emulator. It'd be useful to try to get a "sequence of events" log from your hanging test as well, as concurrent tests are horrible to debug. |
|
Alright, I ran the NIO tests alone in the Android emulator, once swapping the mac-compiled tests into the linux-hosted emulator and vice versa, once keeping the mac-built tests in the mac-hosted emulator and so on, where only the mac version hung. Looking at the results from the swapped run, the macos-hosted emulator again hung each time, with the linux-compiled tests not finishing for 20 minutes before I killed it (I have a 5 macOS job limit on the OSS free tier of github Actions, so I killed it once it seemed to hang, so other jobs could proceed). I just kicked off a swapped run again that I will let run for longer to make sure it does hang after longer (just killed it again after the macos-hosted emulators all hung for more than an hour and a half). This means the problem is in the interaction between this test and the macOS-hosted Android emulator, since the linux-compiled test runs fine in the linux-hosted emulator but hangs in the mac-hosted emulator. The mac cross-compile and mac-hosted Android emulator are consistently slower on github Actions, because of greater contention for fewer macOS CI machines or whatever, so either that lack of speed screws up this test or there is a bug in the macOS-hosted Android emulator. Regarding "a 'sequence of events' log," you want me to spray this test with print statements to figure out what's running when? |
|
Yeah, that was what I was asking for. Presumably this is the result of a specific ordering sequence caused by the slowdown in the emulator. |
|
Sorry, I was away for a while and completely missed this. I'd like to mention that I'm still a bit uncertain about one Task being blocked, waiting for another Task. Could that have anything to do with this? I can imagine that could cause a problem when other tests are doing a similar thing and we're running out of Swift concurrency threads that are making forward progress. |
|
No idea, I will stick a bunch of |
|
It'd be great to have a more detailed log, thanks! Perhaps next week or so I could have a look at trying to reproduce this on constrained hardware. |
|
Alright, finally got around to this, but not sure how useful it is considering the test output may not always flush. I disabled all other tests in this test file and stuck Given that the test output might cut off anywhere before the actual hang, not sure how useful that is. |
|
Thanks for trying this and the detailed report! Indeed, it's hard to say what's going on without a proper flush. I tried to reproduce this on a VM with only one CPU, but that worked fine. However, I was able to trigger a hanging state by adding a delay in one of the tasks. This causes a timeout when waiting for both yields being suspended, which results in the ConditionLock value never being set to true. The timeout is only 1 second, so maybe that one triggers in the Android emulator, causing the issue? I created the following patch: Fix hanging NIOAsyncWriter test It does 2 things:
I'm hoping that the former fixes the issue, but if not, the latter should make sure that the test doesn't hang and we can hopefully get more information from the log. Could you give this a test run? |
|
I just ran it with your test modifications: doesn't hang anymore but now fails, see full log output on my CI. |
|
It looks like what's happening is what I mentioned earlier. It's waiting on the condition lock and there is no thread available on which the other Task can be scheduled to set the condition. I am not able to reproduce this on my Linux VM with one CPU core. I did some tests and I found that when I keep creating Tasks that block on a ConditionLock, Swift just keeps spawning new threads to schedule the work on, until a maximum of 128 threads is reached. If I understand correctly, the underlying libDispatch is doing this and the Android implementation does not use libDispatch, which explains the difference. I guess the Android implementation really has a fixed-width thread pool. @finagolfin: After some more digging, I can see that all your configurations are using 2 CPU cores for the Android emulator, but I found the following warning in your Android emulator launch logs for the MacOS runs: So, it looks like it's a combination of having only 1 CPU core and a fixed-width concurrency thread pool that causes the issue. @Lukasa: A solution could be to add an async test hook into NIOAsyncWriter, so that we can just suspend instead of blocking the thread. For example, something like |
|
Having an async method on the writer delegate is a weird design, because it arguably misses the intended use-case, which is to bridge async producers into synchronous consumers. The sink delegate is supposed to be synchronous because it is the consumer. A better option is to find a way to skip the test in cases where it is likely to run into trouble. If we needed a real sledgehammer we could just skip the entire test on Android, but I wonder if we can construct a better question to ask the runtime. |
Swift Concurrency on Android uses libdispatch, and I haven't seen any big differences in this repo either, so I don't think that's it. In fact, this test has always worked well on the Android emulator on linux and I believe Marc says it often passes locally in the Android emulator on his macOS machine. So this is most likely tied to the slowness of the Android emulator on the github macOS CI, with the outside chance of some software interaction or bug in that environment.
Could be, as I don't see that warning when starting the Android emulator on linux in that same linked CI run. |
I understand, but I meant for it to only be used for testing, just like the already existing didSuspend callback, which is directly set on the storage with _setDidSuspend(...). We need some way to pause execution of the first yield at the right moment, to make sure the state machine is in the correct state for the second yield. With an async hook we can add this pause without blocking a thread from the concurrency pool. The didYield call is actually performed from an async context, so it should be easy to add an async callback for testing next to it.
Ah, I didn't know it used libdipatch as well. Could it be that libdispatch behaves differently on Android for some reason? On my Linux machine it starts a maximum of 128 threads. When I start 127 Tasks at the start of the test and block them all, leaving exactly 1 concurrency thread free, I can see that the test fails in the same way as on the MacOS Android emulator. When I start 126 blocked Tasks, leaving 2 concurrency threads free, the test is successful. That, combined with the fact that the MacOS Android emulator runs on 1 core and the Linux one on 2 cores makes me think that it's very likely that's the issue. If you want, I can create something with which we can test the behavior of the concurrency thread pool in the Android emulator. |
My nervousness there though is that With that said, I'm not 100% definitely opposed. Just noting that it conceptually complicates matters somewhat. |
|
There is another bug in the implementation that was revealed by the Android run. In the log it can be seen that an error was thrown from one of the tasks. This happens due to that the continuation of the second task (which should suspend again) will not be resumed until the yield of the first task is fully completed. The bug is that when the first yield finishes up, the state machine determines that there are no suspended yields and it moves the state to .finished, resulting in an alreadyFinished error from the second task. I can have a look at fixing that issue later, but if we're going with the async test hook approach, I anticipate that we probably need another hook to be able to test that issue, which means that we're already looking at adding 2 more hooks. After thinking about it some more, I'm now leaning towards using a custom executor and setting an executor preference on the tasks. That way we can guarantee that there are at least 2 threads available. Later, we can use the same mechanism to make sure there is only 1 thread available for both tasks, which will reproduce the issue described above. I'm not really familiar with custom executors, but I think this can work and will investigate it more. |
It could be, but I doubt it, especially since this test always passes in the Android emulator run on linux, which is why I only disabled it on the Android emulator run on macOS.
Yes, that core count difference you found is more likely the culprit.
Sounds good, I will also try limiting the core count of the emulator running on linux and see if that reproduces this hang. |
|
I created a test to see the behavior of the concurrency thread pool. When I run only this test on my single core Linux VM, it results in: And on my 8 core Linux VM: When I run the complete test suite, the initial number of threads count sometimes goes up a bit and sometimes it doesn't. I've seen it go up to around 30 on both machines. I'm very curious to see what will happen on the Android runs! |
|
I think you've managed to track this down, the number of cores appears to be the cause. First, I applied your patch with the new test and ran it on the linux and mac Android emulators as usual. The linux-hosted emulator passed as always: while the mac-hosted emulator hung: I then limited the linux-hosted emulator to a single core just like the mac and they all hung: |
|
Great, that definitely shows the problem. I will investigate the use of a custom task executor further and can hopefully provide a fix based on that. This indeed shows different behavior between (GNU) Linux and Android. Would it make sense to align them? I've always understood the Swift global concurrency thread pool to be a fixed width pool, so from that perspective the behavior on Android is correct. However, I can imagine that changing the behavior on Linux would have some undesired impact on existing projects. One thing that bothers me is that the test hangs when a single core is used. I've been staring at the code for a while, but I don't see why that would happen. If you have any ideas, I'd be happy to hear them! |
…ndroid emulator The test requires at least two threads in the concurrency thread pool because it blocks one task, which waits for another task to set a condition. In environments where the global concurrency thread pool doesn’t have at least two threads available, the test will fail, as observed on the Android emulator running with a single virtual core (see discussion in apple#3044). Using a custom task executor guarantees that at least two threads are available for the test, regardless of the width of the global concurrency thread pool.
…ndroid emulator The testSuspendingBufferedYield_whenWriterFinished test requires at least two threads in the concurrency thread pool because it blocks one task, which waits for another task to set a condition. In environments where the global concurrency thread pool doesn’t have at least two threads available, the test will fail, as observed on the Android emulator running with a single virtual core (see discussion in apple#3044). Using a custom task executor guarantees that at least two threads are available for the test, regardless of the width of the global concurrency thread pool.
…ndroid emulator The testSuspendingBufferedYield_whenWriterFinished test requires at least two threads in the concurrency thread pool because it blocks one task, which waits for another task to set a condition. In environments where the global concurrency thread pool doesn’t have at least two threads available, the test will fail, as observed on the Android emulator running with a single virtual core (see discussion in apple#3044). Using a custom task executor guarantees that at least two threads are available for the test, regardless of the width of the global concurrency thread pool.
…ndroid emulator The testSuspendingBufferedYield_whenWriterFinished test requires at least two threads in the concurrency thread pool because it blocks one task, which waits for another task to set a condition. In environments where the global concurrency thread pool doesn’t have at least two threads available, the test will fail, as observed on the Android emulator running with a single virtual core (see discussion in apple#3044). Using a custom task executor guarantees that at least two threads are available for the test, regardless of the width of the global concurrency thread pool.
…ndroid emulator The testSuspendingBufferedYield_whenWriterFinished test requires at least two threads in the concurrency thread pool because it blocks one task, which waits for another task to set a condition. In environments where the global concurrency thread pool doesn’t have at least two threads available, the test will fail, as observed on the Android emulator running with a single virtual core (see discussion in apple#3044). Using a custom task executor guarantees that at least two threads are available for the test, regardless of the width of the global concurrency thread pool.
…#3135) ### Motivation: The testSuspendingBufferedYield_whenWriterFinished test fails on the Android emulator. See also the discussion in #3044. ### Modifications: The test requires at least two threads in the concurrency thread pool because it blocks one task, which waits for another task to set a condition. This PR adds support for running a task executor based on a NIOThreadPool and uses it for the test. Using a custom task executor guarantees that at least two threads are available for the test. Additionally, the test has been renamed to testWriterFinish_AndSuspendBufferedYield, which is more in line with the other test names. ### Result: The test will pass regardless of the width of the global concurrency thread pool. --------- Co-authored-by: George Barnett <gbarnett@apple.com> Co-authored-by: Cory Benfield <lukasa@apple.com>
This fixes hitting the following preconditionFailure in NIOAsyncWriter: "This should have already been handled by
yield()".It doesn't expect a yield to be suspended when the state is
.writerFinished, but this can definitely happen. This seemed to be the correct solution to me, but please check it carefully, because I'm unfamiliar with this code.I'm not happy about the test for this. It's blocking the
didYieldcall for a while, to make sure the correct state is reached. See also the 'FIXME' line. What would be a better way to do this?