Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProduceChannel never complete cancelling state in MainScope after upgrade coroutines to 1.4.3 #2717

Closed
neworld opened this issue May 18, 2021 · 7 comments
Labels

Comments

@neworld
Copy link

neworld commented May 18, 2021

Our app stops working after upgrade coroutines to 1.4.3. Any older version works just fine. After spending countless hours of debugging, I still do not understand the problem, and I can't reproduce the same conditions in the tests, but at least I got something similar.

We have an advertising service which has the responsibility of releasing any loaded ads after closing. It is a mandatory requirement to avoid memory leaks.

Simplified version of the code looks like. Please read a comments which represents real world:

class AdLoader: CoroutineScope by MainScope() {
    private fun loadData() = async { 0 } // actually ad lib call using CompletableDeferred

    val channel = produce(capacity = 0) {
        println("begin production") //this line is never called in test with Main dispatcher, but always in the prod
        repeat(10000) {
            try {
                println("load data")
                val data = loadData() // always throws an exception for few first calls
                println("try to send data")
                send(data)
                println("data was sent")
            } catch (e: Throwable) {
                println("Exception happened $e")
                if (e is CancellationException) {
                    throw e //this is never called
                }
                delay(100) //seems this delay never completes
                println("After delay") //this comment never shows
            }
        }
    }

    fun close() {
        cancel()
        channel.cancel()
        while (!channel.isClosedForReceive) { //this is always true
            println("Consume remaining: " + channel.poll()) //poll() is always null
            println("Channel: $channel") // println always says channel is in `Cancelling` state
        }
    }
}

You could try use the following test:

    @Test
    fun stuckedChannel() = runBlocking {
        val adLoader = AdLoader()
        delay(500)
        adLoader.close()
    }

However, inside unit tests, the ProduceChannel does not print any lines at all, but cancellation stuck forever. After changing MainScope to CoroutineScope(Dispatchers.IO) cancellation works just fine in the tests. I believe the problem is related to the Main dispatcher.

My best guess it could be related to 7061cc2, but I am not sure how it could cause such behavior.

BTW, the code is very old and was written before Flow. Maybe should we rewrite it to Flow? What is the best approach to preload ads in advance and make sure everything is cleaned?

@neworld
Copy link
Author

neworld commented May 18, 2021

A fix is rather simple. Just close scope after cleaning the channel:

    fun close() {
        channel.cancel()
        while (!channel.isClosedForReceive) {
            //cleanup
        }
        cancel()
    }

And I would agree the problem was in our code. However, I am expecting scope will completely cancel all ReceiveChannels

@qwwdfsad
Copy link
Member

The problem is indeed MainScope used in a test where an application doesn't have access to the Android looper dispatcher.
I'd suggest using either pluggable dispatcher for testability or to use kotlinx-coroutines-test module that does that mockery for you

qwwdfsad added a commit that referenced this issue May 27, 2021
That helps to pro-actively catch cases like #2717 and to report such exception in even more robust manner
@neworld
Copy link
Author

neworld commented May 27, 2021

The problem is indeed MainScope used in a test where an application doesn't have access to the Android looper dispatcher.

The actual problem is not the test. It is a cancellation of the scope in the bad time leave stuck producer in canceling state forever. Fortunately, I was able to reproduce the problem with the test and confirm that after fixing the problem (move cancel()) solves the problem.

And the problem appeared only after upgrading 1.4.2 -> 1.4.3. So the bug could be sneaky.

@qwwdfsad
Copy link
Member

The attached test does not reproduce this problem or I've misunderstood the general idea, the test passes correctly and nothing is stuck.

Could you please re-visit the reproducer and post it in a self-contained manner (so, if copypasted in IDEA and run, it hangs)?

@neworld
Copy link
Author

neworld commented May 28, 2021

Could you please re-visit the reproducer and post it in a self-contained manner (so, if copypasted in IDEA and run, it hangs)?

I believe I could reproduce it in isolated app. But before spend effort I just want clarify one thing. If a producer is suspended, should cancel() of corresponding scope move producer to cancelled state eventually? In other words, will this code return regardless of how producer is actually written or dispatcher used? Or am I missing something?

    fun close() {
        scope.cancel()
        while (!channel.isClosedForReceive) {
           
        }
    }

Also, is any chance that commit 089e163 will fix product problem as well?

@qwwdfsad
Copy link
Member

If a producer is suspended, should cancel() of corresponding scope move producer to cancelled state eventually?

Yes, as long as the producer ain't doing anything suspicious such as catching cancellation and busy-waiting/blocking

In other words, will this code return regardless of how producer is actually written or dispatcher used? Or am I missing something?

No.
For example, take a look at the following producer:

produce<Int> {
    try {
        delay(10000L) // Suspend for a while
    } catch (e: CancellationException) {
        // Catch cancellation and just hang
       while (true) { /* buusy loop, do nothing */ }
    }
}

Replace while (true) with Thread.sleep(Long.MAX_VALUE) or Java-ish blocking (e.g. blockingQueue.take) and the picture will stay the same.

Also, is any chance that commit 089e163 will fix product problem as well?

Nope, it's irrelevant to the problem you're describing and rather prevents using MainScope in tests

@neworld
Copy link
Author

neworld commented May 28, 2021

Replace while (true) with Thread.sleep(Long.MAX_VALUE) or Java-ish blocking (e.g. blockingQueue.take) and the picture will stay the same.

🤦 Big shame on me. The main thread was busy waiting for completion in an infinite loop, while the main thread is also was responsible to handle the completion of produce. Thanks for your time

There are more potential solutions: use different threads, coroutine's yield() or cancel in the right order.

qwwdfsad added a commit that referenced this issue May 31, 2021
That helps to pro-actively catch cases like #2717 and to report such exception in even more robust manner
pablobaxter pushed a commit to pablobaxter/kotlinx.coroutines that referenced this issue Sep 14, 2022
That helps to pro-actively catch cases like Kotlin#2717 and to report such exception in even more robust manner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants