-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Flaky-test) Intermittent failure of ProxyParserTest.testRegexSubscription #6332
Comments
There's an interesting warning the Github CI logs:
Here's the context of this exception:
Should this exception be an error, rather than a warning? |
I found this: https://stackoverflow.com/questions/24979610/netty-defaultchannelpipeline-exceptioncaught |
When the test is run by the Github CI, the |
It looks like the UnsupportedOperationException is actually getting thrown here in PulsarDecoder:
What's the reason this method is getting called only sometimes by this code? |
I learned from @jiazhai that PulsarDecoder is an interface that should not be getting called directly. He pointed out that all interactions should be between
So, something is causing the method to get called on the abstract class directly instead of the correct implementing class. Perhaps there's a type conversion issue happening or something. |
I ran this test 1000 times while running a stress test, and I still couldn't reproduce it locally. This is most certainly a heisenbug... |
@jiazhai Any ideas? I'm pretty stumped... |
Here's the chain of events in the logs during the success-case around when the problem occurs:
Here's what it looks like in the failure case:
|
…t from private to public, and decreased test noise. apache#6332
Interestingly, I'm noticing a little more information in the log now for this test:
|
@jiazhai
|
At a closer look, that result only occurred after the first failure. During the first failure, the logs revealed something a little different. With the additional log messages, they looked like this:
So, we reach this line: After more careful inspection, I also discovered that the UnsupportedOperationException actually alternates (non-deterministically) between the exception farther above (#6332 (comment)) and this one:
There is one key difference between the two methods that trigger the
Notice the first one is Notably, ProxyConnection, ServerCnx, ClientCnx, ProxyClientCnx, and ServerConnection all extend PulsarHandler, which extends PulsarDecoder. So, perhaps we're trying to call methods on an object that could be one or the other of that type. Perhaps there's non-determinism in which of these objects we end up with at some point in the process. |
My hypothesis was right. I added these two methods to ProxyConnection:
because I suspected that these methods were incorrectly getting called on the ProxyConnection type instead of getting forwarded like they're supposed to, and this is what appeared in the log:
The new error message that I added is most instructive:
|
Sure enough, when I set local breakpoints, I never hit those two methods on ProxyConnection because the local tests always pass. So, only in the failing case are the methods getting called on ProxyConnection instead of on ServerCnx. |
…timing out during server tests. apache#6332
Added awaitility to two pom files. Increased timeouts for state tests. apache#6200 apache#6198 Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014) Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137) Attempt to fix apache#6207 and add more debugging information by pruning docker containers. Fixed typo in docker commands for getting debug info. apache#6207. Removing timeouts as per comments in apache#5333. This is for apache#6202. Fixed timeout issues for CPP tests. apache#6202 and apache#6137 Increased more timeouts. apache#6202 and apache#6137 Fixed typo in CPP test timeout fix. apache#6202 apache#4884 Edited comment to trigger build apache#6202 Rolled back changes to PulsarSpoutTest because fixing some instability broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202 Added timeouts back in places where required. Increased timeouts though. apache#6202 Fixed timeouts for Storm and Kafka tests. Also removed debug block that was accidentially included in ReaderTest. apache#6202 Editing comment to trigger new build. apache#6202 Attempt to workaround test failure. apache#6202 Adding some timeouts back to get beyond hanging tests. apache#6202 Increased sleep value as temporary workaround for thread timeout. apache#6202 Added back timeouts to fix hang but increased timeouts from 1s to 5s. apache#6202 Added back timeout (but made it longer) to prevent hanging test. apache#6202 Fixed formatting since it was breaking the build. apache#6202 Increased more test timeouts to get them to pass on slow hardware. apache#6202 Increased more test timeouts to get them to pass on slow hardware. apache#6202 Edited more test timeouts to get them to pass on slow hardware. apache#6202 Triggering tests due to 'Could not transfer artifact' maven issue. apache#6202 Increased or edited timeouts to get more tests to pass. apache#6202 Triggering new build by changing comment. apache#6202 Fixed timeouts (to short timeouts) when null message is expected. apache#6202 Triggering new build by changing comment. apache#6202 Increased timeout. apache#6202 Increased sleep as temporary workaround. apache#6202 Tuned timeouts more. apache#6202 Widening time to force timeout in timeout test. apache#6202 Fixed spelling typo. apache#6202 Added randomization of namespace name. apache#6202 Added random name generator to names of producers, subscriptions, and topics in ClientDeduplicationTest to fix duplicate name conflicts. apache#6202 Fixed issues with duplicate namespaces with repeated test runs. apache#6202 Added randomization to topic name to prevent potential conflicts that might be causing non-determinism in test. apache#6202 Added randomization to namespace name to prevent issues with topics not clearing out before second run of tests. apache#6202 Attempt to get C++ test fixed. It's not clear if this commit will build though... apache#6202 Replaced snake_case with camelCase to try to get c++ format to pass the build. apache#6202 Adding random name to subscription to see if that resolves the fact that this test only fails on the second subsequent run. apache#6202 Fixed timeout issues. apache#6202 Attempting fix of testPerTopicStats() by addressing race condition. apache#6202 Adding some debugging to help troubleshoot flaky test. apache#6202 Removing code that wasn't building anyway. apache#6202 Changed how we're testing Prometheus by filtering the topic name to fix race conditions between test runs and sharing broker state. apache#6202 Added more debugging information and fixed assertion apache#6202 Trigger new build apache#6202 Added long timeouts to ensure that broker tests do timeout instead of hanging but without timing out too soon apache#6202 Fixed imports for TimeUnit apache#6202 Fixed imports for TimeUnit apache#6202 Pushing changes to allow discussion on what's happening. apache#6202 Fixed timeouts for the testSharedSingleAckedPartitionedTopic() test. apache#6202 Fixed issue with Prometheus test. apache#6202 Can't use receive with timeout, if the queue size is 0. Fixed InterceptorsTest. apache#6202 Can't use receive with timeout, if the queue size is 0. apache#6202 Fixed Can't use receive with timeout, if the queue size is 0. apache#6202 Edited comment to trigger re-run of all tests to find more flaky tests. apache#6202 Fixed more of the concurrency issue in testPerTopicStats that was causing namespace conflicts. apache#6202 Fixed something I missed during rebasing. apache#6202 Fixed issues with Prometheus tests. apache#6256 Changed MessageId.latest to MessageId.earliest to fix apache#6224 Fixes issue apache#6352 Triggering build to inspect test results. apache#6202 Added timeouts to fix hanging tests. apache#6202 Triggering new build. apache#6202 Updating Github workflow to build surefire artifacts if previous step was cancelled, not just failed. apache#6202 Changing CI Unit Action to always build surefire artifacts to help with debugging hanging test. apache#6202 Triggering new build with arbitrary edit. apache#6202 Triggering build with arbitrary change to comment apache#6202 Triggering new build with arbitrary code change. apache#6202 Triggering new build with arbitrary code change. apache#6202 Changing surefire trigger back to failure() apache#6202 Added surefire artifacts to run always again. apache#6202 Triggering new build. apache#6202 Added condition to make testPartitions() more robust during repeated runs apache#6202 Implementing Sijie's suggestion about timeout for persistentTopicsCursorResetAfterReset(..) test. apache#6202 Fixed file that I forgot to merge. apache#6202 Increased robustness of testPartitions() for repeated execution. apache#6202 Added more debugging to ParserProxyHandler's channelRead, changed test from private to public, and decreased test noise. apache#6332 Trying to get more debug info apache#6332 Added more debugging log statements to try to pinpoint where the failure happens. apache#6332 Added more debugging log statements to try to pinpoint where the failure happens. apache#6332 Added even more debugging for tracing purposes. apache#6332 Added even more debugging for tracing purposes. apache#6332 Rolling back unnecessary changes. apache#6202 Rolling back unnecessary changes. apache#6202 Fixed issue with testDeadLetterTopic() where redelivery was getting triggered. apache#6202 Adding more debug information and methods to test hypothesis. apache#6332 Adding keepAlive to ServerConnection to see what that does. apache#6332 Increasing ProxyServer keepAliveInterval to 90 seconds in case it is timing out during server tests. apache#6332 Rolling back changes. apache#6332
Closed as stale and no recent report. Please open a new issue if it's still relevant in maintained versions. |
@devinbost Thanks a lot for reporting these flaky tests at that moment. Sorry that the community doesn't have enough time to catch it up then. As time goes on, the Pulsar codebase evolved a lot and I have to close these reports since it can be hard to validate to the current codebase. Now we have a dedicated issue form to file a ticket for flaky tests. If you're sure that several test cases are still relevant, feel free to open a new issue. |
Here's what I'm getting periodically in Github CI:
I've attached the surefire log output for this test:
org.apache.pulsar.proxy.server.ProxyParserTest-output.txt
The text was updated successfully, but these errors were encountered: