Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Flaky-test) Intermittent failure of ProxyParserTest.testRegexSubscription #6332

Closed
devinbost opened this issue Feb 15, 2020 · 15 comments
Closed
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug

Comments

@devinbost
Copy link
Contributor

Here's what I'm getting periodically in Github CI:

org.apache.pulsar.proxy.server.ProxyParserTest.testRegexSubscription(org.apache.pulsar.proxy.server.ProxyParserTest)
235[INFO] Run 1: PASS
236[ERROR] Run 2: ProxyParserTest.testRegexSubscription:195 » PulsarClient java.util.concurrent....

I've attached the surefire log output for this test:

org.apache.pulsar.proxy.server.ProxyParserTest-output.txt

@devinbost
Copy link
Contributor Author

There's an interesting warning the Github CI logs:

04:57:30.842 [pulsar-proxy-io-39-2:io.netty.channel.DefaultChannelPipeline@1164] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.UnsupportedOperationException: null

Here's the context of this exception:

04:57:30.817 [pulsar-proxy-io-39-1:org.apache.pulsar.proxy.server.ProxyConnection@141] INFO  org.apache.pulsar.proxy.server.ProxyConnection - [/10.1.0.4:33716] New connection opened
04:57:30.818 [pulsar-proxy-io-39-1:org.apache.pulsar.proxy.server.ProxyConnection@205] INFO  org.apache.pulsar.proxy.server.ProxyConnection - [/10.1.0.4:33716] complete connection, init proxy handler. authenticated with none role null, hasProxyToBrokerUrl: false
04:57:30.819 [pulsar-client-io-94-2:org.apache.pulsar.client.impl.ConnectionPool@156] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0x06bebf8c, L:/10.1.0.4:33716 - R:fv-az98.h5sctpwyleze3eigx5cvfmuugc.cx.internal.cloudapp.net/10.1.0.4:35639]] Connected to server
04:57:30.834 [pulsar-proxy-io-39-1:org.apache.pulsar.client.impl.ConnectionPool@156] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xee41bec9, L:/127.0.0.1:47482 - R:localhost/127.0.0.1:34991]] Connected to server
04:57:30.837 [pulsar-io-15-3:org.apache.pulsar.broker.service.ServerCnx@184] INFO  org.apache.pulsar.broker.service.ServerCnx - New connection from /127.0.0.1:47482
04:57:30.840 [pulsar-proxy-io-39-2:org.apache.pulsar.proxy.server.ProxyConnection@141] INFO  org.apache.pulsar.proxy.server.ProxyConnection - [/10.1.0.4:33720] New connection opened
04:57:30.841 [pulsar-proxy-io-39-2:org.apache.pulsar.proxy.server.ProxyConnection@205] INFO  org.apache.pulsar.proxy.server.ProxyConnection - [/10.1.0.4:33720] complete connection, init proxy handler. authenticated with none role null, hasProxyToBrokerUrl: false
04:57:30.841 [pulsar-client-io-94-2:org.apache.pulsar.client.impl.ConnectionPool@156] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xdf32e16c, L:/10.1.0.4:33720 - R:fv-az98.h5sctpwyleze3eigx5cvfmuugc.cx.internal.cloudapp.net/10.1.0.4:35639]] Connected to server
04:57:30.842 [pulsar-client-io-94-2:org.apache.pulsar.client.impl.ConsumerImpl@544] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://sample/test/local/topic1][ignored] Subscribing to topic on cnx [id: 0xdf32e16c, L:/10.1.0.4:33720 - R:fv-az98.h5sctpwyleze3eigx5cvfmuugc.cx.internal.cloudapp.net/10.1.0.4:35639]
04:57:30.842 [pulsar-proxy-io-39-2:io.netty.channel.DefaultChannelPipeline@1164] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.UnsupportedOperationException: null
	at org.apache.pulsar.common.protocol.PulsarDecoder.handleSubscribe(PulsarDecoder.java:457) ~[pulsar-common-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:216) ~[pulsar-common-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
	at org.apache.pulsar.proxy.server.ProxyConnection.channelRead(ProxyConnection.java:174) ~[classes/:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:326) [netty-codec-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:300) [netty-codec-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792) [netty-transport-native-epoll-4.1.43.Final-linux-x86_64.jar:4.1.43.Final]
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:502) [netty-transport-native-epoll-4.1.43.Final-linux-x86_64.jar:4.1.43.Final]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407) [netty-transport-native-epoll-4.1.43.Final-linux-x86_64.jar:4.1.43.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.43.Final.jar:4.1.43.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]

Should this exception be an error, rather than a warning?

@devinbost
Copy link
Contributor Author

@devinbost
Copy link
Contributor Author

devinbost commented Feb 18, 2020

When the test is run by the Github CI, the UnsupportedOperationException seems to suggest that perhaps we're doing something different when using a regex subscription compared to other subscriptions.

@devinbost
Copy link
Contributor Author

It looks like the UnsupportedOperationException is actually getting thrown here in PulsarDecoder:

    protected void handleSubscribe(CommandSubscribe subscribe) {
        throw new UnsupportedOperationException();
    }

(https://github.com/apache/pulsar/blob/master/pulsar-common/src/main/java/org/apache/pulsar/common/protocol/PulsarDecoder.java#L456)

What's the reason this method is getting called only sometimes by this code?

@devinbost
Copy link
Contributor Author

devinbost commented Feb 19, 2020

I learned from @jiazhai that PulsarDecoder is an interface that should not be getting called directly. He pointed out that all interactions should be between ServerCnx and ClientCnx and that ServerCnx implements PulsarDecoder indirectly like this:

public class ServerCnx extends PulsarHandler 
public abstract class PulsarHandler extends PulsarDecoder {

So, something is causing the method to get called on the abstract class directly instead of the correct implementing class. Perhaps there's a type conversion issue happening or something.

@devinbost
Copy link
Contributor Author

devinbost commented Feb 22, 2020

I ran this test 1000 times while running a stress test, and I still couldn't reproduce it locally.
However, it fails quite frequently in the Github CI.

This is most certainly a heisenbug...

@devinbost
Copy link
Contributor Author

@jiazhai Any ideas? I'm pretty stumped...

@devinbost
Copy link
Contributor Author

Here's the chain of events in the logs during the success-case around when the problem occurs:

20:49:21.322 [pulsar-client-io-5029-1:org.apache.pulsar.client.impl.ConsumerImpl@550] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://sample/test/local/topic2][ignored] Subscribing to topic on cnx [id: 0x6b207b92, L:/10.1.0.4:38008 - R:fv-az59.sa2tkz14bfaudpy2zzhb034ywc.cx.internal.cloudapp.net/10.1.0.4:34623]
20:49:21.322 [pulsar-proxy-io-39-1:org.apache.pulsar.proxy.server.ParserProxyHandler@78] INFO  org.apache.pulsar.proxy.server.ParserProxyHandler - frontendconn:[/10.1.0.4:38008/10.1.0.4:34623] cmd:SUBSCRIBE msg:{consumer:0486d,topic:persistent://sample/test/local/topic2}
20:49:21.323 [pulsar-io-15-4:org.apache.pulsar.broker.service.ServerCnx@760] INFO  org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:51438] Subscribing on topic persistent://sample/test/local/topic2 / ignored

Here's what it looks like in the failure case:

20:49:21.343 [pulsar-client-io-5029-1:org.apache.pulsar.client.impl.ConsumerImpl@550] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://sample/test/local/topic1][regex-sub-proxy-parser-test-1582577361325] Subscribing to topic on cnx [id: 0x604df1da, L:/10.1.0.4:38026 - R:fv-az59.sa2tkz14bfaudpy2zzhb034ywc.cx.internal.cloudapp.net/10.1.0.4:34623]
20:49:21.343 [pulsar-proxy-io-39-1:io.netty.channel.DefaultChannelPipeline@1164] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.UnsupportedOperationException: null

devinbost pushed a commit to devinbost/pulsar that referenced this issue Feb 24, 2020
…t from private to public, and decreased test noise. apache#6332
@devinbost
Copy link
Contributor Author

Interestingly, I'm noticing a little more information in the log now for this test:

[ERROR] testRegexSubscription(org.apache.pulsar.proxy.server.ProxyParserTest)  Time elapsed: 0.186 s  <<< FAILURE!
org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException: Disconnected from server at fv-az59.d2lglhwbiqvulamcdjih523n0h.cx.internal.cloudapp.net/10.1.0.4:42631
	at org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:719)
	at org.apache.pulsar.client.impl.ConsumerBuilderImpl.subscribe(ConsumerBuilderImpl.java:98)
	at org.apache.pulsar.proxy.server.ProxyParserTest.testRegexSubscription(ProxyParserTest.java:209)
	at sun.reflect.GeneratedMethodAccessor115.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:54)
	at org.testng.internal.InvokeMethodRunnable.run(InvokeMethodRunnable.java:44)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.PulsarClientException: Disconnected from server at fv-az59.d2lglhwbiqvulamcdjih523n0h.cx.internal.cloudapp.net/10.1.0.4:42631
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
	at org.apache.pulsar.client.impl.ConsumerBuilderImpl.subscribe(ConsumerBuilderImpl.java:96)

@devinbost
Copy link
Contributor Author

@jiazhai
I added a lot more debugging statements, and I discovered that the UnsupportedOperationException is occurring immediately prior to:
cnx.sendRequestWithId(request, requestId).thenRun(() -> {
in ConsumerImpl.connectionOpened(), so in the code block below, the line log.info("ConsumerImpl.connectionOpened(..)22"); never gets hit.

        // startMessageRollbackDurationInSec should be consider only once when consumer connects to first time
        long startMessageRollbackDuration = (startMessageRollbackDurationInSec > 0
                && startMessageId.equals(initialStartMessageId)) ? startMessageRollbackDurationInSec : 0;
        log.info("ConsumerImpl.connectionOpened(..)09");
        ByteBuf request = Commands.newSubscribe(topic, subscription, consumerId, requestId, getSubType(), priorityLevel,
                consumerName, isDurable, startMessageIdData, metadata, readCompacted,
                conf.isReplicateSubscriptionState(), InitialPosition.valueOf(subscriptionInitialPosition.getValue()),
        startMessageRollbackDuration, si, createTopicIfDoesNotExist, conf.getKeySharedPolicy());
        log.info("ConsumerImpl.connectionOpened(..)10");
        if (startMessageIdData != null) {
            startMessageIdData.recycle();
        }
        log.info("ConsumerImpl.connectionOpened(..)11");

        cnx.sendRequestWithId(request, requestId).thenRun(() -> {
            log.info("ConsumerImpl.connectionOpened(..)22");
            synchronized (ConsumerImpl.this) {

devinbost pushed a commit to devinbost/pulsar that referenced this issue Feb 25, 2020
@devinbost
Copy link
Contributor Author

devinbost commented Feb 25, 2020

At a closer look, that result only occurred after the first failure. During the first failure, the logs revealed something a little different.

With the additional log messages, they looked like this:

04:35:52.434 [pulsar-client-io-304-1:org.apache.pulsar.client.impl.MultiTopicsConsumerImpl@882] INFO  org.apache.pulsar.client.impl.MultiTopicsConsumerImpl - [MultiTopicsConsumer-70113] [regex-sub-proxy-parser-test-1582605352335] Success subscribe new topic persistent://sample/test/local/topic2 in topics consumer, partitions: 0, allTopicPartitionsNumber: 2
04:35:52.436 [pulsar-client-io-304-1:org.apache.pulsar.client.impl.ConsumerImpl@632] INFO  org.apache.pulsar.client.impl.ConsumerImpl - ConsumerImpl.connectionOpened(..)15
04:35:52.438 [pulsar-proxy-io-39-1:org.apache.pulsar.proxy.server.ProxyConnection@122] INFO  org.apache.pulsar.proxy.server.ProxyConnection - ProxyConnection.channelRegistered(..)01
04:35:52.438 [pulsar-proxy-io-39-1:org.apache.pulsar.proxy.server.ProxyConnection@124] INFO  org.apache.pulsar.proxy.server.ProxyConnection - ProxyConnection.channelRegistered(..)02
04:35:52.438 [pulsar-proxy-io-39-1:org.apache.pulsar.proxy.server.ProxyConnection@143] INFO  org.apache.pulsar.proxy.server.ProxyConnection - ProxyConnection.channelActive(..)01
04:35:52.438 [pulsar-proxy-io-39-1:org.apache.pulsar.proxy.server.ProxyConnection@145] INFO  org.apache.pulsar.proxy.server.ProxyConnection - ProxyConnection.channelActive(..)02
04:35:52.438 [pulsar-proxy-io-39-1:org.apache.pulsar.proxy.server.ProxyConnection@147] INFO  org.apache.pulsar.proxy.server.ProxyConnection - [/10.1.0.4:40490] New connection opened
04:35:52.439 [pulsar-proxy-io-39-1:org.apache.pulsar.proxy.server.ProxyConnection@211] INFO  org.apache.pulsar.proxy.server.ProxyConnection - [/10.1.0.4:40490] complete connection, init proxy handler. authenticated with none role null, hasProxyToBrokerUrl: false
04:35:52.440 [pulsar-client-io-304-1:org.apache.pulsar.client.impl.ConnectionPool@156] INFO  org.apache.pulsar.client.impl.ConnectionPool - [[id: 0xc9ddb5f2, L:/10.1.0.4:40490 - R:fv-az60.ct13ep5w0hzutbi1ytkrtqhxsg.cx.internal.cloudapp.net/10.1.0.4:37359]] Connected to server
04:35:52.440 [pulsar-client-io-304-1:org.apache.pulsar.client.impl.ProducerImpl@1067] INFO  org.apache.pulsar.client.impl.ProducerImpl - [persistent://sample/test/local/topic1] [null] Creating producer on cnx [id: 0xc9ddb5f2, L:/10.1.0.4:40490 - R:fv-az60.ct13ep5w0hzutbi1ytkrtqhxsg.cx.internal.cloudapp.net/10.1.0.4:37359]
04:35:52.440 [pulsar-proxy-io-39-1:io.netty.channel.DefaultChannelPipeline@1164] WARN  io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.UnsupportedOperationException: null

So, we reach this line:
ConsumerImpl.connectionOpened(..)15
which implies that ConsumerImpl.connectionOpened(..) was able to successfully run to completion.

After more careful inspection, I also discovered that the UnsupportedOperationException actually alternates (non-deterministically) between the exception farther above (#6332 (comment)) and this one:

04:35:48.906 [pulsar-proxy-io-39-1:org.apache.pulsar.proxy.server.ProxyConnection@168] WARN  org.apache.pulsar.proxy.server.ProxyConnection - [/10.1.0.4:40182] Got exception UnsupportedOperationException : null
java.lang.UnsupportedOperationException: null
	at org.apache.pulsar.common.protocol.PulsarDecoder.handleProducer(PulsarDecoder.java:461) ~[pulsar-common-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:189) ~[pulsar-common-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
	at org.apache.pulsar.proxy.server.ProxyConnection.channelRead(ProxyConnection.java:180) ~[classes/:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:326) [netty-codec-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:300) [netty-codec-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792) [netty-transport-native-epoll-4.1.43.Final-linux-x86_64.jar:4.1.43.Final]
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:502) [netty-transport-native-epoll-4.1.43.Final-linux-x86_64.jar:4.1.43.Final]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407) [netty-transport-native-epoll-4.1.43.Final-linux-x86_64.jar:4.1.43.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.43.Final.jar:4.1.43.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]

There is one key difference between the two methods that trigger the java.lang.UnsupportedOperationException: null result:

  1. at org.apache.pulsar.common.protocol.PulsarDecoder.handleProducer(PulsarDecoder.java:461) ~[pulsar-common-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]
  2. at org.apache.pulsar.common.protocol.PulsarDecoder.handleSubscribe(PulsarDecoder.java:457) ~[pulsar-common-2.6.0-SNAPSHOT.jar:2.6.0-SNAPSHOT]

Notice the first one is handleProducer and the second one is handleSubscribe. This may be an important clue to identifying the cause of this problem.

Notably, ProxyConnection, ServerCnx, ClientCnx, ProxyClientCnx, and ServerConnection all extend PulsarHandler, which extends PulsarDecoder. So, perhaps we're trying to call methods on an object that could be one or the other of that type. Perhaps there's non-determinism in which of these objects we end up with at some point in the process.

devinbost pushed a commit to devinbost/pulsar that referenced this issue Feb 25, 2020
@devinbost
Copy link
Contributor Author

My hypothesis was right.

I added these two methods to ProxyConnection:

@Override
protected void handleSubscribe(PulsarApi.CommandSubscribe subscribe) {
    LOG.info("ERROR: We are calling handleSubscribe(..) on ProxyConnection! This shouldn't happen!");
    StringWriter sw = new StringWriter();
    new Throwable("").printStackTrace(new PrintWriter(sw));
    String stackTrace = sw.toString();
    LOG.info("Stacktrace is: " + stackTrace);
    super.handleSubscribe(subscribe);
}
@Override
protected void handleProducer(PulsarApi.CommandProducer producer) {
    LOG.info("ERROR: We are calling handleProducer(..) on ProxyConnection! This shouldn't happen!");
    StringWriter sw = new StringWriter();
    new Throwable("").printStackTrace(new PrintWriter(sw));
    String stackTrace = sw.toString();
    LOG.info("Stacktrace is: " + stackTrace);
    super.handleProducer(producer);
}

because I suspected that these methods were incorrectly getting called on the ProxyConnection type instead of getting forwarded like they're supposed to, and this is what appeared in the log:

07:03:06.668 [pulsar-proxy-io-39-2:org.apache.pulsar.proxy.server.ProxyConnection@391] INFO  org.apache.pulsar.proxy.server.ProxyConnection - ERROR: We are calling handleSubscribe(..) on ProxyConnection! This shouldn't happen!
07:03:06.668 [pulsar-proxy-io-39-2:org.apache.pulsar.proxy.server.ProxyConnection@395] INFO  org.apache.pulsar.proxy.server.ProxyConnection - Stacktrace is: java.lang.Throwable: 
	at org.apache.pulsar.proxy.server.ProxyConnection.handleSubscribe(ProxyConnection.java:393)
	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:216)
	at org.apache.pulsar.proxy.server.ProxyConnection.channelRead(ProxyConnection.java:182)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:326)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:300)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:502)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

The new error message that I added is most instructive:

org.apache.pulsar.proxy.server.ProxyConnection - ERROR: We are calling handleSubscribe(..) on ProxyConnection! This shouldn't happen!

@devinbost
Copy link
Contributor Author

devinbost commented Feb 25, 2020

Sure enough, when I set local breakpoints, I never hit those two methods on ProxyConnection because the local tests always pass. So, only in the failing case are the methods getting called on ProxyConnection instead of on ServerCnx.

devinbost pushed a commit to devinbost/pulsar that referenced this issue Feb 26, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this issue Feb 26, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this issue Feb 26, 2020
devinbost pushed a commit to devinbost/pulsar that referenced this issue Mar 13, 2020
Added awaitility to two pom files.

Increased timeouts for state tests. apache#6200 apache#6198

Increased timeouts to testSimpleConsumerEventsWithoutPartition and introduced await to poll on assertions to eliminate use of Thread.sleep in several places. (apache#6014)

Attempting to fix testPulsarKafkaProducerWithSerializer issue by adding await to test. (apache#6137)

Attempt to fix apache#6207 and add more debugging information by pruning docker containers.

Fixed typo in docker commands for getting debug info. apache#6207.

Removing timeouts as per comments in apache#5333. This is for apache#6202.

Fixed timeout issues for CPP tests. apache#6202 and apache#6137

Increased more timeouts. apache#6202 and apache#6137

Fixed typo in CPP test timeout fix. apache#6202  apache#4884

Edited comment to trigger build apache#6202

Rolled back changes to PulsarSpoutTest because fixing some instability broke two of the tests that depend on timeout configurations. Those changes will require more investigation. apache#6202

Added timeouts back in places where required. Increased timeouts though. apache#6202

Fixed timeouts for Storm and Kafka tests. Also removed debug block that was accidentially included in ReaderTest. apache#6202

Editing comment to trigger new build. apache#6202

Attempt to workaround test failure. apache#6202

Adding some timeouts back to get beyond hanging tests. apache#6202

Increased sleep value as temporary workaround for thread timeout. apache#6202

Added back timeouts to fix hang but increased timeouts from 1s to 5s. apache#6202

Added back timeout (but made it longer) to prevent hanging test. apache#6202

Fixed formatting since it was breaking the build. apache#6202

Increased more test timeouts to get them to pass on slow hardware. apache#6202

Increased more test timeouts to get them to pass on slow hardware. apache#6202

Edited more test timeouts to get them to pass on slow hardware. apache#6202

Triggering tests due to 'Could not transfer artifact' maven issue. apache#6202

Increased or edited timeouts to get more tests to pass. apache#6202

Triggering new build by changing comment. apache#6202

Fixed timeouts (to short timeouts) when null message is expected. apache#6202

Triggering new build by changing comment. apache#6202

Increased timeout. apache#6202

Increased sleep as temporary workaround. apache#6202

Tuned timeouts more. apache#6202

Widening time to force timeout in timeout test. apache#6202

Fixed spelling typo. apache#6202

Added randomization of namespace name. apache#6202

Added random name generator to names of producers, subscriptions, and topics in ClientDeduplicationTest to fix duplicate name conflicts. apache#6202

Fixed issues with duplicate namespaces with repeated test runs. apache#6202

Added randomization to topic name to prevent potential conflicts that might be causing non-determinism in test. apache#6202

Added randomization to namespace name to prevent issues with topics not clearing out before second run of tests. apache#6202

Attempt to get C++ test fixed. It's not clear if this commit will build though... apache#6202

Replaced snake_case with camelCase to try to get c++ format to pass the build. apache#6202

Adding random name to subscription to see if that resolves the fact that this test only fails on the second subsequent run. apache#6202

Fixed timeout issues. apache#6202

Attempting fix of testPerTopicStats() by addressing race condition. apache#6202

Adding some debugging to help troubleshoot flaky test. apache#6202

Removing code that wasn't building anyway. apache#6202

Changed how we're testing Prometheus by filtering the topic name to fix race conditions between test runs and sharing broker state. apache#6202

Added more debugging information and fixed assertion apache#6202

Trigger new build apache#6202

Added long timeouts to ensure that broker tests do timeout instead of hanging but without timing out too soon apache#6202

Fixed imports for TimeUnit apache#6202

Fixed imports for TimeUnit apache#6202

Pushing changes to allow discussion on what's happening. apache#6202

Fixed timeouts for the testSharedSingleAckedPartitionedTopic() test. apache#6202

Fixed issue with Prometheus test. apache#6202

Can't use receive with timeout, if the queue size is 0. Fixed InterceptorsTest. apache#6202

Can't use receive with timeout, if the queue size is 0. apache#6202

Fixed Can't use receive with timeout, if the queue size is 0. apache#6202

Edited comment to trigger re-run of all tests to find more flaky tests. apache#6202

Fixed more of the concurrency issue in testPerTopicStats that was causing namespace conflicts. apache#6202

Fixed something I missed during rebasing. apache#6202

Fixed issues with Prometheus tests. apache#6256

Changed MessageId.latest to MessageId.earliest to fix apache#6224

Fixes issue apache#6352

Triggering build to inspect test results. apache#6202

Added timeouts to fix hanging tests. apache#6202

Triggering new build. apache#6202

Updating Github workflow to build surefire artifacts if previous step was cancelled, not just failed. apache#6202

Changing CI Unit Action to always build surefire artifacts to help with debugging hanging test. apache#6202

Triggering new build with arbitrary edit. apache#6202

Triggering build with arbitrary change to comment apache#6202

Triggering new build with arbitrary code change. apache#6202

Triggering new build with arbitrary code change. apache#6202

Changing surefire trigger back to failure() apache#6202

Added surefire artifacts to run always again. apache#6202

Triggering new build. apache#6202

Added condition to make testPartitions() more robust during repeated runs apache#6202

Implementing Sijie's suggestion about timeout for persistentTopicsCursorResetAfterReset(..) test. apache#6202

Fixed file that I forgot to merge. apache#6202

Increased robustness of testPartitions() for repeated execution. apache#6202

Added more debugging to ParserProxyHandler's channelRead, changed test from private to public, and decreased test noise. apache#6332

Trying to get more debug info apache#6332

Added more debugging log statements to try to pinpoint where the failure happens. apache#6332

Added more debugging log statements to try to pinpoint where the failure happens. apache#6332

Added even more debugging for tracing purposes. apache#6332

Added even more debugging for tracing purposes. apache#6332

Rolling back unnecessary changes. apache#6202

Rolling back unnecessary changes. apache#6202

Fixed issue with testDeadLetterTopic() where redelivery was getting triggered. apache#6202

Adding more debug information and methods to test hypothesis. apache#6332

Adding keepAlive to ServerConnection to see what that does. apache#6332

Increasing ProxyServer keepAliveInterval to 90 seconds in case it is timing out during server tests. apache#6332

Rolling back changes. apache#6332
@tisonkun
Copy link
Member

Closed as stale and no recent report. Please open a new issue if it's still relevant in maintained versions.

@tisonkun tisonkun closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2022
@tisonkun
Copy link
Member

@devinbost Thanks a lot for reporting these flaky tests at that moment. Sorry that the community doesn't have enough time to catch it up then.

As time goes on, the Pulsar codebase evolved a lot and I have to close these reports since it can be hard to validate to the current codebase. Now we have a dedicated issue form to file a ticket for flaky tests. If you're sure that several test cases are still relevant, feel free to open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

4 participants