Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removes bindOnLocalhost=boolean. Adds bindAddress and advertisedAddress. #26

Merged
merged 20 commits into from
Sep 26, 2016
Merged

Removes bindOnLocalhost=boolean. Adds bindAddress and advertisedAddress. #26

merged 20 commits into from
Sep 26, 2016

Conversation

radekg
Copy link
Contributor

@radekg radekg commented Sep 20, 2016

Motivation

The motivation has been documented and discussed in #23.

Modifications

Adds bindAddress and advertisedAddress configuration settings.
Removes bindOnLocalhost configuration setting.

Result

The user can how set the address the service will bind on, for example: 0.0.0.0. The user can also advertise a different address via ZooKeeper. The latter setting makes it much easier to setup Pulsar in environments such as Amazon EC2.

…Address.

@yahoocla
Copy link

CLA is valid!

# Control whether to bind directly on localhost rather than on normal hostname
bindOnLocalhost=false
# Hostname or IP address the service binds on, default is InetAddress.getLocalHost().getHostName().
bindAddress=
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving empty for clarity.

Copy link
Contributor

@merlimat merlimat Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 0.0.0.0 instead of hostname by default?

Copy link
Contributor Author

@radekg radekg Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's a personal preference. Some people like things wide open by default, some do not. I don't mind making it 0.0.0.0.

bindAddress=

# Hostname or IP address the service advertises to the outside world. If not set, the value of bindAddress is used.
advertisedAddress=
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving empty for clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bind address was 0.0.0.0, then defaulting to bind address might not work. Should we default to hostname for advertisedAddress ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3745033

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment with "default to hostname"

@@ -5,7 +5,8 @@ brokerServicePort=6650
brokerServicePortTls=6651
webServicePort=8080
webServicePortTls=4443
bindOnLocalhost=false
bindAddress=
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left empty for clarity.

@radekg radekg changed the title Changes bindOnLocalhost=boolean to bindAddress. Introduces advertised… Removes bindOnLocalhost=boolean. Adds bindAddress and advertisedAddress Sep 20, 2016
@radekg radekg changed the title Removes bindOnLocalhost=boolean. Adds bindAddress and advertisedAddress Removes bindOnLocalhost=boolean. Adds bindAddress and advertisedAddress. Sep 20, 2016
@radekg
Copy link
Contributor Author

radekg commented Sep 20, 2016

It might be worth adding advertised ports settings for services. 100% in the short term someone is going to ask for it when setting Pulsar in Mesos with Docker.

@radekg
Copy link
Contributor Author

radekg commented Sep 20, 2016

It isn't clear to me how the ServiceConfiguration is populated from file. That might be the only missing link.

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. I think the only part missing is to enforce the bindAddress settings by have sockets effectively binding on the desired IP/Host. Before, the bindOnLocalhost was just a trick used for unit tests to advertise localhost instead of the real hostname, but that was not really affecting the "bind" IP address.

For example, https://github.com/yahoo/pulsar/blob/master/pulsar-broker/src/main/java/com/yahoo/pulsar/broker/web/WebService.java#L87 we'd have to add a connector.setHost() to use it.

Same thing with the Netty server socket, at https://github.com/yahoo/pulsar/blob/master/pulsar-broker/src/main/java/com/yahoo/pulsar/broker/service/BrokerService.java#L223 we need to specify the bindAddress.

# Control whether to bind directly on localhost rather than on normal hostname
bindOnLocalhost=false
# Hostname or IP address the service binds on, default is InetAddress.getLocalHost().getHostName().
bindAddress=
Copy link
Contributor

@merlimat merlimat Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be 0.0.0.0 instead of hostname by default?

bindAddress=

# Hostname or IP address the service advertises to the outside world. If not set, the value of bindAddress is used.
advertisedAddress=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bind address was 0.0.0.0, then defaulting to bind address might not work. Should we default to hostname for advertisedAddress ?

bindAddress=

# Hostname or IP address the service advertises to the outside world. If not set, the value of bindAddress is used.
advertisedAddress=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Web socket server component doesn't need to advertise itself

@@ -291,12 +301,32 @@ public void setWebServicePortTls(int webServicePortTls) {
this.webServicePortTls = webServicePortTls;
}

public boolean isBindOnLocalhost() {
return bindOnLocalhost;
public String getBindAddress() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferably, the ServiceConfiguration should be a simple POJO. This initialization of the bind address is potentially at risk if multiple threads are initializing. It'd be better to return null (or even better Optional<String>) and have the caller cope with the missing value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid throws on this method as the caller would have to require try / catch everywhere where this is used. Unless we can settle on RuntimeException or IllegalArgumentException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, scratch that, that's a silly idea. I'll just re-throw the exception and handle where used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3745033

Copy link
Contributor Author

@radekg radekg Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat Why would multiple threads try to initialize a single instance of ServiceConfiguration? Is the service configuration stored somewhere as static?

LOG.debug("bindAddress not set, attempting default configuration.");
this.setBindAddress(InetAddress.getLocalHost().getHostName());
} catch (UnknownHostException ex) {
LOG.warn(ex.getMessage(), "Using localhost as bindAddress.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just re-trow an unchecked exception to have the broker exit if the hostname is invalid. Anyway it will probably not reachable on localhost.

this.bindAddress = bindAddress;
}

public String getAdvertisedAddress() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we should just return what was configured, and in any case, the fallback here should be on hostname rather than the bind address.

LOG.error(e.getMessage(), e);
throw new IllegalStateException("failed to find host", e);
}
return config.getBindAddress();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 methods should have the above logic to check whether bindAddress and advertisedAddress were configured.

@merlimat merlimat added this to the 1.15 milestone Sep 20, 2016
@merlimat merlimat added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Sep 20, 2016
@merlimat
Copy link
Contributor

It might be worth adding advertised ports settings for services. 100% in the short term someone is going to ask for it when setting Pulsar in Mesos with Docker.

If you're in the mood, I'm all for it! 😄

It isn't clear to me how the ServiceConfiguration is populated from file. That might be the only missing link.

Ok, that's a bit mysterious, but there's a ServiceConfigurationLoader class that uses reflections to associate the ServiceConfiguration fields to variables in the config file.

@radekg
Copy link
Contributor Author

radekg commented Sep 20, 2016

@merlimat Addressed the comments related to bindAddress / advertisedAddress configuration setup. Web socket and broker now bind on the bindAddress.

@merlimat
Copy link
Contributor

@rdhabalia Can you take a look as well?

// If not set, InetAddress.getLocalHost().getHostName() will be used.
private String bindAddress;
// Controls which hostname is advertised to the discovery service via ZooKeeper.
// If not set, bindAddress is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is out of date now, since it defaults to hostname

# Hostname or IP address the service binds on, default is InetAddress.getLocalHost().getHostName().
bindAddress=

# Hostname or IP address the service advertises to the outside world. If not set, the value of bindAddress is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default to hostname

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thank you!

@radekg
Copy link
Contributor Author

radekg commented Sep 20, 2016

@merlimat I'm looking at travis, here: https://travis-ci.org/yahoo/pulsar/builds/161469038, I've been noticing similar situation on my machine:

Running TestSuite
2016-09-20 22:04:02,984 - INFO  - [main:MockedBookKeeperTestCase@65] - >>>>>> starting void com.yahoo.pulsar.broker.service.PersistentMessageFinderTest.testPersistentMessageFinder() throws java.lang.Exception
2016-09-20 22:04:05,505 - INFO  - [main:MockedBookKeeperTestCase@81] - @@@@@@@@@ stopping void com.yahoo.pulsar.broker.service.PersistentMessageFinderTest.testPersistentMessageFinder() throws java.lang.Exception
2016-09-20 22:04:05,543 - INFO  - [main:MockedBookKeeperTestCase@88] - --------- stopped void com.yahoo.pulsar.broker.service.PersistentMessageFinderTest.testPersistentMessageFinder() throws java.lang.Exception
Exception in thread "Thread-7" java.lang.AssertionError: expected [false] but found [true]
    at org.testng.Assert.fail(Assert.java:94)
    at org.testng.Assert.failNotEquals(Assert.java:494)
    at org.testng.Assert.assertFalse(Assert.java:63)
    at org.testng.Assert.assertFalse(Assert.java:73)
    at com.yahoo.pulsar.broker.service.PersistentTopicTest$5.run(PersistentTopicTest.java:483)
Sep 20, 2016 10:04:09 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
Sep 20, 2016 10:04:10 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
Sep 20, 2016 10:04:11 PM org.glassfish.jersey.server.ApplicationHandler initialize
INFO: Initiating Jersey application, version Jersey: 2.8 2014-04-29 01:25:26...
2016-09-20 22:06:18,260 - ERROR - [main:Slf4JLogger@171] - LEAK: HashedWheelTimer.release() was not called before it's garbage-collected. See http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 0
Created at:
    io.netty.util.HashedWheelTimer.<init>(HashedWheelTimer.java:241)
    io.netty.util.HashedWheelTimer.<init>(HashedWheelTimer.java:192)
    io.netty.util.HashedWheelTimer.<init>(HashedWheelTimer.java:174)
    com.yahoo.pulsar.client.impl.PulsarClientImpl.<init>(PulsarClientImpl.java:109)
    com.yahoo.pulsar.client.impl.PulsarClientImpl.<init>(PulsarClientImpl.java:82)
    com.yahoo.pulsar.client.api.PulsarClient.create(PulsarClient.java:55)
    com.yahoo.pulsar.broker.service.BacklogQuotaManagerTest.testConcurrentAckAndEviction(BacklogQuotaManagerTest.java:205)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:483)
    org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)
    org.testng.internal.Invoker.invokeMethod(Invoker.java:714)
    org.testng.internal.Invoker.invokeTestMethod(Invoker.java:901)
    org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1231)
    org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:127)
    org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:111)
    org.testng.TestRunner.privateRun(TestRunner.java:767)
    org.testng.TestRunner.run(TestRunner.java:617)
    org.testng.SuiteRunner.runTest(SuiteRunner.java:334)
    org.testng.SuiteRunner.runSequentially(SuiteRunner.java:329)
    org.testng.SuiteRunner.privateRun(SuiteRunner.java:291)
    org.testng.SuiteRunner.run(SuiteRunner.java:240)
    org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
    org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:86)
    org.testng.TestNG.runSuitesSequentially(TestNG.java:1224)
    org.testng.TestNG.runSuitesLocally(TestNG.java:1149)
    org.testng.TestNG.run(TestNG.java:1057)
    org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:132)
    org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeMulti(TestNGDirectoryTestSuite.java:193)
    org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:94)
    org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:147)
    org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:290)
    org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:242)
    org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:121)
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated

I believe none of my changes are responsible for this, obviously I could be wrong. I do not remember having such earlier today, when I started working on this PR. Do you know what happened in the meantime?

Copy link
Contributor

@rdhabalia rdhabalia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor thing. but looks good to me.

public void setup() throws Exception {
bkEnsemble = new LocalBookkeeperEnsemble(3, ZOOKEEPER_PORT, 5001);
bkEnsemble.start();
ServiceConfiguration config = create(new Properties(System.getProperties()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we are not loading ServiceConfiguration from map/properties: it is fine to directly instantiate
ServiceConfiguration config = new ServiceConfiguration()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

@merlimat
Copy link
Contributor

@radekg I think the test issue might be related to the fact that without specifying the IP at bind time, we were always binding on 0.0.0.0. Now, we're specifying the bind address, by passing the hostname, so 127.0.0.1 will be excluded. There might be few unit tests that weren't passing the bindOnLocalhost=true, that are affected.

@radekg
Copy link
Contributor Author

radekg commented Sep 20, 2016

The default binding is still InetAddress.getLocalHost().getHostName() (https://github.com/yahoo/pulsar/pull/26/files#diff-5ff5807c8da7704d2729f62605e53e88R571) which should by default go to localhost. What I see locally, where there is no 10 minute timeout, is that the test continues just fine but it takes about 20 minutes just in that step.

When running these tests locally, the progress can be seen with lsof -i -n -P | grep TCP | grep java | wc -l. The number changes, the test is much, much slower though. I'll give it a shot after a few hours of sleep.

@radekg
Copy link
Contributor Author

radekg commented Sep 26, 2016

@rdhabalia that does the job. This is now passing the tests. Thank you! Can we review again?
I was also wondering if it isn't worth changing the PulsarService.getHost() to a little bit more descriptive PulsarService.getBindAddress(). What do you think?

@radekg
Copy link
Contributor Author

radekg commented Sep 26, 2016

@radekg
Copy link
Contributor Author

radekg commented Sep 26, 2016

The failure seems to be one of those intermittent issues. It successfully ran here: https://travis-ci.org/radekg/pulsar/builds/162817512. I guess the build needs to be restarted?

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good, thanks for going after all test cases. Just one very minor comment

@@ -565,41 +567,56 @@ public MessagingServiceShutdownHook getShutdownService() {
/**
* Derive the host
*
* @param isBindOnLocalhost
* @return
* @return Hostname or IP address the service binds on.
*/
public static String host(ServiceConfiguration config) {
Copy link
Contributor

@merlimat merlimat Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this host(...) method could probably be removed now, given that we can directly use config.getBindAddress() which defaults to a non-null value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat is it okay to change getHost() to getBindAddress()? Also, would you like me to rebase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure getBindAddress() sounds good, but we probably don't even need to have the static method, since we could just call config.getBindAddress() as it is.

For rebasing, I will squash it anyway when merging. I don't expect particular conflicts with current master.

…rService.getHost() to PulsarService.getBindAddress().
@merlimat
Copy link
Contributor

Nice work!

@merlimat merlimat merged commit 8df6b74 into apache:master Sep 26, 2016
@radekg
Copy link
Contributor Author

radekg commented Sep 26, 2016

Thank you!

sijie pushed a commit to sijie/pulsar that referenced this pull request Mar 4, 2018
* Create pulsar-functions module (#1)

* Create pulsar-functions module

* rename `sdk` package to `api`

* Added the first cut of the Java interface for Pulsar functions (#2)

* Introduced the concept of systemException to differentiate from userException
massakam pushed a commit to massakam/pulsar that referenced this pull request Aug 6, 2020
Add Producer/Consumer/EndToEnd Tests
sijie pushed a commit that referenced this pull request Aug 6, 2020
### Motivation

When a broker server is restarted, the Pulsar proxy outputs the following warning:

```
15:09:01.486 [main-EventThread] INFO  o.a.p.z.ZooKeeperChildrenCache       - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN  o.a.p.p.s.util.ZookeeperCacheLoader  - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" #26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.

The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.

`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97

### Modifications

~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~

Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
cdbartholomew pushed a commit to kafkaesque-io/pulsar that referenced this pull request Aug 15, 2020
### Motivation

When a broker server is restarted, the Pulsar proxy outputs the following warning:

```
15:09:01.486 [main-EventThread] INFO  o.a.p.z.ZooKeeperChildrenCache       - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN  o.a.p.p.s.util.ZookeeperCacheLoader  - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" apache#26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.

The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.

`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97

### Modifications

~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~

Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
### Motivation

When a broker server is restarted, the Pulsar proxy outputs the following warning:

```
15:09:01.486 [main-EventThread] INFO  o.a.p.z.ZooKeeperChildrenCache       - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN  o.a.p.p.s.util.ZookeeperCacheLoader  - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" apache#26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.

The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.

`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97

### Modifications

~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~

Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
lbenc135 pushed a commit to lbenc135/pulsar that referenced this pull request Sep 5, 2020
### Motivation

When a broker server is restarted, the Pulsar proxy outputs the following warning:

```
15:09:01.486 [main-EventThread] INFO  o.a.p.z.ZooKeeperChildrenCache       - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN  o.a.p.p.s.util.ZookeeperCacheLoader  - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" apache#26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.

The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.

`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97

### Modifications

~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~

Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
lbenc135 pushed a commit to lbenc135/pulsar that referenced this pull request Sep 5, 2020
### Motivation

When a broker server is restarted, the Pulsar proxy outputs the following warning:

```
15:09:01.486 [main-EventThread] INFO  o.a.p.z.ZooKeeperChildrenCache       - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN  o.a.p.p.s.util.ZookeeperCacheLoader  - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" apache#26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.

The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.

`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97

### Modifications

~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~

Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
lbenc135 pushed a commit to lbenc135/pulsar that referenced this pull request Sep 5, 2020
### Motivation

When a broker server is restarted, the Pulsar proxy outputs the following warning:

```
15:09:01.486 [main-EventThread] INFO  o.a.p.z.ZooKeeperChildrenCache       - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN  o.a.p.p.s.util.ZookeeperCacheLoader  - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" apache#26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.

The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.

`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97

### Modifications

~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~

Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.
wolfstudy pushed a commit that referenced this pull request Oct 30, 2020
### Motivation

When a broker server is restarted, the Pulsar proxy outputs the following warning:

```
15:09:01.486 [main-EventThread] INFO  o.a.p.z.ZooKeeperChildrenCache       - reloadCache called in zookeeperChildrenCache for path /loadbalance/brokers
15:09:31.487 [main-EventThread] WARN  o.a.p.p.s.util.ZookeeperCacheLoader  - Error updating broker info after broker list changed.
java.util.concurrent.TimeoutException: null
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:118)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:82)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

This is due to a deadlock occurring in the proxy. While this deadlock occurs, `main-EventThread` is stopped in the following states:
```
"main-EventThread" #26 daemon prio=5 os_prio=0 tid=0x00007fa9b55ff000 nid=0x3d98 waiting on condition [0x00007fa923dfc000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000eef2ad80> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
        at org.apache.pulsar.zookeeper.ZooKeeperDataCache.get(ZooKeeperDataCache.java:97)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.updateBrokerList(ZookeeperCacheLoader.java:123)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader.lambda$new$0(ZookeeperCacheLoader.java:84)
        at org.apache.pulsar.proxy.server.util.ZookeeperCacheLoader$$Lambda$40/1450652220.onUpdate(Unknown Source)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache.lambda$0(ZooKeeperChildrenCache.java:89)
        at org.apache.pulsar.zookeeper.ZooKeeperChildrenCache$$Lambda$365/191748085.accept(Unknown Source)
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
        at org.apache.pulsar.zookeeper.ZooKeeperCache.lambda$22(ZooKeeperCache.java:415)
        at org.apache.pulsar.zookeeper.ZooKeeperCache$$Lambda$46/1819738265.processResult(Unknown Source)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
```

`ZookeeperCacheLoader` will try to get load information for all available brokers from ZK or its cache if there is a change in the znode `/loadbalance/brokers`.

The problem is that it is `main-EventThread` that performs the above process, and it is also `main-EventThread` that gets the data from ZK.

`main-EventThread` is blocked at the following line, so deadlock occurs.
https://github.com/apache/pulsar/blob/9d3ab60efddd3f4064789293fd8e9c4d4388b6c7/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperDataCache.java#L97

### Modifications

~Make the method `ZookeeperCacheLoader#updateBrokerList` run on a thread separate from `main-EventThread`.~

Made `updateBrokerList()` method asynchronous so that `main-EventThread` is not blocked.

(cherry picked from commit d7c1cfa)
hangc0276 pushed a commit to hangc0276/pulsar that referenced this pull request May 26, 2021
1. Add kafka-examples to easy verify broker.
2. update readme.
aloyszhang pushed a commit to aloyszhang/pulsar that referenced this pull request Aug 5, 2022
update log4j dependency to 2.17.0
Fixes apache#26 

### Motivation


*Explain here the context, and why you're making that change. What is the problem you're trying to solve.*

### Modifications

*Describe the modifications you've done.*

### Verifying this change

- [ ] Make sure that the change passes the CI checks.

*(Please pick either of the following options)*

This change is a trivial rework / code cleanup without any test coverage.

*(or)*

This change is already covered by existing tests, such as *(please describe tests)*.

*(or)*

This change added tests and can be verified as follows:

*(example:)*
  - *Added integration tests for end-to-end deployment with large payloads (10MB)*
  - *Extended integration test for recovery after broker failure*

### Does this pull request potentially affect one of the following parts:

*If `yes` was chosen, please highlight the chan
dragonls pushed a commit to dragonls/pulsar that referenced this pull request Oct 21, 2022
dragonls pushed a commit to dragonls/pulsar that referenced this pull request Oct 21, 2022
dragonls pushed a commit to dragonls/pulsar that referenced this pull request Oct 21, 2022
nodece added a commit that referenced this pull request Oct 29, 2024
…plication (#26)

* [feat][broker] Add ResourceGroup rate limit for topic dispatch and replication

Signed-off-by: Zixuan Liu <nodeces@gmail.com>

* [fix][broker] Fix ownership loss

Signed-off-by: Zixuan Liu <nodeces@gmail.com>

* Fix test

Signed-off-by: Zixuan Liu <nodeces@gmail.com>

---------

Signed-off-by: Zixuan Liu <nodeces@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants