Standalone broker on MacOS often fail to start after workstation sleep or reboot #15401

zbentley · 2022-04-30T16:56:54Z

Describe the bug
Often, but not always, if I start Pulsar standalone on my workstation and then close/reopen my laptop lid or restart the computer, subsequent attempts to start the standalone broker fail. The JVM process keeps running and emits some errors in its logs (see attachment), but never reaches a connectible state.

Repeatedly restarting the broker usually causes this condition to go away. Sometimes that doesn't work, and I have to remove all broker data files (i.e. reinstall Pulsar) to make it start.

This only seems to happen when publishes have been recorded to a persistent topic.

This does not happen when using Pulsar Standalone in Docker, or on my Linux laptop. It appears to be MacOS specific.

To Reproduce

brew install apache-pulsar
brew services start apache-pulsar
Verify that Pulsar is connectible on localhost:6650.
Publish some messages to a persistent topic such that the message data and the topic itself are not automatically deleted (i.e. at least one ledger is created and persists).
Close laptop lid or otherwise induce sleep.
Reopen laptop and attempt to connect to Pulsar; it will sometimes fail.
brew services restart apache-pulsar and attempt to connect to Pulsar; it will sometimes fail.

Expected behavior

Sleep/resume should not break Pulsar. It should either be available after resumption, or the service/JVM process should fully crash so that it reports not-running if it cannot handle connection requests after resumption.
Restarting the service should result in Pulsar either being connectible or not having a JVM process. If Pulsar Standalone can't come up in a connectible state, its process shouldn't hang around (if the process stays around, service managers think it's still running and thus won't report a failure or auto-restart it).

Desktop (please complete the following information):

OS: MacOS monterey, intel.
Pulsar standalone 2.9.2 and 2.9.1 both exhibit this issue.

Additional context
brew services is likely not the problem. It doesn't do anything special, and Pulsar does sometimes work when restarted via brew services.

Attached is a copy of my logs from a broker start attempt that did not become connectible after 5min.
nostart.log

The text was updated successfully, but these errors were encountered:

github-actions · 2022-05-31T02:13:43Z

The issue had no activity for 30 days, mark with Stale label.

zbentley · 2022-05-31T14:18:53Z

Anecdotally, things seem better after 2.10. The broker still regularly crashes on workstation sleep, but when restarted it at least seems to come up a lot more reliably. Was any work done in 2.10 that may have improved things?

zbentley · 2022-06-02T17:31:24Z

Hmmm, nope; I'm now able to reliably induce this issue on 2.10. Nothing's bound to Pulsar's ports, but it fails to come up.

The logs from a recent start attempt are attached.
pulsar_start.log

github-actions · 2022-07-03T02:14:50Z

The issue had no activity for 30 days, mark with Stale label.

michaeljmarshall · 2022-07-07T05:59:30Z

I didn't realize you could start pulsar that way, thanks for sharing!

Your issue helped me solve a question I have been wondering about for a while (thank you!).

I think I found the likely root cause of your issue, but you will need to confirm. If I am correct, here is the TLDR: Pulsar discovers its IP address for things like advertisedAddress using a lookup of the hostname, and if your machine's hostname is not defined in the /ect/hosts file, then (based on my troubleshooting) the JVM will perform a DNS lookup on the address and issues can arise when the resulting IP does not route to the right Pulsar process.

Based on your logs, I noticed that some of your stack traces indicate trying to connect to 192.168.1.2:4181:

2022-06-02T13:29:02,974-0400 [client-scheduler-OrderedScheduler-28-0] ERROR org.apache.bookkeeper.clients.impl.internal.RootRangeClientImplWithRetries - Reason for the failure 
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.Status.asRuntimeException(Status.java:535) ~[io.grpc-grpc-api-1.45.1.jar:1.45.1]
	at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533) ~[io.grpc-grpc-stub-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.1.2:4181
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
	at io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	... 1 more

Port 4181 is a bookkeeper port, and for some reason the client is trying to connect to a non-local host, but we already know that the bookkeeper should be running on localhost. I see the same behavior on my host machine:

Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /23.202.231.169:4181
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
        at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
        at io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        ... 1 more

Interestingly, in my case, Pulsar fails when I am off my work VPN, but it works when I am on the VPN. I've see the same behavior when running Pulsar Standalone, which makes sense since the brew installation is Pulsar Standalone. The nuance comes here:

pulsar/pulsar-broker-common/src/main/java/org/apache/pulsar/broker/ServiceConfigurationUtils.java

Lines 43 to 51 in 76646cf

    
           public static String unsafeLocalhostResolve() { 
        
               try { 
        
                   // Get the fully qualified hostname 
        
                   return InetAddress.getLocalHost().getCanonicalHostName(); 
        
               } catch (UnknownHostException ex) { 
        
                   LOG.error(ex.getMessage(), ex); 
        
                   throw new IllegalStateException("Failed to resolve localhost name.", ex); 
        
               } 
        
           }

. If Pulsar does not have its advertisedAddress configured, it performs this lookup, and that lookup relies on a reverse lookup of your hostname, as described here https://stackoverflow.com/a/18488982. After updating my /etc/hosts file to map my hostname to map my hostname to 127.0.0.1, standalone now works and I've confirmed that 2.10.1 works after putting my Mac to sleep and waking it up again. (I only just realized tonight that Standalone was not previously working because my hostname was not correctly resolving to 127.0.0.1.)

I also noticed that running nslookup on my host name resulted in 23.202.231.169 before updating my /etc/hosts file and when off the VPN.

Let me know if this helps.

michaeljmarshall · 2022-07-07T06:02:43Z

I've just realized that 192.168.1.2 is for home network routers, so it is likely that my analysis might not solve your problem. Let me know if any of this is helpful.

zbentley · 2022-07-08T12:29:50Z

@michaeljmarshall you may still be onto something; just because the IP in my logs is an RFC1918 address doesn't mean it necessarily resolves to my host (or a router). I'll try setting advertisedAddress; if that makes the problem disappear, I can check that into the Homebrew formula for Pulsar.

zbentley · 2022-07-09T02:40:56Z

That seems to work, at least so far. Which of the broker.conf, proxy.conf, standalone.conf and bookkeeper.conf config files should I set advertisedAddress in for a standalone broker?

michaeljmarshall · 2022-07-09T04:47:20Z

The brew service command uses standalone mode, which you can see in the formula code: https://github.com/Homebrew/homebrew-core/blob/9441684087afc2c59ef7c6341bc434e3b818a950/Formula/apache-pulsar.rb#L60.

When running Pulsar as a standalone broker, all configuration is passed via the standalone.conf file, so you only need to configure that file.

michaeljmarshall · 2022-07-09T05:12:52Z

if that makes the problem disappear, I can check that into the Homebrew formula for Pulsar.

I agree that this configuration makes sense as a default for the Homebrew formula, though it might be worth mentioning it on the dev@ mailing list.

You can set the advertisedAddress in the standalone.conf or on the command line (e.g. bin/pulsar standalone --advertised-address localhost).

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/PulsarStandaloneStarter.java

Lines 71 to 79 in 5bad9b3

    
           if (this.getAdvertisedAddress() != null) { 
        
               // Use advertised address from command line 
        
               config.setAdvertisedAddress(this.getAdvertisedAddress()); 
        
           } else if (isBlank(config.getAdvertisedAddress()) && isBlank(config.getAdvertisedListeners())) { 
        
               // Use advertised address as local hostname 
        
               config.setAdvertisedAddress("localhost"); 
        
           } else { 
        
               // Use advertised or advertisedListeners address from config file 
        
           }

zbentley · 2022-07-11T19:10:06Z

@michaeljmarshall sounds good, I'll put that on my to-do list. In the mean time, this bug should probably stay open: the appropriate fix, I think, is to either:
a) Always default advertisedAddress to 0.0.0.0 or 127.0.0.1 when in standalone mode, or.
b) Find a better way to probe whether the default advertisedAddress is correct; could ARP lookups or something be used? Is that more trouble than it's worth?

github-actions · 2022-08-11T02:26:00Z

The issue had no activity for 30 days, mark with Stale label.

zbentley added the type/bug The PR fixed a bug or issue reported a bug label Apr 30, 2022

sijie mentioned this issue Apr 30, 2022

ISSUE-15401: Standalone broker on MacOS often fail to start after workstation sleep or reboot streamnative/pulsar-archived#4155

Open

github-actions bot added the Stale label May 31, 2022

github-actions bot removed the Stale label Jun 1, 2022

github-actions bot added the Stale label Jul 3, 2022

michaeljmarshall removed the Stale label Jul 7, 2022

github-actions bot added the Stale label Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Standalone broker on MacOS often fail to start after workstation sleep or reboot #15401

Standalone broker on MacOS often fail to start after workstation sleep or reboot #15401

zbentley commented Apr 30, 2022

github-actions bot commented May 31, 2022

Uh oh!

zbentley commented May 31, 2022

Uh oh!

zbentley commented Jun 2, 2022

Uh oh!

github-actions bot commented Jul 3, 2022

Uh oh!

michaeljmarshall commented Jul 7, 2022

Uh oh!

michaeljmarshall commented Jul 7, 2022 •

edited

Loading

Uh oh!

zbentley commented Jul 8, 2022

Uh oh!

zbentley commented Jul 9, 2022 •

edited

Loading

Uh oh!

michaeljmarshall commented Jul 9, 2022

Uh oh!

michaeljmarshall commented Jul 9, 2022

Uh oh!

zbentley commented Jul 11, 2022

Uh oh!

github-actions bot commented Aug 11, 2022

Uh oh!

Standalone broker on MacOS often fail to start after workstation sleep or reboot #15401

Standalone broker on MacOS often fail to start after workstation sleep or reboot #15401

Comments

zbentley commented Apr 30, 2022

github-actions bot commented May 31, 2022

Uh oh!

zbentley commented May 31, 2022

Uh oh!

zbentley commented Jun 2, 2022

Uh oh!

github-actions bot commented Jul 3, 2022

Uh oh!

michaeljmarshall commented Jul 7, 2022

Uh oh!

michaeljmarshall commented Jul 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zbentley commented Jul 8, 2022

Uh oh!

zbentley commented Jul 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaeljmarshall commented Jul 9, 2022

Uh oh!

michaeljmarshall commented Jul 9, 2022

Uh oh!

zbentley commented Jul 11, 2022

Uh oh!

github-actions bot commented Aug 11, 2022

Uh oh!

michaeljmarshall commented Jul 7, 2022 •

edited

Loading

zbentley commented Jul 9, 2022 •

edited

Loading