Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standalone broker on MacOS often fail to start after workstation sleep or reboot #15401

Open
zbentley opened this issue Apr 30, 2022 · 12 comments
Labels
Stale type/bug The PR fixed a bug or issue reported a bug

Comments

@zbentley
Copy link
Contributor

Describe the bug
Often, but not always, if I start Pulsar standalone on my workstation and then close/reopen my laptop lid or restart the computer, subsequent attempts to start the standalone broker fail. The JVM process keeps running and emits some errors in its logs (see attachment), but never reaches a connectible state.

Repeatedly restarting the broker usually causes this condition to go away. Sometimes that doesn't work, and I have to remove all broker data files (i.e. reinstall Pulsar) to make it start.

This only seems to happen when publishes have been recorded to a persistent topic.

This does not happen when using Pulsar Standalone in Docker, or on my Linux laptop. It appears to be MacOS specific.

To Reproduce

  1. brew install apache-pulsar
  2. brew services start apache-pulsar
  3. Verify that Pulsar is connectible on localhost:6650.
  4. Publish some messages to a persistent topic such that the message data and the topic itself are not automatically deleted (i.e. at least one ledger is created and persists).
  5. Close laptop lid or otherwise induce sleep.
  6. Reopen laptop and attempt to connect to Pulsar; it will sometimes fail.
  7. brew services restart apache-pulsar and attempt to connect to Pulsar; it will sometimes fail.

Expected behavior

  1. Sleep/resume should not break Pulsar. It should either be available after resumption, or the service/JVM process should fully crash so that it reports not-running if it cannot handle connection requests after resumption.
  2. Restarting the service should result in Pulsar either being connectible or not having a JVM process. If Pulsar Standalone can't come up in a connectible state, its process shouldn't hang around (if the process stays around, service managers think it's still running and thus won't report a failure or auto-restart it).

Desktop (please complete the following information):

  • OS: MacOS monterey, intel.
  • Pulsar standalone 2.9.2 and 2.9.1 both exhibit this issue.

Additional context
brew services is likely not the problem. It doesn't do anything special, and Pulsar does sometimes work when restarted via brew services.

Attached is a copy of my logs from a broker start attempt that did not become connectible after 5min.
nostart.log

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label May 31, 2022
@zbentley
Copy link
Contributor Author

Anecdotally, things seem better after 2.10. The broker still regularly crashes on workstation sleep, but when restarted it at least seems to come up a lot more reliably. Was any work done in 2.10 that may have improved things?

@github-actions github-actions bot removed the Stale label Jun 1, 2022
@zbentley
Copy link
Contributor Author

zbentley commented Jun 2, 2022

Hmmm, nope; I'm now able to reliably induce this issue on 2.10. Nothing's bound to Pulsar's ports, but it fails to come up.

The logs from a recent start attempt are attached.
pulsar_start.log

@github-actions
Copy link

github-actions bot commented Jul 3, 2022

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jul 3, 2022
@michaeljmarshall
Copy link
Member

I didn't realize you could start pulsar that way, thanks for sharing!

Your issue helped me solve a question I have been wondering about for a while (thank you!).

I think I found the likely root cause of your issue, but you will need to confirm. If I am correct, here is the TLDR: Pulsar discovers its IP address for things like advertisedAddress using a lookup of the hostname, and if your machine's hostname is not defined in the /ect/hosts file, then (based on my troubleshooting) the JVM will perform a DNS lookup on the address and issues can arise when the resulting IP does not route to the right Pulsar process.

Based on your logs, I noticed that some of your stack traces indicate trying to connect to 192.168.1.2:4181:

2022-06-02T13:29:02,974-0400 [client-scheduler-OrderedScheduler-28-0] ERROR org.apache.bookkeeper.clients.impl.internal.RootRangeClientImplWithRetries - Reason for the failure 
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
	at io.grpc.Status.asRuntimeException(Status.java:535) ~[io.grpc-grpc-api-1.45.1.jar:1.45.1]
	at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:533) ~[io.grpc-grpc-stub-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) ~[io.grpc-grpc-core-1.45.1.jar:1.45.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /192.168.1.2:4181
Caused by: java.net.ConnectException: Connection refused
	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
	at io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
	... 1 more

Port 4181 is a bookkeeper port, and for some reason the client is trying to connect to a non-local host, but we already know that the bookkeeper should be running on localhost. I see the same behavior on my host machine:

Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /23.202.231.169:4181
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
        at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
        at io.grpc.netty.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        at io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.grpc-grpc-netty-shaded-1.45.1.jar:1.45.1]
        ... 1 more

Interestingly, in my case, Pulsar fails when I am off my work VPN, but it works when I am on the VPN. I've see the same behavior when running Pulsar Standalone, which makes sense since the brew installation is Pulsar Standalone. The nuance comes here:

public static String unsafeLocalhostResolve() {
try {
// Get the fully qualified hostname
return InetAddress.getLocalHost().getCanonicalHostName();
} catch (UnknownHostException ex) {
LOG.error(ex.getMessage(), ex);
throw new IllegalStateException("Failed to resolve localhost name.", ex);
}
}
. If Pulsar does not have its advertisedAddress configured, it performs this lookup, and that lookup relies on a reverse lookup of your hostname, as described here https://stackoverflow.com/a/18488982. After updating my /etc/hosts file to map my hostname to map my hostname to 127.0.0.1, standalone now works and I've confirmed that 2.10.1 works after putting my Mac to sleep and waking it up again. (I only just realized tonight that Standalone was not previously working because my hostname was not correctly resolving to 127.0.0.1.)

I also noticed that running nslookup on my host name resulted in 23.202.231.169 before updating my /etc/hosts file and when off the VPN.

Let me know if this helps.

@michaeljmarshall
Copy link
Member

michaeljmarshall commented Jul 7, 2022

I've just realized that 192.168.1.2 is for home network routers, so it is likely that my analysis might not solve your problem. Let me know if any of this is helpful.

@zbentley
Copy link
Contributor Author

zbentley commented Jul 8, 2022

@michaeljmarshall you may still be onto something; just because the IP in my logs is an RFC1918 address doesn't mean it necessarily resolves to my host (or a router). I'll try setting advertisedAddress; if that makes the problem disappear, I can check that into the Homebrew formula for Pulsar.

@zbentley
Copy link
Contributor Author

zbentley commented Jul 9, 2022

That seems to work, at least so far. Which of the broker.conf, proxy.conf, standalone.conf and bookkeeper.conf config files should I set advertisedAddress in for a standalone broker?

@michaeljmarshall
Copy link
Member

The brew service command uses standalone mode, which you can see in the formula code: https://github.com/Homebrew/homebrew-core/blob/9441684087afc2c59ef7c6341bc434e3b818a950/Formula/apache-pulsar.rb#L60.

When running Pulsar as a standalone broker, all configuration is passed via the standalone.conf file, so you only need to configure that file.

@michaeljmarshall
Copy link
Member

if that makes the problem disappear, I can check that into the Homebrew formula for Pulsar.

I agree that this configuration makes sense as a default for the Homebrew formula, though it might be worth mentioning it on the dev@ mailing list.

You can set the advertisedAddress in the standalone.conf or on the command line (e.g. bin/pulsar standalone --advertised-address localhost).

if (this.getAdvertisedAddress() != null) {
// Use advertised address from command line
config.setAdvertisedAddress(this.getAdvertisedAddress());
} else if (isBlank(config.getAdvertisedAddress()) && isBlank(config.getAdvertisedListeners())) {
// Use advertised address as local hostname
config.setAdvertisedAddress("localhost");
} else {
// Use advertised or advertisedListeners address from config file
}

@zbentley
Copy link
Contributor Author

@michaeljmarshall sounds good, I'll put that on my to-do list. In the mean time, this bug should probably stay open: the appropriate fix, I think, is to either:
a) Always default advertisedAddress to 0.0.0.0 or 127.0.0.1 when in standalone mode, or.
b) Find a better way to probe whether the default advertisedAddress is correct; could ARP lookups or something be used? Is that more trouble than it's worth?

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

2 participants