-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve error messages when a cluster fails to boot #887
Comments
Sounds good to improve this. |
The scenario here probably is this:
There are multiple possible fix alternatives:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This feature request has quite a wide scope, and fixing it will be some sort of whack-a-mole with each different error condition, but one significant problem I've found with cluster bootstrap is that when it fails, interpreting the error messages you get requires an in depth understanding of Akka cluster bootstrap in order to map that to the actual cause.
So for example, here's a node that's failing to join a cluster:
The above logs are output over and over again. The log message that tells me the most info is this one, which I will not put in a code block because that makes it hard to read because the line is so long:
2021-04-13 23:07:11.778 WARN akka.management.cluster.bootstrap.internal.HttpContactPointBootstrap - Probing [http://10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local:8558/bootstrap/seed-nodes] failed due to: Exceeded configured max-open-requests value of [32]. This means that the request queue of this pool (HostConnectionPoolSetup(10-8-3-65.3aceb7b3-4e74-4e0e-aa23-1bb50408b733.pod.cluster.local,8558,ConnectionPoolSetup(ConnectionPoolSettings(4,0,0,32,1,Duration.Inf,100 milliseconds,2 minutes,30 seconds,ClientConnectionSettings(Some(User-Agent: akka-http/10.1.12),10 seconds,60 seconds,512,None,WebSocketSettings(,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$915/0x0000000100645840@72127121),List(),ParserSettings(2048,16,64,64,8192,64,8388608,8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,HashMap(If-Range -> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, If-None-Match -> 0, User-Agent -> 32, Content-MD5 -> 0, Date -> 0, If-Match -> 0),false,true,akka.util.ConstantFun$$$Lambda$775/0x0000000100577840@b4e36a4,akka.util.ConstantFun$$$Lambda$775/0x0000000100577840@b4e36a4,akka.util.ConstantFun$$$Lambda$776/0x0000000100578040@cf1346d),100 milliseconds,None,TCPTransport),New,1 second),akka.http.scaladsl.HttpConnectionContext$@5d96a23c,akka.event.MarkerLoggingAdapter@6348ef3))) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.
Which is bad in its own right, eg toString of HostConnectionPoolSetup is used which makes the log message way too long. But it also is very misleading, the message makes it sound like the thing that is being communicated with is too slow. In actual fact, the problem in this case is that firewall rules were blocking all communication, ie, it wasn't too slow, the network was blocking it.
Now, of course, if I look earlier in the logs, I see a different error:
Which is better, but I think we can still do better. The above error message doesn't explain why we are probing that node, nor does it say what the consequence of it timing out means, to understand it you need an in depth understanding of cluster bootstrap. There's also this log message:
Which once again requires an in depth understanding of cluster bootstrap to understand. What I think would be useful is to, after some additional margin, start outputting something that is much more informative, ideally, multiple lines long. Imagine something like this:
This would make it far easier for someone that has little to no knowledge of Akka cluster bootstrap to debug the issue. If we made such an error message to be output after a configurable amount of time, we could use the same configuration item to turn it off altogether for when people don't want it.
The text was updated successfully, but these errors were encountered: