-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebSocket hangs in blockingWrite #2061
Comments
What are the various idle timeouts you have configured? |
OP is using stomp + spring. What version of stomp/spring are you using? No reports of similar problems from cometd community, or google app engine community. (yet?) |
I'm running JDK 8u121 on Debian Jessie with Spring-Boot 1.5.9 (Spring Framework 4.3.13). As far as the idle-timeout concerns (and I understood SharedBlockingCallback correctly) it indefinetily blocks due to |
The QueuedThreadPool idleTimeout will not take effect on the actively being used thread. The -1 idle-timeout on websocket is an indefinite idle timeout and is operating properly. |
Hi @joakime. I was just answering your question for the various idle timeouts... As the previous thread/issue #272 already mentions, many argue the Correct me if I'm wrong, but your suggestion would not solve this problem. And even if it does, it would be just a different workaround instead of actually solving the root-cause (which I feel we're not getting closer to find unfortunately). Speaking of the root-cause: Though we have no clear evidence, the behavior described in this issue and #272 seems to be caused by slow clients for us (we have many mobile users that might have a slow connection). Maybe this helps you to come up with a test scenario for the root-cause. Cheers |
I'm curious what the OS thinks of the connection. Having an idle timeout of -1 isn't normal. On the distribution its defaulted to 60,000 ms, and embedded usage seems to default to 30,000 ms. The process to determine idle timeout seems to be ...
But even at -1 idle timeout, the active websocket connection with an active message write, sits in SharedBlockingCallback$Blocker.block() if attempting to write to a congested connection (one where the remote side isn't reading). It will sit like this indefinitely until the connection is closed or the congestion clears up. This is a sane setup for many websocket user. Especially connections used between servers in a data center. If I set a websocket container max idle timeout then even that scenario idle times out. I can sort-of reproduce the SharedBlockingCallback concern, but its only under a specific (working as intended) scenario ...
This is normal behavior for idle timeouts of -1. |
Here is a way to reproduce it consistently (even though I am not sure it actually is the exact same issue):
My workaround for the time being was to allow for more Threads for Jetty:
At runtime, Jetty actually allocated over 300 Threads even before any traffic was on the server. Since 200 is the default max, Jetty probably tried to wait for more Threads available before serving any responses. Might have to do with this: spring-projects/spring-boot#8917 |
Ah! a proxy. Regarding threading configuration. |
@elekktrisch the spring-projects/spring-boot#8917 is about excessively low/insufficient thread pool configurations. If you are using Jetty 9.4.8.v20171121 then you will see warning/error logging events telling you of insufficient threads configurations (which was recently added as a response to spring-projects/spring-boot#8917 See: Issue #1851 and https://github.com/eclipse/jetty.project/blob/jetty-9.4.8.v20171121/jetty-util/src/main/java/org/eclipse/jetty/util/thread/ThreadPoolBudget.java#L137-L157 |
We don't use Spring proxy facilities and don't have an invalid thread-configuration (at least we're not affected by the mentioned Spring-Boot issue). What we do use though is nginx as a reverse proxy. Which to my knowledge is WebSocket/Upgrade aware. |
nginx added websocket support mid 2014 with I believe nginx version 1.3. Just make sure you turn off proxy buffering (a known problem with nginx's implementation)
Also make sure you set a valid |
Looking at the various reports of websocket support in nginx it seems that version 1.3.13 is the first version that the userbase reports as stable for websocket. |
Thanks. We already use 1.12.2 with |
@dreis2211 what's your nginx proxy settings for ...
|
Also, depending on your version of nginx it might be |
@joakime Both are not specified in our config, so they should use the default value of 60 seconds |
We've had this problem for some time. I've found a 100% sure way to reproduce it which may be of interest. First some background. We're currently running 3.9.19 under Spring Boot 1.4.7. Our application often has hundreds of mobile clients over websockets, based on Stomp and the simple broker built into spring. We had occasional problems under heavy load on our servers where the websocket traffic just stopped (while other "regular" requests continue to work, and the server stayed up). When this "zombie mode" occurred, the only remedy was to restart the server. We've scrutinized our code, and found nothing here causing it, so I began suspecting something else in the stack. Therefore I was very happy when I found this thread and the previous one #272, which seem to describe exactly the problem we're seeing. We had an older version of jetty before (9.2.16), and there were some hints along the way that this may have been "fixed" in 9.2.x, so we started by upgrading Spring Boot along with jetty to the versions mentioned above. That did NOT fix the problem. I then proceeded applying the work-around described in #272 ("overriding" the socket write timeout), and that seem to have fixed it for us. The sockets still hang on write, and if a lot of websockets "die" on us at the same time, this may still stave the thread pool for some time. But once those writes time out, the server comes back to normal again, instead of requiring a server restart. This was a huge improvement for us. Now to the way to reproduce it. Unfortunately, all of this is in company code, so I can't share the code. But I can share the general idea. We have tests written to simulate large numbers of clients, doing both regular requests as well as the websocket communication. A single nodejs instance can easily simulate several hundred users (with the exception of cellular network peculiarities). Since this is nodejs based, it's also easy to fire up a couple of Amazon EC2 instances to run really massive tests for an hour or so, and then shut them down. Now, as long as everyone plays nice, and close their websocket connections in an orderly manner, everything works and is happy. It appears that the lock-up happens ONLY when sockets aren't closed but merely cease to communicate, as would happen if the phone in toe other end suddenly goes off grid, is turned off, battery dies, or similar. The way I found I could simulate this situation is to run a test with, say, 100 simulated users on a separate machine on our network. I have the system in a state where it keeps sending data over the websocket to those 100 clients. I then just yank the Ethernet wire out of the machine running the test clients. This quickly causes any send buffers to fill up, and the sending threads to (presumably) block on some low level write call, very soon causing all threads in the pool to block. Our app uses the standard thread pool provided by Spring/Jetty here, which has a pool with twice the number of threads as the processor has cores, so we typically see 8 or 16 threads in the pool. That means that as long as we have more than this number of clients all "die" at about the same time, while data is being sent to those websockets, the server will almost instantly go into "zombie mode". Having a reproducible test like this quickly allowed us to try out various scenarios, and let us come to the conclusion that the timeout is a viable work-around for now. Hopefully this will get fixed in jetty in some more permanent and elegant way, and perhapsmy ramblings here can allow you to at least get a reproducible test case to track it down. I don't feel sufficiently familiar with Jetty's innards to venture a fix myself. -JM |
We've tested Jetty 9.3.x and 9.4.x at 10k+ active websocket connections, with some connections going away (no traffic, just dead). If we have a reasonable Idle Timeout things work as expected. Even in the infinite timeout scenario (Idle Timeout == -1) the dead network connections will eventually be cleaned up by the OS, dependent on network configuration, and the connection to Jetty will die and be cleaned up as well. The only scenario we've been able to use to reproduce this reported issue is having a simple proxy with its own infinite timeout, and jetty configured for infinite timeout. Then the connection remains "live" between Jetty and the proxy, with the proxy itself not cleaning up the connection when it goes away. This behavior seems to be exacerbated with various proxy connection buffering configurations as well. In summary so far (Jan 2018):
If you want to help, fork the example project ... https://github.com/spring-guides/gs-messaging-stomp-websocket |
I can try putting together a reproducible case based on the "gs-messaging-stomp-websocket" example you suggest, combined with a nodejs-based multi-client simulator. That's what we use to run our own load testing. However, before pursuing this, I'd like to make sure we indeed "use a sane idle timeout everywhere", as you suggest. Perhaps you could provide some pointers as to how/where timeouts are specified in Jetty and/or Spring for "your websockets" and "your connections". Any pointers related to timeouts in the "OS networking configuration" (assuming Linux OS) would be most appreciated too. After making sure we have those set to cope with unreliable clients (i.e., lots of mobile phones on cellular networks), I'll re-run our tests to see if I can still repro the lockup. If so, I'll do my best to put together a reproducible test case I can contribute. -JM |
@TheWizz any news from your end about this ticket? |
No. We're using the work-around suggested by jonhill1977 in #272. This is really a "hack", since it involves replacing the org.eclipse.jetty.websocket.common.BlockingWriteCallback class with our own (as there seem to be no way to override this behavior from the "outside"). Doing so solved the problem for us, and we just moved on. A cleaner solution would of course be preferred. |
So no other "solution" than ours. But thanks for getting back. |
@joakime Triggered by #2491 I was thinking if the clients connecting to our servers use the fragments extension and if the two bugs might be connected. While they (unfortunately) don't use it they make use of the per-message-deflate extension. I don't know if that helps for this bug or if it's the same for the other reporters, but I thought this is some new information that I need to share for completeness reasons. |
The Fragment Extension is very rarely used. If you suspect the Option 1: If using WebSocketservlet (to unregister the permessage-deflate extension)public static class MyWebSocketServlet extends WebSocketServlet
{
@Override
public void configure(WebSocketServletFactory factory)
{
factory.getExtensionFactory().unregister("permessage-deflate");
// The rest of your processing here ...
}
} Option 2: If using WebSocketCreator (to remove just permessage-deflate)public static class MyWebSocketCreator implements WebSocketCreator
{
@Override
public Object createWebSocket(ServletUpgradeRequest servletUpgradeRequest, ServletUpgradeResponse servletUpgradeResponse)
{
// Strip permessage-deflate negotiation
servletUpgradeRequest.setExtensions(
servletUpgradeRequest.getExtensions()
.stream()
.filter((extConfig) -> !extConfig.getName().equals("permessage-deflate"))
.collect(Collectors.toList())
);
// The rest of your processing here ...
}
} Option 3: If using JSR's ServerEndpointConfig.Configurator (to remove all extensions offered)public static class MyServer_NoExtensionsConfigurator extends ServerEndpointConfig.Configurator
{
@Override
public List<Extension> getNegotiatedExtensions(List<Extension> installed, List<Extension> requested)
{
// Strip all offered extensions
return Collections.emptyList();
}
} Option 4: If using JSR's ServerEndpointConfig.Configurator (to to filter permessage-deflate offered extension)public static class MyServer_NoPerMessageConfigurator extends ServerEndpointConfig.Configurator
{
@Override
public List<Extension> getNegotiatedExtensions(List<Extension> installed, List<Extension> requested)
{
List<Extension> negotiated = new ArrayList<>();
for (Extension offered : requested)
{
if (offered.getName().equalsIgnoreCase("permessage-deflate"))
{
// skip
continue;
}
if (installed.stream().anyMatch((available) -> available.getName().equals(offered.getName())))
{
negotiated.add(offered);
}
}
return negotiated;
}
} |
We're definitely not using any uncommon options for our websockets. A way I found to repro it is to have a significant number (50 or so) of websockets running, and then "unplug" those clients rather than closing them gracefully. I do this by having a computer (runnig some NodeJS-based test code to excercise the websockets), and then merely pull the Ethernet cable from the test machine. I then terminate the Node test code (with Ethernet still unplugged). I then plug it back in, and restart the test code. This leaves numerous websocket connections in a "zombie" state. Any attempts at writing to those sockets will eventully block somewhere. It is possible that things will eventually time out. But from the clients point of view, the server is "hung" as it stops responding to calls. Note that id I instead of unplugging the Ethernet cable close down the test clients orderly, no hang occurs. So it seems related to this "unexpected" termination of the socket. Surely, this is "wrong" on the websocket clients part. But I suspect that's essentially what happens with real websocket clients, which are typically phones. For instance, the phone leaves cellular coverage while being connected, it runs out of battery. Or perhaps it's just put to sleep and back in the pocket. If a few connections "misbehave" in this way, it seems to cause no harm. But if enough of them do so quickly enough, all websocket threads likely end up blocking somewhere. Increasing the pool size seems to push this problem further a bit. But given enough misbehaving clients, it will still happen at some point. It seems that applying some reasonable write timeout, causing the write to be aborted unless completed within a (relatively short - say seconds) timeframe, fixes the problem, as it indicates the websocket has gone bad, and will therefore be disscarded. Some of the above is empirical tests (with out nodejs based test code, and pulling the Ethernet jack), while the last paragraph is speculation on my part. But in either case, hacking the jetty code, as suggested by jonhill1977 in #272, makes the "hang" go away, and clients happy. -JM |
@TheWizz doing various "unplug" tests show that the idle timeouts do kick in and terminate the connection and free up the resources associated with it. Using this test project - https://github.com/joakime/jetty-websocket-browser-tool It will compile down to a uber-jar if you want to move it around to other systems to try with. The test setup used ... Laptop A (Wired Networking only. Wifi turned off) - this is the client machine.
The behavior and results are the same for Blocking writes and Async writes ... Blocking Writes
Async Writes
A variety of clients have been used (so far) to test this behavior with no change in results (the blocked write is exited when the idle timeout fires)
I will continue to improve this simple tool to attempt to get to the bottom of this reported issue. |
@gbrehmer thank you for the dump, but we're a bit more confused now. |
You are right. I didn't checked the log before posting. I created a HTTP endpoint to trigger the dump creation, but probably injected the wrong webserver instance and didn't know that Spring Boot creates multiple instances. I have to go deeper into the Spring Boot code. It is possible, that the missing parts are produced by https://github.com/spring-projects/spring-framework/blob/master/spring-websocket/src/main/java/org/springframework/web/socket/server/jetty/JettyRequestUpgradeStrategy.java |
It seems that websocket part is created on the fly without direct mapping to the jetty server instance. Like using the jetty websocket support in a embbeded way as part of spring webmvc controller logic |
Hi guys, we've just encountered this issue with the latest 9.4.18.v20190429 version. Our scenario: Our observations after unplugging the network cable (and plugging it in later again): At some point in time an exception in our sender thread [Worker-0] is thrown:
From that point on the sender thread Worker-0 is in WAITING as our thread dump can confirm and is blocked indefinitely, as a result the system is stuck:
Update: |
I can see the NullPointerException coming from the CompressExtension as well, but I can't really connect it directly to the times where we see blocked threads (yet). Still this might be a hint of what's going wrong. |
Found this issue on our production servers yesterday. Currently on version I'll try to provide as much information as I can. We are getting this exact issue with an NPE in the CompressExtension.
I believe this is happening when we are sending a message as a client disconnects. I'm going to attempt to beef up our logging to validate that assumption, but was hoping to find a fix instead.
I'm not ruling out the possibility that we are sending messages incorrectly. I'm admittedly not very familiar with the jetty codebase in general. But this is our general setup. RabbitMqConsumer -> WebSocketSession public void handleDelivery(String consumerTag, Envelope envelope, AMQP.BasicProperties properties,
byte[] body) throws IOException {
websocketSession.getAsyncRemote().sendObject(new RawBytes(body));
} This is currently running in our rabbit consumer thread pool. When we experience this exception, it effectively lowers our consumer pool by 1 thread(until we run out). The Jetty threads are still functional. You can initiate new connections and generally interact with the server. But our message generation side is locked. Given it appears you need to experience a hard disconnect while a message is sending, I can't really attach a sample project. In my case it is obvious the message sending thread and the thread that handles the session disconnecting are different. Is there a best practice to ensure I'm sending messages on a thread that would sync with any disconnects? My workaround for now was to increase my consumer pool count and restart the servers every night. So right now I can see I have locked threads in production, but there are many more to work with. Thanks, Jeremy |
I'm keen on attempting (again) to replicate this ... What is The stacktrace is also telling us that you are not paying attention to the Future returned on javax.websocket.RemoteEndpoint.Async.sendObject(). The fact that you got "Deflater has been closed" means you attempted to send a message after the websocket closed. Since you are also using javax.websocket, you have limited ability to control the batching of outgoing messages.
Consider using those as well. You can also disable the public class NoExtensionsConfigurator extends ServerEndpointConfig.Configurator
{
@Override
public List<Extension> getNegotiatedExtensions(List<Extension> installed,
List<Extension> requested)
{
return Collections.emptyList();
}
} If you choose to disable the Lastly, |
Thank you for the response. Updating to the latest is part of my plan but I wanted to see if I there was a changelog/issue resolution that aligned with the issue I am experiencing instead of just hoping it was fixed.
We are ignoring the future response. I will also look into handling that. The way our rabbitmq integration works will ensure there is only 1 message sent(queues are single threaded) to a websocket session at a time. When we detect that a websocket has closed we disconnect from the rabbit queue. We are not batching anything(on purpose). As you've pointed out, that might not matter. I can try disabling the Based off of the @TheWizz 's comment above about yanking the ethernet cord, it aligns with my theory of a hard disconnect while sending a message. I thought providing my usage example where that is a very likely scenario might be helpful to this discussion. |
@joakime I managed to reproduce the same NPE with this test
|
@lachlan-roberts Awesome. Could we also test if this leads to the blocked threads? |
@dreis2211 If the blocking send is used then it just blocks forever, it seems the callback is never notified of the failure. |
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
Signed-off-by: Lachlan Roberts <lachlan@webtide.com>
@lachlan-roberts: This ticket should be attached to Jetty milestone 9.4.x again, as it has been fixed right there. |
Hi,
it was suggested to me by @joakime to open a new issue for #272 as it still occurs on 9.4.8 - with a slight tendency to occur more often now (which might be just bad luck on our end).
Unfortunately, I can't say anything new about the issue. It still appears to be random (regardless of load for example) that threads end up in WAITING state and only a server restart helps to solve the issue.
As this is affecting our production servers, I'd appreciate if this is investigated again. I'd also appreciate any workaround that doesn't suggest a server restart.
Cheers,
Christoph
The text was updated successfully, but these errors were encountered: