Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocketSend CPU overload #67

Closed
odrisci opened this issue Sep 1, 2014 · 10 comments
Closed

WebSocketSend CPU overload #67

odrisci opened this issue Sep 1, 2014 · 10 comments

Comments

@odrisci
Copy link

odrisci commented Sep 1, 2014

Hi,

I have an issue which I think is related to TooTallNate/Java-WebSocket#225.

Whenever I run the WebSocketSend for an extended period of time, the CPU usage of the streams PEC in which the operator is running goes into overload - usually exhibiting 90 to 100 % processor usage.

I haven't been able to produce a minimum working example as it seems to be somewhat random, and usually only crops up after a few hours of continuous use. My application generates a relatively low rate of one tuple every 1 to 4 seconds.

I was wondering if anyone else is seeing something similar?

Cillian

@odrisci
Copy link
Author

odrisci commented Nov 19, 2014

Ok, in trying to reproduce a minimum working example I found the following fix which seems to resolve the problem, albeit in a way that I'm not entirely comfortable with.

On the receive side (in the webpage) I was doing all the data processing in the WebSocket.onmessage() javascript function. By moving all of the processing out to a new function (handleMessage()) and simply doing the following in onmessage:

ws.onmessage = function(event) {
  setTimeout( function(){ handleMessage(event);}, 100 );
}

the problem seems to have been solved. I presume that the code in handleMessage() would occasionally block, presumably delaying an acknowledge message back to the WebSocketSend, triggering the issue reported here: TooTallNate/Java-WebSocket#225.

Anyway, the moral of story seems to be: do as little as possible in onmessage()

@odrisci odrisci closed this as completed Nov 19, 2014
odrisci added a commit to odrisci/streamsx.inet that referenced this issue Jun 9, 2015
There appears to be an issue with the TupleConsumer base class, which
can lead to one of the threads spinning and taking up 100 % of a CPU.
Moving the WebSocketSend operator to derive from AbstractOperator to
avoid this issue.
@odrisci odrisci reopened this Jun 9, 2015
@odrisci
Copy link
Author

odrisci commented Jun 9, 2015

So I've looked into this in some detail lately as the issue started raising itself again when we moved from a large server to a smaller one on a temporary basis.

I've established that the high CPU usage is coming from a thread whose stack trace shows it is running the function OperatorAdapter.complete() called from a JNIBridge object. It seems that during normal operation this complete() function sleeps for 10 s periods, does something lasting a few milliseconds and then goes back for another 10 s nap. When tuples are being received, the WebSocketSend operator uses a number of helper threads (webSocket-scheduler-thread-N) to run the code in processBatch(). Each of these threads wakes up the napping complete() function when it is done with processBatch().

At a certain point a glitch occurs and the sleeping thread wakes up and does not go back to sleep but runs continuously, eating up the CPU. I used javap to investigate what is going on in OperatorAdapter.complete() and it appears that the function uses a private boolean member variable completerNotify to determine whether or not to go to sleep. I presume that the glitch is caused when one of the scheduler threads notifies the completer thread that it is done when the completer thread is not in a sleep state. The notification sets completerNotify=true, so the completer thread never goes back to sleep and completerNotify is never set back to false.

If my reasoning above is correct this is a bug in the interaction between the scheduler threads (which seem to be coming from the TupleConsumer base class) and the completer thread. I'm currently running tests on a fix for this that derives WebSocketSend directly from AbstractOperator rather than TupleConsumer and it seems to be working (strace shows the completer thread quietly sleeping for 10 s intervals, undisturbed by the arrival of tuples for submission to the websocket). I can create a pull request from this, but I think it would be better to use the TupleConsumer model if a fix can be made to that implementation.

@ddebrunner
Copy link
Member

Thanks for the investigation, that sounds like an issue I fixed in Streams (not yet released), but I wouldn't expect it to occur in this operator. I'll take a look soon.

@ddebrunner
Copy link
Member

Ok, I think I can see why it would occur in the WebSocketSend operator.

You might be able to work around the problem by removing these lines in the WebSocketSend.java

(lines 134-135)
if (tuplesInRequest++ == getBatchSize())
break;

This issue is that too many pending invocations of processBatch() cause the completer thread to have cpu issues.

Another alternative might be to just increase the batch size using the batchSize parameter.

@odrisci
Copy link
Author

odrisci commented Jun 11, 2015

Thanks Dan,

I'll give that a shot in the coming days. I don't want to increase the batchSize parameter as my data arrives relatively infrequently (two metrics every 4 seconds for each input stream, typically 1 to 4 input streams) so batching could introduce unwanted latency.

As I mentioned in my previous comment I have been testing the fix odrisci/streamsx.inet@63e4c34 and this has been running just fine for the last two days. Note that this has been applied to the streams_3.2.1 branch as we haven't upgraded to 4.0.0 as yet.

@ddebrunner
Copy link
Member

The notification sets completerNotify=true, so the completer thread never goes back to sleep and completerNotify is never set back to false.

Is this what you saw in the debugger? Because the completer thread explicitly sets completerNotify to false when it wakes up, so I can't see how completerNotify would remain true.

@odrisci
Copy link
Author

odrisci commented Jun 11, 2015

I haven't been able to capture the moment of transition from normal operation to high CPU usage, but I have obtained strace outputs. This makes it relatively difficult to determine what's going on, since I'm not looking at the java source but at the system calls. One thing is clear: when the high CPU usage occurs, the notifier thread no longer calls the "wait" function.

My assumption is that the notifyCompleter() function is called twice in succession. The first call wakes up the notifier thread, which resets completerNotify to false, then the second call occurs before the completer thread has a chance to get back to the waiting state. Thus, the variable completerNotify is true. It seems that the completer thread checks this value before entering the wait state. if it is true, it does not enter wait and therefore never resets the value to false.

I don't have the source code here, so I can't really be sure that this is what's happening, but the strace outputs suggest that this so.

@ddebrunner
Copy link
Member

Perfect, thanks, I see the issue now.

@ddebrunner
Copy link
Member

I confirmed the issue is in Streams, and have fixed it for the next release.

In the meantime, I see if there's a work-around that still allows the operator to extend from TupleConsumer.

@odrisci
Copy link
Author

odrisci commented Jun 15, 2015

That's great Dan, I appreciate your looking into this.

Meanwhile the AbstractOperator fix seems to be working just fine - just about a week of continuous operation now and all is still well.

Regards,

Cillian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants