Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket FragmentExtension can produce an invalid stream of frames #2491

Closed
jeremystone opened this issue Apr 26, 2018 · 15 comments
Closed
Assignees

Comments

@jeremystone
Copy link

jeremystone commented Apr 26, 2018

When a number of large fragmented messages interspersed with smaller non-fragmented messages are sent over a websocket to a slow client (or over a slow network), the client sometimes sees one of the smaller messages before the final fragment of the previously sent larger message.

The stack trace on the client is:

org.eclipse.jetty.websocket.api.ProtocolException: Unexpected TEXT frame, was expecting CONTINUATION
	at org.eclipse.jetty.websocket.common.Parser.parseFrame(Parser.java:329)
	at org.eclipse.jetty.websocket.common.Parser.parse(Parser.java:241)
	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.readParse(AbstractWebSocketConnection.java:560)
	at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:391)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:319)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:175)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:133)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:754)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:672)
	at java.lang.Thread.run(Thread.java:748)

Note that we first noticed this with a jetty 9.4.9 server and a netty based client (and netty detects the same thing) so the problem is likely with the fragment creation (in this case on the server end) rather than with the fragment re-assembly on the client.

I have created this gist which is a JUnit test case that shows that this seems to happen only with 'slow' readers. (Our original netty/jetty setup required a moderately slow internet link or something like NetLimiter to reliably reproduce it.)

Please let me know if you need any further information.

@joakime
Copy link
Contributor

joakime commented Apr 26, 2018

Bizarre.

The stacktrace shows a parser Exception.
That's single threaded and reads/parses the stream. Can't see how we could screw that up.
We've been (since 2009) constantly test our implementations against autobahn (the de facto websocket standard testing lib) and pywebsocket (the lib maintained by google/chrome).
This kind of bug would be surprising.

@joakime joakime self-assigned this Apr 26, 2018
@jeremystone
Copy link
Author

I added some logging to the test and grepped everything out except the just the 'nextOutgoingFrame' debug logging in FragmentExtension (see log for failing test here).

It appears as if everything happens initially on the 'server response' thread (which is what I called the single thread that does the send) neatly passing on 10 fragmented frames then a short full message frame. But after a while some stuff happens on a 'qtp523691575-17' thread.

Not sure if this is relevant(?)

@joakime
Copy link
Contributor

joakime commented Apr 26, 2018

From the wireshark, its the sending side that's messing up.
There's overlapping messages seen on the wire.

What the testcase seems to be doing.

  • Start client socket with knowledge of testcase scenario (size of message, and how much time to sleep on read side between whole messages)
  • Connect to server.
  • In client.onConnect() send message
  • On server.onMessage() - start new thread, using sendString() to send messages to remote.
    • The server will send the large message, then a short message, in a loop until the predetermined number of messages have been sent.
  • On client.onMessage() simulate slowness by using thread.sleep on whole message receive.

Initial blush: There's no regard for backpressure on the write side of this use case.

Initial guess (needs more testing):

Using sendString() from another thread isn't a great idea.
You'll need to use the sendStringByFuture() or sendString(msg, callback) and check that the message left the java network stack before you send another.
This is the only way you'll know about backpressure from the remote endpoint from another thread.
I'll try working that change in and see what happens.

@joakime
Copy link
Contributor

joakime commented Apr 26, 2018

You can follow along at the test project at - https://github.com/joakime/websocket-issue-2491

@jeremystone
Copy link
Author

OK. Thanks. Glad you're seeing it too.

@joakime
Copy link
Contributor

joakime commented Apr 26, 2018

The backpressure behaviors are not the cause.

Need to dig into WebSocketRemoteEndpoint and FrameFlusher to see if the fault lies there.

@joakime
Copy link
Contributor

joakime commented Apr 26, 2018

WebSocketRemoteEndpoint and FrameFlusher are not at fault.

Taking a closer look at the FragmentExtension now.

@joakime
Copy link
Contributor

joakime commented Apr 26, 2018

Note: we are planning on eliminating the FragmentExtension as its incompatible with the WebSocket over HTTP/2 drafts that enforce fragmentation anyway.

joakime added a commit that referenced this issue Apr 26, 2018
+ Adding testcase for RemoteEndpoint
+ Adding testcase for FrameFlusher

Signed-off-by: Joakim Erdfelt <joakim.erdfelt@gmail.com>
@joakime
Copy link
Contributor

joakime commented Apr 26, 2018

I've been able to narrow down the fault to the FragmentExtension. I have a testcase that can now replicate the issue in isolation of all of the rest of the components.

joakime added a commit that referenced this issue Apr 26, 2018
+ Adding testcase to ensure no regression
+ All data frames that arrive, are now sent through the
  IteratingCallback to ensure that that input frame order
  is preserved.

Signed-off-by: Joakim Erdfelt <joakim.erdfelt@gmail.com>
@joakime
Copy link
Contributor

joakime commented Apr 26, 2018

Fixed in commit c596fca on branch jetty-9.4.x-issue-2491-ws-fragmentextension

Here's what was happening.

  1. The FragmentExtension would enqueue data frames (TEXT/BINARY/CONTINUATION) if they were over the configured maxLength.
  2. The FragmentExtension had an internal flusher from that queue processing large payloads out as fragments out with support for backpressure.
  3. We had a large frame in the queue, being processed.
  4. A smaller frame arrives that is under the configured maxLength and it skipped the queue and was immediately sent. (a bug that was just fixed)
  5. The callbacks related to the original Frame was being notified of success on every fragment being produced. (a bug that was just fixed)

@jeremystone
Copy link
Author

jeremystone commented Apr 26, 2018 via email

joakime added a commit that referenced this issue Apr 26, 2018
@joakime
Copy link
Contributor

joakime commented Apr 26, 2018

9.4.10 final release should occur sometime within the next 2 weeks.
Once I get a CI verification of this fix, I'll push a SNAPSHOT at least (for you)

joakime added a commit that referenced this issue Apr 26, 2018
…entextension

Issue #2491 WebSocket FragmentExtension produces out of order Frames
@joakime joakime changed the title Fragmented websocket frames sometimes received out of order for slow clients WebSocket FragmentExtension can produce an invalid stream of frames Apr 26, 2018
@joakime
Copy link
Contributor

joakime commented Apr 27, 2018

@jeremystone please try Jetty 9.4.10.RC1 (currently on maven central)

@jeremystone
Copy link
Author

9.4.10.RC1 looks good. Not been able to reproduce with this. Thanks again.

@joakime
Copy link
Contributor

joakime commented Apr 27, 2018

No, thank you for the issue report with reproducer! (so important to get to the bottom of the issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants