Send websocket message in multiple fragments when needed. #2355

atoppi · 2020-09-08T10:06:55Z

This PR adds the support to WebSocket fragmentation (RFC 6455) when sending large messages to Janus clients and fixes #2349.

The effort started after the issue #2349 where it has been proved that in particular scenarios that involved the encapsulation of websocket in h2 Janus would crash due to a buffer overflow in the underlying libraries.
We decided to follow @lws-team suggestion (thanks again!) to implement message fragmentation in order to avoid this crash.
The size of each fragment (aka message chunk ) has been set as suggested to about 2 MTU (so 2800 bytes) and it is defined in the MESSAGE_CHUNK_SIZE macro.

While working on this PR we have also refactored the event loop in order to avoid duplicate source and make the code cleaner and more readable.

Clients should not take any action as their websocket libraries are expected to take care of message fragmentation.

Notable changes (for reviewers):

we do not use anymore the sent integer returned by lws_write to move the buffer pointer bufoffset. We just check that sent bytes are not lesser than requested amount (that would mean an error occurred) and then increment the pointer by amount
bufoffset is now initialized to LWS_PRE (IMHO it does more sense, given that the buffer has size equals to LWS_PRE + message size
the response string (that is the message to be sent) is now being freed as soon as it has been copied into the buffer and not just before quitting a loop iteration as before
we fixed some potential overflows by transforming bufsize-like variables from int type to size_t

lminiero

Added some notes 👍

lminiero · 2020-09-08T10:42:07Z

transports/janus_websockets.c

@@ -1198,56 +1201,63 @@ static int janus_websockets_common_callback(
 				}

 				/* Check if we have a pending/partial write to complete first */
-				if(ws_client->buffer && ws_client->bufpending > 0 && ws_client->bufoffset > 0
-						&& !g_atomic_int_get(&ws_client->destroyed) && !g_atomic_int_get(&stopping)) {


These atomic checks should not be removed.

Added again the checks in the OR form to let the function early return.

@lminiero Hi，May I asy why these atomic checks should not be removed ?
It had checked before of this code branch, isn't ?
What if ws_client->destroyed is not volicate int rather than int ?
if so, Shoud it checked here ?

transports/janus_websockets.c

lminiero · 2020-09-08T10:52:50Z

transports/janus_websockets.c

 				}
+				/* Done for this round, check the next response/notification later */
+				lws_callback_on_writable(wsi);


This seems to be calling lws_callback_on_writable even when we successfully sent the last fragment, which shouldn't happen (it will perform one more useless loop).

I have preserved the previous behavior (https://github.com/meetecho/janus-gateway/blob/master/transports/janus_websockets.c#L1246).
A quick test showed some issues when sending asyncronous messages when the writable call gets moved from that position.

…ges.

tmatth · 2020-09-08T13:21:36Z

transports/janus_websockets.c

 					janus_mutex_unlock(&ws_client->ts->mutex);
 					return 0;
 				}
+
+				/* Evaluate amount of data to send according to MESSAGE_CHUNK_SIZE */
+				int amount = ws_client->bufpending <= MESSAGE_CHUNK_SIZE ? ws_client->bufpending : MESSAGE_CHUNK_SIZE;


nit: why not use size_t for amount (to avoid casting later) as it seems it's never needed as an int.

Also a more descriptive name like: amount_to_send or bytes_to_send might be preferable.

Honestly I've used the same name (amount) and type (int) present in the lws example.

@atoppi sure but in their case they don't need to do the casts. It's a minor point but cleaner IMO.

If amount type is set to size_t the following simple test will fail

#include <stdio.h> #include <inttypes.h> #include <stdlib.h> #include <limits.h> int main(int argc, char *argv[]) { size_t amount = 2400; /* Simulate error in lws_write */ int sent = -1; if (sent < amount) { printf("sent < amount !\n"); return 0; } return -1; }

Any cast in the comparison if (sent < amount) might have a problem:

if we do (size_t)sent, we are casting a signed to unsigned

if we do (int)amount, we are casting to a lower (at least on my machine) size type (8 vs 4)

Am I missing something ?

So in that example, since you are comparing to signed you should be using ssize_t for amount, but AFAICT in your patch you were only ever comparing to unsigned.

here I am comparing sent (type int, return value of lws_write) with amount.
According to lws docs about lws_write:

Return may be -1 for a fatal error needing connection close

Right so ssize_t would make sense then.

I personally don't see the point: lws_write returns int, not ssize_t, and comparing size_t and ssize_t would result in a comparison between values with different sizes anyway. Using int for that part seems reasonable.

Yeah didn't want to derail the discussion here as we know these values are capped pretty low anyway.

atoppi · 2020-09-09T10:28:49Z

@lws-team while we wait for this to be merged I'd like to discuss another issue I've found out with Firefox in this scenario.

Since I was testing huge payloads, I tried even bigger sizes (e.g. ~1.2 MBytes messages).
What happens is the following:

a new websocket gets opened on Janus
the message from the client does not seem to reach Janus
when closing the tab or refreshing the page the websocket still remains open in Janus and a TCP socket is still established on the server (confirmed with lsof)
only when the whole browser gets closed all the open websockets (e.g. due to multiple tries) are closed on Janus

The lws log of this scenario can be found here.

lws-team · 2020-09-09T10:41:38Z

[Wed Sep 9 12:21:47 2020] [libwebsockets][INFO] seen incoming LWS_H2_FRAME_TYPE_DATA start
[Wed Sep 9 12:21:47 2020] [libwebsockets][INFO] Frame header DATA: sid 15, flags 0x0, len 14
[Wed Sep 9 12:21:47 2020] [libwebsockets][INFO] DATA rx on state 3
[Wed Sep 9 12:21:47 2020] [libwebsockets][INFO] lws_h2_parser: LWS_H2_FRAME_TYPE_DATA: fl 0x0
[Wed Sep 9 12:21:47 2020] [libwebsockets][DEBUG] lws_read_h1: h1 path: wsi state 0x119
[Wed Sep 9 12:21:47 2020] [libwebsockets][DEBUG] lws_parse_ws: received 14 byte packet
[Wed Sep 9 12:21:47 2020] [libwebsockets][DEBUG] lws_parse_ws: exit with 0 unused
[Wed Sep 9 12:21:47 2020] [libwebsockets][INFO] lws_h2_parse_end_of_frame: DATA flags 0x0

I would think this is a ffox rfc8441 thing... although the support in current ffox is good for normal use now, there were a couple of bugs that hung around for about a year (in their bugzilla)... eventually they figured out it was on their side and solved it.

He's just not sending us the bulk data, it looks like we just get the encapsulated ws frame header. And after that the link is idle.

With rfc8441 the ws traffic is going via what is often an already-existing tcp + tls connection to the server, the h2 one that provided the html and js that told it to open the ws link back to the server. This is highly desirable, since we can mux inside that h2 connection without any further transport setup or validation. But it means that when streams close, that's just a logical action inside a socket / tls tunnel that stays up, it's no longer the case that we get unambiguous events at the transport level like read() blows and throws -1 or 0 length tls reads or the usual that means that socket has had it. We do get an unambiguous event but it's happening at ws frame parser inside h2 parse layer... it's pretty different.

It is meaningful to also try chrome since their implementations are completely different... chrome converged on a stable rfc8441 solution quicker than firefox in the past.

atoppi · 2020-09-09T13:37:54Z

hi @lws-team,
I don't know how to make a test for RFC8441 with Chrome because, as discussed in the linked issue, Chrome seems not to fall in the same scenario of Firefox (WSS + lws w/ h2 support).

Do you think that a fix/workaround is possibile to avoid those stalling websockets on Janus, or we just need to wait for Mozilla to tackle this?

lws-team · 2020-09-09T13:41:10Z

If I understand it, ffox just isn't sending any bulk data on the ws link, nor closing it either.... if so that's all happening inside ffox and I don't see how we can do anything about it.

atoppi · 2020-09-09T13:44:21Z

Ack, thanks again @lws-team !

lminiero · 2020-09-15T10:57:58Z

Merging 👍

Send websocket message in multiple fragments when needed.

0ba580e

atoppi requested a review from lminiero September 8, 2020 10:06

lminiero reviewed Sep 8, 2020

View reviewed changes

Add more atomic checks. Fix start fragment condition. Code style chan…

6dfbf54

…ges.

tmatth reviewed Sep 8, 2020

View reviewed changes

lminiero merged commit 8b9549a into master Sep 15, 2020

lminiero deleted the ws-send-fragments branch September 15, 2020 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send websocket message in multiple fragments when needed. #2355

Send websocket message in multiple fragments when needed. #2355

atoppi commented Sep 8, 2020

lminiero left a comment

lminiero Sep 8, 2020

atoppi Sep 8, 2020

chengmo03013106 Jul 9, 2021

lminiero Sep 8, 2020

atoppi Sep 8, 2020

lminiero Sep 8, 2020

tmatth Sep 8, 2020

tmatth Sep 8, 2020

atoppi Sep 8, 2020

tmatth Sep 8, 2020

atoppi Sep 8, 2020

tmatth Sep 8, 2020

atoppi Sep 8, 2020

tmatth Sep 8, 2020

lminiero Sep 8, 2020

tmatth Sep 8, 2020

atoppi commented Sep 9, 2020

lws-team commented Sep 9, 2020

atoppi commented Sep 9, 2020

lws-team commented Sep 9, 2020

atoppi commented Sep 9, 2020

lminiero commented Sep 15, 2020

Send websocket message in multiple fragments when needed. #2355

Send websocket message in multiple fragments when needed. #2355

Conversation

atoppi commented Sep 8, 2020

lminiero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atoppi commented Sep 9, 2020

lws-team commented Sep 9, 2020

atoppi commented Sep 9, 2020

lws-team commented Sep 9, 2020

atoppi commented Sep 9, 2020

lminiero commented Sep 15, 2020