Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send websocket message in multiple fragments when needed. #2355

Merged
merged 2 commits into from
Sep 15, 2020

Conversation

atoppi
Copy link
Member

@atoppi atoppi commented Sep 8, 2020

This PR adds the support to WebSocket fragmentation (RFC 6455) when sending large messages to Janus clients and fixes #2349.

The effort started after the issue #2349 where it has been proved that in particular scenarios that involved the encapsulation of websocket in h2 Janus would crash due to a buffer overflow in the underlying libraries.
We decided to follow @lws-team suggestion (thanks again!) to implement message fragmentation in order to avoid this crash.
The size of each fragment (aka message chunk ) has been set as suggested to about 2 MTU (so 2800 bytes) and it is defined in the MESSAGE_CHUNK_SIZE macro.

While working on this PR we have also refactored the event loop in order to avoid duplicate source and make the code cleaner and more readable.

Clients should not take any action as their websocket libraries are expected to take care of message fragmentation.

Notable changes (for reviewers):

  • we do not use anymore the sent integer returned by lws_write to move the buffer pointer bufoffset. We just check that sent bytes are not lesser than requested amount (that would mean an error occurred) and then increment the pointer by amount
  • bufoffset is now initialized to LWS_PRE (IMHO it does more sense, given that the buffer has size equals to LWS_PRE + message size
  • the response string (that is the message to be sent) is now being freed as soon as it has been copied into the buffer and not just before quitting a loop iteration as before
  • we fixed some potential overflows by transforming bufsize-like variables from int type to size_t

@atoppi atoppi requested a review from lminiero September 8, 2020 10:06
Copy link
Member

@lminiero lminiero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some notes 👍

@@ -1198,56 +1201,63 @@ static int janus_websockets_common_callback(
}

/* Check if we have a pending/partial write to complete first */
if(ws_client->buffer && ws_client->bufpending > 0 && ws_client->bufoffset > 0
&& !g_atomic_int_get(&ws_client->destroyed) && !g_atomic_int_get(&stopping)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These atomic checks should not be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added again the checks in the OR form to let the function early return.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lminiero Hi,May I asy why these atomic checks should not be removed ?
It had checked before of this code branch, isn't ?
What if ws_client->destroyed is not volicate int rather than int ?
if so, Shoud it checked here ?

transports/janus_websockets.c Outdated Show resolved Hide resolved
transports/janus_websockets.c Outdated Show resolved Hide resolved
transports/janus_websockets.c Outdated Show resolved Hide resolved
transports/janus_websockets.c Outdated Show resolved Hide resolved
transports/janus_websockets.c Show resolved Hide resolved
}
/* Done for this round, check the next response/notification later */
lws_callback_on_writable(wsi);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be calling lws_callback_on_writable even when we successfully sent the last fragment, which shouldn't happen (it will perform one more useless loop).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have preserved the previous behavior (https://github.com/meetecho/janus-gateway/blob/master/transports/janus_websockets.c#L1246).
A quick test showed some issues when sending asyncronous messages when the writable call gets moved from that position.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack!

janus_mutex_unlock(&ws_client->ts->mutex);
return 0;
}

/* Evaluate amount of data to send according to MESSAGE_CHUNK_SIZE */
int amount = ws_client->bufpending <= MESSAGE_CHUNK_SIZE ? ws_client->bufpending : MESSAGE_CHUNK_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not use size_t for amount (to avoid casting later) as it seems it's never needed as an int.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a more descriptive name like: amount_to_send or bytes_to_send might be preferable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I've used the same name (amount) and type (int) present in the lws example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atoppi sure but in their case they don't need to do the casts. It's a minor point but cleaner IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If amount type is set to size_t the following simple test will fail

#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
#include <limits.h>

int main(int argc, char *argv[]) {	
    size_t amount = 2400;
    /* Simulate error in lws_write */
    int sent = -1;
    if (sent < amount) {
        printf("sent < amount !\n");
        return 0;
    }
    return -1;
}

Any cast in the comparison if (sent < amount) might have a problem:

  • if we do (size_t)sent, we are casting a signed to unsigned
  • if we do (int)amount, we are casting to a lower (at least on my machine) size type (8 vs 4)

Am I missing something ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in that example, since you are comparing to signed you should be using ssize_t for amount, but AFAICT in your patch you were only ever comparing to unsigned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I am comparing sent (type int, return value of lws_write) with amount.
According to lws docs about lws_write:

Return may be -1 for a fatal error needing connection close

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right so ssize_t would make sense then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't see the point: lws_write returns int, not ssize_t, and comparing size_t and ssize_t would result in a comparison between values with different sizes anyway. Using int for that part seems reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah didn't want to derail the discussion here as we know these values are capped pretty low anyway.

@atoppi
Copy link
Member Author

atoppi commented Sep 9, 2020

@lws-team while we wait for this to be merged I'd like to discuss another issue I've found out with Firefox in this scenario.

Since I was testing huge payloads, I tried even bigger sizes (e.g. ~1.2 MBytes messages).
What happens is the following:

  • a new websocket gets opened on Janus
  • the message from the client does not seem to reach Janus
  • when closing the tab or refreshing the page the websocket still remains open in Janus and a TCP socket is still established on the server (confirmed with lsof)
  • only when the whole browser gets closed all the open websockets (e.g. due to multiple tries) are closed on Janus

The lws log of this scenario can be found here.

@lws-team
Copy link

lws-team commented Sep 9, 2020

[Wed Sep 9 12:21:47 2020] [libwebsockets][INFO] seen incoming LWS_H2_FRAME_TYPE_DATA start
[Wed Sep 9 12:21:47 2020] [libwebsockets][INFO] Frame header DATA: sid 15, flags 0x0, len 14
[Wed Sep 9 12:21:47 2020] [libwebsockets][INFO] DATA rx on state 3
[Wed Sep 9 12:21:47 2020] [libwebsockets][INFO] lws_h2_parser: LWS_H2_FRAME_TYPE_DATA: fl 0x0
[Wed Sep 9 12:21:47 2020] [libwebsockets][DEBUG] lws_read_h1: h1 path: wsi state 0x119
[Wed Sep 9 12:21:47 2020] [libwebsockets][DEBUG] lws_parse_ws: received 14 byte packet
[Wed Sep 9 12:21:47 2020] [libwebsockets][DEBUG] lws_parse_ws: exit with 0 unused
[Wed Sep 9 12:21:47 2020] [libwebsockets][INFO] lws_h2_parse_end_of_frame: DATA flags 0x0

I would think this is a ffox rfc8441 thing... although the support in current ffox is good for normal use now, there were a couple of bugs that hung around for about a year (in their bugzilla)... eventually they figured out it was on their side and solved it.

He's just not sending us the bulk data, it looks like we just get the encapsulated ws frame header. And after that the link is idle.

With rfc8441 the ws traffic is going via what is often an already-existing tcp + tls connection to the server, the h2 one that provided the html and js that told it to open the ws link back to the server. This is highly desirable, since we can mux inside that h2 connection without any further transport setup or validation. But it means that when streams close, that's just a logical action inside a socket / tls tunnel that stays up, it's no longer the case that we get unambiguous events at the transport level like read() blows and throws -1 or 0 length tls reads or the usual that means that socket has had it. We do get an unambiguous event but it's happening at ws frame parser inside h2 parse layer... it's pretty different.

It is meaningful to also try chrome since their implementations are completely different... chrome converged on a stable rfc8441 solution quicker than firefox in the past.

@atoppi
Copy link
Member Author

atoppi commented Sep 9, 2020

hi @lws-team,
I don't know how to make a test for RFC8441 with Chrome because, as discussed in the linked issue, Chrome seems not to fall in the same scenario of Firefox (WSS + lws w/ h2 support).

Do you think that a fix/workaround is possibile to avoid those stalling websockets on Janus, or we just need to wait for Mozilla to tackle this?

@lws-team
Copy link

lws-team commented Sep 9, 2020

If I understand it, ffox just isn't sending any bulk data on the ws link, nor closing it either.... if so that's all happening inside ffox and I don't see how we can do anything about it.

@atoppi
Copy link
Member Author

atoppi commented Sep 9, 2020

Ack, thanks again @lws-team !

@lminiero
Copy link
Member

Merging 👍

@lminiero lminiero merged commit 8b9549a into master Sep 15, 2020
@lminiero lminiero deleted the ws-send-fragments branch September 15, 2020 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash when using Firefox with WebSockets
5 participants