-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send websocket message in multiple fragments when needed. #2355
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some notes 👍
@@ -1198,56 +1201,63 @@ static int janus_websockets_common_callback( | |||
} | |||
|
|||
/* Check if we have a pending/partial write to complete first */ | |||
if(ws_client->buffer && ws_client->bufpending > 0 && ws_client->bufoffset > 0 | |||
&& !g_atomic_int_get(&ws_client->destroyed) && !g_atomic_int_get(&stopping)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These atomic checks should not be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added again the checks in the OR form to let the function early return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lminiero Hi,May I asy why these atomic checks should not be removed ?
It had checked before of this code branch, isn't ?
What if ws_client->destroyed is not volicate int rather than int ?
if so, Shoud it checked here ?
} | ||
/* Done for this round, check the next response/notification later */ | ||
lws_callback_on_writable(wsi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be calling lws_callback_on_writable
even when we successfully sent the last fragment, which shouldn't happen (it will perform one more useless loop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have preserved the previous behavior (https://github.com/meetecho/janus-gateway/blob/master/transports/janus_websockets.c#L1246).
A quick test showed some issues when sending asyncronous messages when the writable
call gets moved from that position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack!
janus_mutex_unlock(&ws_client->ts->mutex); | ||
return 0; | ||
} | ||
|
||
/* Evaluate amount of data to send according to MESSAGE_CHUNK_SIZE */ | ||
int amount = ws_client->bufpending <= MESSAGE_CHUNK_SIZE ? ws_client->bufpending : MESSAGE_CHUNK_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not use size_t
for amount (to avoid casting later) as it seems it's never needed as an int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a more descriptive name like: amount_to_send
or bytes_to_send
might be preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I've used the same name (amount
) and type (int
) present in the lws example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atoppi sure but in their case they don't need to do the casts. It's a minor point but cleaner IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If amount
type is set to size_t
the following simple test will fail
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
#include <limits.h>
int main(int argc, char *argv[]) {
size_t amount = 2400;
/* Simulate error in lws_write */
int sent = -1;
if (sent < amount) {
printf("sent < amount !\n");
return 0;
}
return -1;
}
Any cast in the comparison if (sent < amount)
might have a problem:
- if we do
(size_t)sent
, we are casting a signed to unsigned - if we do
(int)amount
, we are casting to a lower (at least on my machine) size type (8 vs 4)
Am I missing something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in that example, since you are comparing to signed you should be using ssize_t
for amount, but AFAICT in your patch you were only ever comparing to unsigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I am comparing sent
(type int
, return value of lws_write
) with amount
.
According to lws docs about lws_write
:
Return may be -1 for a fatal error needing connection close
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right so ssize_t
would make sense then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally don't see the point: lws_write
returns int
, not ssize_t
, and comparing size_t
and ssize_t
would result in a comparison between values with different sizes anyway. Using int
for that part seems reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah didn't want to derail the discussion here as we know these values are capped pretty low anyway.
@lws-team while we wait for this to be merged I'd like to discuss another issue I've found out with Firefox in this scenario. Since I was testing huge payloads, I tried even bigger sizes (e.g. ~1.2 MBytes messages).
The lws log of this scenario can be found here. |
I would think this is a ffox rfc8441 thing... although the support in current ffox is good for normal use now, there were a couple of bugs that hung around for about a year (in their bugzilla)... eventually they figured out it was on their side and solved it. He's just not sending us the bulk data, it looks like we just get the encapsulated ws frame header. And after that the link is idle. With rfc8441 the ws traffic is going via what is often an already-existing tcp + tls connection to the server, the h2 one that provided the html and js that told it to open the ws link back to the server. This is highly desirable, since we can mux inside that h2 connection without any further transport setup or validation. But it means that when streams close, that's just a logical action inside a socket / tls tunnel that stays up, it's no longer the case that we get unambiguous events at the transport level like read() blows and throws -1 or 0 length tls reads or the usual that means that socket has had it. We do get an unambiguous event but it's happening at ws frame parser inside h2 parse layer... it's pretty different. It is meaningful to also try chrome since their implementations are completely different... chrome converged on a stable rfc8441 solution quicker than firefox in the past. |
hi @lws-team, Do you think that a fix/workaround is possibile to avoid those stalling websockets on Janus, or we just need to wait for Mozilla to tackle this? |
If I understand it, ffox just isn't sending any bulk data on the ws link, nor closing it either.... if so that's all happening inside ffox and I don't see how we can do anything about it. |
Ack, thanks again @lws-team ! |
Merging 👍 |
This PR adds the support to WebSocket fragmentation (RFC 6455) when sending large messages to Janus clients and fixes #2349.
The effort started after the issue #2349 where it has been proved that in particular scenarios that involved the encapsulation of websocket in h2 Janus would crash due to a buffer overflow in the underlying libraries.
We decided to follow @lws-team suggestion (thanks again!) to implement message fragmentation in order to avoid this crash.
The size of each fragment (aka message chunk ) has been set as suggested to about 2 MTU (so 2800 bytes) and it is defined in the
MESSAGE_CHUNK_SIZE
macro.While working on this PR we have also refactored the event loop in order to avoid duplicate source and make the code cleaner and more readable.
Clients should not take any action as their websocket libraries are expected to take care of message fragmentation.
Notable changes (for reviewers):
sent
integer returned bylws_write
to move the buffer pointerbufoffset
. We just check thatsent
bytes are not lesser than requestedamount
(that would mean an error occurred) and then increment the pointer byamount
bufoffset
is now initialized toLWS_PRE
(IMHO it does more sense, given that the buffer has size equals toLWS_PRE
+message size
response
string (that is the message to be sent) is now being freed as soon as it has been copied into the buffer and not just before quitting a loop iteration as beforeint
type tosize_t