-
Notifications
You must be signed in to change notification settings - Fork 763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fail fast utf-8 aka 6.4.* autobahn #392
Comments
I took a quick look at this a couple weeks ago while refactoring the protocol handling. I was unable to get fast failing working without erroneously failing uncompleted but valid messages. |
After some work and be able to make both 1 and 2 passing test with a weird work around on 2. 1 and 2 work on frame level. Maybe because I use preg_match and only tell me true/false if the string is utf-8 valid. That's why I pass the test but by the time I write my method here I see a potential hole not cover by the testsuite. Seem on the long run that is not worth the effort. |
You should be able to make fast-fail work on the Message level. You would need to check the frame as it is added to the message by first removing partial characters at the end and then check the contents, then keep the overflow and prepend it to the next frame contents prior to the check. This should not require a whole lot more processing power as you would never need to recheck anything. |
@mbonneau : That is indeed a better way. But to cover every thing it will require at least 3 UTF8 check ? For 6.4.2 it's the last bytes. But it could be up to 4 bytes to have been split. |
There are a few checks that you can use on just the final few bytes to find possible partial characters I think. I started building one a while back didn't get very far, but here is a comment I had put in there:
I did not get far enough to test this though. |
@mbonneau thanks that help. First personnal test seem good. Here the code to find tail UTF-8 codepoint private function TrimEndUTF8($data,$size) {
// single byte in tail
if ((ord($data[$size-1]) & 0x80 ) == 0x00)
return 0;
//only last 3 bytes matter.
$end = 3;
if ( $size < $end )
$end=$size;
for ( $num = 1 ; $num<=$end ; $num++ ) {
$current = ord($data[$size-$num]);
if ((($current & 0xF8) == 0xF0) && ($num<4)) { return $num; }
if ((($current & 0xF0) == 0xE0) && ($num<3)) { return $num; }
if ((($current & 0xE0) == 0xC0) && ($num<2)) { return $num; }
}
//not found incomplet UTF-8 bytes in tail
return 0;
} |
Optimised function with check for invalid tail string private function trimendUTF8($data,$size) {
//check last 4 bytes for UTF8 codepoint
$end = 4;
if ( $size < $end )
$end=$size;
for ( $pos = 1 ; $pos<=$end ; $pos++ ) {
$current = ord($data[$size-$pos]);
if (($current & 0xF8) == 0xF0) { return ($pos == 4) ? 0 : $pos; }
if (($current & 0xF0) == 0xE0) { return ($pos == 3) ? 0 : $pos; }
if (($current & 0xE0) == 0xC0) { return ($pos == 2) ? 0 : $pos; }
if (($current & 0x80) == 0x00) { if ($pos == 1 ) return 0; }
}
//Invalid tail string UTF8 ( skip any more utf8 check )
return false;
} |
On performance point of view, fail fast will result in a slower server deeper you check for UTF8 (message -> frame -> packet ). Unless you expect a high % of traffic to be invalid UTF8 and occur before the middle of the message, you'll not see any advantage and worse the server will be slower for everyone. To be equal in speed the checking must scale evenly and it's not the case. check preg_match only (this exclude the cost of checking overflow )
Multiply the last one by 1024 times... to be even with checking once it require a speed of |
Closing this as it would be implemented in the RFC lib: ratchetphp/RFC6455#11 |
I see the benefit that could bring but fail to understand how the test working specificaly 6.4.2.
6.4.2 work on same level as 6.4.1 as frame.
6.4.1 consist of 2 frame ( last frame didn't count
6.4.2 consist of
Did you manage to make those test pass ? Seem it fail too fast but first frame of 6.4.2 are not UTF8 or I'm wrong.
The text was updated successfully, but these errors were encountered: