Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail fast utf-8 aka 6.4.* autobahn #392

Closed
Xaraknid opened this issue Jan 6, 2016 · 9 comments
Closed

fail fast utf-8 aka 6.4.* autobahn #392

Xaraknid opened this issue Jan 6, 2016 · 9 comments

Comments

@Xaraknid
Copy link

Xaraknid commented Jan 6, 2016

I see the benefit that could bring but fail to understand how the test working specificaly 6.4.2.
6.4.2 work on same level as 6.4.1 as frame.

6.4.1 consist of 2 frame ( last frame didn't count

1 - \xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5 ( UTF8 = true )
2 - \xf4\x90\x80\x80 ( UTF-8 = false )
--- fail fast
3- \x65\x64\x69\x74\x65\x64

6.4.2 consist of

1- \xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xf4 ( UTF-8 false )
---fail fast
2- \x90
3- \x80\x80\x65\x64\x69\x74\x65\x64

Did you manage to make those test pass ? Seem it fail too fast but first frame of 6.4.2 are not UTF8 or I'm wrong.

@cboden
Copy link
Member

cboden commented Jan 7, 2016

I took a quick look at this a couple weeks ago while refactoring the protocol handling. I was unable to get fast failing working without erroneously failing uncompleted but valid messages.

@cboden cboden added the question label Jan 7, 2016
@Xaraknid
Copy link
Author

Xaraknid commented Jan 7, 2016

After some work and be able to make both 1 and 2 passing test with a weird work around on 2.

1 and 2 work on frame level.
3 and 4 work on buffer level and to me it's seem wrong to do utf-8 test on that level, waste of cpu cycle. That implies you check frame headers each time with incomplete frame ( chance of having incomplete headers frame ) and doing utf-8 check each time. But I guest if you check on buffer level you should remove frame level test. But something bother me on those 2 test is the opcode they use are wrong and by itself will fail opcode = 0 and fin = false.

Maybe because I use preg_match and only tell me true/false if the string is utf-8 valid. That's why
\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xf4 is false. Maybe with a different utf-8 validator it'll be true.

I pass the test but by the time I write my method here I see a potential hole not cover by the testsuite.

Seem on the long run that is not worth the effort.
My opinion is unless you check for validating UTF-8 per bytes or char you can't cover fail-fast efficiently and to me checking on that level seem CPU-heavy and result in low process time with huge data. That negate all benefit that could offer a fail-fast.

@mbonneau
Copy link
Member

mbonneau commented Jan 7, 2016

You should be able to make fast-fail work on the Message level. You would need to check the frame as it is added to the message by first removing partial characters at the end and then check the contents, then keep the overflow and prepend it to the next frame contents prior to the check. This should not require a whole lot more processing power as you would never need to recheck anything.

@Xaraknid
Copy link
Author

Xaraknid commented Jan 7, 2016

@mbonneau : That is indeed a better way. But to cover every thing it will require at least 3 UTF8 check ?

For 6.4.2 it's the last bytes. But it could be up to 4 bytes to have been split.
1- Check string if fail remove 1 bytes / if pass continue
Repeat step 1 two more times if fail again it's not UTF-8 message

@mbonneau
Copy link
Member

mbonneau commented Jan 7, 2016

There are a few checks that you can use on just the final few bytes to find possible partial characters I think.

I started building one a while back didn't get very far, but here is a comment I had put in there:

        /*
         * If a string has 1111110x in the last 5 bytes
         * or 111110xx in the last 4
         * or 11110xxx in the last 3
         * or 1110xxxx in the last 2
         * or 110xxxxx in the last 1
         * Then the $str ends in a fragment - remove the above byte found
         * and the last bytes - return that as the fragment
         */

I did not get far enough to test this though.

@Xaraknid
Copy link
Author

Xaraknid commented Jan 7, 2016

@mbonneau thanks that help. First personnal test seem good. Here the code to find tail UTF-8 codepoint

private function TrimEndUTF8($data,$size) {
    // single byte in tail 
    if ((ord($data[$size-1]) & 0x80 ) == 0x00)
      return 0;

    //only last 3 bytes matter.
    $end = 3;
    if ( $size < $end )
      $end=$size;

    for ( $num = 1 ; $num<=$end ; $num++ ) {
      $current = ord($data[$size-$num]);
      if ((($current & 0xF8) == 0xF0) && ($num<4)) { return $num; } 
      if ((($current & 0xF0) == 0xE0) && ($num<3)) { return $num; }
      if ((($current & 0xE0) == 0xC0) && ($num<2)) { return $num; }
    }
    //not found incomplet UTF-8 bytes in tail
    return 0;
  }

@Xaraknid
Copy link
Author

Xaraknid commented Jan 8, 2016

Optimised function with check for invalid tail string

private function trimendUTF8($data,$size) {
    //check last 4 bytes for UTF8 codepoint
    $end = 4;
    if ( $size < $end )
      $end=$size;

    for ( $pos = 1 ; $pos<=$end ; $pos++ ) {
      $current = ord($data[$size-$pos]);
      if (($current & 0xF8) == 0xF0) { return ($pos == 4) ? 0 : $pos; } 
      if (($current & 0xF0) == 0xE0) { return ($pos == 3) ? 0 : $pos; } 
      if (($current & 0xE0) == 0xC0) { return ($pos == 2) ? 0 : $pos; }
      if (($current & 0x80) == 0x00) { if ($pos == 1 ) return 0; }
    }
    //Invalid tail string UTF8 ( skip any more utf8 check )
    return false;
  }

@Xaraknid
Copy link
Author

On performance point of view, fail fast will result in a slower server deeper you check for UTF8 (message -> frame -> packet ).

Unless you expect a high % of traffic to be invalid UTF8 and occur before the middle of the message, you'll not see any advantage and worse the server will be slower for everyone. To be equal in speed the checking must scale evenly and it's not the case.

check preg_match only (this exclude the cost of checking overflow )
for a 1M message once vs checking for frame/packet of 1024 .

msg size:  1048576 |  0.0016322135925293 seconds | rps : 612.66491381829
msg size:  1024 |  7.9870223999023E-5 seconds    | rps : 12520.310447761

Multiply the last one by 1024 times... to be even with checking once it require a speed of
1,5956903594771E-6 seconds.

@mbonneau
Copy link
Member

Closing this as it would be implemented in the RFC lib: ratchetphp/RFC6455#11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants