fail fast utf-8 aka 6.4.* autobahn #392

Xaraknid · 2016-01-06T19:23:45Z

I see the benefit that could bring but fail to understand how the test working specificaly 6.4.2.
6.4.2 work on same level as 6.4.1 as frame.

6.4.1 consist of 2 frame ( last frame didn't count

1 - \xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5 ( UTF8 = true )
2 - \xf4\x90\x80\x80 ( UTF-8 = false )
--- fail fast
3- \x65\x64\x69\x74\x65\x64

6.4.2 consist of

1- \xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xf4 ( UTF-8 false )
---fail fast
2- \x90
3- \x80\x80\x65\x64\x69\x74\x65\x64

Did you manage to make those test pass ? Seem it fail too fast but first frame of 6.4.2 are not UTF8 or I'm wrong.

cboden · 2016-01-07T15:36:12Z

I took a quick look at this a couple weeks ago while refactoring the protocol handling. I was unable to get fast failing working without erroneously failing uncompleted but valid messages.

Xaraknid · 2016-01-07T16:28:46Z

After some work and be able to make both 1 and 2 passing test with a weird work around on 2.

1 and 2 work on frame level.
3 and 4 work on buffer level and to me it's seem wrong to do utf-8 test on that level, waste of cpu cycle. That implies you check frame headers each time with incomplete frame ( chance of having incomplete headers frame ) and doing utf-8 check each time. But I guest if you check on buffer level you should remove frame level test. But something bother me on those 2 test is the opcode they use are wrong and by itself will fail opcode = 0 and fin = false.

Maybe because I use preg_match and only tell me true/false if the string is utf-8 valid. That's why
\xce\xba\xe1\xbd\xb9\xcf\x83\xce\xbc\xce\xb5\xf4 is false. Maybe with a different utf-8 validator it'll be true.

I pass the test but by the time I write my method here I see a potential hole not cover by the testsuite.

Seem on the long run that is not worth the effort.
My opinion is unless you check for validating UTF-8 per bytes or char you can't cover fail-fast efficiently and to me checking on that level seem CPU-heavy and result in low process time with huge data. That negate all benefit that could offer a fail-fast.

mbonneau · 2016-01-07T16:32:34Z

You should be able to make fast-fail work on the Message level. You would need to check the frame as it is added to the message by first removing partial characters at the end and then check the contents, then keep the overflow and prepend it to the next frame contents prior to the check. This should not require a whole lot more processing power as you would never need to recheck anything.

Xaraknid · 2016-01-07T16:45:05Z

@mbonneau : That is indeed a better way. But to cover every thing it will require at least 3 UTF8 check ?

For 6.4.2 it's the last bytes. But it could be up to 4 bytes to have been split.
1- Check string if fail remove 1 bytes / if pass continue
Repeat step 1 two more times if fail again it's not UTF-8 message

mbonneau · 2016-01-07T16:53:30Z

There are a few checks that you can use on just the final few bytes to find possible partial characters I think.

I started building one a while back didn't get very far, but here is a comment I had put in there:

        /*
         * If a string has 1111110x in the last 5 bytes
         * or 111110xx in the last 4
         * or 11110xxx in the last 3
         * or 1110xxxx in the last 2
         * or 110xxxxx in the last 1
         * Then the $str ends in a fragment - remove the above byte found
         * and the last bytes - return that as the fragment
         */

I did not get far enough to test this though.

Xaraknid · 2016-01-07T18:36:41Z

@mbonneau thanks that help. First personnal test seem good. Here the code to find tail UTF-8 codepoint

private function TrimEndUTF8($data,$size) {
    // single byte in tail 
    if ((ord($data[$size-1]) & 0x80 ) == 0x00)
      return 0;

    //only last 3 bytes matter.
    $end = 3;
    if ( $size < $end )
      $end=$size;

    for ( $num = 1 ; $num<=$end ; $num++ ) {
      $current = ord($data[$size-$num]);
      if ((($current & 0xF8) == 0xF0) && ($num<4)) { return $num; } 
      if ((($current & 0xF0) == 0xE0) && ($num<3)) { return $num; }
      if ((($current & 0xE0) == 0xC0) && ($num<2)) { return $num; }
    }
    //not found incomplet UTF-8 bytes in tail
    return 0;
  }

Xaraknid · 2016-01-08T17:20:16Z

Optimised function with check for invalid tail string

private function trimendUTF8($data,$size) {
    //check last 4 bytes for UTF8 codepoint
    $end = 4;
    if ( $size < $end )
      $end=$size;

    for ( $pos = 1 ; $pos<=$end ; $pos++ ) {
      $current = ord($data[$size-$pos]);
      if (($current & 0xF8) == 0xF0) { return ($pos == 4) ? 0 : $pos; } 
      if (($current & 0xF0) == 0xE0) { return ($pos == 3) ? 0 : $pos; } 
      if (($current & 0xE0) == 0xC0) { return ($pos == 2) ? 0 : $pos; }
      if (($current & 0x80) == 0x00) { if ($pos == 1 ) return 0; }
    }
    //Invalid tail string UTF8 ( skip any more utf8 check )
    return false;
  }

Xaraknid · 2016-01-21T00:00:27Z

On performance point of view, fail fast will result in a slower server deeper you check for UTF8 (message -> frame -> packet ).

Unless you expect a high % of traffic to be invalid UTF8 and occur before the middle of the message, you'll not see any advantage and worse the server will be slower for everyone. To be equal in speed the checking must scale evenly and it's not the case.

check preg_match only (this exclude the cost of checking overflow )
for a 1M message once vs checking for frame/packet of 1024 .

msg size:  1048576 |  0.0016322135925293 seconds | rps : 612.66491381829
msg size:  1024 |  7.9870223999023E-5 seconds    | rps : 12520.310447761

Multiply the last one by 1024 times... to be even with checking once it require a speed of
1,5956903594771E-6 seconds.

mbonneau · 2018-12-14T03:03:17Z

Closing this as it would be implemented in the RFC lib: ratchetphp/RFC6455#11

cboden added the question label Jan 7, 2016

cboden added the WebSocket label Jan 7, 2016

cboden mentioned this issue Feb 20, 2016

Fail Fast UTF-8 AB-6.4.* ratchetphp/RFC6455#11

Open

mbonneau closed this as completed Dec 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fail fast utf-8 aka 6.4.* autobahn #392

fail fast utf-8 aka 6.4.* autobahn #392

Xaraknid commented Jan 6, 2016

cboden commented Jan 7, 2016

Xaraknid commented Jan 7, 2016

mbonneau commented Jan 7, 2016

Xaraknid commented Jan 7, 2016

mbonneau commented Jan 7, 2016

Xaraknid commented Jan 7, 2016

Xaraknid commented Jan 8, 2016

Xaraknid commented Jan 21, 2016

mbonneau commented Dec 14, 2018

fail fast utf-8 aka 6.4.* autobahn #392

fail fast utf-8 aka 6.4.* autobahn #392

Comments

Xaraknid commented Jan 6, 2016

cboden commented Jan 7, 2016

Xaraknid commented Jan 7, 2016

mbonneau commented Jan 7, 2016

Xaraknid commented Jan 7, 2016

mbonneau commented Jan 7, 2016

Xaraknid commented Jan 7, 2016

Xaraknid commented Jan 8, 2016

Xaraknid commented Jan 21, 2016

mbonneau commented Dec 14, 2018