mbstring-stream may not handle stateful charsets' multi-byte characters being split #1

stanvass · 2017-08-28T11:43:31Z

Hi, I've reviewed your code and it seems you don't take precautions against multi-byte characters being split around chunk boundaries, i.e. at the start/end of every chunk. They'll instead be filtered as invalid, which will result in a character being dropped from the stream from time to time.

Let me know if I'm reading this wrong.

ericnorris · 2017-08-28T15:14:58Z

Hey @stanvass, thanks for taking a look! I'm fairly confident this case is handled though - I even have a test for it here:

mbstring-stream/tests/MultibyteStringStreamTest.php

Lines 81 to 119 in 29ab229

    
           public function testMultibyteEdgeHandling() { 
        
               $output     = fopen('php://memory', 'w+'); 
        
               $filtername = 'convert.mbstring.UTF-8/UTF-8'; 
        
               stream_filter_append($output, $filtername, STREAM_FILTER_WRITE); 
        
               $donut_first_half  = substr("🍩", 0, 2); 
        
               $donut_second_half = substr("🍩", 2); 
        
               fwrite($output, $donut_first_half); 
        
               fflush($output); 
        
               rewind($output); 
        
               $expected = ''; 
        
               $result   = stream_get_contents($output); 
        
               $this->assertSame( 
        
                   $expected, 
        
                   $result, 
        
                   'Wrote out invalid character' 
        
               ); 
        
               fseek($output, 2); 
        
               fwrite($output, $donut_second_half); 
        
               fflush($output); 
        
               rewind($output); 
        
               $expected = '🍩'; 
        
               $result   = stream_get_contents($output); 
        
               $this->assertSame( 
        
                   $expected, 
        
                   $result, 
        
                   'Did not handle partial multibyte character' 
        
               ); 
        
           }

Internally the filter keeps a buffer of invalid (e.g. partial) multibyte characters in order to handle this situation.

stanvass · 2017-08-29T03:58:25Z

This technique works for simple stateless var-width charsets like UTF-8, but it'll fail for stateful var-width charset like EUC-*, whose interpretation depends on escape sequences which may be anywhere in a previous chunk (not just at the end of the last one) and need to be present when sending the string to mb_convert_encoding().

I've been working on a similar streaming processor and I've learned there's no single universal solution to the "imperfect chunking" problem. Every charset type requires a different solution.

If you only want to support single-byte charsets, UTF-8 and UTF-16 for decoding, then this is a fair approach, but I don't see this being documented.

ericnorris · 2017-08-29T15:41:58Z

Interesting @stanvass! Do you think you could provide me with some edge cases to test against? It's a little hard to investigate without hard data.

I'll rename the issue to better reflect that it may fail on stateful character sets, since UTF-8 is accounted for. Hopefully we can figure something out, but worst case I'll document the shortcomings.

ericnorris closed this as completed Aug 28, 2017

ericnorris reopened this Aug 29, 2017

ericnorris changed the title ~~The stream filter will corrupt characters split around chunk boundaries~~ mbstring-stream may not handle stateful charsets' multi-byte characters being split Aug 29, 2017

ericnorris mentioned this issue Feb 25, 2022

Add PHP 8.1 support #3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mbstring-stream may not handle stateful charsets' multi-byte characters being split #1

mbstring-stream may not handle stateful charsets' multi-byte characters being split #1

stanvass commented Aug 28, 2017

ericnorris commented Aug 28, 2017

stanvass commented Aug 29, 2017

ericnorris commented Aug 29, 2017

mbstring-stream may not handle stateful charsets' multi-byte characters being split #1

mbstring-stream may not handle stateful charsets' multi-byte characters being split #1

Comments

stanvass commented Aug 28, 2017

ericnorris commented Aug 28, 2017

stanvass commented Aug 29, 2017

ericnorris commented Aug 29, 2017