Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mbstring-stream may not handle stateful charsets' multi-byte characters being split #1

Open
stanvass opened this issue Aug 28, 2017 · 3 comments

Comments

@stanvass
Copy link

Hi, I've reviewed your code and it seems you don't take precautions against multi-byte characters being split around chunk boundaries, i.e. at the start/end of every chunk. They'll instead be filtered as invalid, which will result in a character being dropped from the stream from time to time.

Let me know if I'm reading this wrong.

@ericnorris
Copy link
Owner

Hey @stanvass, thanks for taking a look! I'm fairly confident this case is handled though - I even have a test for it here:

public function testMultibyteEdgeHandling() {
$output = fopen('php://memory', 'w+');
$filtername = 'convert.mbstring.UTF-8/UTF-8';
stream_filter_append($output, $filtername, STREAM_FILTER_WRITE);
$donut_first_half = substr("🍩", 0, 2);
$donut_second_half = substr("🍩", 2);
fwrite($output, $donut_first_half);
fflush($output);
rewind($output);
$expected = '';
$result = stream_get_contents($output);
$this->assertSame(
$expected,
$result,
'Wrote out invalid character'
);
fseek($output, 2);
fwrite($output, $donut_second_half);
fflush($output);
rewind($output);
$expected = '🍩';
$result = stream_get_contents($output);
$this->assertSame(
$expected,
$result,
'Did not handle partial multibyte character'
);
}

Internally the filter keeps a buffer of invalid (e.g. partial) multibyte characters in order to handle this situation.

@stanvass
Copy link
Author

This technique works for simple stateless var-width charsets like UTF-8, but it'll fail for stateful var-width charset like EUC-*, whose interpretation depends on escape sequences which may be anywhere in a previous chunk (not just at the end of the last one) and need to be present when sending the string to mb_convert_encoding().

I've been working on a similar streaming processor and I've learned there's no single universal solution to the "imperfect chunking" problem. Every charset type requires a different solution.

If you only want to support single-byte charsets, UTF-8 and UTF-16 for decoding, then this is a fair approach, but I don't see this being documented.

@ericnorris
Copy link
Owner

Interesting @stanvass! Do you think you could provide me with some edge cases to test against? It's a little hard to investigate without hard data.

I'll rename the issue to better reflect that it may fail on stateful character sets, since UTF-8 is accounted for. Hopefully we can figure something out, but worst case I'll document the shortcomings.

@ericnorris ericnorris reopened this Aug 29, 2017
@ericnorris ericnorris changed the title The stream filter will corrupt characters split around chunk boundaries mbstring-stream may not handle stateful charsets' multi-byte characters being split Aug 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants