-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mbstring-stream may not handle stateful charsets' multi-byte characters being split #1
Comments
Hey @stanvass, thanks for taking a look! I'm fairly confident this case is handled though - I even have a test for it here: mbstring-stream/tests/MultibyteStringStreamTest.php Lines 81 to 119 in 29ab229
Internally the filter keeps a buffer of invalid (e.g. partial) multibyte characters in order to handle this situation. |
This technique works for simple stateless var-width charsets like UTF-8, but it'll fail for stateful var-width charset like EUC-*, whose interpretation depends on escape sequences which may be anywhere in a previous chunk (not just at the end of the last one) and need to be present when sending the string to mb_convert_encoding(). I've been working on a similar streaming processor and I've learned there's no single universal solution to the "imperfect chunking" problem. Every charset type requires a different solution. If you only want to support single-byte charsets, UTF-8 and UTF-16 for decoding, then this is a fair approach, but I don't see this being documented. |
Interesting @stanvass! Do you think you could provide me with some edge cases to test against? It's a little hard to investigate without hard data. I'll rename the issue to better reflect that it may fail on stateful character sets, since UTF-8 is accounted for. Hopefully we can figure something out, but worst case I'll document the shortcomings. |
Hi, I've reviewed your code and it seems you don't take precautions against multi-byte characters being split around chunk boundaries, i.e. at the start/end of every chunk. They'll instead be filtered as invalid, which will result in a character being dropped from the stream from time to time.
Let me know if I'm reading this wrong.
The text was updated successfully, but these errors were encountered: