-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix encoding issues and improve parsing performance #116
base: main
Are you sure you want to change the base?
Conversation
if (function_exists('mb_strtolower')) { | ||
$mValue = mb_strtolower($mValue, 'utf-8'); | ||
} else { | ||
$mValue = strtolower($mValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here…
Thanks for this pull-request! I like the approach of moving the whole internal processing to a well-defined charset (UTF-8). But I really think the parser should, by default and unless the user explicitly requests something else, output the document in its original encoding. This means using iconv at the end to re-convert the output string back to its original encoding and preserving original value of the Also, converting any non-ASCII chars to unicode escapes seems wasteful too me in a day and age where most files are served using UTF-8 and do contain some non-ASCII chars. In my opinion, this should be an option of Ideally we could add a third option: escape all the chars that were escaped in the input and leave the chars unescaped that appeared verbatim in the input. But since we don’t store that information, this should prove difficult to implement. Maybe we could add a way for users to configure ranges of characters to always escape even if they would be representable in the output charset. Also, one of the open issues of the parser is handling UTF-16 files (either with or without BOM). With your changes, this should finally be fairly simple to implement: the heuristic that searches for the |
The problem with returning text in different encoding is that some characters may not be present in other encodings. I have converted only the CSS strings back to escaped form, those are the strings in double quotes (such as the content in font icon definitions - that was where I needed it). If I understand it correctly the identifiers are not touched which means they might be still in unicode. Also UTF-8 is now the default and recommended encoding in PHP, the same goes for CSS. I am not sure if I ever saw anything encoded in UTF-16, I agree that detecting it and converting it to UTF-8 is possible and easy, on the other had I do not think UTF-16 has any use for web because it is not compatible with ascii - which means no developer or designer should ever try to create CSS in UTF-16. I was more worried about the GB2312 encoding which was required to be used in China, but luckily it appears not to be used much these days. Anyway I will be very busy with our company project in the next few weeks and then I can work a bit more on this - adding the UTF-16, optional output encoding witch @charset change and some more tests. I'll need to study the CSS spec a bit more I guess. Thanks a lot for having a detailed look at my patch! |
hey @FMCorz! I guess you might be interested in this patch... |
Thanks @skodak! Looks like we're solving similar problems, how surprising :). Cheers! |
2dda16e
to
970d17b
Compare
Hi, I have reworked the patch, I got much deeper than I wanted originally - the problem was that the strict parsing did not work much for me when I started writing tests for the new UTF stuff - I had to fix some bugs first. When I started stepping through the code in debugger my hand started to hurt from the repeated clicking, so I rewrote some parts to stop calling peek() a million times. I am not finished yet, I would like to add more test coverage to make sure there are no regressions. Ciao |
// We need to know the charset before the parsing starts, | ||
// UTF BOMs have the highest precedence and must ve removed before other processing. | ||
$this->sOriginalCharset = strtolower($this->oParserSettings->sDefaultCharset); | ||
if (strpos($this->sText, self::BOM8) === 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strpos
searches the whole string for the BOM. I think we should use strpos(substr($this->sText, 0, strlen(self::BOM8)), self::BOM8) === 0
for performance. Same for all other checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, makes sense
Thanks. |
|
||
// Substring operations are slower with unicode, aText array is used for faster emulation. | ||
if ($this->sTextLibrary !== 'ascii') { | ||
$this->aText = preg_split('//u', $this->sText, null, PREG_SPLIT_NO_EMPTY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn’t we just use the aText
array for all cases instead of distinguishing between an ASCII and non-ASCII case? Splitting an ASCII string into chars should be pretty cheap and the memory overhead does IMO not warrant having separate code paths (double the code paths means double the number of tests). I’d love to see how running time and memory usages differ between the two, though, so I can make an informed decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I understand it correctly the strings in PHP are already stored as arrays of bytes
I did not do much testing after the peek refactoring, also the substrings are now much smaller - so I guess this is not critical for performance any more
I'll try to do more perf testing to see which way is faster
Although I don't know so much about this matter, I'd like to see this PR get merged (or refined so far that it becomes mergeable). I had some encoding troubles myself that I could fix using kind of a hack. Also I could really use the performance boost. In the current state I can only use this in combination with a file cache. Any update on this PR? |
This patch looks good. @skodak what exact changes did improve the performance that lot? I am using https://github.com/cweagans/composer-patches and want to add a patch to improve parsing performance. :-) |
Is there any update OR solution for this bug? |
@ThemeMetro Someone (TM) would need to take this PR and split it into smaller, focused PRs (with good test coverage). Would you be willing to do this? |
Hello,
I was wondering if you would be interested in a patch that tries to solve some charset related issues and improves parsing performance. I was trying to parse a 500kb CSS file (Bootstrap 3, Font Awesome and some custom stuff) and it took more than 70 seconds, with this patch I managed to get it down to 3.5-7 seconds. While working on the patch I have noticed a few other problems.
List of changes:
Ciao,
Petr