Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding issues and improve parsing performance #116

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

skodak
Copy link

@skodak skodak commented Aug 20, 2016

Hello,

I was wondering if you would be interested in a patch that tries to solve some charset related issues and improves parsing performance. I was trying to parse a 500kb CSS file (Bootstrap 3, Font Awesome and some custom stuff) and it took more than 70 seconds, with this patch I managed to get it down to 3.5-7 seconds. While working on the patch I have noticed a few other problems.

List of changes:

  • internally all strings are now stored as UTF-8
  • BOM is used for utf-8 detection and removed if found
  • setCharset() was removed completely - all charset conversions are done in constructor
  • the rendered text is always returned as utf-8 with all CSS strings \encoded as ASCII chars
  • the @charset atRule is switched to 'utf-8' if present
  • mbstring extension is not required any more
  • iconv is required - previously it was required only if there were \encoded unicode chars
  • all non-ascii unicode chars are \encoded in CSS strings
  • null character is completely ignored in CSS strings for security reasons

Ciao,
Petr

if (function_exists('mb_strtolower')) {
$mValue = mb_strtolower($mValue, 'utf-8');
} else {
$mValue = strtolower($mValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here…

@sabberworm
Copy link
Contributor

sabberworm commented Aug 21, 2016

Thanks for this pull-request!

I like the approach of moving the whole internal processing to a well-defined charset (UTF-8).

But I really think the parser should, by default and unless the user explicitly requests something else, output the document in its original encoding. This means using iconv at the end to re-convert the output string back to its original encoding and preserving original value of the @charset rule.

Also, converting any non-ASCII chars to unicode escapes seems wasteful too me in a day and age where most files are served using UTF-8 and do contain some non-ASCII chars. In my opinion, this should be an option of OutputFormat that can be configured to: a) escape all non-ASCII chars, or, b) only escape the characters not representable in the output charset (this should be the default).

Ideally we could add a third option: escape all the chars that were escaped in the input and leave the chars unescaped that appeared verbatim in the input. But since we don’t store that information, this should prove difficult to implement. Maybe we could add a way for users to configure ranges of characters to always escape even if they would be representable in the output charset.

Also, one of the open issues of the parser is handling UTF-16 files (either with or without BOM). With your changes, this should finally be fairly simple to implement: the heuristic that searches for the @charset rule should also test, as a fallback, if it can find 0'@'0'c'0'h'0'a'0'r'0's'0'e'0't' (for BE) or '@'0'c'0'h'0'a'0'r'0's'0'e'0't'0 (for LE) and parse its value. UTF-8, 16-LE, 16-BE should all be covered by at least one test case, each with and without BOM.

@skodak
Copy link
Author

skodak commented Aug 22, 2016

The problem with returning text in different encoding is that some characters may not be present in other encodings. I have converted only the CSS strings back to escaped form, those are the strings in double quotes (such as the content in font icon definitions - that was where I needed it). If I understand it correctly the identifiers are not touched which means they might be still in unicode. Also UTF-8 is now the default and recommended encoding in PHP, the same goes for CSS.

I am not sure if I ever saw anything encoded in UTF-16, I agree that detecting it and converting it to UTF-8 is possible and easy, on the other had I do not think UTF-16 has any use for web because it is not compatible with ascii - which means no developer or designer should ever try to create CSS in UTF-16. I was more worried about the GB2312 encoding which was required to be used in China, but luckily it appears not to be used much these days.

Anyway I will be very busy with our company project in the next few weeks and then I can work a bit more on this - adding the UTF-16, optional output encoding witch @charset change and some more tests. I'll need to study the CSS spec a bit more I guess.

Thanks a lot for having a detailed look at my patch!

@skodak
Copy link
Author

skodak commented Aug 22, 2016

hey @FMCorz! I guess you might be interested in this patch...

@FMCorz
Copy link

FMCorz commented Aug 22, 2016

Thanks @skodak! Looks like we're solving similar problems, how surprising :). Cheers!

@skodak skodak force-pushed the unicode branch 2 times, most recently from 2dda16e to 970d17b Compare August 23, 2016 19:46
@skodak
Copy link
Author

skodak commented Aug 23, 2016

Hi, I have reworked the patch, I got much deeper than I wanted originally - the problem was that the strict parsing did not work much for me when I started writing tests for the new UTF stuff - I had to fix some bugs first. When I started stepping through the code in debugger my hand started to hurt from the repeated clicking, so I rewrote some parts to stop calling peek() a million times.

I am not finished yet, I would like to add more test coverage to make sure there are no regressions.

Ciao

// We need to know the charset before the parsing starts,
// UTF BOMs have the highest precedence and must ve removed before other processing.
$this->sOriginalCharset = strtolower($this->oParserSettings->sDefaultCharset);
if (strpos($this->sText, self::BOM8) === 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strpos searches the whole string for the BOM. I think we should use strpos(substr($this->sText, 0, strlen(self::BOM8)), self::BOM8) === 0 for performance. Same for all other checks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, makes sense

@sabberworm
Copy link
Contributor

Thanks.


// Substring operations are slower with unicode, aText array is used for faster emulation.
if ($this->sTextLibrary !== 'ascii') {
$this->aText = preg_split('//u', $this->sText, null, PREG_SPLIT_NO_EMPTY);
Copy link
Contributor

@sabberworm sabberworm Aug 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn’t we just use the aText array for all cases instead of distinguishing between an ASCII and non-ASCII case? Splitting an ASCII string into chars should be pretty cheap and the memory overhead does IMO not warrant having separate code paths (double the code paths means double the number of tests). I’d love to see how running time and memory usages differ between the two, though, so I can make an informed decision.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand it correctly the strings in PHP are already stored as arrays of bytes

I did not do much testing after the peek refactoring, also the substrings are now much smaller - so I guess this is not critical for performance any more

I'll try to do more perf testing to see which way is faster

@jkrzefski
Copy link

Although I don't know so much about this matter, I'd like to see this PR get merged (or refined so far that it becomes mergeable). I had some encoding troubles myself that I could fix using kind of a hack. Also I could really use the performance boost. In the current state I can only use this in combination with a file cache. Any update on this PR?

@matzeeable
Copy link

This patch looks good. @skodak what exact changes did improve the performance that lot? I am using https://github.com/cweagans/composer-patches and want to add a patch to improve parsing performance. :-)

@oliverklee oliverklee deleted the branch MyIntervals:main February 7, 2024 11:36
@oliverklee oliverklee closed this Feb 7, 2024
@oliverklee oliverklee reopened this Feb 7, 2024
@oliverklee oliverklee changed the base branch from master to main February 7, 2024 22:38
@ThemeMetro
Copy link

Is there any update OR solution for this bug?

@oliverklee
Copy link
Contributor

@ThemeMetro Someone (TM) would need to take this PR and split it into smaller, focused PRs (with good test coverage). Would you be willing to do this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants