Fix encoding issues and improve parsing performance #116

skodak · 2016-08-20T12:56:42Z

Hello,

I was wondering if you would be interested in a patch that tries to solve some charset related issues and improves parsing performance. I was trying to parse a 500kb CSS file (Bootstrap 3, Font Awesome and some custom stuff) and it took more than 70 seconds, with this patch I managed to get it down to 3.5-7 seconds. While working on the patch I have noticed a few other problems.

List of changes:

internally all strings are now stored as UTF-8
BOM is used for utf-8 detection and removed if found
setCharset() was removed completely - all charset conversions are done in constructor
the rendered text is always returned as utf-8 with all CSS strings \encoded as ASCII chars
the @charset atRule is switched to 'utf-8' if present
mbstring extension is not required any more
iconv is required - previously it was required only if there were \encoded unicode chars
all non-ascii unicode chars are \encoded in CSS strings
null character is completely ignored in CSS strings for security reasons

Ciao,
Petr

sabberworm · 2016-08-21T08:27:24Z

lib/Sabberworm/CSS/RuleSet/DeclarationBlock.php

+				if (function_exists('mb_strtolower')) {
+					$mValue = mb_strtolower($mValue, 'utf-8');
+				} else {
+					$mValue = strtolower($mValue);


Same here…

sabberworm · 2016-08-21T08:58:55Z

Thanks for this pull-request!

I like the approach of moving the whole internal processing to a well-defined charset (UTF-8).

But I really think the parser should, by default and unless the user explicitly requests something else, output the document in its original encoding. This means using iconv at the end to re-convert the output string back to its original encoding and preserving original value of the @charset rule.

Also, converting any non-ASCII chars to unicode escapes seems wasteful too me in a day and age where most files are served using UTF-8 and do contain some non-ASCII chars. In my opinion, this should be an option of OutputFormat that can be configured to: a) escape all non-ASCII chars, or, b) only escape the characters not representable in the output charset (this should be the default).

Ideally we could add a third option: escape all the chars that were escaped in the input and leave the chars unescaped that appeared verbatim in the input. But since we don’t store that information, this should prove difficult to implement. Maybe we could add a way for users to configure ranges of characters to always escape even if they would be representable in the output charset.

Also, one of the open issues of the parser is handling UTF-16 files (either with or without BOM). With your changes, this should finally be fairly simple to implement: the heuristic that searches for the @charset rule should also test, as a fallback, if it can find 0'@'0'c'0'h'0'a'0'r'0's'0'e'0't' (for BE) or '@'0'c'0'h'0'a'0'r'0's'0'e'0't'0 (for LE) and parse its value. UTF-8, 16-LE, 16-BE should all be covered by at least one test case, each with and without BOM.

skodak · 2016-08-22T06:00:28Z

The problem with returning text in different encoding is that some characters may not be present in other encodings. I have converted only the CSS strings back to escaped form, those are the strings in double quotes (such as the content in font icon definitions - that was where I needed it). If I understand it correctly the identifiers are not touched which means they might be still in unicode. Also UTF-8 is now the default and recommended encoding in PHP, the same goes for CSS.

I am not sure if I ever saw anything encoded in UTF-16, I agree that detecting it and converting it to UTF-8 is possible and easy, on the other had I do not think UTF-16 has any use for web because it is not compatible with ascii - which means no developer or designer should ever try to create CSS in UTF-16. I was more worried about the GB2312 encoding which was required to be used in China, but luckily it appears not to be used much these days.

Anyway I will be very busy with our company project in the next few weeks and then I can work a bit more on this - adding the UTF-16, optional output encoding witch @charset change and some more tests. I'll need to study the CSS spec a bit more I guess.

Thanks a lot for having a detailed look at my patch!

skodak · 2016-08-22T06:20:19Z

hey @FMCorz! I guess you might be interested in this patch...

FMCorz · 2016-08-22T06:28:30Z

Thanks @skodak! Looks like we're solving similar problems, how surprising :). Cheers!

skodak · 2016-08-23T19:50:08Z

Hi, I have reworked the patch, I got much deeper than I wanted originally - the problem was that the strict parsing did not work much for me when I started writing tests for the new UTF stuff - I had to fix some bugs first. When I started stepping through the code in debugger my hand started to hurt from the repeated clicking, so I rewrote some parts to stop calling peek() a million times.

I am not finished yet, I would like to add more test coverage to make sure there are no regressions.

Ciao

sabberworm · 2016-08-29T08:48:51Z

lib/Sabberworm/CSS/Parser.php

+		// We need to know the charset before the parsing starts,
+		// UTF BOMs have the highest precedence and must ve removed before other processing.
+		$this->sOriginalCharset = strtolower($this->oParserSettings->sDefaultCharset);
+		if (strpos($this->sText, self::BOM8) === 0) {


strpos searches the whole string for the BOM. I think we should use strpos(substr($this->sText, 0, strlen(self::BOM8)), self::BOM8) === 0 for performance. Same for all other checks.

ok, makes sense

sabberworm · 2016-08-29T09:09:07Z

Thanks.

sabberworm · 2016-08-29T09:14:26Z

lib/Sabberworm/CSS/Parser.php

+
+		// Substring operations are slower with unicode, aText array is used for faster emulation.
+		if ($this->sTextLibrary !== 'ascii') {
+			$this->aText = preg_split('//u', $this->sText, null, PREG_SPLIT_NO_EMPTY);


Shouldn’t we just use the aText array for all cases instead of distinguishing between an ASCII and non-ASCII case? Splitting an ASCII string into chars should be pretty cheap and the memory overhead does IMO not warrant having separate code paths (double the code paths means double the number of tests). I’d love to see how running time and memory usages differ between the two, though, so I can make an informed decision.

if I understand it correctly the strings in PHP are already stored as arrays of bytes

I did not do much testing after the peek refactoring, also the substrings are now much smaller - so I guess this is not critical for performance any more

I'll try to do more perf testing to see which way is faster

jkrzefski · 2017-02-06T11:42:48Z

Although I don't know so much about this matter, I'd like to see this PR get merged (or refined so far that it becomes mergeable). I had some encoding troubles myself that I could fix using kind of a hack. Also I could really use the performance boost. In the current state I can only use this in combination with a file cache. Any update on this PR?

matzeeable · 2021-12-22T08:33:10Z

This patch looks good. @skodak what exact changes did improve the performance that lot? I am using https://github.com/cweagans/composer-patches and want to add a patch to improve parsing performance. :-)

ThemeMetro · 2024-08-06T03:04:42Z

Is there any update OR solution for this bug?

oliverklee · 2024-08-06T07:52:40Z

@ThemeMetro Someone (TM) would need to take this PR and split it into smaller, focused PRs (with good test coverage). Would you be willing to do this?

JoshuaBehrens · 2025-03-02T22:42:28Z

@skodak is the fork not worth anymore ? we've been waiting for this almost a decade :D

JakeQZ · 2025-03-05T16:42:23Z

Hi @JoshuaBehrens. As I understand it, the PR provided various performance improvements as well as resolving various issues with charset and BOM handling. Is there a specific issue you are encountering? Also, @ThemeMetro, same question.

The changes at the last review can still be viewed on GitHub, so it would be possible to cherry-pick those that are valid and still applicable. @oliverklee, will the changes always be viewable, or might they disappear?

oliverklee · 2025-03-05T16:45:23Z

will the changes always be viewable, or might they disappear?

AFAIK, they'll always be viewable in the GitHub UI, and we even can check them out locally (I've just tried) using the gh command-line client:

gh pr checkout 116

JoshuaBehrens · 2025-03-08T12:16:40Z

@JakeQZ This post basically summarizes my issue #116 (comment) I can only use this with a caching layer. We once tried to convert CSS to Google AMP compatible CSS with this and things were unbearable slow.

JakeQZ · 2025-03-09T22:52:18Z

Thanks @JoshuaBehrens. I have some follow-up questions. There seem to be many small performance improvements in the original PR, so I'd like to establish which ones give the most benefit in real-world use cases.

Is the slowdown (bottleneck) in the parsing, manipulation, or rendering phase?
What is your caching layer caching, and how?
Would you be able to provide sample CSS (which I could use when running a profiler; thus I would need the full lengthy CSS to get meaningful results; can be done under NDA if it's proprietary)?

JoshuaBehrens · 2025-03-10T09:57:44Z

@JakeQZ so the idea was to parse a Shopware 5 compiled theme css. So visit any Shopware 5 shop or host one yourself with dockware https://hub.docker.com/layers/dockware/play/5.5.10/images/sha256-6c8588e9f3d1a2f73daed121092e76a1b677a07453422fb2a9773df68b07f38b and download the compiled css file. Caching layer was eventually file hash of that file. Back then likely not xxhash. Bottleneck likely was parsing IIRC.

We stopped this effort and approach so I have no reference to test against.

skodak mentioned this pull request Aug 21, 2016

Some characters are changed after render. #94

Open

sabberworm reviewed Aug 21, 2016
View reviewed changes

Fix encoding issues and improve parsing performance

2ddbdef

sabberworm reviewed Aug 29, 2016
View reviewed changes

sabberworm mentioned this pull request Sep 19, 2018

Wrong charset breaks parsing #137

Closed

sabberworm mentioned this pull request Nov 20, 2018

UTF-8 byte order mark (BOM) not stripped #150

Open

sabberworm self-assigned this Nov 20, 2018

oliverklee deleted the branch MyIntervals:main February 7, 2024 11:36

oliverklee closed this Feb 7, 2024

oliverklee reopened this Feb 7, 2024

oliverklee changed the base branch from master to main February 7, 2024 22:38

skodak closed this by deleting the head repository Feb 28, 2025

Fix encoding issues and improve parsing performance #116

Fix encoding issues and improve parsing performance #116

Uh oh!

Conversation

skodak commented Aug 20, 2016

Uh oh!

sabberworm Aug 21, 2016

Choose a reason for hiding this comment

Uh oh!

sabberworm commented Aug 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skodak commented Aug 22, 2016

Uh oh!

skodak commented Aug 22, 2016

Uh oh!

FMCorz commented Aug 22, 2016

Uh oh!

skodak commented Aug 23, 2016

Uh oh!

sabberworm Aug 29, 2016

Choose a reason for hiding this comment

Uh oh!

skodak Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

sabberworm commented Aug 29, 2016

Uh oh!

sabberworm Aug 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skodak Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

jkrzefski commented Feb 6, 2017

Uh oh!

matzeeable commented Dec 22, 2021

Uh oh!

ThemeMetro commented Aug 6, 2024

Uh oh!

oliverklee commented Aug 6, 2024

Uh oh!

JoshuaBehrens commented Mar 2, 2025

Uh oh!

JakeQZ commented Mar 5, 2025

Uh oh!

oliverklee commented Mar 5, 2025

Uh oh!

JoshuaBehrens commented Mar 8, 2025

Uh oh!

JakeQZ commented Mar 9, 2025

Uh oh!

JoshuaBehrens commented Mar 10, 2025

Uh oh!

Uh oh!

sabberworm commented Aug 21, 2016 •

edited

Loading

sabberworm Aug 29, 2016 •

edited

Loading