Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PHP parsing performance #1775

Merged
merged 3 commits into from
Jul 7, 2017
Merged

Improve PHP parsing performance #1775

merged 3 commits into from
Jul 7, 2017

Conversation

nylen
Copy link
Member

@nylen nylen commented Jul 7, 2017

Fixes #1646 where the PHP parser introduced in #1152 caused a performance regression.

Previous parsing times displayed evidence of O(n^2) complexity - from #1646 (comment):

 358 words: 0.06s
 921 words: 0.33s
1438 words: 0.77s
1793 words: 1.13s

Current parsing times (using a post not containing any WP blocks, generated the same way as the above example) are significantly improved, with O(n) complexity:

 500 words: 0.04s
1000 words: 0.08s
1500 words: 0.11s
2000 words: 0.13s

Read on for details about what changed.


Using this excellent StackOverflow answer, I set up xhprof to do granular profiling of PHP code. Here were the initial results when running RUN_SLOW_TESTS=y phpunit -c phpunit-xhprof.xml --filter large_post:

We're calling mb_strlen and mb_substr too many times (the post being tested is the Gutenberg demo content, which has around 10k characters), and these functions are too slow.

mb_strlen is really easy to fix, because it's always called against the full length of the input string. Fixed in nylen/phpegjs@9f6b0f8.

mb_substr is a bit more difficult to get rid of, but it's important to realize what these multibyte string functions actually do. They can't know the length of the string in advance, because UTF-8 characters are variable-length. So, every substring operation needs to loop through the entire string until it reaches its starting position. This is the source of the previous O(n^2) complexity.

This is fixed in nylen/phpegjs@3bfd473 by splitting the parse input into an array of UTF-8 characters when a parsing run begins. Direct array lookups are far faster than looping over a multibyte string, and WordPress core already uses this approach in other places. (Like core does in the code block below that one, we need to detect when PCRE Unicode support is not available, and use an alternative method. This is a task for a separate PR.)

Final performance results indicate further room for improvement, and there are some more obvious things to try, such as eliminating the input_substr function for substrings that are known to be of length 1. However, the major issues with the initial version are fixed and we can continue iterating later:

nylen added 2 commits July 7, 2017 02:40
This version gets rid of expensive `mb_strlen` and `mb_substr` calls
during parsing.
@westonruter
Copy link
Member

@mtias I think the parsing performance problem should be fixed prior to the 0.4 release.

@nylen
Copy link
Member Author

nylen commented Jul 7, 2017

Yep, that's the plan :)

@westonruter I am pretty satisfied with test coverage on these changes; do you see any issues here?

Copy link
Member

@westonruter westonruter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent performance debugging here!

$start_mem = memory_get_usage();

$html = file_get_contents(
dirname( __FILE__ ) . '/fixtures/long-content.html'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding a long-content.html fixture, should it instead just create the content programmatically?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me how to do this in a simple way. It would require executing Node.js, much like generating the PHP parser, which is also checked into the repo. I'd prefer to leave this as-is for now.

error_log( 'Time (seconds) : ' . ( round( $time * 100 ) / 100 ) );

$this->assertLessThanOrEqual(
0.3,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-coded number seems problematic in the long run because machines will get faster and faster that we develop on. What if, and this doesn't seem feasible either, we obtain the CPU speed and use that a multiplier for some threshold value? The other thought that came to mind is to use register_tick_function() to count the ticks that happen when parsing the large content, but I tried it out and on master and on this branch the result is both 11, probably because I was only declaring ticks in the unit test file and not the parser.


error_log( '' );
error_log( 'Memory used (KB) : ' . round( $mem / 1024 ) );
error_log( 'Time (seconds) : ' . ( round( $time * 100 ) / 100 ) );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error_log here may have an undesirable effect on the phpunit output, with the results being added inline with the progress when running all tests:

image

@westonruter westonruter added this to the Beta 0.4.0 milestone Jul 7, 2017
@nylen
Copy link
Member Author

nylen commented Jul 7, 2017

Regarding the last couple of review comments here: I would like to build a process that runs these "slow tests" every hour or so against master, then reports the results somewhere. I plan for this to handle many of the cases described at #939 (comment).

For performance tests, PHPunit may not be the best tool. Ideally the performance test mechanism should store past results, provide a history, and compare them against previous results from the same machine.

That's all for later - the important thing for now is to fix the performance degradation included in the 0.3.0 release.

@nylen nylen merged commit f3c1022 into master Jul 7, 2017
@nylen nylen deleted the fix/php-parsing-performance branch July 7, 2017 19:03
@dmsnell
Copy link
Member

dmsnell commented Jul 10, 2017

Very good work here @nylen, and I'm sorry for jumping to a false conclusion earlier about the source of the performance issues.

Curious:

  • do we have any comparison to how fast the JS-based parser performs on the same documents on the same machine?
  • do you see any other major performance issues here?

In light of this I'm going to close our #1681 and try to address other parser-related tasks. Thanks for staying persistent on this.

@nylen
Copy link
Member Author

nylen commented Jul 10, 2017

do we have any comparison to how fast the JS-based parser performs on the same documents on the same machine?

I just ran some more tests on parsing the fixture introduced in this PR.

JavaScript :  11ms
PHP 5.6    : 150ms
PHP 7.0    :  17ms
PHP 7.1    :  19ms

Fortunately PHP 7.x is our official recommendation.

do you see any other major performance issues here?

Nothing major really.

I am not a huge fan of the current approach of splitting the entire input string into an array at the beginning of the parse, but I don't see a good way around it, other than proving that we can safely use substr (sounds tricky and unlikely).

Fixing #1611 should provide further improvements (the peg_regex_test block at the bottom of the second performance graph).

We can get rid of a bunch more input_substr function calls (the red block at the bottom of the second performance graph). These calls aren't very complicated so I'm not sure how much it will actually help.

@dmsnell
Copy link
Member

dmsnell commented Jul 10, 2017

Ouch on 5.6 but those are good numbers otherwise. Keeping the PHP parse within a magnitude of the JS parse I think is a satisfactory target. Getting them within half the runtime of JS is good!

dmsnell added a commit that referenced this pull request Jul 20, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Jul 23, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 23, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 23, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 24, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 27, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 29, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 30, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
mcsf pushed a commit that referenced this pull request Aug 31, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Sep 3, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Sep 5, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
gziolo pushed a commit that referenced this pull request Sep 6, 2018
* Parser: Propose new hand-coded PHP parser

For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file

* Fix issue with containing the nested innerHTML

* Also handle newlines as whitespace

* Use classes for some static typing

* add type hints

* remove needless comment

* space where space is due

* meaningless rename

* remove needless function call

* harmonize with spec parser

* don't forget freeform HTML before blocks

* account for oddity in spec-parser

* add some polish, fix a thing

* comment it

* add JS version too

* Change `.` to `[^]` because `/s` isn't well supported in JS

The `s` flag on the RegExp object informs the engine to treat a
dot character as a class that includes the newline character.
Without it newlines aren't considered in the dot.

Since this flag is new to Javascript and not well supported in
different browsers I have removed it in favor of an explicit class
of characters that _does_ include the newline, namely the open
exclusion of `[^]` which permits all input characters.

Hat-top to @Hywan for finding this.

* Move code into `/packages` directory, prepare for review

* take out names from RegExp pattern to not fail tests

* Fix bug in parser: store HTML soup in stack frames while parsing

Previously we were sending all "HTML soup" segments of HTML between
blocks to the output list before any blocks were processed. We should
have been tracking these segments during the parsing and only spit
them out when closing a block at the top level.

This change stores the index into the input document at which that
soup starts if it exists and then produces the freeform block when
adding a block to the output from the parse frame stack.

* fix whitespace

* fix oddity in spec

* match styles

* use class name filter on server-side parser class

* fix whitespace

* Document extensibility

* fix typo in example code

* Push failing parsing test

* fix lazy/greedy bug in parser regexp

* Docs: Fix typos, links, tweak style.

* update from PR feedback

* trim docs

* Load default block parser, replacing PEG-generated one

* Expand `?:` shorthand for PHP 5.2 compat

* add fixtures test for default parser

* spaces to tabs

* could we need no assoc?

* fill out return array

* put that assoc back in there

* isometrize

* rename and add 0

* Conditionally include the parser class

* Add docblocks

* Standardize the package configuration
Tug pushed a commit that referenced this pull request Feb 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PEG.js based parser introduced a performance regression
3 participants