Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid double extraction of substrings in various MATCH_ bytecodes #427

Merged
merged 7 commits into from
Sep 7, 2023

Conversation

markw65
Copy link

@markw65 markw65 commented Aug 21, 2023

MATCH_* opcodes typically do something like

if (<test>(input.substr(peg$currPos, length))) {
  sN = input.substr(peg$currPos, length);
  ...
} else {
  sN = peg$FAILED;
  ...
}

where we extract the same substring twice.

compileInputChunkCondition will convert that to

sN = input.substr(peg$currPos, length);
if (<test>(sN)) {
  ...
} else {
  sN = peg$FAILED;
  ...
}

avoiding the overhead

@markw65
Copy link
Author

markw65 commented Aug 21, 2023

When stacked on #425 this makes the parser generated from examples/xml.peggy an additional 5% faster.

@hildjj
Copy link
Contributor

hildjj commented Aug 21, 2023

Did you check coverage over the changed bits of code with just the one test added?

I need to add codecov.io to the CI action.

@hildjj
Copy link
Contributor

hildjj commented Aug 21, 2023

Looks like line 584 might need a test.

@markw65
Copy link
Author

markw65 commented Aug 21, 2023

Looks like line 584 might need a test

I think thats probably true - the way bytecode is currently generated, I don't see a way to get an ACCEPT_N with a length greater than 1 which doesn't also hit the skipAccept condition.

The only way I can think of to hit that line would be to add an option to disable the transformation, and run a test with it disabled...

But how are you seeing that? npm test coverage doesn't mention that line for me...

@hildjj
Copy link
Contributor

hildjj commented Aug 21, 2023

I downloaded the PR with gh pr checkout 427, ran npm build, and opened coverage/lcov-report/index.html in a browser.

@markw65 markw65 force-pushed the optimize-substrings branch from fe01e6f to 894ee5b Compare August 21, 2023 23:30
@markw65
Copy link
Author

markw65 commented Aug 21, 2023

Ok, so yes, that line seems to be uncovered; and I've confirmed that the current byte code generator never generates a sequence that can hit that line.

In fact, the only thing that can hit the single character branch on the next line is MATCH_ANY.

So I've added a test to plugin-api.spec.js which inserts a munge-the-bytecode pass between generateBytecode and generateJS. It changes the bytecode so that compileInputChunkCondition doesn't see its expected pattern, and as a result we get full coverage (the test also verifies that the compileInputChunkCondition didn't fire by examining the generated parser).

@markw65
Copy link
Author

markw65 commented Aug 24, 2023

I think this is ready to go.

Copy link
Member

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, that some useful improvements could be made

Comment on lines 186 to 187
expect(bc[0]).to.equal(19 /* MATCH_STRING_IC */);
bc.splice(4, 0, 5 /* PUSH_CUR_POS */, 6 /* POP */);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use real constants instead of comments. Import opcodes and

Suggested change
expect(bc[0]).to.equal(19 /* MATCH_STRING_IC */);
bc.splice(4, 0, 5 /* PUSH_CUR_POS */, 6 /* POP */);
expect(bc[0]).to.equal(op.MATCH_STRING_IC);
bc.splice(4, 0, op.PUSH_CUR_POS, op.POP);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - all the examples I looked at seem to use the numbers though. Will fix.

Comment on lines 391 to 396
cond = cond.replace("$$$.charCodeAt(0)", "input.charCodeAt(peg$currPos)");
}
compileCondition(
cond.replace("$$$", inputChunk),
argCount
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would better just accept a transformation function instead of replacing special placeholder, something like (inputChunk) => /*condition*/

: "input.substr(peg$currPos, " + inputChunkLength + ")";
if (bc[ip + baseLength] === op.ACCEPT_N
&& bc[ip + baseLength + 1] === inputChunkLength) {
skipAccept = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, it should be possible just consume 2 opcodes at once instead of introducing this flag. If you recognize pattern, just pretend that you have a new long opcode and process it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think about this a bit... it's not just two bytecodes; it's a MATCH bytecode followed by a then clause which starts with ACCEPT_N, and an else clause. So the ACCEPT_N is handled in a recursive call to compile.

But yes, presumably I can make compileCondition spit out the increment to currPos, and then skip over the ACCEPT_N.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to push back on this one. The only way I see to do it is to put a modified copy of compileCondition into compileInputChunkCondition. Or maybe pass in some extra options to tell it to push a particular string at the beginning the then block, and then skip the first two bytecodes. Either way, I think the current solution is cleaner.

If you have other ideas, or would like me to do one of the above, let me know.

Copy link
Member

@Mingun Mingun Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe pass in some extra options to tell it to push a particular string at the beginning the then block, and then skip the first two bytecodes.

Yes, that is what I mean. Basically you pretend that instead of this opcode:

[
  op.MATCH_STRING, 0,  // string 0
  if_len, else_len,
  op.ACCEPT_N, 2, ..., // if part
  ...                  // else part
]

you process this opcode:

[
  /* virtual opcode */,
  if_len-2, else_len,
  ..., // if part
  ...  // else part
]

So it seems that all that you need it is correctly calculate IF part to pass to recursion. It seems that you just need add a new parameter to compileCondition:

-function compileCondition(cond, argCount) {
+function compileCondition(cond, argCount, thenSkip) {
-       const thenLength = bc[ip + baseLength - 2];
+       const thenLength = bc[ip + baseLength - 2] - thenSkip;
          () => {
-           ip += baseLength;
+           ip += baseLength + thenSkip;
            thenCode = compile(bc.slice(ip, ip + thenLength));
            ip += thenLength;
          },
}

Pass 0 is all existing invocations and 2 (or whatever) in this new code (which is on line 391 currently)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work, because It's only supposed to skip a part of the ACCEPT_N.

You also need to tell it to increment peg$currPos by the correct amount at the beginning of the then, and to bump stack.sp, to get things properly balanced...

And passing in a flag to do all that seems like way too much magic. I suppose it could be a callback, but even that feels much worse to me than the current solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, ok. Strictly speaking, there are limited count of MATCH + ACCEPT combinations, so instead of repeating it over and over they could be splitted into functions that would be called by the parser. But I'm unsure how this will affect performance

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually working on some bytecode optimizations that will produce lots more MATCH combinations by pulling in subsequent bytecode.

That is to say that where currently we get

MATCH
  ACCEPT
ELSE
  FAIL
IF_FAILED // or IF_NOT_FAILED
  DO_FAILED_THINGS
ELSE
  DO_MATCH_THINGS

My optimization converts that to

MATCH
  ACCEPT
  DO_MATCH_THINGS
ELSE
  FAIL
  DO_FAILED_THINGS

This both saves the second conditional, and in many cases allows other cleanups - often it can avoid assigning the result of the match to anything, or assigning peg$FAILED to anything.

So I think this approach (which continues to work, and can be expanded after my next pull request), is the way to go, rather than pulling out MATCH_ACCEPT patterns.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, if you really hate the skipAccept flag, I think I can make a callback work fairly cleanly:

     ip += baseLength;
     thenCode = (callback || compile)(bc.slice(ip, ip + thenLength));
     ip += thenLength;

Then existing callers of compileCondition can be left unchanged, and compileInputChunkCondition can pass in a suitable callback to emulate the bits of accept that it needs to do, and skip over the first couple of bytecodes.

Let me know if you'd like me to do something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not necessary, but if you like to try this approach I also wouldn't be against

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, and I mostly like it, so I've pushed the update. The only downside is that now the code to handle ACCEPT_N is split into two places - the actual ACCEPT_N, and the cut down version handled by the callback to compileCondition.

But it avoids the statefulness of skipAccept, so I think its a good change on balance.

@markw65
Copy link
Author

markw65 commented Aug 24, 2023

There's a new coverage issue, sorry. I'll fix and update.

@markw65 markw65 force-pushed the optimize-substrings branch from dfbd231 to 8de3123 Compare August 24, 2023 17:34
Copy link
Member

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace using of replace on text strings which we should treat here as opaque tokens. Other that that it is OK

Comment on lines 397 to 399
inputChunkLength > 1
? "peg$currPos += " + inputChunkLength + ";"
: "peg$currPos++;"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course inputChunkLength wouldn't be zero, but for symmetry with the code above I think it would be better to invert condition:

Suggested change
inputChunkLength > 1
? "peg$currPos += " + inputChunkLength + ";"
: "peg$currPos++;"
inputChunkLength === 1
? "peg$currPos++;"
: "peg$currPos += " + inputChunkLength + ";"

Comment on lines 568 to 571
inputChunk = inputChunk.includes(".charAt(")
? inputChunk.replace(".charAt(", ".charCodeAt(")
: `${inputChunk}.charCodeAt(0)`;
return `${inputChunk} === ${literal.charCodeAt(0)}`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems, we could always generate inputChunk as containing .charCodeAt and this hackish replace wouldn't be needed? Or this is not the case for ignored case literals? Then you can pass a ignoreCase parameter to compileInputChunkCondition and generate different code depending on it.

Or pass in a parameter either "input.charAt(peg$currPos)" or "input.charCodeAt(peg$currPos)" if this will be more readable (it will be used only if literal has length of 1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that if we use the inputChunk twice, we want to do

s0 = input.charAt(peg$currPos);
if (s0.charCodeAt(0)) {
  ...
}

but if we only use it once, we want to do:

if (input.charCodeAt(peg$currPos)) {
  ...
}

rather than

if (input.charAt(peg$currPos).charCodeAt(0)) {
  ...
}

Currently, we never generate byte code that uses the input twice when we're extracting a single, case sensitive character; but thats because there's already an optimization in generate-bytecode to avoid using it twice whenever possible as an optimization.

The way I originally wrote this, with a template string, meant that the code that chose inputChunk, and the code that matched the charCodeAt(0); was all self contained in compileInputChunk. Now compileInputChunk chooses inputChunk, but the callback has to make the decision about charAt vs charCodeAt, so yes, I agree it all looks a bit magical...

I could pass a flag back to the callback saying whether the optimization had happened or not, so the callback could then choose to ignore inputChunk when the optimization didn't happen. Would that work?

@@ -166,5 +167,45 @@ describe("plugin API", () => {
expect(parser.parse("x", { startRule: "b" })).to.equal("x");
expect(parser.parse("x", { startRule: "c" })).to.equal("x");
});

it("can munge the bytecode", () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... I learned a new word :)

@markw65
Copy link
Author

markw65 commented Aug 30, 2023

Anything else to do here?

@markw65 markw65 mentioned this pull request Sep 6, 2023
@markw65 markw65 force-pushed the optimize-substrings branch from 6abb00b to 181e338 Compare September 7, 2023 14:26
@markw65
Copy link
Author

markw65 commented Sep 7, 2023

I've rebased this over #425 (thanks!) resolved the CHANGELOG.md "conflict", and rebuilt the build artifacts.

@hildjj
Copy link
Contributor

hildjj commented Sep 7, 2023

Great, thanks. @Mingun final checks, please?

@hildjj hildjj merged commit f323563 into peggyjs:main Sep 7, 2023
@markw65 markw65 deleted the optimize-substrings branch September 7, 2023 15:53
MarcelBolten added a commit to MarcelBolten/phpeggy that referenced this pull request Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants