Skip to content

Remove "offsets" debugging code from regcomp.c #19407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

demerphq
Copy link
Collaborator

This code was added by Mark Jason Dominus to aid a regex debugger
he wrote. The basic premise is that every opcode in a regex can
be attributed back to parts of the pattern. This assumption has
not been true ever since the TRIE optimizations were added, and
I believe that the debugger is no longer in use anyway.

The regex compiler is complicated enough without having to maintain
this logic. There are essentially no tests for it, and the few
tests that do cover it do so as a byproduct of testing other things.
Despite the offsets logic only being used in debug supporting it
does have a cost to non-debug logic as various internal routines
include parameters related to it that are otherwise unused.

I spoke to him many years ago about whether it was ok to remove
it from the regex engine and he said yes.

As part of this patch I also changed the name of the "parse_start"
and "oregcomp_parse" variables in certain contexts so that the
code is a bit more clear, this was partly because the offsets logic
used its own parse_start variable in certain contexts and changing
the names of the others made it easier to clean up.

@hvds
Copy link
Contributor

hvds commented Feb 11, 2022

This code was added by Mark Jason Dominus to aid a regex debugger
he wrote.

What are the chances it is relied upon by any other regex debugger (or similar) out there? AFAIR Damian's Regexp-Debugger relies only on hooking regexp compilation to inject code blocks, but I don't know what else there is.

regcomp.c Outdated
Comment on lines 20549 to 20550
/* Allocate a regnode for 'op', with 'extra_size' extra (smallest) regnode
* equivalents space. It aligns and increments RExC_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've removed 'op'; if you can make more sense of with 'extra_size' extra (smallest) regnode equivalents space that would help too. I'd suggest (based on my best guess): Allocate a regnode that is (1 + extra_size) times as big as the smallest regnode.

@hvds
Copy link
Contributor

hvds commented Feb 11, 2022

regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also appears in ppport.h. If these are retained intentionally, they might merit mention in the commit message.

pod/perlreguts.pod has some references to "offsets" that should mostly be removed (found by grepping for /mjd/i).

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 12, 2022 via email

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 12, 2022 via email

@demerphq demerphq force-pushed the yves/remove_regcomp_offsets_debug branch from 84349ec to 37ee630 Compare February 12, 2022 04:52
@demerphq
Copy link
Collaborator Author

demerphq commented Feb 12, 2022 via email

@demerphq demerphq force-pushed the yves/remove_regcomp_offsets_debug branch from 37ee630 to 9ca30b5 Compare February 12, 2022 10:54
@demerphq
Copy link
Collaborator Author

demerphq commented Feb 12, 2022 via email

@khwilliamson
Copy link
Contributor

khwilliamson commented Feb 12, 2022 via email

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 13, 2022 via email

@hvds
Copy link
Contributor

hvds commented Feb 13, 2022

Thanks Karl. So are you and Hugo good to apply this?

I plan to look at the latest updates later today. There are two other separate questions; my first was above:

What are the chances it is relied upon by any other regex debugger (or similar) out there? AFAIR Damian's Regexp-Debugger relies only on hooking regexp compilation to inject code blocks, but I don't know what else there is.

I guess a cpangrep on a couple of critical functions would give a first-approximation answer on that.

The second (somewhat related to the first) is whether this should go in so soon before a new release or be deferred until after it.

I'm also eager to get this cleanup as soon as safely possible.

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 13, 2022 via email

@hvds
Copy link
Contributor

hvds commented Feb 13, 2022

Last I checked AS werent supporting the product anymore, and as I said,
its been totally broken since jump-tries were implemented in 5.10 or so.
[snip 160 lines]

I understand that, but the question was not whether it works as intended (or ever can) but whether anyone is using it - if someone is, it would seem polite to look at what they're doing and/or talk to them.

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 13, 2022 via email

@hvds
Copy link
Contributor

hvds commented Feb 13, 2022

Anyway since the "api" here is the regex debug output

Ah, my apologies, this was the piece of info I'd misplaced: I'd forgotten how little interface this actually provides, and mentally conflated the removed regcomp.c defines (and MJD_OFFSET_DEBUG) with the simultaneous embed.fnc changes, somehow imagining that a user of this mechanism would be trying to invoke a bunch of now-removed macros.

In any case I've now done some cpangrepping for a handful of candidate strings and found nothing that warrants a closer look.

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 14, 2022 via email

@hvds
Copy link
Contributor

hvds commented Feb 14, 2022

So I guess we can merge it then?

You don't need to ask that repeatedly, rest assured that when I'm confident it's good to merge I'll either say so or simply merge it, and if I encounter something that seems like a problem I'll also say so. I'm also confident that others such as Karl would do the same.

I haven't yet had a chance to look at the latest updates, and when I have done I'll say so. I haven't yet clarified my opinion on whether this should be deferred until after the coming release, and when I have done I'll say so.

If you're worried that I've forgotten about it you're welcome to prod me; a few days would seem a reasonable time to wait before deciding that may be the case, except in urgent cases such as a security fix.

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 14, 2022 via email

@hvds
Copy link
Contributor

hvds commented Feb 15, 2022

@demerphq I've managed to take another look at the main commit, I think it looks good other than one possibly inadvertent removal of an assert.

I'd quite like to see the variable changes in a separate commit from the offsets removal (see also khw's recent p5p request for guidance on this sort of thing). I'm happy to do the work of splitting it into two commits if you don't have a particular reason to be against that.

I've not yet looked at the ramifications of the additional commit ("change regexp_interal attribute from I32 to U32" - note typo); will get to that as soon as I can, but at first glance it doesn't seem controversial.

Based on my understanding so far, I don't think there's any pressing reason to defer this until after the upcoming release - @khwilliamson @leonerd @rjbs @neilb please comment if you think otherwise.

@khwilliamson
Copy link
Contributor

I agree completely with Hugo

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 16, 2022 via email

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 16, 2022 via email

@khwilliamson
Copy link
Contributor

khwilliamson commented Feb 16, 2022 via email

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 16, 2022 via email

Various changes have been made to struct regexp_internal over
time which have not been documented. This updates the docs to
match the code as it is now in preparation of changing the docs
in subsequent commits.
This code was added by Mark Jason Dominus to aid a regex debugger
he wrote for ActiveState. The basic premise is that every opcode
in a regex can be attributed back to a contiguous sequence of characters
that make up the pattern. This assumption has not been true ever since
the "jump" TRIE optimizations were added to the engine.

I spoke to MJD many years ago about whether it was ok to remove this
from the regex engine and he said he had no objections.

An example of a pattern that cannot be handled correctly by this logic is

    /(?: a x+ | b y+ | c z+ )/x

where the

    (?:a ... | b ... | c ...)

parts will now be handled by the TRIE logic and not by the BRANCH/EXACT
opcodes that it would have been in the past. The offset debug output
cannot handle this type of transformation, and produce nonsense output
that mention opcodes that have been optimized away from the final program.

The regex compiler is complicated enough without having to maintain this
logic. There are essentially no tests for it, and the few tests that do
cover it do so as a byproduct of testing other things. Despite the offsets
logic only being used in debug supporting it does have a cost to non-debug
logic as various internal routines include parameters related to it that
are otherwise unused.

Note this output is only usable or visible by enabling special flags
in re.pm, there is no formal API to access it short of parsing the
output of the debug mode of the regex engine, which has changed multiple
time over the past years.
This was originally done to make the cleanup of the offsets debug logic
easier to follow and understand. 'parse_start' was heavily used in
multiple functions, and given the size of the functions in regcomp.c
it was often not clear which parse_start was which. 'oregcomp_parse'
was also used in a similar way.

This patch disambiguates them all so they are all uniquely named
and relevant to the code they operate on and of the form
"thing_parse_start", (or "thing_parse_start_const" where both were in
use).
This changes the name_list_idx attribute from I32 to a U32 as it will
never be negative, and as of a963d6d a 0 can be safely used
to represent "no value" for items in the 'data' array.

I noticed this while cleaning up the offsets debug logic and updating
the perlreguts documentation, so I figured I might as well clean it up
at the same time.
@demerphq demerphq force-pushed the yves/remove_regcomp_offsets_debug branch from 9ca30b5 to f254578 Compare February 16, 2022 05:21
@demerphq
Copy link
Collaborator Author

demerphq commented Feb 16, 2022 via email

@hvds
Copy link
Contributor

hvds commented Feb 16, 2022

LGTM, I'll apply this in a couple of days if nobody else has blocking comments.

(At some point it'd be nice to clarify what precisely the start and end in reg_code_block are marking the start and end of, but having it documented at all is a distinct improvement.)

@demerphq
Copy link
Collaborator Author

demerphq commented Feb 16, 2022 via email

@hvds
Copy link
Contributor

hvds commented Feb 18, 2022

Now pushed as four commits bddb8c7 .. f08cf40.

@hvds hvds closed this Feb 18, 2022
@iabyn
Copy link
Contributor

iabyn commented Mar 7, 2022 via email

@demerphq
Copy link
Collaborator Author

demerphq commented Mar 8, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants