Remove "offsets" debugging code from regcomp.c #19407

demerphq · 2022-02-11T08:13:13Z

This code was added by Mark Jason Dominus to aid a regex debugger
he wrote. The basic premise is that every opcode in a regex can
be attributed back to parts of the pattern. This assumption has
not been true ever since the TRIE optimizations were added, and
I believe that the debugger is no longer in use anyway.

The regex compiler is complicated enough without having to maintain
this logic. There are essentially no tests for it, and the few
tests that do cover it do so as a byproduct of testing other things.
Despite the offsets logic only being used in debug supporting it
does have a cost to non-debug logic as various internal routines
include parameters related to it that are otherwise unused.

I spoke to him many years ago about whether it was ok to remove
it from the regex engine and he said yes.

As part of this patch I also changed the name of the "parse_start"
and "oregcomp_parse" variables in certain contexts so that the
code is a bit more clear, this was partly because the offsets logic
used its own parse_start variable in certain contexts and changing
the names of the others made it easier to clean up.

hvds · 2022-02-11T16:26:04Z

This code was added by Mark Jason Dominus to aid a regex debugger
he wrote.

What are the chances it is relied upon by any other regex debugger (or similar) out there? AFAIR Damian's Regexp-Debugger relies only on hooking regexp compilation to inject code blocks, but I don't know what else there is.

hvds · 2022-02-11T16:40:44Z

regcomp.c

    /* Allocate a regnode for 'op', with 'extra_size' extra (smallest) regnode
     * equivalents space.  It aligns and increments RExC_size


You've removed 'op'; if you can make more sense of with 'extra_size' extra (smallest) regnode equivalents space that would help too. I'd suggest (based on my best guess): Allocate a regnode that is (1 + extra_size) times as big as the smallest regnode.

hvds · 2022-02-11T17:26:25Z

regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also appears in ppport.h. If these are retained intentionally, they might merit mention in the commit message.

pod/perlreguts.pod has some references to "offsets" that should mostly be removed (found by grepping for /mjd/i).

demerphq · 2022-02-12T00:12:21Z

On Sat, 12 Feb 2022, 00:40 Hugo van der Sanden, ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In regcomp.c <#19407 (comment)>: > /* Allocate a regnode for 'op', with 'extra_size' extra (smallest) regnode * equivalents space. It aligns and increments RExC_size You've removed 'op'; if you can make more sense of with 'extra_size' extra (smallest) regnode equivalents space that would help too. I'd suggest (based on my best guess): Allocate a regnode that is (1 + extra_size) times as big as the smallest regnode.

Ack will do. Yves

…

demerphq · 2022-02-12T00:13:36Z

On Sat, 12 Feb 2022, 01:26 Hugo van der Sanden, ***@***.***> wrote: regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also appears in ppport.h. If these are retained intentionally, they might merit mention in the commit message. pod/perlreguts.pod has some references to "offsets" that should mostly be removed (found by grepping for /mjd/i)

Right, I'll fix that too. Yves

…

demerphq · 2022-02-12T05:00:14Z

On Fri, 11 Feb 2022 at 18:26, Hugo van der Sanden ***@***.***> wrote: #reordered comments pod/perlreguts.pod has some references to "offsets" that should mostly be removed (found by grepping for /mjd/i).

Done, also your other comment is resolved too. I had to completely update the regexp_internal documentation in perlreguts.pod. regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also

appears in ppport.h. If these are retained intentionally, they might merit mention in the commit message.

Gnash, I think I missed this one in my force push I just did to resolve your comments. I am digging into something else, Ill check it later today. Yves

…

-- perl -Mre=debug -e "/just|another|perl|hacker/"

demerphq · 2022-02-12T11:04:26Z

On Sat, 12 Feb 2022 at 06:00, demerphq ***@***.***> wrote: On Fri, 11 Feb 2022 at 18:26, Hugo van der Sanden < ***@***.***> wrote: > #reordered comments > pod/perlreguts.pod has some references to "offsets" that should mostly > be removed (found by grepping for /mjd/i). > Done, also your other comment is resolved too. I had to completely update the regexp_internal documentation in perlreguts.pod. regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also > appears in ppport.h. If these are retained intentionally, they might > merit mention in the commit message. > Gnash, I think I missed this one in my force push I just did to resolve your comments. I am digging into something else, Ill check it later today.

I just force pushed again with a bunch of further fixes related to this, thanks Hugo! However I do not know what to do about this: $ git grep OFFSET_DEBUG dist/Devel-PPPort/parts/base/5009004:MJD_OFFSET_DEBUG # Z added by devel/scanprov $ git grep OFFSETS dist/Devel-PPPort/parts/base/5009002:DEBUG_OFFSETS_r # Z added by devel/scanprov dist/Devel-PPPort/parts/base/5009004:RE_DEBUG_EXTRA_OFFSETS # Z added by devel/scanprov dist/Devel-PPPort/parts/base/5009005:RE_TRACK_PATTERN_OFFSETS # Z added by devel/scanprov Should I just remove these? The PPPort instructions are pretty large and I didnt spot instructions for what to do when we *remove* things. Advice appreciated! cheers, Yves

…

-- perl -Mre=debug -e "/just|another|perl|hacker/"

khwilliamson · 2022-02-12T16:39:57Z

On 2/12/22 04:04, Yves Orton wrote: On Sat, 12 Feb 2022 at 06:00, demerphq ***@***.***> wrote: > On Fri, 11 Feb 2022 at 18:26, Hugo van der Sanden < > ***@***.***> wrote: > >> #reordered comments > >> pod/perlreguts.pod has some references to "offsets" that should mostly >> be removed (found by grepping for /mjd/i). >> > > Done, also your other comment is resolved too. I had to completely update > the regexp_internal documentation in perlreguts.pod. > > regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also >> appears in ppport.h. If these are retained intentionally, they might >> merit mention in the commit message. >> > > Gnash, I think I missed this one in my force push I just did to resolve > your comments. I am digging into something else, Ill check it later today. > > I just force pushed again with a bunch of further fixes related to this, thanks Hugo! However I do not know what to do about this: $ git grep OFFSET_DEBUG dist/Devel-PPPort/parts/base/5009004:MJD_OFFSET_DEBUG # Z added by devel/scanprov $ git grep OFFSETS dist/Devel-PPPort/parts/base/5009002:DEBUG_OFFSETS_r # Z added by devel/scanprov dist/Devel-PPPort/parts/base/5009004:RE_DEBUG_EXTRA_OFFSETS # Z added by devel/scanprov dist/Devel-PPPort/parts/base/5009005:RE_TRACK_PATTERN_OFFSETS # Z added by devel/scanprov Should I just remove these? The PPPort instructions are pretty large and I didnt spot instructions for what to do when we *remove* things. Advice appreciated!

Do nothing about them, or in more positive terms, "Don't worry about it". It will be taken care of automatically when we D:P is next released, with no manual intervention required on anyone's part

demerphq · 2022-02-13T04:35:08Z

On Sun, 13 Feb 2022, 00:40 Karl Williamson, ***@***.***> wrote:

On 2/12/22 04:04, Yves Orton wrote: > On Sat, 12 Feb 2022 at 06:00, demerphq ***@***.***> wrote: > > > On Fri, 11 Feb 2022 at 18:26, Hugo van der Sanden < > > ***@***.***> wrote: > > > >> #reordered comments > > > >> pod/perlreguts.pod has some references to "offsets" that should mostly > >> be removed (found by grepping for /mjd/i). > >> > > > > Done, also your other comment is resolved too. I had to completely update > > the regexp_internal documentation in perlreguts.pod. > > > > regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also > >> appears in ppport.h. If these are retained intentionally, they might > >> merit mention in the commit message. > >> > > > > Gnash, I think I missed this one in my force push I just did to resolve > > your comments. I am digging into something else, Ill check it later > today. > > > > > I just force pushed again with a bunch of further fixes related to this, > thanks Hugo! However I do not know what to do about this: > > $ git grep OFFSET_DEBUG > dist/Devel-PPPort/parts/base/5009004:MJD_OFFSET_DEBUG # Z > added by devel/scanprov > $ git grep OFFSETS > dist/Devel-PPPort/parts/base/5009002:DEBUG_OFFSETS_r # Z > added by devel/scanprov > dist/Devel-PPPort/parts/base/5009004:RE_DEBUG_EXTRA_OFFSETS # Z > added by devel/scanprov > dist/Devel-PPPort/parts/base/5009005:RE_TRACK_PATTERN_OFFSETS # Z > added by devel/scanprov > > Should I just remove these? The PPPort instructions are pretty large and I > didnt spot instructions for what to do when we *remove* things. > > Advice appreciated! > Do nothing about them, or in more positive terms, "Don't worry about it". It will be taken care of automatically when we D:P is next released, with no manual intervention required on anyone's part

Thanks Karl. So are you and Hugo good to apply this? Warm regards Yves

hvds · 2022-02-13T12:34:13Z

Thanks Karl. So are you and Hugo good to apply this?

I plan to look at the latest updates later today. There are two other separate questions; my first was above:

What are the chances it is relied upon by any other regex debugger (or similar) out there? AFAIR Damian's Regexp-Debugger relies only on hooking regexp compilation to inject code blocks, but I don't know what else there is.

I guess a cpangrep on a couple of critical functions would give a first-approximation answer on that.

The second (somewhat related to the first) is whether this should go in so soon before a new release or be deferred until after it.

I'm also eager to get this cleanup as soon as safely possible.

demerphq · 2022-02-13T13:24:20Z

On Sun, 13 Feb 2022 at 13:34, Hugo van der Sanden ***@***.***> wrote: Thanks Karl. So are you and Hugo good to apply this? I plan to look at the latest updates later today. There are two other separate questions; my first was above: What are the chances it is relied upon by any other regex debugger (or similar) out there? AFAIR Damian's Regexp-Debugger <https://metacpan.org/pod/Regexp::Debugger> relies only on hooking regexp compilation to inject code blocks, but I don't know what else there is. Last I checked AS werent supporting the product anymore, and as I said,

its been *totally* broken since jump-tries were implemented in 5.10 or so. The code assumes you can make a linear map between parts of the pattern and specific regops, which was true prior to the jump-trie logic, but has not been true since. Eg: $ perl -Mre=Debug,OFFSETS,DUMP -e'/(?: a \s+ | b \w+ | c \d+ )/x' Compiling REx "(?: a \s+ | b \w+ | c \d+ )" Got 140 bytes for offset annotations. Final program: 1: TRIE-EXACT[abc] (17) <a> (4) 4: PLUS (17) 5: POSIXD[\s] (0) (9) 9: PLUS (17) 10: POSIXD[\w] (0) <c> (14) 14: PLUS (17) 15: POSIXU[\d] (0) 17: END (0) stclass AHOCORASICK-EXACT[abc] minlen 2 Offsets: [17] 1:4[1] 4:10[1] 5:7[2] 6:12[1] 7:12[1] 9:18[1] 10:15[2] 11:20[1] 12:20[1] 14:26[1] 15:23[2] 16:26[0] 17:28[0] Freeing REx: "(?: a \s+ | b \w+ | c \d+ )" There is simply no way to represent this using the offsets notation. The TRIE-EXACT represents multiple discontinuous segments of the pattern, signified by ^ symbols below: (?: a \s+ | b \w+ | c \d+ ) ^ ^ ^ ^ ^ ^ ^ The idea of this code was to be able to use the debug mode of the regex engine to highlight which parts of the pattern were being executed, but you can't do that with a jump-trie and the offsets notation, the latter simply cant represent what the trie opcode does. You can even see the broken data, the offsets reference multiple optimized away regops, 6:12[1] 7:12[1], 11:20[1] 12:20[1], 16:26[0] all of which were optimized away when regop 1 was converted to a trie regop. So if anything is using this it has been broken for a very long time. When I did the TRIE optimization I discussed this with MJD and he basically said not to worry.. Essentially there is no way we can do advanced optimizations of the regex engine like the TRIE optimization AND keep this logic working. It assumes that the regex engine is a very simple machine, but TRIE or a hypoethetical DFA op completely break its expectations. It is broken and unsavable, and IMO any attempt to keep it would simply restrict our ability to improve the regex engine. Damians approach is much more intelligent and scalable in some ways, although it actually changes the thing it debugs (code blocks injected into the pattern disable a number of optimizations that we have in place, and would change how some of the ones that arent disabled work in practice, eg, a pattern that might produce a non-jump trie would start producing a jump trie, but at least works to explain the user what is happening or should be happening. The reality is that if you want to properly understand what the regex engine is doing you need to look at the raw re debug output, these other approaches are simply never going to tell you what is going on. Consider: $ perl -Mre=debug -e'"a0b c1"=~/(?: a \s+ | b \w+ | c \d+ )/x' Compiling REx "(?: a \s+ | b \w+ | c \d+ )" Final program: 1: TRIE-EXACT[abc] (17) <a> (4) 4: PLUS (17) 5: POSIXD[\s] (0) (9) 9: PLUS (17) 10: POSIXD[\w] (0) <c> (14) 14: PLUS (17) 15: POSIXU[\d] (0) 17: END (0) stclass AHOCORASICK-EXACT[abc] minlen 2 Matching REx "(?: a \s+ | b \w+ | c \d+ )" against "a0b c1" Matching stclass AHOCORASICK-EXACT[abc] against "a0b c1" (6 bytes) 0 <> <a0b c1> | 0| Charid: 1 CP: 61 State: 1, word=0 - legal 1 <a> <0b c1> | 0| Charid: 0 CP: 30 State: 2, word=1 - accepting Matches word #1 at position 0. Trying full pattern... 0 <> <a0b c1> | 0| 1:TRIE-EXACT[abc](17) 0 <> <a0b c1> | 0| | 0| State: 1 Accepted: N Charid: 1 CP: 61 After State: 2 1 <a> <0b c1> | 0| | 0| State: 2 Accepted: Y Charid: 0 CP: 0 After State: 0 | 0| got 1 possible matches | 0| TRIE matched word #1, continuing | 0| only one match left, short-circuiting: #1 <a> 1 <a> <0b c1> | 0| 4:PLUS(17) | 0| POSIXD[\s] can match 0 times out of 2147483647... | 0| failed... Pattern failed. Looking for new start point... 1 <a> <0b c1> | 0| Scanning for legal start char... 2 <a0> | 0| Charid: 2 CP: 62 State: 1, word=0 - legal 3 <a0b> < c1> | 0| Charid: 0 CP: 20 State: 3, word=2 - accepting Matches word #2 at position 2. Trying full pattern... 2 <a0> | 0| 1:TRIE-EXACT[abc](17) 2 <a0> | 0| | 0| State: 1 Accepted: N Charid: 2 CP: 62 After State: 3 3 <a0b> < c1> | 0| | 0| State: 3 Accepted: Y Charid: 0 CP: 0 After State: 0 | 0| got 1 possible matches | 0| TRIE matched word #2, continuing | 0| only one match left, short-circuiting: #2 3 <a0b> < c1> | 0| 9:PLUS(17) | 0| POSIXD[\w] can match 0 times out of 2147483647... | 0| failed... Pattern failed. Looking for new start point... 3 <a0b> < c1> | 0| Scanning for legal start char... 4 <a0b > <c1> | 0| Charid: 3 CP: 63 State: 1, word=0 - legal 5 <a0b c> <1> | 0| Charid: 0 CP: 31 State: 4, word=3 - accepting Matches word #3 at position 4. Trying full pattern... 4 <a0b > <c1> | 0| 1:TRIE-EXACT[abc](17) 4 <a0b > <c1> | 0| | 0| State: 1 Accepted: N Charid: 3 CP: 63 After State: 4 5 <a0b c> <1> | 0| | 0| State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0 | 0| got 1 possible matches | 0| TRIE matched word #3, continuing | 0| only one match left, short-circuiting: #3 <c> 5 <a0b c> <1> | 0| 14:PLUS(17) | 0| POSIXU[\d] can match 1 times out of 2147483647... 6 <a0b c1> <> | 1| 17:END(0) Match successful! Freeing REx: "(?: a \s+ | b \w+ | c \d+ )" As an aside, the above shows a regression in the debug output, the lines from the trie engine are not being displayed properly anymore. Sigh. This: 4 <a0b > <c1> | 0| | 0| State: 1 Accepted: N Charid: 3 CP: 63 After State: 4 5 <a0b c> <1> | 0| | 0| State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0 should look like this: 4 <a0b > <c1> | 0| State: 1 Accepted: N Charid: 3 CP: 63 After State: 4 5 <a0b c> <1> | 0| State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0 I'd say that fixing that regression is more important than the offsets logic anyday. It actually explains what is going on unlike the offsets data. I guess a cpangrep on a couple of critical functions would give a

first-approximation answer on that. The second (somewhat related to the first) is whether this should go in so soon before a new release or be deferred until after it.

My view is this logic has been long broken and also long unused, but sure I understand the concern. If we have a contact at ActiveState that can confirm it would help. I am willing to bet a dollar that no harm whatsoever would come from immediate application, especially given the existing regressions in the debug output. Yves

…

-- perl -Mre=debug -e "/just|another|perl|hacker/"

hvds · 2022-02-13T13:55:21Z

Last I checked AS werent supporting the product anymore, and as I said,
its been totally broken since jump-tries were implemented in 5.10 or so.
[snip 160 lines]

I understand that, but the question was not whether it works as intended (or ever can) but whether anyone is using it - if someone is, it would seem polite to look at what they're doing and/or talk to them.

demerphq · 2022-02-13T14:45:38Z

On Sun, 13 Feb 2022, 21:55 Hugo van der Sanden, ***@***.***> wrote: Last I checked AS werent supporting the product anymore, and as I said, its been *totally* broken since jump-tries were implemented in 5.10 or so. [snip 160 lines] I understand that, but the question was not whether it works as intended (or ever can) but whether anyone is using it - if someone is, it would seem polite to look at what they're doing and/or talk to them

Doesn't make sense to me at all. It's unsavably broken and will only get more broken as we improve the regex engine, or it will prevent us from improving it by setting a precedent that we have to consider the debug output a stable api. Considering the regex debug output is barely tested and afaik I wrote pretty much all of the tests for it that we do have - but not as a guarantee of stability but rather as a mechanism to test the non debug regex engine that sounds pretty silly to me. Back in the day I actually changed the debug output so the offsets stuff weren't output by default and nobody complained about that. I'd have expected that if these people existed we would have heard from them by now. Anyway since the "api" here is the regex debug output which we have necessarily changed many a time over the years (without a peep from anyone) and which contains obvious regressions no-one noticed until I looked I don't see how you are even going to find these people. Afaik there is no function to grep for. The best you can do is search for various specialized "use re Debug => ..." lines, including 'All' which will likely give a bunch of false positives. You might get lucky and find something explicitly mentioning OFFSETS but I just did that on google and I got a bunch of false positives and a link to this PR. So how do you propose to find these people, and how long do you want to wait for them to surface? What exactly would satisfy you that they don't exist? Yves

…

hvds · 2022-02-13T15:36:24Z

Anyway since the "api" here is the regex debug output

Ah, my apologies, this was the piece of info I'd misplaced: I'd forgotten how little interface this actually provides, and mentally conflated the removed regcomp.c defines (and MJD_OFFSET_DEBUG) with the simultaneous embed.fnc changes, somehow imagining that a user of this mechanism would be trying to invoke a bunch of now-removed macros.

In any case I've now done some cpangrepping for a handful of candidate strings and found nothing that warrants a closer look.

demerphq · 2022-02-14T07:23:05Z

On Sun, 13 Feb 2022 at 16:36, Hugo van der Sanden ***@***.***> wrote: Anyway since the "api" here is the regex debug output Ah, my apologies, this was the piece of info I'd misplaced: I'd forgotten how little interface this actually provides, and mentally conflated the removed regcomp.c defines (and MJD_OFFSET_DEBUG) with the simultaneous embed.fnc changes, somehow imagining that a user of this mechanism would be trying to invoke a bunch of now-removed macros. In any case I've now done some cpangrepping for a handful of candidate strings and found nothing that warrants a closer look.

So I guess we can merge it then? Yves

…

-- perl -Mre=debug -e "/just|another|perl|hacker/"

hvds · 2022-02-14T13:13:28Z

So I guess we can merge it then?

You don't need to ask that repeatedly, rest assured that when I'm confident it's good to merge I'll either say so or simply merge it, and if I encounter something that seems like a problem I'll also say so. I'm also confident that others such as Karl would do the same.

I haven't yet had a chance to look at the latest updates, and when I have done I'll say so. I haven't yet clarified my opinion on whether this should be deferred until after the coming release, and when I have done I'll say so.

If you're worried that I've forgotten about it you're welcome to prod me; a few days would seem a reasonable time to wait before deciding that may be the case, except in urgent cases such as a security fix.

demerphq · 2022-02-14T13:34:21Z

On Mon, 14 Feb 2022 at 14:15, Hugo van der Sanden ***@***.***> wrote: So I guess we can merge it then? You don't need to ask that repeatedly, rest assured that when I'm confident it's good to merge I'll either say so or simply merge it, and if I encounter something that seems like a problem I'll also say so. I'm also confident that others such as Karl would do the same.

It wasn't clear to me if you were expecting me to merge or not. cheers, Yves

hvds · 2022-02-15T19:42:29Z

@demerphq I've managed to take another look at the main commit, I think it looks good other than one possibly inadvertent removal of an assert.

I'd quite like to see the variable changes in a separate commit from the offsets removal (see also khw's recent p5p request for guidance on this sort of thing). I'm happy to do the work of splitting it into two commits if you don't have a particular reason to be against that.

I've not yet looked at the ramifications of the additional commit ("change regexp_interal attribute from I32 to U32" - note typo); will get to that as soon as I can, but at first glance it doesn't seem controversial.

Based on my understanding so far, I don't think there's any pressing reason to defer this until after the upcoming release - @khwilliamson @leonerd @rjbs @neilb please comment if you think otherwise.

khwilliamson · 2022-02-16T02:13:20Z

I agree completely with Hugo

demerphq · 2022-02-16T02:15:13Z

On Tue, 15 Feb 2022 at 20:42, Hugo van der Sanden ***@***.***> wrote: @demerphq <https://github.com/demerphq> I've managed to take another look at the main commit, I think it looks good other than one possibly inadvertent removal of an assert. I'd quite like to see the variable changes in a separate commit from the offsets removal (see also khw's recent p5p request for guidance on this sort of thing). I'm happy to do the work of splitting it into two commits if you don't have a particular reason to be against that.

Personally I see them as related changes since the patch removes such a usage, and since the same variable was duplicated in many places and the code involved includes subs that are very large it made it much easier to be sure I wasn't breaking something. But if you really feel strongly about it I will break them out.

I've not yet looked at the ramifications of the additional commit ("change regexp_interal attribute from I32 to U32" - note typo); will get to that as soon as I can, but at first glance it doesn't seem controversial.

Ill fix the typo. Sure go ahead and validate the implications but I am pretty confident of this: I added this member in the first place, and a valid value can never be negative since it is an index into the data array. I made it I32 in the first place because I was intending to set it to -1 to signify "no value", but either I forgot, or at some point the code was changed to initialize it to 0 when there was no value. Prior to my recent change forbidding 0 as a valid index into the data array someone not realizing the value is only relevant when another property is non-zero (RXp_PAREN_NAMES(prog)) might have tried to dereference the 0 slot in the array. But go ahead check my reasoning. Ill add a comment to the code and docs to explain this however. Yves

demerphq · 2022-02-16T02:28:32Z

On Wed, 16 Feb 2022 at 03:13, Karl Williamson ***@***.***> wrote: I agree completely with Hugo

I really dont. I'll do the work, but under protest. Go ahead and try to disentangle it yourself privately to see what I mean. parse_start is a /* MJD */ marked variable that was intended to be used for the offsets code originally. Over time other code started to use it. This then made it quite unobvious what was going on in places. Fixing the names prior to removing the offsets debug code just means you have to create a bunch of changes that will be deleted the next patch. Making extra work for nothing. This is just make work for extremely low ROI. Yves

khwilliamson · 2022-02-16T03:17:50Z

On 2/15/22 19:28, Yves Orton wrote: On Wed, 16 Feb 2022 at 03:13, Karl Williamson ***@***.***> wrote: > I agree completely with Hugo > I really dont. I'll do the work, but under protest. Go ahead and try to disentangle it yourself privately to see what I mean. parse_start is a /* MJD */ marked variable that was intended to be used for the offsets code originally. Over time other code started to use it. This then made it quite unobvious what was going on in places. Fixing the names prior to removing the offsets debug code just means you have to create a bunch of changes that will be deleted the next patch. Making extra work for nothing. This is just make work for extremely low ROI.

I wouldn't have agreed with Hugo if I thought it was a lot of work. And he has offered to actually do the work for you, so I don't see a problem. Perhaps actually trying it will show why it is too intertwined to separate out. But it doesn't look like it to me, at least for the majority of the changes. I think the new names are a distinct improvement, and I would be sad if they got reverted along with the major thrust should we be wrong and this commit causes field problems.

demerphq · 2022-02-16T03:37:56Z

On Tue, 15 Feb 2022 at 20:42, Hugo van der Sanden ***@***.***> wrote: @demerphq <https://github.com/demerphq> I've managed to take another look at the main commit, I think it looks good other than one possibly inadvertent removal of an assert.

It wasn't inadvertent. The assert depends on "op" which was removed by this patch. If we really want this assert we need to add that parameter back in, or do some kind of macro trickery. Yves

Various changes have been made to struct regexp_internal over time which have not been documented. This updates the docs to match the code as it is now in preparation of changing the docs in subsequent commits.

This code was added by Mark Jason Dominus to aid a regex debugger he wrote for ActiveState. The basic premise is that every opcode in a regex can be attributed back to a contiguous sequence of characters that make up the pattern. This assumption has not been true ever since the "jump" TRIE optimizations were added to the engine. I spoke to MJD many years ago about whether it was ok to remove this from the regex engine and he said he had no objections. An example of a pattern that cannot be handled correctly by this logic is /(?: a x+ | b y+ | c z+ )/x where the (?:a ... | b ... | c ...) parts will now be handled by the TRIE logic and not by the BRANCH/EXACT opcodes that it would have been in the past. The offset debug output cannot handle this type of transformation, and produce nonsense output that mention opcodes that have been optimized away from the final program. The regex compiler is complicated enough without having to maintain this logic. There are essentially no tests for it, and the few tests that do cover it do so as a byproduct of testing other things. Despite the offsets logic only being used in debug supporting it does have a cost to non-debug logic as various internal routines include parameters related to it that are otherwise unused. Note this output is only usable or visible by enabling special flags in re.pm, there is no formal API to access it short of parsing the output of the debug mode of the regex engine, which has changed multiple time over the past years.

This was originally done to make the cleanup of the offsets debug logic easier to follow and understand. 'parse_start' was heavily used in multiple functions, and given the size of the functions in regcomp.c it was often not clear which parse_start was which. 'oregcomp_parse' was also used in a similar way. This patch disambiguates them all so they are all uniquely named and relevant to the code they operate on and of the form "thing_parse_start", (or "thing_parse_start_const" where both were in use).

This changes the name_list_idx attribute from I32 to a U32 as it will never be negative, and as of a963d6d a 0 can be safely used to represent "no value" for items in the 'data' array. I noticed this while cleaning up the offsets debug logic and updating the perlreguts documentation, so I figured I might as well clean it up at the same time.

demerphq · 2022-02-16T05:43:42Z

On Tue, 15 Feb 2022 at 20:42, Hugo van der Sanden ***@***.***> wrote: @demerphq <https://github.com/demerphq> I've managed to take another look at the main commit, I think it looks good other than one possibly inadvertent removal of an assert.

As I said elsewhere it was intentional, the assert was in a debugging block and uses a parameter that was removed so I removed the assert as well since the code wouldnt have that parameter. However I have now reworked the patch so that the debugging behavior does not impact the production code, and restored the assert to the debug version of the code. I force pushed the branch.

I'd quite like to see the variable changes in a separate commit from the offsets removal (see also khw's recent p5p request for guidance on this sort of thing). I'm happy to do the work of splitting it into two commits if you don't have a particular reason to be against that.

Done. While I actually did so under protest as I think it was unnecessary (and that is case is well in the grey zone of acceptability) I decided if I was going to split up that commit I might as well do it right, so the perlreguts.pod change was also split out into its own patch.

I've not yet looked at the ramifications of the additional commit ("change regexp_interal attribute from I32 to U32" - note typo); will get to that as soon as I can, but at first glance it doesn't seem controversial.

Nod. Already comment on this elsewhere, but please validate my analysis.

Based on my understanding so far, I don't think there's any pressing reason to defer this until after the upcoming release - @khwilliamson <https://github.com/khwilliamson> @leonerd <https://github.com/leonerd> @rjbs <https://github.com/rjbs> @neilb <https://github.com/neilb> please comment if you think otherwise.

I'll leave it to you or Karl to apply. I also checked the typos and stuff like that, but please let me know if you notice something I missed. cheers, yves

…

-- perl -Mre=debug -e "/just|another|perl|hacker/"

hvds · 2022-02-16T22:56:05Z

LGTM, I'll apply this in a couple of days if nobody else has blocking comments.

(At some point it'd be nice to clarify what precisely the start and end in reg_code_block are marking the start and end of, but having it documented at all is a distinct improvement.)

demerphq · 2022-02-16T23:41:00Z

On Thu, 17 Feb 2022, 06:56 Hugo van der Sanden, ***@***.***> wrote: LGTM, I'll apply this in a couple of days if nobody else has blocking comments.

Great. Thanks. (At some point it'd be nice to clarify what precisely the start and end in

reg_code_block are marking the start and end of, but having it documented at all is a distinct improvement.)

Dave would be the one to do that... Cheers Yves

…

hvds · 2022-02-18T15:11:54Z

Now pushed as four commits bddb8c7 .. f08cf40.

iabyn · 2022-03-07T16:10:40Z

On Wed, Feb 16, 2022 at 03:41:16PM -0800, Yves Orton wrote: On Thu, 17 Feb 2022, 06:56 Hugo van der Sanden, ***@***.***> (At some point it'd be nice to clarify what precisely the start and end in > reg_code_block are marking the start and end of, but having it documented > at all is a distinct improvement.) > Dave would be the one to do that...

Is this referring to this in regexp.h: /* record the position of a (?{...}) within a pattern */ struct reg_code_block { STRLEN start; STRLEN end; OP *block; REGEXP *src_regex; }; in which case in a pattern like /...(?{...}).../, it refers to where in the string containing the pattern that the code block starts and ends. What in particular is confusing?

…

-- Red sky at night - gerroff my land! Red sky at morning - gerroff my land! -- old farmers' sayings #14

demerphq · 2022-03-08T02:02:41Z

On Mon, 7 Mar 2022 at 17:11, iabyn ***@***.***> wrote: On Wed, Feb 16, 2022 at 03:41:16PM -0800, Yves Orton wrote: > On Thu, 17 Feb 2022, 06:56 Hugo van der Sanden, ***@***.***> > (At some point it'd be nice to clarify what precisely the start and end in > > reg_code_block are marking the start and end of, but having it documented > > at all is a distinct improvement.) > > > Dave would be the one to do that... Is this referring to this in regexp.h: /* record the position of a (?{...}) within a pattern */ struct reg_code_block { STRLEN start; STRLEN end; OP *block; REGEXP *src_regex; }; in which case in a pattern like /...(?{...}).../, it refers to where in the string containing the pattern that the code block starts and ends. What in particular is confusing?

When you added it you didnt update pod/perlreguts.pod about what the structure means, and I wasn't confident enough I understood it properly to do it for you. If you like I can do a patch for you based on this mail. cheers, Yves

…

-- perl -Mre=debug -e "/just|another|perl|hacker/"

hvds reviewed Feb 11, 2022

View reviewed changes

demerphq force-pushed the yves/remove_regcomp_offsets_debug branch from 84349ec to 37ee630 Compare February 12, 2022 04:52

demerphq force-pushed the yves/remove_regcomp_offsets_debug branch from 37ee630 to 9ca30b5 Compare February 12, 2022 10:54

demerphq added 4 commits February 16, 2022 05:25

perlreguts.pod: synchronize regexp_internal docs with code

8f1467d

Various changes have been made to struct regexp_internal over time which have not been documented. This updates the docs to match the code as it is now in preparation of changing the docs in subsequent commits.

demerphq force-pushed the yves/remove_regcomp_offsets_debug branch from 9ca30b5 to f254578 Compare February 16, 2022 05:21

hvds closed this Feb 18, 2022

		/* Allocate a regnode for 'op', with 'extra_size' extra (smallest) regnode
		* equivalents space. It aligns and increments RExC_size

Remove "offsets" debugging code from regcomp.c #19407

Remove "offsets" debugging code from regcomp.c #19407

Uh oh!

Conversation

demerphq commented Feb 11, 2022

Uh oh!

hvds commented Feb 11, 2022

Uh oh!

hvds Feb 11, 2022

Choose a reason for hiding this comment

Uh oh!

hvds commented Feb 11, 2022

Uh oh!

demerphq commented Feb 12, 2022 via email

Uh oh!

demerphq commented Feb 12, 2022 via email

Uh oh!

demerphq commented Feb 12, 2022 via email

Uh oh!

demerphq commented Feb 12, 2022 via email

Uh oh!

khwilliamson commented Feb 12, 2022 via email

Uh oh!

demerphq commented Feb 13, 2022 via email

Uh oh!

hvds commented Feb 13, 2022

Uh oh!

demerphq commented Feb 13, 2022 via email

Uh oh!

hvds commented Feb 13, 2022

Uh oh!

demerphq commented Feb 13, 2022 via email

Uh oh!

hvds commented Feb 13, 2022

Uh oh!

demerphq commented Feb 14, 2022 via email

Uh oh!

hvds commented Feb 14, 2022

Uh oh!

demerphq commented Feb 14, 2022 via email

Uh oh!

hvds commented Feb 15, 2022

Uh oh!

khwilliamson commented Feb 16, 2022

Uh oh!

demerphq commented Feb 16, 2022 via email

Uh oh!

demerphq commented Feb 16, 2022 via email

Uh oh!

khwilliamson commented Feb 16, 2022 via email

Uh oh!

demerphq commented Feb 16, 2022 via email

Uh oh!

demerphq commented Feb 16, 2022 via email

Uh oh!

hvds commented Feb 16, 2022

Uh oh!

demerphq commented Feb 16, 2022 via email

Uh oh!

hvds commented Feb 18, 2022

Uh oh!

iabyn commented Mar 7, 2022 via email

Uh oh!

demerphq commented Mar 8, 2022 via email

Uh oh!

Uh oh!