-
Notifications
You must be signed in to change notification settings - Fork 575
Remove "offsets" debugging code from regcomp.c #19407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
What are the chances it is relied upon by any other regex debugger (or similar) out there? AFAIR Damian's Regexp-Debugger relies only on hooking regexp compilation to inject code blocks, but I don't know what else there is. |
regcomp.c
Outdated
/* Allocate a regnode for 'op', with 'extra_size' extra (smallest) regnode | ||
* equivalents space. It aligns and increments RExC_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've removed 'op'; if you can make more sense of with 'extra_size' extra (smallest) regnode equivalents space
that would help too. I'd suggest (based on my best guess): Allocate a regnode that is (1 + extra_size) times as big as the smallest regnode
.
|
On Sat, 12 Feb 2022, 00:40 Hugo van der Sanden, ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In regcomp.c
<#19407 (comment)>:
> /* Allocate a regnode for 'op', with 'extra_size' extra (smallest) regnode
* equivalents space. It aligns and increments RExC_size
You've removed 'op'; if you can make more sense of with 'extra_size'
extra (smallest) regnode equivalents space that would help too. I'd
suggest (based on my best guess): Allocate a regnode that is (1 +
extra_size) times as big as the smallest regnode.
Ack will do.
Yves
… |
On Sat, 12 Feb 2022, 01:26 Hugo van der Sanden, ***@***.***> wrote:
regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also
appears in ppport.h. If these are retained intentionally, they might
merit mention in the commit message.
pod/perlreguts.pod has some references to "offsets" that should mostly be
removed (found by grepping for /mjd/i)
Right, I'll fix that too.
Yves
… |
84349ec
to
37ee630
Compare
On Fri, 11 Feb 2022 at 18:26, Hugo van der Sanden ***@***.***> wrote:
#reordered comments
pod/perlreguts.pod has some references to "offsets" that should mostly be
removed (found by grepping for /mjd/i).
Done, also your other comment is resolved too. I had to completely update
the regexp_internal documentation in perlreguts.pod.
regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also
appears in ppport.h. If these are retained intentionally, they might
merit mention in the commit message.
Gnash, I think I missed this one in my force push I just did to resolve
your comments. I am digging into something else, Ill check it later today.
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
37ee630
to
9ca30b5
Compare
On Sat, 12 Feb 2022 at 06:00, demerphq ***@***.***> wrote:
On Fri, 11 Feb 2022 at 18:26, Hugo van der Sanden <
***@***.***> wrote:
> #reordered comments
> pod/perlreguts.pod has some references to "offsets" that should mostly
> be removed (found by grepping for /mjd/i).
>
Done, also your other comment is resolved too. I had to completely update
the regexp_internal documentation in perlreguts.pod.
regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also
> appears in ppport.h. If these are retained intentionally, they might
> merit mention in the commit message.
>
Gnash, I think I missed this one in my force push I just did to resolve
your comments. I am digging into something else, Ill check it later today.
I just force pushed again with a bunch of further fixes related to this,
thanks Hugo! However I do not know what to do about this:
$ git grep OFFSET_DEBUG
dist/Devel-PPPort/parts/base/5009004:MJD_OFFSET_DEBUG # Z
added by devel/scanprov
$ git grep OFFSETS
dist/Devel-PPPort/parts/base/5009002:DEBUG_OFFSETS_r # Z
added by devel/scanprov
dist/Devel-PPPort/parts/base/5009004:RE_DEBUG_EXTRA_OFFSETS # Z
added by devel/scanprov
dist/Devel-PPPort/parts/base/5009005:RE_TRACK_PATTERN_OFFSETS # Z
added by devel/scanprov
Should I just remove these? The PPPort instructions are pretty large and I
didnt spot instructions for what to do when we *remove* things.
Advice appreciated!
cheers,
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
On 2/12/22 04:04, Yves Orton wrote:
On Sat, 12 Feb 2022 at 06:00, demerphq ***@***.***> wrote:
> On Fri, 11 Feb 2022 at 18:26, Hugo van der Sanden <
> ***@***.***> wrote:
>
>> #reordered comments
>
>> pod/perlreguts.pod has some references to "offsets" that should mostly
>> be removed (found by grepping for /mjd/i).
>>
>
> Done, also your other comment is resolved too. I had to completely update
> the regexp_internal documentation in perlreguts.pod.
>
> regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it also
>> appears in ppport.h. If these are retained intentionally, they might
>> merit mention in the commit message.
>>
>
> Gnash, I think I missed this one in my force push I just did to resolve
> your comments. I am digging into something else, Ill check it later
today.
>
>
I just force pushed again with a bunch of further fixes related to this,
thanks Hugo! However I do not know what to do about this:
$ git grep OFFSET_DEBUG
dist/Devel-PPPort/parts/base/5009004:MJD_OFFSET_DEBUG # Z
added by devel/scanprov
$ git grep OFFSETS
dist/Devel-PPPort/parts/base/5009002:DEBUG_OFFSETS_r # Z
added by devel/scanprov
dist/Devel-PPPort/parts/base/5009004:RE_DEBUG_EXTRA_OFFSETS # Z
added by devel/scanprov
dist/Devel-PPPort/parts/base/5009005:RE_TRACK_PATTERN_OFFSETS # Z
added by devel/scanprov
Should I just remove these? The PPPort instructions are pretty large and I
didnt spot instructions for what to do when we *remove* things.
Advice appreciated!
Do nothing about them, or in more positive terms, "Don't worry about
it". It will be taken care of automatically when we D:P is next
released, with no manual intervention required on anyone's part
|
On Sun, 13 Feb 2022, 00:40 Karl Williamson, ***@***.***>
wrote:
On 2/12/22 04:04, Yves Orton wrote:
> On Sat, 12 Feb 2022 at 06:00, demerphq ***@***.***> wrote:
>
> > On Fri, 11 Feb 2022 at 18:26, Hugo van der Sanden <
> > ***@***.***> wrote:
> >
> >> #reordered comments
> >
> >> pod/perlreguts.pod has some references to "offsets" that should mostly
> >> be removed (found by grepping for /mjd/i).
> >>
> >
> > Done, also your other comment is resolved too. I had to completely
update
> > the regexp_internal documentation in perlreguts.pod.
> >
> > regcomp.h defines MJD_OFFSET_DEBUG, is that still required? I see it
also
> >> appears in ppport.h. If these are retained intentionally, they might
> >> merit mention in the commit message.
> >>
> >
> > Gnash, I think I missed this one in my force push I just did to resolve
> > your comments. I am digging into something else, Ill check it later
> today.
> >
> >
> I just force pushed again with a bunch of further fixes related to this,
> thanks Hugo! However I do not know what to do about this:
>
> $ git grep OFFSET_DEBUG
> dist/Devel-PPPort/parts/base/5009004:MJD_OFFSET_DEBUG # Z
> added by devel/scanprov
> $ git grep OFFSETS
> dist/Devel-PPPort/parts/base/5009002:DEBUG_OFFSETS_r # Z
> added by devel/scanprov
> dist/Devel-PPPort/parts/base/5009004:RE_DEBUG_EXTRA_OFFSETS # Z
> added by devel/scanprov
> dist/Devel-PPPort/parts/base/5009005:RE_TRACK_PATTERN_OFFSETS # Z
> added by devel/scanprov
>
> Should I just remove these? The PPPort instructions are pretty large and
I
> didnt spot instructions for what to do when we *remove* things.
>
> Advice appreciated!
>
Do nothing about them, or in more positive terms, "Don't worry about
it". It will be taken care of automatically when we D:P is next
released, with no manual intervention required on anyone's part
Thanks Karl. So are you and Hugo good to apply this?
Warm regards
Yves
|
I plan to look at the latest updates later today. There are two other separate questions; my first was above:
I guess a cpangrep on a couple of critical functions would give a first-approximation answer on that. The second (somewhat related to the first) is whether this should go in so soon before a new release or be deferred until after it. I'm also eager to get this cleanup as soon as safely possible. |
On Sun, 13 Feb 2022 at 13:34, Hugo van der Sanden ***@***.***> wrote:
Thanks Karl. So are you and Hugo good to apply this?
I plan to look at the latest updates later today. There are two other
separate questions; my first was above:
What are the chances it is relied upon by any other regex debugger (or
similar) out there? AFAIR Damian's Regexp-Debugger
<https://metacpan.org/pod/Regexp::Debugger> relies only on hooking regexp
compilation to inject code blocks, but I don't know what else there is.
Last I checked AS werent supporting the product anymore, and as I said,
its been *totally* broken since jump-tries were implemented in 5.10 or so.
The code assumes you can make a linear map between parts of the pattern and
specific regops, which was true prior to the jump-trie logic, but has not
been true since. Eg:
$ perl -Mre=Debug,OFFSETS,DUMP -e'/(?: a \s+ | b \w+ | c \d+ )/x'
Compiling REx "(?: a \s+ | b \w+ | c \d+ )"
Got 140 bytes for offset annotations.
Final program:
1: TRIE-EXACT[abc] (17)
<a> (4)
4: PLUS (17)
5: POSIXD[\s] (0)
<b> (9)
9: PLUS (17)
10: POSIXD[\w] (0)
<c> (14)
14: PLUS (17)
15: POSIXU[\d] (0)
17: END (0)
stclass AHOCORASICK-EXACT[abc] minlen 2
Offsets: [17]
1:4[1] 4:10[1] 5:7[2] 6:12[1] 7:12[1] 9:18[1] 10:15[2] 11:20[1] 12:20[1]
14:26[1] 15:23[2] 16:26[0] 17:28[0]
Freeing REx: "(?: a \s+ | b \w+ | c \d+ )"
There is simply no way to represent this using the offsets notation. The
TRIE-EXACT represents multiple discontinuous segments of the pattern,
signified by ^ symbols below:
(?: a \s+ | b \w+ | c \d+ )
^ ^ ^ ^ ^ ^ ^
The idea of this code was to be able to use the debug mode of the regex
engine to highlight which parts of the pattern were being executed, but you
can't do that with a jump-trie and the offsets notation, the latter simply
cant represent what the trie opcode does. You can even see the broken data,
the offsets reference multiple optimized away regops, 6:12[1] 7:12[1],
11:20[1] 12:20[1], 16:26[0] all of which were optimized away when regop 1
was converted to a trie regop. So if anything is using this it has been
broken for a very long time. When I did the TRIE optimization I discussed
this with MJD and he basically said not to worry..
Essentially there is no way we can do advanced optimizations of the regex
engine like the TRIE optimization AND keep this logic working. It assumes
that the regex engine is a very simple machine, but TRIE or a
hypoethetical DFA op completely break its expectations. It is broken and
unsavable, and IMO any attempt to keep it would simply restrict our ability
to improve the regex engine.
Damians approach is much more intelligent and scalable in some ways,
although it actually changes the thing it debugs (code blocks injected into
the pattern disable a number of optimizations that we have in place, and
would change how some of the ones that arent disabled work in practice, eg,
a pattern that might produce a non-jump trie would start producing a jump
trie, but at least works to explain the user what is happening or should be
happening.
The reality is that if you want to properly understand what the regex
engine is doing you need to look at the raw re debug output, these other
approaches are simply never going to tell you what is going on.
Consider:
$ perl -Mre=debug -e'"a0b c1"=~/(?: a \s+ | b \w+ | c \d+ )/x'
Compiling REx "(?: a \s+ | b \w+ | c \d+ )"
Final program:
1: TRIE-EXACT[abc] (17)
<a> (4)
4: PLUS (17)
5: POSIXD[\s] (0)
<b> (9)
9: PLUS (17)
10: POSIXD[\w] (0)
<c> (14)
14: PLUS (17)
15: POSIXU[\d] (0)
17: END (0)
stclass AHOCORASICK-EXACT[abc] minlen 2
Matching REx "(?: a \s+ | b \w+ | c \d+ )" against "a0b c1"
Matching stclass AHOCORASICK-EXACT[abc] against "a0b c1" (6 bytes)
0 <> <a0b c1> | 0| Charid: 1 CP: 61 State: 1, word=0
- legal
1 <a> <0b c1> | 0| Charid: 0 CP: 30 State: 2, word=1
- accepting
Matches word #1 at position 0. Trying full pattern...
0 <> <a0b c1> | 0| 1:TRIE-EXACT[abc](17)
0 <> <a0b c1> | 0| | 0|
State: 1 Accepted: N Charid: 1 CP: 61 After State: 2
1 <a> <0b c1> | 0| | 0|
State: 2 Accepted: Y Charid: 0 CP: 0 After State: 0
| 0| got 1 possible matches
| 0| TRIE matched word #1, continuing
| 0| only one match left, short-circuiting:
#1 <a>
1 <a> <0b c1> | 0| 4:PLUS(17)
| 0| POSIXD[\s] can match 0 times out of
2147483647...
| 0| failed...
Pattern failed. Looking for new start point...
1 <a> <0b c1> | 0| Scanning for legal start char...
2 <a0> <b c1> | 0| Charid: 2 CP: 62 State: 1, word=0
- legal
3 <a0b> < c1> | 0| Charid: 0 CP: 20 State: 3, word=2
- accepting
Matches word #2 at position 2. Trying full pattern...
2 <a0> <b c1> | 0| 1:TRIE-EXACT[abc](17)
2 <a0> <b c1> | 0| | 0|
State: 1 Accepted: N Charid: 2 CP: 62 After State: 3
3 <a0b> < c1> | 0| | 0|
State: 3 Accepted: Y Charid: 0 CP: 0 After State: 0
| 0| got 1 possible matches
| 0| TRIE matched word #2, continuing
| 0| only one match left, short-circuiting:
#2 <b>
3 <a0b> < c1> | 0| 9:PLUS(17)
| 0| POSIXD[\w] can match 0 times out of
2147483647...
| 0| failed...
Pattern failed. Looking for new start point...
3 <a0b> < c1> | 0| Scanning for legal start char...
4 <a0b > <c1> | 0| Charid: 3 CP: 63 State: 1, word=0
- legal
5 <a0b c> <1> | 0| Charid: 0 CP: 31 State: 4, word=3
- accepting
Matches word #3 at position 4. Trying full pattern...
4 <a0b > <c1> | 0| 1:TRIE-EXACT[abc](17)
4 <a0b > <c1> | 0| | 0|
State: 1 Accepted: N Charid: 3 CP: 63 After State: 4
5 <a0b c> <1> | 0| | 0|
State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0
| 0| got 1 possible matches
| 0| TRIE matched word #3, continuing
| 0| only one match left, short-circuiting:
#3 <c>
5 <a0b c> <1> | 0| 14:PLUS(17)
| 0| POSIXU[\d] can match 1 times out of
2147483647...
6 <a0b c1> <> | 1| 17:END(0)
Match successful!
Freeing REx: "(?: a \s+ | b \w+ | c \d+ )"
As an aside, the above shows a regression in the debug output, the lines
from the trie engine are not being displayed properly anymore. Sigh. This:
4 <a0b > <c1> | 0| | 0|
State: 1 Accepted: N Charid: 3 CP: 63 After State: 4
5 <a0b c> <1> | 0| | 0|
State: 4 Accepted: Y Charid: 0 CP: 0 After State: 0
should look like this:
4 <a0b > <c1> | 0| State: 1 Accepted: N Charid: 3 CP:
63 After State: 4
5 <a0b c> <1> | 0| State: 4 Accepted: Y Charid: 0 CP:
0 After State: 0
I'd say that fixing that regression is more important than the offsets
logic anyday. It actually explains what is going on unlike the offsets
data.
I guess a cpangrep on a couple of critical functions would give a
first-approximation answer on that.
The second (somewhat related to the first) is whether this should go in so
soon before a new release or be deferred until after it.
My view is this logic has been long broken and also long unused, but sure I
understand the concern. If we have a contact at ActiveState that can
confirm it would help.
I am willing to bet a dollar that no harm whatsoever would come from
immediate application, especially given the existing regressions in the
debug output.
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
I understand that, but the question was not whether it works as intended (or ever can) but whether anyone is using it - if someone is, it would seem polite to look at what they're doing and/or talk to them. |
On Sun, 13 Feb 2022, 21:55 Hugo van der Sanden, ***@***.***> wrote:
Last I checked AS werent supporting the product anymore, and as I said,
its been *totally* broken since jump-tries were implemented in 5.10 or so.
[snip 160 lines]
I understand that, but the question was not whether it works as intended
(or ever can) but whether anyone is using it - if someone is, it would seem
polite to look at what they're doing and/or talk to them
Doesn't make sense to me at all. It's unsavably broken and will only get
more broken as we improve the regex engine, or it will prevent us from
improving it by setting a precedent that we have to consider the debug
output a stable api. Considering the regex debug output is barely tested
and afaik I wrote pretty much all of the tests for it that we do have - but
not as a guarantee of stability but rather as a mechanism to test the non
debug regex engine that sounds pretty silly to me. Back in the day I
actually changed the debug output so the offsets stuff weren't output by
default and nobody complained about that. I'd have expected that if these
people existed we would have heard from them by now.
Anyway since the "api" here is the regex debug output which we have
necessarily changed many a time over the years (without a peep from anyone)
and which contains obvious regressions no-one noticed until I looked I
don't see how you are even going to find these people. Afaik there is no
function to grep for. The best you can do is search for various specialized
"use re Debug => ..." lines, including 'All' which will likely give a bunch
of false positives. You might get lucky and find something explicitly
mentioning OFFSETS but I just did that on google and I got a bunch of false
positives and a link to this PR.
So how do you propose to find these people, and how long do you want to
wait for them to surface? What exactly would satisfy you that they don't
exist?
Yves
… |
Ah, my apologies, this was the piece of info I'd misplaced: I'd forgotten how little interface this actually provides, and mentally conflated the removed In any case I've now done some cpangrepping for a handful of candidate strings and found nothing that warrants a closer look. |
On Sun, 13 Feb 2022 at 16:36, Hugo van der Sanden ***@***.***> wrote:
Anyway since the "api" here is the regex debug output
Ah, my apologies, this was the piece of info I'd misplaced: I'd forgotten
how little interface this actually provides, and mentally conflated the
removed regcomp.c defines (and MJD_OFFSET_DEBUG) with the simultaneous
embed.fnc changes, somehow imagining that a user of this mechanism would
be trying to invoke a bunch of now-removed macros.
In any case I've now done some cpangrepping for a handful of candidate
strings and found nothing that warrants a closer look.
So I guess we can merge it then?
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
You don't need to ask that repeatedly, rest assured that when I'm confident it's good to merge I'll either say so or simply merge it, and if I encounter something that seems like a problem I'll also say so. I'm also confident that others such as Karl would do the same. I haven't yet had a chance to look at the latest updates, and when I have done I'll say so. I haven't yet clarified my opinion on whether this should be deferred until after the coming release, and when I have done I'll say so. If you're worried that I've forgotten about it you're welcome to prod me; a few days would seem a reasonable time to wait before deciding that may be the case, except in urgent cases such as a security fix. |
On Mon, 14 Feb 2022 at 14:15, Hugo van der Sanden ***@***.***> wrote:
So I guess we can merge it then?
You don't need to ask that repeatedly, rest assured that when I'm
confident it's good to merge I'll either say so or simply merge it, and if
I encounter something that seems like a problem I'll also say so. I'm also
confident that others such as Karl would do the same.
It wasn't clear to me if you were expecting me to merge or not.
cheers,
Yves
|
@demerphq I've managed to take another look at the main commit, I think it looks good other than one possibly inadvertent removal of an assert. I'd quite like to see the variable changes in a separate commit from the offsets removal (see also khw's recent p5p request for guidance on this sort of thing). I'm happy to do the work of splitting it into two commits if you don't have a particular reason to be against that. I've not yet looked at the ramifications of the additional commit ("change regexp_interal attribute from I32 to U32" - note typo); will get to that as soon as I can, but at first glance it doesn't seem controversial. Based on my understanding so far, I don't think there's any pressing reason to defer this until after the upcoming release - @khwilliamson @leonerd @rjbs @neilb please comment if you think otherwise. |
I agree completely with Hugo |
On Tue, 15 Feb 2022 at 20:42, Hugo van der Sanden ***@***.***> wrote:
@demerphq <https://github.com/demerphq> I've managed to take another look
at the main commit, I think it looks good other than one possibly
inadvertent removal of an assert.
I'd quite like to see the variable changes in a separate commit from the
offsets removal (see also khw's recent p5p request for guidance on this
sort of thing). I'm happy to do the work of splitting it into two commits
if you don't have a particular reason to be against that.
Personally I see them as related changes since the patch removes such a
usage, and since the same variable was duplicated in many places and the
code involved includes subs that are very large it made it much easier to
be sure I wasn't breaking something. But if you really feel strongly about
it I will break them out.
I've not yet looked at the ramifications of the additional commit ("change
regexp_interal attribute from I32 to U32" - note typo); will get to that as
soon as I can, but at first glance it doesn't seem controversial.
Ill fix the typo. Sure go ahead and validate the implications but I am
pretty confident of this: I added this member in the first place, and a
valid value can never be negative since it is an index into the data array.
I made it I32 in the first place because I was intending to set it to -1 to
signify "no value", but either I forgot, or at some point the code was
changed to initialize it to 0 when there was no value. Prior to my recent
change forbidding 0 as a valid index into the data array someone not
realizing the value is only relevant when another property is non-zero
(RXp_PAREN_NAMES(prog)) might have tried to dereference the 0 slot in the
array. But go ahead check my reasoning. Ill add a comment to the code and
docs to explain this however.
Yves
|
On Wed, 16 Feb 2022 at 03:13, Karl Williamson ***@***.***> wrote:
I agree completely with Hugo
I really dont. I'll do the work, but under protest. Go ahead and try to
disentangle it yourself privately to see what I mean.
parse_start is a /* MJD */ marked variable that was intended to be used for
the offsets code originally. Over time other code started to use it. This
then made it quite unobvious what was going on in places. Fixing the names
prior to removing the offsets debug code just means you have to create a
bunch of changes that will be deleted the next patch. Making extra work for
nothing.
This is just make work for extremely low ROI.
Yves
|
On 2/15/22 19:28, Yves Orton wrote:
On Wed, 16 Feb 2022 at 03:13, Karl Williamson ***@***.***>
wrote:
> I agree completely with Hugo
>
I really dont. I'll do the work, but under protest. Go ahead and try to
disentangle it yourself privately to see what I mean.
parse_start is a /* MJD */ marked variable that was intended to be used for
the offsets code originally. Over time other code started to use it. This
then made it quite unobvious what was going on in places. Fixing the names
prior to removing the offsets debug code just means you have to create a
bunch of changes that will be deleted the next patch. Making extra work for
nothing.
This is just make work for extremely low ROI.
I wouldn't have agreed with Hugo if I thought it was a lot of work. And
he has offered to actually do the work for you, so I don't see a problem.
Perhaps actually trying it will show why it is too intertwined to
separate out. But it doesn't look like it to me, at least for the
majority of the changes.
I think the new names are a distinct improvement, and I would be sad if
they got reverted along with the major thrust should we be wrong and
this commit causes field problems.
|
On Tue, 15 Feb 2022 at 20:42, Hugo van der Sanden ***@***.***> wrote:
@demerphq <https://github.com/demerphq> I've managed to take another look
at the main commit, I think it looks good other than one possibly
inadvertent removal of an assert.
It wasn't inadvertent. The assert depends on "op" which was removed by this
patch. If we really want this assert we need to add that parameter back in,
or do some kind of macro trickery.
Yves
|
Various changes have been made to struct regexp_internal over time which have not been documented. This updates the docs to match the code as it is now in preparation of changing the docs in subsequent commits.
This code was added by Mark Jason Dominus to aid a regex debugger he wrote for ActiveState. The basic premise is that every opcode in a regex can be attributed back to a contiguous sequence of characters that make up the pattern. This assumption has not been true ever since the "jump" TRIE optimizations were added to the engine. I spoke to MJD many years ago about whether it was ok to remove this from the regex engine and he said he had no objections. An example of a pattern that cannot be handled correctly by this logic is /(?: a x+ | b y+ | c z+ )/x where the (?:a ... | b ... | c ...) parts will now be handled by the TRIE logic and not by the BRANCH/EXACT opcodes that it would have been in the past. The offset debug output cannot handle this type of transformation, and produce nonsense output that mention opcodes that have been optimized away from the final program. The regex compiler is complicated enough without having to maintain this logic. There are essentially no tests for it, and the few tests that do cover it do so as a byproduct of testing other things. Despite the offsets logic only being used in debug supporting it does have a cost to non-debug logic as various internal routines include parameters related to it that are otherwise unused. Note this output is only usable or visible by enabling special flags in re.pm, there is no formal API to access it short of parsing the output of the debug mode of the regex engine, which has changed multiple time over the past years.
This was originally done to make the cleanup of the offsets debug logic easier to follow and understand. 'parse_start' was heavily used in multiple functions, and given the size of the functions in regcomp.c it was often not clear which parse_start was which. 'oregcomp_parse' was also used in a similar way. This patch disambiguates them all so they are all uniquely named and relevant to the code they operate on and of the form "thing_parse_start", (or "thing_parse_start_const" where both were in use).
This changes the name_list_idx attribute from I32 to a U32 as it will never be negative, and as of a963d6d a 0 can be safely used to represent "no value" for items in the 'data' array. I noticed this while cleaning up the offsets debug logic and updating the perlreguts documentation, so I figured I might as well clean it up at the same time.
9ca30b5
to
f254578
Compare
On Tue, 15 Feb 2022 at 20:42, Hugo van der Sanden ***@***.***> wrote:
@demerphq <https://github.com/demerphq> I've managed to take another look
at the main commit, I think it looks good other than one possibly
inadvertent removal of an assert.
As I said elsewhere it was intentional, the assert was in a debugging block
and uses a parameter that was removed so I removed the assert as well since
the code wouldnt have that parameter.
However I have now reworked the patch so that the debugging behavior does
not impact the production code, and restored the assert to the debug
version of the code.
I force pushed the branch.
I'd quite like to see the variable changes in a separate commit from the
offsets removal (see also khw's recent p5p request for guidance on this
sort of thing). I'm happy to do the work of splitting it into two commits
if you don't have a particular reason to be against that.
Done. While I actually did so under protest as I think it was unnecessary
(and that is case is well in the grey zone of acceptability) I decided if I
was going to split up that commit I might as well do it right, so the
perlreguts.pod change was also split out into its own patch.
I've not yet looked at the ramifications of the additional commit ("change
regexp_interal attribute from I32 to U32" - note typo); will get to that as
soon as I can, but at first glance it doesn't seem controversial.
Nod. Already comment on this elsewhere, but please validate my analysis.
Based on my understanding so far, I don't think there's any pressing
reason to defer this until after the upcoming release - @khwilliamson
<https://github.com/khwilliamson> @leonerd <https://github.com/leonerd>
@rjbs <https://github.com/rjbs> @neilb <https://github.com/neilb> please
comment if you think otherwise.
I'll leave it to you or Karl to apply. I also checked the typos and stuff
like that, but please let me know if you notice something I missed.
cheers,
yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
LGTM, I'll apply this in a couple of days if nobody else has blocking comments. (At some point it'd be nice to clarify what precisely the |
On Thu, 17 Feb 2022, 06:56 Hugo van der Sanden, ***@***.***> wrote:
LGTM, I'll apply this in a couple of days if nobody else has blocking
comments.
Great. Thanks.
(At some point it'd be nice to clarify what precisely the start and end in
reg_code_block are marking the start and end of, but having it documented
at all is a distinct improvement.)
Dave would be the one to do that...
Cheers
Yves
… |
On Wed, Feb 16, 2022 at 03:41:16PM -0800, Yves Orton wrote:
On Thu, 17 Feb 2022, 06:56 Hugo van der Sanden, ***@***.***>
(At some point it'd be nice to clarify what precisely the start and end in
> reg_code_block are marking the start and end of, but having it documented
> at all is a distinct improvement.)
>
Dave would be the one to do that...
Is this referring to this in regexp.h:
/* record the position of a (?{...}) within a pattern */
struct reg_code_block {
STRLEN start;
STRLEN end;
OP *block;
REGEXP *src_regex;
};
in which case in a pattern like /...(?{...}).../, it refers to where in
the string containing the pattern that the code block starts and ends.
What in particular is confusing?
…--
Red sky at night - gerroff my land!
Red sky at morning - gerroff my land!
-- old farmers' sayings #14
|
On Mon, 7 Mar 2022 at 17:11, iabyn ***@***.***> wrote:
On Wed, Feb 16, 2022 at 03:41:16PM -0800, Yves Orton wrote:
> On Thu, 17 Feb 2022, 06:56 Hugo van der Sanden, ***@***.***>
> (At some point it'd be nice to clarify what precisely the start and end
in
> > reg_code_block are marking the start and end of, but having it
documented
> > at all is a distinct improvement.)
> >
> Dave would be the one to do that...
Is this referring to this in regexp.h:
/* record the position of a (?{...}) within a pattern */
struct reg_code_block {
STRLEN start;
STRLEN end;
OP *block;
REGEXP *src_regex;
};
in which case in a pattern like /...(?{...}).../, it refers to where in
the string containing the pattern that the code block starts and ends.
What in particular is confusing?
When you added it you didnt update pod/perlreguts.pod about what the
structure means, and I wasn't confident enough I understood it properly to
do it for you.
If you like I can do a patch for you based on this mail.
cheers,
Yves
…--
perl -Mre=debug -e "/just|another|perl|hacker/"
|
This code was added by Mark Jason Dominus to aid a regex debugger
he wrote. The basic premise is that every opcode in a regex can
be attributed back to parts of the pattern. This assumption has
not been true ever since the TRIE optimizations were added, and
I believe that the debugger is no longer in use anyway.
The regex compiler is complicated enough without having to maintain
this logic. There are essentially no tests for it, and the few
tests that do cover it do so as a byproduct of testing other things.
Despite the offsets logic only being used in debug supporting it
does have a cost to non-debug logic as various internal routines
include parameters related to it that are otherwise unused.
I spoke to him many years ago about whether it was ok to remove
it from the regex engine and he said yes.
As part of this patch I also changed the name of the "parse_start"
and "oregcomp_parse" variables in certain contexts so that the
code is a bit more clear, this was partly because the offsets logic
used its own parse_start variable in certain contexts and changing
the names of the others made it easier to clean up.