Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pat codeblocks: add size to struct reg_code_blocks and fix leak #22201

Merged
merged 2 commits into from
Jun 18, 2024

Conversation

iabyn
Copy link
Contributor

@iabyn iabyn commented May 8, 2024

The first commit reworks some of the detail of how arrays are interpolated arrays into patterns which have code blocks, and fixes GH #16627.
The second follow up commit fixes a small leak I spotted in that same area of code.

@iabyn iabyn added the defer-next-dev This PR should not be merged yet, but await the next development cycle label May 8, 2024
Copy link
Contributor

@hvds hvds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I have not searched the code for other references to these structs.

In the first commit message, I don't understand the line Overall, $pat has the same effect as qr/A1B2C3/.

The second commit message needs another s/of of/of/.

pRExC_state->code_blocks->count -= n;
n = 0;
/* pat now represents the return value of overloaded
* concatenation of of two values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/of of/of/

@@ -2512,6 +2512,94 @@ SKIP:
ok($str =~ s/$copy/PQR/, 'replaced $copy with PQR');
is($str, "PQR", 'final string should be PQR');
}


# Various tests for regexes with code blocks interpolated from an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor point: all these tests involve regexps that match, having some that correctly fail to match would be good.

@iabyn
Copy link
Contributor Author

iabyn commented May 11, 2024 via email

@hvds
Copy link
Contributor

hvds commented May 13, 2024

The three variable values returned are 1,2,3

I somehow missed that "A1B2C3" was the literal pattern: for some reason it didn't occur to me to backtrack beyond the current paragraph to find the relevant context. I think this is probably just me overlooking something that should have been obvious.

The main point of those particular tests was to ensure that each time a code block is called, the right code block is called

Well, all we know is that a match succeeded, not that the right code block was called - in principle we could end up with the pattern being changed to /.*/ and all the tests would pass.

I don't think it needs a lot, but just showing that a single example like qr/^(??{$A})$r$/ ("code in array 10") does not match some nearby strings like "AE", "Aee", "A=ee" would be something. We should also verify that lack of use re qw{eval} causes the right error in this case, if we don't already have that.

@iabyn iabyn force-pushed the davem/rx_codeblocks2 branch from 1065606 to 436f01a Compare May 20, 2024 11:21
@iabyn
Copy link
Contributor Author

iabyn commented May 20, 2024 via email

@jkeenan jkeenan removed the defer-next-dev This PR should not be merged yet, but await the next development cycle label Jun 10, 2024
iabyn added 2 commits June 18, 2024 09:36
Split the 'count' field of reg_code_blocks structures into separate
'count' and 'size' fields to make the code less fragile; and as an
intended side-effect, fix GH #16627.

Background:

When a pattern includes embedded perl code, such as /(?{ CODE })/, then
at compile-time the op trees associated with each of those code blocks
are stored within the compiled regex, in a reg_code_blocks structure.

This structure contains some basic info, plus a pointer to an array of
reg_code_block structures, each of which contains a pointer to the
optree for that code block, plus string offsets to where the (?{..}) or
similar expression starts and ends within the pattern string.

For a runtime pattern, perl tries to reuse any original compiled code
blocks rather than recompiling them, to maintain correct closure
behaviour.

So for example, consider the following:

    my $x = 1;
    { my $x = 2; $r = qr/(??{$x})/ }
    my $y = 3;
    my $s = '(??{$y})';

    my $pat = qr/A (??{$x}) B $r C $s/x;

At perl compile time, the two '$x' code blocks are compiled, and their
optrees stored.

At runtime, when the $pat pattern is compiled, the third code block,
'$y', is compiled, and the two earlier optrees are retrieved. A new
three-element 'struct reg_code_block' array is malloc()ed, and the
pointers to the two old, and one new, optrees are stored in it.

So when $pat gets compiled, it becomes equivalent to:

    qr/A (??{$x}) B (??{$x}) C (??{$y})/x;

except that the $x's have different values since they are from different
closures. When the pattern is executed, the sub-patterns returned by the
various (??{..})'s result in $pat having the same overall effect as
qr/A1B2C3/.

The assembly of this reg_code_block array is mostly performed by
S_concat_pat() and S_compile_runtime_code(). It is done incrementally,
since the total number of code blocks isn't known in advance.

Prior to this commit, the array was often realloced() and grown one
element at a time as each new run-time code block was discovered, with
a corresponding pRExC_state->code_blocks->count++.

This count field served twin purposes: it indicated both how many code
blocks had been found and stored so far, and the malloc()ed size of the
array. But some parts of the regex compiler allocate more than one slot
at a time, and so the two meanings of the 'count' field temporarily
diverge. This became noticeable when S_concat_pat() recursed to
interpolate the contents of an array, such as qr/$a$b@c/, where
interpolating $a, $b was done iteratively at the top level, then it
recursed to process each element of @c. S_concat_pat() had a local var,
'int n', which counted how many code blocks had been found so far, and
this value sometimes represented the difference between the two meanings
of the 'count' field.

However when it recursed, n started from zero again and things got out
of whack, which led to GH #16627. The bug in that ticket can be reduced
to:

    my @x = ( qr/(?{A})/ );
    qr/(?{B})@x/;

Here the B code block is stored in pRExC_state->code_blocks->cb[0],
but then S_concat_pat() recurses, n is reset to 0, and the A code block
is also stored into slot 0. Then things would start to crash.

The quick and dirty fix would be to share n between recursive calls to
S_concat_pat(), by passing a pointer to it. Instead, this commit takes
the approach of adding a 'size' field to pRExC_state->code_blocks,
so that ->count now only indicates the current number of code blocks
stored (replacing the local var n) while ->size indicates the current
number of slots malloc()ed.

This makes the code more conventional and simpler to understand, and
allows the realloc() to pre-allocate rather than incrementing the array
size by 1 each time. By removing the fragile double meaning of the
'count' field, it should make any future bugs easier to diagnose, at the
cost of this initial commit being more complex.
When concatenating the components of a run-time pattern, if
a component has concat overloading, then that method is used, and
any previously accumulated code blocks - i.e. (?{...}) and similar -
are discarded. However, the ref counts of of any regex objects
pointed to which contained those code block(s) weren't having
their reference count decremented, and so leaked.

Spotted by code inspection while working on the previous commit.
@iabyn iabyn force-pushed the davem/rx_codeblocks2 branch from 436f01a to f4099af Compare June 18, 2024 08:42
@iabyn iabyn merged commit f4099af into blead Jun 18, 2024
59 checks passed
@iabyn iabyn deleted the davem/rx_codeblocks2 branch June 18, 2024 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants