-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pat codeblocks: add size to struct reg_code_blocks and fix leak #22201
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I have not searched the code for other references to these structs.
In the first commit message, I don't understand the line Overall, $pat has the same effect as qr/A1B2C3/.
The second commit message needs another s/of of/of/
.
pRExC_state->code_blocks->count -= n; | ||
n = 0; | ||
/* pat now represents the return value of overloaded | ||
* concatenation of of two values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/of of/of/
@@ -2512,6 +2512,94 @@ SKIP: | |||
ok($str =~ s/$copy/PQR/, 'replaced $copy with PQR'); | |||
is($str, "PQR", 'final string should be PQR'); | |||
} | |||
|
|||
|
|||
# Various tests for regexes with code blocks interpolated from an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor point: all these tests involve regexps that match, having some that correctly fail to match would be good.
On Wed, May 08, 2024 at 04:24:44PM -0700, Hugo van der Sanden wrote:
In the first commit message, I don't understand the line `Overall, $pat
has the same effect as qr/A1B2C3/.`
$pat, after interpolation, when being compiled, expands to
qr/A (??{$x}) B (??{$x}) C (??{$y})/x
except that the $x's are different variables coming from different
closures. When the pattern is executed, the various (??{$foo})'s are
executed, and the various variable values are returned, interpreted as a
sub-pattern and recompiled. The three variable values retuned are 1,2,3,
so the run-time effect of the pattern is the same as is had been written
qr/A1B2C3/.
The point of that line in the commit message was to emphasise that the
two $'s have different values. Which part wasn't clear and needs
expanding? Or should I just include the entirety of the paragraph above?
A minor point: all these tests involve regexps that match, having some
that correctly fail to match would be good.
The main point of those particular tests was to ensure that each time a
code block is called, the *right* code block is called (the bug meant that
sometimes the wrong code block was stored at particular slot). A
successful match implies that every code block which was called had to
have been the right code block (because each is arranged to be a closure
that returns a specific unique value). There are many, many ways for the
match to fail, and only one way for the match to succeed. So I'm not sure
how having a non-matching pattern helps, nor can I think of a suitable
set of such failing test patterns. I'm open to suggestions though.
…--
This email is confidential, and now that you have read it you are legally
obliged to shoot yourself. Or shoot a lawyer, if you prefer. If you have
received this email in error, place it in its original wrapping and return
for a full refund. By opening this email, you accept that Elvis lives.
|
I somehow missed that "A1B2C3" was the literal pattern: for some reason it didn't occur to me to backtrack beyond the current paragraph to find the relevant context. I think this is probably just me overlooking something that should have been obvious.
Well, all we know is that a match succeeded, not that the right code block was called - in principle we could end up with the pattern being changed to I don't think it needs a lot, but just showing that a single example like |
1065606
to
436f01a
Compare
I've just pushed a forced update.
On Mon, May 13, 2024 at 03:11:46PM -0700, Hugo van der Sanden wrote:
I somehow missed that "A1B2C3" was the literal pattern:
I've expanded the commit message a bit to help.
Well, all we know is that a match succeeded, not that the right code
block was called - in principle we could end up with the pattern being
changed to `/.*/` and all the tests would pass.
Some negative tested added.
We should also verify that lack of `use re qw{eval}` causes the right
error in this case, if we don't already have that.
A couple of such tests added.
…--
"Emacs isn't a bad OS once you get used to it.
It just lacks a decent editor."
|
Split the 'count' field of reg_code_blocks structures into separate 'count' and 'size' fields to make the code less fragile; and as an intended side-effect, fix GH #16627. Background: When a pattern includes embedded perl code, such as /(?{ CODE })/, then at compile-time the op trees associated with each of those code blocks are stored within the compiled regex, in a reg_code_blocks structure. This structure contains some basic info, plus a pointer to an array of reg_code_block structures, each of which contains a pointer to the optree for that code block, plus string offsets to where the (?{..}) or similar expression starts and ends within the pattern string. For a runtime pattern, perl tries to reuse any original compiled code blocks rather than recompiling them, to maintain correct closure behaviour. So for example, consider the following: my $x = 1; { my $x = 2; $r = qr/(??{$x})/ } my $y = 3; my $s = '(??{$y})'; my $pat = qr/A (??{$x}) B $r C $s/x; At perl compile time, the two '$x' code blocks are compiled, and their optrees stored. At runtime, when the $pat pattern is compiled, the third code block, '$y', is compiled, and the two earlier optrees are retrieved. A new three-element 'struct reg_code_block' array is malloc()ed, and the pointers to the two old, and one new, optrees are stored in it. So when $pat gets compiled, it becomes equivalent to: qr/A (??{$x}) B (??{$x}) C (??{$y})/x; except that the $x's have different values since they are from different closures. When the pattern is executed, the sub-patterns returned by the various (??{..})'s result in $pat having the same overall effect as qr/A1B2C3/. The assembly of this reg_code_block array is mostly performed by S_concat_pat() and S_compile_runtime_code(). It is done incrementally, since the total number of code blocks isn't known in advance. Prior to this commit, the array was often realloced() and grown one element at a time as each new run-time code block was discovered, with a corresponding pRExC_state->code_blocks->count++. This count field served twin purposes: it indicated both how many code blocks had been found and stored so far, and the malloc()ed size of the array. But some parts of the regex compiler allocate more than one slot at a time, and so the two meanings of the 'count' field temporarily diverge. This became noticeable when S_concat_pat() recursed to interpolate the contents of an array, such as qr/$a$b@c/, where interpolating $a, $b was done iteratively at the top level, then it recursed to process each element of @c. S_concat_pat() had a local var, 'int n', which counted how many code blocks had been found so far, and this value sometimes represented the difference between the two meanings of the 'count' field. However when it recursed, n started from zero again and things got out of whack, which led to GH #16627. The bug in that ticket can be reduced to: my @x = ( qr/(?{A})/ ); qr/(?{B})@x/; Here the B code block is stored in pRExC_state->code_blocks->cb[0], but then S_concat_pat() recurses, n is reset to 0, and the A code block is also stored into slot 0. Then things would start to crash. The quick and dirty fix would be to share n between recursive calls to S_concat_pat(), by passing a pointer to it. Instead, this commit takes the approach of adding a 'size' field to pRExC_state->code_blocks, so that ->count now only indicates the current number of code blocks stored (replacing the local var n) while ->size indicates the current number of slots malloc()ed. This makes the code more conventional and simpler to understand, and allows the realloc() to pre-allocate rather than incrementing the array size by 1 each time. By removing the fragile double meaning of the 'count' field, it should make any future bugs easier to diagnose, at the cost of this initial commit being more complex.
When concatenating the components of a run-time pattern, if a component has concat overloading, then that method is used, and any previously accumulated code blocks - i.e. (?{...}) and similar - are discarded. However, the ref counts of of any regex objects pointed to which contained those code block(s) weren't having their reference count decremented, and so leaked. Spotted by code inspection while working on the previous commit.
436f01a
to
f4099af
Compare
The first commit reworks some of the detail of how arrays are interpolated arrays into patterns which have code blocks, and fixes GH #16627.
The second follow up commit fixes a small leak I spotted in that same area of code.