Finish making ParseXS build an AST and fix some bugs #23883

iabyn · 2025-10-29T17:48:48Z

This big branch finishes off the work done in my earlier merge commit
v5.43.0-169-g195fee3008 from July 2025, which refactored ParseXS so that
each XSUB was parsed into an Abstract Syntax Tree (AST).

This branch extends that work so that the whole XS file is now compiled
into a single AST (with all XSUBs embedded somewhere in that tree), and
any code generation takes place after all parsing is done. This opens
the possibility in the future of, for example, varying what boilerplate
C is added to the start of the generated C file based on the nature of the
parsed XSUBs: previously, all possible boilerplate had to be emitted,
because it couldn't be known in advance what would be needed.

Part of this refactoring has involved moving any non-low-level processing
out of fetch_para(). For example, TYPEMAP keywords were formerly entirely
processed by fetch_para(); now it has just enough logic to find the
matching "EOF" heredoc line and return the TYPEMAP block as a single
paragraph, which is then processed just like any other keyword. In fact
fetch_para() has been heavily refactored to make it simpler and more
regular.

The old parse stack, {$pxs->{XS_parse_stack}}, has been eliminated. The
state it used to maintain (mainly concerned with #if/#else/#endif nesting
for handling duplicate XSUBs) is now maintained by the presence of
Node::cpp_scope nodes in the AST, which delineate sequences of nodes which
are all within the same branch of such an #if.

This branch also fixes some bugs along the way; some fixes were just
by-products of the new way of parsing things; while others were ones that
were spotted while trying to understand what the old code did.

The "duplicate XSUB" analysing alluded to above didn't handle #elif
correctly.
False positives have been eliminated for the "duplicate function
definition" warning. This fixes GH 19661. Now the warning is only given
for XSUBs with the same name appearing strictly within the same branch of
an #if/#else/#endif. This might generate some false negatives (i.e. it
doesn't warn when it should) but the problem will be detected by the C
compiler eventually anyway.
Handle POD correctly if it extends to the last line of the file.
Line continuations ("\\n") are now correctly handled if they occur
on the line immediately after a POD or TYPEMAP section.

There are also some visible changes in behaviour:

The sequencing of XSubPPtmpAAA, XSubPPtmpAAB, etc guard defines may
change. These are defines added to indicate different branches of an
#if/#else/#endif. The functionality hasn't changed. In theory some XS code
could break if it was testing for particular defines; but since these are
an internal undocumented implementation detail, it shouldn't be relying on
it.
Similarly,the exact placing of the '#define XSubPPtmpAAA 1' may alter
slightly.
The "#else/elif/endif without #if in this function" warning no longer
includes the hint "(precede it with a blank line...)" because the
condition that was used to determine whether to add the hint is no longer
easy to calculate.
The newXS() boot calls for per-package 'Foo::Bar::()' methods are
emitted slightly later in the boot code now. This should make no
functional difference.
syntax errors in MODULE and TYPEMAP keyword lines are now detected. So
bad MODULE/TYPEMAP lines are now reported as such; previously they would be
silently interpreted as non-keyword lines and thus generate some weird
error message about a bad XSUB start or similar.

This set of changes requires a perldelta entry, and i'll write one later.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Add ExtUtils::ParseXS::Node::XS_file class. It doesn't do a lot yet.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Add a stub Node subclass which is responsible for emitting the preamble at the start of the generated C file.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Add these new AST node types: ExtUtils::ParseXS::Node::C_part; ExtUtils::ParseXS::Node::C_part_POD; ExtUtils::ParseXS::Node::C_part_code; and add this method: ExtUtils::ParseXS::Node::is_xs_module_line() These are collectively used to parse and hold the "C" part of the XS file - i.e. everything that comes before the first MODULE line. A C_part node has children consisting of C_part_POD and C_part_code nodes.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Add a stub Node subclass which is responsible for emitting the postamble to the C file following any C code which has been copied as-is from the C part of the XS file

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) This node stores a single "#foo BAR" C preprocessor line which appears within the XS part of an XS file, and which is in global (file) scope (as opposed to being within code in an XSUB or BOOT).

This condition: match a line which doesn't start with anything other than '#' can be more simply expressed as: match a line which starts with '#' They're equivalent. Or at least I think they are. This was doing my head in.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Remove the old BOOT_handler() and replace it with a new Node type.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) For some reason most of the file-scoped keywords (PROTOTYPES etc) are handled together in a single 'while /REQUIRE|PROTOTYPES|.../' loop, but then BOOT is checked for separately after that loop has completed. This commit just moves checking for and handling BOOT into that same loop. The only significant difference between BOOT and the other keywords is that BOOT consumes all line until the end of the paragraph, rather than stopping at the next keyword. But this shouldn't make any difference to the top-level parsing loop.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Remove the old FALLBACK_handler() and replace it with a new Node type. Also, change $pxs->{map_package_to_fallback_string} so that it now holds logical values ('TRUE' etc) rather than the actual strings to later emit into the code ('&PL_sv_yes' etc). Then do the translation at the point the code is emitted.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Remove the old REQUIRE_handler() and replace it with a new Node type.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) In preparation for code generation to be done only after the whole file has been parsed (rather than after each XSUB is parsed, as at present), store the current package name in each Node::xsub object. Then use that value, rather than $pxs->{PACKAGE_name}, to specify the package name in various bits of boot code and in the $Package variable in typemaps. This is because by the end of parsing, $pxs->{PACKAGE_name} will have the value which was extracted from the *last* seen MODULE line in the file, and not the value at the point where an XSUB was parsed. Add several tests to confirm things work, and will continue to work in a few commits' time, when we switch to code generation coming after *all* parsing.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) This commit adds these new Node types: ExtUtils::ParseXS::Node::include ExtUtils::ParseXS::Node::INCLUDE ExtUtils::ParseXS::Node::INCLUDE_COMMAND and removes these two (now obsolete) methods: INCLUDE_handler INCLUDE_COMMAND_handler INCLUDE_COMMAND_handler() looks to have been a cut+paste of INCLUDE_handler() with the relevant bits changed to handle a piped command rather than a filename. I've combined their functionality back into a single parse() method within the Node::include base class, then made Node::INCLUDE and Node::INCLUDE_COMMAND trivial subclasses which just set (or not) $self->{is_cmd} and then let the base class do all the heavy lifting.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Remove a branch which says "if present, call new-style Node::FOO::parse(), else call old-style call FOO_handler()", since there are now AST nodes for all keywords. Also update some of the code comments to remove mention of FOO_handler and that way of working.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) This method used to be used for scanning for keywords and then invoking old-style FOO_handler()s. Replace its last usage with the new parse_keywords() method (which looks for keywords and creates Node::FOO's instead), and remove the method itself. For now, create a temporary node to accumulate all the file-scoped keywords in a continuous run, and then call as_code() once at the end. I.e for the input: PROTOTYPES: X REQUIRE: Y .. formerly a temporary Node::PROTOTYPES node was created, parsed, and then as_code() immediately called on it, then freed. Then similarly for REQUIRE etc. After this commit, a temporary parent node is created, child nodes are added to it for each of PROTOTYPES, REQUIRE, etc, and finally as_code() is called once which calls as_code() for each child. This is a small step in the process of moving towards deferring calling as_code() until the whole file has been read in and parsed. The next commit will make the temporary parent node into something better.

For debugging: displays the node and its children in a compact style similar to perl -MO=Concise

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Add a new node type, ExtUtils::ParseXS::Node::cpp_scope. This node takes on responsibility for the handling the main file-scoped parsing loop, which looks for file-scoped keywords, CPP directives, the start of XSUB declarations etc. It is the main child of the top-most Node::XS_file node type. The bulk of this commit consists of cutting and pasting the main parsing loop (including fetch_para()) out from ExtUtils::ParseXS::process_file() and into ExtUtils::ParseXS::Node::cpp_scope::parse(). For now this as a bit hacky - all tests pass, but mainly by ensuring (temporarily) that various bits of code are emitted during the parsing phase, rather than via a call to as_code(), in order to preserve the correct ordering. The next few commits will gradually clean things up. This commit also converts the lexical variable $cpp_next_tmp_define into a field of the ExtUtils::ParseXS object, as it now needs accessing outside of process_file(). The intent for this node type is that it represents anywhere within a single C preprocessor #if / #else / #endif branch / scope. For this commit, there's only a single cpp_scope node, representing the whole XS part of the main file. Over the next few commits, further nodes will be added: during INCLUDE, which is a whole separate scope, and between CPP conditional directives. When all is complete, it will allow for simpler handling of handling "duplicate" XSUB declarations in different #if branches, and the XS_parse_stack mechanism can be eliminated.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) This commit makes the INCLUDE keyword's parse() method use a new Node::cpp_scope node as a child to hold the nodes generated by parsing the included file. So whereas before those nodes were flattened into the parent list of nodes, they now form a distinct subtree. This involves recursively calling Node::cpp_scope::parse() and fetch_para(), rather than the included file being processed in the main loop. For example, given a main file which contains two declared XSUBs, main1 and main2, and an included file with XSUBs inc1 and inc2; then previously, the parse tree would have had this nesting: XS_file cpp_scope[type=main] xsub main1 include xsub inc1 xsub inc2 xsub main2 whereas it now looks like: XS_file cpp_scope[type=main] xsub main1 include cpp_scope[type=include] xsub inc1 xsub inc2 xsub main2 This commit introduces a slight bug in terms of the order in which BOOT blocks in main and included files get emitted; this will be fixed in a few commits' time.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Delete the ExtUtils::ParseXS methods push_parse_stack() and PopFile(), and inline their single use each in ExtUtils::ParseXS::Node::include::parse(). This is commit is just a crude cut+paste and s/$self/$pxs/g. Subsequent commits will clean the code up.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Previously whenever INCLUDE or INCLUDE_COMMAND parsed a new file, the old state (filename, old lines etc) were saved in a new entry pushed onto the XS_parse_stack. The previous commit inlined this push and pull into Node::include::parse(), which now looked a bit like: sub parse() { ... push @{$pxs->{XS_parse_stack}}, { type =>' file', Filename => $pxs->{in_filename}, ... } ... open new file and parse ... my $data = pop @{$pxs->{XS_parse_stack}}; $pxs->{in_pathname} = $data->{Filepathname}; ... } The intention over the next few commits is to eliminate XS_parse_stack completely: the functionality of the stack will be replaced by a nested series of Node::cpp_scope nodes within the AST. For now, this commit still pushes/pops an entry (so that #if / #endif scope analysing continues to work), but saves all the values in a local var. So after this commit, Node::include::parse() looks like: sub parse() { ... push @{$pxs->{XS_parse_stack}}, { type =>' file' }; my @saved = ($pxs->{in_filename}, ...); ... open new file and parse ... pop @{$pxs->{XS_parse_stack}}; ($pxs->{in_filename}, ...) = @saved; } The rest of this commit is just general tidying up and reordering of some of the code within parse() without change in functionality. Except I also added $pxs->{line_no} = []; in addition to the already present $pxs->{line} = [];

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Make the recently-added Node::global_cpp_line::parse() method split a C preprocessor line into its components and set some flags indicating whether its part of a conditional and if so, an if or variant. Save these values as new fields in the node. They will come in useful shortly.

The XS parser is supposed to be able to do basic processing of file-scoped C preprocessor directives: in particular, to handle alternative XSUB declarations; e.g.: #if X void foo(short i) #elif Y void foo(long i) #else void foo(int i) #endif But due to a bug present since this feature was added in perl5.003 (and which no one has complained about for 29 years), the XS parser was incorrectly looking for the string '#elsif' rather than '#elif'. So the above would fail, but something using only 'else' would compile okay: #if X void foo(short i) #else void foo(int i) #endif This commit fixes the bug and adds some tests.

Add more tests for how the XS parser processes conditional C preprocessor directives in file scope. Note that some of these tests will start to fail after the next commit, which re-implements the processing. The fixups to these tests will indicate what has changed.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) This commit includes a few significant changes to ParseXS which are not separable into individual commits. Some of these also involve minor changes in behaviour. Hence a rather long commit message, using pseudo-POD headers to break things up a bit and give a bit of structure. =head1 Background The XS parser does some basic analysis of file-scoped C preprocessor conditional directives such as #if, #elif, #endif. This is mainly for the purpose of handling otherwise duplicate XSUBs, e.g. #ifdef USE_2ARG int foo(int i, int j) #else int foo(int i) #endif For the above, the parser suppresses any 'duplicate' warning, and adds a 'guard' define at the start of each branch, such as the following in the emitted C code: #ifdef USE_2ARG #define XSubPPtmpAAAA 1 XS_EUPXS(XS_Foo__Bar_foo) { ... } #else #define XSubPPtmpAAAB 1 XS_EUPXS(XS_Foo__Bar_foo) { ... } #endif It then uses those guards in the boot code to ensure that only the correct XSUB(s) (and aliases etc) are registered: #if XSubPPtmpAAAA newXS_deffile("Foo::Bar::foo", XS_Foo__Bar_foo); #endif #if XSubPPtmpAAAB newXS_deffile("Foo::Bar::foo", XS_Foo__Bar_foo); #endif =head1 The changes in this commit =head2 Add a Node::cpp_scope node between branches The XS code above was, before this commit, compiled into an AST which looked something like: XS_file preamble C_part postamble cpp_scope[type=main] global_cpp_line [if] xsub global_cpp_line [else] xsub global_cpp_line [endif] After this commit, all nodes between each conditional CPP line are now hung off a new cpp_scope node: XS_file preamble C_part postamble cpp_scope[type=main] global_cpp_line [if] cpp_scope[type=if] xsub global_cpp_line [else] cpp_scope[type=if] xsub global_cpp_line [endif] =head2 The logic to do the #if/#endif processing is completely reimplemented from scratch. Formerly, @{pxs->{XS_parse_stack}} was a stack that was pushed or popped for every INCLUDE and #if/#endif, and the state about conditionals was contained in that stack. After this commit, the state is instead implied by the structure of the AST; in particular, in the currently active cpp_scope node. The cpp_scope::parse() method has the fetch_para() loop, and each time it encounters an #if etc it, it creates a child cpp_scope node and recursively calls its parse() method. That method returns when it sees the end of a scope, such as encountering an #elif, #else or #endif or EOF. Now, virtually all of the logic involved in maintaining '#if' state is in cpp_scope::parse(), and all the logic to emit '#define XSubPPtmpAAAB 1' etc code is in cpp_scope::as_code(). This is in contrast to previously, where the logic and code generation were liberally spread around in ParseXS.pm and Node.pm To facilitate this change, two fields have been added to the cpp_scope node type: guard_name - the 'XSubPPtmpAAAB' or whatever to use for this branch; seen_xsubs - hash of XSUB names seen in this branch. After this commit, the data in @{pxs->{XS_parse_stack}} isn't used; the next commit will remove it completely. =head2 Almost all code is now generated after parsing is finished. At the start of this git branch, code was emitted after each XSUB was parsed. Now (with a few residual exceptions to be addressed shortly) the whole file is first parsed into a single AST, and *then* the top-level as_code() method is called to generate the entire C file from the AST. Some of the changes in this commit represent the removal of temporary shims that preserved the old parse/emit ordering, and which are now no longer required. =head1 Visible changes in behaviour Although this commit is mostly intended to be just a refactoring, some changes in behaviour (in terms of what C code is generated and what warnings are issued) has occurred. These are made visible in the changes this commit makes to t/001-basic.t, which deliberately contains tests for the behaviours which were about to change. The changes are: =head2 fewer warnings The "Warning: duplicate function definition" warning is now emitted less often, to avoid false positives. In particular, this fixes GH 19661. That ticket concerned code like #if C1 void foo() #endif #if C2 void foo() #endif The logic was such that only one of C1 and C2 could ever be true (basically DBD:MariaDB was adding a missing DBI function if the installed DBI was an old one). But the XS parser can't know that. So it used to warn. The big change in this commit is to now warn only for two identically named XSUBs which appear in the *same* cpp_scope. This removes false positives. It may add false negatives, but that is fairly harmless, as it just means that bad C code will be generated, and the C compiler will eventually complain anyway. The guard definitions are still generated as before. This change in behaviour was intentional, but was also the easiest: it meant just checking the seen_xsubs hash for the current cpp_scope node. For more complex behaviour (if you are considering implementing it), it may require having the concept of a 'current' cpp_scope, e.g. $pxs->{cur_cpp_scope}, plus a chain of pointers from the current scope back to outer ones. (This is slightly problematic, as weak refs aren't available, as Scalar::Util may not be built yet - so it might require higher level node(s) to have a destructor which searches for those links and deletes them first) Once the chain is in place, then it's possible for a newly-added XSUB to update the parent seen_xsubs hashes too, etc. =head2 The sequencing of XSubPPtmpAAA, XSubPPtmpAAB etc may change Previously, the sequence of the guard defines was strictly in the order they appear in the C file. Now, ones in a nested scope come before outer ones, e.g. #ifdef C1 #define XSubPPtmpAAB 1 # ifdef C2 # define XSubPPtmpAAA 1 ... This should be harmless. =head2 XSubPPtmpAAA etc may be placed slightly differently Previously in something like #ifdef C1 #define X 1 #define Y 2 the guard would be added after any other CPP lines; the above would cause this C code to be emitted: #ifdef C1 #define X 1 #define Y 2 #define XSubPPtmpAAA 1 but is now: #ifdef C1 #define XSubPPtmpAAA 1 #define X 1 #define Y 2 i.e.the guard now always comes *immediately* after the conditional at the start of the branch it will guard. This should be harmless.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) The previous commits have changed how the parser tracks file-scoped #if / #endif sections - this is now simply part of the structure of the AST. So this commit removes the old parser stack, @{$pxs->{XS_parse_stack}}, which used to be used to track this info. The stack was still present, but was no longer used to deduce #if state. It also removes the analyze_preprocessor_statements() method and the test file which used to test it, t/111-analyze_preprocessor_statements.t. One minor regression in this commit is that a warning about about unbalanced #if/#endif within an XSUB or BOOT block used to sometimes emit an extra hint about what the problem might be: print STDERR " (precede it with a blank line if the matching #if is outside the function)\n" if $self->{XS_parse_stack}->[-1]{type} eq 'if'; Since it's no longer trivial to check whether we're currently within a global #if/endif section, I've kept the 'unbalanced' warning, but deleted the additional hint. If anybody wants to add it back, you'll need to add a field to the EU::PXS class such as $pxs->{current_cpp_scope} which keeps track of which is the current (innermost) Node::cpp_scope node being parsed. Then check_conditional_preprocessor_statements() can use it in place of the "$self->{XS_parse_stack}->[-1]{type} eq 'if';" test.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Move the parsing of the XS part of the main file from process_file() into cpp_scope::parse(). No functional change.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Add a stub Node subclass which is responsible for emitting any C code which comes after the XS file has been parsed (and any user XSUBs emiited), but before the boot XSUB is emitted. Currently this code is just concerned with adding overload methods. The code has just been moved from process_file() to the node's as_code method() and is otherwise unchanged.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Add a stub Node subclass which is responsible for emitting the C code for the boot XSUB. The code has just been moved from process_file() to the node's as_code method() and is otherwise unchanged.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) The previous commit moved code from process_file() into it's own method, with minimal changes. Now clean the code up a bit: - use $open_brace, $close_brace instead of the magic [[ ]]; - clean up comments, white space etc; - use Q to make some code strings more readable.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Move some code from process_file() to XS_file::parse() No change in functionality.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) In the boot XSUB, a bunch of newXS() calls are added, to register all the XSUBs in the perl namespace, plus all aliases, overload methods etc. In addition, for every package which contains at least one overload method, a special stub sub called Foo::Bar::() is registered. This is used at run-time to determine whether overloading is present for a particular class. This commit changes the order, by changing an unshift to a push. Previously the stub package subs were registered first, then the XSUBs; now its the other way round. e.g. /* mark these two packages as having overload methods */ newXS_deffile("X::Y::()", XS_X__Y_nil); newXS_deffile("P::Q::()", XS_P__Q_nil); /* foo XSUB used as an overload method for <=> etc */ newXS_deffile("P::Q::foo", XS_P__Q_foo); newXS_deffile("P::Q::(<=>", XS_P__Q_foo); ... /* bar XSUB used as an overload method for <=> etc */ newXS_deffile("X::Y::bar", XS_X__Y_bar); newXS_deffile("X::Y::(<=>", XS_X__Y_bar); ... becomes /* foo XSUB used as an overload method for <=> etc */ newXS_deffile("P::Q::foo", XS_P__Q_foo); newXS_deffile("P::Q::(<=>", XS_P__Q_foo); ... /* bar XSUB used as an overload method for <=> etc */ newXS_deffile("X::Y::bar", XS_X__Y_bar); newXS_deffile("X::Y::(<=>", XS_X__Y_bar); ... /* mark these two packages as having overload methods */ newXS_deffile("P::Q::()", XS_P__Q_nil); newXS_deffile("X::Y::()", XS_X__Y_nil); This should make no difference to the functionality, but will make a further refactoring of boot code generation easier. AFAIKT, using unshift rather than push in the original 2002 commit didn't have any particular intent.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Before this commit, as_code() methods were responsible for generating both the general code that goes in the C file, and also code that specifically goes in the boot SUB. They achieved this by pushing lines into the arrays $pxs->{bootcode_early} and $pxs->{bootcode_later}. This commit adds as_boot_code() methods to the relevant node types and removes the bootcode_early and bootcode_later arrays. Now, the C code file is generated by walking the AST, calling as_code() methods. When it gets to the boot_xsub::as_code() method, which is responsible for emitting the XS(boot_Foo__Bar) {} boot XSUB, that sub itself then does a second treewalk, calling as_boot_code() methods, which accumulates lines to be added early and later on in the boot SUB.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) The ExtUtils::ParseXS::Utilities::standard_XS_defs() function just returns a big string containing all the standard boilerplate code that gets added to the C file. Delete this function, and instead include the text directly within C_part_postamble::as_code().

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) When parsing the C part of the XS file (mainly looking for POD and C code), don't use $_ and $. to hold the state of the next line. This was ok when it was a simple loop, but now that the logic is spread over multiple parse() methods, store the next line to process in $pxs->{lastline} and $pxs->{lastline_no}. (The XS half of the file already uses those variables, along with @{$pxs->{line}} etc.)

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) The parser, when parsing an XSUB (but not any other parts of an XS file), pushes a special token onto the end of the list of lines in @{$pxs->{line}} which make up the current paragraph. This token, $ExtUtils::ParseXS::END = "!End!\n\n"; (along with a trailing ':') is designed to look like an impossible keyword which can't actually appear in the source code (due to the multiple newlines). It looks like it was originally added in perl5.002 to make the parsing code easier, but I don't really understand why. It just makes the parser harder to understand. So this commit removes it, and just relies on @{$pxs->{line}} being zero to detect the end of the paragraph. This change doesn't alter the C code generated from any of the XS files bundled with perl.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Remove this line: $_ = ''; which is a leftover from when $_ used to maintain the current line between calls to various parsing subs. It's redundant now.

Previously, a line which was not a completely syntactically correct MODULE line was not treated as a module line; instead it got passed on uninterpreted until likely causing an error further along in parsing. This commit changes things so that anything that looks like the *start* of a module line is treated as a module line, and is *only then* examined for full syntactic correctness, giving an error if not ok. For example: previously, this line: MODULE = Foo XXXPACKAGE = Foo::Bar gave the weird error message: Error: Function definition too short 'MODULE = Foo XXXPACKAGE ... but now gives the error message: Error: unparseable MODULE line: 'MODULE = Foo XXXPACKAGE ... In particular, any line starting with /^MODULE\s*[=:]/ is now treated as an attempt to declare a module, including the syntactically incorrect 'MODULE:' form. This is in the same spirit that other keywords are already processed; for example PROTOTYPES: XXXENABLE is treated as as a badly-formed PROTOTYPES line rather than an otherwise unrecognised and unprocessed line.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Add a new node type, ExtUtils::ParseXS::Node::MODULE. this (mostly) regularises the treatment of a MODULE line, now handled in the usual way by parse_keywords(), rather than being a special magical snowflake that got its own handling in fetch_para(). This commit also changes one parameter of parse_keywords() from being a boolean to being a bit flag, now that there are now *two* slightly special cases to flag up: MODULE, in addition to NOT_IMPLEMENTED_YET. This commit is supposed to have no changes in behaviour, but there *might* be some edge cases that I haven't thought of.

Previously, a line which was not a completely syntactically correct TYPEMAP: line was not treated as a TYPEMAP line; instead it got passed on uninterpreted until likely causing an error further along in parsing. This commit changes things so that anything starting with /^TYPEMAP\s*:/ is treated as a TYPEMAP line, and is *only then* examined for full syntactic correctness, giving an error if not ok. This is similar to two commits ago which did the same for the MODULE keyword.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Add a new node type, ExtUtils::ParseXS::Node::TYPEMAP. this (mostly) regularises the treatment of a TYPEMAP line, now handled in the usual way by parse_keywords(), rather than being processed solely within fetch_para(). fetch_para() does still need to do *some* processing: it has to read the TYPEMAP line, extract out the <<EOF bit, and use that to read lines up until the matchhing 'EOF'. It then returns these lines (including the initial TYPEMAP line) as a paragraph which is then handled as a normal file-scoped keyword. With this commit, fetch_para() becomes closer to its official job of just returning the next useful chunk of lines, rather than having hidden side effects too. This commit is supposed to have no changes in behaviour, but there *might* be some edge cases that I haven't thought of.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Inline the one use of _maybe_skip_pod() into fetch_para().

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Change where $pxs->{lastline} gets chomped at the start of the XS half of the file, where the fetch_para() loop is first entered - do it just the once when transitioning rather than polluting the main fetch_para() loop with it. No functional changes.

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) For consistency with everything else in fetch_para(), chomp any lines extracted from a TYPEMAP section before stuffing them in @{$pxs->{line}}. Later, make Node::TYPEMAP::parse() stick a newline back on each line before passing them to the typemap parser.

If POD continued to the last line of the XS file, a bunch of "Use of uninitialized value" warnings would be generated. The fix is trivial, but it also required a slight tweak to the test framework in t/001-basic.t, which was unconditionally adding a \n to the end of each test string, thus making the '=cut' appear not to be at EOF.

Remove the loop which says 'foreach contiguous section of pod, delete it'. Replace it with with something which removes just *one* section of POD, then relies on the main loop to process the next line, which may well be the start of another POD section. Fewer nested loops is less cognitive load. Should be no functional changes.

change if (cond) { ... lots lines ... } to if (!cond) { goto read_next_line; } ... lots lines ... read_next_line: which might be considered a bit regressive in a 'goto considered harmful' sort of way, but will will help with further refactoring, and emphasises that the current input line is discarded if the condition isn't met (i.e. it strips out code comments, which wasn't obvious).

Reindent a block of code after the previous commit removed a scope. Whitespace-only.

Within fetch_para(), the at-end-of-POD code and the main loop both had separate (but similar) chunks of code to read in the next line, with the latter one being more complete. In particular, the latter had logic to process '\' continuation lines, while the POD one didn't. This meant that this gave a parse error: =pod =cut void foo(int i, \ int j) The fix is to make them both use the same code to read the next line.

(This is similar to the previous commit, which fixed the same bug for POD) Within fetch_para(), the at-end-of-TYPEMAP code and the main loop both had separate (but similar) chunks of code to read in the next line, with the latter one being more complete. In particular, the latter had logic to process '\' continuation lines, while the TYPEMAP one didn't. This meant that this gave a parse error: TYPEMAP: <<EOF ... EOF void foo(int i, \ int j) The fix is to make them both use the same code to read the next line. This bug was probably actually introduced by me during the refactoring in this branch. By using the same read-next-line code, it also means that an immediately following blank line is now stripped of whitespace. I'm not sure whether that actually affects anything in the real word. Also, any trailing lines in typemap are now stripped. Again, this shouldn't affect anything.

Add more code comments to fetch_para(), especially explaining how the "XSUB cuddled by #if/#endif without blank lines" mechanism works

Simplify the logic of the code which handles C-preprocessor directives where there's not the (usually required) blank line directly before and after an XSUB; e.g. #if X int foo(...) ... #endif This commit splits the logic into two halves: the first is responsible for handling conditionals directly *before* the XSUB, and the other half for ones during and directly after the XSUB. Then the second half is further simplified. I think this makes the code easier to understand. I *think* the new code is logically equivalent.

Update the code comments above this function. This is both to reflect the changes made in this branch (e.g. no longer processing MODULE lines in this function), and to reflect my better understanding of this function since I wrote the original text.

Rewrite some code comments explaining why the XS parser sometimes dies if the next line is indented. It's a bit complicated and the initial comments I added a year ago weren't entirely accurate.

Simplify the code in Node::cpp_scope::parse(), which contains the main file-scoped parsing loop, and update / add code comments. The code in this loop has gone through some rough times with all the refactoring that has taken place in the last year, chiefly converting the parser into being AST-based. This commit is kind of a final tidy up to restore sanity. Apart from fixing lots of comments, the main thing this commit does is flatten nested while loops. Before this commit, the overall structure of the parse() function was like; PARAGRAPH: while (lines to process) { while (blank line) { skip } while (C-preprocessor line) { process } ... and similar loops ... } and is now: while (lines to process) { if (blank line) { skip; next } if (C-preprocessor line) { process; next } ... and similar loops ... } i.e. rather than there being nested loops, just have one main loop and go back to the start of that to do the next thing. This should be functionally equivalent, but less complicated. This new paradigm can't be applied yet to the last few items in the main loop due to a backwards compatibility issue with the SCOPE keyword. I'll fix that at some future date.

Now that the XS parser builds a parse tree (AST), it spends most of its time doing recursive-descent $some_node_object->parse() calls. There is a common idiom within the current code along the lines of $some_node_object->parse() or return; This makes a parse failure in a node propagate its way back up the call chain. Note that many parse failures don't immediately croak; they call blurt() which only warns, but also increments the error count; parsing can continue. During my big refactoring I haven't been entirely consistent here; this commit adds a bunch of 'or return's that I missed. It also does a few 'or next's at higher levels, to get behaviour along the lines of: "this XSUB had a syntax error, but continue trying to parse other XSUBs, and only error out at the end." This is all a bit going on gut feelings; in the final analysis it doesn't matter too much how much more parsing gets done before xsubpp finally croaks out.

Update the POD at the top of Node.pm and the code comments at the top of ParseXS.pm to reflect the changes in this branch which have extended the AST from representing just an XSUB to representing the whole XS file.

Add a method which is a simple wrapper around ExtUtils::ParseXS::Q. This means throughout Node.pm, you can write $self->Q(<<"EOF") rather than the more long-winded ExtUtils::ParseXS::Q(<<"EOF");

iabyn added 30 commits October 29, 2025 13:14

Bump ExtUtils::ParseXS to 3.61

848c3b7

ParseXS: refactor: add stub top-level node

c45207c

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Add ExtUtils::ParseXS::Node::XS_file class. It doesn't do a lot yet.

ParseXS: refactor: simplify double-negative regex

51357f9

This condition: match a line which doesn't start with anything other than '#' can be more simply expressed as: match a line which starts with '#' They're equivalent. Or at least I think they are. This was doing my head in.

ParseXS: refactor: add Node::BOOT

c153926

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Remove the old BOOT_handler() and replace it with a new Node type.

ParseXS: refactor: add Node::REQUIRE

8f8270f

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Remove the old REQUIRE_handler() and replace it with a new Node type.

ParseXS: add Node::as_concise method

b449039

For debugging: displays the node and its children in a compact style similar to perl -MO=Concise

ParseXS: refactor: move some code

39cf145

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Move some code from process_file() to XS_file::parse() No change in functionality.

iabyn added 27 commits October 29, 2025 16:13

ParseXS: refactor: inline _maybe_skip_pod()

970d027

(This commit is part of a series which will extend the AST parse tree from just representing individual XSUBs to representing the whole XS file.) Inline the one use of _maybe_skip_pod() into fetch_para().

ParseXS: refactor: fetch_para(): reindent

b84438e

Reindent a block of code after the previous commit removed a scope. Whitespace-only.

ParseXS: refactor: fetch_para(): add more comments

9d970e5

Add more code comments to fetch_para(), especially explaining how the "XSUB cuddled by #if/#endif without blank lines" mechanism works

ParseXS: refactor: fix comments re die if indented

20d2937

Rewrite some code comments explaining why the XS parser sometimes dies if the next line is indented. It's a bit complicated and the initial comments I added a year ago weren't entirely accurate.

ParseXS: refactor: update POD and comments

1324bc4

Update the POD at the top of Node.pm and the code comments at the top of ParseXS.pm to reflect the changes in this branch which have extended the AST from representing just an XSUB to representing the whole XS file.

ParseXS: refactor: add Node::Q method

d88ca93

Add a method which is a simple wrapper around ExtUtils::ParseXS::Q. This means throughout Node.pm, you can write $self->Q(<<"EOF") rather than the more long-winded ExtUtils::ParseXS::Q(<<"EOF");

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finish making ParseXS build an AST and fix some bugs #23883

Finish making ParseXS build an AST and fix some bugs #23883

Uh oh!

iabyn commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Finish making ParseXS build an AST and fix some bugs #23883

Are you sure you want to change the base?

Finish making ParseXS build an AST and fix some bugs #23883

Uh oh!

Conversation

iabyn commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants