Skip to content

Conversation

@iabyn
Copy link
Contributor

@iabyn iabyn commented Oct 29, 2025

This big branch finishes off the work done in my earlier merge commit
v5.43.0-169-g195fee3008 from July 2025, which refactored ParseXS so that
each XSUB was parsed into an Abstract Syntax Tree (AST).

This branch extends that work so that the whole XS file is now compiled
into a single AST (with all XSUBs embedded somewhere in that tree), and
any code generation takes place after all parsing is done. This opens
the possibility in the future of, for example, varying what boilerplate
C is added to the start of the generated C file based on the nature of the
parsed XSUBs: previously, all possible boilerplate had to be emitted,
because it couldn't be known in advance what would be needed.

Part of this refactoring has involved moving any non-low-level processing
out of fetch_para(). For example, TYPEMAP keywords were formerly entirely
processed by fetch_para(); now it has just enough logic to find the
matching "EOF" heredoc line and return the TYPEMAP block as a single
paragraph, which is then processed just like any other keyword. In fact
fetch_para() has been heavily refactored to make it simpler and more
regular.

The old parse stack, {$pxs->{XS_parse_stack}}, has been eliminated. The
state it used to maintain (mainly concerned with #if/#else/#endif nesting
for handling duplicate XSUBs) is now maintained by the presence of
Node::cpp_scope nodes in the AST, which delineate sequences of nodes which
are all within the same branch of such an #if.

This branch also fixes some bugs along the way; some fixes were just
by-products of the new way of parsing things; while others were ones that
were spotted while trying to understand what the old code did.

  • The "duplicate XSUB" analysing alluded to above didn't handle #elif
    correctly.

  • False positives have been eliminated for the "duplicate function
    definition" warning. This fixes GH 19661. Now the warning is only given
    for XSUBs with the same name appearing strictly within the same branch of
    an #if/#else/#endif. This might generate some false negatives (i.e. it
    doesn't warn when it should) but the problem will be detected by the C
    compiler eventually anyway.

  • Handle POD correctly if it extends to the last line of the file.

  • Line continuations ("\\n") are now correctly handled if they occur
    on the line immediately after a POD or TYPEMAP section.

There are also some visible changes in behaviour:

  • The sequencing of XSubPPtmpAAA, XSubPPtmpAAB, etc guard defines may
    change. These are defines added to indicate different branches of an
    #if/#else/#endif. The functionality hasn't changed. In theory some XS code
    could break if it was testing for particular defines; but since these are
    an internal undocumented implementation detail, it shouldn't be relying on
    it.

  • Similarly,the exact placing of the '#define XSubPPtmpAAA 1' may alter
    slightly.

  • The "#else/elif/endif without #if in this function" warning no longer
    includes the hint "(precede it with a blank line...)" because the
    condition that was used to determine whether to add the hint is no longer
    easy to calculate.

  • The newXS() boot calls for per-package 'Foo::Bar::()' methods are
    emitted slightly later in the boot code now. This should make no
    functional difference.

  • syntax errors in MODULE and TYPEMAP keyword lines are now detected. So
    bad MODULE/TYPEMAP lines are now reported as such; previously they would be
    silently interpreted as non-keyword lines and thus generate some weird
    error message about a bad XSUB start or similar.

  • This set of changes requires a perldelta entry, and i'll write one later.

iabyn added 30 commits October 29, 2025 13:14
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Add ExtUtils::ParseXS::Node::XS_file class.
It doesn't do a lot yet.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Add a stub Node subclass which is responsible for emitting the preamble
at the start of the generated C file.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Add these new AST node types:

    ExtUtils::ParseXS::Node::C_part;
    ExtUtils::ParseXS::Node::C_part_POD;
    ExtUtils::ParseXS::Node::C_part_code;

and add this method:

    ExtUtils::ParseXS::Node::is_xs_module_line()

These are collectively used to parse and hold the "C" part of the XS
file - i.e. everything that comes before the first MODULE line.

A C_part node has children consisting of C_part_POD and C_part_code
nodes.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Add a stub Node subclass which is responsible for emitting the postamble
to the C file following any C code which has been copied as-is from the
C part of the XS file
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

This node stores a single "#foo BAR" C preprocessor line which appears
within the XS part of an XS file, and which is in global (file) scope
(as opposed to being within code in an XSUB or BOOT).
This condition:

    match a line which doesn't start with anything other than '#'

can be more simply expressed as:

    match a line which starts with '#'

They're equivalent. Or at least I think they are. This was doing my head
in.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Remove the old BOOT_handler() and replace it with a new Node type.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

For some reason most of the file-scoped keywords (PROTOTYPES etc) are
handled together in a single 'while /REQUIRE|PROTOTYPES|.../' loop, but
then BOOT is checked for separately after that loop has completed.

This commit just moves checking for and handling BOOT into that same
loop. The only significant difference between BOOT and the other
keywords is that BOOT consumes all line until the end of the paragraph,
rather than stopping at the next keyword. But this shouldn't make any
difference to the top-level parsing loop.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Remove the old FALLBACK_handler() and replace it with a new Node type.
Also, change $pxs->{map_package_to_fallback_string} so that it
now holds logical values ('TRUE' etc) rather than the actual strings
to later emit into the code ('&PL_sv_yes' etc). Then do the translation
at the point the code is emitted.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Remove the old REQUIRE_handler() and replace it with a new Node type.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

In preparation for code generation to be done only after the whole file
has been parsed (rather than after each XSUB is parsed, as at present),
store the current package name in each Node::xsub object.

Then use that value, rather than $pxs->{PACKAGE_name}, to specify the
package name in various bits of boot code and in the $Package variable
in typemaps. This is because by the end of parsing, $pxs->{PACKAGE_name}
will have the value which was extracted from the *last* seen MODULE line
in the file, and not the value at the point where an XSUB was parsed.

Add several tests to confirm things work, and will continue to work in a
few commits' time, when we switch to code generation coming after *all*
parsing.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

This commit adds these new Node types:

    ExtUtils::ParseXS::Node::include
    ExtUtils::ParseXS::Node::INCLUDE
    ExtUtils::ParseXS::Node::INCLUDE_COMMAND

and removes these two (now obsolete) methods:

    INCLUDE_handler
    INCLUDE_COMMAND_handler

INCLUDE_COMMAND_handler() looks to have been a cut+paste of
INCLUDE_handler() with the relevant bits changed to handle a piped
command rather than a filename. I've combined their functionality back
into a single parse() method within the Node::include base class, then
made Node::INCLUDE and Node::INCLUDE_COMMAND trivial subclasses which
just set (or not) $self->{is_cmd} and then let the base class do all the
heavy lifting.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Remove a branch which says "if present, call new-style
Node::FOO::parse(), else call old-style call FOO_handler()", since there
are now AST nodes for all keywords.

Also update some of the code comments to remove mention of FOO_handler
and that way of working.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

This method used to be used for scanning for keywords and then invoking
old-style FOO_handler()s. Replace its last usage with the new
parse_keywords() method (which looks for keywords and creates
Node::FOO's instead), and remove the method itself.

For now, create a temporary node to accumulate all the file-scoped
keywords in a continuous run, and then call as_code() once at the end.

I.e for the input:

    PROTOTYPES: X
    REQUIRE: Y
    ..

formerly a temporary Node::PROTOTYPES node was created, parsed, and then
as_code() immediately called on it, then freed. Then similarly for
REQUIRE etc.

After this commit, a temporary parent node is created, child nodes are
added to it for each of PROTOTYPES, REQUIRE, etc, and finally as_code()
is called once which calls as_code() for each child.

This is a small step in the process of moving towards deferring calling
as_code() until the whole file has been read in and parsed.

The next commit will make the temporary parent node into something
better.
For debugging: displays the node and its children in a compact style
similar to perl -MO=Concise
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Add a new node type,

    ExtUtils::ParseXS::Node::cpp_scope.

This node takes on responsibility for the handling the main file-scoped
parsing loop, which looks for file-scoped keywords, CPP directives,
the start of XSUB declarations etc. It is the main child of the top-most
Node::XS_file node type.

The bulk of this commit consists of cutting and pasting the main parsing
loop (including fetch_para()) out from ExtUtils::ParseXS::process_file()
and into ExtUtils::ParseXS::Node::cpp_scope::parse().

For now this as a bit hacky - all tests pass, but mainly by ensuring
(temporarily) that various bits of code are emitted during the parsing
phase, rather than via a call to as_code(), in order to preserve the
correct ordering. The next few commits will gradually clean things up.

This commit also converts the lexical variable $cpp_next_tmp_define
into a field of the ExtUtils::ParseXS object, as it now needs accessing
outside of process_file().

The intent for this node type is that it represents anywhere within a
single C preprocessor #if / #else / #endif branch / scope. For this
commit, there's only a single cpp_scope node, representing the whole XS
part of the main file. Over the next few commits, further nodes will be
added: during INCLUDE, which is a whole separate scope, and between CPP
conditional directives. When all is complete, it will allow for simpler
handling of handling "duplicate" XSUB declarations in different #if
branches, and the XS_parse_stack mechanism can be eliminated.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

This commit makes the INCLUDE keyword's parse() method use a new
Node::cpp_scope node as a child to hold the nodes generated by parsing
the included file. So whereas before those nodes were flattened into the
parent list of nodes, they now form a distinct subtree.

This involves recursively calling Node::cpp_scope::parse() and
fetch_para(), rather than the included file being processed in the main
loop.

For example, given a main file which contains two declared XSUBs, main1
and main2, and an included file with XSUBs inc1 and inc2; then
previously, the parse tree would have had this nesting:

    XS_file
        cpp_scope[type=main]
            xsub main1
            include
            xsub inc1
            xsub inc2
            xsub main2

whereas it now looks like:

    XS_file
        cpp_scope[type=main]
            xsub main1
            include
                cpp_scope[type=include]
                    xsub inc1
                    xsub inc2
            xsub main2

This commit introduces a slight bug in terms of the order in which BOOT
blocks in main and included files get emitted; this will be fixed in a
few commits' time.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Delete the ExtUtils::ParseXS methods push_parse_stack() and PopFile(),
and inline their single use each in
ExtUtils::ParseXS::Node::include::parse().

This is commit is just a crude cut+paste and s/$self/$pxs/g.
Subsequent commits will clean the code up.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Previously whenever INCLUDE or INCLUDE_COMMAND parsed a new file, the
old state (filename, old lines etc) were saved in a new entry pushed
onto the XS_parse_stack.

The previous commit inlined this push and pull into
Node::include::parse(), which now looked a bit like:

    sub parse() {
        ...
        push @{$pxs->{XS_parse_stack}}, {
            type =>' file',
            Filename => $pxs->{in_filename},
            ...
        }

        ... open new file and parse ...

        my $data = pop @{$pxs->{XS_parse_stack}};
        $pxs->{in_pathname} = $data->{Filepathname};
        ...
    }

The intention over the next few commits is to eliminate XS_parse_stack
completely: the functionality of the stack will be replaced by a nested
series of Node::cpp_scope nodes within the AST. For now, this commit
still pushes/pops an entry (so that #if / #endif scope analysing
continues to work), but saves all the values in a local var. So after
this commit, Node::include::parse() looks like:

    sub parse() {
        ...
        push @{$pxs->{XS_parse_stack}}, { type =>' file' };
        my @saved = ($pxs->{in_filename}, ...);

        ... open new file and parse ...

        pop @{$pxs->{XS_parse_stack}};
        ($pxs->{in_filename}, ...) = @saved;
    }

The rest of this commit is just general tidying up and reordering of
some of the code within parse() without change in functionality. Except
I also added

    $pxs->{line_no} = [];

in addition to the already present

    $pxs->{line}    = [];
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Make the recently-added Node::global_cpp_line::parse() method
split a C preprocessor line into its components and set some flags
indicating whether its part of a conditional and if so, an if or variant.

Save these values as new fields in the node. They will come in useful
shortly.
The XS parser is supposed to be able to do basic processing of
file-scoped C preprocessor directives: in particular, to handle
alternative XSUB declarations; e.g.:

    #if X
    void foo(short i)
    #elif Y
    void foo(long i)
    #else
    void foo(int i)
    #endif

But due to a bug present since this feature was added in perl5.003
(and which no one has complained about for 29 years), the XS parser
was incorrectly looking for the string '#elsif' rather than '#elif'.

So the above would fail, but something using only 'else' would compile
okay:

    #if X
    void foo(short i)
    #else
    void foo(int i)
    #endif

This commit fixes the bug and adds some tests.
Add more tests for how the XS parser processes conditional C
preprocessor directives in file scope.

Note that some of these tests will start to fail after the next commit,
which re-implements the processing. The fixups to these tests will
indicate what has changed.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

This commit includes a few significant changes to ParseXS which are not
separable into individual commits. Some of these also involve minor
changes in behaviour. Hence a rather long commit message, using
pseudo-POD headers to break things up a bit and give a bit of structure.

=head1 Background

The XS parser does some basic analysis of file-scoped C preprocessor
conditional directives such as #if, #elif, #endif. This is mainly for
the purpose of handling otherwise duplicate XSUBs, e.g.

    #ifdef USE_2ARG

    int foo(int i, int j)

    #else

    int foo(int i)

    #endif

For the above, the parser suppresses any 'duplicate' warning, and adds a
'guard' define at the start of each branch, such as the following in the
emitted C code:

    #ifdef USE_2ARG
    #define XSubPPtmpAAAA 1

    XS_EUPXS(XS_Foo__Bar_foo) { ... }

    #else
    #define XSubPPtmpAAAB 1

    XS_EUPXS(XS_Foo__Bar_foo) { ... }

    #endif

It then uses those guards in the boot code to ensure that only the
correct XSUB(s) (and aliases etc) are registered:

    #if XSubPPtmpAAAA
            newXS_deffile("Foo::Bar::foo", XS_Foo__Bar_foo);
    #endif

    #if XSubPPtmpAAAB
            newXS_deffile("Foo::Bar::foo", XS_Foo__Bar_foo);
    #endif

=head1 The changes in this commit

=head2 Add a Node::cpp_scope node between branches

The XS code above was, before this commit, compiled into an AST
which looked something like:

    XS_file
        preamble
        C_part
        postamble
        cpp_scope[type=main]
            global_cpp_line [if]
            xsub
            global_cpp_line [else]
            xsub
            global_cpp_line [endif]

After this commit, all nodes between each conditional CPP line are
now hung off a new cpp_scope node:

    XS_file
        preamble
        C_part
        postamble
        cpp_scope[type=main]
            global_cpp_line [if]
            cpp_scope[type=if]
                xsub
            global_cpp_line [else]
            cpp_scope[type=if]
                xsub
            global_cpp_line [endif]

=head2 The logic to do the #if/#endif processing is completely
reimplemented from scratch.

Formerly, @{pxs->{XS_parse_stack}} was a stack that was pushed or popped
for every INCLUDE and #if/#endif, and the state about conditionals was
contained in that stack.

After this commit, the state is instead implied by the structure of the
AST; in particular, in the currently active cpp_scope node.
The cpp_scope::parse() method has the fetch_para() loop, and each time
it encounters an #if etc it, it creates a child cpp_scope node and
recursively calls its parse() method. That method returns when it sees
the end of a scope, such as encountering an #elif, #else or #endif or
EOF.

Now, virtually all of the logic involved in maintaining '#if' state is
in cpp_scope::parse(), and all the logic to emit '#define XSubPPtmpAAAB
1' etc code is in cpp_scope::as_code(). This is in contrast to
previously, where the logic and code generation were liberally spread
around in ParseXS.pm and Node.pm

To facilitate this change, two fields have been added to the cpp_scope
node type:

   guard_name - the 'XSubPPtmpAAAB' or whatever to use for this branch;
   seen_xsubs - hash of XSUB names seen in this branch.

After this commit, the data in @{pxs->{XS_parse_stack}}  isn't used;
the next commit will remove it completely.

=head2 Almost all code is now generated after parsing is finished.

At the start of this git branch, code was emitted after each XSUB was
parsed. Now (with a few residual exceptions to be addressed shortly) the
whole file is first parsed into a single AST, and *then* the top-level
as_code() method is called to generate the entire C file from the AST.

Some of the changes in this commit represent the removal of temporary
shims that preserved the old parse/emit ordering, and which are now no
longer required.

=head1 Visible changes in behaviour

Although this commit is mostly intended to be just a refactoring, some
changes in behaviour (in terms of what C code is generated and what
warnings are issued) has occurred.

These are made visible in the changes this commit makes to
t/001-basic.t, which deliberately contains tests for the behaviours
which were about to change.

The changes are:

=head2 fewer warnings

The "Warning: duplicate function definition" warning is now emitted
less often, to avoid false positives. In particular, this fixes GH
19661. That ticket concerned code like

    #if C1
    void foo()
    #endif

    #if C2
    void foo()
    #endif

The logic was such that only one of C1 and C2 could ever be true
(basically DBD:MariaDB was adding a missing DBI function if the
installed DBI was an old one). But the XS parser can't know that.
So it used to warn.

The big change in this commit is to now warn only for two identically
named XSUBs which appear in the *same* cpp_scope. This removes false
positives. It may add false negatives, but that is fairly harmless,
as it just means that bad C code will be generated, and the C compiler
will eventually complain anyway.

The guard definitions are still generated as before.

This change in behaviour was intentional, but was also the easiest: it
meant just checking the seen_xsubs hash for the current cpp_scope node.
For more complex behaviour (if you are considering implementing it), it
may require having the concept of a 'current' cpp_scope, e.g.
$pxs->{cur_cpp_scope}, plus a chain of pointers from the current scope
back to outer ones. (This is slightly problematic, as weak refs aren't
available, as Scalar::Util may not be built yet - so it might require
higher level node(s) to have a destructor which searches for those links
and deletes them first) Once the chain is in place, then it's possible
for a newly-added XSUB to update the parent seen_xsubs hashes too, etc.

=head2 The sequencing of XSubPPtmpAAA, XSubPPtmpAAB etc may change

Previously, the sequence of the guard defines was strictly in the order
they appear in the C file. Now, ones in a nested scope come before outer
ones, e.g.

    #ifdef C1
    #define XSubPPtmpAAB 1
    #  ifdef C2
    #  define XSubPPtmpAAA 1
    ...

This should be harmless.

=head2 XSubPPtmpAAA etc may be placed slightly differently

Previously in something like

    #ifdef C1
    #define X 1
    #define Y 2

the guard would be added after any other CPP lines; the above would
cause this C code to be emitted:

    #ifdef C1
    #define X 1
    #define Y 2
    #define XSubPPtmpAAA 1

but is now:

    #ifdef C1
    #define XSubPPtmpAAA 1
    #define X 1
    #define Y 2

i.e.the guard now always comes *immediately* after the conditional at
the start of the branch it will guard.

This should be harmless.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

The previous commits have changed how the parser tracks file-scoped
  #if / #endif  sections - this is now simply part of the structure
of the AST.

So this commit removes the old parser stack, @{$pxs->{XS_parse_stack}},
which used to be used to track this info. The stack was still present,
but was no longer used to deduce #if state. It also removes the
analyze_preprocessor_statements() method and the test file which used to
test it, t/111-analyze_preprocessor_statements.t.

One minor regression in this commit is that a warning about about
unbalanced #if/#endif within an XSUB or BOOT block used to sometimes
emit an extra hint about what the problem might be:

  print STDERR "    (precede it with a blank line if the matching #if is outside the function)\n"
      if $self->{XS_parse_stack}->[-1]{type} eq 'if';

Since it's no longer trivial to check whether we're currently within a
global #if/endif section, I've kept the 'unbalanced' warning, but
deleted the additional hint.

If anybody wants to add it back, you'll need to add a field to the
EU::PXS class such as

    $pxs->{current_cpp_scope}

which keeps track of which is the current (innermost) Node::cpp_scope
node being parsed. Then check_conditional_preprocessor_statements() can
use it in place of the "$self->{XS_parse_stack}->[-1]{type} eq 'if';"
test.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Move the parsing of the XS part of the main file from process_file()
into cpp_scope::parse().

No functional change.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Add a stub Node subclass which is responsible for emitting any C code
which comes after the XS file has been parsed (and any user XSUBs
emiited), but before the boot XSUB is emitted.

Currently this code is just concerned with adding overload methods.

The code has just been moved from process_file() to the node's
as_code method() and is otherwise unchanged.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Add a stub Node subclass which is responsible for emitting the C code
for the boot XSUB.

The code has just been moved from process_file() to the node's
as_code method() and is otherwise unchanged.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

The previous commit moved code from process_file() into it's own
method, with minimal changes. Now clean the code up a bit:

- use $open_brace, $close_brace instead of the magic [[ ]];
- clean up comments, white space etc;
- use Q to make some code strings more readable.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Move some code from process_file() to XS_file::parse()

No change in functionality.
iabyn added 27 commits October 29, 2025 16:13
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

In the boot XSUB, a bunch of newXS() calls are added, to register all
the XSUBs in the perl namespace, plus all aliases, overload methods etc.

In addition, for every package which contains at least one overload
method, a special stub sub called Foo::Bar::() is registered. This is
used at run-time to determine whether overloading is present for a
particular class.

This commit changes the order, by changing an unshift to a push.
Previously the stub package subs were registered first, then the XSUBs;
now its the other way round. e.g.

    /* mark these two packages as having overload methods */
    newXS_deffile("X::Y::()",   XS_X__Y_nil);
    newXS_deffile("P::Q::()",   XS_P__Q_nil);

    /* foo XSUB used as an overload method for <=> etc */
    newXS_deffile("P::Q::foo",  XS_P__Q_foo);
    newXS_deffile("P::Q::(<=>", XS_P__Q_foo);
    ...
    /* bar XSUB used as an overload method for <=> etc */
    newXS_deffile("X::Y::bar",  XS_X__Y_bar);
    newXS_deffile("X::Y::(<=>", XS_X__Y_bar);
    ...

becomes

    /* foo XSUB used as an overload method for <=> etc */
    newXS_deffile("P::Q::foo",  XS_P__Q_foo);
    newXS_deffile("P::Q::(<=>", XS_P__Q_foo);
    ...
    /* bar XSUB used as an overload method for <=> etc */
    newXS_deffile("X::Y::bar",  XS_X__Y_bar);
    newXS_deffile("X::Y::(<=>", XS_X__Y_bar);
    ...

    /* mark these two packages as having overload methods */
    newXS_deffile("P::Q::()",   XS_P__Q_nil);
    newXS_deffile("X::Y::()",   XS_X__Y_nil);

This should make no difference to the functionality, but will make
a further refactoring of boot code generation easier.

AFAIKT, using unshift rather than push in the original 2002 commit
didn't have any particular intent.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Before this commit, as_code() methods were responsible for generating
both the general code that goes in the C file, and also code that
specifically goes in the boot SUB. They achieved this by pushing
lines into the arrays $pxs->{bootcode_early} and $pxs->{bootcode_later}.

This commit adds as_boot_code() methods to the relevant node types and
removes the bootcode_early and bootcode_later arrays.

Now, the C code file is generated by walking the AST, calling as_code()
methods. When it gets to the boot_xsub::as_code() method, which is
responsible for emitting the XS(boot_Foo__Bar) {} boot XSUB, that sub
itself then does a second treewalk, calling as_boot_code() methods, which
accumulates lines to be added early and later on in the boot SUB.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

The ExtUtils::ParseXS::Utilities::standard_XS_defs() function
just returns a big string containing all the standard boilerplate code
that gets added to the C file. Delete this function, and instead
include the text directly within C_part_postamble::as_code().
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

When parsing the C part of the XS file (mainly looking for POD and C
code), don't use $_ and $. to hold the state of the next line. This was
ok when it was a simple loop, but now that the logic is spread over
multiple parse() methods, store the next line to process in
$pxs->{lastline} and $pxs->{lastline_no}.

(The XS half of the file already uses those variables, along with
@{$pxs->{line}} etc.)
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

The parser, when parsing an XSUB (but not any other parts of an XS file),
pushes a special token onto the end of the list of lines in
@{$pxs->{line}} which make up the current paragraph. This token,

    $ExtUtils::ParseXS::END = "!End!\n\n";

(along with a trailing ':') is designed  to look like an impossible
keyword which can't actually appear in the source code (due to the
multiple newlines).

It looks like it was originally added in perl5.002 to make the parsing
code easier, but I don't really understand why. It just makes the
parser harder to understand.

So this commit removes it, and just relies on @{$pxs->{line}} being zero
to detect the end of the paragraph. This change doesn't alter the C code
generated from any of the XS files bundled with perl.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Remove this line:

    $_ = '';

which is a leftover from when $_ used to maintain the current line
between calls to various parsing subs. It's redundant now.
Previously, a line which was not a completely syntactically correct
MODULE line was not treated as a module line; instead it got passed on
uninterpreted until likely causing an error further along in parsing.

This commit changes things so that anything that looks like the *start*
of a module line is treated as a module line, and is *only then*
examined for full syntactic correctness, giving an error if not ok.

For example: previously, this line:

    MODULE = Foo XXXPACKAGE = Foo::Bar

gave the weird error message:

    Error: Function definition too short 'MODULE = Foo XXXPACKAGE ...

but now gives the error message:

    Error: unparseable MODULE line: 'MODULE = Foo XXXPACKAGE ...

In particular, any line starting with

    /^MODULE\s*[=:]/

is now treated as an attempt to declare a module, including the
syntactically incorrect 'MODULE:' form.

This is in the same spirit that other keywords are already processed;
for example

    PROTOTYPES: XXXENABLE

is treated as as a badly-formed PROTOTYPES line rather than an otherwise
unrecognised and unprocessed line.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Add a new node type,

    ExtUtils::ParseXS::Node::MODULE.

this (mostly) regularises the treatment of a MODULE line, now handled in
the usual way by parse_keywords(), rather than being a special magical
snowflake that got its own handling in fetch_para().

This commit also changes one parameter of parse_keywords() from being a
boolean to being a bit flag, now that there are now *two* slightly
special cases to flag up: MODULE, in addition to NOT_IMPLEMENTED_YET.

This commit is supposed to have no changes in behaviour, but there
*might* be some edge cases that I haven't thought of.
Previously, a line which was not a completely syntactically correct
TYPEMAP: line was not treated as a TYPEMAP line; instead it got passed on
uninterpreted until likely causing an error further along in parsing.

This commit changes things so that anything starting with
/^TYPEMAP\s*:/ is treated as a TYPEMAP line, and is *only then*
examined for full syntactic correctness, giving an error if not ok.

This is similar to two commits ago which did the same for the MODULE
keyword.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Add a new node type,

    ExtUtils::ParseXS::Node::TYPEMAP.

this (mostly) regularises the treatment of a TYPEMAP line, now handled in
the usual way by parse_keywords(), rather than being processed solely
within fetch_para(). fetch_para() does still need to do *some*
processing: it has to read the TYPEMAP line, extract out the <<EOF bit,
and use that to read lines up until the matchhing 'EOF'. It then returns
these lines (including the initial TYPEMAP line) as a paragraph which is
then handled as a normal file-scoped keyword.

With this commit, fetch_para() becomes closer to its official job of
just returning the next useful chunk of lines, rather than having hidden
side effects too.

This commit is supposed to have no changes in behaviour, but there
*might* be some edge cases that I haven't thought of.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Inline the one use of _maybe_skip_pod() into fetch_para().
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

Change where $pxs->{lastline} gets chomped at the start of the XS half
of the file, where the fetch_para() loop is first entered - do it just
the once when transitioning rather than polluting the main fetch_para()
loop with it.

No functional changes.
(This commit is part of a series which will extend the AST parse tree
from just representing individual XSUBs to representing the whole XS
file.)

For consistency with everything else in fetch_para(), chomp any lines
extracted from a TYPEMAP section before stuffing them in
@{$pxs->{line}}. Later, make Node::TYPEMAP::parse() stick a newline
back on each line before passing them to the typemap parser.
If POD continued to the last line of the XS file, a bunch of
"Use of uninitialized value" warnings would be generated.

The fix is trivial, but it also required a slight tweak to the test
framework in t/001-basic.t, which was unconditionally adding a \n to the
end of each test string, thus making the '=cut' appear not to be at EOF.
Remove the loop which says 'foreach contiguous section of pod, delete
it'. Replace it with with something which removes just *one* section of
POD, then relies on the main loop to process the next line, which may
well be the start of another POD section.

Fewer nested loops is less cognitive load.

Should be no functional changes.
change

    if (cond) {
        ... lots lines ...
    }

to

    if (!cond) {
        goto read_next_line;
    }
    ... lots lines ...

  read_next_line:

which might be considered a bit regressive in a 'goto considered
harmful' sort of way, but will will help with further refactoring,
and emphasises that the current input line is discarded if the condition
isn't met (i.e. it strips out code comments, which wasn't obvious).
Reindent a block of code after the previous commit removed a scope.
Whitespace-only.
Within fetch_para(), the at-end-of-POD code and the main loop both had
separate (but similar) chunks of code to read in the next line, with the
latter one being more complete. In particular, the latter had logic to
process '\' continuation lines, while the POD one didn't. This meant
that this gave a parse error:

    =pod
    =cut
    void foo(int i, \
             int j)

The fix is to make them both use the same code to read the next line.
(This is similar to the previous commit, which fixed the same bug for
POD)

Within fetch_para(), the at-end-of-TYPEMAP code and the main loop both
had separate (but similar) chunks of code to read in the next line, with
the latter one being more complete. In particular, the latter had logic
to process '\' continuation lines, while the TYPEMAP one didn't. This
meant that this gave a parse error:

    TYPEMAP: <<EOF
       ...
    EOF
    void foo(int i, \
             int j)

The fix is to make them both use the same code to read the next line.
This bug was probably actually introduced by me during the refactoring
in this branch.

By using the same read-next-line code, it also means that an immediately
following blank line is now stripped of whitespace. I'm not sure whether
that actually affects anything in the real word. Also, any trailing
lines in typemap are now stripped. Again, this shouldn't affect
anything.
Add more code comments to fetch_para(), especially explaining how
the "XSUB cuddled by #if/#endif without blank lines" mechanism works
Simplify the logic of the code which handles C-preprocessor directives
where there's not the (usually required) blank line directly before and
after an XSUB; e.g.

    #if X
    int foo(...)
       ...
    #endif

This commit splits the logic into two halves: the first is responsible
for handling conditionals directly *before* the XSUB, and the other half
for ones during and directly after the XSUB. Then the second half is
further simplified.

I think this makes the code easier to understand.

I *think* the new code is logically equivalent.
Update the code comments above this function.

This is both to reflect the changes made in this branch (e.g.
no longer processing MODULE lines in this function), and to reflect
my better understanding of this function since I wrote the original
text.
Rewrite some code comments explaining why the XS parser sometimes dies
if the next line is indented. It's a bit complicated and the initial
comments I added a year ago weren't entirely accurate.
Simplify the code in Node::cpp_scope::parse(), which contains the
main file-scoped parsing loop, and update / add code comments.

The code in this loop has gone through some rough times with all the
refactoring that has taken place in the last year, chiefly converting
the parser into being AST-based. This commit is kind of a final tidy up
to restore sanity. Apart from fixing lots of comments, the main thing
this commit does is flatten nested while loops. Before this commit, the
overall structure of the parse() function was like;

  PARAGRAPH:
    while (lines to process) {
        while (blank line) { skip }
        while (C-preprocessor line) { process }
        ... and similar loops ...
    }

and is now:

    while (lines to process) {
        if (blank line) { skip; next }
        if (C-preprocessor line) { process; next }

        ... and similar loops ...
    }

i.e. rather than there being nested loops, just have one main loop
and go back to the start of that to do the next thing. This should
be functionally equivalent, but less complicated.

This new paradigm can't be applied yet to the last few items in the main
loop due to a backwards compatibility issue with the SCOPE keyword.
I'll fix that at some future date.
Now that the XS parser builds a parse tree (AST), it spends most of its
time doing recursive-descent $some_node_object->parse() calls.

There is a common idiom within the current code along the lines of

    $some_node_object->parse()
        or return;

This makes a parse failure in a node propagate its way back up the call
chain. Note that many parse failures don't immediately croak; they
call blurt() which only warns, but also increments the error count;
parsing can continue.

During my big refactoring I haven't been entirely consistent here; this
commit adds a bunch of 'or return's that I missed. It also does a few
'or next's at higher levels, to get behaviour along the lines of:
"this XSUB had a syntax error, but continue trying to parse other XSUBs,
and only error out at the end."

This is all a bit going on gut feelings; in the final analysis it
doesn't matter too much how much more parsing gets done before xsubpp
finally croaks out.
Update the POD at the top of Node.pm and the code comments at the top
of ParseXS.pm to reflect the changes in this branch which have extended
the AST from representing just an XSUB to representing the whole XS
file.
Add a method which is a simple wrapper around ExtUtils::ParseXS::Q.
This means throughout Node.pm, you can write

    $self->Q(<<"EOF")

rather than the more long-winded

    ExtUtils::ParseXS::Q(<<"EOF");
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants