-
Notifications
You must be signed in to change notification settings - Fork 259
Design note: Unambiguous parsing
Q: How does Cpp2 parse things that seem ambiguous? For example, template-argument has the two productions expression and id-expression, and wouldn't an identifier match both of those? Or what about a<b,c>d
, doesn't that depend on whether a
is a template or a comparable object?
A: First match wins, there are no comma-expressions, and a relational comparison in a template-argument must be parenthesized
Let's take these one at a time. First:
For example, template-argument has the two productions expression and id-expression, and wouldn't an identifier match both of those?
In Cpp2, my intent is that the productions are tried in order. By deterministically taking the first match, we can eliminate any ambiguity when input could match more than one production. In the template-argument example, this means that "if it can be an expression, it is." (Similarly, statement has several productions, and some program text could match more than one production; the parsing rule is "if it can be an , it is.")
Note: This disambiguation is completely unlike today's vexing parse "if it can be a declaration, it is" disambiguation, which is truly awful because it requires directing the parse production selection based on whether a name is a type or not... which means today it requires name lookup (semantic analysis!) to decide the parse, which is totally backward from the proper "lex -> parse -> sema" order with all steps independent. Cpp2 has nothing like that problem, I never require sema to guide parsing.
Here's the parse.h code for template-argument:
if (auto e = expression(false)) { // disallow unparenthesized relational comparisons in template args
term.arg = std::move(e);
}
else if (auto i = id_expression()) {
term.arg = std::move(i);
}
Note that this snippet's comment also answers the second example... next:
Or what about
a<b,c>d
, doesn't that depend on whethera
is a template or a comparable object?
Ah, templates and commas and relational comparisons, oh my! In today's C++ these are a delightful (let's call it that) parse because of the comma operator and the overloading of the <
and >
tokens for template parameter/argument lists and relational comparisons.
In Cpp2 my intent is to make this a non-problem (famous last words!) by addressing this in two chunks:
- Cpp2 has no comma operator, so that removes the possibility of
b,c
being a comma-expression. - Cpp2 requires a relational comparison expression in a template-argument-list to be parenthesized. (*)
The reason I deliberately disallow unparenthesized relational comparisons in template arguments is as a simple way to avoid the many subtle issues that have complicated today's C++ parsers in this area.(**) For example, a regularly cited one is this minor variation of the above example:
a < b > c , d >
In Cpp2 that won't compile as is, because a<b>
parses as a template-id, which can't be followed by an identifier like c
. But it works fine with the required parentheses:
a < ( b > c ) , d > // ok
which parses a<...>
as a template-id with two arguments, b>c
and d
.
I also considered other options.
For a while I tried making template parameter/argument lists use [
and ]
instead of <
and >
. This had some theoretical benefits:
- It avoided the overloaded uses of
<
>
for relational comparison. -
[
]
are balanced in the language (outside comments and literals) whereas<
>
are not because of the tokens' additional use as relational operators, and using balanced tokens enables lazy parsing to skip sections more easily (as D does, and as cppfront itself does to rely on balanced{
}
to skip Cpp1 code sections without having to fully parse them).
However, eventually I decided <
>
was better for two reasons:
- It makes Cpp2 much easier for cppfront and for programmers, by letting both easily distinguish the meaning of
a<b>
vsa[b]
locally without having to go off and do name lookup (in the compiler or by manual scrolling around) to figure out whethera
is the name of a template or of an object that overloadedoperator[]
. - It reduced the delta with today's Cpp1 syntax, which I want to minimize unless there's a compelling reason to be different (and here there isn't).
Also, I know that some languages put a prefix on template/generic argument lists (e.g., !
); by itself this doesn't solve the problems here if we still use <
and >
, and (admittedly a matter of my personal taste) I considered it a bit ugly, so I didn't want to use it except as a last resort if nothing else worked. I think the current solution is better for Cpp2... other solutions may well be better for other languages, so this is not at all a diss of other languages' choices!
Even though in today's C++ grammar many of challenging cases aren't actually ambiguous, they still surely are:
- visually ambiguous to programmers, so I want to improve them under the goal of improving "simplicity"; and
- a known pain in the posterior to C++ parser writers, so I want to improve them under the goal of improving "toolability".
This comes up even if you're not writing a full C++ parser, just an approximate "tag" parser that tries to be much faster than an accurate C++ parser. In the past decade I've seen several efforts where people write new C++ parsers of both kinds, accurate and approximate, and each parser author I've talked to raised this class of example. The problem examples are just common enough in real code that even an approximate parser has to do something sensible with them.
Note these complications don't arise in C#, even though C# also uses <
and >
for generics. That's because C# doesn't have non-type template arguments (or the capability to do type computations that would involve relational comparison) and so a relational expression naturally cannot occur in a generic argument in C#. C# also doesn't have comma-expressions. So there seems to be some prior art experience that simply banning (unparenthesized) relational comparisons inside template argument lists, and banning comma-expressions everywhere, would similarly sidestep the issue in Cpp2.