Skip to content

Commit

Permalink
Merge pull request #2128 from guwirth/right-angle-brackets
Browse files Browse the repository at this point in the history
Parsing of nested templates is slow
  • Loading branch information
guwirth authored May 2, 2021
2 parents 8b32415 + 0bceca5 commit fa5575c
Show file tree
Hide file tree
Showing 5 changed files with 329 additions and 99 deletions.
221 changes: 221 additions & 0 deletions cxx-squid/dox/Right_Angle_Brackets_N1757_05-0017.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
<html>
<head>
<title>Right Angle Brackets (N1757/05-0017)</title>
</head>
<body bgcolor=white fgcolor=black>

<p align=right>
<table>
<tr><td>Document number:</td><td>N1757</td></tr>
<tr><td></td><td>05-0017</td></tr>
<tr><td>Author:</td><td>Daveed Vandevoorde</td></tr>
<tr><td></td><td>Edison Design Group</td></tr>
<tr><td>Date:</td><td>2005-01-14</td></tr>
</table>

<center><h1>Right Angle Brackets</h1></center>
<center>(Revision 2)</center>

<h2>Introduction</h2>
<p>
Ever since the introduction of angle brackets, C++ programmers have been
surprised by the fact that two consecutive right angle brackets must be
separated by whitespace:
<blockquote><tt><pre>#include &lt;vector&gt;
typedef std::vector&lt;std::vector&lt;int&gt; &gt; Table; // OK
typedef std::vector&lt;std::vector&lt;bool&gt;&gt; Flags; // Error
</pre></tt></blockquote>
The problem is an immediate consequence of the the &ldquo;maximum munch&rdquo; principle and the fact that <tt>&gt;&gt;</tt> is a valid token (right shift) in C++.
<p>
This issue is a minor, but persisting, annoying, and somewhat
embarrassing problem. If the cost is reasonable, it seems therefore
worthwhile to eliminate the surprise.
<p>
The purpose of this document is to explain ways to allow <tt>&gt;&gt;</tt> to be treated as two closing angle brackets, as well as to discuss the resulting issues. A specific option is proposed along with wording that would implement the proposal in the current working paper.

<h2>Constructs with Right Angle Brackets</h2>
<p>
The example above shows the most common context of double right angle brackets: Nested template-ids. However, the &ldquo;new-style&rdquo; cast syntax may also participate in such constructs. For example:
<blockquote><tt><pre>
static_cast&lt;List&lt;B&gt;&gt;(ld)
</pre></tt></blockquote>
This situation currently occurs fairly rarely because the template-ids involved always represent class types, whereas these casts usually involve pointer, pointer-to-member, or reference types.
<p>
However, if template aliases make it into the language (and it seems likely
they will), then template-ids will be able to represent nonclass types.
It seems therefore desirable to address the issue for all constructs with
right angle brackets, not just for templates.
<p>
It is also worth noting that the problem can also occur with the <tt>&gt;&gt;=</tt> and <tt>&gt;=</tt> tokens. For example
<blockquote><tt><pre>
void func(List&lt;B&gt;= default_val1);
void func(List&lt;List&lt;B&gt;&gt;= default_val2);
</pre></tt></blockquote>
Both of these forms are currently ill-formed. It may be desirable to
also address this issue, but this paper does not propose to do so.

<h2>Possible Solutions</h2>
<p>
Solving our problem amounts to decreeing that under some circumstances
a <tt>&gt;&gt;</tt> token is treated as two right angle brackets
instead of a right shift operator. As it turns out, there are several
general approaches to defining those
&ldquo;circumstances.&rdquo;
<p>
<b>Approach 1.</b>
The first approach is the simplest: Decree that if a left angle bracket is
active (i.e. not yet matched by a right angle bracket) the <tt>&gt;&gt;</tt> token
is treated as two right angle brackets instead of a shift operator,
except within parentheses or brackets that are themselves within the angle brackets.
A slight
variation on that theme (call it &ldquo;Approach 1b&rdquo;) is to
require at least two left angle brackets to
be active since otherwise the construct would be an error (because there would be an excess of right angle brackets).
<p>
This strategy is similar to the treatment of the <tt>&gt;</tt> token:
If a left angle bracket is active, the token is treated as a right angle
bracket, except within parentheses. For example:
<blockquote><tt><pre>
A&lt;(X&gt;Y)&gt; a; // The first &gt; token appears within parentheses and
// therefore is not a right angle bracket. The second one
// <i>is</i> a right angle bracket because a left angle bracket
// is active and no parentheses are more recently active.
</pre></tt></blockquote>
<p>
Unfortunately, some programs may be broken by this approach.
Consider the following example:
<blockquote><tt><pre>#include &lt;iostream&gt;
template&lt;int I&gt; struct X {
static int const c = 2;
};
template&lt;&gt; struct X&lt;0&gt; {
typedef int c;
};
template&lt;typename T&gt; struct Y {
static int const c = 3;
};
static int const c = 4;
int main() {
std::cout &lt;&lt; (Y&lt;X&lt;1&gt; &gt;::c &gt;::c&gt;::c) &lt;&lt; '\n';
std::cout &lt;&lt; (Y&lt;X&lt; 1&gt;&gt;::c &gt;::c&gt;::c) &lt;&lt; '\n';
}
</pre></tt></blockquote>
This program is valid today; it produces the following output:
<blockquote><tt><pre>0
3
</pre></tt></blockquote>
With the right angle bracket rule proposed above, the <tt>&gt;&gt;</tt> token
in the second statement would change its meaning (from right shift to double right
angle bracket) and the output would therefore
become:
<blockquote><tt><pre>0
0
</pre></tt></blockquote>
<p>
<b>Approach 2.</b>
To avoid the backward incompatibility, an alternative solution it to modify
the rule proposed above to only treat the <tt>&gt;&gt;</tt> token as two right
angle brackets when parsing template type arguments or template template
arguments, but not when parsing template nontype arguments. This approach would make <tt>A&lt;B&lt;int&gt;&gt;</tt> valid, but would leave <tt>C&lt;D&lt;12&gt;&gt;</tt> ill-formed.
<p>
Another way to view this alternative approach is that a template argument
is always parsed as far as possible (which may include right shift operators).
When an argument is parsed, the next token must be a comma, a <tt>&gt;</tt>
treated as a single closing angle bracket, or (with this proposal) a
<tt>&gt;&gt;</tt> token treated as a double angle bracket.
<p>
<b>Approach 3.</b>
Finally, a third way to tackle the problem is to eliminate the right shift
token altogether and to modify the grammar so that two consecutive
<tt>&gt;</tt> tokens are treated as a right shift operation in the appropriate circumstances. This would for example allow the following form:
<blockquote><tt><pre>
int i = 10000 > > x;
</pre></tt></blockquote>
If limited to the right shift token, this approach introduces no known
new ambiguities, but it does introduce at least one backward compatibility
issue: The <tt>##</tt> preprocessing token can no longer be applied to two
<tt>&gt;</tt> tokens. However, it would be surprising to eliminate the
right shift token and not the left shift token. Eliminating the left
shift token does introduce new parsing ambiguities
(e.g., <tt>&X::operator<tt>&lt;</tt> <tt>&lt;</tt>Y<tt>&gt;</tt></tt>).
The shift-assign operators (<tt>&lt;&lt;=</tt> and <tt>&gt;&gt;=</tt>)
lead to similar considerations. It may also come as a surprise that
shift operations are realized through a two-token construct, whereas
other operations (e.g., prefix and postfix <tt>--</tt>, or <tt>&&</tt>)
use a single two-character token.

<h2>Implementation Experience</h2>
<p>
<b>Approach 1.</b>
As mentioned, the first proposal is analogous to the existing language
rule for the <tt>&gt;</tt> token. We therefore do not expect implementation difficulty for the approach.
<p>
<b>Approach 2.</b>
The GNU and EDG C++ compilers currently implement the second proposed
alternative for error recovery purposes. It would be trivial to promote
the error recovery procedure to a correct parse procedure. (Other compilers
appear to have a facility for the same purpose, but I do not know their exact
strategy.)
<p>
<b>Approach 3.</b>
I'm unaware of implementation experience with eliminating shift tokens
and replacing them with grammar that allows two-token shift expressions.

<h2>Recommendation</h2>
<p>
I suggest we pursue &ldquo;Approach 1&rdquo; (which breaks some valid programs).
Specifically, I propose that if even a single left angle bracket is active,
a <tt>&gt;&gt;</tt> token not enclosed in parentheses is treated as two
right angle brackets and not as a right shift operator. I do <i>not</i>
recommend the variation described as &ldquo;Approach 1b.&rdquo;
<p>
My arguments for doing so are the following:
<ul>
<li>It leaves no remaining cases that require whitespace between
two right angle brackets, which makes teaching easier.</li>
<li>It treats the <tt>&gt;&gt;</tt> token in the same way as the <tt>&gt;</tt>
token, making both specification and teaching simpler.</li>
<li>Programs that would change meaning are probably as contrived as the
example shown above, and therefore unlikely to be found in nature. Programs that would become ill-formed (i.e., containing a nonparenthesized right-shift operator in a trailing nontype template argument) are probably slightly more common but still rare.</li>
</ul>
<p>
(While the approach of eliminating the shift tokens (approach 3) was presented for the
sake of completeness, I find that it has enough small technical and
aesthetic problems to make the other approaches far preferable.)

<h2>Wording changes</h2>
<p>
Insert after the last normative sentence of 14.2/3, but before the example:
<blockquote>Similarly, the first non-nested <tt>&gt;&gt;</tt> is treated as two consecutive but distinct <tt>&gt;</tt> tokens, the first of which is taken as the end of the <i>template-argument-list</i> and completes the <i>template-id</i>. [ <i>Note:</i> The second <tt>&gt;</tt> token produced by this replacement rule may terminate an enclosing <i>template-id</i> construct or it may be part of a different construct (e.g., a cast). <i>--end note</i> ]</blockquote>
<p>
Replace the example of 14.2/3 by the following:
<blockquote>[ <i>Example:</i><pre>
template&lt;int i&gt; class X { /* ... */ };
X&lt; 1&gt;2 &gt; x1; // Syntax error.
X&lt;(1&gt;2)&gt; x2; // Okay.

template&lt;class T&gt; class Y { /* ... */ };
Y&lt;X&lt;1&gt;&gt; x3; // Okay, same as "Y&lt;X&lt;1&gt; &gt; x3;".
Y&lt;X&lt;6&gt;&gt;1&gt;&gt; x4; // Syntax error. Instead, write "Y&lt;X&lt;(6&gt;&gt;1)&gt;&gt; x4;".
</pre>
</blockquote>
<p>
Insert just before the first "<i>Note:</i>" of translation phase "7." in 2.1/1:
<blockquote>[ <i>Note:</i> The process of analyzing and translating the tokens may occasionally result in one token being replaced by a sequence of other tokens (14.2 temp.names). <i>--end note</i> ]</blockquote>
<p>
Insert a new paragraph 5.2/2 that reads:
<blockquote>[ <i>Note:</i> The <tt>&gt;</tt> token following the <i>type-id</i> in a <tt>dynamic_cast</tt>, <tt>static_cast</tt>, <tt>reinterpret_cast</tt>, or <tt>const_cast</tt>, may be the product of replacing a <tt>&gt;&gt;</tt> token by two consecutive <tt>&gt;</tt> tokens (14.2 temp.names). <i>--end note</i> ]</blockquote>
<p>
Insert in 14/1 just after the grammar rules:
<blockquote>[ <i>Note:</i> The <tt>&gt;</tt> token following the <i>template-parameter-list</i> of a <i>template-declaration</i> may be the product of replacing a <tt>&gt;&gt;</tt> token by two consecutive <tt>&gt;</tt> tokens (14.2 temp.names). <i>--end note</i> ]</blockquote>
<p>
Append to 14.1/1 (following the grammar rules):
<blockquote>[ <i>Note:</i> The <tt>&gt;</tt> token following the <i>template-parameter-list</i> of a <i>type-parameter</i> may be the product of replacing a <tt>&gt;&gt;</tt> token by two consecutive <tt>&gt;</tt> tokens (14.2 temp.names). <i>--end note</i> ]</blockquote>

<h2>See Also...</h2>
<p>
Reflector messages: c++std-ext-6767,6771,6773,6775,6779,6786,6788,6789,6792,6793,6794,6799,6801,6809.
<p>
Previous revision: N1649/04-0089, N1699/0139.
</body>
</html>
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
/*
* C++ Community Plugin (cxx plugin)
* Copyright (C) 2010-2021 SonarOpenCommunity
* http://github.com/SonarOpenCommunity/sonar-cxx
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public
* License as published by the Free Software Foundation; either
* version 3 of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public License
* along with this program; if not, write to the Free Software Foundation,
* Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
*/
package org.sonar.cxx.channels;

import com.sonar.sslr.api.Token;
import com.sonar.sslr.impl.Lexer;
import org.sonar.cxx.parser.CxxPunctuator;
import org.sonar.sslr.channel.Channel;
import org.sonar.sslr.channel.CodeReader;

/**
* Solving the problem amounts to decreeing that under some circumstances a >> token is treated as
* two right angle brackets instead of a right shift operator.
*
* According to Document number: N1757 05-0017, Right Angle Brackets (Revision 2)
*
* Decree that if a left angle bracket is active (i.e. not yet matched by a right angle bracket)
* the >> token is treated as two right angle brackets instead of a shift operator, except within
* - parentheses or
* - brackets that are themselves within the angle brackets.
*
* A<(X>Y)> a; // The first > token appears within parentheses and
* // therefore is not a right angle bracket. The second one
* // is a right angle bracket because a left angle bracket
* // is active and no parentheses are more recently active.
*/
public class RightAngleBracketsChannel extends Channel<Lexer> {

private int angleBracketLevel = 0;
private int parentheseLevel = 0;

@Override
public boolean consume(CodeReader code, Lexer output) {
char ch = (char) code.peek();
boolean consumed = false;

if (ch == '<') {
if (parentheseLevel == 0) {
char next = code.charAt(1);
if ((next != '<') && (next != '=')) { // not <<, <=, <<=, <=>,
angleBracketLevel++;
}
}
} else if (angleBracketLevel > 0) {
switch (ch) {
case ';': // end of expression => reset
angleBracketLevel = 0;
parentheseLevel = 0;
break;
case '>':
if (parentheseLevel == 0) {
output.addToken(Token.builder()
.setLine(code.getLinePosition())
.setColumn(code.getColumnPosition())
.setURI(output.getURI())
.setValueAndOriginalValue(">")
.setType(CxxPunctuator.GT)
.build());
code.pop();
consumed = true;
}
angleBracketLevel = Math.max(0, angleBracketLevel - 1);
if (angleBracketLevel == 0) {
parentheseLevel = 0;
}
break;
case '(':
parentheseLevel++;
break;
case ')':
parentheseLevel = Math.max(0, parentheseLevel - 1);
break;
}
}

return consumed;
}

}
Loading

0 comments on commit fa5575c

Please sign in to comment.