Skip to content

Commit 912443c

Browse files
committed
[lex] Better specify whitespace characters
This commit defines a grammar term for _whitespace-character_ and uses it consistently where the plain text term whitespace character is used. A whitespace character is defined as one of the five characters that are mentioned in the text closest to provifing a defifinition. The unicode character name is (mostly) consistently used to name these characters, and for consistency, similar changes were made to name unicode characters rather than render specified characters in code font throughout [lex]. The one exception is backslash, which is retained as-is to avoid making more issues for P2348. Note that this commit is not a replacement for P2348, merely a clearer statement of the existing specification without any normative changes.
1 parent 324f564 commit 912443c

File tree

1 file changed

+30
-23
lines changed

1 file changed

+30
-23
lines changed

source/lex.tex

+30-23
Original file line numberDiff line numberDiff line change
@@ -110,9 +110,9 @@
110110
\indextext{line splicing}%
111111
If the first translation character is \unicode{feff}{byte order mark},
112112
it is deleted.
113-
Each sequence of a backslash character (\textbackslash)
113+
Each sequence of a backslash character (\unicode{005c}{reverse solidus})
114114
immediately followed by
115-
zero or more whitespace characters other than new-line followed by
115+
zero or more \grammarterm{whitespace-character}s other than new-line followed by
116116
a new-line character is deleted, splicing
117117
physical source lines to form \defnx{logical source lines}{source line!logical}. Only the last
118118
backslash on any physical source line shall be eligible for being part
@@ -127,7 +127,7 @@
127127
to the file.
128128

129129
\item The source file is decomposed into preprocessing
130-
tokens\iref{lex.pptoken} and sequences of whitespace characters
130+
tokens\iref{lex.pptoken} and sequences of \grammarterm{whitespace-character}s
131131
(including comments). A source file shall not end in a partial
132132
preprocessing token or in a partial comment.
133133
\begin{footnote}
@@ -140,9 +140,9 @@
140140
would arise from a source file ending with an unclosed \tcode{/*}
141141
comment.
142142
\end{footnote}
143-
Each comment\iref{lex.comment} is replaced by one space character. New-line characters are
144-
retained. Whether each nonempty sequence of whitespace characters other
145-
than new-line is retained or replaced by one space character is
143+
Each comment\iref{lex.comment} is replaced by one \unicode{0020}{space} character. New-line characters are
144+
retained. Whether each nonempty sequence of \grammarterm{whitespace-character}s other
145+
than new-line is retained or replaced by one \unicode{0020}{space} character is
146146
unspecified.
147147
As characters from the source file are consumed
148148
to form the next preprocessing token
@@ -178,7 +178,7 @@
178178
\item
179179
Adjacent \grammarterm{string-literal} tokens are concatenated\iref{lex.string}.
180180

181-
\item Whitespace characters separating tokens are no longer
181+
\item \grammarterm{whitespace-character}s separating tokens are no longer
182182
significant. Each preprocessing token is converted into a
183183
token\iref{lex.token}. The resulting tokens
184184
constitute a \defn{translation unit} and
@@ -469,16 +469,25 @@
469469

470470
\rSec1[lex.comment]{Comments}
471471

472-
\pnum
473472
\indextext{comment|(}%
473+
\begin{bnf}
474+
\nontermdef{whitespace-character}\br
475+
\unicode{0009}{character tabulation}\br
476+
\textnormal{new-line}\br
477+
\unicode{000b}{line tabulation}\br
478+
\unicode{000c}{form feed}\br
479+
\unicode{0020}{space}\br
480+
\end{bnf}
481+
482+
\pnum
474483
\indextext{comment!\tcode{/*} \tcode{*/}}%
475484
\indextext{comment!\tcode{//}}%
476485
The characters \tcode{/*} start a comment, which terminates with the
477486
characters \tcode{*/}. These comments do not nest.
478487
\indextext{comment!\tcode{//}}%
479488
The characters \tcode{//} start a comment, which terminates immediately before the
480-
next new-line character. If there is a form-feed or a vertical-tab
481-
character in such a comment, only whitespace characters shall appear
489+
next new-line character. If there is a \unicode{000c}{form feed} or a \unicode{000b}{line tabulation}
490+
character in such a comment, only \grammarterm{whitespace-character}s shall appear
482491
between it and the new-line that terminates the comment; no diagnostic
483492
is required.
484493
\begin{note}
@@ -494,6 +503,7 @@
494503

495504
\indextext{token!preprocessing|(}%
496505
\begin{bnf}
506+
497507
\nontermdef{preprocessing-token}\br
498508
header-name\br
499509
import-keyword\br
@@ -506,7 +516,7 @@
506516
string-literal\br
507517
user-defined-string-literal\br
508518
preprocessing-op-or-punc\br
509-
\textnormal{each non-whitespace character that cannot be one of the above}
519+
\textnormal{each non-\grammarterm{whitespace-character} that cannot be one of the above}
510520
\end{bnf}
511521

512522
\pnum
@@ -520,7 +530,7 @@
520530
(\grammarterm{import-keyword}, \grammarterm{module-keyword}, and \grammarterm{export-keyword}),
521531
identifiers, preprocessing numbers, character literals (including user-defined character
522532
literals), string literals (including user-defined string literals), preprocessing
523-
operators and punctuators, and single non-whitespace characters that do not lexically
533+
operators and punctuators, and single non-\grammarterm{whitespace-character}s that do not lexically
524534
match the other preprocessing token categories.
525535
If a \unicode{0027}{apostrophe} or a \unicode{0022}{quotation mark} character
526536
matches the last category, the program is ill-formed.
@@ -530,12 +540,9 @@
530540
\indextext{whitespace}%
531541
whitespace;
532542
\indextext{comment}%
533-
this consists of comments\iref{lex.comment}, or whitespace characters
534-
(\unicode{0020}{space},
535-
\unicode{0009}{character tabulation},
536-
new-line,
537-
\unicode{000b}{line tabulation}, and
538-
\unicode{000c}{form feed}), or both.
543+
this consists of comments\iref{lex.comment},
544+
\grammarterm{whitespace-character}s, or
545+
both.
539546
As described in \ref{cpp}, in certain
540547
circumstances during translation phase 4, whitespace (or the absence
541548
thereof) serves as more than preprocessing token separation. Whitespace
@@ -673,13 +680,13 @@
673680
external source file names as specified in~\ref{cpp.include}.
674681

675682
\pnum
676-
The appearance of either of the characters \tcode{'} or \tcode{\textbackslash} or of
683+
The appearance of either of the characters \unicode{0027}{apostrophe} or \unicode{005c}{reverse solidus} or of
677684
either of the character sequences \tcode{/*} or \tcode{//} in a
678685
\grammarterm{q-char-sequence} or an \grammarterm{h-char-sequence}
679686
is conditionally-supported with \impldef{meaning of \tcode{'}, \tcode{\textbackslash},
680687
\tcode{/*}, or \tcode{//} in a \grammarterm{q-char-sequence} or an
681688
\grammarterm{h-char-sequence}} semantics, as is the appearance of the character
682-
\tcode{"} in an \grammarterm{h-char-sequence}.
689+
\unicode{0022}{quotation mark} in an \grammarterm{h-char-sequence}.
683690
\begin{footnote}
684691
Thus, a sequence of characters
685692
that resembles an escape sequence can result in an error, be interpreted as the
@@ -826,7 +833,7 @@
826833
\end{footnote}
827834
operators, and other separators.
828835
\indextext{whitespace}%
829-
Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments
836+
\grammarterm{whitespace-character}s and comments
830837
(collectively, ``whitespace''), as described below, are ignored except
831838
as they serve to separate tokens.
832839
\begin{note}
@@ -1790,8 +1797,8 @@
17901797
\begin{bnf}
17911798
\nontermdef{d-char}\br
17921799
\textnormal{any member of the basic character set except:}\br
1793-
\bnfindent\textnormal{\unicode{0020}{space}, \unicode{0028}{left parenthesis}, \unicode{0029}{right parenthesis}, \unicode{005c}{reverse solidus},}\br
1794-
\bnfindent\textnormal{\unicode{0009}{character tabulation}, \unicode{000b}{line tabulation}, \unicode{000c}{form feed}, and new-line}
1800+
\bnfindent\textnormal{a \grammarterm{whitespace-character}, \unicode{0028}{left parenthesis}, \unicode{0029}{right parenthesis},}\br
1801+
\bnfindent\textnormal{and \unicode{005c}{reverse solidus}}
17951802
\end{bnf}
17961803

17971804
\pnum

0 commit comments

Comments
 (0)