p2071r0.html

<!doctype html public "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

<head>

<title>
Named universal character escapes
</title>

<style type="text/css">

body {
    max-width: 1600px;
}

table#header th,
table#header td
{
    text-align: left;
}

table#references th,
table#references td
{
    vertical-align: top;
}

#hideins:checked ~ * ins, #hideins:checked ~ * ins * { display:none; visibility:hidden }
#hidedel:checked ~ * del, #hidedel:checked ~ * del * { display:none; visibility:hidden }

ins, ins *
{
    text-decoration: underline;
    color: #000000;
    background-color:#C8FFC8
}
del, del *
{
    text-decoration: line-through;
    color: #000000;
    background-color:#FFA0A0
}

blockquote
{
    color: #000000;
    background-color: #F1F1F1;
    border: 1px solid #D1D1D1;
    padding-left: 0.5em;
    padding-right: 0.5em;
}
blockquote.stdins
{
    color: #000000;
    background-color: #C8FFC8;
    border: 1px solid #B3EBB3;
    padding: 0.5em;
}
blockquote.stddel
{
    text-decoration: line-through;
    color: #000000;
    background-color: #FFA0A0;
    border: 1px solid #ECD7EC;
    padding-left: 0.5empadding-right: 0.5em;
}

</style>

</head>


<body>

<table id="header">
  <tr>
    <th>Document Number:</th>
    <td>P2071R0</td>
  </tr>
  <tr>
    <th>Date:</th>
    <td>2020-01-13</td>
  </tr>
  <tr>
    <th>Audience:</th>
    <td>SG16, EWG</td>
  </tr>
  <tr>
    <th>Reply-to:</th>
    <td>Tom Honermann &lt;tom@honermann.net&gt;<br/>
        Peter Bindels &lt;peterbindels@gmail.com&gt;</td>
  </tr>
</table>


<h1>
Named universal character escapes
</h1>

<ul>
  <li><a href="#introduction">Introduction</a></li>
  <li><a href="#motivation">Motivation</a></li>
  <li><a href="#design">Design considerations</a>
    <ul>
      <li><a href="#design_syntax">Syntax</a>
      <li><a href="#design_names">Name sources</a>
      <li><a href="#design_matching">Name matching</a>
      <li><a href="#design_portability">Portable names</a>
      <li><a href="#design_existing_practice">Existing practice</a>
      <li><a href="#design_compat">Backward compatibility</a>
      <li><a href="#design_impact">Implementor impact</a>
      <li><a href="#design_alt">Design alternatives</a>
    </ul>
  </li>
  <li><a href="#proposal">Proposal</a></li>
  <li><a href="#proposal_opts">Proposal options</a></li>
  <li><a href="#future">Possible future extensions</a></li>
  <li><a href="#implementation_exp">Implementation experience</a></li>
  <li><a href="#acknowledgements">Acknowledgements</a></li>
  <li><a href="#references">References</a></li>
  <li><a href="#core_wording">Core wording</a></li>
</ul>


<h1 id="introduction">Introduction</h1>

<p>
This proposal continues the effort R. Martinho Fernandes initiated that
culminated in
<a title="Named character escapes"
   href="https://wg21.link/p1097r2">
P1097R2</a><sup><a title="Named character escapes"
                   href="#ref_p1097r2">[P1097R2]</a></sup>.
This proposal does not deviate from the general design intent in Fernandes'
work, but does deviate in the following specific details:
<ul>
  <li>This proposal uses 
      <a href="https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2">UAX44-LM2</a>
      for matching names rather than just case-insensitive matching.  This is
      primarily motivated by implementation concerns; ignoring spaces allows
      for a more efficient implementation.
  </li>
  <li>This proposal includes a feature test macro.</li>
</ul>
</p>

<p>
C++ programmers have been able to portably use characters outside of the basic
source character set in character and string literals since the introduction of
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
in C++11.  For example:
<div style="margin-left: 1em;">
<pre><code class="c++">U'\u0100'        // UTF-32 character literal with U+0100 {LATIN CAPITAL LETTER A WITH MACRON}
u8"\u0100\u0300" // UTF-8 string literal with U+0100 {LATIN CAPITAL LETTER A WITH MACRON} U+0300 {COMBINING GRAVE ACCENT}</code></pre>
</div>
</p>

<p>
This proposal enables the above literals to be written using Unicode assigned
names instead of Unicode code point values.
<div style="margin-left: 1em;">
<pre><code class="c++">U'\N{LATIN CAPITAL LETTER A WITH MACRON}'                           // Equivalent to U'\u0100'
u8"\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT" // Equivalent to u8"\u0100\u0300"</code></pre>
</div>
</p>

<p>
Prior presentations of P1097 to EWG-I and EWG received strong encouragement:
<ul>
  <li>Poll of
      <a title="P1097R1: Named character escapes"
         href="https://wg21.link/p1097r1">
      P1097R1</a><sup><a title="P1097R1: Named character escapes"
                         href="#ref_p1097r1">[P1097R1]</a></sup>
      in
      <a href="http://wiki.edg.com/bin/view/Wg21sandiego2018/P1097R1">EWG-I in San Diego, 2018</a>:
      <div style="margin-left: 1em;">
        Do we want named escape sequences?
        <table border="1" style="border-collapse: collapse">
          <tr><th>SF</th><th>F</th><th>N</th><th>A</th><th>SA</th></tr>
          <tr><td>5</td><td>9</td><td>7</td><td>0</td><td>0</td></tr>
        </table>
      </div>
  </li>
  <li>Poll of
      <a title="P1097R2: Named character escapes"
         href="https://wg21.link/p1097r2">
      P1097R2</a><sup><a title="P1097R2: Named character escapes"
                         href="#ref_p1097r2">[P1097R2]</a></sup>
      in
      <a href="http://wiki.edg.com/bin/view/Wg21belfast/P1097-EWG">EWG in Belfast, 2019</a>:
      <div style="margin-left: 1em;">
        EWG wants to encourage further work in this area 
        <table border="1" style="border-collapse: collapse">
          <tr><th>SF</th><th>F</th><th>N</th><th>A</th><th>SA</th></tr>
          <tr><td>8</td><td>16</td><td>8</td><td>1</td><td>1</td></tr>
        </table>
      </div>
  </li>
</ul>
</p>

<p>
Two areas of concern were raised during
<a href="http://wiki.edg.com/bin/view/Wg21belfast/P1097-EWG">discussion in EWG in Belfast, 2019</a>:
<ul>
  <li><b>Implementation impact</b><br/>
      The Unicode name database (names and aliases), in text form, is ~1.5 MiB
      and a naive implementation could significantly impact the size of compiler
      distributions.  This was of particular concern to organizations that
      distribute compilers as part of a distributed build process.
  </li>
  <li><b>Design concerns</b><br/>
      One EWG member strongly preferred a library based design that would have
      a smaller impact on the core language.  For example, a string
      interpolation based design.
  </li>
</ul>
This paper discusses and links to work completed by Corentin Jabot that
investigates implementation impact, though an implementation has not yet been
completed.  This paper also includes discussion regarding alternative design
possibilities.
</p>


<h1 id="motivation">Motivation</h1>

<p>
The introduction of 
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
in C++11 benefitted programmers by allowing them to portably encode characters
outside of the basic source character set without having to resort to use of
octal or hexadecimal
<a href="http://eel.is/c++draft/lex.ccon#nt:escape-sequence"><em>escape-sequence</em></a>s
to explicitly encode code units.  However, Unicode code points by themselves do
not clearly communicate to readers of the code which character is to be encoded;
hence the code comments included with the code examples in the introduction.
Allowing programmers to directly use Unicode assigned character names avoids the
need for side channel communications, like code comments, that might get out of
sync over time.
</p>

<p>
Use of UTF-8 as the encoding for source files has increased over time, but
impediments to adoption remain.  For example, Microsoft Visual C++ still
defaults to a locale dependent encoding and that encourages limiting source
files to ASCII.  If the C++ community were to migrate en masse to UTF-8,
then one might question whether
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
would become a legacy backward compatibility feature since programmers could
reliably type the intended character in their source code directly.  And if
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
were to become an anachronism, then what use would be served by introducing
a named character escape?
</p>

<p>
Unicode defines a number of characters that, even when they can be typed
directly, can result in confusion.  These include invisible characters
such as U+200B {ZERO WIDTH SPACE}, combining characters such as U+0300
{COMBINING GRAVE ACCENT}, visually indistinct characters such as U+003B
{SEMICOLON} and U+037E {GREEK QUESTION MARK}, and characters with
RTL (right-to-left) directionality.  Consider how the following string
literals containing these characters are rendered.  In cases like these,
use of escape sequences improves clarity; thus motivation for use of Unicode
escape sequences will remain.
<div style="margin-left: 1em;">
<table style="border:1px solid black">
  <tr>
    <td>
      <tt>"​"</tt><br/>
      <tt>"‏"</tt><br/>
      <tt>"̀"</tt><br/>
      <tt>";"</tt><br/>
      <tt>";"</tt><br/>
      <tt>"´"</tt><br/>
      <tt>"́"</tt><br/>
      <tt>"´"</tt><br/>
      <tt>"Ω"</tt><br/>
      <tt>"Ω"</tt><br/>
      <tt>"A"</tt><br/>
      <tt>"Α"</tt><br/>
      <tt>"А"</tt><br/>
      <tt>"Ꭺ"</tt><br/>
      <tt>"ꓮ"</tt><br/>
      <tt>"𐊠" </tt><br/>
      <tt>"𖽀" </tt><br/>
    </td>
    <td>
      <tt>// U+0000200B {ZERO WIDTH SPACE}</tt><br/>
      <tt>// U+0000200F {RIGHT-TO-LEFT MARK}</tt><br/>
      <tt>// U+00000300 {COMBINING GRAVE ACCENT}</tt><br/>
      <tt>// U+0000003B {SEMICOLON}</tt><br/>
      <tt>// U+0000037E {GREEK QUESTION MARK}</tt><br/>
      <tt>// U+000000B4 {ACUTE ACCENT}</tt><br/>
      <tt>// U+00000301 {COMBINING ACUTE ACCENT}</tt><br/>
      <tt>// U+00001FFD {GREEK OXIA}</tt><br/>
      <tt>// U+000003A9 {GREEK CAPITAL LETTER OMEGA}</tt><br/>
      <tt>// U+00002126 {OHM SIGN}</tt><br/>
      <tt>// U+00000041 {LATIN CAPITAL LETTER A}</tt><br/>
      <tt>// U+00000391 {GREEK CAPITAL LETTER ALPHA}</tt><br/>
      <tt>// U+00000410 {CYRILLIC CAPITAL LETTER A}</tt><br/>
      <tt>// U+000013AA {CHEROKEE LETTER GO}</tt><br/>
      <tt>// U+0000A4EE {LISU LETTER A}</tt><br/>
      <tt>// U+000102A0 {CARIAN LETTER A}</tt><br/>
      <tt>// U+00016F40 {MIAO LETTER ZZYA}</tt><br/>
    </td>
  </tr>
</table>
</div>
</p>

<p>
Named character escapes are supported in various forms in other programming
languages.  The following is the result of a brief survey of various languages.
For languages that include such support, more details can be found in the
<a href="#design">Design considerations</a> section.
<div style="margin-left: 1em;">
  <table border="1" style="border-collapse: collapse">
    <tr>
      <th style="text-align:left">Language</th>
      <th style="text-align:left">Named character escape support</th>
    </tr>
    <tr> <td>C#</td>           <td>No</td> </tr>
    <tr> <td>D</td>            <td>Yes; HTML 5 named character references</td> </tr>
    <tr> <td>Go</td>           <td>No</td> </tr>
    <tr> <td>Java</td>         <td>No</td> </tr>
    <tr> <td>Javascript</td>   <td>No</td> </tr>
    <tr> <td>Perl</td>         <td>Yes; Unicode names, aliases, and named sequences</td> </tr>
    <tr> <td>PHP</td>          <td>No</td> </tr>
    <tr> <td>Python</td>       <td>Yes; Unicode names and aliases</td> </tr>
    <tr> <td>Raku</td>         <td>Yes; Unicode names, aliases, named sequences, and emoji sequences</td> </tr>
    <tr> <td>Ruby</td>         <td>No</td> </tr>
    <tr> <td>Rust</td>         <td>No</td> </tr>
    <tr> <td>Swift</td>        <td>No</td> </tr>
    <tr> <td>Visual Basic</td> <td>No</td> </tr>
  </table>
</div>
</p>


<h1 id="design">Design considerations</h1>

<p>
There are numerous choices for how support for named characters can be
integrated into C++.  Useful questions for making design choices include:
<ul>
  <li>Which names will be recognized?  Can multiple names for the same character exist?</li>
  <li>How will names be matched?  Must they be exact?  Case insensitive?</li>
  <li>How will support for new names affect backward compatibility?</li>
  <li>How will the requirement for a name database impact implementations?</li>
  <li>What syntax to use?</li>
  <li>What is existing practice in other languages?</li>
</ul>
This section analyzes the various options considered for this proposal.
</p>


<h2 id="design_syntax">Syntax</h2>

<p>
Named character escapes are proposed as a more readable alternative to
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name">universal-character-name</a>s.
As such, it is desirable that they be similar in syntax to
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name">universal-character-name</a>s
and other existing escape sequences.
</p>

<p>
The syntax proposed by Fernandes in
<a title="Named character escapes"
   href="https://wg21.link/p1097r2">
P1097R2</a><sup><a title="Named character escapes"
                   href="#ref_p1097r2">[P1097R2]</a></sup>
is modeled after the syntax adopted for Python and consists of a <tt>\N</tt>
escape introducer followed by a name enclosed in curly brackets.  For example:
<div style="margin-left: 1em;">
<pre><code class="c++">'\N{LATIN CAPITAL LETTER A}'
"\N{LATIN CAPITAL LETTER A WITH MACRON}"
</code></pre>
</div>
</p>

<p>
Other choices for the escape introducer are possible; the
<a href="#design_compat">Backward compatibility</a>
section discusses some possible motivation for preferring
<tt>\u</tt> and/or <tt>\U</tt> and the
<a href="#proposal_opts">Proposal options</a>
section includes this alternate syntax as an option.
</p>

<p>
Options for recognized names and how to match them are discussed in
subsequent sections.
</p>

<p>
As proposed, only one name is allowed per named character escape, but that
is an artificial limitation.  Raku allows a sequence of comma separated names
to be specified in a single escape.  This is a natural extension if names
are permitted to identify sequences of characters instead of a single
character.  The following would all be equivalent.  This proposal leaves this
option to a future extension; see the
<a href="#future">Possible future extensions</a> section.
<div style="margin-left: 1em;">
<pre><code class="c++">"\N{LATIN CAPITAL LETTER A WITH MACRON, COMBINING GRAVE ACCENT}"
"\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT}"
"\u0100\u0300"
</code></pre>
</div>
</p>

<p>
Perl and Raku both allow Unicode code point numbers to be specified as character
names and could enable a syntax that avoids the strict 4 or 8 number
requirements of
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name">universal-character-name</a>s
as well as the natural <tt>U+NNNN</tt> style frequently used to identify
Unicode characters.  The following could all be equivalent.  This proposal
also leaves this option for a future extension as discussed in the
<a href="#future">Possible future extensions</a> section.
<div style="margin-left: 1em;">
<pre><code class="c++">"\N{U+0100}"
"\N{U+100}"
"\N{U+00000100}"
"\N{0x0100}"
"\N{256}"
"\u0100"
</code></pre>
</div>
</p>


<h2 id="design_names">Name sources</h2>

<p>
A named character escape feature is not particularly useful unless accompanied
by at least one source of character names.  The following list contains sources
of character names that are consulted by at least one implementation of named
character escapes in another programming language.
<ul>
  <li>Unicode assigned names (synchronized with ISO/IEC 10646)<br/>
      <a href="https://www.unicode.org/Public/12.0.0/ucd/NamesList.txt">https://www.unicode.org/Public/12.0.0/ucd/NamesList.txt</a>
  </li>
  <li>Unicode aliases (synchronized with ISO/IEC 10646)<br/>
      <a href="https://www.unicode.org/Public/12.0.0/ucd/NameAliases.txt">https://www.unicode.org/Public/12.0.0/ucd/NameAliases.txt</a>
  </li>
  <li>Unicode named sequences (synchronized with ISO/IEC 10646)<br/>
      <a href="https://www.unicode.org/Public/12.0.0/ucd/NamedSequences.txt">https://www.unicode.org/Public/12.0.0/ucd/NamedSequences.txt</a>
  </li>
  <li>Emoji ZWJ sequences<br/>
      <a href="https://www.unicode.org/Public/emoji/4.0/emoji-zwj-sequences.txt">https://www.unicode.org/Public/emoji/4.0/emoji-zwj-sequences.txt</a>
  </li>
  <li>Emoji sequences<br/>
      <a href="https://www.unicode.org/Public/emoji/4.0/emoji-sequences.txt">https://www.unicode.org/Public/emoji/4.0/emoji-sequences.txt</a>
  </li>
  <li>HTML named character references<br/>
      <a href="https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references">https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references</a>
  </li>
</ul>
</p>

<p>
The first three are defined by the Unicode Consortium, part of the Unicode
standard, and synchronized with ISO/IEC 10646.  The names specified in each
are designed in concert, share a common namespace, are immutable once
published, and Unicode guarantees no conflicts between them.  See the
<a title="Unicode Character Encoding Stability Policies"
   href="https://www.unicode.org/policies/stability_policy.html">
Unicode character encoding stability policy</a><sup><a title="Unicode Character Encoding Stability Policies"
                                                       href="#ref_ucesp">[UCESP]</a></sup>
for more details.  These sources are consulted for named character escapes
in Perl, Python, and Raku.
<p>

<p>
The next two sources specify emoji character sequences.  Though produced
by the Unicode Consortium, they are not part of the Unicode standard, and
are not covered by the
<a title="Unicode Character Encoding Stability Policies"
   href="https://www.unicode.org/policies/stability_policy.html">
Unicode character encoding stability policy</a><sup><a title="Unicode Character Encoding Stability Policies"
                                                       href="#ref_ucesp">[UCESP]</a></sup>.
These two sources don't technically provide names; they provide optional
descriptions.  The provided descriptions use characters, particularly
<tt>:</tt> and <tt>,</tt>, that are disallowed in the names provided
by the first three sources.  These sources are consulted for named character
escapes in Raku.
</p>

<p>
The last source is the specification of names recognized for use as named
character references in HTML documents.  This source is used for the
implementation of named character escapes in the D programming language.
</p>

<p>
The stability guarantees offered by the Unicode standard are a strong motivator
for their use and, as such, this proposal adopts them as the name sources to
use.
</p>

<p>
The list of Unicode assigned names associates at most one name with each
character.  There are some characters that are not assigned a name in this
list, for example, U+0080 is simply listed as a <tt>&lt;control&gt;</tt>
character with no name.  In some of these cases, the Unicode aliases list
provides one or more names.  For example, U+0080 has assigned aliases of
<tt>PADDING CHARACTER</tt> (a figment alias) and <tt>PAD</tt>
(an abbreviation alias).
</p>

<p>
Unicode aliases provide another critical service.  As mentioned above,
once assigned, names are immutable.  Corrections are only offered by providing
an alias.  Aliases come in five varieties:
<ul>
  <li><b>correction</b><br/>
      Aliases for cases where an incorrect assigned name was published.
      For example, U+FE18 has an assigned name of
      <tt>PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET</tt>
      and a correction alias of
      <tt>PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET</tt>
      (note the typo correction).
  </li>
  <li><b>control</b><br/>
      Aliases for various control characters.  For example, U+0000 as a
      control alias of <tt>NULL</tt>.
  </li>
  <li><b>alternate</b><br/>
      Aliases for widely used alternate names.  For example,
      <tt>BYTE ORDER MARK</tt> for U+FEFF.
  </li>
  <li><b>figment</b><br/>
      Aliases for names that were documented, but never accepted in a standard.
      For example, <tt>HIGH OCTET PRESET</tt> for U+0081.
  </li>
  <li><b>abbreviation</b><br/>
      Aliases for common abbreviations.  For example,
      <tt>NBSP</tt> for U+00A0.
  </li>
</ul>
</p>

<p>
It is conceivable that implementors could desire, or be requested to, support
additional implementation-defined names; perhaps including from the
additional sources listed above.  Since new characters and names will continue
to be added to the Unicode standard, caution is warranted to avoid the
possibility of introducing conflicting names over time.  The description of the
<a href="https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2">UAX44-LM2</a>
name matching algorithm describes a historical case of how such a conflict
once occurred.  Any support for additional names should ensure that they
occupy a non-overlapping namespace with the Unicode assigned names.  Out of
caution, this proposal disallows additional implementation-defined names.
</p>


<h2 id="design_matching">Name matching</h2>

<p>
Names can be finicky things.  Having to remember whether a name is, for example,
<tt>ZERO WIDTH SPACE</tt> or <tt>ZERO-WIDTH SPACE</tt> is likely to frustrate
programmers.  Some programmers might prefer <tt>zero width space</tt>.
</p>

<p>
Unicode provides a straight forward algorithm for matching names with various
allowances including case-insensitivity, omission of some hyphens (<tt>-</tt>),
and substitution of underscore (<tt>_</tt>) for space characters.
<a href="https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2">UAX44-LM2</a>
is included in the Unicode standard via
<a title="Unicode Standard Annex #44 - Unicode Character Database"
   href="https://www.unicode.org/reports/tr44/tr44-24.html">
Unicode Standard Annex #44</a><sup><a title="Unicode Standard Annex #44 - Unicode Character Database"
                                      href="#ref_uax44">[UAX#44]</a></sup>.
</p>

<p>
The
<a href="https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2">UAX44-LM2</a>
matching rule would accept any of the following names as a match for
U+200B {ZERO WIDTH SPACE}
<div style="margin-left: 1em;">
<pre><code class="c++"><tt>ZERO WIDTH SPACE</tt>
<tt>ZERO-WIDTH SPACE</tt>
<tt>zero-width space</tt>
<tt>ZERO width S P_A_C E</tt></code></pre>
</div>
</p>


<h2 id="design_portability">Portable names</h2>

<p>
Portably using named character escapes will require implementations to agree
on a minimum version of the name sources.
</p>

<p>
Thanks to the adoption of
<a title="Update The Reference To The Unicode Standard"
   href="https://wg21.link/p1025r1">
P1025R1</a><sup><a title="Update The Reference To The Unicode Standard"
                   href="#ref_p1025r1">[P1025R1]</a></sup>
in Rapperswil, 2019, the C++ standard has a normative floating reference to
<a title="Information technology — Universal Coded Character Set (UCS)"
   href="https://www.iso.org/standard/69119.html">
ISO/IEC 10646</a><sup><a title="Information technology — Universal Coded Character Set (UCS)"
                         href="#ref_10646">[ISO/IEC10646]</a></sup>,
the ISO/IEC standard that specifies a subset of what is specified in the
Unicode standard and is kept synchronized with it.  ISO/IEC 10646:2017
includes the
Unicode assigned names (in section 33),
name aliases (in section 33), and
named character sequences (in section 27).
</p>

<p>
The floating reference to ISO/IEC 10646 indicates a dependence on the version
that is current at the time of standardization.  Thus, conformance with the
C++ standard will require conformance with the latest available publication
of ISO/IEC 10646.
</p>

<p>Implementors must be allowed, and encouraged, to conform to more recent
versions of ISO/IEC 10646 as they are published.
</p>


<h2 id="design_existing_practice">Existing practice</h2>

<p>
Support for named escape sequences exists in several programming languages.
The following details of existing practice were obtained from these
documentation sources.  The author has not verified the accuracy of this
information.
<table style="border:1px solid black">
  <tr><th>Language</th><th>Documentation link</th></tr>
  <tr>
    <td>D</td>
    <td><a href="https://dlang.org/spec/lex.html#StringLiteral">https://dlang.org/spec/lex.html#StringLiteral</a></td>
  </tr>
  <tr>
    <td>Perl</td>
    <td><a href="https://perldoc.perl.org/charnames.html">https://perldoc.perl.org/charnames.html</a></td>
  </tr>
  <tr>
    <td>Python</td>
    <td><a href="https://docs.python.org/3.8/reference/lexical_analysis.html#literals">https://docs.python.org/3.8/reference/lexical_analysis.html#literals</a></td>
  </tr>
  <tr>
    <td>Raku</td>
    <td><a href="https://docs.raku.org/language/unicode#Entering_unicode_codepoints_and_codepoint_sequences">https://docs.raku.org/language/unicode#Entering_unicode_codepoints_and_codepoint_sequences</a></td>
  </tr>
</table>
</p>

<p>
Capabilities vary across languages:
<table border="1" style="border-collapse: collapse">
  <tr>
    <th>Language</th>
    <th>Name sources</th>
    <th>Comma separated names</th>
    <th>Name matching</th>
    <th>Matches code<br/>point numbers</th>
  </tr>
  <tr>
    <td>D</td>
    <td>HTML 5</td>
    <td>No</td>
    <td>Exact match?</td>
    <td>No</td>
  </tr>
  <tr>
    <td>Perl</td>
    <td>Unicode names<br/>
        Unicode name aliases<br/>
        Unicode named sequences<br/>
        registered custom aliases<br/>
    </td>
    <td>No</td>
    <td>
      Optionally, script qualified short names<br/>
      Optionally, loose matching (case insensitive, ignore underscore, most spaces, and most non-medial hyphens)
    </td>
    <td>Yes</td>
  </tr>
  <tr>
    <td>Python</td>
    <td>Unicode names<br/>
        Unicode name aliases<br/>
    </td>
    <td>No</td>
    <td>Case-insensitive</td>
    <td>No</td>
  </tr>
  <tr>
    <td>Raku</td>
    <td>Unicode names<br/>
        Unicode name aliases<br/>
        Unicode named sequences<br/>
        emoji ZWJ sequences<br/>
        emoji sequences<br/>
    </td>
    <td>Yes</td>
    <td>Exact match?</td>
    <td>Yes</td>
  </tr>
</table>
</p>

<p>
Examples:
<table border="1" style="border-collapse: collapse">
  <tr>
    <th>Language</th>
    <th>Code</th>
  </tr>
  <tr>
    <td>D</td>
    <td>
<pre><code class="D">"\&amp;Amacr;"</code></pre>
    </td>
  </tr>
  <tr>
    <td>Perl</td>
    <td>
<pre><code class="perl">"\N{LATIN CAPITAL LETTER A WITH MACRON}"
"\N{U+0100}"
</code></pre>
    </td>
  </tr>
  <tr>
    <td>Python</td>
    <td>
<pre><code class="python">"\N{LATIN CAPITAL LETTER A WITH MACRON}"</code></pre>
    </td>
  </tr>
  <tr>
    <td>Raku</td>
    <td>
<pre><code class="raku">"\c[LATIN CAPITAL LETTER A WITH MACRON]"
"\c[256]"
"\c[LATIN CAPITAL LETTER A WITH MACRON,COMBINING GRAVE ACCENT]"
"\c[LATIN CAPITAL LETTER A WITH MACRON AND GRAVE]"</code></pre>
    </td>
  </tr>
</table>
</p>


<h2 id="design_compat">Backward compatibility</h2>

<p>
Escape sequences beyond those required in the standard are
conditionally-supported
(<a href="http://eel.is/c++draft/lex.ccon#7.sentence-3">[lex.ccon]p7</a>).
For implementations that currently define a meaning for <tt>\N</tt> in
character or string literals, the use of <tt>\N</tt> in this proposal is
technically a breaking change.
</p>

<p>
Gcc, Clang, and Microsoft Visual C++ all accept <tt>\N</tt> as an escape
sequence with the semantic effect of substituting <tt>N</tt> such that
<tt>"\N{xxx}"</tt> is equivalent to <tt>"N{xxx}"</tt>.  However, they each
emit a warning regarding an unrecognized escape sequence, so reliance on
this behavior is not likely to be common.  Still, there are likely to be
some uses in the wild (probably some percentage of that were intended to
be <tt>\n</tt>).
</p>

<p>
Another option would be to reuse the <tt>\u</tt> and/or <tt>\U</tt> introducer
used for 
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s.
Gcc and Clang both reject code like <tt>"\u{xxx}"</tt> and <tt>"\U{xxx}"</tt>
as containing ill-formed
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s.
However, Microsoft Visual C++ accepts such uses without a warning and
treats them as equivalent to <tt>"u{xxx}</tt> and <tt>"U{xxx}"</tt>
respectively.
</p>

<p>
The implementation divergence that occurs for the <tt>\u</tt> and <tt>\U</tt>
cases above suggests that repurposing them may result in less backward
compatibility.  Use of <tt>\u</tt> and/or <tt>\U</tt> would potentially
require more wording changes to distinguish named character escapes from
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s,
but would be unlikely to pose a significant additional impact to implementors.
</p>

<p>
For now, this proposal adheres to Fernandes' original design and retains use
of <tt>\N</tt> as the introducer for named character escapes.
</p>


<h2 id="design_impact">Implementor impact</h2>

<p>
The sources of character names listed in the
<a href="#design_names">Name sources</a>
section do not constitute big data by today's standards, but that does not mean
that the volume of data and potential for impact to compiler distributions and
compiler performance is insignificant.  As mentioned earlier, some organizations
have valid technical reasons to be sensitive to the size of the compiler
distributions they use; in a distributed build environment that distributes
compilers, the size of the distribution impacts latency and can therefore
negatively impact build times.
</p>

<p>
The combined size of the Unicode 12.0 text files containing the Unicode assigned
names, aliases, and named character sequences is approximately 1.5 MiB.  A
naive implementation might contribute 2+ MiB of code/data to a compiler.  Some
EWG members indicated that amount of increase is a cause for concern.
</p>

<p>
Fortunately, naive implementations are not the only option.  Corentin Jabot
has done some excellent work to demonstrate that an implementation should be
possible that increases the code/data size of a compiler by less than 300 KiB.
See the
<a href="#implementation_exp">Implementation experience</a> section for details.
Corentin's approach is promising, but the additional complexity caries
additional implementation cost and maintenance.
</p>

<p>
Staying up to date with new Unicode releases will also, of course, pose an
additional cost on implementors.
</p>


<h2 id="design_alt">Design alternatives</h2>

<p>
As indicated previously, at least one EWG member in Belfast was strongly
interested in a more general core language feature, presumably a string
interpolation facility, that would allow named character escapes to be
implemented as a library feature.  Such a feature could take many forms,
but might look something like the following where <tt>\{</tt> is an
escape sequence followed by a call to a <tt>constexpr</tt> function named
<tt>nce</tt> with arguments passed in some form.
<div style="margin-left: 1em;">
<pre><code class="c++">"\{nce(LATIN CAPITAL LETTER A WITH GRAVE)}"</code></pre>
</div>
</p>

<p>
Such a feature could certainly be implemented, but would seem to necessarily
be more verbose and would necessitate inclusion of appropriate headers;
headers that would be quite large in the case of a named character database
or that would make use of a compiler intrinsic; which would put the complexity
back in the compiler (though in implementation-defined territory rather than
in standard core language).  The verbosity concern could potentially be
reduced by introducing core language sugar for lowering the proposed syntax
to the example string interpolation syntax above.
</p>


<h1 id="proposal">Proposal</h1>

<p>
The wording included in this proposal is for the following design:
<ul>
  <li>Context:
    <ul>
      <li>Named character escapes are valid only in character and string
          literals.
      </li>
    </ul>
  </li>
  <li>Syntax:
    <ul>
      <li><tt>\N{xxx}</tt> where the name is substituted for <tt>xxx</tt>.</li>
    </ul>
  </li>
  <li>Name sources:
    <ul>
      <li>ISO/IEC 10646 assigned names.</li>
      <li>ISO/IEC 10646 assigned name aliases.</li>
      <li>No allowance for additional implementation-defined names.</li>
    </ul>
  </li>
  <li>Name matching:
    <ul>
      <li>As specified by rule
          <a href="https://www.unicode.org/reports/tr44/tr44-24.html#UAX44-LM2">UAX44-LM2</a>
          in
          <a title="Unicode Standard Annex #44 - Unicode Character Database"
             href="https://www.unicode.org/reports/tr44/tr44-24.html">
          UAX#44</a><sup><a title="Unicode Standard Annex #44 - Unicode Character Database"
                            href="#ref_uax44">[UAX#44]</a></sup>.
      </li>
    </ul>
  </li>
  <li>Feature test macro:
    <ul>
      <li><tt>__cpp_named_character_escapes</tt></li>
    </ul>
  </li>
</ul>
</p>


<h1 id="proposal_opts">Proposal options</h1>

<p>
The following options are <em>not</em> currently proposed, but could be adopted
as modifications of the current proposal.
<ol>
  <li>Instead of <tt>\N</tt>, reuse the <tt>\u</tt> and/or <tt>\U</tt> introducers from
      <a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
      to introduce a named character escape.  For example:
    <ul>
      <li><tt>"\u{LATIN CAPITAL LETTER A WITH GRAVE}"</tt></li>
      <li><tt>"\U{LATIN CAPITAL LETTER A WITH GRAVE}"</tt></li>
    </ul>
    See the
    <a href="#design_compat">Backward compatibility</a>
    section for more discussion of this option.
  </li>
  <li>Allow names to match ISO/IEC 10646 named sequences such that the
      following would be equivalent:
    <ul>
      <li><tt>"\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"</tt></li>
      <li><tt>"\N{LATIN CAPITAL LETTER A WITH MACRON}\N{COMBINING GRAVE ACCENT"</tt></li>
      <li><tt>"\u0100\u0300"</tt></li>
    </ul>
  </li>
</ol>
</p>


<h1 id="future">Possible future extensions</h1>

<p>
The following options are <em>not</em> currently proposed but could be considered
for future extension.
<ol>
  <li>Allow named character escapes to be used outside of character and string
      literals (e.g., in identifiers) analogously to 
      <a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s.
  </li>
  <li>Allow comma separated names.  For example:
    <ul>
      <li><tt>"\N{LATIN CAPITAL LETTER A WITH MACRON, COMBINING GRAVE ACCENT}" // Equivalent to "\u0100\u0300"</tt></li>
    </ul>
  </li>
  <li>Allow code point numbers as names.  For example:
    <ul>
      <li><tt>"\N{U+00C0}" // U+00C0 {LATIN CAPITAL LETTER A WITH GRAVE}</tt></li>
      <li><tt>"\N{0x00C0}" // U+00C0 {LATIN CAPITAL LETTER A WITH GRAVE}</tt></li>
      <li><tt>"\N{192}"    // U+00C0 {LATIN CAPITAL LETTER A WITH GRAVE}</tt></li>
    </ul>
  </li>
  <li>Allow names to match Unicode emoji named sequences</li>
  <li>Allow names to match Unicode emoji ZWJ named sequences</li>
  <li>Allow names to match HTML 5 named character references by surrounding
      them with <tt>&amp;</tt> and <tt>;</tt>.  For example:
    <ul>
      <li><tt>"\N{&amp;Agrave;}" // U+00C0 {LATIN CAPITAL LETTER A WITH GRAVE}</tt></li>
    </ul>
  </li>
</ol>
</p>


<h1 id="implementation_exp">Implementation experience</h1>

<p>
This proposal has not yet been implemented in an existing compiler.  However,
the implementation concerns raised in Belfast prompted Corentin Jabot to conduct
an experiement to determine how small the implementation overhead, in terms of
data and code within the compiler, could be reduced to.  His
<a title="Storing Unicode: Character Name to Codepoint Mapping"
   href="https://cor3ntin.github.io/posts/cp_to_name">
blog post</a><sup><a title="Storing Unicode: Character Name to Codepoint Mapping"
                     href="#ref_cj_blog">[CJ_BLOG]</a></sup>
on the experiment reported that he was able to implement a function
(<a href="https://github.com/cor3ntin/ext-unicode-db/blob/name_to_cp/name_to_cp.hpp#L215-L260"><tt>cp_from_name</tt></a>)
that accepts a Unicode 12.0 name or name alias and returns a code point value in
under 300 KiB.  His implementation is available in the <tt>cp_to_name</tt>
branch of his <tt>ext-unicode-db</tt> GitHub repository at
<a title="ext-unicode-db"
   href="https://github.com/cor3ntin/ext-unicode-db/tree/name_to_cp">
https://github.com/cor3ntin/ext-unicode-db/tree/name_to_cp</a><sup><a title="ext-unicode-db"
                                                                      href="#ref_cj_impl">[CJ_IMPL]</a></sup>.
</p>


<h1 id="acknowledgements">Acknowledgements</h1>

<p>
Thank you to R. Martinho Fernandes for taking the initiative to research
and first propose support for named character escapes and for contributing
his considerable expertise in general to SG16.
</p>

<p>
Thank you to Corentin Jabot for the excellent work he did experimenting with
and analyzing implementation impact.  Without his work, the data necessary to
respond to the implementation concerns raised in Belfast would not have been
available at this time, thereby delaying further progress on this proposal.
</p>

<p>
Thank you to Peter Bindels and Corentin Jabot for providing feedback on an
initial draft that I delivered to then less than two hours before the Prague
pre-meeting mailing deadline!
</p>


<h1 id="references">References</h1>

<table id="references">
  <tr>
    <td id="ref_cj_blog"><sup>[CJ_BLOG]</sup></td>
    <td>
      Corentin Jabot,
      "Storing Unicode: Character Name to Codepoint Mapping", 2019.<br/>
      <a href="https://cor3ntin.github.io/posts/cp_to_name">
      https://cor3ntin.github.io/posts/cp_to_name</a></td>
  </tr>
  <tr>
    <td id="ref_cj_impl"><sup>[CJ_IMPL]</sup></td>
    <td>
      Corentin Jabot,
      "ext-unicode-db", 2019.<br/>
      <a href="https://github.com/cor3ntin/ext-unicode-db/tree/name_to_cp">
      https://github.com/cor3ntin/ext-unicode-db/tree/name_to_cp</a></td>
  </tr>
  <tr>
    <td id="ref_10646"><sup>[ISO/IEC10646]</sup></td>
    <td>
      "Information technology — Universal Coded Character Set (UCS)", ISO/IEC 10646:2017, 2017.<br/>
      <a href="https://www.iso.org/standard/69119.html">
      https://www.iso.org/standard/69119.html</a></td>
  </tr>
  <tr>
    <td id="ref_n4835"><sup>[N4835]</sup></td>
    <td>
      "Working Draft, Standard for Programming Language C++", N4835, 2019.<br/>
      <a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4835.pdf">
      http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4835.pdf</a></td>
  </tr>
  <tr>
    <td id="ref_p1025r1"><sup>[P1025R1]</sup></td>
    <td>
      Steve Downey, et al.
      "Update The Reference To The Unicode Standard", P1025R1, 2018.<br/>
      <a href="https://wg21.link/p1025r1">https://wg21.link/p1025r1</a>
    </td>
  </tr>
  <tr>
    <td id="ref_p1097r1"><sup>[P1097R1]</sup></td>
    <td>
      R. Martinho Fernandes,
      "Named character escapes", P1097R1, 2018.<br/>
      <a href="https://wg21.link/p1097r1">https://wg21.link/p1097r1</a>
    </td>
  </tr>
  <tr>
    <td id="ref_p1097r2"><sup>[P1097R2]</sup></td>
    <td>
      R. Martinho Fernandes,
      "Named character escapes", P1097R2, 2019.<br/>
      <a href="https://wg21.link/p1097r2">https://wg21.link/p1097r2</a>
    </td>
  </tr>
  <tr>
    <td id="ref_p2029r0"><sup>[P2029R0]</sup></td>
    <td>
      Tom Honermann,
      "Proposed resolution for core issues 411, 1656, and 2333;
      numeric and universal character escapes in character and string literals",
      P2029R0, 2020.<br/>
      <a href="https://wg21.link/p2029r0">https://wg21.link/p2029r0</a>
    </td>
  </tr>
  <tr>
    <td id="ref_ucesp"><sup>[UCESP]</sup></td>
    <td>
      "Unicode Character Encoding Stability Policies", 2017.<br/>
      <a href="https://www.unicode.org/policies/stability_policy.html">https://www.unicode.org/policies/stability_policy.html</a>
    </td>
  </tr>
  <tr>
    <td id="ref_uax44"><sup>[UAX#44]</sup></td>
    <td>
      Ken Whistler and Laurențiu Iancu,
      "Unicode Standard Annex #44 - Unicode Character Database", Revision 24, Unicode 12.0.0, 2019.<br/>
      <a href="https://www.unicode.org/reports/tr44/tr44-24.html">https://www.unicode.org/reports/tr44/tr44-24.html</a>
    </td>
  </tr>
</table>


<h1 id="core_wording">Core wording</h1>

<p>These changes are relative to
<a title="Working Draft, Standard for Programming Language C++"
   href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/n4835.pdf">
N4835</a><sup><a title="Working Draft, Standard for Programming Language C++"
                 href="#ref_n4835">[N4835]</a></sup>.
</p>

<p>
If
<a title="Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals"
   href="https://wg21.link/p2029r0">P2029R0</a><sup><a title="Proposed resolution for core issues 411, 1656, and 2333; numeric and universal character escapes in character and string literals"
                                                       href="#ref_p2029r0">[P2029R0]</a></sup>
were to be adopted, substantial wording updates will be required.
</p>

<input type="checkbox" id="hideins">Hide inserted text</input><br/>
<input type="checkbox" id="hidedel">Hide deleted text</input>

<p>Change in
<a href="http://eel.is/c++draft/lex.phases#1.5">
5.2 [lex.phases] paragraph 5</a>:
<blockquote>
Each basic source character set member in a character literal or a string
literal, as well as each escape sequence<del> and</del><ins>,</ins>
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a><ins>,
and
<a href="http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence"><em>named-escape-sequence</em></a></ins>
in a character literal or a non-raw string literal, is converted to the
corresponding member of the execution character set
(<a href="http://eel.is/c++draft/lex.ccon">[lex.ccon]</a>,
<a href="http://eel.is/c++draft/lex.string">[lex.string]</a>);
if there is no corresponding member, it is converted to an implementation
defined member other than the null (wide) character.
<sup><a href="http://eel.is/c++draft/lex.phases#footnote-8">8</a></sup>
</blockquote>
</p>

<p>Change in
<a href="http://eel.is/c++draft/lex.ccon">
5.13.3 [lex.ccon]</a>:
<blockquote>
<a href="http://eel.is/c++draft/lex.ccon#nt:character-literal">character-literal:</a>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:encoding-prefix">encoding-prefix</a><sub>opt</sub>
<tt>'</tt>
<a href="http://eel.is/c++draft/lex.ccon#nt:c-char-sequence">c-char-sequence</a>
<tt>'</tt>
</div>
<br/>
<a href="http://eel.is/c++draft/lex.ccon#nt:encoding-prefix">encoding-prefix:</a> one of
<div style="margin-left: 1em;">
<tt>u8</tt>&ensp; <tt>u</tt>&ensp; <tt>U</tt>&ensp; <tt>L</tt>
</div>
<br/>
<a href="http://eel.is/c++draft/lex.ccon#nt:c-char-sequence">c-char-sequence:</a>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:c-char">c-char</a>
</div>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:c-char-sequence">c-char-sequence</a>
<a href="http://eel.is/c++draft/lex.ccon#nt:c-char">c-char</a>
</div>
<br/>
<a href="http://eel.is/c++draft/lex.ccon#nt:c-char">c-char:</a>
<div style="margin-left: 1em;">
any member of the basic source character set except the single-quote <tt>'</tt>,
backslash <tt>\</tt>, or
<a href="http://eel.is/c++draft/cpp.pre#nt:new-line">new-line</a>
character
</div>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:escape-sequence">escape-sequence</a>
</div>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name">universal-character-name</a>
</div>
<br/>
<a href="http://eel.is/c++draft/lex.ccon#nt:escape-sequence">escape-sequence:</a>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:simple-escape-sequence">simple-escape-sequence</a>
</div>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:octal-escape-sequence">octal-escape-sequence</a>
</div>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:hexadecimal-escape-sequence">hexadecimal-escape-sequence</a>
</div>
<ins>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence">named-escape-sequence</a>
</div>
</ins>
<br/>
<a href="http://eel.is/c++draft/lex.ccon#nt:simple-escape-sequence">simple-escape-sequence:</a> one of
<div style="margin-left: 1em;">
<tt>\'</tt>&ensp; <tt>\"</tt>&ensp; <tt>\?</tt>&ensp; <tt>\\</tt><br/>
<tt>\a</tt>&ensp; <tt>\b</tt>&ensp; <tt>\f</tt>&ensp; <tt>\n</tt>&ensp; <tt>\r</tt>&ensp; <tt>\t</tt>&ensp; <tt>\v</tt>&ensp;
</div>
<br/>
<a href="http://eel.is/c++draft/lex.ccon#nt:octal-escape-sequence">octal-escape-sequence:</a>
<div style="margin-left: 1em;">
<tt>\</tt> <a href="http://eel.is/c++draft/lex.icon#nt:octal-digit">octal-digit</a>
</div>
<div style="margin-left: 1em;">
<tt>\</tt> <a href="http://eel.is/c++draft/lex.icon#nt:octal-digit">octal-digit</a>
<a href="http://eel.is/c++draft/lex.icon#nt:octal-digit">octal-digit</a>
</div>
<div style="margin-left: 1em;">
<tt>\</tt> <a href="http://eel.is/c++draft/lex.icon#nt:octal-digit">octal-digit</a>
<a href="http://eel.is/c++draft/lex.icon#nt:octal-digit">octal-digit</a>
<a href="http://eel.is/c++draft/lex.icon#nt:octal-digit">octal-digit</a>
</div>
<br/>
<a href="http://eel.is/c++draft/lex.ccon#nt:hexadecimal-escape-sequence">hexadecimal-escape-sequence:</a>
<div style="margin-left: 1em;">
<tt>\x</tt> <a href="http://eel.is/c++draft/lex.icon#nt:hexadecimal-digit">hexadecimal-digit</a>
</div>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:hexadecimal-escape-sequence">hexadecimal-escape-sequence</a>
<a href="http://eel.is/c++draft/lex.icon#nt:hexadecimal-digit">hexadecimal-digit</a>
</div>
<ins>
<br/>
<a href="http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence">named-escape-sequence:</a>
<div style="margin-left: 1em;">
<tt>\N{</tt> 
<a href="http://eel.is/c++draft/lex.ccon#nt:n-char-sequence">n-char-sequence</a>
<tt>}</tt>
</div>
<br/>
<a href="http://eel.is/c++draft/lex.ccon#nt:n-char-sequence">n-char-sequence:</a>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:n-char">n-char</a>
</div>
<div style="margin-left: 1em;">
<a href="http://eel.is/c++draft/lex.ccon#nt:n-char">n-char</a>
<a href="http://eel.is/c++draft/lex.ccon#nt:n-char-sequence">n-char-sequence</a>
</div>
<br/>
<a href="http://eel.is/c++draft/lex.ccon#nt:n-char">n-char:</a> one of
<div style="margin-left: 1em;">
<tt>A</tt> <tt>B</tt> <tt>C</tt> <tt>D</tt> <tt>E</tt> <tt>F</tt> <tt>G</tt>
<tt>H</tt> <tt>I</tt> <tt>J</tt> <tt>K</tt> <tt>L</tt> <tt>M</tt> <tt>N</tt>
<tt>O</tt> <tt>P</tt> <tt>Q</tt> <tt>R</tt> <tt>S</tt> <tt>T</tt> <tt>U</tt>
<tt>V</tt> <tt>W</tt> <tt>X</tt> <tt>Y</tt> <tt>Z</tt>
</div>
<div style="margin-left: 1em;">
<tt>a</tt> <tt>b</tt> <tt>c</tt> <tt>d</tt> <tt>e</tt> <tt>f</tt> <tt>g</tt>
<tt>h</tt> <tt>i</tt> <tt>j</tt> <tt>k</tt> <tt>l</tt> <tt>m</tt> <tt>n</tt>
<tt>o</tt> <tt>p</tt> <tt>q</tt> <tt>r</tt> <tt>s</tt> <tt>t</tt> <tt>u</tt>
<tt>v</tt> <tt>w</tt> <tt>x</tt> <tt>y</tt> <tt>z</tt>
</div>
<div style="margin-left: 1em;">
<tt>0</tt> <tt>1</tt> <tt>2</tt> <tt>3</tt> <tt>4</tt>
<tt>5</tt> <tt>6</tt> <tt>7</tt> <tt>8</tt> <tt>9</tt>
</div>
<div style="margin-left: 1em;">
<tt>_</tt> <tt>-</tt> space
</div>
</ins>
</blockquote>
</p>

<p>Change in
<a href="http://eel.is/c++draft/lex.ccon#7">
5.13.3 [lex.ccon] paragraph 7</a>:
<blockquote>
Certain non-graphic characters, the single quote <tt>'</tt>, the double quote
<tt>"</tt>, the question mark
<tt>?</tt>,<sup><a href="http://eel.is/c++draft/lex.ccon#footnote-19">19</a></sup>
and the backslash <tt>\</tt>, can be represented according to Table
<a href="http://eel.is/c++draft/lex.ccon#tab:lex.ccon.esc">8</a>.
The double quote <tt>"</tt> and the question mark <tt>?</tt>, can be
represented as themselves or by the escape sequences <tt>\"</tt> and <tt>\?</tt>
respectively, but the single quote <tt>'</tt> and the backslash <tt>\</tt> shall
be represented by the escape sequences <tt>\'</tt> and <tt>\\</tt> respectively.
Escape sequences in which the character following the backslash is not listed in
Table <a href="http://eel.is/c++draft/lex.ccon#tab:lex.ccon.esc">8</a>
are conditionally-supported, with implementation-defined semantics.  An escape
sequence specifies a single character.
<br/>
<br/>
<div align="center">
<table border="0">
<tr>
<td align="left">Table <a href="http://eel.is/c++draft/lex.ccon#tab:lex.ccon.esc">8</a>:
Escape sequences</td>
<td/>
<td align="right">[tab:lex.ccon.esc]</td>
</tr>
</table>
<br/>
<table border="1">
<tr><td>new-line</td>       <td>NL(LF)</td><td><tt>\n</tt></td></tr>
<tr><td>horizontal tab</td> <td>HT</td>    <td><tt>\t</tt></td></tr>
<tr><td>vertical tab</td>   <td>VT</td>    <td><tt>\v</tt></td></tr>
<tr><td>backspace</td>      <td>BS</td>    <td><tt>\b</tt></td></tr>
<tr><td>carriage return</td><td>CR</td>    <td><tt>\r</tt></td></tr>
<tr><td>form feed</td>      <td>FF</td>    <td><tt>\f</tt></td></tr>
<tr><td>alert</td>          <td>BEL</td>   <td><tt>\a</tt></td></tr>
<tr><td>backslash</td>      <td>\</td>     <td><tt>\\</tt></td></tr>
<tr><td>question mark</td>  <td>?</td>     <td><tt>\?</tt></td></tr>
<tr><td>single quote</td>   <td>'</td>     <td><tt>\'</tt></td></tr>
<tr><td>double quote</td>   <td>"</td>     <td><tt>\"</tt></td></tr>
<tr><td>octal number</td>   <td>ooo</td>   <td><tt>\ooo</tt></td></tr>
<tr><td>hex number</td>     <td>hhh</td>   <td><tt>\xhhh</tt></td></tr>
<tr><td><ins>named escape sequence</ins></td> <td><ins>named character</ins></td> <td><ins><tt>\N{xxx}</tt></ins></td></tr>
</table>
</div>
</blockquote>
</p>

<p>Add a new paragraph (X) after
<a href="http://eel.is/c++draft/lex.ccon#9">
5.13.3 [lex.ccon] paragraph 9</a>:<br/>
<em>Drafting Note:</em> Associated character names and character
name aliases are listed in section 33 of ISO/IEC 10646:2017.  Named UCS
sequence identifiers are listed in section 27.
<blockquote class="stdins">
A
<a href="http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence"><em>named-escape-sequence</em></a>
is translated to the encoding, in the appropriate execution character set, of
the character or character sequence associated with the ISO/IEC 10646
<tt>associated character name</tt> or
<tt>character name alias</tt>
that matches the name specified by the
<a href="http://eel.is/c++draft/lex.ccon#nt:n-char-sequence"><em>n-char-sequence</em></a>.
Matching of names is performed by:
<table>
  <tr>
    <td>(X.1) &mdash;</td>
    <td>removing all medial hyphens.</td>
  </tr>
  <tr>
    <td>(X.2) &mdash;</td>
    <td>removing all space and underscore characters.</td>
  </tr>
  <tr>
    <td>(X.3) &mdash;</td>
    <td>lowercasing all capital letters.</td>
  </tr>
</table>
If no name is matched, then the program is ill-formed.  If the matched name is
<tt>HANGUL JUNGSEONG OE</tt>, then steps 2 and 3 are performed against the name
<tt>HANGUL JUNGSEONG O-E</tt> and, if the names match, U+1180
{HANGUL JUNGSEONG O-E} is encoded, otherwise U+116C {HANGUL JUNGSEONG OE} is
encoded.  Otherwise, the character associated with the matched name is encoded.
[ <em>Note:</em> The special handling of U+1180
{HANGUL JUNGSEONG O-E} resolves an ambiguity in the matching algorithm; this
is the only case of ambiguity. &mdash; <em>end note</em> ]
</blockquote>
</p>

<p>Change in
<a href="http://eel.is/c++draft/lex.string#14">
5.13.5 [lex.string] paragraph 14</a>:
<blockquote>
Escape sequences and
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
in non-raw string literals have the same meaning as in
<a href="http://eel.is/c++draft/lex.ccon">character literals</a>
(<a href="http://eel.is/c++draft/lex.ccon">[lex.ccon]</a>), except that
the single quote <tt>'</tt> is representable either by itself or by the escape
sequence <tt>\'</tt>, and the double quote <tt>"</tt> shall be preceded by a
<tt>\</tt>, and except that a
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>
<ins>or
<a href="http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence"><em>named-escape-sequence</em></a>
</ins>in a UTF-16 string literal may yield a surrogate pair.  In a narrow string
literal, a
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>
<ins>or
<a href="http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence"><em>named-escape-sequence</em></a>
</ins>may map to more than one <tt>char</tt> or <tt>char8_t</tt> element due to
<a href="http://eel.is/c++draft/lex.string#def:encoding,multibyte"><em>multibyte encoding</em></a>.
The size of a <tt>char32_t</tt> or wide string
literal is the total number of escape sequences,
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s,
<ins>
<a href="http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence"><em>named-escape-sequence</em></a>s,
</ins>and other characters, plus one for the terminating <tt>U'\0'</tt> or
<tt>L'\0'</tt>.  The size of a UTF-16 string literal is the total number of
escape sequences,
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s,
<ins>
<a href="http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence"><em>named-escape-sequence</em></a>s,
</ins>and other characters, plus one for each character requiring a surrogate pair,
plus one for the terminating <tt>u'\0'</tt>.  [ <em>Note:</em> The size of a
<tt>char16_t</tt> string literal is the number of
code units, not the number of characters.  &mdash; <em>end note</em> ] Within
<tt>char32_t</tt> and <tt>char16_t</tt> string literals, any
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>s
shall be within the range <tt>0x0</tt> to <tt>0x10FFFF</tt>.  The size of a
narrow string literal is the total number of escape sequences and other
characters, plus at least one for the multibyte encoding of each
<a href="http://eel.is/c++draft/lex.charset#nt:universal-character-name"><em>universal-character-name</em></a>,
<ins>
<a href="http://eel.is/c++draft/lex.ccon#nt:named-escape-sequence"><em>named-escape-sequence</em></a>s,
</ins>plus one for the terminating <tt>'\0'</tt>.
</blockquote>
</p>

<p>Change in table 17 of
<a href="http://eel.is/c++draft/cpp.predefined#1.8">
15.11 [cpp.predefined] paragraph 1.8</a>:<br/>
<em>Drafting note:</em> the final value for the
<tt>__cpp_named_character_escapes</tt> feature test macro will be selected by
the project editor to reflect the date of approval. 
<blockquote>
<div style="margin-left: 1em;">
<table>
  <tr>
    <td align="center">
      <table>
        <tr>
          <td align="left">Table 17 &mdash; Feature-test macros</td>
          <td align="right">[tab:cpp.predefined.ft]</td>
        </tr>
      </table>
    </td>
  </tr>
  <tr>
    <td align="center">
      <table border="1">
        <tr>
          <th align="center">Macro name</th>
          <th align="center">Value</th>
        </tr>
        <tr>
          <td>[&hellip;]</td>
          <td>[&hellip;]</td>
        </tr>
        <tr>
          <td>__cpp_modules</td>
          <td>201907L</td>
        </tr>
        <tr>
          <td><ins>__cpp_named_character_escapes</ins></td>
          <td><ins>XXXXXXL</ins> <strong><em style="background-color: yellow">** placeholder **</em></strong></td>
        </tr>
        <tr>
          <td>__cpp_namespace_attributes</td>
          <td>201411L</td>
        </tr>
        <tr>
          <td>[&hellip;]</td>
          <td>[&hellip;]</td>
        </tr>
      </table>
    </td>
  </tr>
</table>
</div>
</blockquote>
</p>


</body>