forked from tc39/proposal-regexp-unicode-property-escapes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
spec.html
259 lines (233 loc) · 14.4 KB
/
spec.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
<!DOCTYPE html>
<meta charset="utf-8">
<pre class="metadata">
title: Unicode property escapes in regular expressions
status: proposal
stage: 3
location: https://tc39.github.io/proposal-regexp-unicode-property-escapes/
copyright: false
contributors: Mathias Bynens
</pre>
<script src="ecmarkup.js" defer></script>
<link rel="stylesheet" href="ecmarkup.css">
<style>
ins.block {
font-weight: bold;
font-style: italic;
}
hr {
height: 0.25em;
background: #ccc;
border: 0;
margin: 2em 0;
}
</style>
<p><ins class="block">The syntax listed in <a href="https://tc39.github.io/ecma262/#sec-patterns">21.2.1 Patterns</a> is modified as follows.</ins></p>
<emu-grammar>
ControlLetter :: one of
`a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m` `n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`
`A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M` `N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
CharacterClassEscape[U] ::
`d`
`D`
`s`
`S`
`w`
`W`
<ins>[+U] `p{` UnicodePropertyValueExpression `}`
[+U] `P{` UnicodePropertyValueExpression `}`</ins>
<ins>UnicodePropertyValueExpression ::
UnicodePropertyName `=` UnicodePropertyValue
LoneUnicodePropertyNameOrValue</ins>
<ins>UnicodePropertyNameCharacter ::
ControlLetter
`_`</ins>
<ins>UnicodePropertyNameCharacters ::
UnicodePropertyNameCharacter UnicodePropertyNameCharacters?</ins>
<ins>UnicodePropertyName ::
UnicodePropertyNameCharacters</ins>
<ins>UnicodePropertyValueCharacter ::
UnicodePropertyNameCharacter
`0`
`1`
`2`
`3`
`4`
`5`
`6`
`7`
`8`
`9`</ins>
<ins>UnicodePropertyValueCharacters ::
UnicodePropertyValueCharacter UnicodePropertyValueCharacters?</ins>
<ins>UnicodePropertyValue ::
UnicodePropertyValueCharacters</ins>
<ins>LoneUnicodePropertyNameOrValue ::
UnicodePropertyValueCharacters</ins>
</emu-grammar>
<hr>
<p><ins class="block">The following items are appended to <a href="https://tc39.github.io/ecma262/#sec-patterns-static-semantics-early-errors">21.2.1.1 Static Semantics: Early Errors</a>.</ins></p>
<emu-grammar>UnicodePropertyName :: UnicodePropertyNameCharacters</emu-grammar>
<ul>
<li>
It is a Syntax Error if <emu-nt>UnicodePropertyName</emu-nt> does not strictly match a known Unicode property name or property alias.
</li>
</ul>
<emu-grammar>UnicodePropertyValue :: UnicodePropertyValueCharacters</emu-grammar>
<ul>
<li>
It is a Syntax Error if <emu-nt>UnicodePropertyValue</emu-nt> does not strictly match a known Unicode property value or property value alias.
</li>
</ul>
<emu-grammar>UnicodePropertyValueExpression :: UnicodePropertyName `=` UnicodePropertyValue</emu-grammar>
<ul>
<li>
It is a Syntax Error if <emu-nt>UnicodePropertyName</emu-nt> strictly matches a known Unicode binary property or binary property alias.
</li>
<li>
It is a Syntax Error if <emu-nt>UnicodePropertyValue</emu-nt> is not a known value or value alias for the Unicode property or property alias <emu-nt>UnicodePropertyName</emu-nt>.
</li>
</ul>
<emu-grammar>UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue</emu-grammar>
<ul>
<li>
It is a Syntax Error if <emu-nt>LoneUnicodePropertyNameOrValue</emu-nt> does not strictly match a known Unicode general category, general category alias, binary property, or binary property alias.
</li>
</ul>
<hr>
<p><ins class="block">The following two abstract operations are appended to <a href="https://tc39.github.io/ecma262/#sec-atom">21.2.2.8 Atom</a>.</ins></p>
<emu-clause id="sec-static-semantics-unicodematchproperty-p" aoid="UnicodeMatchProperty">
<h1>Static Semantics: UnicodeMatchProperty ( _p_ )</h1>
<p>The abstract operation UnicodeMatchProperty takes a string parameter _p_ and performs the following steps:</p>
<emu-alg>
1. Assert: _p_ strictly matches a known Unicode property name or property alias.
1. Let _property_ be that unaliased property name of _p_.
1. Return _property_.
</emu-alg>
<p>Implementations must support the following non-binary Unicode properties and their property aliases:</p>
<ul>
<li><a href="http://unicode.org/reports/tr18/#General_Category_Property">`General_Category`</a></li>
<li><a href="http://unicode.org/reports/tr24/#Script">`Script`</a></li>
<li><a href="http://unicode.org/reports/tr24/#Script_Extensions">`Script_Extensions`</a></li>
</ul>
<p>Additionally, implementations must support the following binary Unicode properties and their property aliases:</p>
<ul>
<li><a href="http://unicode.org/reports/tr44/#Alphabetic">`Alphabetic`</a></li>
<li><a href="http://unicode.org/reports/tr18/#General_Category_Property">`Any`</a></li>
<li><a href="http://unicode.org/reports/tr18/#General_Category_Property">`ASCII`</a></li>
<li><a href="http://unicode.org/reports/tr44/#ASCII_Hex_Digit">`ASCII_Hex_Digit`</a></li>
<li><a href="http://unicode.org/reports/tr18/#General_Category_Property">`Assigned`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Bidi_Control">`Bidi_Control`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Bidi_Mirrored">`Bidi_Mirrored`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Case_Ignorable">`Case_Ignorable`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Cased">`Cased`</a></li>
<li><a href="http://unicode.org/reports/tr44/#CWCF">`Changes_When_Casefolded`</a></li>
<li><a href="http://unicode.org/reports/tr44/#CWCM">`Changes_When_Casemapped`</a></li>
<li><a href="http://unicode.org/reports/tr44/#CWL">`Changes_When_Lowercased`</a></li>
<li><a href="http://unicode.org/reports/tr44/#CWKCF">`Changes_When_NFKC_Casefolded`</a></li>
<li><a href="http://unicode.org/reports/tr44/#CWT">`Changes_When_Titlecased`</a></li>
<li><a href="http://unicode.org/reports/tr44/#CWU">`Changes_When_Uppercased`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Dash">`Dash`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Default_Ignorable_Code_Point">`Default_Ignorable_Code_Point`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Deprecated">`Deprecated`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Diacritic">`Diacritic`</a></li>
<li><a href="http://unicode.org/reports/tr51/#Emoji_Properties">`Emoji`</a></li>
<li><a href="http://unicode.org/reports/tr51/#Emoji_Properties">`Emoji_Component`</a></li>
<li><a href="http://unicode.org/reports/tr51/#Emoji_Properties">`Emoji_Modifier_Base`</a></li>
<li><a href="http://unicode.org/reports/tr51/#Emoji_Properties">`Emoji_Modifier`</a></li>
<li><a href="http://unicode.org/reports/tr51/#Emoji_Properties">`Emoji_Presentation`</a></li>
<li><!-- Note: by the time this makes it into ECMA-262, the draft defining `Extended_Pictographic` will likely have been published already. Update the following link accordingly when that happens. --><a href="http://unicode.org/reports/tr51/proposed.html#Emoji_Properties">`Extended_Pictographic`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Extender">`Extender`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Grapheme_Base">`Grapheme_Base`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Grapheme_Extend">`Grapheme_Extend`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Hex_Digit">`Hex_Digit`</a></li>
<li><a href="http://unicode.org/reports/tr44/#ID_Continue">`ID_Continue`</a></li>
<li><a href="http://unicode.org/reports/tr44/#ID_Start">`ID_Start`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Ideographic">`Ideographic`</a></li>
<li><a href="http://unicode.org/reports/tr44/#IDS_Binary_Operator">`IDS_Binary_Operator`</a></li>
<li><a href="http://unicode.org/reports/tr44/#IDS_Trinary_Operator">`IDS_Trinary_Operator`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Join_Control">`Join_Control`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Logical_Order_Exception">`Logical_Order_Exception`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Lowercase">`Lowercase`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Math">`Math`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Noncharacter_Code_Point">`Noncharacter_Code_Point`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Pattern_Syntax">`Pattern_Syntax`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Pattern_White_Space">`Pattern_White_Space`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Quotation_Mark">`Quotation_Mark`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Radical">`Radical`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Regional_Indicator">`Regional_Indicator`</a></li>
<li><a href="http://unicode.org/reports/tr44/#STerm">`Sentence_Terminal`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Soft_Dotted">`Soft_Dotted`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Terminal_Punctuation">`Terminal_Punctuation`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Unified_Ideograph">`Unified_Ideograph`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Uppercase">`Uppercase`</a></li>
<li><a href="http://unicode.org/reports/tr44/#Variation_Selector">`Variation_Selector`</a></li>
<li><a href="http://unicode.org/reports/tr44/#White_Space">`White_Space`</a></li>
<li><a href="http://unicode.org/reports/tr44/#XID_Continue">`XID_Continue`</a></li>
<li><a href="http://unicode.org/reports/tr44/#XID_Start">`XID_Start`</a></li>
</ul>
<emu-note>
<p>The listed properties form a superset of what <a href="http://unicode.org/reports/tr18/#RL1.2">UTS18 RL1.2</a> requires.</p>
</emu-note>
<p>To ensure interoperability, implementations must not extend Unicode property support to the remaining properties.</p>
<p>Implementations must only recognize the canonical property aliases listed in `PropertyAliases.txt`.</p>
<emu-note>
<p>For example, `Script_Extensions` (property name) and `scx` (property alias) are valid, but `script_extensions` or `Scx` aren’t.</p>
</emu-note>
</emu-clause>
<emu-clause id="sec-static-semantics-unicodematchpropertyvalue-p-v" aoid="UnicodeMatchPropertyValue">
<h1>Static Semantics: UnicodeMatchPropertyValue ( _p_, _v_ )</h1>
<p>The abstract operation UnicodeMatchPropertyValue takes two string parameters _p_ and _v_ and performs the following steps:</p>
<emu-alg>
1. Assert: _p_ is a canonical, unaliased Unicode property name.
1. Assert: _v_ strictly matches a known property value or property value alias for Unicode property _p_.
1. Let _value_ be that unaliased property value of _v_.
1. Return _value_.
</emu-alg>
<p>Implementations must support any existing property values and their aliases for the following Unicode properties as required by <a href="http://unicode.org/reports/tr18/#RL1.2">UTS18 RL1.2</a>: `General_Category`, `Script`, and `Script_Extensions`.</p>
<p>Only the canonical property values and property value aliases listed in `PropertyValueAliases.txt` must be recognized.</p>
<emu-note>
<p>For example, `Xpeo` and `Old_Persian` are valid `Script_Extension` values, but `xpeo` and `Old Persian` aren’t.</p>
</emu-note>
<emu-note>
<p>This algorithm differs from <a href="http://unicode.org/reports/tr44/#Matching_Symbolic">the matching rules for symbolic values listed in UAX44</a>: case, <emu-xref href="#sec-white-space">white space</emu-xref>, U+002D (HYPHEN-MINUS), and U+005F (LOW LINE) are not ignored, and the `Is` prefix is not supported.</p>
</emu-note>
</emu-clause>
<hr>
<p><ins class="block">The following is appended to the list of productions in <a href="https://tc39.github.io/ecma262/#sec-characterclassescape">21.2.2.12 CharacterClassEscape</a>.</ins></p>
<p>The production <emu-grammar>CharacterClassEscape :: `\p{` UnicodePropertyValueExpression `}`</emu-grammar> evaluates by returning the CharSet containing all Unicode code points included in the CharSet returned by <emu-nt>UnicodePropertyValueExpression</emu-nt>.</p>
<p>The production <emu-grammar>CharacterClassEscape :: `\P{` UnicodePropertyValueExpression `}`</emu-grammar> evaluates by returning the CharSet containing all Unicode code points not included in the CharSet returned by <emu-nt>UnicodePropertyValueExpression</emu-nt>.</p>
<p>The production <emu-grammar>UnicodePropertyValueExpression :: UnicodePropertyName `=` UnicodePropertyValue</emu-grammar> evaluates as follows:</p>
<emu-alg>
1. Let _p_ be ! UnicodeMatchProperty(_UnicodePropertyName_).
1. Assert: _p_ is not the name of a binary property in Unicode.
1. Let _v_ be ! UnicodeMatchPropertyValue(_p_, _UnicodePropertyValue_).
1. Return the CharSet containing all Unicode code points whose character database definition includes the property _p_ with value _v_.
</emu-alg>
<p>The production <emu-grammar>UnicodePropertyValueExpression :: LoneUnicodePropertyNameOrValue</emu-grammar> evaluates as follows:</p>
<emu-alg>
1. If ! UnicodeMatchPropertyValue(`"General_Category"`, _LoneUnicodePropertyNameOrValue_) is the name of a general category in Unicode, then
1. Return the CharSet containing all Unicode code points whose character database definition includes the property General_Category with value _LoneUnicodePropertyNameOrValue_.
1. Let _property_ be ! UnicodeMatchProperty(_LoneUnicodePropertyNameOrValue_).
1. Assert: _property_ is the name of a binary property in Unicode.
1. Return the CharSet containing all Unicode code points whose character database definition includes the property _p_ with value |True|.
</emu-alg>
<p><ins class="block">The following is appended to <a href="https://tc39.github.io/ecma262/#sec-bibliography">the bibliography</a>.</ins></p>
<!-- es6num="F" -->
<emu-annex id="sec-bibliography">
<h1>Bibliography</h1>
<ol>
<li>
<i>Unicode Standard Annex #18: Unicode Regular Expressions</i>, available at <<a href="http://unicode.org/reports/tr18/">http://unicode.org/reports/tr18/</a>>
</li>
<li>
<i>Unicode Standard Annex #24: Unicode `Script` Property</i>, available at <<a href="http://unicode.org/reports/tr24/">http://unicode.org/reports/tr24/</a>>
</li>
<li>
<i>Unicode Standard Annex #44: Unicode Character Database</i>, available at <<a href="http://unicode.org/reports/tr44/">http://unicode.org/reports/tr44/</a>>
</li>
<li>
<i>Unicode Technical Report #51: Unicode Emoji</i>, available at <<a href="http://unicode.org/reports/tr51/">http://unicode.org/reports/tr51/</a>>
</li>
</ol>
</emu-annex>