w3c · aphillips · Mar 28, 2023 · Jan 27, 2023 · Jan 27, 2023 · Jan 27, 2023
diff --git a/index.html b/index.html
@@ -303,10 +303,10 @@ <h3>Defining language values</h3>
 
 
 	<div class="req" id="lang_bcp_not_rfc">
-	<p class="advisement">Refer to BCP 47, not to RFC 5646.</p>
+	<p class="advisement">Refer to BCP 47, not to its constituent parts, such as RFC 5646 or RFC 4647.</p>
 	</div>
 
-    <p>The link to and name of BCP 47 was created specifically so that there is an unchanging reference to the definition of Tags for the Identification of Languages. RFCs 1766, 3066, 4646 were previous (superseded) versions and 5646 is the current version of BCP 47.</p>
+    <p>The link to and name of BCP 47 was created specifically so that there is an unchanging reference to the definition of <cite>Tags for the Identification of Languages</cite>. RFCs 1766, 3066, 4646 were previous (superseded) versions. The current version of BCP 47 is made up of two RFCs: 5646 and 4647.</p>
 
 
 
@@ -351,7 +351,15 @@ <h3>Defining language values</h3>
 	</details>
 	</div>
 
- 	<p>BCP 47 contains one RFC dedicated to the syntax and subtags of language tags, and another dedicated to how to match two or more subtags.  (This topic needs more detail, and may merit being a separate section.)</p>
+ 	<p>BCP 47 contains one RFC dedicated to the syntax and subtags of language tags, and another dedicated to how to match two or more subtags.</p>
+
+ 	<aside class="issue"><p>The topic of matching language tags needs more detail, and may merit being a separate section.</p></aside>
+
+ 	<div class="req" id="use_lstr">
+		<p class="advisement">Specifications SHOULD refer to the IANA Language Subtag Registry instead of providing lists of codes extracted from ISO 639, ISO 3166, or other standards.</p>
+ 	</div>
+
+ 	<p>As part of BCP 47, IANA maintains the language subtag registry, which is a publicly available, machine-readable list of valid subtags for use in language tags. While this registry is based on underlying ISO standards, such as ISO 639 (languages) and ISO 3166 (regions), the list is actively maintained, stabilized, and comprehensive in ways that other lists found on the Internet may not be. Each of the subtag types is kept in sync with parent standards with the help and participation of those standards maintainers. These include the various parts of ISO 639 (639-1, 639-2, 639-3), the ISO 15924 script codes, and ISO 3166 and UN M.49 region codes. Extracting or making your own list of codes or referring to ones found elsewhere can lead to maintenance problems or confusion.</p>
 </section>
 
 
@@ -937,9 +945,47 @@ <h3><em>Detecting &amp; matching direction (TBD)</em></h3>
 
 <p class="reviewComments"><a href="https://github.com/w3c/i18n-activity/labels/t%3Abidi_detection_x" target="_blank">See related review comments.</a></p>
 </section>
-</section>
 
+<section id="bidi_gen" class="subtopic">
+<h3>Generating or requiring creation of mixed direction strings</h3>
+
+<p class="reviewComments"><a href="https://github.com/w3c/i18n-activity/labels/t%3Abidi_gen" target="_blank">See related review comments.</a></p>
+
+<p>Specifications for APIs, protocols, or document formats sometimes provide [=natural language=] content fields, such as implementation or user-generated labels or descriptions. In addition to bidirectional text requirements found in the preceding sections, specifications can also need to provide guidance to users or content authors.</p>
+
+<aside class="example" title="Generating a display label">
+<p>Suppose a specification provides a descriptive field in an API that is meant to be filled in by the implementation at runtime. One example of this is the <code>label</code> field in the [window-placement] API. The value of <code>label</code> might take various different implementation-dependent forms that include natural language text generated by the system or user-agent.</p>
+
+<p>Such a label, when generated in a right-to-left language, might not display correctly when assembled by the system from various substrings. Spillover effects can happen when text that has mixed left-to-right and right-to-left text are used in a larger label or token. A device label, such as a monitor name, might include strings such as the brand name ("Dell", "HP", etc.), part number ("S2721H", "A157-B", etc.), device capabilities ("75 Hz", "4ms", etc.), screen resolution (1024x768), and so forth. These often include ASCII letters, digits, and punctuation. For example, the English label for a monitor might be:</p>
+
+<pre>
+Brand A123B (1920 x 1080) 36" monitor 75 Hz, 4ms, built-in speakers
+</pre>
+
+<p>A naive translation to Arabic might look like this:</p>
+
+<pre dir="rtl">
+&#x0645;&#x0627;&#x0631;&#x0643;&#x0629; A123B (1920 x 1080) 36" &#x0634;&#x0627;&#x0634;&#x0629; &#x0627;&#x0644;&#x0643;&#x0645;&#x0628;&#x064A;&#x0648;&#x062A;&#x0631;, 75 Hz, 4 &#x0645;&#x0644;&#x0644;&#x064A; &#x062B;&#x0627;&#x0646;&#x064A;&#x0629;, &#x0645;&#x0643;&#x0628;&#x0631;&#x0627;&#x062A; &#x0635;&#x0648;&#x062A; &#x0645;&#x062F;&#x0645;&#x062C;&#x0629;</pre>
 
+<p>Notice how the part number (<kbd lang="zxx" translate="no">A123B</kbd>) is separated from the brand name (<kbd lang="ar" dir="rtl" translate="no">&#x0645;&#x0627;&#x0631;&#x0643;&#x0629;</kbd>), the measurement <kbd>36</kbd> and the marker for inches (<kbd>"</kbd>) have become separated and that the values <kbd>75 Hz</kbd> (where the measurement is in ASCII) and <kbd>4 ms</kbd> (where the measurement has an Arabic translation <kbd lang="ar" dir="rtl" translate="no">&#x0645;&#x0644;&#x0644;&#x064A; &#x062B;&#x0627;&#x0646;&#x064A;&#x0629;</kbd>) both separate the number from the measurement. Generating labels from a sequence of string tokens requires extra care to ensure that the complete string is "bidirectionally clean" and will display properly to the user. Adding isolating bidirectional controls to the above string produces better results:</p>
+
+<pre dir="rtl">&#x0645;&#x0627;&#x0631;&#x0643;&#x0629; A123B &#x2066;(1920 x 1080)&#x2069; &#x2066;36"&#x2069; &#x0634;&#x0627;&#x0634;&#x0629; &#x0627;&#x0644;&#x0643;&#x0645;&#x0628;&#x064A;&#x0648;&#x062A;&#x0631;, &#x2067;75 Hz&#x2069;, &#x2067;4 &#x0645;&#x0644;&#x0644;&#x064A; &#x062B;&#x0627;&#x0646;&#x064A;&#x0629;&#x2069;, &#x0645;&#x0643;&#x0628;&#x0631;&#x0627;&#x062A; &#x0635;&#x0648;&#x062A; &#x0645;&#x062F;&#x0645;&#x062C;&#x0629;
+</pre>
+</aside>
+
+<div class="req" id="bidi_gen_advice">
+	<p class="advisement">Specifications that require or suggest system or user-generated natural language text values SHOULD provide guidance for the generation of bidirectional labels for those languages that require it.</p>
+</div>
+
+<aside class="example" title="Example of bidi generation guidance">
+	<div class="note" role="note" id="bidi-gen-note-example">
+	   <p class="example_note">If <em><code>_field_name_</code></em> contains or might contain [= bidirectional text =], care should be used to ensure that the string will display correctly without the application needing to process the string. For more information see <a href="https://www.w3.org/International/questions/qa-bidi-unicode-controls"><cite>How to use Unicode controls for bidi text</cite></a></p>
+	</div>
+</aside>
+
+
+</section>
+</section>
 
 
 
@@ -2784,29 +2830,51 @@ <h3>Specifying sort and search functionality</h3>
 </aside>
 
 
-    <p>Applications often need to organize sets of information or content. Frequently this involves sorting the content so that users can find what they are looking for. Many data types, such as numbers or dates, are easily sorted by comparing the values. When it comes to textual information, however, the nature of character encodings brings some additional complexity.</p>
-
-    <p>One key choice is whether the sorting of textual data will be shown to users and thus follow the sorting rules of a specific language or culture or whether the sorting is strictly internal.</p>
+    <p>Applications often need to organize sets of information or content. Frequently this involves sorting the content so that users can find what they are looking for. Many data types, such as numbers or dates, are easily sorted by comparing the values. When it comes to textual information, however, the nature of character encodings and user expectations regarding "alphabetical" order brings some additional complexity.</p>
 
     <div class="req" id="char_sort_internal_only">
-	<p class="advisement">Specifications or implementations that require a program-internal, fast, and deterministic sorting of text (which will never be human visible) SHOULD specify that strings are sorted into ascending code unit order.</p>
-	<!--details class="links"><summary>explanations &amp; examples</summary>
-	TODO: add links
-	</details -->
+	<p class="advisement">Specifications or implementations that require a program-internal, fast, and deterministic sorting of text which is not intended for human viewing or interaction SHOULD specify that strings are sorted according to their definition of string. For string types based on UTF-16 (such as DOMString or JavaScript), specify <em>ascending code unit</em> order. For data that uses scalar value strings (such as USVString or many XML processes), specify <em>ascending code point</em> order.</p>
+	<details class="links"><summary>explanations &amp; examples</summary>
+	   <a href="#char_string">Defining 'string'</a>
+	</details>
     </div>
 
-    <p>This matches JavaScript's <code>Array.prototype.sort</code> on an <code>Array</code> of <code>String</code>. In JavaScript, this ordering compares the 16-bit code units in each string. The resulting list will not match any particular alphabet or lexicographical order, particularly for code points represented by a surrogate pair.</p>
+    <p>One key choice is whether the sorting of textual data will be shown to users and thus need to be [=locale=]-sensitive (that is, following the sorting rules of a specific language or culture) or whether the sorting is strictly internal. There are two potential internal sorting sequences: ordering by Unicode [=code point=] or ordering by [=code unit=]. For either type of ordering, the resulting list will not match any particular alphabet or lexicographical order.</p>
+
+    <p>Sorting by [=code point=] makes sense when strings are stored and processed as a sequence of code points, such as in a <a href="https://webidl.spec.whatwg.org/#idl-USVString">USVString</a>. Sorting by [=code unit=] makes sense when strings are stored and processed using the underlying encoding, such as in a <a href="https://webidl.spec.whatwg.org/#idl-DOMString">DOMString</a>.</p>
+
+    <p>For example, consider JavaScript's function <code>Array.prototype.sort</code> applied to an <code>Array</code> of <code>String</code> values. In JavaScript, a String is a sequence of UTF-16 code units. This ordering compares the 16-bit (UTF-16) code units in each string, so [=supplementary characters=], which are encoded as a [=surrogate pair=], compare differently in this sort order than when ordering by code point.</p>
+
+    <aside class="example" title="Code point vs. code unit ordering">
+		<p>Consider two strings, one containing <span class="codepoint" translate="no"><bdi lang="ja">&#x1f63a;</bdi> [<span class="uname">U+1F63A SMILING CAT FACE WITH OPEN MOUTH</span>]</span> and the other containing <span class="codepoint" translate="no"><bdi lang="ja">&#xff5e;</bdi> [<span class="uname">U+FF5E FULL WIDTH TILDE</span>]</span>.</p>
+
+		<p>In ascending <em>code point order</em>, the strings sort like:</p>
+<pre>
+&#xff5e; (U+FF5E)
+&#x1f63a; (U+1F63A)
+</pre>
+
+		<p>In ascending <em>code unit order</em>, the character U+1F63A is encoded as the code unit sequence <code>0xD83D 0xDE3A</code>, so the strings sort like:</p>	
+<pre>
+&#x1f63a; (0xD83D 0xDE3A)
+&#xff5e; (0xFF5E)
+</pre>
+       <p>Note that UTF-8 <em>code unit order</em> (that is, when sorting by byte values in UTF-8 encoded byte strings) is the same as code point order.</p>
+    </aside>
 
-    <p>Specifications or applications that need to deal with sorting natural language text face some additional complexity. Unicode defines a default collation (sorting) order as part of the <code>Unicode Collation Algorithm</code> [[UTS10]], which can then be tailored to meet the needs of specific languages, <a>locales</a>, and cultures.</p>
-
-
 
   	<div class="req" id="char_sort_units">
-	<p class="advisement">Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application.</p>
+	<p class="advisement">Software that sorts or searches text for display to users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application.</p>
 	<details class="links"><summary>explanations &amp; examples</summary>
 	<p><a href="https://www.w3.org/TR/charmod/#sec-CollationUnits">Units of collation, C006</a>, in <cite>Character Model for the World Wide Web: Fundamentals</cite></p>
 	</details>
 	</div>
+
+    <p>Specifications or applications that need to deal with sorting natural language text for display to users face some additional complexity. Unicode defines a default collation (sorting) order as part of the <cite>Unicode Collation Algorithm</cite> [[UTS10]], which can then be tailored to meet the needs of specific languages, <a>locales</a>, and cultures.</p>
+
+	<aside class="issue" id="char_sort_user_issue">
+		<p>The following requirement is somewhat unclear for specification authors. There are many places where what I'd want to advise specs to do is follow the language (locale) of the given document or of the application or to provide controls so that the application can choose appropriately. The "current user", where it means "operating system" or "user agent host system's locale" or "browser's localization" is not always what is expected.</p>
+	</aside>
 
   	<div class="req" id="char_sort_user">
 	<p class="advisement">Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user.</p>