Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address issues raised by recent reviews #89

Merged
merged 23 commits into from
Mar 28, 2023
Merged
Changes from 5 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
de6fe7f
Add guidance for generating bidirectional text from substrings or for
aphillips Jan 27, 2023
7b24ade
Address #83
aphillips Jan 27, 2023
85fc0dc
Address #69
aphillips Jan 27, 2023
b73b570
Corrections and minor cleanup for the bidi example.
aphillips Jan 28, 2023
d0c0799
Addresses #14
aphillips Jan 29, 2023
57e9fb5
Address @r12a's comment about language tags.
aphillips Jan 30, 2023
3910539
Address @r12a's comment about "isolating bidi controls"
aphillips Jan 30, 2023
7825d50
Fixed bidi example by using markup.
aphillips Jan 31, 2023
bb58dfa
Address teleconference comments.
aphillips Feb 2, 2023
54d1b8b
Improve color contrast in example 3
aphillips Feb 2, 2023
1a2991b
Further simplified bidi example.
aphillips Feb 2, 2023
cc2f39e
Additional work on sorting section.
aphillips Feb 2, 2023
e16100d
Fix description of BCP47 and of matching
aphillips Feb 7, 2023
6da1a39
Minor tweak to ascending code unit sort.
aphillips Feb 7, 2023
2244c70
Added text to description of the matching BP
aphillips Feb 7, 2023
676694b
Address comments on #bidi_gen section from telecon of 2023-02-09
aphillips Feb 9, 2023
a54cf2f
Move explanation under mustard.
aphillips Feb 9, 2023
48fa92a
Ensure all parts of 639 are dealt with, minor edits
aphillips Feb 9, 2023
a7a3897
Further rephrasing based on @r12a's comments in telecon
aphillips Feb 10, 2023
f64e6d2
Further edits to make reqs refer to string values.
aphillips Feb 10, 2023
1caf816
Address telecon 2023-02-16 comments
aphillips Feb 16, 2023
07cbc60
Address @xfq's comment about vertical-align
aphillips Feb 28, 2023
6ce0aea
Update bidi recommendation from telecon of 2023-03-09 [I18N-ACTION-1253]
aphillips Mar 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 84 additions & 16 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -303,10 +303,10 @@ <h3>Defining language values</h3>


<div class="req" id="lang_bcp_not_rfc">
<p class="advisement">Refer to BCP 47, not to RFC 5646.</p>
<p class="advisement">Refer to BCP 47, not to its constituent parts, such as RFC 5646 or RFC 4647.</p>
</div>

<p>The link to and name of BCP 47 was created specifically so that there is an unchanging reference to the definition of Tags for the Identification of Languages. RFCs 1766, 3066, 4646 were previous (superseded) versions and 5646 is the current version of BCP 47.</p>
<p>The link to and name of BCP 47 was created specifically so that there is an unchanging reference to the definition of <cite>Tags for the Identification of Languages</cite>. RFCs 1766, 3066, 4646 were previous (superseded) versions. The current version of BCP 47 is made up of two RFCs: 5646 and 4647.</p>



Expand Down Expand Up @@ -351,7 +351,15 @@ <h3>Defining language values</h3>
</details>
</div>

<p>BCP 47 contains one RFC dedicated to the syntax and subtags of language tags, and another dedicated to how to match two or more subtags. (This topic needs more detail, and may merit being a separate section.)</p>
<p>BCP 47 contains one RFC dedicated to the syntax and subtags of language tags, and another dedicated to how to match two or more subtags.</p>

<aside class="issue"><p>The topic of matching language tags needs more detail, and may merit being a separate section.</p></aside>

<div class="req" id="use_lstr">
<p class="advisement">Specifications SHOULD refer to the IANA Language Subtag Registry instead of providing lists of codes extracted from ISO 639, ISO 3166, or other standards.</p>
</div>

<p>As part of BCP 47, IANA maintains the language subtag registry, which is a publicly available, machine-readable list of valid subtags for use in language tags. While this registry is based on underlying ISO standards, such as ISO 639 (languages) and ISO 3166 (regions), the list is actively maintained, stabilized, and comprehensive in ways that other lists found on the Internet may not be. Each of the subtag types is kept in sync with parent standards with the help and participation of those standards maintainers. These include the various parts of ISO 639 (639-1, 639-2, 639-3), the ISO 15924 script codes, and ISO 3166 and UN M.49 region codes. Extracting or making your own list of codes or referring to ones found elsewhere can lead to maintenance problems or confusion.</p>
aphillips marked this conversation as resolved.
Show resolved Hide resolved
</section>


Expand Down Expand Up @@ -937,9 +945,47 @@ <h3><em>Detecting &amp; matching direction (TBD)</em></h3>

<p class="reviewComments"><a href="https://github.com/w3c/i18n-activity/labels/t%3Abidi_detection_x" target="_blank">See related review comments.</a></p>
</section>
</section>

<section id="bidi_gen" class="subtopic">
<h3>Generating or requiring creation of mixed direction strings</h3>

<p class="reviewComments"><a href="https://github.com/w3c/i18n-activity/labels/t%3Abidi_gen" target="_blank">See related review comments.</a></p>

<p>Specifications for APIs, protocols, or document formats sometimes provide [=natural language=] content fields, such as implementation or user-generated labels or descriptions. In addition to bidirectional text requirements found in the preceding sections, specifications can also need to provide guidance to users or content authors.</p>

<aside class="example" title="Generating a display label">
<p>Suppose a specification provides a descriptive field in an API that is meant to be filled in by the implementation at runtime. One example of this is the <code>label</code> field in the [window-placement] API. The value of <code>label</code> might take various different implementation-dependent forms that include natural language text generated by the system or user-agent.</p>

<p>Such a label, when generated in a right-to-left language, might not display correctly when assembled by the system from various substrings. Spillover effects can happen when text that has mixed left-to-right and right-to-left text are used in a larger label or token. A device label, such as a monitor name, might include strings such as the brand name ("Dell", "HP", etc.), part number ("S2721H", "A157-B", etc.), device capabilities ("75 Hz", "4ms", etc.), screen resolution (1024x768), and so forth. These often include ASCII letters, digits, and punctuation. For example, the English label for a monitor might be:</p>

<pre>
Brand A123B (1920 x 1080) 36" monitor 75 Hz, 4ms, built-in speakers
</pre>

<p>A naive translation to Arabic might look like this:</p>

<pre dir="rtl">
&#x0645;&#x0627;&#x0631;&#x0643;&#x0629; A123B (1920 x 1080) 36" &#x0634;&#x0627;&#x0634;&#x0629; &#x0627;&#x0644;&#x0643;&#x0645;&#x0628;&#x064A;&#x0648;&#x062A;&#x0631;, 75 Hz, 4 &#x0645;&#x0644;&#x0644;&#x064A; &#x062B;&#x0627;&#x0646;&#x064A;&#x0629;, &#x0645;&#x0643;&#x0628;&#x0631;&#x0627;&#x062A; &#x0635;&#x0648;&#x062A; &#x0645;&#x062F;&#x0645;&#x062C;&#x0629;</pre>

<p>Notice how the part number (<kbd lang="zxx" translate="no">A123B</kbd>) is separated from the brand name (<kbd lang="ar" dir="rtl" translate="no">&#x0645;&#x0627;&#x0631;&#x0643;&#x0629;</kbd>), the measurement <kbd>36</kbd> and the marker for inches (<kbd>"</kbd>) have become separated and that the values <kbd>75 Hz</kbd> (where the measurement is in ASCII) and <kbd>4 ms</kbd> (where the measurement has an Arabic translation <kbd lang="ar" dir="rtl" translate="no">&#x0645;&#x0644;&#x0644;&#x064A; &#x062B;&#x0627;&#x0646;&#x064A;&#x0629;</kbd>) both separate the number from the measurement. Generating labels from a sequence of string tokens requires extra care to ensure that the complete string is "bidirectionally clean" and will display properly to the user. Adding isolating bidirectional controls to the above string produces better results:</p>
aphillips marked this conversation as resolved.
Show resolved Hide resolved

<pre dir="rtl">&#x0645;&#x0627;&#x0631;&#x0643;&#x0629; A123B &#x2066;(1920 x 1080)&#x2069; &#x2066;36"&#x2069; &#x0634;&#x0627;&#x0634;&#x0629; &#x0627;&#x0644;&#x0643;&#x0645;&#x0628;&#x064A;&#x0648;&#x062A;&#x0631;, &#x2067;75 Hz&#x2069;, &#x2067;4 &#x0645;&#x0644;&#x0644;&#x064A; &#x062B;&#x0627;&#x0646;&#x064A;&#x0629;&#x2069;, &#x0645;&#x0643;&#x0628;&#x0631;&#x0627;&#x062A; &#x0635;&#x0648;&#x062A; &#x0645;&#x062F;&#x0645;&#x062C;&#x0629;
</pre>
</aside>

<div class="req" id="bidi_gen_advice">
<p class="advisement">Specifications that require or suggest system or user-generated natural language text values SHOULD provide guidance for the generation of bidirectional labels for those languages that require it.</p>
</div>

<aside class="example" title="Example of bidi generation guidance">
<div class="note" role="note" id="bidi-gen-note-example">
<p class="example_note">If <em><code>_field_name_</code></em> contains or might contain [= bidirectional text =], care should be used to ensure that the string will display correctly without the application needing to process the string. For more information see <a href="https://www.w3.org/International/questions/qa-bidi-unicode-controls"><cite>How to use Unicode controls for bidi text</cite></a></p>
</div>
</aside>


</section>
</section>



Expand Down Expand Up @@ -2784,29 +2830,51 @@ <h3>Specifying sort and search functionality</h3>
</aside>


<p>Applications often need to organize sets of information or content. Frequently this involves sorting the content so that users can find what they are looking for. Many data types, such as numbers or dates, are easily sorted by comparing the values. When it comes to textual information, however, the nature of character encodings brings some additional complexity.</p>

<p>One key choice is whether the sorting of textual data will be shown to users and thus follow the sorting rules of a specific language or culture or whether the sorting is strictly internal.</p>
<p>Applications often need to organize sets of information or content. Frequently this involves sorting the content so that users can find what they are looking for. Many data types, such as numbers or dates, are easily sorted by comparing the values. When it comes to textual information, however, the nature of character encodings and user expectations regarding "alphabetical" order brings some additional complexity.</p>

<div class="req" id="char_sort_internal_only">
<p class="advisement">Specifications or implementations that require a program-internal, fast, and deterministic sorting of text (which will never be human visible) SHOULD specify that strings are sorted into ascending code unit order.</p>
<!--details class="links"><summary>explanations &amp; examples</summary>
TODO: add links
</details -->
<p class="advisement">Specifications or implementations that require a program-internal, fast, and deterministic sorting of text which is not intended for human viewing or interaction SHOULD specify that strings are sorted according to their definition of string. For string types based on UTF-16 (such as DOMString or JavaScript), specify <em>ascending code unit</em> order. For data that uses scalar value strings (such as USVString or many XML processes), specify <em>ascending code point</em> order.</p>
<details class="links"><summary>explanations &amp; examples</summary>
<a href="#char_string">Defining 'string'</a>
</details>
</div>

<p>This matches JavaScript's <code>Array.prototype.sort</code> on an <code>Array</code> of <code>String</code>. In JavaScript, this ordering compares the 16-bit code units in each string. The resulting list will not match any particular alphabet or lexicographical order, particularly for code points represented by a surrogate pair.</p>
<p>One key choice is whether the sorting of textual data will be shown to users and thus need to be [=locale=]-sensitive (that is, following the sorting rules of a specific language or culture) or whether the sorting is strictly internal. There are two potential internal sorting sequences: ordering by Unicode [=code point=] or ordering by [=code unit=]. For either type of ordering, the resulting list will not match any particular alphabet or lexicographical order.</p>

<p>Sorting by [=code point=] makes sense when strings are stored and processed as a sequence of code points, such as in a <a href="https://webidl.spec.whatwg.org/#idl-USVString">USVString</a>. Sorting by [=code unit=] makes sense when strings are stored and processed using the underlying encoding, such as in a <a href="https://webidl.spec.whatwg.org/#idl-DOMString">DOMString</a>.</p>

<p>For example, consider JavaScript's function <code>Array.prototype.sort</code> applied to an <code>Array</code> of <code>String</code> values. In JavaScript, a String is a sequence of UTF-16 code units. This ordering compares the 16-bit (UTF-16) code units in each string, so [=supplementary characters=], which are encoded as a [=surrogate pair=], compare differently in this sort order than when ordering by code point.</p>

<aside class="example" title="Code point vs. code unit ordering">
<p>Consider two strings, one containing <span class="codepoint" translate="no"><bdi lang="ja">&#x1f63a;</bdi> [<span class="uname">U+1F63A SMILING CAT FACE WITH OPEN MOUTH</span>]</span> and the other containing <span class="codepoint" translate="no"><bdi lang="ja">&#xff5e;</bdi> [<span class="uname">U+FF5E FULL WIDTH TILDE</span>]</span>.</p>

<p>In ascending <em>code point order</em>, the strings sort like:</p>
<pre>
&#xff5e; (U+FF5E)
&#x1f63a; (U+1F63A)
</pre>

<p>In ascending <em>code unit order</em>, the character U+1F63A is encoded as the code unit sequence <code>0xD83D 0xDE3A</code>, so the strings sort like:</p>
<pre>
&#x1f63a; (0xD83D 0xDE3A)
&#xff5e; (0xFF5E)
</pre>
<p>Note that UTF-8 <em>code unit order</em> (that is, when sorting by byte values in UTF-8 encoded byte strings) is the same as code point order.</p>
</aside>

<p>Specifications or applications that need to deal with sorting natural language text face some additional complexity. Unicode defines a default collation (sorting) order as part of the <code>Unicode Collation Algorithm</code> [[UTS10]], which can then be tailored to meet the needs of specific languages, <a>locales</a>, and cultures.</p>



<div class="req" id="char_sort_units">
<p class="advisement">Software that sorts or searches text for users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application.</p>
<p class="advisement">Software that sorts or searches text for display to users SHOULD do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application.</p>
<details class="links"><summary>explanations &amp; examples</summary>
<p><a href="https://www.w3.org/TR/charmod/#sec-CollationUnits">Units of collation, C006</a>, in <cite>Character Model for the World Wide Web: Fundamentals</cite></p>
</details>
</div>

<p>Specifications or applications that need to deal with sorting natural language text for display to users face some additional complexity. Unicode defines a default collation (sorting) order as part of the <cite>Unicode Collation Algorithm</cite> [[UTS10]], which can then be tailored to meet the needs of specific languages, <a>locales</a>, and cultures.</p>

<aside class="issue" id="char_sort_user_issue">
<p>The following requirement is somewhat unclear for specification authors. There are many places where what I'd want to advise specs to do is follow the language (locale) of the given document or of the application or to provide controls so that the application can choose appropriately. The "current user", where it means "operating system" or "user agent host system's locale" or "browser's localization" is not always what is expected.</p>
</aside>

<div class="req" id="char_sort_user">
<p class="advisement">Where searching or sorting is done dynamically, particularly in a multilingual environment, the 'relevant language' SHOULD be determined to be that of the current user, and may thus differ from user to user.</p>
Expand Down