Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft of some registry functions #420

Merged
merged 12 commits into from
Jul 24, 2023
Merged

First draft of some registry functions #420

merged 12 commits into from
Jul 24, 2023

Conversation

mihnita
Copy link
Collaborator

@mihnita mihnita commented Jul 12, 2023

No description provided.

@mihnita
Copy link
Collaborator Author

mihnita commented Jul 12, 2023

The .md needs updating, but I though I should do that only after we finish polishing the .xml.

@macchiati
Copy link
Member

Some comments from a very quick scan:

  <option name="minimumFractionDigits" values="0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20"/>

The number could be a BigDecimal, and you don't want to list all of those possible values! Maybe just \d+?

        Match a numerical value against CLDR plural categories or against a number literal.

should be something like:

Match a formatted numerical value against CLDR plural categories or against a number literal.

That is, I'm looking for some way we can make it harder (or at least alert people to )make the following mistake:

match {$count}
when one {You have a book}
when * {You have {$count :number maximumFractionDigits=0} books}

That is, when $count = 1.1, this fails, with *You have 1 books

@mihnita
Copy link
Collaborator Author

mihnita commented Jul 12, 2023

Thank you!

The number could be a BigDecimal, and you don't want to list all of those possible values! Maybe just \d+?

The info comes from ECMAScript, which limits the possible values:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/NumberFormat/NumberFormat#minimumfractiondigits

The intention was to start with something that is common between ICU and JS.

spec/registry.xml Outdated Show resolved Hide resolved
spec/registry.xml Outdated Show resolved Hide resolved
Comment on lines +41 to +47
<!-- The time zone to use. The only value implementations must recognize
is "UTC"; the default is the runtime's default time zone.
Implementations may also recognize the time zone names of the IANA
time zone database, such as "Asia/Shanghai", "Asia/Kolkata",
"America/New_York".
-->
<option name="timeZone" pattern="timeZoneId"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget offset time zones.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is 100% what ECMA says:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/DateTimeFormat/DateTimeFormat#timezone

I tried to not include anything that is not currently supported by ECMAScript, as we know there is strong opposition to that.

I'm for it, happy to discuss, but I didn't change this xml.

Added a topic in #422

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the function accepts as validated input. It doesn't mean that every possible value is supported

spec/registry.xml Outdated Show resolved Hide resolved
Comment on lines 203 to 204
the default for currency formatting is the number of minor unit digits provided by
the ISO 4217 currency code list (2 if the list doesn't provide that information).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... unless there is an override that says otherwise, e.g. some curriences with the cash format (e.g. technically COP has two digits but practically it has none)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. But this is 100% ECMAScript quote.

A value with a smaller number of integer digits than this number will be
left-padded with zeros (to the specified length) when formatted.
-->
<option name="minimumIntegerDigits" values="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21" default="1"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need a list?

Copy link
Collaborator Author

@mihnita mihnita Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue #422 for this (and more)

<option name="maximumSignificantDigits" values="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21" default="21"/>

<match pattern="anyNumber"/>
<match values="zero one two few many"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does * need to appear here? Some locales produce other as output from plural, which is effectively * in our syntax. That is, do we want to be able to say "this selector can request the default message"?

Copy link
Collaborator Author

@mihnita mihnita Jul 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our spec makes the presence of * mandatory.
So I didn't include other for plural and the other selectors.

Of course, open for discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is an interesting minor issue for the registry spec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our spec makes the presence of * mandatory. So I didn't include other for plural and the other selectors.

Why omit other if the spec only requires *?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking (why I excluded it): because they mean the same thing?
If I see other in the registry, I expect to also see other handled in the when

Copy link
Collaborator Author

@mihnita mihnita Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option: we add it here as a possible value, we also add default=other, and update the registry spec to say that the default value for selectors maps to *

I think I like that option.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption is unfortunately false, likely on accident in CLDR, although I miss the context. For Polish, other is specifically used for fractions, while many is what you'd expect to be a good other in most other languages.

@macchiati
Copy link
Member

Thank you!

The number could be a BigDecimal, and you don't want to list all of those possible values! Maybe just \d+?

The info comes from ECMAScript, which limits the possible values: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/NumberFormat/NumberFormat#minimumfractiondigits

The intention was to start with something that is common between ICU and JS.

There is, however, no need to have that limitation here, and artificially limit more powerful implementations.

Instead, what I think we should do is document that if the value exceeds what the implementation is capable of, the maximum supported by the implementation is used.

Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a much bigger subset of the JS Intl.NumberFormat & Intl.DateTimeFormat formatters' options than I expected, which is great! Besides the line comments, I would find it easier to consider exactly which parts are being left out: Could a list of such options with their reasonings be provided?

spec/registry.dtd Outdated Show resolved Hide resolved
spec/registry.dtd Outdated Show resolved Hide resolved
spec/registry.xml Outdated Show resolved Hide resolved
spec/registry.xml Show resolved Hide resolved
spec/registry.xml Outdated Show resolved Hide resolved
spec/registry.xml Outdated Show resolved Hide resolved
spec/registry.xml Outdated Show resolved Hide resolved
spec/registry.xml Outdated Show resolved Hide resolved
spec/registry.xml Outdated Show resolved Hide resolved
@mihnita
Copy link
Collaborator Author

mihnita commented Jul 13, 2023

There is, however, no need to have that limitation here, and artificially limit more powerful implementations.

Instead, what I think we should do is document that if the value exceeds what the implementation is capable of, the maximum supported by the implementation is used.

Some of the group members strongly opposed including anything that is not supported by ECMAScript.
So I tried to start as non-controversial as possible.

I anticipate (and hope) that we will discuss PR in the next meeting.

@mihnita
Copy link
Collaborator Author

mihnita commented Jul 13, 2023

Match a formatted numerical value against CLDR plural categories or against a number literal.

Updated registry.md and the registry.xml

That is, I'm looking for some way we can make it harder (or at least alert people to )make the following mistake:

match {$count}
when one {You have a book}
when * {You have {$count :number maximumFractionDigits=0} books}

That is, when $count = 1.1, this fails, with *You have 1 books

ACK, and agree.

I don't think we can capture this in the registry.xml, at least not with the current registry dtd.
Should be easy to implement as a lint.

Added it as a discussion point in #422

@mihnita
Copy link
Collaborator Author

mihnita commented Jul 13, 2023

Reverted most changes to the .dtd, except for the ones that were clearly unintentional, and prevented a valid .xml.

For example:

<!ELEMENT registry (function*|pattern*)>

This means that the registry can contain (0 or more function) OR (0 or more pattern)
But function AND pattern is illegal, it is either / or.

This makes the examples in the .md invalid, and the registry can't cover the use cases we need.

spec/registry.xml Outdated Show resolved Hide resolved
spec/registry.xml Outdated Show resolved Hide resolved
@macchiati
Copy link
Member

macchiati commented Jul 16, 2023 via email

Co-authored-by: Addison Phillips <addisonI18N@gmail.com>
spec/registry.xml Outdated Show resolved Hide resolved
Copy link
Contributor

@ryzokuken ryzokuken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks for kicking this off.

@@ -90,7 +90,7 @@ For the sake of brevity, only `locales="en"` is considered.
<function name="number">
<description>
Format a number.
Match a numerical value against CLDR plural categories or against a number literal.
Match a **formatted** numerical value against CLDR plural categories or against a number literal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the imply that the input is a localized number that'd be parsed and then matched? Also since this function includes both selection and formatting, should this also say something about the formatting part?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I've merged a suggestion from Addison which does not mention formatting.

There is an ongoing discussion in issue #425
Especially this comment #425 (comment)

<pattern id="anyNumber" regex="-?[0-9]+(\.[0-9]+)"/>
<pattern id="positiveInteger" regex="[0-9]+"/>
<pattern id="currencyCode" regex="[A-Z]{3}"/>
<pattern id="timeZoneId" regex="[a-zA-Z/]+"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a more restricted grammar that only allows valid IANA timezone IDs, but I don't feel very strongly about it. Exhibit A: https://tc39.es/proposal-temporal/#prod-TimeZoneIANAName

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current registry spec does not allow us to do this, only regex and list of values.

I suggested changing the spec to allow for URLs to external specs.
Added it as a discussion point in #422

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Temporal polyfill includes a Regexp that works for this purpose, but unfortunately it isn't a literal but I imagine that if we agreed on having a restrictive regex here that only allows valid IANA tzids, then we could come up with a simplified regex that works in this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in the longer run (not first commit) we should also add support for Unicode Time Zone Identifiers

The main reason would be stability, which the Olson database IDs does not offer (see linked)

spec/registry.xml Outdated Show resolved Hide resolved
<!-- The minimum number of significant digits to use. -->
<option name="minimumSignificantDigits" values="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21" default="1"/>
<!-- The maximum number of significant digits to use. -->
<option name="maximumSignificantDigits" values="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21" default="21"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all of these *Digits <option>s, I wonder if we could add a convenient shorthand in the DTD... Perhaps a simple <range start="x" end="y"/> definition could work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to positiveInteger.
But I think the range option might also be handy in other cases.

spec/registry.xml Outdated Show resolved Hide resolved
mihnita and others added 2 commits July 20, 2023 07:56
Co-authored-by: Ujjwal Sharma <ryzokuken@disroot.org>
Co-authored-by: Ujjwal Sharma <ryzokuken@disroot.org>
@mihnita
Copy link
Collaborator Author

mihnita commented Jul 20, 2023

There is, however, no need to have that limitation here, and artificially limit more powerful implementations.

More seriously, if the restriction is removed, it simply allows clients to pass in more digits of accuracy; doesn't mean that the implementations have to support that in formatting.

Between several people commenting on this, and the news that ECMAScript considers removing the restriction, I changed this to positiveInteger. I hope it will not be controversial :-)

Co-authored-by: Eemeli Aro <eemeli@gmail.com>
@eemeli
Copy link
Collaborator

eemeli commented Jul 20, 2023

Between several people commenting on this, and the news that ECMAScript considers removing the restriction, I changed this to positiveInteger. I hope it will not be controversial :-)

The JS limit isn't being removed; it's being increased from 20 to 100. Attempting to format a number with a minimumFractionDigits value greater than that will still throw a RangeError. As long as we consider that to be acceptable behaviour for a comforming implementation, I'm fine with this option accepting any positive integer value.

stasm added a commit to stasm/message-format-wg that referenced this pull request Jul 23, 2023
…, matchSignature

A spin-off from unicode-org#420, in which @mihnita noticed that `registry.dtd` uses alternatives in a wrong way. For example, `<!ELEMENT formatSignature (input?|option*)>` means that `formatSignature` is allowed to have as children either: at most one `input` OR any number of `options`, but not both.

Instead, @mihnita suggested using sequences: `<!ELEMENT formatSignature (input?,option*)>`, which this PR does. Note that sequences require the children to appear in a specific order, which isn't something that's useful to us. However, I'm not aware of any way of lifting this requirement that also allows to enforce at most on `input` or exactly one `description`.
@@ -1,6 +1,6 @@
<!ELEMENT registry (function*|pattern*)>
<!ELEMENT registry (function,pattern)*>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing this. I filed #434 to discuss and fix this independently of this PR, so that you can focus it on registry.xml.

Copy link
Collaborator

@eemeli eemeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a decent starting point for us.

I'd prefer the DTD changes to be handled separately, but I'm fine with them being included here and then iterated on in other PRs.

spec/registry.xml Outdated Show resolved Hide resolved
Comment on lines +110 to +111
<!-- The formatting style to use. -->
<option name="style" values="decimal currency percent unit" default="decimal"/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To consider: If we were to leave out currency and unit formatting from the default, then we wouldn't need to say how compound units need to work.

Co-authored-by: Eemeli Aro <eemeli@gmail.com>
@aphillips aphillips merged commit 32c2467 into unicode-org:main Jul 24, 2023
aphillips added a commit that referenced this pull request Aug 13, 2023
…, matchSignature (#434)

A spin-off from #420, in which @mihnita noticed that `registry.dtd` uses alternatives in a wrong way. For example, `<!ELEMENT formatSignature (input?|option*)>` means that `formatSignature` is allowed to have as children either: at most one `input` OR any number of `options`, but not both.

Instead, @mihnita suggested using sequences: `<!ELEMENT formatSignature (input?,option*)>`, which this PR does. Note that sequences require the children to appear in a specific order, which isn't something that's useful to us. However, I'm not aware of any way of lifting this requirement that also allows to enforce at most on `input` or exactly one `description`.

Co-authored-by: Addison Phillips <addison@unicode.org>
@mihnita mihnita deleted the mihnita_registry branch February 14, 2024 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants