Start message format test suite #113

grhoten · 2020-09-16T09:25:12Z

These changes are a part of issue #109. This is just a start. I wanted to get feedback before I add more. I can confirm that my implementation passes these tests.

It's worth noting that this XML syntax could potentially be extended to include the XHTML namespace or the SSML namespace. For example, the speak namespace could be omitted from the print line, and the print namespace could be omitted from the speak line. Using XML should also lend itself well to XLIFF for translation purposes.

Some options that remain to be implemented for testing include:

The full range of CLDR plural rules for quantity.
The various text transformations (e.g. titlecasing and quoting).
Semantic concepts to be retrieved from a semantic feature model.
Dependency handling (e.g. 1 noun affecting the inflection of another noun).
Handling other variants of numbers, like the number of digits after the decimal point.

grhoten · 2020-09-16T09:36:58Z

Perhaps @nbouvrette, @stasm or @zbraniecki would like to start looking at this?

stasm

Thanks for starting this, @grhoten! I have a few questions inline to help me understand where this sits on the ladder of abstraction.

stasm · 2020-09-18T12:17:36Z

test/format/en_US.xml

+            <param name="relationship" type="String"><value>brother</value></param>
+        </params>
+        <source>I didn't find <var name="relationship" inflect="indefinite"/> in your contacts. What is your <var name="relationship" inflect="genitive"/> name?</source>
+        <print>I didn't find a brother in your contacts. What is your brotherʼs name?</print>


Where should the variants like a brother and brother's be defined? Is that something the implementation should provide? Or would there be another resource editable by the translators where these terms would be defined?

We have a default set of data. It's a lexical dictionary. We used to have a set based off of Wiktionary, but we're switching away from it due to licensing issues around modification, completeness and cleanliness of the data. The lexical dictionary helps with all of the common edge cases. When a word is missing from the lexical dictionary, we use heuristics or machine learning to fill in the gaps. For English, the default rules are pretty simple.

This also reduces the amount of effort to translate the common stuff. Think of it like translation memory, but it's available at runtime instead of at build time.

If the default is obscure, like a product name, invented word or a company name, you can create something like a semantic concept where you define all of the semantic features to get the words into grammatical agreement. We call this dialog metadata, and it's stored in our semantic feature model. Other parts of this working group have used the more generic term "data model".

I can put in a Russian example to highlight how to do it by hand and how complicated it is to get a number and noun into grammatical agreement. It's harder to find a language tougher than Russian. I mean Arabic, Finnish and Turkish have their own difficult issues, but those complexities fit into an architecture that properly supports Russian.

stasm · 2020-09-18T12:18:47Z

test/format/en_US.xml

+        <params>
+            <param name="object" type="String"><value>lights</value></param>
+        </params>
+        <source>The <var name="object"/> <switch value="object" feature="number">


Is the idea here that the implementation is capable of inspecting the object (lights) and extracting some grammatical information from it? (here: the grammatical number).

stasm · 2020-09-18T12:21:30Z

test/format/en_US.xml

+    </test>
+    <test>
+        <params>
+            <param name="object" type="String"><value>light</value></param>


How does the translation of the word light get into this message?

This is user vocabulary. I could have easily used the phrase "light on the front porch" or "Stanisław's party light". If you're dealing with something like HomeKit, the naming of various objects is up to the user.

It may not be translated. If it's translated, it would require additional infrastructure. It's stored in a SemanticFeatureModel, but I didn't define how it goes into a SemanticFeatureModel in these test cases. Fluent seems to have a similar concept to the translated phrases, but I think it's put into the same file of messages.

I added Russian tests to highlight how a SemanticConcept would work as a quantity. I used Latin text to highlight that it's using the declared data instead of a lexical dictionary, how the number changes pronunciation and how the grammatical case changes as the value changes.

If I wanted to be complete with the testing in Russian, I'd vary the grammatical gender and test more of the speak lines.

I picked Russian because it's much more complicated when compared to many other languages. See ru_RU.xml for details.

nbouvrette · 2020-09-18T21:34:39Z

Looks promising @grhoten - were you also considering adding tests to cover lists?

You also mentioned 'numbers' in your original comment, I presume you also had something in mind for currencies?

…straints Add Russian tests

Start message format test suite

18b8aaa

romulocintra requested review from DavidFatDavidF, echeran, mihnita and zbraniecki September 16, 2020 15:48

stasm reviewed Sep 18, 2020

View reviewed changes

Add support for quantity inflection, SemanticConcept and multiple con…

6e8dd59

…straints Add Russian tests

romulocintra approved these changes Mar 22, 2021

View reviewed changes

romulocintra merged commit 7e51216 into unicode-org:master Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start message format test suite #113

Start message format test suite #113

grhoten commented Sep 16, 2020 •

edited

Loading

grhoten commented Sep 16, 2020

stasm left a comment

stasm Sep 18, 2020

grhoten Sep 18, 2020

stasm Sep 18, 2020

grhoten Sep 18, 2020

stasm Sep 18, 2020

grhoten Sep 18, 2020

grhoten Sep 21, 2020 •

edited

Loading

nbouvrette commented Sep 18, 2020

Start message format test suite #113

Start message format test suite #113

Conversation

grhoten commented Sep 16, 2020 • edited Loading

grhoten commented Sep 16, 2020

stasm left a comment

Choose a reason for hiding this comment

stasm Sep 18, 2020

Choose a reason for hiding this comment

grhoten Sep 18, 2020

Choose a reason for hiding this comment

stasm Sep 18, 2020

Choose a reason for hiding this comment

grhoten Sep 18, 2020

Choose a reason for hiding this comment

stasm Sep 18, 2020

Choose a reason for hiding this comment

grhoten Sep 18, 2020

Choose a reason for hiding this comment

grhoten Sep 21, 2020 • edited Loading

Choose a reason for hiding this comment

nbouvrette commented Sep 18, 2020

grhoten commented Sep 16, 2020 •

edited

Loading

grhoten Sep 21, 2020 •

edited

Loading