Serialization of natural language in data formats such as JSON [I18N] #178

aphillips · 2017-05-18T19:48:11Z

Hello TAG!

I'm requesting a TAG review of:

Name: Recommendations for language and direction attributes in data formats This document is an explainer.
Primary contacts: [@aphillips][@r12a]

Further details (optional):

Relevant time constraints or deadlines: None specifically, but this has been a developing problem for us.

You should also know that...

The Internationalization WG has been commenting on data format specifications with increasing frequency over the past couple years in which we have noted the lack of natural language string types in formats such as JSON. We are concerned that there are internationalization gaps or, in an attempt to address our comments, non-interoperable and divergent implementation choices being made.

This issue is the result of an I18N WG action.

We would like the TAG's opinion on the problem and mooted solutions. The I18N WG chair (@aphillips) and Team contact (@r12a) can be available for consultation as needed.

We'd prefer the TAG provide feedback as (please select one):

open issues in our Github repo for each point of feedback
open a single issue in our Github repo for the entire review
leave review feedback as a comment in this issue and @-notify [@aphillips][@r12a]

aphillips · 2017-05-18T19:51:23Z

Please add the i18n-discuss label so that our tracking mechanism picks this up.

domenic · 2017-05-18T19:55:10Z

This seems related to the discussion currently happening in whatwg/webidl#358, where we're attempting to add a shared primitive to Web IDL that all specs can use, and getting stuck. The point of contention is basically whether the pattern should be

someAPI({
  lang: "...",
  dir: "...",
  label: "a string governed by the lang/dir"
  name: "another string, governed by the same lang/dir"
});

(the "Localizable base dictionary" solution)

or

someAPI({
  label: {
    lang: "...",
    dir: "...",
    value: "a string governed by the lang/dir"
  },
  name: "another string, using the default lang/dir"
});

(the "LocalizableString union typedef" solution).

The former makes it easier to say that all strings have the same lang/dir. The latter allows more granular decision making, at the cost of verbosity.

I suppose you could even have both.

r12a · 2017-05-18T19:55:12Z

@aphillips it seems that this repo is not under the w3c/ domain, so i'm unable to set up the normal notifications and labels, and we won't get notifications to our list.

dbaron · 2017-06-06T15:53:29Z

One question we're thinking about is to what extent this can be solved by using only dir="auto" (plus LRM and RLM or similar) and language tags. The ergonomics of both options aren't great. I also wonder whether there's an alternative that could be plain text much of the time but could also be more markup-like when needed.

torgo · 2017-06-06T15:56:46Z

Discussed on call 06-06 suggestion that I18N group should work with WebIDL group.

cynthia · 2017-06-06T16:06:34Z

Solution 1: New Data Type
Create a new data type whose serialization optionally includes language and direction. Examples:

myLocalizedString: "Hello World!"@en^ltr
myLocalizedString_fr: "Bonjour monde !"@fr
myLocalizedString_ar: "مرحبا بالعالم!"@ar-eg^rtl
myLocalizedString_und: "שלום עולם!"^rtl
myLanguageNeutralString: "978-0-123-4567-X" // no language or direction for this non-natural-language string

There are quite a few parser implementations out in the wild already for this approach to be feasible - and since parsers which do not support this feature will not function against data with these tags present, this does not seem like a way forward.

We did briefly touch on http://unicode.org/faq/languagetagging.html during the call, in case that would be an option.

aphillips · 2017-06-06T16:35:15Z

dir="auto" is not a panacea. The first strong characters in a string may be left-to-right and fool the algorithm.

My concern here is that this requires the addition of LRM/RLM markers to data---data that may not be owned by the process assembling the wire format or that may have a field length restriction expressed in characters, code units, or bytes, etc. Adopting auto semantics and requiring the markers introduces (possibly cascading) data change. It also requires, in some cases, developers to introduce more markers into text, as when assembling messages.

aphillips · 2017-06-06T16:38:27Z

@cynthia The Unicode language tagging characters are deprecated and it is a Bad Idea to use them. We knew coming in that a new data type was more-or-less a non-starter unless/until our request for the keys to the time machine comes through.

@torgo Thanks for the update. We'll reach out to WebIDL

r12a · 2017-06-13T17:05:51Z

All, wrt using Unicode formatting characters to establish direction, please read the other docs that Addison links – in particular http://w3c.github.io/i18n-discuss/notes/string-base-direction.html – where we try to enumerate the pros and cons of various approaches. (@aphilips we should probably make it a bit clearer that folks should read those docs to get a better basis for discussion)

@domenic it's useful to be able to apply the same lang/dir metadata to multiple strings without repeating the metadata, if that's possible; however, it's certainly easy to imagine situations where different assignments are needed for particular strings (eg. in the case of a set of alternative translations for an error message, where one string is in english, and another in hebrew).

hth

cynthia · 2017-06-20T04:58:51Z

The Unicode language tagging characters are deprecated and it is a Bad Idea to use them. We knew coming in that a new data type was more-or-less a non-starter unless/until our request for the keys to the time machine comes through.

Understood. It was just a note that the topic came up during the call.

dbaron · 2017-06-26T17:22:38Z

So let me present what I'm concerned about here in a little more detail. My basic concern is that there really don't seem to be any good options:

the new data type seems to be a non-starter in terms of compatibility (e.g., parsing, etc.)
the use of dictionaries makes things harder for both developers using the API, for implementors of the API, and for specification authors (increasing both the amount of work and the risk of errors) and:
- if the use of a dictionary rather than a string is option, the handling of dictionaries is frequently going to be wrong in both specs and implementations.
- if the use of a dictionary is not optional, it adds a good bit of extra overhead for developers in the common case where they're not going to be interested in adding language or direction information.

One of the pieces of advice from i18n in the past was that text that should be presented to users should be markup rather than attribute values, so that when needed it could allow elements within it (for things like language and direction, ruby, etc.). I also wonder whether this sort of advice could be extended here, i.e., whether we should be encouraging the use of HTML rather than text.

dbaron · 2017-07-07T19:55:22Z

(And if we wanted to encourage HTML, would it be a subset of HTML, or arbitrary HTML?)

aphillips · 2017-07-07T20:21:37Z

Thanks @dbaron. While, in general, markup is a Good Thing for this, at the same time the point of using JSON and other data languages is the transmission of "unrendered" data. Let me give a concrete use case.

Suppose that in my day job I am building a Web page to show a customer's library of e-books. The e-books exist in a catalog of data and consist of the usual data values. It might looks something like:

{
    "id": "978-0-1234-5678-X",
    "title": "Moby Dick",
    "authors": [ "Herman Melville" ],
    "language": "en-US",
    "pubDate": "1851-10-18",
    "publisher": "Mark Twain Press",
    "coverImage": "https://example.com/images/mobidick_cover.jpg",
    // etc.
},

Each of the above is a data field in a database somewhere. Now, because I know I need it, I have language and direction information for each of the textual fields also in my database. I even have stuff like a pronunciation field for title and author (for sorting Chinese and Japanese). Those are just data fields. Do I really want to serialize them as HTML:

   "title": "<span lang='en-US' dir='ltr'>Mobi Dick</span>"

After all, I may not end up displaying the title field in an HTML context! My JSON might very well be used to populate say the device local data store which uses native controls to show the title.

I'd also argue that:

a good bit of extra overhead for developers in the common case where they're not going to be interested in adding language or direction information

... is probably wrong. The common case where you don't want language or direction information is for non-language-bearing fields (isbn). Omitting the information for language-bearing fields is basically an I18N bug (yeah, being a pedant here)

dwsinger · 2017-07-07T20:38:52Z

Not to be a pedant, but if you separate the language tag from the other fields, don't you introduce ambiguity or risk of error?

{
"id": "978-0-1234-5678-X",
"title": "Quo Vadis",
"authors": [ " Henryk Sienkiewicz" ],
"language": "la",
"pubDate": "1895-10-18",
// etc.
},

The title is indeed in Latin. The book was originally written in Polish. But maybe this edition is in some other language. This becomes particularly problematic if two fields need different tagging (here, the author's name might be tagged as "pl" -- Polish).

aphillips · 2017-07-07T21:05:34Z

@dswinger Exactly so. The book language(s) (the language(s) of the intended audience) might be (often are) different from the language of the title or the author. The language field really is wrongly ambiguous, given that each field (title, author, publisher name) needs language and direction metadata.

hsivonen · 2017-07-10T20:07:06Z

Solution 1 that would require changes to JSON itself isn't practical, because it would be too much of ocean boiling effort to change all JSON parsers.

I think Solution 2 potentially with bidi control characters within string values is workable.

This seems related to the discussion currently happening in whatwg/webidl#358, where we're attempting to add a shared primitive to Web IDL that all specs can use, and getting stuck. The point of contention is basically whether the pattern should be
someAPI({
 lang: "...",
 dir: "...",
 label: "a string governed by the lang/dir"
 name: "another string, governed by the same lang/dir"
});
(the "Localizable base dictionary" solution)

or
someAPI({
 label: {
   lang: "...",
   dir: "...",
   value: "a string governed by the lang/dir"
 },
 name: "another string, using the default lang/dir"
});
(the "LocalizableString union typedef" solution).

The former makes it easier to say that all strings have the same lang/dir. The latter allows more granular decision making, at the cost of verbosity.

I would expect the former to face less resistance, because it just adds some key-value pairs without forcing a reorganization of a given JSON-based format compared to its lang/dir-unaware version. Moreover, considering JSON from the perspective of developers trying to escape XML, the added nesting/complexity of the latter would probably not be well received. Therefore, I think pushing the latter as the only option wouldn't be productive.

A third option would be:

someAPI({
  label_lang: "...",
  label_dir: "...",
  label: "a string governed by label_lang/label_dir",
  name_lang: "...",
  name_dir: "...",
  name: "another string, governed by name_lang/name_dir"
});

whether we should be encouraging the use of HTML rather than text

I think using HTML in JSON makes sense for strings that carry multi-paragraph text with inline formatting (i.e. something that would make sense inside HTML <body>), but I think it wouldn't be good to recommend markup inside JSON for strings that are closer to HTML <title>, email subject line, name of a person, a GUI label, invoice/inventory line item, etc.

Even though HTML parsers are now widely available, a plain-text string is a significantly simpler thing for the consumer's data model to deal with than a tree rooted at DOM DocumentFragment or equivalent in a non-DOM markup tree API.

People use JSON instead of XML to avoid various complexities of XML and to use a format that maps nicely to and from basic programming language data structures. Making shortish plainish strings (not just ones representing multi-paragraph text with inline formatting) in JSON potentially carry markup would defeat both avoiding XML mixed content complexity and having a format that maps nicely to and from basic programming language data structures.

When a JSON-based format wouldn't use markup in strings for non-bidi reasons, to the extent a base direction taken from an adjacent key-value pair isn't enough, I think finer-grained bidi control should use the bidi control characters instead of importing the full data model complexity of markup for every (human-readable) string.

(Whereas bidi is intrinsic to whole scripts, ruby is a sometimes-used (relatively rarely-used even) typographical device for the scripts with which it is used, so I think it cases where bidi doesn't justify the complexity of markup, ruby doesn't, either.)

travisleithead · 2017-07-11T16:21:29Z

A lot of great points have been made in this thread. I'm personally not convinced that there is any single "right" solution.

It seems that if we want to create a compact data representation for strings, associating lang, direction, and other meta-data about the string, then this encoding has to be as maximally-portable across systems as possible, which leads me to think it must be some new representation of a string literal. That of course, is asking for a huge change across all programming environments and applications--not likely to happen, but neat to dream about--or even start some activity there, perhaps in Unicode.

For a serialization memory layout that associates lang, direction, etc., metadata about strings, I don't offer a strong opinion, though I have a weak opinion: keep it simple, or it will likely be too much of a burden to get much traction. For example, I think a simple dictionary with fields at parallel depths would work fine for most applications, e.g., { lang: .., dir: .., stringvalue: ... }.

aphillips · 2017-07-11T16:35:58Z

@travisleithead I tend to agree. It would have been nice to address this in the past, but we're here now.

However, building recommended patterns and best practices would allow specs to be consistent and interoperate well.

r12a · 2017-07-17T17:28:18Z

The common case where you don't want language or direction information is for non-language-bearing fields (isbn).

Watch out though, apparently harmless data may not actually be so. The isbn field will only display correctly if it is isolated when displayed in a RTL target context and treated as LTR inside that isolated area. For example, if i just drop the text into a field on a RTL page without any precautions, i get:

rather than what i really want, which is:

However, if the value were a range, such as 100-300, the first arrangement would actually be what's wanted in Arabic (though not Hebrew, and i'm not sure about N'Ko), otherwise the range would appear to be decreasing instead of increasing.

So it may actually be useful to have some direction information for isbn numbers, MAC addresses, telephone numbers, etc.

dbaron · 2017-07-17T18:00:10Z

Though in those cases there's a tradeoff between having the direction data in the text data, versus having the application have knowledge of the correct way to present the particular field, since there is a correct and simple per-field algorithm (although it's not particularly simple to have tens or hundreds of them). One is easier for the producer of the data and the other is easier for the consumer.

This is different from cases where you basically have to have the direction data stored in the text because you can't trivially derive it from the text.

r12a · 2018-10-31T15:00:06Z

@dbaron I recently added a section about use of script subtags for guessing bidi info. In addition to that, we noticed that some people were reading this document and not catching some of the key messages, so i have a plan to summary and simplify the text which is currently in progress, and waiting for some time to become available so that i can complete it. That might be a good time for review (?)

kenchris · 2018-12-11T21:18:20Z

Related: w3c/manifest#676

torgo · 2019-01-15T21:28:40Z

We agreed to put this on the agenda for the next f2f and close it off somehow.

aphillips · 2019-01-15T21:54:42Z

@torgo Thanks. Would you like to invite my/Richard's participation? We can have an updated version of our doc ready, if the date is 2019-02-05. Is that the target date?

aphillips · 2019-02-02T21:26:30Z

Note that the I18N WG resolved to publish our document of best practice recommendations as FPWD in our last teleconference. The current editor's copy is here: https://w3c.github.io/string-meta/

I suggest that TAG either adopt our best practices or provide feedback on changes (that we can incorporate). I did not receive a reply to my previous question about TAG f2f participation, btw.

dbaron · 2019-02-04T05:25:53Z

I just took a look at the document and filed two issues (above); I'm more concerned about the second one.

Regarding discussion at the meeting; I think the chairs would like us to just stop cycling back to this issue as a group, and I think I agree that it doesn't need attention from the whole TAG, but probably @cynthia and I can continue to provide feedback on the document if needed.

cynthia · 2019-04-03T04:42:30Z

I believe we discussed this in a previous call and was happy to close it off; and follow up on the issues in the group's tracker. I'll close this for now; thanks a lot for the long discussion and we hope to hear more from i18n in the future.

(Please re-open if I got the summary of our last discussion wrong).

torgo added this to the tag-telcon-2017-06-06 milestone Jun 6, 2017

torgo assigned dbaron Jun 6, 2017

torgo modified the milestones: tag-telcon-2017-06-20, tag-telcon-2017-06-06 Jun 6, 2017

dbaron modified the milestones: tag-telcon-2017-07-11, tag-telcon-2017-06-20 Jun 20, 2017

torgo added the Progress: pending external feedback The TAG is waiting on response to comments/questions asked by the TAG during the review label Jun 27, 2017

torgo modified the milestones: tag-f2f-london-2017-07-25, tag-telcon-2017-07-11 Jul 25, 2017

torgo added the extra time label Jul 25, 2017

torgo modified the milestones: 2018-01-31-f2f-london, 2018-11-20-telcon Oct 30, 2018

BigBlueHat mentioned this issue Nov 6, 2018

Proposal for handling localizable texts (writeup of the F2F discussions) w3c/wpub#354

Closed

torgo modified the milestones: 2018-11-20-telcon, 2018-12-04-telcon, 2018-12-11-telcon Nov 28, 2018

plinss modified the milestones: 2018-12-11-telcon, 2019-01-15-telcon Dec 11, 2018

torgo modified the milestones: 2019-01-15-telcon, 2019-02-05-f2f Jan 15, 2019

This was referenced Feb 4, 2019

section 4 (approaches for language) and section 5 (approaches for base direction) identify recommended practices differently w3c/string-meta#22

Closed

section on document-level @language and @dir has confusing examples w3c/string-meta#23

Open

torgo modified the milestones: 2019-02-05-f2f, 2019-02-26-telcon Feb 5, 2019

travisleithead removed their assignment Feb 5, 2019

plinss modified the milestones: 2019-02-26-telcon, 2019-03-12-telcon Feb 26, 2019

plinss modified the milestones: 2019-03-12-telcon, 2019-03-19-telcon Mar 13, 2019

plinss modified the milestones: 2019-03-19-telcon, 2019-04-02-telcon Mar 25, 2019

cynthia closed this as completed Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization of natural language in data formats such as JSON [I18N] #178

Serialization of natural language in data formats such as JSON [I18N] #178

aphillips commented May 18, 2017

aphillips commented May 18, 2017

domenic commented May 18, 2017

r12a commented May 18, 2017

dbaron commented Jun 6, 2017

torgo commented Jun 6, 2017

cynthia commented Jun 6, 2017

aphillips commented Jun 6, 2017

aphillips commented Jun 6, 2017

r12a commented Jun 13, 2017

cynthia commented Jun 20, 2017

dbaron commented Jun 26, 2017

dbaron commented Jul 7, 2017

aphillips commented Jul 7, 2017

dwsinger commented Jul 7, 2017

aphillips commented Jul 7, 2017

hsivonen commented Jul 10, 2017

travisleithead commented Jul 11, 2017

aphillips commented Jul 11, 2017

r12a commented Jul 17, 2017

dbaron commented Jul 17, 2017

r12a commented Oct 31, 2018 •

edited

Loading

kenchris commented Dec 11, 2018

torgo commented Jan 15, 2019

aphillips commented Jan 15, 2019

aphillips commented Feb 2, 2019

dbaron commented Feb 4, 2019

cynthia commented Apr 3, 2019

Serialization of natural language in data formats such as JSON [I18N] #178

Serialization of natural language in data formats such as JSON [I18N] #178

Comments

aphillips commented May 18, 2017

aphillips commented May 18, 2017

domenic commented May 18, 2017

r12a commented May 18, 2017

dbaron commented Jun 6, 2017

torgo commented Jun 6, 2017

cynthia commented Jun 6, 2017

aphillips commented Jun 6, 2017

aphillips commented Jun 6, 2017

r12a commented Jun 13, 2017

cynthia commented Jun 20, 2017

dbaron commented Jun 26, 2017

dbaron commented Jul 7, 2017

aphillips commented Jul 7, 2017

dwsinger commented Jul 7, 2017

aphillips commented Jul 7, 2017

hsivonen commented Jul 10, 2017

travisleithead commented Jul 11, 2017

aphillips commented Jul 11, 2017

r12a commented Jul 17, 2017

dbaron commented Jul 17, 2017

r12a commented Oct 31, 2018 • edited Loading

kenchris commented Dec 11, 2018

torgo commented Jan 15, 2019

aphillips commented Jan 15, 2019

aphillips commented Feb 2, 2019

dbaron commented Feb 4, 2019

cynthia commented Apr 3, 2019

r12a commented Oct 31, 2018 •

edited

Loading