i18n of metadata values, in case JSON is chosen as a serialization format #1

llemeurfr · 2017-06-27T16:13:02Z

A question was raised during the first F2F meeting in NYC, about the proper internationalization of UTF-8 metadata values (eg. the book title).

I'll quote Ivan, from the minutes: "On the i18n side, we will need to be careful about ids, uris, iris, etc. w/respect to i18n char-sets. Another area we need to be careful about is metadata, which also have issues with the char-sets for the actual text content. One example is mixing bidi text in the metadata content."

Depending the serialization format used for expressing publishing metadata, the issue we face may be different. But in case a JSON (JSON-LD?) format is chosen, which are the issues we may face and the i18n solutions recommended by other W3C WGs?

llemeurfr · 2017-06-27T16:14:09Z

Two interesting resources on bidi-text and Unicode:
1- http://www.iamcal.com/understanding-bidirectional-text/
2- https://www.w3.org/International/articles/inline-bidi-markup/

llemeurfr · 2017-06-27T16:17:21Z

For bidi text, it appears that we may have to create a JSON "dir" attribute representing the global text direction applicable to the metadata by default.

Embedded in the text as special characters, "implicit marker characters" (Left-to-Right Mark and Right-to-Left Mark) will help tailoring the direction of "neutral" characters (e.g. "!"), and "explicit markers" will describe a local text direction.

lrosenthol · 2017-06-28T18:03:27Z

it's not just about bidi - you also have the more general problem of language identification.
Consider the case of a book with multiple (localized) titles - how do you encode that information?
Or worse, consider a multi-lingual title?

There is some work in this area for JSON-LD.

HadrienGardeur · 2017-07-02T20:51:24Z

In Readium-2 we already handle that case (multiple localization for some strings).

Here's how we handle it for title for example:

"title": {
  "fr": "Vingt mille lieues sous les mers",
  "en": "Twenty Thousand Leagues Under the Sea",
  "ja": "海底二万里"
}

Since we're using JSON-LD and include the proper info in our context document, this is correctly understood by JSON-LD clients:

schema:name "Twenty Thousand Leagues Under the Sea"@en, "Vingt mille lieues sous les mers"@fr, "海底二万里"@ja ;

lrosenthol · 2017-07-02T23:02:11Z

Yes, JSON-LD does have possible solutions to this - but that's why we need to this all in mind... However, that only handles half the problem - multiple values, each in their own language. What do you do when you have a single value in multiple languages?

…

On Sun, Jul 2, 2017 at 4:51 PM, Hadrien Gardeur ***@***.***> wrote: In Readium-2 we already handle that case (multiple localization for some strings). Here's how we handle it for title for example: "title": { "fr": "Vingt mille lieues sous les mers", "en": "Twenty Thousand Leagues Under the Sea", "ja": "海底二万里" } Since we're using JSON-LD and include the proper info in our context document, this is correctly understood by JSON-LD clients: schema:name "Twenty Thousand Leagues Under the ***@***.***, "Vingt mille lieues sous les ***@***.***, ***@***.*** ; — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE1vNSsz1muSBvm9TwDx5wnHigKhXiQNks5sKALNgaJpZM4OG3C-> .

HadrienGardeur · 2017-07-03T10:50:38Z

Representing multiple languages in a single string is a much bigger issue that we can't tackle on our own.

Unlike what the name of this issue implies, this IMO has nothing to do with JSON:

the exact same issue exists today in XML with EPUB
... or most metadata formats

If you can't represent that info in a string, the problem is much bigger than the manifest:

how would anyone store that info in a database (usually these fields are UTF-8 strings)?
how would you transmit this info in an API?
how would dedicated reading systems represent this info in-memory?

I think this falls under the "not our problem to solve" category that Ivan mentioned several times during the F2F.
We can participate in efforts to solve this problem with UTF-8, but we can't and shouldn't try to fix it on our own.

lrosenthol · 2017-07-03T11:07:05Z

Not true, Hadrian. Using a markup language like HTML or XML, I can break a single string into "spans" with langs defined on each...JSON (or similar) does not. that isn't to say that I am saying we shouldn't use JSON - I am simply pointing out an issue that we need to consider to be fully i18n... Any solution that relies solely on UTF8, thinking that it solves their i18n problem, is built incorrectly.

…

On Mon, Jul 3, 2017 at 6:50 AM, Hadrien Gardeur ***@***.***> wrote: Representing multiple languages in a single string is a much bigger issue that we can't tackle on our own. Unlike what the name of this issue implies, this IMO has nothing to do with JSON: - the exact same issue exists today in XML with EPUB - ... or most metadata formats If you can't represent that info in a string, the problem is much bigger than the manifest: - how would anyone store that info in a database (usually these fields are UTF-8 strings)? - how would you transmit this info in an API? - how would dedicated reading systems represent this info in-memory? I think this falls under the "not our problem to solve" category that Ivan mentioned several times during the F2F. We can participate in efforts to solve this problem with UTF-8, but we can't and shouldn't try to fix it on our own. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE1vNQx8ANej_g6sSYfYEwUgLGE8d7hgks5sKMd_gaJpZM4OG3C-> .

HadrienGardeur · 2017-07-03T11:09:32Z

Sure, but who's using HTML to represent strings in a database or an API? Absolutely no one.

lrosenthol · 2017-07-03T11:12:28Z

In a database? Absolutely! For example, any good content search engine (for example) indexes HTML (or XML). Yes, for most REST-based APIs, its JSON and UTF8...and unfortunate choice on a variety of levels. And yes, we're not going to change that. But that doesn't make it correct.

…

On Mon, Jul 3, 2017 at 7:09 AM, Hadrien Gardeur ***@***.***> wrote: Sure, but who's using HTML to represent strings in a database or an API? Absolutely no one. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE1vNaOYqv9L3xHmBANO7TzypZEzJiawks5sKMvsgaJpZM4OG3C-> .

llemeurfr · 2017-07-03T11:42:02Z

Re. databases: not really Leonard. @HadrienGardeur is talking about databases (e.g MySQL), you move to search engines (e.g. Solr). Some databases (SQLServer, Oracle, DB2) can handle XML fields in their recent versions; but others (MySQL, sqlite) don't. Most professional search engines don't index HTML or XML, the tags are tripped out before indexing. Note also that ElasticSearch imports JSON structures, not XML.

Re. the Web: Web Publication must be adapted to ... the Web, i.e. browsers and nowadays, browsers don't handle XML perfectly.

We are dealing with property/value tuples in this discussion. If you want to promote mixed content as core value type, you'll have the whole database/web community "vent debout" against the idea.

The problem I raised (i18n for metadata values) is currently not handled in EPUB 3. I suppose that the publishing industry was not so impatient to have it resolved before. So I agree with Hadrien that we should just express why it could be interesting to have a solution for this issue and which solution is offered by other W3C WGs.

IMHO, there are two main reasons why we would like proper internationalized metadata values:

mix of ltr and rtl words in a string (if is certainly a MUST)
proper pronounciation of words by a tts engine, from a string (IMO it is a "good to have" but not mandatory).

murata2makoto · 2017-07-12T23:04:14Z

In the case of the Japanese language, each human-readable text requires two representations: one in Kana only and one in Kanji.

For example, the Japanese National Diet Llibrary uses

<dc:title>
   <rdf:Description>
     <rdf:value>国立国会図書館資料デジタル化の手引</rdf:value>
     <dcndl:transcription>コクリツ コッカイ トショカン シリョウ デジタルカ ノ テビキ</dcndl:transcription>
   </rdf:Description>
</dc:title>

where dcndl:transcription is Kana-only.

HadrienGardeur · 2017-07-12T23:57:46Z

@murata0204 in the Readium Web Publication Manifest this is supported for most strings. The only place where we can't use it yet is for the description.

murata2makoto · 2017-07-17T17:02:13Z

See Requirements for Language and Direction Metadata in Data Formats.

danielweck · 2017-07-17T17:31:27Z

Thank you Makoto.
Hadrien, what about:

"title": "<span lang='en-US' dir='ltr'>Mobi Dick</span>"

?

HadrienGardeur · 2017-07-17T17:39:27Z

@danielweck not sure what you mean, as you know we support both syntax in Readium-2.

danielweck · 2017-07-17T18:11:08Z

we have langcode to string mapping, but no dir, right?

Revert "consolidates previous PRs #46, #47 and #49"

Suggested an editors note in Section 4.5

iherman · 2018-03-02T10:38:51Z

Propose closure: Can be closed; mostly taken care of in #129, although the JSON serialization is still to be done.

iherman · 2018-03-13T09:04:07Z

Closing per https://www.w3.org/publishing/groups/publ-wg/Meetings/Minutes/2018/2018-03-12-minutes.html#resolution1

dauwhe mentioned this issue Jul 5, 2017

Information content of the abstract manifest #6

Closed

dauwhe added topic:internationalization topic:manifest labels Jul 5, 2017

GarthConboy mentioned this issue Jul 19, 2017

Associating a manifest with publication resources #13

Closed

mattgarrish added a commit that referenced this issue Aug 27, 2017

Merge pull request #1 from w3c/master

a0916b9

Revert "consolidates previous PRs #46, #47 and #49"

mattgarrish pushed a commit that referenced this issue Nov 11, 2017

Merge pull request #1 from prototypo/prototypo-patch-1

50f5272

Suggested an editors note in Section 4.5

llemeurfr mentioned this issue Feb 2, 2018

Proposed changes / additions to WAM #127

Closed

GarthConboy mentioned this issue Feb 14, 2018

Progression/direction between resources #126

Closed

iherman added the propose closing label Mar 2, 2018

iherman closed this as completed Mar 13, 2018

GarthConboy mentioned this issue Jun 12, 2018

yet another 'resource list' in the manifest? #225

Closed

ghost mentioned this issue Aug 28, 2018

indicate different metadata (including narrators/creators/format) for different format of resource/track #229

Closed

iherman mentioned this issue Oct 24, 2018

Duration of an audiobook #307

Closed

iherman mentioned this issue May 8, 2019

Manifest files need their own MIME Media Type (because canonicalization) #409

Closed

iherman mentioned this issue Jun 18, 2019

Should there be a TOC if supplemental materials are provided in an audio book? #408

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i18n of metadata values, in case JSON is chosen as a serialization format #1

i18n of metadata values, in case JSON is chosen as a serialization format #1

llemeurfr commented Jun 27, 2017

llemeurfr commented Jun 27, 2017

llemeurfr commented Jun 27, 2017

lrosenthol commented Jun 28, 2017

HadrienGardeur commented Jul 2, 2017

lrosenthol commented Jul 2, 2017 via email

HadrienGardeur commented Jul 3, 2017

lrosenthol commented Jul 3, 2017 via email

HadrienGardeur commented Jul 3, 2017

lrosenthol commented Jul 3, 2017 via email

llemeurfr commented Jul 3, 2017

murata2makoto commented Jul 12, 2017 •

edited

Loading

HadrienGardeur commented Jul 12, 2017

murata2makoto commented Jul 17, 2017

danielweck commented Jul 17, 2017 •

edited

Loading

HadrienGardeur commented Jul 17, 2017

danielweck commented Jul 17, 2017

iherman commented Mar 2, 2018 •

edited

Loading

iherman commented Mar 13, 2018

i18n of metadata values, in case JSON is chosen as a serialization format #1

i18n of metadata values, in case JSON is chosen as a serialization format #1

Comments

llemeurfr commented Jun 27, 2017

llemeurfr commented Jun 27, 2017

llemeurfr commented Jun 27, 2017

lrosenthol commented Jun 28, 2017

HadrienGardeur commented Jul 2, 2017

lrosenthol commented Jul 2, 2017 via email

HadrienGardeur commented Jul 3, 2017

lrosenthol commented Jul 3, 2017 via email

HadrienGardeur commented Jul 3, 2017

lrosenthol commented Jul 3, 2017 via email

llemeurfr commented Jul 3, 2017

murata2makoto commented Jul 12, 2017 • edited Loading

HadrienGardeur commented Jul 12, 2017

murata2makoto commented Jul 17, 2017

danielweck commented Jul 17, 2017 • edited Loading

HadrienGardeur commented Jul 17, 2017

danielweck commented Jul 17, 2017

iherman commented Mar 2, 2018 • edited Loading

iherman commented Mar 13, 2018

murata2makoto commented Jul 12, 2017 •

edited

Loading

danielweck commented Jul 17, 2017 •

edited

Loading

iherman commented Mar 2, 2018 •

edited

Loading