Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i18n of metadata values, in case JSON is chosen as a serialization format #1

Closed
llemeurfr opened this issue Jun 27, 2017 · 18 comments
Closed

Comments

@llemeurfr
Copy link
Contributor

A question was raised during the first F2F meeting in NYC, about the proper internationalization of UTF-8 metadata values (eg. the book title).

I'll quote Ivan, from the minutes: "On the i18n side, we will need to be careful about ids, uris, iris, etc. w/respect to i18n char-sets. Another area we need to be careful about is metadata, which also have issues with the char-sets for the actual text content. One example is mixing bidi text in the metadata content."

Depending the serialization format used for expressing publishing metadata, the issue we face may be different. But in case a JSON (JSON-LD?) format is chosen, which are the issues we may face and the i18n solutions recommended by other W3C WGs?

@llemeurfr
Copy link
Contributor Author

@llemeurfr
Copy link
Contributor Author

For bidi text, it appears that we may have to create a JSON "dir" attribute representing the global text direction applicable to the metadata by default.

Embedded in the text as special characters, "implicit marker characters" (Left-to-Right Mark and Right-to-Left Mark) will help tailoring the direction of "neutral" characters (e.g. "!"), and "explicit markers" will describe a local text direction.

@lrosenthol
Copy link

it's not just about bidi - you also have the more general problem of language identification.
Consider the case of a book with multiple (localized) titles - how do you encode that information?
Or worse, consider a multi-lingual title?

There is some work in this area for JSON-LD.

@HadrienGardeur
Copy link

In Readium-2 we already handle that case (multiple localization for some strings).

Here's how we handle it for title for example:

"title": {
  "fr": "Vingt mille lieues sous les mers",
  "en": "Twenty Thousand Leagues Under the Sea",
  "ja": "海底二万里"
}

Since we're using JSON-LD and include the proper info in our context document, this is correctly understood by JSON-LD clients:

schema:name "Twenty Thousand Leagues Under the Sea"@en, "Vingt mille lieues sous les mers"@fr, "海底二万里"@ja ;

@lrosenthol
Copy link

lrosenthol commented Jul 2, 2017 via email

@HadrienGardeur
Copy link

Representing multiple languages in a single string is a much bigger issue that we can't tackle on our own.

Unlike what the name of this issue implies, this IMO has nothing to do with JSON:

  • the exact same issue exists today in XML with EPUB
  • ... or most metadata formats

If you can't represent that info in a string, the problem is much bigger than the manifest:

  • how would anyone store that info in a database (usually these fields are UTF-8 strings)?
  • how would you transmit this info in an API?
  • how would dedicated reading systems represent this info in-memory?

I think this falls under the "not our problem to solve" category that Ivan mentioned several times during the F2F.
We can participate in efforts to solve this problem with UTF-8, but we can't and shouldn't try to fix it on our own.

@lrosenthol
Copy link

lrosenthol commented Jul 3, 2017 via email

@HadrienGardeur
Copy link

Sure, but who's using HTML to represent strings in a database or an API? Absolutely no one.

@lrosenthol
Copy link

lrosenthol commented Jul 3, 2017 via email

@llemeurfr
Copy link
Contributor Author

Re. databases: not really Leonard. @HadrienGardeur is talking about databases (e.g MySQL), you move to search engines (e.g. Solr). Some databases (SQLServer, Oracle, DB2) can handle XML fields in their recent versions; but others (MySQL, sqlite) don't. Most professional search engines don't index HTML or XML, the tags are tripped out before indexing. Note also that ElasticSearch imports JSON structures, not XML.

Re. the Web: Web Publication must be adapted to ... the Web, i.e. browsers and nowadays, browsers don't handle XML perfectly.

We are dealing with property/value tuples in this discussion. If you want to promote mixed content as core value type, you'll have the whole database/web community "vent debout" against the idea.

The problem I raised (i18n for metadata values) is currently not handled in EPUB 3. I suppose that the publishing industry was not so impatient to have it resolved before. So I agree with Hadrien that we should just express why it could be interesting to have a solution for this issue and which solution is offered by other W3C WGs.

IMHO, there are two main reasons why we would like proper internationalized metadata values:

  • mix of ltr and rtl words in a string (if is certainly a MUST)
  • proper pronounciation of words by a tts engine, from a string (IMO it is a "good to have" but not mandatory).

@murata2makoto
Copy link

murata2makoto commented Jul 12, 2017

In the case of the Japanese language, each human-readable text requires two representations: one in Kana only and one in Kanji.

For example, the Japanese National Diet Llibrary uses

<dc:title>
   <rdf:Description>
     <rdf:value>国立国会図書館資料デジタル化の手引</rdf:value>
     <dcndl:transcription>コクリツ コッカイ トショカン シリョウ デジタルカ ノ テビキ</dcndl:transcription>
   </rdf:Description>
</dc:title>

where dcndl:transcription is Kana-only.

@HadrienGardeur
Copy link

@murata0204 in the Readium Web Publication Manifest this is supported for most strings. The only place where we can't use it yet is for the description.

@murata2makoto
Copy link

@danielweck
Copy link
Member

danielweck commented Jul 17, 2017

Thank you Makoto.
Hadrien, what about:

"title": "<span lang='en-US' dir='ltr'>Mobi Dick</span>"

?

@HadrienGardeur
Copy link

@danielweck not sure what you mean, as you know we support both syntax in Readium-2.

@danielweck
Copy link
Member

we have langcode to string mapping, but no dir, right?

mattgarrish added a commit that referenced this issue Aug 27, 2017
Revert "consolidates previous PRs #46, #47 and #49"
mattgarrish pushed a commit that referenced this issue Nov 11, 2017
Suggested an editors note in Section 4.5
@iherman
Copy link
Member

iherman commented Mar 2, 2018

Propose closure: Can be closed; mostly taken care of in #129, although the JSON serialization is still to be done.

@iherman
Copy link
Member

iherman commented Mar 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants