-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
i18n of metadata values, in case JSON is chosen as a serialization format #1
Comments
Two interesting resources on bidi-text and Unicode: |
For bidi text, it appears that we may have to create a JSON "dir" attribute representing the global text direction applicable to the metadata by default. Embedded in the text as special characters, "implicit marker characters" (Left-to-Right Mark and Right-to-Left Mark) will help tailoring the direction of "neutral" characters (e.g. "!"), and "explicit markers" will describe a local text direction. |
it's not just about bidi - you also have the more general problem of language identification. There is some work in this area for JSON-LD. |
In Readium-2 we already handle that case (multiple localization for some strings). Here's how we handle it for
Since we're using JSON-LD and include the proper info in our context document, this is correctly understood by JSON-LD clients:
|
Yes, JSON-LD does have possible solutions to this - but that's why we need
to this all in mind...
However, that only handles half the problem - multiple values, each in
their own language. What do you do when you have a single value in multiple
languages?
…On Sun, Jul 2, 2017 at 4:51 PM, Hadrien Gardeur ***@***.***> wrote:
In Readium-2 we already handle that case (multiple localization for some
strings).
Here's how we handle it for title for example:
"title": {
"fr": "Vingt mille lieues sous les mers",
"en": "Twenty Thousand Leagues Under the Sea",
"ja": "海底二万里"
}
Since we're using JSON-LD and include the proper info in our context
document, this is correctly understood by JSON-LD clients:
schema:name "Twenty Thousand Leagues Under the ***@***.***, "Vingt mille lieues sous les ***@***.***, ***@***.*** ;
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AE1vNSsz1muSBvm9TwDx5wnHigKhXiQNks5sKALNgaJpZM4OG3C->
.
|
Representing multiple languages in a single string is a much bigger issue that we can't tackle on our own. Unlike what the name of this issue implies, this IMO has nothing to do with JSON:
If you can't represent that info in a string, the problem is much bigger than the manifest:
I think this falls under the "not our problem to solve" category that Ivan mentioned several times during the F2F. |
Not true, Hadrian. Using a markup language like HTML or XML, I can break a
single string into "spans" with langs defined on each...JSON (or similar)
does not. that isn't to say that I am saying we shouldn't use JSON - I am
simply pointing out an issue that we need to consider to be fully i18n...
Any solution that relies solely on UTF8, thinking that it solves their i18n
problem, is built incorrectly.
…On Mon, Jul 3, 2017 at 6:50 AM, Hadrien Gardeur ***@***.***> wrote:
Representing multiple languages in a single string is a much bigger issue
that we can't tackle on our own.
Unlike what the name of this issue implies, this IMO has nothing to do
with JSON:
- the exact same issue exists today in XML with EPUB
- ... or most metadata formats
If you can't represent that info in a string, the problem is much bigger
than the manifest:
- how would anyone store that info in a database (usually these fields
are UTF-8 strings)?
- how would you transmit this info in an API?
- how would dedicated reading systems represent this info in-memory?
I think this falls under the "not our problem to solve" category that Ivan
mentioned several times during the F2F.
We can participate in efforts to solve this problem with UTF-8, but we
can't and shouldn't try to fix it on our own.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AE1vNQx8ANej_g6sSYfYEwUgLGE8d7hgks5sKMd_gaJpZM4OG3C->
.
|
Sure, but who's using HTML to represent strings in a database or an API? Absolutely no one. |
In a database? Absolutely! For example, any good content search engine
(for example) indexes HTML (or XML).
Yes, for most REST-based APIs, its JSON and UTF8...and unfortunate choice
on a variety of levels. And yes, we're not going to change that. But
that doesn't make it correct.
…On Mon, Jul 3, 2017 at 7:09 AM, Hadrien Gardeur ***@***.***> wrote:
Sure, but who's using HTML to represent strings in a database or an API?
Absolutely no one.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AE1vNaOYqv9L3xHmBANO7TzypZEzJiawks5sKMvsgaJpZM4OG3C->
.
|
Re. databases: not really Leonard. @HadrienGardeur is talking about databases (e.g MySQL), you move to search engines (e.g. Solr). Some databases (SQLServer, Oracle, DB2) can handle XML fields in their recent versions; but others (MySQL, sqlite) don't. Most professional search engines don't index HTML or XML, the tags are tripped out before indexing. Note also that ElasticSearch imports JSON structures, not XML. Re. the Web: Web Publication must be adapted to ... the Web, i.e. browsers and nowadays, browsers don't handle XML perfectly. We are dealing with property/value tuples in this discussion. If you want to promote mixed content as core value type, you'll have the whole database/web community "vent debout" against the idea. The problem I raised (i18n for metadata values) is currently not handled in EPUB 3. I suppose that the publishing industry was not so impatient to have it resolved before. So I agree with Hadrien that we should just express why it could be interesting to have a solution for this issue and which solution is offered by other W3C WGs. IMHO, there are two main reasons why we would like proper internationalized metadata values:
|
In the case of the Japanese language, each human-readable text requires two representations: one in Kana only and one in Kanji. For example, the Japanese National Diet Llibrary uses
where dcndl:transcription is Kana-only. |
@murata0204 in the Readium Web Publication Manifest this is supported for most strings. The only place where we can't use it yet is for the description. |
Thank you Makoto.
? |
@danielweck not sure what you mean, as you know we support both syntax in Readium-2. |
we have langcode to string mapping, but no dir, right? |
Suggested an editors note in Section 4.5
Propose closure: Can be closed; mostly taken care of in #129, although the JSON serialization is still to be done. |
A question was raised during the first F2F meeting in NYC, about the proper internationalization of UTF-8 metadata values (eg. the book title).
I'll quote Ivan, from the minutes: "On the i18n side, we will need to be careful about ids, uris, iris, etc. w/respect to i18n char-sets. Another area we need to be careful about is metadata, which also have issues with the char-sets for the actual text content. One example is mixing bidi text in the metadata content."
Depending the serialization format used for expressing publishing metadata, the issue we face may be different. But in case a JSON (JSON-LD?) format is chosen, which are the issues we may face and the i18n solutions recommended by other W3C WGs?
The text was updated successfully, but these errors were encountered: