Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] Internationalization and localization (i18n, l10n) and internal working vocabulary #15

Open
fititnt opened this issue Mar 14, 2021 · 13 comments
Labels

Comments

@fititnt
Copy link
Member

fititnt commented Mar 14, 2021

Quick links:


This issue may be used to make references to the internal working vocabulary and how to deal with Internationalization and localization in special for the [meta issue] hxlm #11.

A lot of work already was done, but in addition to be used internally, for tools like the https://json-schema.org/ that can be used to generate helpers for who uses code editors like VSCode when editing YAML by hand, for allow multiple languages (even for the key, not just he content) eventually we may need to generate the JSON schemas (there is no native way to make JSON Schemas multilanguage).

TODO: add more context

fititnt added a commit that referenced this issue Mar 14, 2021
…fy more than one source; also content without translation will be prefixed with a single _
fititnt added a commit that referenced this issue Mar 15, 2021
…etions (hsilos). While the tags allow use at any level, this attribute should be explicity named at top level.
fititnt added a commit that referenced this issue Mar 15, 2021
…not as intuitive. In special attr.falsum.zho (because may have many, many alternatives) and attr.falsum.ara (because I'm not sure about the attr.falsum.ara.id) is welcome to get revision; the initial values at least are from Wikidata/Wikitionary.
@fititnt
Copy link
Member Author

fititnt commented Mar 17, 2021

Maybe will be possible to conditionally load JSON Schemas (and, with JSON Schemas, means auto complete) based not just on file extension (think like ola-mundo.por.hdp.yml vs hello-world.eng.hdp.yml but something like salve-mundi.mul.hdp.yml salve-mundi.hdp.yml.


Edit: from salve-mundi.mul.hdp.yml to salve-mundi.hdp.yml

fititnt added a commit that referenced this issue Mar 17, 2021
…SE; comment: maybe JSON Schemas could allow conditional loading, so an *.mul.hdp.yml or *.hdp.yml could allow completion all the time?; also playing with loops;
@fititnt
Copy link
Member Author

fititnt commented Mar 17, 2021

Almost there...

hdpcli tests/hrecipe/hello-world.hrecipe.hdp.yml --objectivum-linguam RUS

Captura de tela de 2021-03-17 18-22-05

fititnt added a commit that referenced this issue Mar 17, 2021
@fititnt
Copy link
Member Author

fititnt commented Mar 17, 2021

We will need to use recursion.

And this need to not try to translate even the inline example data (or, at least for now, the country/territory ISO 2 code names). But I think this still not as bad as the need to be well done, in special when parsing a unknown language to avoid some sort of recursive DDoS.

@fititnt
Copy link
Member Author

fititnt commented Mar 18, 2021

We're already able to export the internal representation (heavily based on Latin) on the 6 UN languages plus Portuguese!!!

(It still not checking for input syntax beyond what JSON Schema warn the user, but ok, it's something!)

1. Make any know vocabulary equally valid

The things really shine if any of the 7 languages are able to be equally valid as a full working project. That's the idea. This feature alone make it an huge appeal to use.

Note: the core_vocab, while always will try to export to an unique ID per language, tolerates aliases.

1.1 aliases are good... but the idea is not overuse for macro languages

In other words: core_vocab (plus user ad-hoc customization for unknown languages) tolerate some variation on input. But still an good idea at some point don't force entire macro languages (like Arabic and Chinese) on same ISO 639-3 ISO codes (ok that this could be an hot fix, but is not an ideal)

If necessary, I think we can implement some way to just override part of a vocabulary. So for example if his 20% of an individual language share acceptable conventions with the macro language, we make the HDP itself allow this

2. What would be the "official" version of a file?

Even if, in practice, most teams that already use English as working language would use thing.eng.hdp.yml I like the idea that for resources that are created from someone else still with the very exact file and the HDP tools could still tolerate on the fly more than one file on disk.

This may not be as relevant when everyone speaks the same language, but at least can work as benchmark for when it is with HDP files from others.

2.1 What about if two files on disk are out of sync (like someone edit a version)

I think either by default (should be allow to enable/disable with configuration) or by extra command line, some way to detect if two resources in different languages would deliver different results

2.2 What if two same resources NEED to be different? (Like an file from someone else had an error; or need an update before pass for next person)

Again, I think that this case may need to implicitly allow some way to know that two resources are almost the same... But changes are allowed (maybe just allow override a few parameters).

At a more basic level, I think that just having a small name chance, (like thing-v2.eng.hdp.yml) could do the trick. This may sound lazy, but would be sufficient to not raise errors.

But the idea would be something that (not this week, maybe not this month because I need to do other stuff out of this project) eventually allow digitally signing an HDP file.

And the process of digitally signing, when necessary, needs to me allow humans who could do this a lot of time, but without actually automate too much: do exist a reason why smart cards like Yubikeys have an physical button, and is scary.


Edit: added example

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hdpcli tests/hrecipe/salve-mundi.hrecipe.mul.hdp.yml --objectivum-linguam RUS

urn:hdp:OO:HS:local:salve-mundi.hrecipe.mul.hdp.yml:
  силосная:
    группа:
      - salve-mundi
    описание:
      ENG: Hello World!
      POR: Olá Mundo!
    страна:
      - AO
      - BR
      - CV
      - GQ
      - GW
      - MO
      - MZ
      - PT
      - ST
      - TL
    тег:
      - CPLP
    язык: MUL
  трансформация-данных:
    - _recipe:
        - aggregators:
            - sum(population) as Population#population
          filter: count
          patterns: adm1+name,adm1+code
        - filter: clean_data
          number: population
          number_format: .0f
      идентификатор: example-processing-with-a-JSON-spec
      пример:
        - источник:
            _sheet_index: 1
            iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
        - источник:
            данные:
              - - header 1
                - header 2
                - header 3
              - - '#item +id'
                - '#item +name'
                - '#item +value'
              - - ACME1
                - ACME Inc.
                - '123'
              - - XPTO1
                - XPTO org
                - '456'
          цель:
            данные:
              - - header 1
                - header 2
                - header 3
              - - '#item +id'
                - '#item +name'
                - '#item +value'
              - - ACME1
                - ACME Inc.
                - '123'
              - - XPTO1
                - XPTO org
                - '456'

fititnt added a commit that referenced this issue Mar 18, 2021
fititnt added a commit that referenced this issue Mar 18, 2021
@fititnt
Copy link
Member Author

fititnt commented Mar 18, 2021

About gettext

I just learned that it is possible to translate even command line options with Python. From this GNU gettext book, the first 200 pages are the most relevant. A good part of them talks about the translation challenge. The author seems to be someone who speaks French.

[Trivia] Latin script even for proper names, need localization (aka conversion of the script)

4.9 Marking Proper Names for Translation
Should names of persons, cities, locations etc. be marked for translation or not? People who only know languages that can be written with Latin letters (English, Spanish, French, German, etc.) are tempted to say “no”, because names usually do not change when transported between these languages. However, in general when translating from one script to another, names are translated too, usually phonetically or by transliteration. For example, Russian or Greek names are converted to the Latin alphabet when being translated to English, and English or French names are converted to the Katakana script when being translated to Japanese. This is necessary because the speakers of the target language in general cannot read the script the name is originally written in.

Good to know this. Maybe we never implement, but if have to, at least the command line options (not just the help message) should upfront allow translation.

fititnt added a commit that referenced this issue Mar 18, 2021
fititnt added a commit that referenced this issue Mar 18, 2021
fititnt added a commit that referenced this issue Mar 18, 2021
… to check EVEN command line instead of only the underlinign library; also removed nacl dependency (it was required for run the hdpcli, but it was not tested until now on a clean enviroment)
fititnt added a commit that referenced this issue Mar 19, 2021
… created; hdpcli is already doing too much; lets break functionlity more on other parts of hxlm lib
fititnt added a commit that referenced this issue Mar 19, 2021
fititnt added a commit that referenced this issue Mar 19, 2021
fititnt added a commit that referenced this issue Mar 19, 2021
fititnt added a commit that referenced this issue Mar 19, 2021
@fititnt
Copy link
Member Author

fititnt commented Mar 26, 2021

I believe we will need time sort of way to express rank/ordering as part of the internationalization/localization feature.

The checksums already return an S-expression like hash. It means it is possible to have a compact form to express how it was done, but also S-expression are more easy to construct parsers and they could be even translatable). But beyond the name of the algorithms there is a need to express "what" was hashed. While users could customize very own strings, we could provide some way that even special values would be translatable. This nice-to-have alone could be sufficient to people accept the defaults.

Why numerals (or order of letters in an alphabet)

While some uses actually would not be always a ranked system, it's more easy to create Knowledge graphs that map numbers from different writing systems (we're already using 6 UN working languages plus Portuguese and Latin), so this new feature actually is not that hard). So if someone is executing a query on an document that is not on a local disk with an exact writing system, this means that whoever not used specific strings would be able to use any search query term and it would work.

Start with one (avoid use of zero)

For sake of simplifying potential translations, since decisions need to be made between start rank with zero or one (computers often start with zeros), I think that we should avoid using the meaning of zero. The meaning of zero is not easy to translate/localize. Also, even in natural languages that have the concept of zero, like English, the words to describe zero tent to have much more synonyms, and if for some reason people try to bruteforce understanding, the term for 'zero' could be more likely to be understood as string instead of convert for some more international meaning.

We may still use zero (we can't change programming language interpreters) but at least the terms designed to humans understand could start with 1. It simplifies documentation.


Edit: link to S-expression.

@fititnt
Copy link
Member Author

fititnt commented Mar 29, 2021

See also:

Captura de tela de 2021-03-29 12-34-15
Captura de tela de 2021-03-29 12-35-12

Python 3 and Unicode identifiers (Letters, even non-latin, ok, math symbols, not)

Just to mention that Python 3 accept almost all characters someone is willing to put as identifiers (even ç or geek letters like λ) but Unicode mathematical symbols that are not also letters are not accepted as valid identifier.

While for the firsts versions I don't plan suggest we to go full math symbols for all the things (for this we implement localized translations for each language) at least for features that are not meant to average user we could do it with special characters that do not mean anything on most languages. But for the record I'm mentioning this point because this may affect some decisions.

Or maybe this would still be possible, but not as literally identifier even for very internal python implementation.

fititnt added a commit that referenced this issue Mar 31, 2021
… wikipedia describes as 'It has been described as the most frequently spoken or written word on the planet' and we need some way to express result of program that is not bad, even if not as perfect)
fititnt added a commit that referenced this issue Mar 31, 2021
…ady have an draft to allow localized exceptions on the user language! (ok that we may need to make it stable engouth to not trow exceptions on exceptions themselves)
fititnt added a commit that referenced this issue Mar 31, 2021
…ed; the idea is abstract ram text messages to allow future l10n
fititnt referenced this issue Apr 6, 2021
…o-left? (like Imperial Aramaic, not Lingua Latina) will need some serious work on abstract syntax tree
fititnt added a commit that referenced this issue Apr 13, 2021
…of terms that could be enforced as fallback on any localization
@fititnt
Copy link
Member Author

fititnt commented Apr 13, 2021

Even if it would be NOT recommended end users (think someone creating rules for design by contract and making mistates) do exist some English keywords that (if not in English) would be need be defined in Latin, and from Latin extended by every natural language. The bare minimum keywords tend to be ATOM, CAR, CDR, COND, CONS, EQ, QUOTE (sometimes LAMBDA, sometimes abbreviated as λ).

Since we're not on the 1960 anymore, whoever develop compilers could already use an alphabet that does not use Latin at all. This decision could simplify some work: it is neutral, it could be loaded by default with some other mathematical operations (like + and -, like Ada

This is the current draft:

# This is an draft of what neutral name could be used
b:
  ATOM:
    _*_
  CAR:
    _^_ (How this will behave on Right-to-left languages when compose like CADR _^~_ & CDAR _~^_ ?)
  CDR:
    _~_ (How this will behave on Right-to-left languages when compose like CADR _^~_ & CDAR _~^_ ?)
  COND:
    _?_
  CONS:
    _*_
  EQ:
    _=_
  LAMBDA:
    _λ_   (Not ideal, is an alphabed for an writting system)
    ___   (3 _ seems anonymous enough and is neutral)
  PRINT:
    _:_
  QUOTE:
    _"_
  READ:
    ???  (TODO: think about)
  DEFINE, DEF, DEFN, etc:
    _#_
  L:
    _ISO369-3_ltr-text
    rtl-text_ISO369-3_
          (Note: non-Latin alphabets may need some work to discover how to use term for them)
  "+":
  "-":
  "*":
  "/":

For safety reasons, recommend not use the fallback terms when localization is available (maybe use Nudge)

The reason for these keywords, while may actually available on any language loaded (and maybe even when the user could use some specialized text editors, these terms could be autocomplete to the localized words) they should not be used except for debugging or (maybe if really a new language was added, but some term still missing) only that term fallback to these ones.

The problem is that often we may both have experts working with people that is not expert. So what is ok for one, may not be for other.

So, in this case, in addition to maybe work with most common user interfaces that help developers create these scripts, almost every tool that see the the more internal keywords could nudge (see https://en.wikipedia.org/wiki/Nudge_theory) the user, like by implicitly converting to the more verbose format, to a point of the user have to disable if really want to use the internal.

So, why not fallback to ASCII ATOM, CAR, CDR, COND, CONS, EQ, QUOTE instead of create new terms (if this could be dangerious)?

  • The first reason, is because neutrality (no hardcode some alphabet) and to induce empathy.
    • This approach both force tools to upgrade as soon as possible to use local natural language and also, if could cause deaths, people who know Latin alphabet would not be privileged.
      • And no, is not "fair" compare like the challenge of not know English while still have as mother language some writing system alphabet compared to have complete different writing system.
  • A second reason is that, even for Latin, CAR & CDR are confusing.
    • If the non-alphabetic encoding could be good enough to be more clear what is CAR and what is CDR (or at least, if someone is reading code on debug mode) this alone could be an improvement. Anyway, most people who would use HDPLisp tend to not previously know the origin, so while we could allow import code, new one could be already more user friendly.
      • But I agree upfront that would need some feedback to combination of neutral characters to find what could be more intuitive

@fititnt
Copy link
Member Author

fititnt commented Apr 13, 2021

The https://github.com/EticaAI/HXL-Data-Science-file-formats/ontologia (and the public endpoint https://hdp.etica.ai/ontologia/) that until few hours ago were more deep in hxlm/ontologia (the Python implementation) now are at the root of the project and I'm dedicating some time to merge some datasets that are pertinent!. This need some care. So lets put in a single place. Anyway, the https://hdp.etica.ai/ontologia/ expose both for who can't download all the tables (likely to be hxlm-js later, when running on browser and needs to build local cache).

The ontologia/

While part of the ontologia is mostly for the Knowledge Graphs (Localization Knowledge Graph, Vocabulary Knowledge Graph) and already have a draft of the internals of the HDPLisp, the idea here is already have a single place for every package from this repository get the data. (This is why have a few symlinks).

This also means that people trying to understand how the internals works, or maybe just doing some quick integration without actually load the libraries from here, can consume just the data. Also by having a single place to put "all shared knowledge" of all underlining implementations of tools here, we can test everything together.

About Monolith / Monorepo

Just to mention that putting several implementations on a single GitHub repository is not considered (on average) a good practice. But in some cases (or at least at this moment, when we're writing same concept for more than one programming language) monorepo can work to allow consistent testing.

But when necessary (like it start to make slow, not fast, to test things) we can split in more projects

fititnt added a commit that referenced this issue Apr 13, 2021
fititnt added a commit that referenced this issue Apr 14, 2021
…build local cache from Unicode PDFs containing ALL unicode characters (PDFs are huge; not commited on history; see links)
fititnt added a commit that referenced this issue Apr 14, 2021
fititnt added a commit that referenced this issue Apr 15, 2021
…v added BCD, Octal numbers and draft of Japanese numerals
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant