-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[meta] Internationalization and localization (i18n
, l10n
) and internal working vocabulary
#15
Comments
…fy more than one source; also content without translation will be prefixed with a single _
…etions (hsilos). While the tags allow use at any level, this attribute should be explicity named at top level.
…not as intuitive. In special attr.falsum.zho (because may have many, many alternatives) and attr.falsum.ara (because I'm not sure about the attr.falsum.ara.id) is welcome to get revision; the initial values at least are from Wikidata/Wikitionary.
Maybe will be possible to conditionally load JSON Schemas (and, with JSON Schemas, means auto complete) based not just on file extension (think like
Edit: from |
…SE; comment: maybe JSON Schemas could allow conditional loading, so an *.mul.hdp.yml or *.hdp.yml could allow completion all the time?; also playing with loops;
We will need to use recursion. And this need to not try to translate even the inline example data (or, at least for now, the country/territory ISO 2 code names). But I think this still not as bad as the need to be well done, in special when parsing a unknown language to avoid some sort of recursive DDoS. |
We're already able to export the internal representation (heavily based on Latin) on the 6 UN languages plus Portuguese!!! (It still not checking for input syntax beyond what JSON Schema warn the user, but ok, it's something!) 1. Make any know vocabulary equally validThe things really shine if any of the 7 languages are able to be equally valid as a full working project. That's the idea. This feature alone make it an huge appeal to use. Note: the core_vocab, while always will try to export to an unique ID per language, tolerates aliases. 1.1 aliases are good... but the idea is not overuse for macro languagesIn other words: core_vocab (plus user ad-hoc customization for unknown languages) tolerate some variation on input. But still an good idea at some point don't force entire macro languages (like Arabic and Chinese) on same ISO 639-3 ISO codes (ok that this could be an hot fix, but is not an ideal) If necessary, I think we can implement some way to just override part of a vocabulary. So for example if his 20% of an individual language share acceptable conventions with the macro language, we make the HDP itself allow this 2. What would be the "official" version of a file?Even if, in practice, most teams that already use English as working language would use This may not be as relevant when everyone speaks the same language, but at least can work as benchmark for when it is with HDP files from others. 2.1 What about if two files on disk are out of sync (like someone edit a version)I think either by default (should be allow to enable/disable with configuration) or by extra command line, some way to detect if two resources in different languages would deliver different results 2.2 What if two same resources NEED to be different? (Like an file from someone else had an error; or need an update before pass for next person)Again, I think that this case may need to implicitly allow some way to know that two resources are almost the same... But changes are allowed (maybe just allow override a few parameters). At a more basic level, I think that just having a small name chance, (like But the idea would be something that (not this week, maybe not this month because I need to do other stuff out of this project) eventually allow digitally signing an HDP file. And the process of digitally signing, when necessary, needs to me allow humans who could do this a lot of time, but without actually automate too much: do exist a reason why smart cards like Yubikeys have an physical button, and is scary. Edit: added example
urn:hdp:OO:HS:local:salve-mundi.hrecipe.mul.hdp.yml:
силосная:
группа:
- salve-mundi
описание:
ENG: Hello World!
POR: Olá Mundo!
страна:
- AO
- BR
- CV
- GQ
- GW
- MO
- MZ
- PT
- ST
- TL
тег:
- CPLP
язык: MUL
трансформация-данных:
- _recipe:
- aggregators:
- sum(population) as Population#population
filter: count
patterns: adm1+name,adm1+code
- filter: clean_data
number: population
number_format: .0f
идентификатор: example-processing-with-a-JSON-spec
пример:
- источник:
_sheet_index: 1
iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
- источник:
данные:
- - header 1
- header 2
- header 3
- - '#item +id'
- '#item +name'
- '#item +value'
- - ACME1
- ACME Inc.
- '123'
- - XPTO1
- XPTO org
- '456'
цель:
данные:
- - header 1
- header 2
- header 3
- - '#item +id'
- '#item +name'
- '#item +value'
- - ACME1
- ACME Inc.
- '123'
- - XPTO1
- XPTO org
- '456' |
…n't contain .LLL.hdp.(yml|json) sufixes
…of ZXX; Also, this code is not used by other ConLangs; see https://www.kreativekorp.com/clcr/,
About gettext
I just learned that it is possible to translate even command line options with Python. From this GNU gettext book, the first 200 pages are the most relevant. A good part of them talks about the translation challenge. The author seems to be someone who speaks French. [Trivia] Latin script even for proper names, need localization (aka conversion of the script)
Good to know this. Maybe we never implement, but if have to, at least the command line options (not just the help message) should upfront allow translation. |
… to check EVEN command line instead of only the underlinign library; also removed nacl dependency (it was required for run the hdpcli, but it was not tested until now on a clean enviroment)
… created; hdpcli is already doing too much; lets break functionlity more on other parts of hxlm lib
…f_hsilo() & get_languages_of_words()
I believe we will need time sort of way to express rank/ordering as part of the internationalization/localization feature. The checksums already return an S-expression like hash. It means it is possible to have a compact form to express how it was done, but also S-expression are more easy to construct parsers and they could be even translatable). But beyond the name of the algorithms there is a need to express "what" was hashed. While users could customize very own strings, we could provide some way that even special values would be translatable. This nice-to-have alone could be sufficient to people accept the defaults. Why numerals (or order of letters in an alphabet)While some uses actually would not be always a ranked system, it's more easy to create Knowledge graphs that map numbers from different writing systems (we're already using 6 UN working languages plus Portuguese and Latin), so this new feature actually is not that hard). So if someone is executing a query on an document that is not on a local disk with an exact writing system, this means that whoever not used specific strings would be able to use any search query term and it would work. Start with one (avoid use of zero)For sake of simplifying potential translations, since decisions need to be made between start rank with zero or one (computers often start with zeros), I think that we should avoid using the meaning of zero. The meaning of zero is not easy to translate/localize. Also, even in natural languages that have the concept of zero, like English, the words to describe zero tent to have much more synonyms, and if for some reason people try to bruteforce understanding, the term for 'zero' could be more likely to be understood as string instead of convert for some more international meaning. We may still use zero (we can't change programming language interpreters) but at least the terms designed to humans understand could start with 1. It simplifies documentation. Edit: link to S-expression. |
See also:
Python 3 and Unicode identifiers (Letters, even non-latin, ok, math symbols, not)Just to mention that Python 3 accept almost all characters someone is willing to put as identifiers (even While for the firsts versions I don't plan suggest we to go full math symbols for all the things (for this we implement localized translations for each language) at least for features that are not meant to average user we could do it with special characters that do not mean anything on most languages. But for the record I'm mentioning this point because this may affect some decisions. Or maybe this would still be possible, but not as literally identifier even for very internal python implementation. |
… wikipedia describes as 'It has been described as the most frequently spoken or written word on the planet' and we need some way to express result of program that is not bad, even if not as perfect)
…ady have an draft to allow localized exceptions on the user language! (ok that we may need to make it stable engouth to not trow exceptions on exceptions themselves)
…ed; the idea is abstract ram text messages to allow future l10n
…o-left? (like Imperial Aramaic, not Lingua Latina) will need some serious work on abstract syntax tree
…of terms that could be enforced as fallback on any localization
Even if it would be NOT recommended end users (think someone creating rules for design by contract and making mistates) do exist some English keywords that (if not in English) would be need be defined in Latin, and from Latin extended by every natural language. The bare minimum keywords tend to be Since we're not on the 1960 anymore, whoever develop compilers could already use an alphabet that does not use Latin at all. This decision could simplify some work: it is neutral, it could be loaded by default with some other mathematical operations (like + and -, like Ada This is the current draft: # This is an draft of what neutral name could be used
b:
ATOM:
_*_
CAR:
_^_ (How this will behave on Right-to-left languages when compose like CADR _^~_ & CDAR _~^_ ?)
CDR:
_~_ (How this will behave on Right-to-left languages when compose like CADR _^~_ & CDAR _~^_ ?)
COND:
_?_
CONS:
_*_
EQ:
_=_
LAMBDA:
_λ_ (Not ideal, is an alphabed for an writting system)
___ (3 _ seems anonymous enough and is neutral)
PRINT:
_:_
QUOTE:
_"_
READ:
??? (TODO: think about)
DEFINE, DEF, DEFN, etc:
_#_
L:
_ISO369-3_ltr-text
rtl-text_ISO369-3_
(Note: non-Latin alphabets may need some work to discover how to use term for them)
"+":
"-":
"*":
"/": For safety reasons, recommend not use the fallback terms when localization is available (maybe use Nudge)The reason for these keywords, while may actually available on any language loaded (and maybe even when the user could use some specialized text editors, these terms could be autocomplete to the localized words) they should not be used except for debugging or (maybe if really a new language was added, but some term still missing) only that term fallback to these ones. The problem is that often we may both have experts working with people that is not expert. So what is ok for one, may not be for other. So, in this case, in addition to maybe work with most common user interfaces that help developers create these scripts, almost every tool that see the the more internal keywords could nudge (see https://en.wikipedia.org/wiki/Nudge_theory) the user, like by implicitly converting to the more verbose format, to a point of the user have to disable if really want to use the internal. So, why not fallback to ASCII
|
The https://github.com/EticaAI/HXL-Data-Science-file-formats/ontologia (and the public endpoint https://hdp.etica.ai/ontologia/) that until few hours ago were more deep in hxlm/ontologia (the Python implementation) now are at the root of the project and I'm dedicating some time to merge some datasets that are pertinent!. This need some care. So lets put in a single place. Anyway, the https://hdp.etica.ai/ontologia/ expose both for who can't download all the tables (likely to be hxlm-js later, when running on browser and needs to build local cache). The
|
…build local cache from Unicode PDFs containing ALL unicode characters (PDFs are huge; not commited on history; see links)
…orts/tr35/ is huge and complex, but we need undestand this
…v added BCD, Octal numbers and draft of Japanese numerals
Quick links:
This issue may be used to make references to the internal working vocabulary and how to deal with Internationalization and localization in special for the [meta issue] hxlm #11.
A lot of work already was done, but in addition to be used internally, for tools like the https://json-schema.org/ that can be used to generate helpers for who uses code editors like VSCode when editing YAML by hand, for allow multiple languages (even for the key, not just he content) eventually we may need to generate the JSON schemas (there is no native way to make JSON Schemas multilanguage).
TODO: add more context
The text was updated successfully, but these errors were encountered: