-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test case for HXLM-Action: datasets from Translation Initiative for COVID-19 "TICO-19" #5
Comments
Truth to be told: in addition to lack of tooling to deal with linguistic content (in special terminology), not surprisingly most initial work was done in a hurry AND merging content from different submitters. The folder with Translation Memories (using TMX) was updated over months. Also the work from different providers had different approaches (both ways to document concepts and ways to encode which language the terms are), so this made it not fully compatible. So even any idea of having some file with all terminology would make sense only later. Anyway, deal with such type of terminology exchange and translation memory exchange for such number of languages is already hardcore. The paper mentions that was a lot of work reviewing the translations themselves (and I think this is mostly from Translators Without Borders), but is clear that even in case of different companies (like Facebook and Google) which could also ask their users to help with translations could at least be aware on how to prepare the additional descriptions to not simple asks the bare term. I'm saying this because if any future initiative with companies try to do crowdsourcing content, they would need some minimum standards so when they exchange with others it would be less complex. **Old comment here**One visible challenge: TICO-19 does not have any multilingual with more than 2 languagesEven without the idea of HXLTM (it does not even existed when TICO-19 started, we're both new to @HXL-CPLP, and I don't remember of this initiative past there, or I would 100% be interested in get in) the data repository does not have any "global" dataset for terminology. Even the terminology is released using bilingual CSVs (I think have some ID for concepts, but the research paper mentions they way the languages were structured make then non-uniform). But bilingual files is not really the best way to organize terminology. My complaint here is that maybe no one guessed about release as TermBase eXchange (TBX) (which is supposed to be the de facto industry standard, even if I totally understand it lack A LOT of tooling make hard). Possible hypothesisThe non-tooling partThe research paper mentions both pivot languages and the fact that some languages had more review than others. So, with this alone is likely that some way to merge back some multilingual terminology, what Europe IATE would call reliabilityCode would not be the same. In other words: we could here "merge" the files back together, but the best would be this be done by who worked on the project. Also, they did not stored comments from translators about issues they had, so not just because I saw some of translations (not just Portuguese) have some minor issues (like translate Also, very likely the Facebook/Google have more issues because the source text in English already was shitty. I'm saying this because lack of language diversity on such companies make no one aware how hard is to translate terms without any context. The tooling partThis compressed README https://github.com/tico-19/tico-19.github.io/blob/master/data/TM/README.md.zip make me suspicious why they did not used TBX
Way back there on https://github.com/EticaAI/HXL-Data-Science-file-formats/issues I remember trying to use the okapi to convert CSVs (as the ones on HXL) to TMX and TBXs. The Okapi does not support TBX (https://okapiframework.org/wiki/index.php/Open_Standards#TBX_-_Term_Base_eXchange). And other tool that support TBX, the https://toolkit.translatehouse.org/ (see http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands/csv2tbx.html) since it's internal format is mono or bilingual, simply the tooling itself cannot handle multilingual terminology. So, even if TBX was, in theory, good to merge the end result, to get translation back from Translators Without Borders, the people who compiled the end project would need to create own tooling, and this means read about entire TBX. |
Same reasoning as previous updated commend. I will keep the old comments done yesterday here. I still think this sound a bit of non-intentional helicopter research, but to improve the quality of data collected and shared, since was done by actually different strategies, would need more planning ahead of time in special for companies where translation/terminology is not even related to their main goal. **Old comment here**I will take some sleep and continue tomorrow (so whatever I'm writing here is not full reviewed). TBX do have less tolling, but unless the source content for data/TM (which may be the CSVs) already was reduced to have none extra relevant information, the use of Translation Memory eXchange (TMX) (https://en.wikipedia.org/wiki/Translation_Memory_eXchange) instead of XLIFF (XML Localization Interchange File Format) (https://en.wikipedia.org/wiki/XLIFF) to store translations actually is less powerful. Any decent tool focused on translations have support for XLIFF. Even open source ones like https://www.matecat.com/ allow export translators annotations, and even do some automated tagging and checks, and the user can rearrange the source text when it is poor written (which is not hard for source text in English). Why not TMX (instead of XLIFF) for translationsTMX is a dead standard. And store only language pairs still not fix the big picture. It's only supported by most machine learning and etc, because is simple to implement, but is beyond recognition for serious usage on language pairs. I think that the TMXs files where generated later (so, they are not really losing information, because it was already stored first without any more relevant context), but for cases where translations already are stored on XLIFF, TMX is a downgrade. And I'm saying this because is bizarre the idea of trying to apply a lot of machine learning and etc to detect accuracy of translations and not care about how this is collected and stored. If either Facebook or Google employees read this comment on next years, for gods sake, take some time and implement proper parsing of XLIFF. TBX is fair better, but most translations tools would export XLIFF as best output format. The not-so-irony relation with TBX and helicopter researchOne reason to implement more support for data that collect more input from translators (instead of just over-reduce to only what machines can understand) is that is it resemble helicopter research
This part of the paper seems to ignore the fact that most of these languages, actually the translators were doing word-for-word translations because the languages themselves may never had the term for such concepts. This word-by-word strategy is mentioned for example on Arabic https://www.researchgate.net/publication/272431518_Methods_of_Creating_and_Introducing_New_Terms_in_Arabic_Contributions_from_English-Arabic_Translation. My point here is even the lack of store additional metadata from translators (like their comments) as would be possible with any minimal tool compatible with XLIFF would allow to see why they were failing.
The paper mention somewhere will have at least some annotations. So maybe I just did not find on the zip files, so 60% of my written rant here can be wrong. But I don't think they can archive so high accuracy, when even source text have at least punctuation like commas missing. |
Quick proof of concept (merging only 4 languages in 3 different files) here : hxl proxy link. This link may stop working if test files are changed (which is likely). As long as the working languages are well defined, the Current screenshots |
…tch scripts/patch/data-terminology-facebook.diff (fititnt/hxltm-action#5, #1)
…al way of label targetLang (Facebook terminology only) (fititnt/hxltm-action#5, #1)
While not crucial to implement the V1 of hxltm-action, this issue will be used to reference strategy used to convert this real dataset as test case for conversion. The end result both could be useful and also help to understand what additional tooling could be relevant.
The text was updated successfully, but these errors were encountered: