-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate function and data #14
Comments
Could you explain more about this? I'm working on a change of the api to allow a custom charmap (can be set in configuration so all following calls will use that config without providing options each time) . Does it solve the problem? I'm also thinking to provide a plug-able data source but it needs to be well considered. |
dist/transliteration.js - function/API, that transliterates words or sentences BTW, check wiki/Help:Multilingual_support_(Indic), to understand, that other languages transliterates in more complex way. Function:
Data:
Example 1: I want to use my own set of pairs:
Example 2: I want to use my own coolest transliteration tool, but same character pairs data:
|
Thanks for the link. Yes it can be done in this way, if gziped it should be much smaller and probably in the future we need to cache it in localstorage or somewhere if it's run in the browser. |
I just tried it. After gzipped the |
We need to change data grouping behavor. From beginning, I saw that data was grouped by char code, so to cover all characters you will need a lot of memory. But, do you really need that? Anyone need that? I think - no. In Baltic states, for example, there are 5 common languages: LV, LT, EE, EN, RU. If I want to cover all transliteration of these languages, I should get few special characters from LV, LT, EE and all combinations from RU (becouse of non-latin alphabet). |
I think the memory usage is ok for now. It takes less than 2MB rss if you load all data into memory. Please try process.memoryUsage() to see. And for node, those files are conditionally loaded, means if you do not use them they are not |
@andyhu Any news on this? |
There is another way...
|
With Example |
@ogonkov Nice idea, if I have time I'll first try to replace browserify with webpack. And maybe it's better to make different builds for different purpose and developer's own preference, so it can be more flexible. Currently if you want you can replace the default character map database by using an undocumented API |
Actually, the most difficult thing is not code but data. I can't find enough data to support all different sorts of transliteration rules for each language. Probably I'll first make it more flexible, and let community to contribute the code mapping rules for each languages. |
Well, the solution above works for me pretty nice, it leaves only required data in json, and saves a lot in total For Russian transliteration this charmap works smoothly |
I'm also thinking to compress the JSON file and unpack it in the browser, so it can save around 100+KB space for the default code map data. |
I think it's better not bundle JSON by default, and have all charmap keys as separate imports, which could be imported separately |
@ogonkov That's true, I'm thinking to separate code and data in v3 version |
2.x is breaking my solution above because now JSON data is bundled to JS file, that used it. |
I'll probably implement this in next version, but only if I can manage to get enough time |
What is the plan? How it should work? |
Like separating function and data as this issue suggested, and probably also adding an async method since sync operations may block the main thread for intensive usage. Also I'm planning to implement a new algorism of data storage. It should both reduce the package size to about 50% to 70% of the current size and also improve the performance a bit(hopefully). |
One of the proposals in this comment? #14 (comment) |
I have to investigate a bit more first, but should be something similar. |
Other transliteration packages, such as the Transliteration component of Drupal - whose data was originally based on the Unidecode CPAN package but heavily improved in the meantime - are using separate mapping files for each language, and the language needs to be passed as an argument to the transliterate() function. You could save a lot of time and effort by parsing and converting those existing data files from PHP into JSON (with a one-liner) as an ongoing data update process executed each time before releasing new package versions? As a former co-maintainer of Drupal's Transliteration project - even though the base data of Unidecode was an excellent starting point already - I can tell that this is a fairly big undertaking and never-ending process of patches that are hard to verify and confirm without knowledge about the native languages, so you need a large and active community of contributors for all languages to get them right. EDIT: PR #57 from 2017 seems to go in a similar direction. |
Yes, actually I used to be a Drupal developer myself (Drupal 6) It's an excellent module and gave me a lot of inspiration. And the original data was converted from PHP, but back then Drupal's transliteration module didn't have separate files. Maybe I'll take a look again. There are many errors in the data but I've been fixing them whenever anyone find one. |
T13n - is array of character pairs only, organized by manual or local t13n standards (character mappings/pairs). Some need only tool for converting strings by they own standart, some need pair collections to use in their own software.
I think, we need to seporate tool from data (character pairs).
The text was updated successfully, but these errors were encountered: