Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate function and data #14

Open
iegik opened this issue May 14, 2016 · 24 comments
Open

Separate function and data #14

iegik opened this issue May 14, 2016 · 24 comments

Comments

@iegik
Copy link

iegik commented May 14, 2016

T13n - is array of character pairs only, organized by manual or local t13n standards (character mappings/pairs). Some need only tool for converting strings by they own standart, some need pair collections to use in their own software.

I think, we need to seporate tool from data (character pairs).

@dzcpy
Copy link
Owner

dzcpy commented May 14, 2016

Could you explain more about this? I'm working on a change of the api to allow a custom charmap (can be set in configuration so all following calls will use that config without providing options each time) . Does it solve the problem? I'm also thinking to provide a plug-able data source but it needs to be well considered.

@iegik
Copy link
Author

iegik commented May 14, 2016

dist/transliteration.js - function/API, that transliterates words or sentences
dist/transliteration-data.js - set of character pairs (or formulas), how words should translated.

BTW, check wiki/Help:Multilingual_support_(Indic), to understand, that other languages transliterates in more complex way.

Function:

module.exports = function transliteration(data, req, res) {
    // ... calculations, depending of passed `data`
    return res;
}

Data:

var Cyrilic = require('./data/Cyrilic.iso9.ru');
var Hindi = require('./data/Hindi.remington.hi'); // for example only (http://www.findurlaptop.com/tech/2012/08/20/hindi-typing-and-google-transliteration/)
module.exports = Object.assign({}, Cyrilic, Hindi);

Example 1:

I want to use my own set of pairs:

var data = require('hulu-char-pairs');
var tr = require('transliteration').bind(null, data);
var res = tr(req);

Example 2:

I want to use my own coolest transliteration tool, but same character pairs data:

var data = require('transliteration.data');
var tr = require('my-transliteration').bind(null, data);
var res = tr(req);

@dzcpy
Copy link
Owner

dzcpy commented May 15, 2016

Thanks for the link. Yes it can be done in this way, if gziped it should be much smaller and probably in the future we need to cache it in localstorage or somewhere if it's run in the browser.

@dzcpy
Copy link
Owner

dzcpy commented May 15, 2016

I just tried it. After gzipped the transliteration.min.js file just weight about less than 80k. However, do you have more info on how to get different rules to transliterate different languages? And where can we get the data from? I can get some data for transliterating Chinese and Japanese which both have some cases that one character could have different pronunciation depending on context or words combination. How about other languages?

@iegik
Copy link
Author

iegik commented May 15, 2016

@iegik
Copy link
Author

iegik commented May 15, 2016

We need to change data grouping behavor. From beginning, I saw that data was grouped by char code, so to cover all characters you will need a lot of memory. But, do you really need that? Anyone need that?

I think - no. In Baltic states, for example, there are 5 common languages: LV, LT, EE, EN, RU. If I want to cover all transliteration of these languages, I should get few special characters from LV, LT, EE and all combinations from RU (becouse of non-latin alphabet).

@dzcpy
Copy link
Owner

dzcpy commented May 15, 2016

I think the memory usage is ok for now. It takes less than 2MB rss if you load all data into memory. Please try process.memoryUsage() to see. And for node, those files are conditionally loaded, means if you do not use them they are not required in the code

@dzcpy dzcpy changed the title Seporate function and data Separate function and data May 21, 2016
@shrpne
Copy link

shrpne commented Mar 1, 2018

@andyhu Any news on this?
It is a great library but it bloats my bundle size a lot: ~300kb minified and looks like data takes almost all of this space.
If I could bundle only languages I need, the bundle size would decrease significantly

@iegik
Copy link
Author

iegik commented Mar 1, 2018

There is another way...
If you need only to remove accents in the latin languages, I`ll recommend to use String.prototype.normalize for now.

// shim for String.prototype.normalize https://github.com/walling/unorm
export default str => str.normalize('NFKD')
    .replace(/[\u0300-\u036f]/g, "") // remove accents
    .replace(/\u0142/g, "l"); // ł is a letter in itself

/*
const normalize = import('normalize.js')
console.log(normalize('ąśćńżółźćęāēūīšģķļ'))
*/

@ogonkov
Copy link
Contributor

ogonkov commented May 21, 2018

With webpack the one could map required alphabet for transliteration to charmap.json

Example
https://gist.github.com/ogonkov/bb415854f6a27e39471d391672e43003

@iegik iegik closed this as completed May 30, 2018
@dzcpy
Copy link
Owner

dzcpy commented Jul 15, 2018

@ogonkov Nice idea, if I have time I'll first try to replace browserify with webpack. And maybe it's better to make different builds for different purpose and developer's own preference, so it can be more flexible.

Currently if you want you can replace the default character map database by using an undocumented API transliterate.setCharmap(). But you cannot get rid of the default one. The character map data is originally from ICU project, I know its quality is pretty low, but I can't find any better data source. Ideally the end-user should be able to choose which languages or unicode blocks they would like the module to support and load respective data. It requires a lot of work, especially high quality data.

@dzcpy dzcpy reopened this Jul 15, 2018
@dzcpy
Copy link
Owner

dzcpy commented Jul 15, 2018

Actually, the most difficult thing is not code but data. I can't find enough data to support all different sorts of transliteration rules for each language. Probably I'll first make it more flexible, and let community to contribute the code mapping rules for each languages.

@ogonkov
Copy link
Contributor

ogonkov commented Jul 15, 2018

Well, the solution above works for me pretty nice, it leaves only required data in json, and saves a lot in total

For Russian transliteration this charmap works smoothly

@dzcpy
Copy link
Owner

dzcpy commented Jul 15, 2018

I'm also thinking to compress the JSON file and unpack it in the browser, so it can save around 100+KB space for the default code map data.

@ogonkov
Copy link
Contributor

ogonkov commented Jul 15, 2018

I think it's better not bundle JSON by default, and have all charmap keys as separate imports, which could be imported separately

@dzcpy
Copy link
Owner

dzcpy commented Jan 11, 2019

@ogonkov That's true, I'm thinking to separate code and data in v3 version

@ogonkov
Copy link
Contributor

ogonkov commented Feb 26, 2020

2.x is breaking my solution above because now JSON data is bundled to JS file, that used it.

@dzcpy
Copy link
Owner

dzcpy commented Mar 15, 2020

I'll probably implement this in next version, but only if I can manage to get enough time

@ogonkov
Copy link
Contributor

ogonkov commented Mar 16, 2020

What is the plan? How it should work?

@dzcpy
Copy link
Owner

dzcpy commented Mar 17, 2020

Like separating function and data as this issue suggested, and probably also adding an async method since sync operations may block the main thread for intensive usage. Also I'm planning to implement a new algorism of data storage. It should both reduce the package size to about 50% to 70% of the current size and also improve the performance a bit(hopefully).
Anyway, considering my current workload, I might not be able to start working on it within 1-2 months, unless there's a sponsor.

@ogonkov
Copy link
Contributor

ogonkov commented Mar 17, 2020

One of the proposals in this comment? #14 (comment)

@dzcpy
Copy link
Owner

dzcpy commented Mar 18, 2020

One of the proposals in this comment? #14 (comment)

I have to investigate a bit more first, but should be something similar.

@sun
Copy link

sun commented May 5, 2020

Other transliteration packages, such as the Transliteration component of Drupal - whose data was originally based on the Unidecode CPAN package but heavily improved in the meantime - are using separate mapping files for each language, and the language needs to be passed as an argument to the transliterate() function. You could save a lot of time and effort by parsing and converting those existing data files from PHP into JSON (with a one-liner) as an ongoing data update process executed each time before releasing new package versions?

As a former co-maintainer of Drupal's Transliteration project - even though the base data of Unidecode was an excellent starting point already - I can tell that this is a fairly big undertaking and never-ending process of patches that are hard to verify and confirm without knowledge about the native languages, so you need a large and active community of contributors for all languages to get them right.

EDIT: PR #57 from 2017 seems to go in a similar direction.

@dzcpy
Copy link
Owner

dzcpy commented May 9, 2020

Other transliteration packages, such as the Transliteration component of Drupal - whose data was originally based on the Unidecode CPAN package but heavily improved in the meantime - are using separate mapping files for each language, and the language needs to be passed as an argument to the transliterate() function. You could save a lot of time and effort by parsing and converting those existing data files from PHP into JSON (with a one-liner) as an ongoing data update process executed each time before releasing new package versions?

As a former co-maintainer of Drupal's Transliteration project - even though the base data of Unidecode was an excellent starting point already - I can tell that this is a fairly big undertaking and never-ending process of patches that are hard to verify and confirm without knowledge about the native languages, so you need a large and active community of contributors for all languages to get them right.

EDIT: PR #57 from 2017 seems to go in a similar direction.

Yes, actually I used to be a Drupal developer myself (Drupal 6) It's an excellent module and gave me a lot of inspiration. And the original data was converted from PHP, but back then Drupal's transliteration module didn't have separate files. Maybe I'll take a look again. There are many errors in the data but I've been fixing them whenever anyone find one.
Maybe we can share the data between the two project.
I'm thinking to refactor the code from scratch, Just have't had enough time.
As you said, a large and active community is necessary. Do you have any suggestins in building such a community?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants