Separate function and data #14

iegik · 2016-05-14T23:18:06Z

T13n - is array of character pairs only, organized by manual or local t13n standards (character mappings/pairs). Some need only tool for converting strings by they own standart, some need pair collections to use in their own software.

I think, we need to seporate tool from data (character pairs).

dzcpy · 2016-05-14T23:38:29Z

Could you explain more about this? I'm working on a change of the api to allow a custom charmap (can be set in configuration so all following calls will use that config without providing options each time) . Does it solve the problem? I'm also thinking to provide a plug-able data source but it needs to be well considered.

iegik · 2016-05-14T23:53:07Z

dist/transliteration.js - function/API, that transliterates words or sentences
dist/transliteration-data.js - set of character pairs (or formulas), how words should translated.

BTW, check wiki/Help:Multilingual_support_(Indic), to understand, that other languages transliterates in more complex way.

Function:

module.exports = function transliteration(data, req, res) {
    // ... calculations, depending of passed `data`
    return res;
}

Data:

var Cyrilic = require('./data/Cyrilic.iso9.ru');
var Hindi = require('./data/Hindi.remington.hi'); // for example only (http://www.findurlaptop.com/tech/2012/08/20/hindi-typing-and-google-transliteration/)
module.exports = Object.assign({}, Cyrilic, Hindi);

Example 1:

I want to use my own set of pairs:

var data = require('hulu-char-pairs');
var tr = require('transliteration').bind(null, data);
var res = tr(req);

Example 2:

I want to use my own coolest transliteration tool, but same character pairs data:

var data = require('transliteration.data');
var tr = require('my-transliteration').bind(null, data);
var res = tr(req);

dzcpy · 2016-05-15T00:01:43Z

Thanks for the link. Yes it can be done in this way, if gziped it should be much smaller and probably in the future we need to cache it in localstorage or somewhere if it's run in the browser.

dzcpy · 2016-05-15T00:13:14Z

I just tried it. After gzipped the transliteration.min.js file just weight about less than 80k. However, do you have more info on how to get different rules to transliterate different languages? And where can we get the data from? I can get some data for transliterating Chinese and Japanese which both have some cases that one character could have different pronunciation depending on context or words combination. How about other languages?

iegik · 2016-05-15T00:19:22Z

http://www.findurlaptop.com/tech/2012/08/20/hindi-typing-and-google-transliteration/

iegik · 2016-05-15T00:35:53Z

We need to change data grouping behavor. From beginning, I saw that data was grouped by char code, so to cover all characters you will need a lot of memory. But, do you really need that? Anyone need that?

I think - no. In Baltic states, for example, there are 5 common languages: LV, LT, EE, EN, RU. If I want to cover all transliteration of these languages, I should get few special characters from LV, LT, EE and all combinations from RU (becouse of non-latin alphabet).

dzcpy · 2016-05-15T01:00:09Z

I think the memory usage is ok for now. It takes less than 2MB rss if you load all data into memory. Please try process.memoryUsage() to see. And for node, those files are conditionally loaded, means if you do not use them they are not required in the code

shrpne · 2018-03-01T17:33:24Z

@andyhu Any news on this?
It is a great library but it bloats my bundle size a lot: ~300kb minified and looks like data takes almost all of this space.
If I could bundle only languages I need, the bundle size would decrease significantly

iegik · 2018-03-01T21:35:41Z

There is another way...
If you need only to remove accents in the latin languages, I`ll recommend to use String.prototype.normalize for now.

// shim for String.prototype.normalize https://github.com/walling/unorm
export default str => str.normalize('NFKD')
    .replace(/[\u0300-\u036f]/g, "") // remove accents
    .replace(/\u0142/g, "l"); // ł is a letter in itself

/*
const normalize = import('normalize.js')
console.log(normalize('ąśćńżółźćęāēūīšģķļ'))
*/

ogonkov · 2018-05-21T08:39:38Z

With webpack the one could map required alphabet for transliteration to charmap.json

Example
https://gist.github.com/ogonkov/bb415854f6a27e39471d391672e43003

dzcpy · 2018-07-15T07:19:45Z

@ogonkov Nice idea, if I have time I'll first try to replace browserify with webpack. And maybe it's better to make different builds for different purpose and developer's own preference, so it can be more flexible.

Currently if you want you can replace the default character map database by using an undocumented API transliterate.setCharmap(). But you cannot get rid of the default one. The character map data is originally from ICU project, I know its quality is pretty low, but I can't find any better data source. Ideally the end-user should be able to choose which languages or unicode blocks they would like the module to support and load respective data. It requires a lot of work, especially high quality data.

dzcpy · 2018-07-15T07:28:37Z

Actually, the most difficult thing is not code but data. I can't find enough data to support all different sorts of transliteration rules for each language. Probably I'll first make it more flexible, and let community to contribute the code mapping rules for each languages.

ogonkov · 2018-07-15T07:44:31Z

Well, the solution above works for me pretty nice, it leaves only required data in json, and saves a lot in total

For Russian transliteration this charmap works smoothly

dzcpy · 2018-07-15T09:41:20Z

I'm also thinking to compress the JSON file and unpack it in the browser, so it can save around 100+KB space for the default code map data.

ogonkov · 2018-07-15T11:50:34Z

I think it's better not bundle JSON by default, and have all charmap keys as separate imports, which could be imported separately

dzcpy · 2019-01-11T14:01:31Z

@ogonkov That's true, I'm thinking to separate code and data in v3 version

ogonkov · 2020-02-26T13:59:48Z

2.x is breaking my solution above because now JSON data is bundled to JS file, that used it.

dzcpy · 2020-03-15T10:14:20Z

I'll probably implement this in next version, but only if I can manage to get enough time

ogonkov · 2020-03-16T09:15:05Z

What is the plan? How it should work?

dzcpy · 2020-03-17T10:57:38Z

Like separating function and data as this issue suggested, and probably also adding an async method since sync operations may block the main thread for intensive usage. Also I'm planning to implement a new algorism of data storage. It should both reduce the package size to about 50% to 70% of the current size and also improve the performance a bit(hopefully).
Anyway, considering my current workload, I might not be able to start working on it within 1-2 months, unless there's a sponsor.

ogonkov · 2020-03-17T17:54:33Z

One of the proposals in this comment? #14 (comment)

dzcpy · 2020-03-18T09:34:09Z

One of the proposals in this comment? #14 (comment)

I have to investigate a bit more first, but should be something similar.

sun · 2020-05-05T10:53:43Z

Other transliteration packages, such as the Transliteration component of Drupal - whose data was originally based on the Unidecode CPAN package but heavily improved in the meantime - are using separate mapping files for each language, and the language needs to be passed as an argument to the transliterate() function. You could save a lot of time and effort by parsing and converting those existing data files from PHP into JSON (with a one-liner) as an ongoing data update process executed each time before releasing new package versions?

As a former co-maintainer of Drupal's Transliteration project - even though the base data of Unidecode was an excellent starting point already - I can tell that this is a fairly big undertaking and never-ending process of patches that are hard to verify and confirm without knowledge about the native languages, so you need a large and active community of contributors for all languages to get them right.

EDIT: PR #57 from 2017 seems to go in a similar direction.

dzcpy · 2020-05-09T01:46:16Z

Other transliteration packages, such as the Transliteration component of Drupal - whose data was originally based on the Unidecode CPAN package but heavily improved in the meantime - are using separate mapping files for each language, and the language needs to be passed as an argument to the transliterate() function. You could save a lot of time and effort by parsing and converting those existing data files from PHP into JSON (with a one-liner) as an ongoing data update process executed each time before releasing new package versions?

As a former co-maintainer of Drupal's Transliteration project - even though the base data of Unidecode was an excellent starting point already - I can tell that this is a fairly big undertaking and never-ending process of patches that are hard to verify and confirm without knowledge about the native languages, so you need a large and active community of contributors for all languages to get them right.

EDIT: PR #57 from 2017 seems to go in a similar direction.

Yes, actually I used to be a Drupal developer myself (Drupal 6) It's an excellent module and gave me a lot of inspiration. And the original data was converted from PHP, but back then Drupal's transliteration module didn't have separate files. Maybe I'll take a look again. There are many errors in the data but I've been fixing them whenever anyone find one.
Maybe we can share the data between the two project.
I'm thinking to refactor the code from scratch, Just have't had enough time.
As you said, a large and active community is necessary. Do you have any suggestins in building such a community?

dzcpy added the enhancement label May 19, 2016

dzcpy changed the title ~~Seporate function and data~~ Separate function and data May 21, 2016

dzcpy mentioned this issue Dec 16, 2016

Support for React Native #23

Closed

iegik closed this as completed May 30, 2018

dzcpy reopened this Jul 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate function and data #14

Separate function and data #14

iegik commented May 14, 2016

dzcpy commented May 14, 2016 •

edited

Loading

iegik commented May 14, 2016 •

edited

Loading

dzcpy commented May 15, 2016

dzcpy commented May 15, 2016 •

edited

Loading

iegik commented May 15, 2016

iegik commented May 15, 2016 •

edited

Loading

dzcpy commented May 15, 2016 •

edited

Loading

shrpne commented Mar 1, 2018

iegik commented Mar 1, 2018

ogonkov commented May 21, 2018

dzcpy commented Jul 15, 2018 •

edited

Loading

dzcpy commented Jul 15, 2018 •

edited

Loading

ogonkov commented Jul 15, 2018

dzcpy commented Jul 15, 2018

ogonkov commented Jul 15, 2018

dzcpy commented Jan 11, 2019

ogonkov commented Feb 26, 2020

dzcpy commented Mar 15, 2020

ogonkov commented Mar 16, 2020

dzcpy commented Mar 17, 2020 •

edited

Loading

ogonkov commented Mar 17, 2020

dzcpy commented Mar 18, 2020

sun commented May 5, 2020 •

edited

Loading

dzcpy commented May 9, 2020 •

edited

Loading

Separate function and data #14

Separate function and data #14

Comments

iegik commented May 14, 2016

dzcpy commented May 14, 2016 • edited Loading

iegik commented May 14, 2016 • edited Loading

dzcpy commented May 15, 2016

dzcpy commented May 15, 2016 • edited Loading

iegik commented May 15, 2016

iegik commented May 15, 2016 • edited Loading

dzcpy commented May 15, 2016 • edited Loading

shrpne commented Mar 1, 2018

iegik commented Mar 1, 2018

ogonkov commented May 21, 2018

dzcpy commented Jul 15, 2018 • edited Loading

dzcpy commented Jul 15, 2018 • edited Loading

ogonkov commented Jul 15, 2018

dzcpy commented Jul 15, 2018

ogonkov commented Jul 15, 2018

dzcpy commented Jan 11, 2019

ogonkov commented Feb 26, 2020

dzcpy commented Mar 15, 2020

ogonkov commented Mar 16, 2020

dzcpy commented Mar 17, 2020 • edited Loading

ogonkov commented Mar 17, 2020

dzcpy commented Mar 18, 2020

sun commented May 5, 2020 • edited Loading

dzcpy commented May 9, 2020 • edited Loading

dzcpy commented May 14, 2016 •

edited

Loading

iegik commented May 14, 2016 •

edited

Loading

dzcpy commented May 15, 2016 •

edited

Loading

iegik commented May 15, 2016 •

edited

Loading

dzcpy commented May 15, 2016 •

edited

Loading

dzcpy commented Jul 15, 2018 •

edited

Loading

dzcpy commented Jul 15, 2018 •

edited

Loading

dzcpy commented Mar 17, 2020 •

edited

Loading

sun commented May 5, 2020 •

edited

Loading

dzcpy commented May 9, 2020 •

edited

Loading