Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower right of 舞 #116

Open
benkasminbullock opened this issue May 13, 2021 · 7 comments
Open

Lower right of 舞 #116

benkasminbullock opened this issue May 13, 2021 · 7 comments

Comments

@benkasminbullock
Copy link

The lower right corner of 舞 seems to diverge between Chinese and Japanese.

Japanese seems to write 舞 with the lower right as four strokes:

https://kakijun.jp/page/15116200.html

舞

But the 㐄 element seems to be three strokes in the same Japanese sources:

https://kakijun.jp/page/masu06200.html

舛

In ids.txt they are unified onto one thing, but there seems to be an actual difference.

@hfhchan
Copy link

hfhchan commented May 13, 2021

It is the same component but customarily written as different forms due to an inconsistency in Japanese kanji standardization. There is no semantic difference and the distinction is unifiable for the purposes of ISO10646 standardization.

@hfhchan
Copy link

hfhchan commented May 13, 2021

If you want to decompose glyphs exactly as they look in various standards, you may want to check out yi-bai/ids which decomposes characters down to the stroke level and has data indicating stroke joining behaviour.

@benkasminbullock
Copy link
Author

OK but the requirement and details for that specification isn't documented here. If it's required to fit that spec, at the least document it.

@hfhchan
Copy link

hfhchan commented May 13, 2021

Unfortunately the maintainer has not been able to update the repository :/

This repository is the main data source used for IRG IDS algorithm, though the decomposition data is also useful for other purposes. That's why the IDSs used are more vague.

@benkasminbullock
Copy link
Author

If you want to decompose glyphs exactly as they look in various standards, you may want to check out yi-bai/ids which decomposes characters down to the stroke level and has data indicating stroke joining behaviour.

This? https://github.com/yi-bai/ids

It contains some information on this particular character:

舞 ⿳𠂉卌.⿱一舛.

舛 ⿰夕㐄.(.);⿰夕㐄J(K)

I'm not sure what that (.) all means yet, and the above doesn't accord with my own findings, but thank you for the pointer.

@benkasminbullock
Copy link
Author

Unfortunately the maintainer has not been able to update the repository :/

I don't have information except that @kawabata contributed to a project in January 2021 so I assume he is in good health.

This repository is the main data source used for IRG IDS algorithm, though the decomposition data is also useful for other purposes. That's why the IDSs used are more vague.

Is that kawabata's purpose of making the repository? It seems undocumented, queries to the mailing list went unanswered, and so on. If this repository is intended for your purpose then at least it should say so. I will leave this bug report open for the time being pending guidance "from above".

@hfhchan
Copy link

hfhchan commented May 13, 2021

I believe this repository was born before it was used for IRG standardization, however I am not sure because I joined IRG much later than Kawabata-san.

If you refer to the the IRG working documents (https://appsrv.cse.cuhk.edu.hk/~irg/irg/irg56/IRG56.htm), you can see that this repo is used listed as the official IDS equivalence database for conducting CJK Unification.

The decomposition strategies used for IDS data to be used for CJK standardization purposes are specified in IRGN1183 in IRG#25, written by @kawabata himself: https://appsrv.cse.cuhk.edu.hk/~irg/irg/irg25/IRG25.htm.

Refer to paragraph 2.5 of the decomposition strategy which would be relevant to this case:

2.5. Generousness on minor differences

Don't try to represent details of the shapes of an ideograph. Ignore minor differences. We have a set of unification rules and if the difference is important (for the unification rules), we can consider so through the eye-to-eye review after the IDS based matching. On the other hand, if the IDS is constructed under a draconian policy, two shapes to be unified may have a totally different IDS and we may fail to find them duplicate.

Though recently IDS check maintenance for IRG's standardization purposes has been passed to @yi-bai because @kawabata is busy. He maintains a proprietary format for IDSes. You may want to consult with him to see if he wants to increase his coverage for other locales.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants