-
Notifications
You must be signed in to change notification settings - Fork 18
Confusables for ㅋ vs. ᄏ #10
Comments
@ariutta Sorry for the late answer. I update the unicode data files and release as 3.2.0, could you please check that it now behaves as expected? |
Hi @vhf, thanks for checking on this, and no worries about the delay! I tried version 3.2.0, and I think Case 1 fails but Case 2 passes. Case 1Input: Expected Output:
Actual Output:
Code from confusable_homoglyphs import confusables
set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs'])) Case 2Input: Expected Output:
Actual Output:
Code from confusable_homoglyphs import confusables
set(map(lambda x: x['c'], confusables.is_confusable('ᄏ', preferred_aliases=[], greedy=True)[0]['homoglyphs'])) |
Thanks! I'll take a closer look later. For now here's what unicode says:
|
I can confirm your two cases: 1 fails, 2 passes. The data files here confirm that this is correct, what might be not correct is my interpretation of the spec: http://www.unicode.org/reports/tr39/#Confusable_Detection From:
I infer that
@ariutta Can you see the issue here? What I am missing from the spec? Something is incorrect here I guess: https://github.com/vhf/confusable_homoglyphs/blob/master/confusable_homoglyphs/cli.py#L70 but the spec, as any spec, isn't that easy to understand. :) Some code I played withdef test_confusable_with_a(self):
HANGUL_LETTER_KHIEUKH = u'ㅋ'
pprint(confusables.is_confusable(HANGUL_LETTER_KHIEUKH, preferred_aliases=[], greedy=True))
set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))
def test_confusable_with_b(self):
HANGUL_JONGSEONG_KHIEUKH = u'ᆿ'
pprint(confusables.is_confusable(HANGUL_JONGSEONG_KHIEUKH, preferred_aliases=[], greedy=True))
set(map(lambda x: x['c'], confusables.is_confusable('ㅋ', preferred_aliases=[], greedy=True)[0]['homoglyphs']))
def test_confusable_with_c(self):
## this one passes and should still pass
HANGUL_CHOSEONG_KHIEUKH = u'ᄏ'
confusable_with = confusables.is_confusable(HANGUL_CHOSEONG_KHIEUKH, preferred_aliases=[], greedy=True)
confusable_char_names = set(map(lambda x: x['n'], confusable_with[0]['homoglyphs']))
expected = set(['HANGUL LETTER KHIEUKH', 'HANGUL JONGSEONG KHIEUKH'])
self.assertEqual(confusable_char_names, expected) |
Hi @vhf, sorry it's taken me so long to respond. I'm not a Unicode/Korean letter expert either, but I based my expection on the output of this unicode.org "confusables" tool: Does that tool correctly match the spec? I can't say for sure, but the result seems plausible at least based on the visual comparison of the characters. |
I'm confused as to why I'm getting different results for
ㅋ
vs.ᄏ
. The Unicode site gives the original plus 2 additional homoglyphs forㅋ
:But the confusable_homoglyphs package yields just one additional homoglyph initially. I only get the other one when I look for homoglyphs of that previous result:
Is this expected behavior?
(Somewhat related to this issue.)
The text was updated successfully, but these errors were encountered: