Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Duplicate key in dictionary #846

Closed
BLKSerene opened this issue Oct 16, 2023 · 6 comments · Fixed by #851 or #852
Closed

bug: Duplicate key in dictionary #846

BLKSerene opened this issue Oct 16, 2023 · 6 comments · Fixed by #851 or #852
Assignees
Labels
bug bugs in the library Hacktoberfest for Hacktoberfest event
Milestone

Comments

@BLKSerene
Copy link
Contributor

Description

Two issues of duplicate keys in dictionary are found in the codebase:

_punctuation_and_digits = {
"ๆ": "«",
"ฯ": "ǂ",
"๏": "§",
"ฯ": "ǀ",
"๚": "ǁ",
"๛": "»",
"๐": "0",
"๑": "1",
"๒": "2",
"๓": "3",
"๔": "4",
"๕": "5",
"๖": "6",
"๗": "7",
"๘": "8",
"๙": "9",
}

The key "ฯ" has been assigned two different values: "ǂ" and "ǀ".

dict_ipa_rtgs = {
"b":"b",
"d":"d",
"f":"f",
"h":"h",
"j":"y",
"k":"k",
"kʰ":"kh",
"l":"l",
"m":"m",
"n":"n",
"ŋ":"ng",
"p":"p",
"pʰ":"ph",
"r":"r",
"s":"s",
"t":"t",
"tʰ":"th",
"tɕ":"ch",
"tɕʰ":"ch",
"w":"w",
"ʔ":"",
"j":"i",
"a":"a",
"e":"e",
"ɛ":"ae",
"i":"i",
"o":"o",
"ɔ":"o",
"u":"u",
"ɯ":"ue",
"ɤ":"oe",
"aː":"a",
"eː":"e",
"ɛː":"ae",
"iː":"i",
"oː":"o",
"ɔː":"o",
"uː":"u",
"ɯː":"ue",
"ɤː":"oe",
"ia":"ia",
"ua":"ua",
"ɯa":"uea",
"aj":"ai",
"aw":"ao",
"ew":"eo",
"ɛw":"aeo",
"iw":"io",
"ɔj":"io",
"uj":"ui",
"aːj":"ai",
"aːw":"ao",
"eːw":"eo",
"ɛːw":"aeo",
"oːj":"oi",
"ɔːj":"oi",
"ɤːj":"oei",
"iaw":"iao",
"uaj":"uai",
"ɯaj":"ueai",
".":".",
}

The key "j" has been assigned two different values: "y" and "i".

Since I do not speak Thai, I do not have the required knowledge to decide on which line should be removed.

Expected results

No duplicate keys in dictionaries.

Current results

Duplicate keys in dictionaries.

Steps to reproduce

Check in the codebase.

PyThaiNLP version

4.0.2

Python version

3.10.13

Operating system and version

Windows 11 22H2 x64

More info

No response

Possible solution

No response

Files

No response

@bact
Copy link
Member

bact commented Oct 16, 2023

For ฯ it's "by design".

From Wikipedia:

"ISO 11940:1998 distinguishes the abbreviation symbol paiyannoi (ฯ) from the sentence terminator angkhandiao (ฯ), even though neither the national character standard TIS 620-2533 nor Unicode Version 5.0 distinguishes them. Paiyannoi is transliterated as ǂ and angkhandiao is transliterated as ǀ. Note that paiyannoi, angkhandiao and angkhankhu (๚) are transliterated by the letters used for click consonants, not by double dagger, vertical bars or dandas."

https://en.wikipedia.org/wiki/ISO_11940

I would say we can just pick Paiyannoi and neglect Angkhandiao.

@BLKSerene
Copy link
Contributor Author

@bact What is your opinion about the second one? While standards may differ in interpretations, it is meaningless to assign two values to the same key in Python dictionary. It only creates confusions to users/contributors.

@bact
Copy link
Member

bact commented Oct 18, 2023

Agree that it will only create confusion if we have non-unique keys left around. Should remove them.

--

For the ISO 11940 paiyannoi/angkhandiao case, if we know the context we can actually tell if the surface ฯ should be paiyannoi (for abbreviation, tend to occurs at the end of a word) or angkhandiao (for end of sentence).

But for the current simple dictionary lookup implementation, I think we can for the time being commented out the line 112 of angkhandiao to disabled it - but still keep it in the code with some note so future contributors will not trying to add it in again (as they may have an impression that something is missing).

This case is a bit easier to remove/commented out the line 112, as angkhandiao is rarely used in modern Thai writing. (it may create problem if anyone use this for writings like poetry though).

--

For IPA to RTGS one, according to this table https://en.wikipedia.org/wiki/Help:IPA/Thai , j will get converted differently depends on where it occurs in the syllable (j as a syllable-initial consonant vs j as a syllable-final consonant).

This again a shortcoming of current simple dictionary lookup implementation.

Removing the line alone will not improve the code correctness but will surely improve the code 'deterministic' character from the point of view of contributor (having non-unique keys are just confusing).

In this case, it is more difficult to remove a particular line as both of them occur regularly.

Needs an algorithm update, I would say.

@bact bact added the bug bugs in the library label Oct 19, 2023
@bact bact self-assigned this Oct 19, 2023
@bact bact added this to the 4.1 milestone Oct 19, 2023
@bact bact added the Hacktoberfest for Hacktoberfest event label Oct 19, 2023
@bact bact closed this as completed in #851 Oct 19, 2023
@bact
Copy link
Member

bact commented Oct 19, 2023

ISO 11940 case is fixed (for now) by #851
Phoneme still remains.

@bact bact reopened this Oct 19, 2023
@BLKSerene
Copy link
Contributor Author

BLKSerene commented Oct 19, 2023

@bact For the second issue, if both cases are common, and algorithm is not expected to be updated in the near future, I suggest for now we just comment out the first key-value pair and add a comment along with a link to the discussion in this issue, so that the behavior and output is unchanged (as the second key-value overrides the first one) but at least confusion is removed.

@bact
Copy link
Member

bact commented Oct 20, 2023

@BLKSerene I agree. We can put #846 in the comment,
maybe also in the docstring as well. So it will be more transparent about the limitation to the users.

Do you feel like open a PR on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs in the library Hacktoberfest for Hacktoberfest event
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants