-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Duplicate key in dictionary #846
Comments
For ฯ it's "by design". From Wikipedia: "ISO 11940:1998 distinguishes the abbreviation symbol paiyannoi (ฯ) from the sentence terminator angkhandiao (ฯ), even though neither the national character standard TIS 620-2533 nor Unicode Version 5.0 distinguishes them. Paiyannoi is transliterated as ǂ and angkhandiao is transliterated as ǀ. Note that paiyannoi, angkhandiao and angkhankhu (๚) are transliterated by the letters used for click consonants, not by double dagger, vertical bars or dandas." https://en.wikipedia.org/wiki/ISO_11940 I would say we can just pick Paiyannoi and neglect Angkhandiao. |
@bact What is your opinion about the second one? While standards may differ in interpretations, it is meaningless to assign two values to the same key in Python dictionary. It only creates confusions to users/contributors. |
Agree that it will only create confusion if we have non-unique keys left around. Should remove them. -- For the ISO 11940 paiyannoi/angkhandiao case, if we know the context we can actually tell if the surface ฯ should be paiyannoi (for abbreviation, tend to occurs at the end of a word) or angkhandiao (for end of sentence). But for the current simple dictionary lookup implementation, I think we can for the time being commented out the line 112 of angkhandiao to disabled it - but still keep it in the code with some note so future contributors will not trying to add it in again (as they may have an impression that something is missing).
This case is a bit easier to remove/commented out the line 112, as angkhandiao is rarely used in modern Thai writing. (it may create problem if anyone use this for writings like poetry though). -- For IPA to RTGS one, according to this table https://en.wikipedia.org/wiki/Help:IPA/Thai , This again a shortcoming of current simple dictionary lookup implementation. Removing the line alone will not improve the code correctness but will surely improve the code 'deterministic' character from the point of view of contributor (having non-unique keys are just confusing). In this case, it is more difficult to remove a particular line as both of them occur regularly. Needs an algorithm update, I would say. |
ISO 11940 case is fixed (for now) by #851 |
@bact For the second issue, if both cases are common, and algorithm is not expected to be updated in the near future, I suggest for now we just comment out the first key-value pair and add a comment along with a link to the discussion in this issue, so that the behavior and output is unchanged (as the second key-value overrides the first one) but at least confusion is removed. |
@BLKSerene I agree. We can put #846 in the comment, Do you feel like open a PR on this? |
Description
Two issues of duplicate keys in dictionary are found in the codebase:
pythainlp/pythainlp/transliterate/iso_11940.py
Lines 108 to 125 in 73b17e3
The key "ฯ" has been assigned two different values: "ǂ" and "ǀ".
pythainlp/pythainlp/util/phoneme.py
Lines 123 to 185 in 73b17e3
The key "j" has been assigned two different values: "y" and "i".
Since I do not speak Thai, I do not have the required knowledge to decide on which line should be removed.
Expected results
No duplicate keys in dictionaries.
Current results
Duplicate keys in dictionaries.
Steps to reproduce
Check in the codebase.
PyThaiNLP version
4.0.2
Python version
3.10.13
Operating system and version
Windows 11 22H2 x64
More info
No response
Possible solution
No response
Files
No response
The text was updated successfully, but these errors were encountered: