-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CODEC-308: change NYSIIS encoding to not remove the first character i… #189
base: master
Are you sure you want to change the base?
Conversation
3ae8399
to
caa21a8
Compare
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException { | |||
new String[] { "JILES", "JAL" }, | |||
// violates 6: if the last two characters are AY, remove A | |||
new String[] { "CARRAWAY", "CARY" }, // Original: CARAY | |||
new String[] { "YAMADA", "YANAD" }); | |||
new String[] { "YAMADA", "YANAD" }, | |||
new String[] { "ASH", "A"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Ben-Waters
Thank you for your PR.
Who is right? This or Fuzzy?
py
Python 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> fuzzy.nysiis('ASH')
'AS'
>>>
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@garydgregory
Based on this other implementation it would be just A.
According to the algorithm, the final character should be removed if it is an 'S' so I would think it should be removed.
The current commons-codec implementation is removing it as well as the final 'A' but is ignoring the part about "The first character of the NYSIIS code is the first character of the name."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't used fuzzy before but looking at their implementation I don't see where they remove the trailing S or A.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had not used it before today either. I was looking for something easy to compare this PR's behavior.
Let's see if anybody else has feedback.
Do you know of any other software that is easily testable for this use-case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty frustrating but it seems like the results are inconsistent depending on where you go.
Looking at this link the Java version returns AS but the python returns A. For my use-case, either one of those would be fine but it seems hard to get a consistent answer. Commons-codec is currently returning nothing though and I haven't seen another do that yet.
I don't know of a good official implementation to run it against.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that returning something like "A" is better than "". Let's see if anyone else suggests anything. @kinow ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a C++ implementation nor a good source from Wikipedia. So I checked another implementation, phonics
in R. It has a nysiis algorithm implementation that allows for "modified" key. For "A" it gives "" too, but the modified version gives "AS".
>install.packages('phonics')
>library('phonics')
>nysiis(c("ASH"))
[1] ""
> nysiis(c("ASH"), modified=TRUE)
[1] "AS"
> nysiis(c("ASHBURTON"))
[1] "ASBART"
Their docs (PDF) documents the basic algorithm, but links to this PDF that explains the whole package a lot better: James P. Howard, II, Phonetic Spelling Algorithm Implementations for R
Searching more about those papers after the text above, I found this issue that describes the same thing I just said 😬 : CODEC-235
So my think we should document that the current version in Commons Codec is the one from the first paper, and then maybe add the other implementation as a separate class/method and let users to pick which one they want to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question that comes up for me is: Is our current implementation "plain" NYSIIS, or, is any non-standard or any behavior from the "modified" L&A 1977 algorithm present in our current implementation? If our current code is "plain", then I could see us creating a ModifiedNysiis
class, if not, then we are in a bit of a pickle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the results seem to indicate we implement the plain NYSIIS, someone has indeed to verify it reading our code and the algorithm. A good point, +1!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the steps from wikipedia, the core of the issue is that in step 4, we are setting the pointer to the first character instead of the 2nd character. The 9th step on wikipedia is a bit ambiguous though as to whether that step includes the first character or not. If it does, then you would get a blank string like commons-codec currently gives. If not, then you would get 'A'.
Unfortunately I don't have access to the original source book for the algorithm to try to clarify. It looks like berkeley law has it but I don't have an account for that.
@Ben-Waters Not directly related, but do you have any thoughts on #36? |
Hmmm I'm no expert on this since it isn't NYSIIS but some other algorithm but I can take a look. |
I just re-read the comments it seems like:
Help needed. |
With the current implementation of NYSIIS, it is possible to incorrectly remove the first character from the encoding.
According to the algorithm the first character of the string should be the first character of the encoding, then based on a bunch of other rules are applied to the string characters are removed. The implementation in commons-codec passes the entire string into the transcodeRemaining method which works for the most part and then afterwards, checks that there is at least 1 character before removing the final 'A' or 'S'.
The problem is, if you have a word like "ASH" you will end up with a single final character of "A". Similarly with "SSH" you would have "S" and the logic will currently remove it and return a blank string when it should still return at least the first letter of the original string.