Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symbols & Punctuation Support #16

Open
JamoCA opened this issue Jun 4, 2020 · 2 comments
Open

Symbols & Punctuation Support #16

JamoCA opened this issue Jun 4, 2020 · 2 comments

Comments

@JamoCA
Copy link

JamoCA commented Jun 4, 2020

I'm not entirely sure if this is a valid bug report or not, but while pasting content from Microsoft Word to CKEditor and processing using Junidecode, I encountered a reproducible java.lang.NullPointerException error. I narrowed it down to a single "right single quotation mark" (U+2019) character.

To prevent this error from being a show stopper, I wrote a wrapper for the Junidecode function that pre-sanitizes symbols & punctuation using the recommended mapping from NIH's Lexical Systems Group: https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2013/docs/designDoc/UDF/unicode/NormOperations/mapSymbolToAscii.html

Here's the ColdFusion (similar to Java) user-defined function that I wrote.
https://gist.github.com/JamoCA/6f35220d47caa7fdbf75eb884ff1cec7

Is this something that should be added to Junidecode?

@hohwille
Copy link
Contributor

Hey @JamoCA, I am just a user of Junitdecode and not a maintainer but I stumbled over your issue that has not been answered for 3 years. So I added this test method to JunidecodeTest:

   @Test
    public void testIssue16() {

        String rightSingleQuote = "\u2019";
        assertEquals("'", Junidecode.unidecode(rightSingleQuote));
    }

After running the modified test it remains green. IMHO your bug report is invalid and Junidecode is behaving correct.
Also whenever you report bugs like this always include the full stacktrace of an exception.
This would easily reveal if the NullPointerException actually came from Junidecode (what I suspect, except you have been using an old version that might have had such bug) or if it came from something completely different (e.g. your own calling code or CKEditor what ever that may be).
IMHO this issue can be closed unless further evidence is provided.

p.s.: String.replaceAll is a very expensive operation that parses and matches a regular expression. Junidecode was implemented much smarter with high efficiency by design. Anyway thanks for sharing your mappings as cold fusion script.
E.g. the fraction mappings you provided can be found here in Junidecode:

" 1/3 ", // 0x53
" 2/3 ", // 0x54
" 1/5 ", // 0x55
" 2/5 ", // 0x56
" 3/5 ", // 0x57
" 4/5 ", // 0x58
" 1/6 ", // 0x59
" 5/6 ", // 0x5a
" 1/8 ", // 0x5b
" 3/8 ", // 0x5c
" 5/8 ", // 0x5d
" 7/8 ", // 0x5e

And if I am not mistaken the last one (elip) is this one:

@JamoCA
Copy link
Author

JamoCA commented Nov 11, 2023

I'll need to see if I can recreate the exact issue that I encountered 3 years ago and write some unit tests. I've installed a number of java updates over the years and it's possible that whatever I was encountering has been fixed as of my initial tests work without having to pre-sanitize. It was odd because Microsoft was adding cosmetic characters that looked correct, but were some alternate unicode characters. I also ran into issues where some extended IDN characters were being used to bypass filters. I'll retest soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants