Symbols & Punctuation Support #16

JamoCA · 2020-06-04T19:51:43Z

I'm not entirely sure if this is a valid bug report or not, but while pasting content from Microsoft Word to CKEditor and processing using Junidecode, I encountered a reproducible java.lang.NullPointerException error. I narrowed it down to a single "right single quotation mark" (U+2019) character.

To prevent this error from being a show stopper, I wrote a wrapper for the Junidecode function that pre-sanitizes symbols & punctuation using the recommended mapping from NIH's Lexical Systems Group: https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2013/docs/designDoc/UDF/unicode/NormOperations/mapSymbolToAscii.html

Here's the ColdFusion (similar to Java) user-defined function that I wrote.
https://gist.github.com/JamoCA/6f35220d47caa7fdbf75eb884ff1cec7

Is this something that should be added to Junidecode?

The text was updated successfully, but these errors were encountered:

hohwille · 2023-11-10T21:38:59Z

Hey @JamoCA, I am just a user of Junitdecode and not a maintainer but I stumbled over your issue that has not been answered for 3 years. So I added this test method to JunidecodeTest:

   @Test
    public void testIssue16() {

        String rightSingleQuote = "\u2019";
        assertEquals("'", Junidecode.unidecode(rightSingleQuote));
    }

After running the modified test it remains green. IMHO your bug report is invalid and Junidecode is behaving correct.
Also whenever you report bugs like this always include the full stacktrace of an exception.
This would easily reveal if the NullPointerException actually came from Junidecode (what I suspect, except you have been using an old version that might have had such bug) or if it came from something completely different (e.g. your own calling code or CKEditor what ever that may be).
IMHO this issue can be closed unless further evidence is provided.

p.s.: String.replaceAll is a very expensive operation that parses and matches a regular expression. Junidecode was implemented much smarter with high efficiency by design. Anyway thanks for sharing your mappings as cold fusion script.
E.g. the fraction mappings you provided can be found here in Junidecode:

junidecode/src/main/java/net/gcardone/junidecode/X21.java

Lines 109 to 120 in d479f6f

    
           " 1/3 ", // 0x53 
        
           " 2/3 ", // 0x54 
        
           " 1/5 ", // 0x55 
        
           " 2/5 ", // 0x56 
        
           " 3/5 ", // 0x57 
        
           " 4/5 ", // 0x58 
        
           " 1/6 ", // 0x59 
        
           " 5/6 ", // 0x5a 
        
           " 1/8 ", // 0x5b 
        
           " 3/8 ", // 0x5c 
        
           " 5/8 ", // 0x5d 
        
           " 7/8 ", // 0x5e

And if I am not mistaken the last one (elip) is this one:

junidecode/src/main/java/net/gcardone/junidecode/X18.java

Line 27 in d479f6f

" ... ", // 0x01

JamoCA · 2023-11-11T00:08:52Z

I'll need to see if I can recreate the exact issue that I encountered 3 years ago and write some unit tests. I've installed a number of java updates over the years and it's possible that whatever I was encountering has been fixed as of my initial tests work without having to pre-sanitize. It was odd because Microsoft was adding cosmetic characters that looked correct, but were some alternate unicode characters. I also ran into issues where some extended IDN characters were being used to bypass filters. I'll retest soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Symbols & Punctuation Support #16

Symbols & Punctuation Support #16

JamoCA commented Jun 4, 2020

hohwille commented Nov 10, 2023

JamoCA commented Nov 11, 2023

Symbols & Punctuation Support #16

Symbols & Punctuation Support #16

Comments

JamoCA commented Jun 4, 2020

hohwille commented Nov 10, 2023

JamoCA commented Nov 11, 2023