Clean room implementation of `detectCharsetFromBOM` #402

lapo-luchini · 2020-08-26T14:15:23Z

I re-wrote detectCharsetFromBOM starting from specs.

Tested with the following code:

for (String test : new String[] { "0000FEFF", "EFBBBF77", "FEFF7777", "FFFE0000", "FFFE7777", "77777777" }) {
    byte[] buf = Hex.decode(test);
    String our = detectCharsetFromBOM(buf);
    String prev = org.mozilla.universalchardet.UniversalDetector.detectCharsetFromBOM(buf);
    System.out.printf("%-8s %-8s %-8s %b%n", test, our, prev, Objects.equals(our, prev));
}

which produces

0000FEFF UTF-32BE UTF-32BE true
EFBBBF77 UTF-8    UTF-8    true
FEFF7777 UTF-16BE UTF-16BE true
FFFE0000 UTF-32LE UTF-32LE true
FFFE7777 UTF-16LE UTF-16LE true
77777777 null     null     true

(creating an actual test would be difficult, and also would need to keep old dependency)

(as `detectCharsetFromBOM()` never returns "CP037")

Drops dependency. Closes GitHub issue LibrePDF#400.

lapo-luchini · 2020-08-26T14:18:57Z

Checking old library after the commit I noticed that with this PR we would lose support for unusual Unicode orders where word order is big-endian but byte order is little-endian (or the opposite).
I don't think that's much of an issue really.
(also, previous support for EBCDIC was already dead code, never actually used)

lapo-luchini · 2020-08-26T14:26:58Z

Removing this dependency was requested in both #124 and #400.

andreasrosdal · 2020-08-26T16:27:39Z

Excellent. Can you please add unit tests?

openpdf/src/main/java/com/lowagie/text/xml/simpleparser/SimpleXMLParser.java

sonarqubecloud · 2020-08-27T18:37:47Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
0 Code Smells

No Coverage information
No Duplication information

lapo-luchini · 2020-08-27T18:42:39Z

OK, done as suggested by @andreasrosdal and @noavarice.

lapo-luchini · 2020-08-27T18:51:10Z

I also done some research about EBCDIC and while it doesn't have a BOM to detect we could maybe try and detect "<?xm" which in CP037 encodes as 0x4C6FA794.
I wonder how many EBCDIC-encfoded-XML-inside-PDF are out there… and the chance of false positives (should be pretty low, that decodes back in ASCII as partially unprintable).

Oh, I noticed only later that's exactly what original code did… actually I now notice that this PR is kinda close to a (partial) revert of 211acbc… do we want that?

asturio · 2020-08-28T11:14:32Z

Oh, I noticed only later that's exactly what original code did… actually I now notice that this PR is kinda close to a (partial) revert of 211acbc… do we want that?

I think that is ok, as you are using Charset here, which is greate, and we don't have a dependency to another library for using just a single method.

asturio

Nice changes

lapo-luchini added 2 commits August 26, 2020 15:49

Drop CP037 / EBCDIC support dead code.

bba6d0a

(as `detectCharsetFromBOM()` never returns "CP037")

Clean room implementation of detectCharsetFromBOM.

b56efa0

Drops dependency. Closes GitHub issue LibrePDF#400.

Drop unused import.

19b35d7

noavarice reviewed Aug 26, 2020

View reviewed changes

openpdf/src/main/java/com/lowagie/text/xml/simpleparser/SimpleXMLParser.java Outdated Show resolved Hide resolved

lapo-luchini added 3 commits August 27, 2020 19:50

Add unit tests for SimpleXMLParser Unicode charset detection.

5f7a6fa

Test for declared encoding as well.

8a97705

Refactor to use Charset.

a6aa1cd

asturio approved these changes Aug 28, 2020

View reviewed changes

asturio merged commit d94d2f7 into LibrePDF:master Aug 28, 2020

lapo-luchini deleted the chardet branch August 28, 2020 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean room implementation of `detectCharsetFromBOM` #402

Clean room implementation of `detectCharsetFromBOM` #402

lapo-luchini commented Aug 26, 2020

lapo-luchini commented Aug 26, 2020

lapo-luchini commented Aug 26, 2020

andreasrosdal commented Aug 26, 2020

sonarqubecloud bot commented Aug 27, 2020

lapo-luchini commented Aug 27, 2020

lapo-luchini commented Aug 27, 2020

asturio commented Aug 28, 2020 •

edited

Loading

asturio left a comment

Clean room implementation of detectCharsetFromBOM #402

Clean room implementation of detectCharsetFromBOM #402

Conversation

lapo-luchini commented Aug 26, 2020

lapo-luchini commented Aug 26, 2020

lapo-luchini commented Aug 26, 2020

andreasrosdal commented Aug 26, 2020

sonarqubecloud bot commented Aug 27, 2020

lapo-luchini commented Aug 27, 2020

lapo-luchini commented Aug 27, 2020

asturio commented Aug 28, 2020 • edited Loading

asturio left a comment

Choose a reason for hiding this comment

Clean room implementation of `detectCharsetFromBOM` #402

Clean room implementation of `detectCharsetFromBOM` #402

asturio commented Aug 28, 2020 •

edited

Loading