Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean room implementation of detectCharsetFromBOM #402

Merged
merged 6 commits into from
Aug 28, 2020

Conversation

lapo-luchini
Copy link
Contributor

I re-wrote detectCharsetFromBOM starting from specs.

Tested with the following code:

for (String test : new String[] { "0000FEFF", "EFBBBF77", "FEFF7777", "FFFE0000", "FFFE7777", "77777777" }) {
    byte[] buf = Hex.decode(test);
    String our = detectCharsetFromBOM(buf);
    String prev = org.mozilla.universalchardet.UniversalDetector.detectCharsetFromBOM(buf);
    System.out.printf("%-8s %-8s %-8s %b%n", test, our, prev, Objects.equals(our, prev));
}

which produces

0000FEFF UTF-32BE UTF-32BE true
EFBBBF77 UTF-8    UTF-8    true
FEFF7777 UTF-16BE UTF-16BE true
FFFE0000 UTF-32LE UTF-32LE true
FFFE7777 UTF-16LE UTF-16LE true
77777777 null     null     true

(creating an actual test would be difficult, and also would need to keep old dependency)

(as `detectCharsetFromBOM()` never returns "CP037")
@lapo-luchini
Copy link
Contributor Author

Checking old library after the commit I noticed that with this PR we would lose support for unusual Unicode orders where word order is big-endian but byte order is little-endian (or the opposite).
I don't think that's much of an issue really.
(also, previous support for EBCDIC was already dead code, never actually used)

@lapo-luchini
Copy link
Contributor Author

Removing this dependency was requested in both #124 and #400.

@andreasrosdal
Copy link
Contributor

Excellent. Can you please add unit tests?

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@lapo-luchini
Copy link
Contributor Author

OK, done as suggested by @andreasrosdal and @noavarice.

@lapo-luchini
Copy link
Contributor Author

I also done some research about EBCDIC and while it doesn't have a BOM to detect we could maybe try and detect "<?xm" which in CP037 encodes as 0x4C6FA794.
I wonder how many EBCDIC-encfoded-XML-inside-PDF are out there… and the chance of false positives (should be pretty low, that decodes back in ASCII as partially unprintable).

Oh, I noticed only later that's exactly what original code did… actually I now notice that this PR is kinda close to a (partial) revert of 211acbc… do we want that?

@asturio
Copy link
Member

asturio commented Aug 28, 2020

Oh, I noticed only later that's exactly what original code did… actually I now notice that this PR is kinda close to a (partial) revert of 211acbc… do we want that?

I think that is ok, as you are using Charset here, which is greate, and we don't have a dependency to another library for using just a single method.

Copy link
Member

@asturio asturio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes

@asturio asturio merged commit d94d2f7 into LibrePDF:master Aug 28, 2020
@lapo-luchini lapo-luchini deleted the chardet branch August 28, 2020 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants