-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean room implementation of detectCharsetFromBOM
#402
Conversation
(as `detectCharsetFromBOM()` never returns "CP037")
Drops dependency. Closes GitHub issue LibrePDF#400.
Checking old library after the commit I noticed that with this PR we would lose support for unusual Unicode orders where word order is big-endian but byte order is little-endian (or the opposite). |
Excellent. Can you please add unit tests? |
openpdf/src/main/java/com/lowagie/text/xml/simpleparser/SimpleXMLParser.java
Outdated
Show resolved
Hide resolved
Kudos, SonarCloud Quality Gate passed! 0 Bugs |
OK, done as suggested by @andreasrosdal and @noavarice. |
I also done some research about EBCDIC and while it doesn't have a BOM to detect we could maybe try and detect Oh, I noticed only later that's exactly what original code did… actually I now notice that this PR is kinda close to a (partial) revert of 211acbc… do we want that? |
I think that is ok, as you are using Charset here, which is greate, and we don't have a dependency to another library for using just a single method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice changes
I re-wrote
detectCharsetFromBOM
starting from specs.Tested with the following code:
which produces
(creating an actual test would be difficult, and also would need to keep old dependency)