Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of the new Feature/Bugfix
Though it is Identity-H font, it is ok to decode it by /ToUnicode. So I can decode it if the font has /ToUnicode. Then I can get the char array according to it is one byte or two bytes
Related Issue: #330
Unit-Tests for the new Feature/Bugfix
OpenPDF/openpdf/src/test/java/com/lowagie/text/pdf/TextExtractTest.java
Lines 10 to 22 in 422b562
Compatibilities Issues
I add a new method to CMapAwareDocumentFont to check if it has two bytes mapping.
OpenPDF/openpdf/src/main/java/com/lowagie/text/pdf/CMapAwareDocumentFont.java
Lines 226 to 231 in 422b562
cannot pass 1:
failure reason: it also error when we use itext5. because when it was Chunk.NEWLINE, it will replace '\n' with '0x3' with a Identity-H font with /Unicode which maps 0x3 to 0x3. It makes error. The solution is to delete this key '0x3' or change the way to wirte Chunk.NEWLINE.
screen snapshoot of iText5:
OpenPDF/openpdf/src/test/java/com/lowagie/text/pdf/parser/PdfTextExtractorTest.java
Lines 85 to 101 in d144eaa
cannot pass 2:
failure reason: it should delete the whitespace "data\ttable ".
OpenPDF/openpdf/src/test/java/com/lowagie/text/pdf/TabTest.java
Lines 14 to 32 in d144eaa
technology details
9.10 Extraction of Text Content of
PDF32000_2008.pdf
Later I will give a solution to fix the issue about Chunk.NEWLINE