-
Notifications
You must be signed in to change notification settings - Fork 599
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Text extraction improvements and bug fixes
Word extraction will respect spaces if they are present in strings stored in the PDF, and not try to split or recombine them inappropriately. Since most PDfs now have pretty logically-organized content streams, the old graphical reassembly is less important. Resulting word rectangles are now correct for more cases of odd PDFs. Rotated pages are not yet handled, but media boxes with non-zero offsets now work. Font character decoding works for Identity-H encoding which is one form of multi-byte encoding. The decoding has been reorganized as previously, measuring was based on the characters themselves, not the code points that were originally parsed. This meant that some measurements would be off, e.g. in the case of ligatures (which return multiple characters, but are only a single code point). Some minor adjustments were made in CMap and CMapAware font handling so that this could be done.
- Loading branch information
1 parent
4dc66b3
commit 0c341c8
Showing
10 changed files
with
697 additions
and
256 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.