-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar #1118
Conversation
…char This is not a principled fix, but it is a hack to avoid a crash when encountering an empty dstString in a `beginbfchar` table in a ToUnicode CMap. The right way to fix this would be to replace all the string manipulation with a formal grammar, but i don't have the skill or capacity to do that right now. Instead, we take narrow aim at the issue of zero-length (empty) hex string representations. We take advantage of the fact that no angle-bracket-delimited hex string contains a . character. when we encounter an empty hex string, rather than replacing it with the empty string, we replace it with a literal ".". Then, when we encounter a ".", we remember that it was supposed to be an empty string. One consequence of this fix is that the exported cmap can now return an empty string, so we also have to clean up `PageObject::process_operation` so that it doesn't try to read the final character from an empty string. This is a hackish workaround for py-pdf#1111.
Codecov Report
@@ Coverage Diff @@
## main #1118 +/- ##
==========================================
- Coverage 91.96% 91.91% -0.06%
==========================================
Files 24 24
Lines 4657 4663 +6
Branches 962 964 +2
==========================================
+ Hits 4283 4286 +3
- Misses 229 230 +1
- Partials 145 147 +2
Continue to review full report at Codecov.
|
@@ -1377,6 +1377,7 @@ def process_operation(operator: bytes, operands: List) -> None: | |||
if ( | |||
(abs(float(op)) >= _space_width) | |||
and (abs(float(op)) <= 8 * _space_width) | |||
and (len(text) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one is pretty clear and easy to approve / merge. I have a harder time with the other changes. I'm currently running the benchmark to see if it changed.
Looking at my benchmark results: https://github.com/py-pdf/benchmarks
|
@pubpub-zz What do you think about this PR? |
I of course agree with elif process_char:
lst = [x for x in l.split(b" ") if x]
map_dict[-1] = len(lst[0]) // 2
if len(lst) == 1: # some case where the 2nd param is empty (seems not IAW pdfspec)
map_dict[
unhexlify(lst[0]).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
)
] = ""
else:**
while len(lst) > 0:
map_dict[
unhexlify(lst[0]).decode(
"charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
)
] = unhexlify(lst[1]).decode(
"utf-16-be", "surrogatepass"
) # join is here as some cases where the code was split
int_entry.append(int(lst[0], 16))
lst = lst[2:] Tested on habibi.pdf If you want I can push it as a suggestion note : the arabic test showed before and after the roman text is also reported (but with some unknown car after) : not an issue with PyPDF2 |
@pubpub-zz Very good hint!
Removing that line improves almost all examples. The one exception is |
@dkg Would you mind adding the two changes suggested by @pubpub-zz ? (I could also add them afterwards) |
Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
Thank you for your contributon @dkg ! It will be part of the release today :-) |
Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
As i understand it, there are three requested changes:
In fact, i'm pretty sure that the PDF format doesn't care about linebreaks here at all. That is, a CMap stream's
could just as easily be written as:
now, if the dstString of the first mapping is the empty string, you get this:
this will work with either proposed code change. But if the CMap is written the second way:
Then @pubpub-zz 's proposed variant (in 7740a6e), which only deals with the final odd pairing, will effectively transform such a CMap stream to:
That's an entirely different character mapping! So if i had to choose one, i'd stick with my original proposal. Again, the right way to fix this is to use a formal grammar for CMap, but i'm grateful for the consideration of any improvement that would work with |
@pubpub-zz wrote:
Agreed, and thanks for pointing this out. The fact that the arabic text shows up twice is a distinct issue with the generator, WeasyPrint, not something for PyPDF2 to worry about. |
Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>
@dkg, |
New Features (ENH): - Add color and font_format to PdfReader.outlines[i] (#1104) - Extract Text Enhancement (whitespaces) (#1084) Bug Fixes (BUG): - Use `build_destination` for named destination outlines (#1128) - Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118) - Prevent deduplication of PageObject (#1105) - None-check in DictionaryObject.read_from_stream (#1113) - Avoid IndexError in _cmap.parse_to_unicode (#1110) Documentation (DOC): - Explanation for git submodule - Watermark and stamp (#1095) Maintenance (MAINT): - Text extraction improvements (#1126) - Destination.color returns ArrayObject instead of tuple as fallback (#1119) - Use add_bookmark_destination in add_bookmark (#1100) - Use add_bookmark_destination in add_bookmark_dict (#1099) Testing (TST): - Remove xfail from test_outline_title_issue_1121 - Add test for arab text (#1127) - Add xfail for decryption fail (#1125) - Add xfail test for IndexError when extracting text (#1124) - Add MCVE showing outline title issue (#1123) Code Style (STY): - Apply black and isort - Use IntFlag for permissions_flag / update_page_form_field_values (#1094) - Simplify code (#1101) Full Changelog: 2.5.0...2.6.0
@pubpub-zz that's an entirely reasonable question. I think you're asking whether a string can have zero characters in it or not. I refer you to §7.3.4 ("String Objects") from PDF32000_2008.pdf, which says:
§7.3.4.3 ("Hexadecimal Strings"), which is the relevant bit for what we're parsing here, says:
All of these suggest that a string generally can be the empty string. So maybe you're asking whether it's acceptable to have an empty string in the
back in the PDF spec, §9.10.3 ("ToUnicode CMaps") says that these maps:
Neither the PDF spec nor the tech note suggests that the strings have any required length other than an integer number of bytes (as a baseline for strings) or valid UTF-16BE encoding (for ToUnicode CMaps). and of course the empty string has an integer number of bytes and is also a valid UTF-16BE string. |
This is not a principled fix, but it is a hack to avoid a crash when
encountering an empty dstString in a
beginbfchar
table in aToUnicode CMap.
The right way to fix this would be to replace all the string
manipulation with a formal grammar, but i don't have the skill or
capacity to do that right now.
Instead, we take narrow aim at the issue of zero-length (empty) hex
string representations.
We take advantage of the fact that no angle-bracket-delimited hex
string contains a . character. when we encounter an empty hex string,
rather than replacing it with the empty string, we replace it with a
literal ".". Then, when we encounter a ".", we remember that it was
supposed to be an empty string.
One consequence of this fix is that the exported cmap can now return
an empty string, so we also have to clean up
PageObject::process_operation
so that it doesn't try to read thefinal character from an empty string.
This is a hackish workaround for #1111.