Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar #1118

dkg · 2022-07-15T23:09:50Z

This is not a principled fix, but it is a hack to avoid a crash when
encountering an empty dstString in a beginbfchar table in a
ToUnicode CMap.

The right way to fix this would be to replace all the string
manipulation with a formal grammar, but i don't have the skill or
capacity to do that right now.

Instead, we take narrow aim at the issue of zero-length (empty) hex
string representations.

We take advantage of the fact that no angle-bracket-delimited hex
string contains a . character. when we encounter an empty hex string,
rather than replacing it with the empty string, we replace it with a
literal ".". Then, when we encounter a ".", we remember that it was
supposed to be an empty string.

One consequence of this fix is that the exported cmap can now return
an empty string, so we also have to clean up
PageObject::process_operation so that it doesn't try to read the
final character from an empty string.

This is a hackish workaround for #1111.

…char This is not a principled fix, but it is a hack to avoid a crash when encountering an empty dstString in a `beginbfchar` table in a ToUnicode CMap. The right way to fix this would be to replace all the string manipulation with a formal grammar, but i don't have the skill or capacity to do that right now. Instead, we take narrow aim at the issue of zero-length (empty) hex string representations. We take advantage of the fact that no angle-bracket-delimited hex string contains a . character. when we encounter an empty hex string, rather than replacing it with the empty string, we replace it with a literal ".". Then, when we encounter a ".", we remember that it was supposed to be an empty string. One consequence of this fix is that the exported cmap can now return an empty string, so we also have to clean up `PageObject::process_operation` so that it doesn't try to read the final character from an empty string. This is a hackish workaround for py-pdf#1111.

codecov · 2022-07-15T23:14:42Z

Codecov Report

Merging #1118 (cba5b04) into main (9bbe827) will decrease coverage by 0.05%.
The diff coverage is 57.14%.

@@            Coverage Diff             @@
##             main    #1118      +/-   ##
==========================================
- Coverage   91.96%   91.91%   -0.06%     
==========================================
  Files          24       24              
  Lines        4657     4663       +6     
  Branches      962      964       +2     
==========================================
+ Hits         4283     4286       +3     
- Misses        229      230       +1     
- Partials      145      147       +2

Impacted Files	Coverage Δ
PyPDF2/_page.py	`92.13% <ø> (ø)`
PyPDF2/_cmap.py	`92.18% <57.14%> (-1.37%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9bbe827...cba5b04. Read the comment docs.

MartinThoma · 2022-07-16T08:58:10Z

PyPDF2/_page.py

@@ -1377,6 +1377,7 @@ def process_operation(operator: bytes, operands: List) -> None:
 if (
 (abs(float(op)) >= _space_width)
 and (abs(float(op)) <= 8 * _space_width)
+ and (len(text) > 0)


That one is pretty clear and easy to approve / merge. I have a harder time with the other changes. I'm currently running the benchmark to see if it changed.

MartinThoma · 2022-07-16T09:23:01Z

Looking at my benchmark results: https://github.com/py-pdf/benchmarks

            "1601.03642": 98.8 -> 98.9 : +0.1
            "1602.06541": 98.0 -> 97.8 : -0.2
            "1707.09725": 94.0 -> 94.1 : +0.1
            "2201.00021": 96.7 -> 96.5 : -0.2
            "2201.00022": 97.2 -> 97.3 : +0.1
            "2201.00029": 97.6 -> 96.7 : -0.9
            "2201.00037": 93.8 -> 94.0 : +0.2
            "2201.00069": 96.3 -> 96.4 : +0.1
            "2201.00151": 93.3 -> 93.8 : +0.5
            "2201.00178": 92.9 -> 93.1 : +0.2
            "2201.00200": 97.4 -> 97.4 : ----
            "2201.00201": 98.2 -> 98.2 : ----
            "2201.00214": 97.2 -> 97.4 : +0.2
          "GeoTopo-book": 86.5 -> 87.1 : +0.6

MartinThoma · 2022-07-16T09:28:45Z

@pubpub-zz What do you think about this PR?

pubpub-zz · 2022-07-16T10:40:13Z

@pubpub-zz What do you think about this PR?

I of course agree with and (text != ""). Just wandering if the and (abs(float(op)) <= 8 * _space_width) should not be removed : can you compare the benchmark with and without?
about change in _cmap.py I would more likely propose this mod:

        elif process_char:
            lst = [x for x in l.split(b" ") if x]
            map_dict[-1] = len(lst[0]) // 2
            if len(lst) == 1:       # some case where the 2nd param is empty (seems not IAW pdfspec)
                map_dict[
                    unhexlify(lst[0]).decode(
                        "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
                    )
                ] = ""
            else:**
                while len(lst) > 0:
                    map_dict[
                        unhexlify(lst[0]).decode(
                            "charmap" if map_dict[-1] == 1 else "utf-16-be", "surrogatepass"
                        )
                    ] = unhexlify(lst[1]).decode(
                        "utf-16-be", "surrogatepass"
                    )  # join is here as some cases where the code was split
                    int_entry.append(int(lst[0], 16))
                    lst = lst[2:]

Tested on habibi.pdf

If you want I can push it as a suggestion

note : the arabic test showed before and after the roman text is also reported (but with some unknown car after) : not an issue with PyPDF2

MartinThoma · 2022-07-17T08:37:46Z

@pubpub-zz Very good hint!

                PyPDF2==2.5.0  This PR        removing the line 'and (abs(float(op)) <= 8 * _space_width)'
  "1601.03642": 98.8%          98.9% (+0.1)   99.0% (+0.2 / +0.1) 
  "1602.06541": 98.0%          97.8% (-0.2)   98.0% (---  / +0.2)
  "1707.09725": 94.0%          94.1% (+0.1)   94.3% (+0.3 / +0.2)
  "2201.00021": 96.7%          96.5% (-0.2)   96.7% (---- / +0.2)
  "2201.00022": 97.2%          97.3% (+0.1)   97.3% (+0.1 / ----)
  "2201.00029": 97.6%          96.7% (-0.9)   96.7% (-0.9 / ----)
  "2201.00037": 93.8%          94.0% (+0.2)   94.0% (+0.2 / ----)
  "2201.00069": 96.3%          96.4% (+0.1)   96.4% (+0.1 / ----)
  "2201.00151": 93.3%          93.8% (+0.5)   93.8% (+0.5 / ----)
  "2201.00178": 92.9%          93.1% (+0.2)   93.2% (+0.3 / +0.1)
  "2201.00200": 97.4%          97.4% (----)   97.4% (---  / ----)
  "2201.00201": 98.2%          98.2% (----)   98.3% (+0.1 / +0.1)
  "2201.00214": 97.2%          97.4% (+0.2)   97.6% (+0.4 / +0.2)
"GeoTopo-book": 86.5%          87.1% (+0.6)   87.1% (+0.6 / ----)

Removing that line improves almost all examples. The one exception is 2201.00029 - sadly by quite a bit.

MartinThoma · 2022-07-17T08:46:41Z

@dkg Would you mind adding the two changes suggested by @pubpub-zz ? (I could also add them afterwards)

Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

MartinThoma · 2022-07-17T12:21:57Z

Thank you for your contributon @dkg ! It will be part of the release today :-)

Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

dkg · 2022-07-17T12:30:55Z

As i understand it, there are three requested changes:

use (text != "") instead of (len(text) > 0) as an "if" predicate. I'm fine with either, i don't know enough about the various implementations of python to know whether there's any significant performance benefit across all of them. fwiw, _page.py has at least one other instance of a len(text) == 0 test internally.
remove and (abs(float(op)) <= 8 * _space_width) -- this seems orthogonal to the change i'm proposing to make here, and would recommend it being done separately.
changes to the elif process_char stanza: the proposed change only works if the last entry on each line is the empty entry. It is not at all clear to me from either the PDF standard or the CMap technical note 5014 requires each grouping to be on a single line.

In fact, i'm pretty sure that the PDF format doesn't care about linebreaks here at all. That is, a CMap stream's beginbfchar range that contains two mappings which most folks would emit as:

<0012> <0056>
<0013> <0059>

could just as easily be written as:

<0012> <0056> <0013> <0059>

now, if the dstString of the first mapping is the empty string, you get this:

<0012> <>
<0013> <0059>

this will work with either proposed code change.

But if the CMap is written the second way:

<0012> <> <0013> <0059>

Then @pubpub-zz 's proposed variant (in 7740a6e), which only deals with the final odd pairing, will effectively transform such a CMap stream to:

<0012> <0013>
<0059> <>

That's an entirely different character mapping! So if i had to choose one, i'd stick with my original proposal.

Again, the right way to fix this is to use a formal grammar for CMap, but i'm grateful for the consideration of any improvement that would work with habibi.pdf :)

dkg · 2022-07-17T13:50:04Z

@pubpub-zz wrote:

note : the arabic test showed before and after the roman text is also reported (but with some unknown car after) : not an issue with PyPDF2

Agreed, and thanks for pointing this out. The fact that the arabic text shows up twice is a distinct issue with the generator, WeasyPrint, not something for PyPDF2 to worry about.

Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

pubpub-zz · 2022-07-17T19:05:10Z

now, if the dstString of the first mapping is the empty string, you get this:
<0012> <>
<0013> <0059>

@dkg,
When I read the pdf Spec, I have not been able to identify an example about empty string. Despite WeasyPrint can produce this and we have a way to cope with it, can you point me a ref where this case is described. Thx

New Features (ENH): - Add color and font_format to PdfReader.outlines[i] (#1104) - Extract Text Enhancement (whitespaces) (#1084) Bug Fixes (BUG): - Use `build_destination` for named destination outlines (#1128) - Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar (#1118) - Prevent deduplication of PageObject (#1105) - None-check in DictionaryObject.read_from_stream (#1113) - Avoid IndexError in _cmap.parse_to_unicode (#1110) Documentation (DOC): - Explanation for git submodule - Watermark and stamp (#1095) Maintenance (MAINT): - Text extraction improvements (#1126) - Destination.color returns ArrayObject instead of tuple as fallback (#1119) - Use add_bookmark_destination in add_bookmark (#1100) - Use add_bookmark_destination in add_bookmark_dict (#1099) Testing (TST): - Remove xfail from test_outline_title_issue_1121 - Add test for arab text (#1127) - Add xfail for decryption fail (#1125) - Add xfail test for IndexError when extracting text (#1124) - Add MCVE showing outline title issue (#1123) Code Style (STY): - Apply black and isort - Use IntFlag for permissions_flag / update_page_form_field_values (#1094) - Simplify code (#1101) Full Changelog: 2.5.0...2.6.0

dkg · 2022-07-19T04:27:55Z

@pubpub-zz that's an entirely reasonable question. I think you're asking whether a string can have zero characters in it or not. I refer you to §7.3.4 ("String Objects") from PDF32000_2008.pdf, which says:

A string object shall consist of a series of zero or more bytes.

§7.3.4.3 ("Hexadecimal Strings"), which is the relevant bit for what we're parsing here, says:

If the final digit of a hexadecimal string is missing—that is, if there is an odd number of digits—the final digit
shall be assumed to be 0.

All of these suggest that a string generally can be the empty string.

So maybe you're asking whether it's acceptable to have an empty string in the beginbfchar/endbfchar table. Adobe Techical note #5014 itself predates unicode, but page 71 describes the structure of beginbfchar and endbfchar, which says:

dstCode must be specified as hexadecimal strings.

back in the PDF spec, §9.10.3 ("ToUnicode CMaps") says that these maps:

define the mapping from character codes to Unicode character sequences expressed in UTF-16BE encoding.

Neither the PDF spec nor the tech note suggests that the strings have any required length other than an integer number of bytes (as a baseline for strings) or valid UTF-16BE encoding (for ToUnicode CMaps). and of course the empty string has an integer number of bytes and is also a valid UTF-16BE string.

dkg changed the title ~~Avoid a crash when a ToUnicode CMap has an empty dstString in beginbf…~~ Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar Jul 15, 2022

dkg mentioned this pull request Jul 15, 2022

cmap handling: failure when a beginbfchar pair has a zero-length second element #1111

Closed

MartinThoma reviewed Jul 16, 2022

View reviewed changes

MartinThoma merged commit ae0ff49 into py-pdf:main Jul 17, 2022

MartinThoma added a commit that referenced this pull request Jul 17, 2022

MAINT: Text extraction improvements

c08fa8f

Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

MartinThoma mentioned this pull request Jul 17, 2022

MAINT: Text extraction improvements #1126

Merged

MartinThoma added a commit that referenced this pull request Jul 17, 2022

MAINT: Text extraction improvements

7740a6e

Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

MartinThoma added a commit that referenced this pull request Jul 17, 2022

MAINT: Text extraction improvements (#1126)

e24b0a0

Credits to pubpub-zz, see #1118 (comment) Co-authored-by: pubpub-zz <4083478+pubpub-zz@users.noreply.github.com>

dkg mentioned this pull request Jul 20, 2022

feat: Update weasyprint to 55.0 ietf-tools/xml2rfc#802

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar #1118

Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar #1118

dkg commented Jul 15, 2022 •

edited

Loading

codecov bot commented Jul 15, 2022

MartinThoma Jul 16, 2022 •

edited

Loading

MartinThoma commented Jul 16, 2022

MartinThoma commented Jul 16, 2022

pubpub-zz commented Jul 16, 2022 •

edited by MartinThoma

Loading

MartinThoma commented Jul 17, 2022

MartinThoma commented Jul 17, 2022

MartinThoma commented Jul 17, 2022

dkg commented Jul 17, 2022

dkg commented Jul 17, 2022

pubpub-zz commented Jul 17, 2022

dkg commented Jul 19, 2022

Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar #1118

Avoid a crash when a ToUnicode CMap has an empty dstString in beginbfchar #1118

Conversation

dkg commented Jul 15, 2022 • edited Loading

codecov bot commented Jul 15, 2022

Codecov Report

MartinThoma Jul 16, 2022 • edited Loading

Choose a reason for hiding this comment

MartinThoma commented Jul 16, 2022

MartinThoma commented Jul 16, 2022

pubpub-zz commented Jul 16, 2022 • edited by MartinThoma Loading

MartinThoma commented Jul 17, 2022

MartinThoma commented Jul 17, 2022

MartinThoma commented Jul 17, 2022

dkg commented Jul 17, 2022

dkg commented Jul 17, 2022

pubpub-zz commented Jul 17, 2022

dkg commented Jul 19, 2022

dkg commented Jul 15, 2022 •

edited

Loading

MartinThoma Jul 16, 2022 •

edited

Loading

pubpub-zz commented Jul 16, 2022 •

edited by MartinThoma

Loading