DeprecationWarning: invalid escape sequence #482

stephenfin · 2020-04-03T16:13:12Z

Bug description

Calling Page.getText('blocks') on PDFs that contain invalid Python escape sequences (e.g. \ ) result in the following warnings:

../fitz/fitz.py:5404: DeprecationWarning: invalid escape sequence '\ '
  return _fitz.TextPage_extractBLOCKS(self, lines)

This is a warning now but may or may not be an error in Python 3.10.

To Reproduce (mandatory)

Create the following test script and save as test.py:

 import sys
 import fitz

 pdf = fitz.open(sys.argv[1])
 for page in pdf.pages():
     page.getText('blocks')

Save the attached file locally
Run the script against the file with deprecation warnings enabled:
```
 PYTHONWARNINGS=d python3 test.py test_aafigure.pdf
```

Expected behavior (optional)

The strings should be marked as rawstring (e.g. r'\ ') internally or escaped.

Screenshots (optional)

N/A

Your configuration (mandatory)

Fedora 31 (64 bit)
Python 3.7.6
PyMuPDF 1.16.16, wheel

3.7.6 (default, Jan 30 2020, 09:44:41) 
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] 
 linux

PyMuPDF 1.16.16: Python bindings for the MuPDF 1.16.0 library.
Version date: 2020-03-29 09:44:30.
Built for Python 3.7 on linux (64-bit).

Additional context (optional)

I did try to fix this myself, but I haven't worked with SWIG (or Python bindings to a C lib) before and got lost. Sorry 😞

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2020-04-03T18:56:32Z

Thanks for reporting this! ... And for your interest in PyMuPDF.
Let me have a look. I'll be back 😎

JorjMcKie · 2020-04-03T19:00:00Z

Where is the attached PDF please?

stephenfin · 2020-04-04T12:17:59Z

Whoops, sorry. Attached. Page three is the errant one.

test_aafigure.pdf

JorjMcKie · 2020-04-04T13:38:42Z

ok, thanks, will look at it

JorjMcKie · 2020-04-05T08:43:34Z

Fixed it I think by using PyUnicode_DecodeRawUnicodeEscape instead of PyUnicode_DecodeUnicodeEscape.

stephenfin · 2020-04-05T09:47:04Z

Awesome. Thanks! Let me know if you need anything from me testing wise. I assume the reproducer I provided did the trick.

JorjMcKie · 2020-04-05T11:46:21Z

I assume the reproducer I provided did the trick.

Yes, thanks again. Your observation also did occur for the "text", "words", "(x)html" and "(raw)dict" variants of getText().
Until fairly recently, I just used PyUnicode_FromStringAndSize to make Python strings from extracted PDF text. But I learned from user provided PDF examples, that this text is not reliably UTF-8 encodable. So I had to switch ... and switch again now 😉.

JorjMcKie · 2020-04-09T16:40:22Z

Hopefully addressed in version 1.16.17 uploaded today.

stephenfin added the bug label Apr 3, 2020

stephenfin assigned JorjMcKie Apr 3, 2020

JorjMcKie closed this as completed Apr 9, 2020

bschollnick mentioned this issue Jan 25, 2021

Unable to install under M1 Macintosh mode (works in x86/rosetta 2 mode) #834

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeprecationWarning: invalid escape sequence #482

DeprecationWarning: invalid escape sequence #482

stephenfin commented Apr 3, 2020 •

edited

Loading

JorjMcKie commented Apr 3, 2020

JorjMcKie commented Apr 3, 2020

stephenfin commented Apr 4, 2020 •

edited

Loading

JorjMcKie commented Apr 4, 2020

JorjMcKie commented Apr 5, 2020

stephenfin commented Apr 5, 2020

JorjMcKie commented Apr 5, 2020

JorjMcKie commented Apr 9, 2020

DeprecationWarning: invalid escape sequence #482

DeprecationWarning: invalid escape sequence #482

Comments

stephenfin commented Apr 3, 2020 • edited Loading

Bug description

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

JorjMcKie commented Apr 3, 2020

JorjMcKie commented Apr 3, 2020

stephenfin commented Apr 4, 2020 • edited Loading

JorjMcKie commented Apr 4, 2020

JorjMcKie commented Apr 5, 2020

stephenfin commented Apr 5, 2020

JorjMcKie commented Apr 5, 2020

JorjMcKie commented Apr 9, 2020

stephenfin commented Apr 3, 2020 •

edited

Loading

stephenfin commented Apr 4, 2020 •

edited

Loading