Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeprecationWarning: invalid escape sequence #482

Closed
stephenfin opened this issue Apr 3, 2020 · 8 comments
Closed

DeprecationWarning: invalid escape sequence #482

stephenfin opened this issue Apr 3, 2020 · 8 comments
Assignees
Labels

Comments

@stephenfin
Copy link

stephenfin commented Apr 3, 2020

Bug description

Calling Page.getText('blocks') on PDFs that contain invalid Python escape sequences (e.g. \ ) result in the following warnings:

../fitz/fitz.py:5404: DeprecationWarning: invalid escape sequence '\ '
  return _fitz.TextPage_extractBLOCKS(self, lines)

This is a warning now but may or may not be an error in Python 3.10.

To Reproduce (mandatory)

  1. Create the following test script and save as test.py:

     import sys
     import fitz
    
     pdf = fitz.open(sys.argv[1])
     for page in pdf.pages():
         page.getText('blocks')
    
  2. Save the attached file locally

  3. Run the script against the file with deprecation warnings enabled:

     PYTHONWARNINGS=d python3 test.py test_aafigure.pdf
    

Expected behavior (optional)

The strings should be marked as rawstring (e.g. r'\ ') internally or escaped.

Screenshots (optional)

N/A

Your configuration (mandatory)

  • Fedora 31 (64 bit)
  • Python 3.7.6
  • PyMuPDF 1.16.16, wheel
3.7.6 (default, Jan 30 2020, 09:44:41) 
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] 
 linux 
PyMuPDF 1.16.16: Python bindings for the MuPDF 1.16.0 library.
Version date: 2020-03-29 09:44:30.
Built for Python 3.7 on linux (64-bit).

Additional context (optional)

I did try to fix this myself, but I haven't worked with SWIG (or Python bindings to a C lib) before and got lost. Sorry 😞

@JorjMcKie
Copy link
Collaborator

Thanks for reporting this! ... And for your interest in PyMuPDF.
Let me have a look. I'll be back 😎

@JorjMcKie
Copy link
Collaborator

Where is the attached PDF please?

@stephenfin
Copy link
Author

stephenfin commented Apr 4, 2020

Whoops, sorry. Attached. Page three is the errant one.

test_aafigure.pdf

@JorjMcKie
Copy link
Collaborator

ok, thanks, will look at it

@JorjMcKie
Copy link
Collaborator

Fixed it I think by using PyUnicode_DecodeRawUnicodeEscape instead of PyUnicode_DecodeUnicodeEscape.

@stephenfin
Copy link
Author

Awesome. Thanks! Let me know if you need anything from me testing wise. I assume the reproducer I provided did the trick.

@JorjMcKie
Copy link
Collaborator

I assume the reproducer I provided did the trick.

Yes, thanks again. Your observation also did occur for the "text", "words", "(x)html" and "(raw)dict" variants of getText().
Until fairly recently, I just used PyUnicode_FromStringAndSize to make Python strings from extracted PDF text. But I learned from user provided PDF examples, that this text is not reliably UTF-8 encodable. So I had to switch ... and switch again now 😉.

@JorjMcKie
Copy link
Collaborator

Hopefully addressed in version 1.16.17 uploaded today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants