Bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed #616

htInEdin · 2021-05-05T16:41:02Z

Pull request

I have only just caught up with pdfminer.six, was using pdfminer.20191103

Two bug fixes:

pdfminer.20191103 allowed me to use a TextIOWrapper as the output stream for a TextConverter, but this fails with pdfminer.six because _is_binary_stream fails to recognise TextIOWrapper as non-binary.

All that's needed is to test for instances of io.TextIOBase

Fixes #615.

Also, fixed a bug in psparser which failed to remove all of an escaped \r\n.

Fixes #624.

How Has This Been Tested?

I have a link extractor (patched version of pdfx) which ran on a 200 page pdf, finding 1300+ links, using pdfminer.20191103, crashed with pdfminer.20201018, runs again with the same output as before when patched with these changes. Running on a sample of 900+ pdf files from Common Crawl of August 2019, it finds more and cleaner links that when using pdfminer.20191103.

Checklist

I have added tests that prove my fix is effective or that my feature works
I have added docstrings to newly created methods and classes
I have optimized the code at least one time after creating the initial version
I have updated the README.md or I am verified that this is not necessary
I have updated the readthedocs documentation or verified that this is not necessary
Commit includes a ChangeLog, I don't understand the new format, sorry, but have supplied an old-format version

pietermarsman · 2021-08-31T20:03:58Z

@htInEdin I did some changes to make it more readable. Could you double-check the psparser.py logic for literal strings? I find that these kinds of things are very easy to get wrong.

htInEdin · 2021-08-31T23:25:23Z

Pieter Marsman writes:

@htInEdin<https://github.com/htInEdin> I did some changes to make it more readable. Could you double-check the psparser.py logic for literal strings? I find that these kinds of things are very easy to get wrong.

Thanks, will do, on holiday but back next week. ht -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ***@***.*** URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me _always_ has a .sig like this -- mail without it is forged spam] The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

pietermarsman · 2021-09-17T18:33:37Z

@htInEdin bump ;)

(Just a friendly reminder, no hurry)

htInEdin · 2021-09-20T14:49:04Z

I've reviewed your work above, thanks, and _parse_string(_1) in particular, and I think it's good to go.

htInEdin · 2021-09-22T07:49:34Z

Pieter Marsman writes:

@htInEdin<https://github.com/htInEdin> bump ;) (Just a friendly reminder, no hurry)

Done. ht -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ***@***.*** URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me _always_ has a .sig like this -- mail without it is forged spam] The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

…lowing the Python official guidance that warning.warn is directed at _developers_, not users * (pdfdocument.py) remove declarations of PDFTextExtractionNotAllowedWarning, PDFNoValidXRefWarning * (pdfpage.py) Don't import warning, don't use PDFTextExtractionNotAllowedWarning * (tools/dumppdf.py) Don't import warning, don't use PDFNoValidXRefWarning * (tests/test_tools_dumppdf.py) Don't import warning, check for logging.WARN rather than PDFNoValidXRefWarning

htInEdin · 2021-09-22T15:31:20Z

Bother, I still haven't quite gotten the hang of branch management. The above 4 commits were meant to be for a new 'preferLoggingToWarning' pull request, now merged into this one :-(.
d6fc079 is the correct commit to pull from for this request. Let me know if I need to do some surgery...

This reverts commit 4592769.

This reverts commit 80091ea.

…ses, following" This reverts commit 3c1e3d6.

This reverts commit 9d9d139, reversing changes made to 80091ea.

This reverts commit b3da219.

pietermarsman · 2021-09-27T18:31:04Z

@htInEdin I think I fixed it by reverting them.

Merged it now. Thanks for the work!

hstHome added 2 commits May 5, 2021 16:59

detect TextIOWrapper as non-binary

74cbc67

I don't understand the CHANGELOG.md format, hope this is good enough

737a100

htInEdin changed the title ~~Hst~~ Bug: _is_binary_stream should recognize TextIOWrapper as non-binary May 6, 2021

hstHome added 2 commits May 22, 2021 12:24

Delete \\\r\n in Literal Strings (ref. section 7.3.4.2 of PDF32000_2008)

357726e

Keep Travis CI happy

624402a

htInEdin changed the title ~~Bug: _is_binary_stream should recognize TextIOWrapper as non-binary~~ Bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed May 27, 2021

pietermarsman added 7 commits August 31, 2021 21:11

Added test

2ad707e

Remove pdfminer/Changelog

3e6a987

Prettify _parse_string_1

f83b3f8

Add CHANGELOG.md

a2c70fc

Satisfy flake8

fdb5b1f

Merge branch 'develop' into hst

5bb26f2

Update CHANGELOG.md

1347336

Merge branch 'pdfminer:develop' into hst

d6fc079

hstHome added 4 commits September 22, 2021 10:07

get name right

80091ea

Merge branch 'preferLoggingToWarning' into hst

9d9d139

make flake8 happy

4592769

pietermarsman added 5 commits September 27, 2021 20:27

Revert "make flake8 happy"

dd4f600

This reverts commit 4592769.

Revert "get name right"

de1edf1

This reverts commit 80091ea.

Revert "Use logging.Logger.warning instead of warning.warn in most ca…

bb05ac1

…ses, following" This reverts commit 3c1e3d6.

Revert "Merge branch 'preferLoggingToWarning' into hst"

b3da219

This reverts commit 9d9d139, reversing changes made to 80091ea.

Revert "Revert "Merge branch 'preferLoggingToWarning' into hst""

2667d4c

This reverts commit b3da219.

pietermarsman approved these changes Sep 27, 2021

View reviewed changes

pietermarsman merged commit 33d7dde into pdfminer:develop Sep 27, 2021

htInEdin deleted the hst branch September 28, 2021 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed #616

Bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed #616

htInEdin commented May 5, 2021 •

edited

Loading

pietermarsman commented Aug 31, 2021

htInEdin commented Aug 31, 2021 via email

pietermarsman commented Sep 17, 2021

htInEdin commented Sep 20, 2021

htInEdin commented Sep 22, 2021 via email

htInEdin commented Sep 22, 2021

pietermarsman commented Sep 27, 2021

Bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed #616

Bug: _is_binary_stream should recognize TextIOWrapper as non-binary, escaped \r\n should be removed #616

Conversation

htInEdin commented May 5, 2021 • edited Loading

pietermarsman commented Aug 31, 2021

htInEdin commented Aug 31, 2021 via email

pietermarsman commented Sep 17, 2021

htInEdin commented Sep 20, 2021

htInEdin commented Sep 22, 2021 via email

htInEdin commented Sep 22, 2021

pietermarsman commented Sep 27, 2021

htInEdin commented May 5, 2021 •

edited

Loading