Tables drawn from single path is converted to curve instead of rects #369

cheungpat · 2020-02-05T17:44:58Z

Describe the bug

When using Excel and Print to PDF function, borders from the generated PDF cannot be converted to HTML output. When converting to XML output, the borders become <curve> instead of <rect>.

It appears that the borders are rendered as a single path and hence it is interpreted as a curve instead of a rect.

To Reproduce

Run pdf2txt.py -t html output.pdf > output.html.

output.pdf

output.html: (some borders are missing)

When converting to xml, the borders become <curve> instead of <rect>.

Expected behavior

output.html: (borders should be shown)

The text was updated successfully, but these errors were encountered:

For path that consists of a series of rectangles (shape is 'mlllhmlllh...'), call paint_path again with each group of 5 points. The result is multiple rects instead of a single curve. fixes pdfminer#369

pietermarsman · 2020-03-03T20:41:13Z

Thanks for raising this issue!

I'll review your PR and see if we can merge it.

* Fix converting path to multiple rectangles For path that consists of a series of rectangles (shape is 'mlllhmlllh...'), call paint_path again with each group of 5 points. The result is multiple rects instead of a single curve. fixes #369 * Reduce pdf size by removing font * Add unittest for PDFLayoutAnalyzer.paint_path() * Add line to CHANGELOG.md * Add reference to pdf reference manual * Cleanup function paint_path a bit * Reduce line length of tests * Reduce line length of tests Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>

cheungpat mentioned this issue Feb 6, 2020

Fix converting path to multiple rectangles #371

Merged

5 tasks

pietermarsman added the type: bug label Mar 3, 2020

pietermarsman closed this as completed in #371 Jul 11, 2020

jsvine mentioned this issue Aug 20, 2020

Bug in new .paint_path logic #473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tables drawn from single path is converted to curve instead of rects #369

Tables drawn from single path is converted to curve instead of rects #369

cheungpat commented Feb 5, 2020

pietermarsman commented Mar 3, 2020

Tables drawn from single path is converted to curve instead of rects #369

Tables drawn from single path is converted to curve instead of rects #369

Comments

cheungpat commented Feb 5, 2020

pietermarsman commented Mar 3, 2020