Line breaks being lost between headers and paragraphs #608

robt-dice · 2023-06-23T13:46:50Z

PHP Version: 8.1
PDFParser Version: v2.5.0

Description:

Line breaks between headings and paragraph text are being lost. Also, the subsequent line breaks after each paragraph are being lost. The PDF was created using macOS Pages' export function, using the following settings:

PDF input

Example PDF.pdf

Expected output & actual output

expected.txt
actual.txt

Code

$parser = new PdfParser\Parser();
$pdf = $parser->parseFile($filePath);
echo $pdf->getText();

The text was updated successfully, but these errors were encountered:

GreyWyvern · 2023-07-31T14:52:45Z

I'm fairly certain that what's going on here is that the "macOS Pages' export function" is using the graphics state cm of the document to position the text, rather than the Tm text matrix to do so. Here's a snip of the text stream from the original document:

/Cs1 cs
0 0 0 sc
q
1 0 0 1 56.69292 766.6251 cm
BT
18 0 0 18 0 0 Tm
/TT1 1 Tf
(Heading 1 ) Tj
ET
Q
EMC
/P << /MCID 2 >> BDC
q
1 0 0 1 56.69292 752.8481 cm
BT
11 0 0 11 0 0 Tm
/TT2 1 Tf
[ (Lor) 18 (em ipsum dolor sit amet, consectetur adipiscing elit. In at leo eu er) 18 (os dapibus ultrices. Cras ac ) ] TJ
ET
Q
q
1 0 0 1 56.69292 740.8481 cm
BT
11 0 0 11 0 0 Tm
/TT2 1 Tf
[ (facilisis diam. Nunc et augue ut mauris gravida phar) 18 (etra. Nam egestas ex et diam aliquet varius. ) ] TJ
ET
Q
q
1 0 0 1 56.69292 728.8481 cm
BT
11 0 0 11 0 0 Tm
/TT2 1 Tf
[ (Sed consectetur enim a ur) -18 (na vehicula dignissim. Phasellus venenatis nunc sed risus ultricies, id ) ] TJ
ET

You'll notice that all the Tm commands are the same except for the heading. For the three paragraph lines below, the Tm command should all position them on top of each other, but the cm command seems to be moving the text block to the next line. See how it uses the decreasing values: 752, 740, 728... ?

If the document is re-saved in Adobe Acrobat as a reduced-size PDF, then the output above becomes:

/CS0 cs
0 0 0 scn
/TT0 1 Tf
18 0 0 18 56.6929 766.6251 Tm
(Heading 1 )Tj
ET
EMC
BT
/P <>BDC
/TT1 1 Tf
11 0 0 11 56.6929 752.8481 Tm
[(Lor)18 (em ipsum dolor sit amet, consectetur adipiscing elit. In at leo eu er)18 (os dapibus ultrices. Cras ac )]TJ
0 -1.091 TD
[(facilisis diam. Nunc et augue ut mauris gravida phar)18 (etra. Nam egestas ex et diam aliquet varius. )]TJ
T*
[(Sed consectetur enim a ur)-18 (na vehicula dignissim. Phasellus venenatis nunc sed risus ultricies, id \ )]TJ
T*
[(laor)18 (eet massa semper)92 (. In id mi at magna porttitor blandit eget eget liber)18 (o. Pellentesque id )]TJ
T*
[(malesuada ipsum. Pellentesque viverra molestie consectetur)92 (.)]TJ

Here you'll see that the cm value from the original document has been matrix-added (?) to the original Tm command. 1 0 0 1 56.69292 766.6251 cm + 18 0 0 18 0 0 Tm = 18 0 0 18 56.6929 766.6251 Tm

Since PdfParser ignores cm commands when extracting text, it knows of no positioning nor line-breaks, so all the output appears on one line.

k00ni · 2023-08-01T06:44:08Z

@GreyWyvern thanks for investigating. So it seems that PDFParser is lacking the ability to process cm parts? What do you suggest?

GreyWyvern · 2023-08-01T13:50:48Z

@GreyWyvern thanks for investigating. So it seems that PDFParser is lacking the ability to process cm parts? What do you suggest?

In this case the cm commands are outside of the BT ... ET blocks which means PDFObject::getSectionsText() doesn't even read them. Until PdfParser has the way it reads content streams overhauled*, this can't be fixed.

*I'm working on this, hopefully soon. :)

GreyWyvern · 2023-08-14T13:25:23Z

Hi @robt-dice. Are we able to use your Example PDF.pdf in the PdfParser test suite? Is it free to use?

It's actually good for at least a couple tests of mine. :)

robt-dice · 2023-08-14T13:50:52Z

Please, feel free to use it.

…

On 14 Aug 2023, at 14:25, Brian Huisman ***@***.***> wrote: Hi @robt-dice <https://github.com/robt-dice>. Are we able to use your Example PDF.pdf <https://github.com/smalot/pdfparser/files/11848963/Example.PDF.pdf> in the PdfParser test suite? Is it free to use? It's actually good for at least a couple tests of mine. :) — Reply to this email directly, view it on GitHub <#608 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A73UXSLRWSPKEYOZQCHOR6TXVIRM5ANCNFSM6AAAAAAZRS7VYQ>. You are receiving this because you were mentioned.

k00ni added the bug label Jun 25, 2023

k00ni added the stale needs decision label Jul 31, 2023

k00ni removed the stale needs decision label Aug 1, 2023

GreyWyvern mentioned this issue Aug 10, 2023

PdfParser does not consider the entire document stream #628

Closed

GreyWyvern mentioned this issue Aug 18, 2023

Major Update to PDFObject.php + Ancillary #634

Merged

k00ni closed this as completed in #634 Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line breaks being lost between headers and paragraphs #608

Line breaks being lost between headers and paragraphs #608

robt-dice commented Jun 23, 2023 •

edited

Loading

GreyWyvern commented Jul 31, 2023

k00ni commented Aug 1, 2023

GreyWyvern commented Aug 1, 2023

GreyWyvern commented Aug 14, 2023

robt-dice commented Aug 14, 2023 via email

Line breaks being lost between headers and paragraphs #608

Line breaks being lost between headers and paragraphs #608

Comments

robt-dice commented Jun 23, 2023 • edited Loading

Description:

PDF input

Expected output & actual output

Code

GreyWyvern commented Jul 31, 2023

k00ni commented Aug 1, 2023

GreyWyvern commented Aug 1, 2023

GreyWyvern commented Aug 14, 2023

robt-dice commented Aug 14, 2023 via email

robt-dice commented Jun 23, 2023 •

edited

Loading