Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line breaks being lost between headers and paragraphs #608

Closed
robt-dice opened this issue Jun 23, 2023 · 5 comments · Fixed by #634
Closed

Line breaks being lost between headers and paragraphs #608

robt-dice opened this issue Jun 23, 2023 · 5 comments · Fixed by #634
Labels

Comments

@robt-dice
Copy link

robt-dice commented Jun 23, 2023

  • PHP Version: 8.1
  • PDFParser Version: v2.5.0

Description:

Line breaks between headings and paragraph text are being lost. Also, the subsequent line breaks after each paragraph are being lost. The PDF was created using macOS Pages' export function, using the following settings:
Screenshot 2023-06-23 at 14 32 03

PDF input

Example PDF.pdf

Expected output & actual output

expected.txt
actual.txt

Code

$parser = new PdfParser\Parser();
$pdf = $parser->parseFile($filePath);
echo $pdf->getText();
@k00ni k00ni added the bug label Jun 25, 2023
@GreyWyvern
Copy link
Contributor

I'm fairly certain that what's going on here is that the "macOS Pages' export function" is using the graphics state cm of the document to position the text, rather than the Tm text matrix to do so. Here's a snip of the text stream from the original document:

/Cs1 cs
0 0 0 sc
q
1 0 0 1 56.69292 766.6251 cm
BT
18 0 0 18 0 0 Tm
/TT1 1 Tf
(Heading 1 ) Tj
ET
Q
EMC
/P << /MCID 2 >> BDC
q
1 0 0 1 56.69292 752.8481 cm
BT
11 0 0 11 0 0 Tm
/TT2 1 Tf
[ (Lor) 18 (em ipsum dolor sit amet, consectetur adipiscing elit. In at leo eu er) 18 (os dapibus ultrices. Cras ac ) ] TJ
ET
Q
q
1 0 0 1 56.69292 740.8481 cm
BT
11 0 0 11 0 0 Tm
/TT2 1 Tf
[ (facilisis diam. Nunc et augue ut mauris gravida phar) 18 (etra. Nam egestas ex et diam aliquet varius. ) ] TJ
ET
Q
q
1 0 0 1 56.69292 728.8481 cm
BT
11 0 0 11 0 0 Tm
/TT2 1 Tf
[ (Sed consectetur enim a ur) -18 (na vehicula dignissim. Phasellus venenatis nunc sed risus ultricies, id ) ] TJ
ET

You'll notice that all the Tm commands are the same except for the heading. For the three paragraph lines below, the Tm command should all position them on top of each other, but the cm command seems to be moving the text block to the next line. See how it uses the decreasing values: 752, 740, 728... ?

If the document is re-saved in Adobe Acrobat as a reduced-size PDF, then the output above becomes:

/CS0 cs
0 0 0 scn
/TT0 1 Tf
18 0 0 18 56.6929 766.6251 Tm
(Heading 1 )Tj
ET
EMC
BT
/P <>BDC
/TT1 1 Tf
11 0 0 11 56.6929 752.8481 Tm
[(Lor)18 (em ipsum dolor sit amet, consectetur adipiscing elit. In at leo eu er)18 (os dapibus ultrices. Cras ac )]TJ
0 -1.091 TD
[(facilisis diam. Nunc et augue ut mauris gravida phar)18 (etra. Nam egestas ex et diam aliquet varius. )]TJ
T*
[(Sed consectetur enim a ur)-18 (na vehicula dignissim. Phasellus venenatis nunc sed risus ultricies, id \ )]TJ
T*
[(laor)18 (eet massa semper)92 (. In id mi at magna porttitor blandit eget eget liber)18 (o. Pellentesque id )]TJ
T*
[(malesuada ipsum. Pellentesque viverra molestie consectetur)92 (.)]TJ

Here you'll see that the cm value from the original document has been matrix-added (?) to the original Tm command. 1 0 0 1 56.69292 766.6251 cm + 18 0 0 18 0 0 Tm = 18 0 0 18 56.6929 766.6251 Tm

Since PdfParser ignores cm commands when extracting text, it knows of no positioning nor line-breaks, so all the output appears on one line.

@k00ni k00ni added the stale needs decision label Jul 31, 2023
@k00ni
Copy link
Collaborator

k00ni commented Aug 1, 2023

@GreyWyvern thanks for investigating. So it seems that PDFParser is lacking the ability to process cm parts? What do you suggest?

@k00ni k00ni removed the stale needs decision label Aug 1, 2023
@GreyWyvern
Copy link
Contributor

@GreyWyvern thanks for investigating. So it seems that PDFParser is lacking the ability to process cm parts? What do you suggest?

In this case the cm commands are outside of the BT ... ET blocks which means PDFObject::getSectionsText() doesn't even read them. Until PdfParser has the way it reads content streams overhauled*, this can't be fixed.

*I'm working on this, hopefully soon. :)

@GreyWyvern
Copy link
Contributor

Hi @robt-dice. Are we able to use your Example PDF.pdf in the PdfParser test suite? Is it free to use?

It's actually good for at least a couple tests of mine. :)

@robt-dice
Copy link
Author

robt-dice commented Aug 14, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants