Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF #454

Closed
izabala opened this issue Aug 24, 2021 · 0 comments · Fixed by #455
Labels
bug parsing fail When (almost) nothing can be extracted from a given PDF

Comments

@izabala
Copy link
Contributor

izabala commented Aug 24, 2021

I am using FPDI/FPDF (setasign\Fpdi\Fpdi) to split and merge some PDF files. When I try to parsed them, all the mentioned methods failed. They return nothing or almost nothing (empty arrays).
After that, I create a PDF file (also done with FPDI/FPDF) merging some of the sample files for this library. And tried again. They failed. Here the sample file:
FpdiFpdf.pdf

I made my investigation:
Page->getText() works fine. But all the info that make posible to get the Text Matrix done in the methods extractRawData and the rest, are not keep it as in normal Pdf files. The information for Pdf commands and Text Matrix are saved inside the xObjects in a Form class, not in the pages.

I already made a fix/workaround for that, that involved changes in: getTextArray, extractRawData, extractDecodedRawData, but it involve to know the Producer of the file (It asked if the file was produced by FPDF) in order to know what to do, and this is because I couldn't find any other way to work this out.

I will wait, till the pull request #453 (for the Issue #450) is finally merged into the master branch, because it touches the same methods, to make a pull request using my fix/workaround for this new problem.

@izabala izabala changed the title extractRawData, extractDecodedRawData, getDataTm and getDataXY doesnt work with a Pdf file produced by FPDI/FPDF extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF Aug 24, 2021
@k00ni k00ni added bug parsing fail When (almost) nothing can be extracted from a given PDF labels Aug 25, 2021
@k00ni k00ni linked a pull request Oct 15, 2021 that will close this issue
k00ni added a commit that referenced this issue Oct 18, 2021
* workaround for the Issue #450

The file makes that 2 of the Page methods fails.

The Page->extractDecodedRawData was not returning the correct string. This was corrected.

The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work.

This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done.

* PageTest: attempt to fix cs issues

* Page.php: fixed cs issues

* ParserTest: fixed failing test testRetainImageContentImpact 

This test is a bit wonky because it relies on memory values which may differ from system to system and run to run.
Adjusted values to fix it.

Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22

* refined memory threshold in ParserTest::testRetainImageContentImpact

* Update Page.php

* Taking out line

Taking out the line:
$decodedText = '';
This was not needed. Thanks @j0k3r

* Changing the catch of the Error

To catching Throwable.

* Fix/workaround for Issue #454

When the pdf files is produced by setasign/fpdi/fpdi or FPDF, this correct that nothing is returning by the methods.
But for doing that things like to know that the producer is FPDF and the page number are required and used in conjunction with getXObjects.

* Update Page.php

Some of the changes asked in Github by kOOni

* Update Page.php

Other changes asked by k00ny

* Some other recomendations

Some other @k00ni recommendations

* After manually doing php-cs-fixer

I manually run dev-tools\vendor\bin\php-cs-fixer fix

* Correcting the phpstan error

* Update Page.php

just to make a code enhacement

* Removing vscode\lauch.json and some corrections

Some corrections metions by @k00ni.

* creating some function to get this clearer

Follow the recomendation of @k00ni on using extra function to have the code clearer.

* After applaying some @k00ni recomendations

Many changes following @k00ni recommendations.

* Updating the comment for the isFpdf function

Better explanation for the function

* Changes for correcting phpstan errors

* some changes

Changes in comments, functions names and variable names.

* Reformatted some code parts

Co-authored-by: Konrad Abicht <hi@inspirito.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parsing fail When (almost) nothing can be extracted from a given PDF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants