extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF #454

izabala · 2021-08-24T19:05:50Z

I am using FPDI/FPDF (setasign\Fpdi\Fpdi) to split and merge some PDF files. When I try to parsed them, all the mentioned methods failed. They return nothing or almost nothing (empty arrays).
After that, I create a PDF file (also done with FPDI/FPDF) merging some of the sample files for this library. And tried again. They failed. Here the sample file:
FpdiFpdf.pdf

I made my investigation:
Page->getText() works fine. But all the info that make posible to get the Text Matrix done in the methods extractRawData and the rest, are not keep it as in normal Pdf files. The information for Pdf commands and Text Matrix are saved inside the xObjects in a Form class, not in the pages.

I already made a fix/workaround for that, that involved changes in: getTextArray, extractRawData, extractDecodedRawData, but it involve to know the Producer of the file (It asked if the file was produced by FPDF) in order to know what to do, and this is because I couldn't find any other way to work this out.

I will wait, till the pull request #453 (for the Issue #450) is finally merged into the master branch, because it touches the same methods, to make a pull request using my fix/workaround for this new problem.

The text was updated successfully, but these errors were encountered:

@j0k3r

* workaround for the Issue #450 The file makes that 2 of the Page methods fails. The Page->extractDecodedRawData was not returning the correct string. This was corrected. The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work. This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done. * PageTest: attempt to fix cs issues * Page.php: fixed cs issues * ParserTest: fixed failing test testRetainImageContentImpact This test is a bit wonky because it relies on memory values which may differ from system to system and run to run. Adjusted values to fix it. Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22 * refined memory threshold in ParserTest::testRetainImageContentImpact * Update Page.php * Taking out line Taking out the line: $decodedText = ''; This was not needed. Thanks @j0k3r * Changing the catch of the Error To catching Throwable. * Fix/workaround for Issue #454 When the pdf files is produced by setasign/fpdi/fpdi or FPDF, this correct that nothing is returning by the methods. But for doing that things like to know that the producer is FPDF and the page number are required and used in conjunction with getXObjects. * Update Page.php Some of the changes asked in Github by kOOni * Update Page.php Other changes asked by k00ny * Some other recomendations Some other @k00ni recommendations * After manually doing php-cs-fixer I manually run dev-tools\vendor\bin\php-cs-fixer fix * Correcting the phpstan error * Update Page.php just to make a code enhacement * Removing vscode\lauch.json and some corrections Some corrections metions by @k00ni. * creating some function to get this clearer Follow the recomendation of @k00ni on using extra function to have the code clearer. * After applaying some @k00ni recomendations Many changes following @k00ni recommendations. * Updating the comment for the isFpdf function Better explanation for the function * Changes for correcting phpstan errors * some changes Changes in comments, functions names and variable names. * Reformatted some code parts Co-authored-by: Konrad Abicht <hi@inspirito.de>

izabala changed the title ~~extractRawData, extractDecodedRawData, getDataTm and getDataXY doesnt work with a Pdf file produced by FPDI/FPDF~~ extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF Aug 24, 2021

k00ni added bug parsing fail When (almost) nothing can be extracted from a given PDF labels Aug 25, 2021

k00ni linked a pull request Oct 15, 2021 that will close this issue

Fix for Issue #454 #455

Merged

k00ni closed this as completed in #455 Oct 18, 2021

izabala mentioned this issue Oct 21, 2021

Issue loading pdf generated from FPDI #472

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF #454

extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF #454

izabala commented Aug 24, 2021 •

edited

Loading

extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF #454

extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF #454

Comments

izabala commented Aug 24, 2021 • edited Loading

izabala commented Aug 24, 2021 •

edited

Loading