-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF #454
Labels
Comments
Merged
k00ni
added a commit
that referenced
this issue
Oct 18, 2021
* workaround for the Issue #450 The file makes that 2 of the Page methods fails. The Page->extractDecodedRawData was not returning the correct string. This was corrected. The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work. This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done. * PageTest: attempt to fix cs issues * Page.php: fixed cs issues * ParserTest: fixed failing test testRetainImageContentImpact This test is a bit wonky because it relies on memory values which may differ from system to system and run to run. Adjusted values to fix it. Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22 * refined memory threshold in ParserTest::testRetainImageContentImpact * Update Page.php * Taking out line Taking out the line: $decodedText = ''; This was not needed. Thanks @j0k3r * Changing the catch of the Error To catching Throwable. * Fix/workaround for Issue #454 When the pdf files is produced by setasign/fpdi/fpdi or FPDF, this correct that nothing is returning by the methods. But for doing that things like to know that the producer is FPDF and the page number are required and used in conjunction with getXObjects. * Update Page.php Some of the changes asked in Github by kOOni * Update Page.php Other changes asked by k00ny * Some other recomendations Some other @k00ni recommendations * After manually doing php-cs-fixer I manually run dev-tools\vendor\bin\php-cs-fixer fix * Correcting the phpstan error * Update Page.php just to make a code enhacement * Removing vscode\lauch.json and some corrections Some corrections metions by @k00ni. * creating some function to get this clearer Follow the recomendation of @k00ni on using extra function to have the code clearer. * After applaying some @k00ni recomendations Many changes following @k00ni recommendations. * Updating the comment for the isFpdf function Better explanation for the function * Changes for correcting phpstan errors * some changes Changes in comments, functions names and variable names. * Reformatted some code parts Co-authored-by: Konrad Abicht <hi@inspirito.de>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am using FPDI/FPDF (setasign\Fpdi\Fpdi) to split and merge some PDF files. When I try to parsed them, all the mentioned methods failed. They return nothing or almost nothing (empty arrays).
After that, I create a PDF file (also done with FPDI/FPDF) merging some of the sample files for this library. And tried again. They failed. Here the sample file:
FpdiFpdf.pdf
I made my investigation:
Page->getText()
works fine. But all the info that make posible to get the Text Matrix done in the methodsextractRawData
and the rest, are not keep it as in normal Pdf files. The information for Pdf commands and Text Matrix are saved inside thexObjects
in a Form class, not in the pages.I already made a fix/workaround for that, that involved changes in: g
etTextArray
,extractRawData
,extractDecodedRawData
, but it involve to know the Producer of the file (It asked if the file was produced by FPDF) in order to know what to do, and this is because I couldn't find any other way to work this out.I will wait, till the pull request #453 (for the Issue #450) is finally merged into the master branch, because it touches the same methods, to make a pull request using my fix/workaround for this new problem.
The text was updated successfully, but these errors were encountered: