-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to PDFBox 2.0.0 #52
Comments
For your reference, Apache just released PDFBox 2.0 last March 21, 2016. |
We currently upgraded our project to pdfbox 2.0 and so most of the tabula code doesnt work now. A lot of functions have been moved/deleted in pdfbox so its getting very hard to personally make the changes. I understand you people would be working very hard on it, but around when can we expect the migration in tabula? Thanks and cheers ! |
Hi @kapil-mangtani, The migration to pdfbox 2.0 hasn't even started. pdfbox 2 has a completly different API, so it's going to be quite a bit of work. As much as I'd like to work on it, so there are other priorities and Tabula is a labor of love. Unfortunately, I can't give you a timeline. If you'd like to contribute a patch, however, we'll be happy to work with you in integrating it to the |
Thanks for the swift reply. I am trying to rewrite the ObjectExtractor class along with its PageIterator, would love to contribute if this thing works out correctly. |
Some initial explorations: Our The solution would have to be a class that inherits from Additionally, the new |
We have run into a similar problem as Kapil. Regards |
Hi @subhashbylaiah, No, we haven't made much progress. However, if you are interested in sponsoring the development of this, or contributing a patch, let us know. |
I've started to do some real work on this issue (be0b41a) Things are looking good. In addition, the pdfbox 2 version is faster than 1.8. With PDFBox 1.8:
With PDFBox 2.0.3
|
Leaving a comment here for future reference: when this issue is ready to be resolved, let's make sure that we don't regress the accuracy of the table detector. We have the output of the Travis builds as a baseline: https://travis-ci.org/tabulapdf/tabula-java/jobs/177768121 |
We have run sample pdf's with master branch and with pdfbox2-0 working branch, we have seen the tables which were identified correctly using master branch are not fetched using this branch.( along with other errors in cell information). After looking at the travis build results - we have seen that many of the test cases are still failing. Is there any timeline to release this branch to master ?. We would like to contribute to make a quicker release. Would working on fixing the failed test cases available be the best way to proceed ? |
Just for the record: We expect to merge @melisabok's fantastic work in the coming weeks. |
We have a pull request: #146 — Will review and integrate with master in the coming days. Those of you (@gudipatiharitha, @subhashbylaiah, @beng06, @kapil-mangtani, @chezou) interested in helping out testing this, please build from melisabok/tabula-java@pdfbox2.0 |
* Starting with upgrade to PDFBox 2.0 (#52) * 2.0 * little progress in upgrading to pdfbox 2 * upgrade to pdfbox 2 starting to show signs of life * Fix TextElement creation * fix tabs * Use the code from LegacyPDFStreamEngine to create the TextElements * Fix removeText function using the example: org.apache.pdfbox.examples.util.RemoveAllText * close the document * close removed text document * fix array serialization * add spanning cells test with CSV format * - Remove capheight calculation - Temporally set height * Test writer two tables checking the json result object instead of the string Add a test writer two tables for CSV output * Fix pageTransform when there is a rotation Add more csv tests * fix path iterator * update json tests * update json outputs * upgrade pdfbox version * back to the old implementation and catch the IndexOutOfBoundsException * Remove hardcoded code * Remove more hardcoded code * test all the elements of the detected table * Change the expected table top value * Increase the threshold factor to support a greater headings * Fix rectangle comparator. * fix wrong expected column size, 5 instead of 6. add more tests * update expected table, more spaces are expected to respect the alingment. * when the text value has length > 1, clean the spaces. * clean code * remove stackstrace * add log error * upgrade all dependencies * code formatting * setting pom to snapshot version
* Starting with upgrade to PDFBox 2.0 (tabulapdf#52) * 2.0 * little progress in upgrading to pdfbox 2 * upgrade to pdfbox 2 starting to show signs of life * Fix TextElement creation * fix tabs * Use the code from LegacyPDFStreamEngine to create the TextElements * Fix removeText function using the example: org.apache.pdfbox.examples.util.RemoveAllText * close the document * close removed text document * fix array serialization * add spanning cells test with CSV format * - Remove capheight calculation - Temporally set height * Test writer two tables checking the json result object instead of the string Add a test writer two tables for CSV output * Fix pageTransform when there is a rotation Add more csv tests * fix path iterator * update json tests * update json outputs * upgrade pdfbox version * back to the old implementation and catch the IndexOutOfBoundsException * Remove hardcoded code * Remove more hardcoded code * test all the elements of the detected table * Change the expected table top value * Increase the threshold factor to support a greater headings * Fix rectangle comparator. * fix wrong expected column size, 5 instead of 6. add more tests * update expected table, more spaces are expected to respect the alingment. * when the text value has length > 1, clean the spaces. * clean code * remove stackstrace * add log error * upgrade all dependencies * code formatting * setting pom to snapshot version
A stable release of PDFBox 2.0 is around the corner (they're at rc2 now), so it makes sense to start thinking about upgrading.
Our
ObjectExtractor
class extends PDFBox 1.8PageDrawer
, which changed substantially in 2.0.Also, PDF rendering improved substantially in PDFBox 2.0, so we might be able to drop
JPedal
in Tabula and use PDFBox for rendering.The text was updated successfully, but these errors were encountered: