Upgrade to PDFBox 2.0.0 #52

jazzido · 2015-12-03T21:08:51Z

A stable release of PDFBox 2.0 is around the corner (they're at rc2 now), so it makes sense to start thinking about upgrading.

Our ObjectExtractor class extends PDFBox 1.8 PageDrawer, which changed substantially in 2.0.

Also, PDF rendering improved substantially in PDFBox 2.0, so we might be able to drop JPedal in Tabula and use PDFBox for rendering.

The text was updated successfully, but these errors were encountered:

beng06 · 2016-03-23T05:07:31Z

For your reference, Apache just released PDFBox 2.0 last March 21, 2016.

http://sdtimes.com/apache-pdfbox-2-0-is-released/

kapil-mangtani · 2016-03-31T10:16:32Z

We currently upgraded our project to pdfbox 2.0 and so most of the tabula code doesnt work now. A lot of functions have been moved/deleted in pdfbox so its getting very hard to personally make the changes. I understand you people would be working very hard on it, but around when can we expect the migration in tabula? Thanks and cheers !

jazzido · 2016-03-31T13:36:30Z

Hi @kapil-mangtani,

The migration to pdfbox 2.0 hasn't even started. pdfbox 2 has a completly different API, so it's going to be quite a bit of work. As much as I'd like to work on it, so there are other priorities and Tabula is a labor of love.

Unfortunately, I can't give you a timeline. If you'd like to contribute a patch, however, we'll be happy to work with you in integrating it to the master branch.

kapil-mangtani · 2016-04-01T07:41:54Z

Thanks for the swift reply.

I am trying to rewrite the ObjectExtractor class along with its PageIterator, would love to contribute if this thing works out correctly.

jazzido · 2016-04-28T22:09:19Z

Some initial explorations:

Our ObjectExtractor is a subclass of PageDrawer (PDFBox 1.8). That class is now meant for rendering a PDF onto a Graphics2D (the docs state that "If you want to do custom graphics processing rather than Graphics2D rendering, then you should subclass PDFGraphicsStreamEngine instead"). ObjectExtractor mines both graphics and text elements, so we need hooks for both. Unfortunately, there is no single class in PDFBox 2.0 that can interpret both.

The solution would have to be a class that inherits from PDFStreamEngine, and combines the funcionality of PDFGraphicsStreamEngine, PageDrawer and PDFTextStripper

Additionally, the new StreamEngines no longer operate on a PDDocument, but on a PDPage. We'll need to modify PageIterator and ObjectExtractor accordingly.

subhashbylaiah · 2016-06-30T04:53:48Z

Hi @kapil-mangtani, @jazzido

We have run into a similar problem as Kapil.
Have you been able to make any progress on the migrations for 2.0

Regards
Subhash

jazzido · 2016-07-01T20:34:57Z

Hi @subhashbylaiah,

No, we haven't made much progress. However, if you are interested in sponsoring the development of this, or contributing a patch, let us know.

jazzido · 2016-10-03T04:22:41Z

I've started to do some real work on this issue (be0b41a)

Things are looking good. In addition, the pdfbox 2 version is faster than 1.8.
Unscientific benchmarks ahead:

With PDFBox 1.8:

for i in `seq 1 10`; do time java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1 src/test/resources/technology/tabula/argentina_diputados_voting_record.pdf > /dev/null; done
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.71s user 0.27s system 177% cpu 2.249 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.64s user 0.25s system 176% cpu 2.197 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.58s user 0.23s system 173% cpu 2.199 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.47s user 0.23s system 171% cpu 2.151 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.44s user 0.23s system 172% cpu 2.132 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.43s user 0.22s system 169% cpu 2.160 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.51s user 0.24s system 173% cpu 2.162 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.55s user 0.24s system 173% cpu 2.184 total
java -jar ~/Downloads/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/nul  3.66s user 0.24s system 176% cpu 2.217 total

With PDFBox 2.0.3

for i in `seq 1 10`; do time java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1 src/test/resources/technology/tabula/argentina_diputados_voting_record.pdf > /dev/null; done
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  2.91s user 0.23s system 208% cpu 1.501 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  2.96s user 0.23s system 206% cpu 1.539 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  2.91s user 0.23s system 201% cpu 1.561 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.19s user 0.23s system 212% cpu 1.613 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.01s user 0.23s system 209% cpu 1.548 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.01s user 0.23s system 208% cpu 1.552 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.18s user 0.23s system 176% cpu 1.926 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.07s user 0.23s system 214% cpu 1.543 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  2.92s user 0.23s system 203% cpu 1.544 total
java -jar target/tabula-0.9.1-jar-with-dependencies.jar -p 1  > /dev/null  3.10s user 0.24s system 171% cpu 1.945 total

jazzido · 2016-11-22T00:13:56Z

Leaving a comment here for future reference: when this issue is ready to be resolved, let's make sure that we don't regress the accuracy of the table detector.

We have the output of the Travis builds as a baseline: https://travis-ci.org/tabulapdf/tabula-java/jobs/177768121

gudipatiharitha · 2016-12-13T11:05:11Z

We have run sample pdf's with master branch and with pdfbox2-0 working branch, we have seen the tables which were identified correctly using master branch are not fetched using this branch.( along with other errors in cell information). After looking at the travis build results - we have seen that many of the test cases are still failing. Is there any timeline to release this branch to master ?. We would like to contribute to make a quicker release. Would working on fixing the failed test cases available be the best way to proceed ?

jazzido · 2017-03-07T05:39:46Z

Just for the record: melisabok/tabula-java@pdfbox2.0 now passes all the tests.

We expect to merge @melisabok's fantastic work in the coming weeks.

jazzido · 2017-03-08T19:02:20Z

We have a pull request: #146 — Will review and integrate with master in the coming days.

Those of you (@gudipatiharitha, @subhashbylaiah, @beng06, @kapil-mangtani, @chezou) interested in helping out testing this, please build from melisabok/tabula-java@pdfbox2.0

* Starting with upgrade to PDFBox 2.0 (#52) * 2.0 * little progress in upgrading to pdfbox 2 * upgrade to pdfbox 2 starting to show signs of life * Fix TextElement creation * fix tabs * Use the code from LegacyPDFStreamEngine to create the TextElements * Fix removeText function using the example: org.apache.pdfbox.examples.util.RemoveAllText * close the document * close removed text document * fix array serialization * add spanning cells test with CSV format * - Remove capheight calculation - Temporally set height * Test writer two tables checking the json result object instead of the string Add a test writer two tables for CSV output * Fix pageTransform when there is a rotation Add more csv tests * fix path iterator * update json tests * update json outputs * upgrade pdfbox version * back to the old implementation and catch the IndexOutOfBoundsException * Remove hardcoded code * Remove more hardcoded code * test all the elements of the detected table * Change the expected table top value * Increase the threshold factor to support a greater headings * Fix rectangle comparator. * fix wrong expected column size, 5 instead of 6. add more tests * update expected table, more spaces are expected to respect the alingment. * when the text value has length > 1, clean the spaces. * clean code * remove stackstrace * add log error * upgrade all dependencies * code formatting * setting pom to snapshot version

* Starting with upgrade to PDFBox 2.0 (tabulapdf#52) * 2.0 * little progress in upgrading to pdfbox 2 * upgrade to pdfbox 2 starting to show signs of life * Fix TextElement creation * fix tabs * Use the code from LegacyPDFStreamEngine to create the TextElements * Fix removeText function using the example: org.apache.pdfbox.examples.util.RemoveAllText * close the document * close removed text document * fix array serialization * add spanning cells test with CSV format * - Remove capheight calculation - Temporally set height * Test writer two tables checking the json result object instead of the string Add a test writer two tables for CSV output * Fix pageTransform when there is a rotation Add more csv tests * fix path iterator * update json tests * update json outputs * upgrade pdfbox version * back to the old implementation and catch the IndexOutOfBoundsException * Remove hardcoded code * Remove more hardcoded code * test all the elements of the detected table * Change the expected table top value * Increase the threshold factor to support a greater headings * Fix rectangle comparator. * fix wrong expected column size, 5 instead of 6. add more tests * update expected table, more spaces are expected to respect the alingment. * when the text value has length > 1, clean the spaces. * clean code * remove stackstrace * add log error * upgrade all dependencies * code formatting * setting pom to snapshot version

jazzido added a commit that referenced this issue Dec 3, 2015

Starting with upgrade to PDFBox 2.0 (#52)

a3f4cfb

jazzido mentioned this issue Oct 23, 2016

Support for incremental output #113

Open

jeremybmerrill mentioned this issue Oct 27, 2016

Unable to extract Japanese characters #114

Closed

chezou mentioned this issue Nov 27, 2016

Unable to extract Japanese characters chezou/tabula-py#10

Closed

jazzido mentioned this issue Feb 6, 2017

Upgrade pdfbox dependecy to version 2.1.0? #136

Closed

leeper mentioned this issue Mar 9, 2017

Switch to PDFBox 2.0 ropensci/tabulizerjars#3

Closed

leeper mentioned this issue Apr 8, 2017

Handling issues related to upgrade to PDFBox 2.0 ropensci/tabulapdf#48

Closed

jazzido closed this as completed Apr 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to PDFBox 2.0.0 #52

Upgrade to PDFBox 2.0.0 #52

jazzido commented Dec 3, 2015

beng06 commented Mar 23, 2016

kapil-mangtani commented Mar 31, 2016

jazzido commented Mar 31, 2016

kapil-mangtani commented Apr 1, 2016

jazzido commented Apr 28, 2016 •

edited

Loading

subhashbylaiah commented Jun 30, 2016

jazzido commented Jul 1, 2016

jazzido commented Oct 3, 2016

jazzido commented Nov 22, 2016 •

edited

Loading

gudipatiharitha commented Dec 13, 2016

jazzido commented Mar 7, 2017

jazzido commented Mar 8, 2017

Upgrade to PDFBox 2.0.0 #52

Upgrade to PDFBox 2.0.0 #52

Comments

jazzido commented Dec 3, 2015

beng06 commented Mar 23, 2016

kapil-mangtani commented Mar 31, 2016

jazzido commented Mar 31, 2016

kapil-mangtani commented Apr 1, 2016

jazzido commented Apr 28, 2016 • edited Loading

subhashbylaiah commented Jun 30, 2016

jazzido commented Jul 1, 2016

jazzido commented Oct 3, 2016

jazzido commented Nov 22, 2016 • edited Loading

gudipatiharitha commented Dec 13, 2016

jazzido commented Mar 7, 2017

jazzido commented Mar 8, 2017

jazzido commented Apr 28, 2016 •

edited

Loading

jazzido commented Nov 22, 2016 •

edited

Loading