Skip to content

Conversion Tests

Matteo Seclì edited this page Mar 13, 2018 · 6 revisions

TL;DR

I've performed some conversion tests to assess the conversion success rate of pdf2archive depending on the Ghostscript version used. These are the results for pdf2archive v0.3, based on the conversion of 764 random PDF files:

GS Version PASS FAIL ERROR
9.14 716 (93.72%) 48 (6.28%) 0 (0%)
9.15 716 (93.72%) 48 (6.28%) 0 (0%)
9.16 715 (93.59%) 49 (6.41%) 0 (0%)
9.18 715 (93.59%) 49 (6.41%) 0 (0%)
9.19 715 (93.59%) 49 (6.41%) 0 (0%)
9.20 715 (93.59%) 49 (6.41%) 0 (0%)
9.21 715 (93.59%) 49 (6.41%) 0 (0%)
9.22 715 (93.59%) 49 (6.41%) 0 (0%)

Only the tests related to Ghostscript versions that gave at least 50% of success are shown. The meaning of the results is the following:

  • PASS: The resulting PDF file is a valid PDF/A-1B file.
  • FAIL: The resulting PDF file is not a valid PDF/A-1B file.
  • ERROR: An error occurred in the workflow (Ghostscript met an unrecognized option, failed to produce a PDF file, crashed, etc. or veraPDF crashed)

Methods

All the tests were done by using pdf2archive (versions v0.1-1 and v0.3) on different GNU/Linux x86_64 machines. The Ghostscript GNU/Linux binaries from gs v9.04to gs v9.22 were downloaded from the official site; the ones from gs v8.63to gs v9.02 were compiled by myself. The files used for the test were randomly selected PDF files from the ArXiv. The validation was done with veraPDF.

Details of the tests

Test #04

Settings

Setting Value
pdf2archive Version 0.3
VeraPDF Version 1.8.4
Linux Kernel 2.6.32 (x86_64)
Java Version Java(TM) SE "9.0.1"
Test Files 695 PDF files in arXiv_pdf_1210_002

Results

GS Version PASS FAIL ERROR
9.14 648 47 0
9.15 648 47 0
9.16 648 47 0
9.18 648 47 0
9.19 648 47 0
9.20 648 47 0
9.21 648 47 0
9.22 648 47 0

Test #03

Settings

Setting Value
pdf2archive version 0.3
VeraPDF Version 1.8.4
Linux Kernel 4.10.0 (x86_64)
Java Version OpenJDK "9-internal"
Test Files 69 PDF files in arXiv_src_1309_007

Results

GS Version PASS FAIL ERROR
9.14 68 1 0
9.15 68 1 0
9.16 67 2 0
9.18 67 2 0
9.19 67 2 0
9.20 67 2 0
9.21 67 2 0
9.22 67 2 0

Test #02 [OLD]

Settings

Setting Value
pdf2archive Version 0.1-1
VeraPDF Version 1.8.4
Linux Kernel 2.6.32 (x86_64)
Java Version OpenJDK "1.8.0_151"
Test Files 695 PDF files in arXiv_pdf_1210_002

Results

GS Version PASS FAIL ERROR
8.63 0 0 695
8.64 0 0 695
8.70 0 0 695
8.71 0 0 695
9.00 0 0 695
9.01 0 0 695
9.02 0 0 695
9.04 0 695 0
9.05 0 695 0
9.06 0 694 1
9.07 0 694 1
9.09 0 695 0
9.10 0 692 3
9.14 647 47 1
9.15 647 47 1
9.16 647 47 1
9.18 647 47 1
9.19 647 47 1
9.20 647 47 1
9.21 635 59 1
9.22 635 59 1

Test #01 [OLD]

Settings

Setting Value
pdf2archive version 0.1-1
VeraPDF Version 1.8.4
Linux Kernel 4.10.0 (x86_64)
Java Version Java(TM) SE "1.8.0_151"
Test Files 69 PDF files in arXiv_src_1309_007

Results

GS Version PASS FAIL ERROR
8.63 0 0 69
8.64 0 0 69
8.70 0 69 0
8.71 0 69 0
9.00 0 69 0
9.01 0 69 0
9.02 0 69 0
9.04 0 0 69
9.05 0 69 0
9.06 0 69 0
9.07 0 69 0
9.09 0 69 0
9.10 0 69 0
9.14 68 1 0
9.15 68 1 0
9.16 68 1 0
9.18 68 1 0
9.19 68 1 0
9.20 68 1 0
9.21 52 17 0
9.22 67 2 0

Analysis [pdf2archive v0.3]

Metadata handling has been strengthened in pdf2archive v0.3 when compared to pdf2archive v0.1-1. This, in turn, has solved most of the issues with Ghostscript versions beyond 9.21. However, if you compare test #01 and test #03 (which are done on the same set of PDF's) for Ghostscript 9.16 to 9.20, you'll notice that there is one more fail. The reason is documented in https://github.com/matteosecli/pdf2archive/issues/5, and it has to do, again, with metadata handling.

The font problems (which give most of the fails) have not been solved yet; they range from zero-width glyphs to inconsistent glyph width in the document. I think these are problems that Ghostscript cannot solve, and they are mostly due to the software that has produced the document. For LaTeX's case, one of the goals is to write a section that documents good practices when producing documents with PDFLaTeX (see https://github.com/matteosecli/pdf2archive/issues/3); this should in principle improve the situation. For all the other cases (such as documents the user has no control on), the idea is to try to use some other software solution that corrects the mismatches. I still have to find a viable solution; maybe PDFBox could do the job, but I have to document myself about that.

Analysis [pdf2archive v0.1-1]

Ghostscript versions 9.14-9.20 are the best-performing ones; it seems that 9.21 introduced some regression, that has to be investigated. Preliminary tests seem to suggest that the regression has something do to with metadata.

It's also strange that recent versions of Ghostscript completely failed the conversion of one document (giving "ERROR"). It turns out that the resulting PDF file was indeed a valid PDF/A-1B file; the error was due to veraPDF going out of memory during the validation process. The problem appears to be related to OpenJDK; by switching to Oracle Java, the problem disappeared. Because of this reason, and since it's not a Ghostscript fault, I gave the resulting file a "PASS" instead of an "ERROR" in summing up the final results above.

Overall, the success rate on the tested files is generally above 90%. The converted files which gave a "FAIL" seem to have some fonts problem; I'll have to look into that and report to Ghostscript developers.

Old Test Results

pdf2archive v0.1-1

GS Version PASS FAIL ERROR
9.14 716 (93.72%) 48 (6.28%) 0 (0%)
9.15 716 (93.72%) 48 (6.28%) 0 (0%)
9.16 716 (93.72%) 48 (6.28%) 0 (0%)
9.18 716 (93.72%) 48 (6.28%) 0 (0%)
9.19 716 (93.72%) 48 (6.28%) 0 (0%)
9.20 716 (93.72%) 48 (6.28%) 0 (0%)
9.21 688 (90.05%) 76 (9.95%) 0 (0%)
9.22 703 (92.02%) 61 (7.98%) 0 (0%)