-
Notifications
You must be signed in to change notification settings - Fork 16
Conversion Tests
I've performed some conversion tests to assess the conversion success rate of pdf2archive
depending on the Ghostscript version used. These are the results for pdf2archive v0.3
, based on the conversion of 764 random PDF files:
GS Version | PASS | FAIL | ERROR |
---|---|---|---|
9.14 | 716 (93.72%) | 48 (6.28%) | 0 (0%) |
9.15 | 716 (93.72%) | 48 (6.28%) | 0 (0%) |
9.16 | 715 (93.59%) | 49 (6.41%) | 0 (0%) |
9.18 | 715 (93.59%) | 49 (6.41%) | 0 (0%) |
9.19 | 715 (93.59%) | 49 (6.41%) | 0 (0%) |
9.20 | 715 (93.59%) | 49 (6.41%) | 0 (0%) |
9.21 | 715 (93.59%) | 49 (6.41%) | 0 (0%) |
9.22 | 715 (93.59%) | 49 (6.41%) | 0 (0%) |
Only the tests related to Ghostscript versions that gave at least 50% of success are shown. The meaning of the results is the following:
- PASS: The resulting PDF file is a valid PDF/A-1B file.
- FAIL: The resulting PDF file is not a valid PDF/A-1B file.
- ERROR: An error occurred in the workflow (Ghostscript met an unrecognized option, failed to produce a PDF file, crashed, etc. or veraPDF crashed)
All the tests were done by using pdf2archive
(versions v0.1-1
and v0.3
) on different GNU/Linux x86_64 machines. The Ghostscript GNU/Linux binaries from gs v9.04
to gs v9.22
were downloaded from the official site; the ones from gs v8.63
to gs v9.02
were compiled by myself. The files used for the test were randomly selected PDF files from the ArXiv. The validation was done with veraPDF
.
Setting | Value |
---|---|
pdf2archive Version | 0.3 |
VeraPDF Version | 1.8.4 |
Linux Kernel | 2.6.32 (x86_64) |
Java Version | Java(TM) SE "9.0.1" |
Test Files | 695 PDF files in arXiv_pdf_1210_002 |
GS Version | PASS | FAIL | ERROR |
---|---|---|---|
9.14 | 648 | 47 | 0 |
9.15 | 648 | 47 | 0 |
9.16 | 648 | 47 | 0 |
9.18 | 648 | 47 | 0 |
9.19 | 648 | 47 | 0 |
9.20 | 648 | 47 | 0 |
9.21 | 648 | 47 | 0 |
9.22 | 648 | 47 | 0 |
Setting | Value |
---|---|
pdf2archive version | 0.3 |
VeraPDF Version | 1.8.4 |
Linux Kernel | 4.10.0 (x86_64) |
Java Version | OpenJDK "9-internal" |
Test Files | 69 PDF files in arXiv_src_1309_007 |
GS Version | PASS | FAIL | ERROR |
---|---|---|---|
9.14 | 68 | 1 | 0 |
9.15 | 68 | 1 | 0 |
9.16 | 67 | 2 | 0 |
9.18 | 67 | 2 | 0 |
9.19 | 67 | 2 | 0 |
9.20 | 67 | 2 | 0 |
9.21 | 67 | 2 | 0 |
9.22 | 67 | 2 | 0 |
Setting | Value |
---|---|
pdf2archive Version | 0.1-1 |
VeraPDF Version | 1.8.4 |
Linux Kernel | 2.6.32 (x86_64) |
Java Version | OpenJDK "1.8.0_151" |
Test Files | 695 PDF files in arXiv_pdf_1210_002 |
GS Version | PASS | FAIL | ERROR |
---|---|---|---|
8.63 | 0 | 0 | 695 |
8.64 | 0 | 0 | 695 |
8.70 | 0 | 0 | 695 |
8.71 | 0 | 0 | 695 |
9.00 | 0 | 0 | 695 |
9.01 | 0 | 0 | 695 |
9.02 | 0 | 0 | 695 |
9.04 | 0 | 695 | 0 |
9.05 | 0 | 695 | 0 |
9.06 | 0 | 694 | 1 |
9.07 | 0 | 694 | 1 |
9.09 | 0 | 695 | 0 |
9.10 | 0 | 692 | 3 |
9.14 | 647 | 47 | 1 |
9.15 | 647 | 47 | 1 |
9.16 | 647 | 47 | 1 |
9.18 | 647 | 47 | 1 |
9.19 | 647 | 47 | 1 |
9.20 | 647 | 47 | 1 |
9.21 | 635 | 59 | 1 |
9.22 | 635 | 59 | 1 |
Setting | Value |
---|---|
pdf2archive version | 0.1-1 |
VeraPDF Version | 1.8.4 |
Linux Kernel | 4.10.0 (x86_64) |
Java Version | Java(TM) SE "1.8.0_151" |
Test Files | 69 PDF files in arXiv_src_1309_007 |
GS Version | PASS | FAIL | ERROR |
---|---|---|---|
8.63 | 0 | 0 | 69 |
8.64 | 0 | 0 | 69 |
8.70 | 0 | 69 | 0 |
8.71 | 0 | 69 | 0 |
9.00 | 0 | 69 | 0 |
9.01 | 0 | 69 | 0 |
9.02 | 0 | 69 | 0 |
9.04 | 0 | 0 | 69 |
9.05 | 0 | 69 | 0 |
9.06 | 0 | 69 | 0 |
9.07 | 0 | 69 | 0 |
9.09 | 0 | 69 | 0 |
9.10 | 0 | 69 | 0 |
9.14 | 68 | 1 | 0 |
9.15 | 68 | 1 | 0 |
9.16 | 68 | 1 | 0 |
9.18 | 68 | 1 | 0 |
9.19 | 68 | 1 | 0 |
9.20 | 68 | 1 | 0 |
9.21 | 52 | 17 | 0 |
9.22 | 67 | 2 | 0 |
Metadata handling has been strengthened in pdf2archive v0.3
when compared to pdf2archive v0.1-1
. This, in turn, has solved most of the issues with Ghostscript versions beyond 9.21. However, if you compare test #01 and test #03 (which are done on the same set of PDF's) for Ghostscript 9.16 to 9.20, you'll notice that there is one more fail. The reason is documented in https://github.com/matteosecli/pdf2archive/issues/5, and it has to do, again, with metadata handling.
The font problems (which give most of the fails) have not been solved yet; they range from zero-width glyphs to inconsistent glyph width in the document. I think these are problems that Ghostscript cannot solve, and they are mostly due to the software that has produced the document. For LaTeX's case, one of the goals is to write a section that documents good practices when producing documents with PDFLaTeX (see https://github.com/matteosecli/pdf2archive/issues/3); this should in principle improve the situation. For all the other cases (such as documents the user has no control on), the idea is to try to use some other software solution that corrects the mismatches. I still have to find a viable solution; maybe PDFBox could do the job, but I have to document myself about that.
Ghostscript versions 9.14-9.20 are the best-performing ones; it seems that 9.21 introduced some regression, that has to be investigated. Preliminary tests seem to suggest that the regression has something do to with metadata.
It's also strange that recent versions of Ghostscript completely failed the conversion of one document (giving "ERROR"). It turns out that the resulting PDF file was indeed a valid PDF/A-1B file; the error was due to veraPDF going out of memory during the validation process. The problem appears to be related to OpenJDK; by switching to Oracle Java, the problem disappeared. Because of this reason, and since it's not a Ghostscript fault, I gave the resulting file a "PASS" instead of an "ERROR" in summing up the final results above.
Overall, the success rate on the tested files is generally above 90%. The converted files which gave a "FAIL" seem to have some fonts problem; I'll have to look into that and report to Ghostscript developers.
GS Version | PASS | FAIL | ERROR |
---|---|---|---|
9.14 | 716 (93.72%) | 48 (6.28%) | 0 (0%) |
9.15 | 716 (93.72%) | 48 (6.28%) | 0 (0%) |
9.16 | 716 (93.72%) | 48 (6.28%) | 0 (0%) |
9.18 | 716 (93.72%) | 48 (6.28%) | 0 (0%) |
9.19 | 716 (93.72%) | 48 (6.28%) | 0 (0%) |
9.20 | 716 (93.72%) | 48 (6.28%) | 0 (0%) |
9.21 | 688 (90.05%) | 76 (9.95%) | 0 (0%) |
9.22 | 703 (92.02%) | 61 (7.98%) | 0 (0%) |