Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF/UA Requirements #1012

Open
DonRichards opened this issue Jan 23, 2019 · 9 comments
Open

PDF/UA Requirements #1012

DonRichards opened this issue Jan 23, 2019 · 9 comments
Labels
Subject: Access Control related to managing roles and permissions/information security.
Milestone

Comments

@DonRichards
Copy link
Member

Ingesting PDFs should result in a PDF/UA (ISO 32000-1) file to be generated and used for the viewer for accessibility compliance.

AIIM Technical Implementation Guide 32000-1

Some suggestions is to explore something like pandoc or drupalwxt.org to achieve this.

@DonRichards
Copy link
Member Author

Slightly related to 7.x issue https://jira.duraspace.org/browse/ISLANDORA-2350

@mjordan
Copy link
Contributor

mjordan commented Jan 23, 2019

I'll drop a note here to say that accessibility has a relatively high priority in the Proposed Technical Roadmap.

@mjordan
Copy link
Contributor

mjordan commented Jan 23, 2019

Should we tag this as a meta issue?

@whikloj
Copy link
Member

whikloj commented Jan 23, 2019

Do we have a meta issue around accessibility? I'll note that the Manitoba government has required that government websites (and the University falls under that) must adhere to Web Consortium (W3C) 2.0 Level AA compliance. So I need it.

@DonRichards
Copy link
Member Author

In the States, we're require to adhere to the Section 508 online accessibility standards (WCAG 2.0 Level A and AA) as well. So this is an absolute requirement.

This is a simple breakdown (aka checklist). This would be a quick checklist for anyone trying to create a solution for this.

Although I don't much about commonlook.com it is referenced from webaim.org so it should be a good place to start.

@whikloj whikloj added this to the 1.x milestone Apr 11, 2019
@Natkeeran
Copy link
Contributor

@DonRichards

Do you have any suggestion with respect to OS tools that can be to generate pdf/ua?

As per your ticket, current derivatives https://github.com/Islandora/islandora_solution_pack_pdf/blob/7.x/includes/derivatives.inc#L154 don't meet the accessibility criteria.

Maybe gs can do the job with right configurations?

@Natkeeran
Copy link
Contributor

Natkeeran commented May 1, 2019

According to the following docs, gs supports ISO 32000-1 as well as pdf 2.0 standards, which means we should be able to generate pdf/ua using gs. Samples attached below.

https://artifex.com/news/ghostscript-9-22-release/
https://www.ghostscript.com/doc/9.22/VectorDevices.htm

@dannylamb
Should we make gs a micro service?

pdfua1

sudo gs -dPDFA=1 -dNOOUTERSAVE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o pdfua1.pdf -dPDFACompatibilityPolicy=1 sample.pdf
GPL Ghostscript 9.26 (2018-11-20)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 10.
Page 1
Loading NimbusRoman-Regular font from /usr/share/ghostscript/9.26/Resource/Font/NimbusRoman-Regular... 5086732 3679115 2659272 1289313 3 done.
Page 2
Page 3
Loading NimbusRoman-Bold font from /usr/share/ghostscript/9.26/Resource/Font/NimbusRoman-Bold... 5274044 3913360 3002672 1607847 3 done.
Loading NimbusMonoPS-Regular font from /usr/share/ghostscript/9.26/Resource/Font/NimbusMonoPS-Regular... 5501756 4146899 3063272 1667348 3 done.
Page 4
Page 5
Loading NimbusMonoPS-Bold font from /usr/share/ghostscript/9.26/Resource/Font/NimbusMonoPS-Bold... 5769868 4408756 3346072 1942782 3 done.
Page 6
Loading NimbusRoman-Italic font from /usr/share/ghostscript/9.26/Resource/Font/NimbusRoman-Italic... 6017780 4651223 3649072 2224743 3 done.
Page 7
Page 8
Page 9
Loading NimbusSans-Regular font from /usr/share/ghostscript/9.26/Resource/Font/NimbusSans-Regular... 6225292 4863399 4457072 2997521 3 done.
Page 10

pdfua2

sudo gs -dPDFA=2 -dNOOUTERSAVE -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o pdfua2.pdf -dPDFACompatibilityPolicy=1 sample.pdf
GPL Ghostscript 9.26 (2018-11-20)
Copyright (C) 2018 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 10.
Page 1
Loading NimbusRoman-Regular font from /usr/share/ghostscript/9.26/Resource/Font/NimbusRoman-Regular... 5086732 3679115 2659272 1289313 3 done.
GPL Ghostscript 9.26: Setting Overprint Mode to 1
 not permitted in PDF/A-2, overprint mode not set

Page 2
Page 3
Loading NimbusRoman-Bold font from /usr/share/ghostscript/9.26/Resource/Font/NimbusRoman-Bold... 5274044 3913360 3002672 1607847 3 done.
Loading NimbusMonoPS-Regular font from /usr/share/ghostscript/9.26/Resource/Font/NimbusMonoPS-Regular... 5501756 4146899 3063272 1667348 3 done.
Page 4
Page 5
Loading NimbusMonoPS-Bold font from /usr/share/ghostscript/9.26/Resource/Font/NimbusMonoPS-Bold... 5769868 4408756 3346072 1942782 3 done.
Page 6
Loading NimbusRoman-Italic font from /usr/share/ghostscript/9.26/Resource/Font/NimbusRoman-Italic... 6017780 4651223 3649072 2224743 3 done.
Page 7
Page 8
Page 9
Loading NimbusSans-Regular font from /usr/share/ghostscript/9.26/Resource/Font/NimbusSans

pdfua2.pdf
pdfua1.pdf
sample.pdf

@dannylamb
Copy link
Contributor

dannylamb commented Aug 9, 2019

@Natkeeran Circling back to this one. Yes!!!! If we can do it with gs, then I'd wrap that up as a microservice and call it a day. Could also use gs to extract text from PDFs as well.

@kstapelfeldt kstapelfeldt added Subject: Access Control related to managing roles and permissions/information security. and removed Accessibility labels Sep 25, 2021
@alxp
Copy link
Contributor

alxp commented Nov 25, 2021

For a recent effort I have been doing outside of work needing to access PDF documents provided as a result of a Freedom of Information request to my provincial government, which are images-only and completely inaccessible, I have discovered the "ocrmypdf" utility.

It is available via apt-get on Ubuntu and Homebrew on macOS and works very well in my testing.

It is a wrapper for a lot of tesseract configurations that re-engineering would be impossible to do so I propose we should make a microservices for it.

I have run into a wall without documentation for making a microservice.

Before I embark on another copy and paste effort, I think there is a need for an Examples for adding a simple microservices that wraps a straight-forward Unix command.

This documentation should give a step-by-step set of files to add, and in what order - one of the harder things to know when embarking on a new application in an unfamiliar framework.

The output of ocrmypdf indicates that it produces PDF/A files:

iMac:Charlottetown STR FOIPP Vol. 2 aoneill$ ocrmypdf "FOIPP0012 Tab 5 - Peter Kelly.pdf" output.pdf
Scanning contents: 100%|███████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 34.92page/s]
Start processing 4 pages concurrently
    5 [tesseract] lots of diacritics - possibly poor OCR                                                                             
    8 [tesseract] lots of diacritics - possibly poor OCR                                                                             
OCR: 100%|█████████████████████████████████████████████████████████████████████████████████████| 20.0/20.0 [00:08<00:00,  2.35page/s]
Postprocessing...
PDF/A conversion: 100%|████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  8.39page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2:   0%|                                                                                                | 0/20 [00:00<?, ?item/s]
Optimize ratio: 1.96 savings: 49.0%
Output file is a PDF/A-2B (as expected)
iMac:Charlottetown STR FOIPP Vol. 2 aoneill$ 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Subject: Access Control related to managing roles and permissions/information security.
Projects
Development

No branches or pull requests

7 participants