You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Overview of feature request
All PDF files ingested as media for a Digital Document model should have Extracted Text (OCR) derivatives that contain the OCR'd text from the files.
What kind of user is the feature intended for?
Collections Manager, User
What inspired the request?
Migrating PDF's created in Islandora7 (by Ghostscript) and discovering that all the Extracted Text derivatives are blank.
What existing behavior do you want changed?
PDF files as media for a Digital Document model currently have their Extracted Text media generated by copying the embedded text layer in the PDF. If the PDF is "image-only" (does not have a text layer) the Extracted Text media is created as a blank/empty text file.
Any brand new behavior do you want to add to Islandora?
Not sure how this should get implemented - maybe pages are broken out and individually OCR'd as an image file and that OCR is fed back to the original object's media. Maybe PDF's can be processed directly somehow?
Any related open or closed issues to this feature request?
Couldn't find any!
The text was updated successfully, but these errors were encountered:
Overview of feature request
All PDF files ingested as media for a Digital Document model should have Extracted Text (OCR) derivatives that contain the OCR'd text from the files.
What kind of user is the feature intended for?
Collections Manager, User
What inspired the request?
Migrating PDF's created in Islandora7 (by Ghostscript) and discovering that all the Extracted Text derivatives are blank.
What existing behavior do you want changed?
PDF files as media for a Digital Document model currently have their Extracted Text media generated by copying the embedded text layer in the PDF. If the PDF is "image-only" (does not have a text layer) the Extracted Text media is created as a blank/empty text file.
Any brand new behavior do you want to add to Islandora?
Not sure how this should get implemented - maybe pages are broken out and individually OCR'd as an image file and that OCR is fed back to the original object's media. Maybe PDF's can be processed directly somehow?
Any related open or closed issues to this feature request?
Couldn't find any!
The text was updated successfully, but these errors were encountered: