Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCV to HOCR or PAGE conversion not working #121

Open
OmriPi opened this issue Jan 27, 2020 · 9 comments
Open

GCV to HOCR or PAGE conversion not working #121

OmriPi opened this issue Jan 27, 2020 · 9 comments

Comments

@OmriPi
Copy link

OmriPi commented Jan 27, 2020

Hi all,

I am new to using this software so please bear with me if this has been asked before or I'm not using the tool correctly.

I have the JSON output of google vision OCR of a PDF (emphasis on PDF and not an image).
I would like to create a searchable version of that PDF using the OCR results. I have tried using gcv2hocr but it doesn't seem to work on PDFs, or it has some other error, because the HOCR output I'm getting from it is basically just the metadata. I tried using ocr-fileformat on the same file, but once again I get only the metadata as a result. Trying to convert it to PAGE fails as well, with the result being some java lines indicating exceptions have occurred. Does ocr-fileformat supports GCV JSON generated from PDF?

The file I'm trying to run it on is the sample file from google:
gs://cloud-samples-data/vision/pdf_tiff/census2010.pdf

And the JSON is generated following this tutorial:
https://cloud.google.com/vision/docs/pdf#vision_text_detection_pdf_gcs-python

If you could assist me or point me in the direction of how to solve it I would be very grateful, as I'm in an urgent need to solve this issue.

Thanks in advance!

@kba
Copy link
Collaborator

kba commented Jan 28, 2020

To convert PDF to Google Cloud Vision JSON,, you need to use Google Cloud Vision which is a commercial cloud software we neither support nor endorse. Once you have that JSON data by using their services, you can convert it to hOCR.

@kba
Copy link
Collaborator

kba commented Jan 28, 2020

You could also convert to PAGE via hOCR and try https://github.com/PRImA-Research-Lab/prima-page-to-pdf

@OmriPi
Copy link
Author

OmriPi commented Jan 30, 2020

Hi @kba , thank you for the answer. I think I may have not explained it correctly, or you misunderstood me:
I have used google vision to get the JSON, I already have it. I am having a problem with using the gcv to HOCR transformer found in this package. When I use it on the JSON I got from google vision, I am getting an almost blank output, with only the metadata.

When I'm trying to convert it to PAGE instead I get this result:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1472) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:994) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327) at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:169) at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:130) at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:212) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:192) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130) org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1239) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) at org.primaresearch.dla.page.io.xml.XmlPageReader.parse(XmlPageReader.java:204) at org.primaresearch.dla.page.io.xml.XmlPageReader.read(XmlPageReader.java:130) at org.primaresearch.dla.page.io.xml.PageXmlInputOutput.readPage(PageXmlInputOutput.java:212) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:192) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130) Exception in thread "main" java.lang.NullPointerException at org.primaresearch.dla.page.converter.PageConverter.handleNegativeCoordinates(PageConverter.java:389) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:216) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)

So I'm looking to understand why are the gcv converters in this module not working for me, despite the fact that I have a perfectly viable gcv JSON. I can send you the JSON generated from gcv and you can try for yourself to convert it, if it helps.

Thanks!

@OmriPi
Copy link
Author

OmriPi commented Jan 30, 2020

extracted_pdf.pdfoutput-1-to-1.txt

This is the JSON from gcv that I'm using (I changed the suffix into .txt to upload it here), it's a JSON of the sample document that google uses in the tutorial.
Can you try and see if transforming it works correctly for you?
Thanks!

@kba
Copy link
Collaborator

kba commented Jan 30, 2020

Then it's best to ask @dinosauria123 (not sure whether they're subscribed to issues here but they should see the mention). The code is at https://github.com/dinosauria123/gcv2hocr

@dinosauria123
Copy link

Hi,
If you have problem please open issue at https://github.com/dinosauria123/gcv2hocr.

@OmriPi
Copy link
Author

OmriPi commented Feb 6, 2020

Ok @dinosauria123 ! Thanks

@sarepal
Copy link

sarepal commented Mar 2, 2021

Is this issue still live? I'm getting a similar error (org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.) when I try to convert GCV to PAGE. I'm attaching a zip file with the JPG and two versions of GCV: gcv-google-api (which was made with a Python script I wrote to interact with the Google API) and gcv-sh (which was derived from the shell script provided by @dinosauria123 at https://github.com/dinosauria123/gcv2hocr). Thank for your consideration.
gcv-sample.zip

@jcuenod
Copy link

jcuenod commented Nov 8, 2021

@sarepal I'm still having issues converting GCV to HOCR and, I could be wrong, I think the conversion to PAGE goes via HOCR. Are you using a result from TEXT_DETECTION or DOCUMENT_TEXT_DETECTION?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants