-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Cross-posted from the Forum/Suggestion] Implement a way to integrate (original image file, detected text) →searchable PDF #660
Comments
This is a complicated way of asking for an option to send one image through OCR and insert a different image in the output PDF.
I know this was requested before and I believe @jbreiden said it would be added to the PDF renderer at some point. |
I'm very reluctant to make Tesseract PDF generation fancy. I wonder if we can do an image swap like this outside of Tesseract, using one of the PDF manipulation toolkits. |
Sounds reasonable.
It is fairly simple to swap an image using qpdf's C++ API.
…On Fri, Jan 13, 2017 at 18:33 jbreiden ***@***.***> wrote:
I'm very reluctant to make Tesseract PDF generation fancy. I wonder if we
can do an image swap like this outside of Tesseract, using one of the PDF
manipulation toolkits.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#660 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABvcMz65b8BY11SURxHH8RJPXNgJj_N6ks5rSDQHgaJpZM4Liq1Q>
.
|
@jbreiden It's the last really missing issue. If this helps, I will donate some mBTC for implementing it just right now. Just post your receiving address. |
@jbarlow83 background info. As you know, I recently wanted to try your OCRmyPDF because I found the interesting which "does not alter the final output":
but unfortunately this does not work with tesseract 4, at the present. So I looked for bug reports, if tesseract could pass the original input image to the output; and filed the present issue. |
Really? That's interesting, qpdf is very well written. Maybe the right thing to do is allow Tesseract to produce a multi-page PDF with invisible symbolic text PDF only, no images. Then another tool (perhaps an enhanced qpdf tool) would merge and composite two PDFs together. One being the original image-only PDF, and the other an invisible-text-only PDF. What do you think, @jbarlow83? Please point me at the relevant qpdf API calls if you happen to know them. |
I think invisible text only output would be far more useful for developers that integrate tesseract or anyone who wants to do something fancy. It would still make sense to keep the existing OCR with image option of course. As a plus, it's should be easier to suppress the image than add a different one. OCRmyPDF (which I maintain) use Ghostscript to rasterize and then runs one of its two PDF renderers. One uses Tesseract hOCR and provides more features but is not as good at producing the OCR text layer as Tesseract PDF, so I also provide Tesseract PDF. If Tesseract could produce a invisible text only I could offer all the features for both, and work towards phasing out the hOCR renderer. When possible I already do graft the text layer onto the existing PDF instead of constructing a new one. In addition to OCRmyPDF In writing this I've made a case for not using qpdf because other tools should be able to do the job with an invisible text PDF, but for interest's sake case here is example code that inverts black and white for all images; clearly this is close to how one would replace an image outright. |
This sounds reasonable to me. I'll try to find time over this coming week to make an experimental invisible-text-only PDF that we can play with. All the other pieces of the puzzle are there; for example Leptonica already ships with a images->pdf tool that avoids transcoding for PNG, JP2K, and JPEG. It would be cool to use qpdf for the merge step because it is already so useful for linearizing. But it's great that there are more options. The qpdf author is extremely friendly in my experience, in case we eventually chat with him. Oh, I now vaguely remember that PDFBox had something for merging as well, but I've never tried it and can't find it at the moment. |
Here's an experimental PDF pair, image-only and text-only. Let the merging begin! |
This works brilliantly. I will implement for real if someone promises that they will use it. Also, what do we call the configuration option? My best idea so far to describe a PDF that has invisible text only is 'naked'. I'm sure someone has a better idea.
Actually this works better the other way around, for preserving the bookmarks and things like that.
|
Implementation complete and under review by Ray. @jbarlow83 this is a good time to look at the samples above and make sure they meet your needs.
|
Looks really good @jbreiden. Works great in pdftk. No display issues and PDF syntax looks fine. PyPDF2 is also capable of merging. It does not have the equivalent of "multibackground" but merge pages manually. Here is merging one page: In [1]: import PyPDF2 as pypdf
In [4]: pdf_text = pypdf.PdfFileReader(open('text.pdf', 'rb'))
In [5]: pdf_image = pypdf.PdfFileReader(open('images.pdf', 'rb'))
In [6]: page_text = pdf_text.pages[1]
In [7]: page_image = pdf_image.pages[1]
In [8]: page_text.mergeRotatedScaledTranslatedPage(page_image, 0, 1.0, 0, 0, expand=False)
In [9]: out = pypdf.PdfFileWriter()
In [10]: out.addPage(page_text)
In [11]: with open('pypdfmerge.pdf','wb') as o:
...: out.write(o)
...: For reference, pdfbox did not work out of the box. As far as I can tell the closest command in pdfbox is
However pdfbox takes the unusual approach of rasterizing the overlay PDF as a bitmap and drawing it on top of the base page, making it useless regardless of image/text order. (I suppose when you go to the trouble implementing a full PDF renderer in Java you feel compelled to use it even when it's not strictly needed.) |
I don't know about calling it a naked PDF because there's nothing exciting to see in it. It's more of a phantom or spectral apparition PDF, having form without substance.
|
Spectral writing. Perhaps a kind of ghost script, if you will. |
How about @jbreiden is it also possible to use a .pdf file as input to tesseract directly? |
|
@Shreeshrii PDF is a very complex vector-based file format. Tesseract works only on images. It is much easier to write PDFs that use a limited set of PDF features than read arbitrary PDFs. Have a look at OCRmyPDF (which I develop) - it addresses the details of using tesseract to apply OCR to PDFs. |
@jbreiden @jbarlow83 @amitdo info: I just built the whole toolchain from their git repos (tesseract, ocrmypdf, unpaper), and have ghostscript version 9.20 ready in a dedicated debian 9 "OCR VM" on my Qubes OS system. Pls. let me know, what (if) you want me to test - I have time to test and want to help you. |
Hmmm, an invisible text layer, invisible text, let's see ... iText? Anyway, I'll pick something. There is zero chance that a PDF rasterizer will ever be part of Tesseract or Leptonica. In theory one could write an PDF image extractor for Leptonica, but there isn't really enough motivation to do so. |
Ray will eventually merge this patch, but it is hard to predict when. I am posting here for anyone who is impatient or excited. --- api/pdfrenderer.cpp 2016-12-13 14:43:24.000000000 -0800
+++ api/pdfrenderer.cpp 2017-01-19 14:50:56.000000000 -0800
@@ -178,10 +178,12 @@
* PDF Renderer interface implementation
**********************************************************************/
-TessPDFRenderer::TessPDFRenderer(const char* outputbase, const char *datadir)
+TessPDFRenderer::TessPDFRenderer(const char *outputbase, const char *datadir,
+ bool textonly)
: TessResultRenderer(outputbase, "pdf") {
obj_ = 0;
datadir_ = datadir;
+ textonly_ = textonly;
offsets_.push_back(0);
}
@@ -326,7 +328,11 @@
pdf_str.add_str_double("", prec(width));
pdf_str += " 0 0 ";
pdf_str.add_str_double("", prec(height));
- pdf_str += " 0 0 cm /Im1 Do Q\n";
+ pdf_str += " 0 0 cm";
+ if (!textonly_) {
+ pdf_str += " /Im1 Do";
+ }
+ pdf_str += " Q\n";
int line_x1 = 0;
int line_y1 = 0;
@@ -832,6 +838,7 @@
bool TessPDFRenderer::AddImageHandler(TessBaseAPI* api) {
size_t n;
char buf[kBasicBufSize];
+ char buf2[kBasicBufSize];
Pix *pix = api->GetInputImage();
char *filename = (char *)api->GetInputName();
int ppi = api->GetSourceYResolution();
@@ -840,6 +847,9 @@
double width = pixGetWidth(pix) * 72.0 / ppi;
double height = pixGetHeight(pix) * 72.0 / ppi;
+ snprintf(buf2, sizeof(buf2), "XObject << /Im1 %ld 0 R >>\n", obj_ + 2);
+ const char *xobject = (textonly_) ? "" : buf2;
+
// PAGE
n = snprintf(buf, sizeof(buf),
"%ld 0 obj\n"
@@ -850,19 +860,18 @@
" /Contents %ld 0 R\n"
" /Resources\n"
" <<\n"
- " /XObject << /Im1 %ld 0 R >>\n"
+ " %s"
" /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]\n"
" /Font << /f-0-0 %ld 0 R >>\n"
" >>\n"
">>\n"
"endobj\n",
obj_,
- 2L, // Pages object
- width,
- height,
- obj_ + 1, // Contents object
- obj_ + 2, // Image object
- 3L); // Type0 Font
+ 2L, // Pages object
+ width, height,
+ obj_ + 1, // Contents object
+ xobject, // Image object
+ 3L); // Type0 Font
if (n >= sizeof(buf)) return false;
pages_.push_back(obj_);
AppendPDFObject(buf);
@@ -899,13 +908,15 @@
objsize += strlen(b2);
AppendPDFObjectDIY(objsize);
- char *pdf_object;
- if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
- return false;
+ if (!textonly_) {
+ char *pdf_object = nullptr;
+ if (!imageToPDFObj(pix, filename, obj_, &pdf_object, &objsize)) {
+ return false;
+ }
+ AppendData(pdf_object, objsize);
+ AppendPDFObjectDIY(objsize);
+ delete[] pdf_object;
}
- AppendData(pdf_object, objsize);
- AppendPDFObjectDIY(objsize);
- delete[] pdf_object;
return true;
}
--- api/renderer.h 2016-11-07 07:44:03.000000000 -0800
+++ api/renderer.h 2017-01-19 14:50:56.000000000 -0800
@@ -186,7 +186,7 @@
public:
// datadir is the location of the TESSDATA. We need it because
// we load a custom PDF font from this location.
- TessPDFRenderer(const char *outputbase, const char *datadir);
+ TessPDFRenderer(const char* outputbase, const char* datadir, bool textonly);
protected:
virtual bool BeginDocumentHandler();
@@ -196,20 +196,20 @@
private:
// We don't want to have every image in memory at once,
// so we store some metadata as we go along producing
- // PDFs one page at a time. At the end that metadata is
+ // PDFs one page at a time. At the end, that metadata is
// used to make everything that isn't easily handled in a
// streaming fashion.
long int obj_; // counter for PDF objects
GenericVector<long int> offsets_; // offset of every PDF object in bytes
GenericVector<long int> pages_; // object number for every /Page object
const char *datadir_; // where to find the custom font
+ bool textonly_; // skip images if set
// Bookkeeping only. DIY = Do It Yourself.
void AppendPDFObjectDIY(size_t objectsize);
// Bookkeeping + emit data.
void AppendPDFObject(const char *data);
// Create the /Contents object for an entire page.
- static char* GetPDFTextObjects(TessBaseAPI* api,
- double width, double height);
+ char* GetPDFTextObjects(TessBaseAPI* api, double width, double height);
// Turn an image into a PDF object. Only transcode if we have to.
static bool imageToPDFObj(Pix *pix, char *filename, long int objnum,
char **pdf_object, long int *pdf_object_size);
--- api/tesseractmain.cpp 2016-12-15 15:28:37.000000000 -0800
+++ api/tesseractmain.cpp 2017-01-19 14:50:56.000000000 -0800
@@ -337,8 +337,10 @@
api->GetBoolVariable("tessedit_create_pdf", &b);
if (b) {
- renderers->push_back(
- new tesseract::TessPDFRenderer(outputbase, api->GetDatapath()));
+ bool textonly;
+ api->GetBoolVariable("textonly_pdf", &textonly);
+ renderers->push_back(new tesseract::TessPDFRenderer(
+ outputbase, api->GetDatapath(), textonly));
}
api->GetBoolVariable("tessedit_write_unlv", &b);
--- ccmain/tesseractclass.cpp 2017-01-19 11:57:09.000000000 -0800
+++ ccmain/tesseractclass.cpp 2017-01-19 18:15:57.000000000 -0800
@@ -391,6 +391,8 @@
this->params()),
BOOL_MEMBER(tessedit_create_pdf, false, "Write .pdf output file",
this->params()),
+ BOOL_MEMBER(textonly_pdf, false, "Invisible text only for PDF",
+ this->params()),
STRING_MEMBER(unrecognised_char, "|",
"Output char for unidentified blobs", this->params()),
INT_MEMBER(suspect_level, 99, "Suspect marker level", this->params()),
--- ccmain/tesseractclass.h 2017-01-19 11:57:09.000000000 -0800
+++ ccmain/tesseractclass.h 2017-01-19 16:31:04.000000000 -0800
@@ -1027,6 +1027,7 @@
BOOL_VAR_H(tessedit_create_hocr, false, "Write .html hOCR output file");
BOOL_VAR_H(tessedit_create_tsv, false, "Write .tsv output file");
BOOL_VAR_H(tessedit_create_pdf, false, "Write .pdf output file");
+ BOOL_VAR_H(textonly_pdf, false, "Invisible text only for PDF");
STRING_VAR_H(unrecognised_char, "|",
"Output char for unidentified blobs");
INT_VAR_H(suspect_level, 99, "Suspect marker level"); |
@Shreeshrii http://kiirani.com/2013/03/22/tesseract-pdf.html The PDF/invisible text output you guys are implementing works quite well for me using OSX 'Preview' but for a little jerkiness depending on scaling, of course. This is quite a big deal, in my opinion, as it will allow those who have, for instance... legal documents containing notary stamps in color, or in my use-case aviation emergency manuals with color-coded pages, to keep their original copies unmodified from their scanners, but modify them in a clean way into searchable documents. Thanks for this. |
Thanks for info on pdf to images conversion for use with tesseract. I usually use ghostscript for the purpose e.g.
I will give the other suggestions a try (including a new one suggested by zdenop in the forum- https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/vvMldrkcuOQ/xLES3_ZoEwAJ ) @jbreiden Thanks, Jeff, for this invisible text output pdf which can be merged with the original pdf. |
pdfimages from poppler-utils will do image extraction as well. And pdfium offers API calls for image extraction. I am sure there are many others. Have fun. |
I suggest to merge this to master now. Ray can modify it later if needed. |
merged to master. |
why again *.jpg (step 1) ? Never ever use jpg with text files. |
I already developed code for this using -c textonly_pdf=1, thanks |
A image-only PDF file is a bag of images. If the bag is holding a bunch of JPEG images, extract them as-is. Don't convert. Don't recompress. Just empty the PDF bag and get your images out. If it is holding JPEG2000, then just get those out. Same with PNG. |
Yes and no, why can't tesseract do this (pass-through the "bunch of input images") ? |
Let's shift this discussion back to the forum. Please re-ask your most recent question there; I don't follow exactly what you are asking. |
Pls. elaborate your step I use |
C-API should be fixed now. Thanks for finding this wikinaut. |
Was there a final resolution to this request for putting back in the original images? @Wikinaut? |
Yes. The final solution was to implement |
Yeah, that doesn't work for me: I'm using version 3.05.00 installed via homebrew. |
@Jmuccigr I am definitely not happy with the current implementation, and decided some months ago to stay silent and let other users come back with the issue (hoping, that my original proposal - pass-through the original input image without transcoding it - will be implement in forthcoming versions). |
The |
@Wikinaut, yeah, my workflow at some point involves adding OCR'ed text to an optimized PDF. Having the OCR step degrade the quality of that PDF kind of spoils it. |
@zdenop Please backport for 3.05. Thanks! |
done. |
Thanks, @zdenop. Please also make a 3.05.01 release with the latest commit in 3.05 branch so that all these enhancements are easily accessible. |
Just getting back to this now that 3.05.01 has hit homebrew and wanted to say that it seems to be working. I've tested it out by running text-only tesseract on a 2x version of an image - which tends to give better results if the original dpi is too low - and then combining that text-only PDF with a PDF made from the original image, which keeps the file size down. |
FWIW, I created a small command line utility pdfmerge as a frontend to the merge functionality (equivalent to the pdfktk multibackground command) in the Python packages PyPDF2 and pdfrw. |
@wrznr, thanks for the info. |
Hello, My first goal was to try to understand how it works and what it does exactly... (for merging image anf textonly pdf files). Thank you! |
You didn't read the whole thread. The parameter name was changed to |
Oh sorry! I thought it was another parameter! (I already know |
I should have written: "Did you read the whole thread?" or just omit the sentence. |
https://groups.google.com/forum/#!topic/tesseract-ocr/vvMldrkcuOQ has asked:
How to reproduce:
in.pdf
, converted toin.ppm
imageunpaper in.ppm in-cleaned.ppm
tesseract in-cleaned.ppm out -l deu+eng --oem 2 pdf txt
out.pdf
has now a blotchy background (from theunpaper
step above)Is there any way to "feed-in" the original
in.ppm
as image, so that this is used instead ofin-cleaned.ppm
when creating theout.pdf
?So what is wanted is original input image plus ocr layer, so that output looks like
The text was updated successfully, but these errors were encountered: