-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ALTO Output Functionality #2067
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this contribution. I'll test it for our images later this week.
Up to now I only reviewed the code and have added several inline remarks.
One general remark: personally I like things sorted, be it new code blocks, several lines of code or other things, either alphabetically or some other criteria. That helps when maintaining the code and also gives nicer help texts, for example. So I suggest to add new code not always at the end.
src/api/altorenderer.cpp
Outdated
"\t\t<OCRProcessing ID=\"OCR_0\">\n" | ||
"\t\t\t<ocrProcessingStep>\n" | ||
"\t\t\t\t<processingSoftware>\n" | ||
"\t\t\t\t\t<softwareName>tesseract 4.0.0</softwareName>\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version should be taken from TessBaseAPI::Version()
here.
src/api/altorenderer.cpp
Outdated
} | ||
|
||
/** | ||
* Append the ALTO XML for the end of the document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would C++ (///
) comments be better? Much of the Tesseract code was written before using C++, so is not necessarily a good template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/api/altorenderer.cpp
Outdated
// convert input name from ANSI encoding to utf-8 | ||
int str16_len = | ||
MultiByteToWideChar(CP_ACP, 0, input_file_->string(), -1, nullptr, 0); | ||
wchar_t *uni16_str = new WCHAR[str16_len]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use std::vector
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need help making this change. I did not write this section of code (It came from the hOCR implementation), so I am not 100% sure of what it is doing.
src/api/altorenderer.cpp
Outdated
char *utf8_str = new char[utf8_len]; | ||
WideCharToMultiByte(CP_UTF8, 0, uni16_str, str16_len, utf8_str, | ||
utf8_len, nullptr, nullptr); | ||
*input_file_ = utf8_str; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could *input_file
be used directly instead of utf8_str
thus avoiding the copying and delete[]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need help making this change. I did not write this section of code (It came from the hOCR implementation.), so I am not 100% sure of what it is doing.
src/api/altorenderer.cpp
Outdated
/** | ||
* Append the ALTO XML for the layout of the image | ||
*/ | ||
bool TessAltoRenderer::AddImageHandler(TessBaseAPI *api) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommended style for new or modified code is TessBaseAPI* api
.
src/api/altorenderer.cpp
Outdated
return ret; | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing line feed as last character of the source file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} |
src/api/baseapi.cpp
Outdated
@@ -94,34 +94,34 @@ BOOL_VAR(stream_filelist, FALSE, "Stream a filelist from stdin"); | |||
namespace tesseract { | |||
|
|||
/** Minimum sensible image size to be worth running tesseract. */ | |||
const int kMinRectSize = 10; | |||
const int kMinRectSize = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are several code changes which are unrelated to ALTO.
src/api/capi.cpp
Outdated
@@ -66,6 +66,11 @@ TESS_API TessResultRenderer* TESS_CALL TessHOcrRendererCreate2(const char* outpu | |||
return new TessHOcrRenderer(outputbase, font_info); | |||
} | |||
|
|||
TESS_API TessResultRenderer* TESS_CALL TessAltoRendererCreate(const char* outputbase) | |||
{ | |||
return new TessHOcrRenderer(outputbase); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy+paste error. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return new TessHOcrRenderer(outputbase); | |
return new TessAltoRenderer(outputbase); |
Personally I think that an own file is good here. What would be the advantages of the alternatives? |
src/api/altorenderer.cpp
Outdated
* Append the ALTO XML for the layout of the image | ||
*/ | ||
bool TessAltoRenderer::AddImageHandler(TessBaseAPI *api) { | ||
const std::unique_ptr<const char[]> hocr(api->GetAltoText(imagenum())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That requires #include <memory.h>
(see build failures from continuous integration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, my comment was wrong. It should be just #include <memory>
.
@amitdo : see @jbreiden comment in #419 (comment) |
Ok, alto renderer will get its own file. What about the other renderers, should each of them get its own file? |
src/api/baseapi.cpp
Outdated
@@ -118,10 +118,10 @@ namespace tesseract { | |||
static void addAvailableLanguages(const STRING &datadir, const STRING &base, | |||
GenericVector<STRING>* langs) | |||
{ | |||
const STRING base2 = (base.string()[0] == '\0') ? base : base + "/"; | |||
const size_t extlen = sizeof(kTrainedDataSuffix); | |||
const STRING base2 = (base.string()[0] == '\0') ? base : base + "/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use C++ comments, and Tesseract uses indentation steps of 2, so at least most of this commit is not needed or even not wanted at all.
Thank you very much for these comments and suggestions. They are incredibly helpful and informative. I will work on making these improvements over the next few days. A note: I had originally implemented ALTO functionality in api/baseapi.cpp but then separated it out due to the aforementioned comment. |
@jakesebright, I have now processed a page with your code. It can be seen here. Full text view must be activated by clicking on the 4th icon in the top menu. Obviously the DFG viewer software which is used there has problems with the separation of the words. All other pages which were made by ABBYY Finereader don't have that problem because it separates the All ALTO files are available online. |
|
@jakesebright, do you plan to add more commits, for example to fix the build error? Or should we take your contribution as it is and fix the remaining issues on our side? |
@stweil Yes, I should have time to work on this tomorrow. Thank you for following up and for your patience. |
I have made all of the suggested changes except for those where I left a comment noting otherwise. I am still getting build failures from the continuous integration, and am having a hard time figuring out why this is happening. Could anyone offer some help / advice with this failure? |
I just want to say thank you very much for adding this feature. At one
point I reasonably understood the HOCR renderer so l will try to take a
look when I find time, if you are still stuck.
|
src/api/altorenderer.cpp
Outdated
// limitations under the License. | ||
|
||
#include "baseapi.h" | ||
#include <memory.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include <memory.h> | |
#include <memory> // for unique_ptr |
bool TessAltoRenderer::BeginDocumentHandler() { | ||
AppendString( | ||
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" | ||
"<alto xmlns=\"http://www.loc.gov/standards/alto/ns-v3#\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd\">\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"<alto xmlns=\"http://www.loc.gov/standards/alto/ns-v3#\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd\">\n" | |
"<alto xmlns=\"http://www.loc.gov/standards/alto/ns-v3#\" " | |
"xmlns:xlink=\"http://www.w3.org/1999/xlink\" " | |
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" " | |
"xsi:schemaLocation=\"http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd\">\n" |
@jakesebright, thank you for the updates and all of you work. The build failure is still there because I gave a wrong advice: the missing include file is |
alto_str += "\" HEIGHT=\""; | ||
alto_str.add_str_int("", rect_height_); | ||
alto_str += "\" PHYSICAL_IMG_NR=\""; | ||
alto_str.add_str_int("", rect_height_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakesebright , rect_height
does not look like a PHYSICAL_IMG_NR
. Is this a copy+paste error? What was intended here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a copy/paste error. I believe it should be page_number
instead of rect_height
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a fix in pull request #2122.
I have added support for ALTO output as described in this issue.
ALTO XML can be output by using the config file at tessdata/configs/alto. I have confirmed that this output validates against the schema defined here.
This is my first pull request to a code base of this size, so please have patience if I have misunderstood anything. I use ALTO quite often and would love to have support for it in tesseract. I would be happy to make additional recommended improvements to my implementation.