Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ALTO Output Functionality #2067

Merged
merged 9 commits into from
Nov 30, 2018
Merged

Add ALTO Output Functionality #2067

merged 9 commits into from
Nov 30, 2018

Conversation

jakesebright
Copy link
Contributor

I have added support for ALTO output as described in this issue.

ALTO XML can be output by using the config file at tessdata/configs/alto. I have confirmed that this output validates against the schema defined here.

This is my first pull request to a code base of this size, so please have patience if I have misunderstood anything. I use ALTO quite often and would love to have support for it in tesseract. I would be happy to make additional recommended improvements to my implementation.

@amitdo
Copy link
Collaborator

amitdo commented Nov 21, 2018

@zdenop , @stweil

Should the alto renderer code be in its own file like the pdf renderer or in baseapi.cpp?

Another option is to put all text renderers in one file textrenderers.cpp.

Copy link
Member

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this contribution. I'll test it for our images later this week.

Up to now I only reviewed the code and have added several inline remarks.

One general remark: personally I like things sorted, be it new code blocks, several lines of code or other things, either alphabetically or some other criteria. That helps when maintaining the code and also gives nicer help texts, for example. So I suggest to add new code not always at the end.

"\t\t<OCRProcessing ID=\"OCR_0\">\n"
"\t\t\t<ocrProcessingStep>\n"
"\t\t\t\t<processingSoftware>\n"
"\t\t\t\t\t<softwareName>tesseract 4.0.0</softwareName>\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version should be taken from TessBaseAPI::Version() here.

}

/**
* Append the ALTO XML for the end of the document
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would C++ (///) comments be better? Much of the Tesseract code was written before using C++, so is not necessarily a good template.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// convert input name from ANSI encoding to utf-8
int str16_len =
MultiByteToWideChar(CP_ACP, 0, input_file_->string(), -1, nullptr, 0);
wchar_t *uni16_str = new WCHAR[str16_len];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use std::vector here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need help making this change. I did not write this section of code (It came from the hOCR implementation), so I am not 100% sure of what it is doing.

char *utf8_str = new char[utf8_len];
WideCharToMultiByte(CP_UTF8, 0, uni16_str, str16_len, utf8_str,
utf8_len, nullptr, nullptr);
*input_file_ = utf8_str;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could *input_file be used directly instead of utf8_str thus avoiding the copying and delete[]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need help making this change. I did not write this section of code (It came from the hOCR implementation.), so I am not 100% sure of what it is doing.

/**
* Append the ALTO XML for the layout of the image
*/
bool TessAltoRenderer::AddImageHandler(TessBaseAPI *api) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommended style for new or modified code is TessBaseAPI* api.

return ret;
}

}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing line feed as last character of the source file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

@@ -94,34 +94,34 @@ BOOL_VAR(stream_filelist, FALSE, "Stream a filelist from stdin");
namespace tesseract {

/** Minimum sensible image size to be worth running tesseract. */
const int kMinRectSize = 10;
const int kMinRectSize = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are several code changes which are unrelated to ALTO.

src/api/capi.cpp Outdated
@@ -66,6 +66,11 @@ TESS_API TessResultRenderer* TESS_CALL TessHOcrRendererCreate2(const char* outpu
return new TessHOcrRenderer(outputbase, font_info);
}

TESS_API TessResultRenderer* TESS_CALL TessAltoRendererCreate(const char* outputbase)
{
return new TessHOcrRenderer(outputbase);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy+paste error. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new TessHOcrRenderer(outputbase);
return new TessAltoRenderer(outputbase);

@stweil
Copy link
Member

stweil commented Nov 21, 2018

Should the alto renderer code be in its own file like the pdf renderer or in baseapi.cpp?

Personally I think that an own file is good here. What would be the advantages of the alternatives?

* Append the ALTO XML for the layout of the image
*/
bool TessAltoRenderer::AddImageHandler(TessBaseAPI *api) {
const std::unique_ptr<const char[]> hocr(api->GetAltoText(imagenum()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That requires #include <memory.h> (see build failures from continuous integration).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, my comment was wrong. It should be just #include <memory>.

@zdenop
Copy link
Contributor

zdenop commented Nov 21, 2018

@amitdo : see @jbreiden comment in #419 (comment)

@amitdo
Copy link
Collaborator

amitdo commented Nov 21, 2018

Ok, alto renderer will get its own file.

What about the other renderers, should each of them get its own file?

@@ -118,10 +118,10 @@ namespace tesseract {
static void addAvailableLanguages(const STRING &datadir, const STRING &base,
GenericVector<STRING>* langs)
{
const STRING base2 = (base.string()[0] == '\0') ? base : base + "/";
const size_t extlen = sizeof(kTrainedDataSuffix);
const STRING base2 = (base.string()[0] == '\0') ? base : base + "/";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use C++ comments, and Tesseract uses indentation steps of 2, so at least most of this commit is not needed or even not wanted at all.

@jakesebright
Copy link
Contributor Author

Thank you very much for these comments and suggestions. They are incredibly helpful and informative. I will work on making these improvements over the next few days.

A note: I had originally implemented ALTO functionality in api/baseapi.cpp but then separated it out due to the aforementioned comment.

@stweil
Copy link
Member

stweil commented Nov 22, 2018

@jakesebright, I have now processed a page with your code. It can be seen here. Full text view must be activated by clicking on the 4th icon in the top menu.

Obviously the DFG viewer software which is used there has problems with the separation of the words. All other pages which were made by ABBYY Finereader don't have that problem because it separates the <String> tags by <SP> tags.

All ALTO files are available online.

@stweil
Copy link
Member

stweil commented Nov 22, 2018

tessdata/configs/Makefile.am also needs an update to install the configuration file tessdata/configs/alto for ALTO.

@stweil
Copy link
Member

stweil commented Nov 28, 2018

@jakesebright, do you plan to add more commits, for example to fix the build error? Or should we take your contribution as it is and fix the remaining issues on our side?

@jakesebright
Copy link
Contributor Author

@stweil Yes, I should have time to work on this tomorrow. Thank you for following up and for your patience.

@jakesebright
Copy link
Contributor Author

I have made all of the suggested changes except for those where I left a comment noting otherwise.

I am still getting build failures from the continuous integration, and am having a hard time figuring out why this is happening. Could anyone offer some help / advice with this failure?

@jbreiden
Copy link
Contributor

jbreiden commented Nov 30, 2018 via email

// limitations under the License.

#include "baseapi.h"
#include <memory.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <memory.h>
#include <memory> // for unique_ptr

bool TessAltoRenderer::BeginDocumentHandler() {
AppendString(
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
"<alto xmlns=\"http://www.loc.gov/standards/alto/ns-v3#\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd\">\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"<alto xmlns=\"http://www.loc.gov/standards/alto/ns-v3#\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd\">\n"
"<alto xmlns=\"http://www.loc.gov/standards/alto/ns-v3#\" "
"xmlns:xlink=\"http://www.w3.org/1999/xlink\" "
"xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" "
"xsi:schemaLocation=\"http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd\">\n"

@stweil
Copy link
Member

stweil commented Nov 30, 2018

@jakesebright, thank you for the updates and all of you work. The build failure is still there because I gave a wrong advice: the missing include file is memory, not memory.h. I fixed it now.

@ghost ghost assigned stweil Nov 30, 2018
@ghost ghost added the review label Nov 30, 2018
@stweil stweil merged commit d7cee03 into tesseract-ocr:master Nov 30, 2018
@ghost ghost removed the review label Nov 30, 2018
@jakesebright jakesebright deleted the alto branch December 1, 2018 23:48
alto_str += "\" HEIGHT=\"";
alto_str.add_str_int("", rect_height_);
alto_str += "\" PHYSICAL_IMG_NR=\"";
alto_str.add_str_int("", rect_height_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesebright , rect_height does not look like a PHYSICAL_IMG_NR. Is this a copy+paste error? What was intended here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a copy/paste error. I believe it should be page_number instead of rect_height.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a fix in pull request #2122.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants