Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix conversion from ALTO to PAGE and vice versa #106

Merged
merged 9 commits into from
Jan 2, 2020
Merged

Conversation

stweil
Copy link
Member

@stweil stweil commented Dec 31, 2019

  • Fix order of arguments passed
  • Remove shell debugging (-x)
  • Handle input from STDIN
  • Add -convert-to ALTO argument needed for conversion from PAGE to ALTO

@stweil
Copy link
Member Author

stweil commented Dec 31, 2019

The prima-page-converter does not format the output file. Should we add post processing which formats it nicely? Otherwise the web interface only shows a single (very lengthy) line.

@zuphilip
Copy link
Member

Yes, pretty print would be good 👍 Can we use Saxon for that? Would it maybe even make sense to have a CLI option for that in ocr-transform?

I tried some examples in the Web GUI and see an error for alto__page transformation with https://rawgit.com/kba/ocr-fileformat-samples/master/samples/alto/2.0/wetzel_reisebegleiter_1901_0021.alto as well as with http://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml . But the third example works fine. Are these upstream problems?

@zuphilip
Copy link
Member

There seems to be a side effect in the abbyy2hocr transformation. In the Web GUI this transformation output nothing with the input https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/abbyy/417576986_0031.xml . However, this works in the docker run on the master branch. (Sorry for the the previous message which was unrelated to this issue.)

@stweil
Copy link
Member Author

stweil commented Dec 31, 2019

There seems to be a side effect in the abbyy2hocr transformation.

That's strange because ABBYY to hOCR does not use a transformer script (and this PR does not change anything else).

@zuphilip
Copy link
Member

Okay, I did another test with this branch here and can confirm that abbyy2hocr works fine in my docker container. There seems to be a problem with the instance on digi.

@stweil
Copy link
Member Author

stweil commented Jan 1, 2020

Yes, pretty print would be good. Can we use Saxon for that? Would it maybe even make sense to have a CLI option for that in ocr-transform?

There are a lot of options how to implement pretty printing. I added a commit which uses Saxon, so the usual command line argument can be used to enable it (currently only implemented for output to STDOUT). The web interface now uses pretty printing for all PAGE related conversions by default.

@zuphilip
Copy link
Member

zuphilip commented Jan 1, 2020

Is this ready to merge? It looks good to me!

Do you agree that the problems with the alto files I described above are upstream problems?

stweil added 9 commits January 1, 2020 23:14
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The output is formatted when a Saxon serialization parameter is given on the
command line.

The web interface automatically uses `!indent=yes`.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
It used an undefined macro SAXON_JAR.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Member Author

stweil commented Jan 2, 2020

There seems to be a side effect in the abbyy2hocr transformation. In the Web GUI this transformation output nothing with the input https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/abbyy/417576986_0031.xml .

Commit OCR-D/format-converters@5b9568f was missing in our installation (fixed now). I noticed that just running make all or make install does not update any of the existing cloned code from external git repositories. I think we should start using git submodules to get explicit dependencies.

@stweil
Copy link
Member Author

stweil commented Jan 2, 2020

I tried some examples in the Web GUI and see an error for alto__page transformation with https://rawgit.com/kba/ocr-fileformat-samples/master/samples/alto/2.0/wetzel_reisebegleiter_1901_0021.alto as well as with http://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml.

The first one throws a Java nullpointer exception, the second one looks like a vendor specific variant of ALTO. So both problems are not caused by the ocr-fileformat code.

@zuphilip zuphilip mentioned this pull request Jan 2, 2020
@zuphilip
Copy link
Member

zuphilip commented Jan 2, 2020

The first one throws a Java nullpointer exception, the second one looks like a vendor specific variant of ALTO. So both problems are not caused by the ocr-fileformat code.

Okay, we should report them upstream such that they will hopefully been fixed there in the future.

Moreover, I created an issue about a better update mechanism.

So, let me ask again: Is this ready to merge? It looks good to me! 👍

@stweil stweil merged commit 20886fe into master Jan 2, 2020
@stweil stweil deleted the fix-page-to-alto branch January 2, 2020 08:36
@zuphilip
Copy link
Member

zuphilip commented Jan 2, 2020

Thank you @stweil for all the work on this! 🙇‍♂️

@@ -50,6 +49,8 @@ main () {
fi
fi

declare -a script_args
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you move that line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply wanted to have it close to the first use of script_args.

@@ -86,8 +87,7 @@ main () {
[[ "$outfile" != '-' ]] && script_args=("${script_args[@]}" "-o:$outfile")
exec_saxon "${script_args[@]}"
else
script_args=("${script_args[@]}" "$infile")
script_args=("${script_args[@]}" "$outfile")
script_args=("$infile" "$outfile" "${script_args[@]}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reverse sorted then before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now sorted as described in script/transform/README.md.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was wrong before, weird we never noticed this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants