-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix conversion from ALTO to PAGE and vice versa #106
Conversation
The prima-page-converter does not format the output file. Should we add post processing which formats it nicely? Otherwise the web interface only shows a single (very lengthy) line. |
Yes, pretty print would be good 👍 Can we use Saxon for that? Would it maybe even make sense to have a CLI option for that in ocr-transform? I tried some examples in the Web GUI and see an error for alto__page transformation with https://rawgit.com/kba/ocr-fileformat-samples/master/samples/alto/2.0/wetzel_reisebegleiter_1901_0021.alto as well as with http://chroniclingamerica.loc.gov/lccn/sn86069133/1910-10-31/ed-1/seq-1/ocr.xml . But the third example works fine. Are these upstream problems? |
There seems to be a side effect in the abbyy2hocr transformation. In the Web GUI this transformation output nothing with the input https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/abbyy/417576986_0031.xml . However, this works in the docker run on the master branch. (Sorry for the the previous message which was unrelated to this issue.) |
5cd607c
to
509aac2
Compare
That's strange because ABBYY to hOCR does not use a transformer script (and this PR does not change anything else). |
Okay, I did another test with this branch here and can confirm that abbyy2hocr works fine in my docker container. There seems to be a problem with the instance on digi. |
There are a lot of options how to implement pretty printing. I added a commit which uses Saxon, so the usual command line argument can be used to enable it (currently only implemented for output to STDOUT). The web interface now uses pretty printing for all PAGE related conversions by default. |
Is this ready to merge? It looks good to me! Do you agree that the problems with the alto files I described above are upstream problems? |
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Weil <sw@weilnetz.de>
The output is formatted when a Saxon serialization parameter is given on the command line. The web interface automatically uses `!indent=yes`. Signed-off-by: Stefan Weil <sw@weilnetz.de>
It used an undefined macro SAXON_JAR. Signed-off-by: Stefan Weil <sw@weilnetz.de>
c34ae24
to
d1c4477
Compare
Commit OCR-D/format-converters@5b9568f was missing in our installation (fixed now). I noticed that just running |
The first one throws a Java nullpointer exception, the second one looks like a vendor specific variant of ALTO. So both problems are not caused by the ocr-fileformat code. |
Okay, we should report them upstream such that they will hopefully been fixed there in the future. Moreover, I created an issue about a better update mechanism. So, let me ask again: Is this ready to merge? It looks good to me! 👍 |
Thank you @stweil for all the work on this! 🙇♂️ |
@@ -50,6 +49,8 @@ main () { | |||
fi | |||
fi | |||
|
|||
declare -a script_args |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you move that line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply wanted to have it close to the first use of script_args
.
@@ -86,8 +87,7 @@ main () { | |||
[[ "$outfile" != '-' ]] && script_args=("${script_args[@]}" "-o:$outfile") | |||
exec_saxon "${script_args[@]}" | |||
else | |||
script_args=("${script_args[@]}" "$infile") | |||
script_args=("${script_args[@]}" "$outfile") | |||
script_args=("$infile" "$outfile" "${script_args[@]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is reverse sorted then before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now sorted as described in script/transform/README.md.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was wrong before, weird we never noticed this...
-convert-to ALTO
argument needed for conversion from PAGE to ALTO