Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix conversion from ALTO to PAGE and vice versa #106

Merged
merged 9 commits into from
Jan 2, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ Usage: ocr-transform [-dhLv] <from> <to> [<infile> [<outfile>]] [-- <script-args

Transformations:
abbyy hocr
abbyy page
alto2.0 alto3.0
alto2.0 alto3.1
alto2.0 hocr
Expand All @@ -153,8 +154,10 @@ Usage: ocr-transform [-dhLv] <from> <to> [<infile> [<outfile>]] [-- <script-args
alto page
alto text
gcv hocr
gcv page
hocr alto2.0
hocr alto2.1
hocr page
hocr text
page alto
page hocr
Expand Down Expand Up @@ -190,11 +193,11 @@ capable stylesheet transformer.

| From ╲ To | hOCR | ALTO | PAGEXML |
| ---: | --- | --- | --- |
| hOCR | = | ✓ | - |
| hOCR | = | ✓ | |
| ALTO | ✓ | = | ✓ |
| PAGEXML | ✓ | ✓ | = |
| FineReader | ✓ | - | - |
| Google Cloud Vision | ✓ | - | - |
| FineReader | ✓ | - | |
| Google Cloud Vision | ✓ | - | |
| TEI | ✓ | - | - |

## Validation
Expand Down
6 changes: 3 additions & 3 deletions bin/ocr-transform.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ show_version () {
main () {
local from="$1" to="$2" infile='-' outfile='-' transformer
shift 2
declare -a script_args

# Validate parameters
if [[ -z "$from" ]];then
Expand All @@ -50,6 +49,8 @@ main () {
fi
fi

declare -a script_args
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you move that line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply wanted to have it close to the first use of script_args.


# <infile>
if [[ "$1" == '--' ]];then
script_args+=("${@:2}")
Expand Down Expand Up @@ -86,8 +87,7 @@ main () {
[[ "$outfile" != '-' ]] && script_args=("${script_args[@]}" "-o:$outfile")
exec_saxon "${script_args[@]}"
else
script_args=("${script_args[@]}" "$infile")
script_args=("${script_args[@]}" "$outfile")
script_args=("$infile" "$outfile" "${script_args[@]}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is reverse sorted then before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now sorted as described in script/transform/README.md.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was wrong before, weird we never noticed this...

"$transformer" "${script_args[@]}"
fi
}
Expand Down
2 changes: 1 addition & 1 deletion lib.sh
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ show_saxon_options () {
#{{{ run saxon / xsd-validator (xsdv.sh)
# exec_saxon ()
exec_saxon() {
(( DEBUG > 0 )) && loginfo Executing "java -jar $SAXON_JAR" "$@"
(( DEBUG > 0 )) && loginfo Executing "java -jar $SHAREDIR/vendor/saxon9he.jar" "$@"
(( DEBUG > 1 )) && SAXON_ARGS+=('-t')
java -jar "$SHAREDIR/vendor/saxon9he.jar" "$@"
}
Expand Down
1 change: 1 addition & 0 deletions script/transform/abbyy__page
27 changes: 20 additions & 7 deletions script/transform/alto__page
Original file line number Diff line number Diff line change
@@ -1,19 +1,32 @@
#!/bin/bash -x
#!/bin/bash

SCRIPTDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENDORDIR="$(cd $SCRIPTDIR/../../vendor/; pwd)"
JAR="$VENDORDIR/JPageConverter/PageConverter.jar"
INFILE="$1"
OUTFILE="$2"
ARGUMENT="$3"

if [[ "$1" = "-" ]]; then
INFILE="$(mktemp)"
cat >"$INFILE"
fi

is_temp=
if [[ "$2" = "-" ]];then
is_temp=true
if [[ "$2" = "-" ]]; then
OUTFILE="$(mktemp)"
fi

java -jar "$JAR" -neg-coords toZero -source-xml "$INFILE" -target-xml "$OUTFILE"
java -jar "$JAR" -neg-coords toZero -source-xml "$INFILE" -target-xml "$OUTFILE" 2>&1

if [[ "$1" = "-" ]]; then
rm "$INFILE"
fi

if [[ "$is_temp" = true ]];then
cat "$OUTFILE"
if [[ "$2" = "-" ]]; then
if [[ -z "$ARGUMENT" ]]; then
cat "$OUTFILE"
else
java -cp "$VENDORDIR/saxon9he.jar" net.sf.saxon.Query -s:"$OUTFILE" -qs:/ "$ARGUMENT"
fi
rm "$OUTFILE"
fi
16 changes: 12 additions & 4 deletions script/transform/gcv__hocr
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/bin/bash

SCRIPTDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENDORDIR="$(cd $SCRIPTDIR/../../vendor/; pwd)"
VENDORSCRIPT="$VENDORDIR/gcv2hocr/gcv2hocr"
Expand All @@ -8,15 +9,22 @@ OUTFILE="$2"
WIDTH=2000
HEIGHT=2000

is_temp=
if [[ "$2" = "-" ]];then
is_temp=true
if [[ "$1" = "-" ]]; then
INFILE="$(mktemp)"
cat >"$INFILE"
fi

if [[ "$2" = "-" ]]; then
OUTFILE="$(mktemp)"
fi

"$VENDORSCRIPT" "$INFILE" "$OUTFILE" "$WIDTH" "$HEIGHT"

if [[ "$is_temp" = true ]];then
if [[ "$1" = "-" ]]; then
rm "$INFILE"
fi

if [[ "$2" = "-" ]]; then
cat "$OUTFILE"
rm "$OUTFILE"
fi
Expand Down
1 change: 1 addition & 0 deletions script/transform/gcv__page
1 change: 1 addition & 0 deletions script/transform/hocr__page
1 change: 0 additions & 1 deletion script/transform/page__alto

This file was deleted.

32 changes: 32 additions & 0 deletions script/transform/page__alto
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/bin/bash

SCRIPTDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENDORDIR="$(cd $SCRIPTDIR/../../vendor/; pwd)"
JAR="$VENDORDIR/JPageConverter/PageConverter.jar"
INFILE="$1"
OUTFILE="$2"
ARGUMENT="$3"

if [[ "$1" = "-" ]]; then
INFILE="$(mktemp)"
cat >"$INFILE"
fi

if [[ "$2" = "-" ]]; then
OUTFILE="$(mktemp)"
fi

java -jar "$JAR" -neg-coords toZero -source-xml "$INFILE" -target-xml "$OUTFILE" -convert-to ALTO 2>&1

if [[ "$1" = "-" ]]; then
rm "$INFILE"
fi

if [[ "$2" = "-" ]]; then
if [[ -z "$ARGUMENT" ]]; then
cat "$OUTFILE"
else
java -cp "$VENDORDIR/saxon9he.jar" net.sf.saxon.Query -s:"$OUTFILE" -qs:/ "$ARGUMENT"
fi
rm "$OUTFILE"
fi