Tesseract as a microservice for use with API-X.
- Install
tesseract
. On Ubuntu, this can be done withsudo apt-get install tesseract-ocr
. If you want to install extra languages, they are available as separate packages in Ubuntu. You can use apt's autocomplete to get a quick list of them. - Clone this repository somewhere in your web root.
- Install
composer
. Install instructions here. $ cd /path/to/Hypercube
and run$ composer install
- Then either
- For production, configure your web server appropriately (e.g. add a VirtualHost for Hypercube in Apache) some documentation (here)[http://silex.sensiolabs.org/doc/2.0/web_servers.html].
- For development, run the PHP built-in web server
$ php -S localhost:8888 -t src
from Hypercube root.
To use Hypercube with Apache you need to configure your Virtualhost with a few options:
- Redirect all requests to the Hypercube index.php file
- Make sure Hypercube has access to Authorization headers
Here is an example configuration for Apache 2.4:
Alias "/hypercube" "/path/to/Crayfish/Hypercube/src"
<Directory "/path/to/Crayfish/Hypercube/src">
FallbackResource /hypercube/index.php
Require all granted
DirectoryIndex index.php
SetEnvIf Authorization "(.*)" HTTP_AUTHORIZATION=$1
</Directory>
This will put the Hypercube at the /hypercube endpoint on the web server.
If your tesseract installation is not on your path, then you can configure Hypercube to use a specific executable by editing executable
entry in config.yaml.
You also will need to set the fedora base url
entry to point to your Fedora installation.
In order to work on larger images, be sure post_max_size
is sufficiently large and max_execution_time
is set to 0 in your PHP installation's ini file. You can determine which ini file is getting used by running the command $ php --ini
.
Hypercube is meant for use with API-X. It accepts only accepts one request, a GET
with the URI of a Fedora resource in the ApixLdpResource
header..
For example, suppose if you have a TIFF in Fedora at http://localhost:8080/fcrepo/rest/foo/bar
. If running the PHP built-in server command described in the Installation section:
$ curl -H "Authorization: Bearer blabhlahblah" -H "Apix-Ldp-Resource: http://localhost:8080/fcrepo/rest/foo/bar" "http://localhost:8888"
This will return the OCR generated from the TIFF in Fedora. Additional arguments to tesseract
can be provided using the X-Islandora-Args
header. For example, to change the page layout:
$ curl -H "Authorization: Bearer blabhlahblah" -H "Apix-Ldp-Resource: http://localhost:8080/fcrepo/rest/foo/bar" -H "X-Islandora-Args: -psm 9" "http://localhost:8888"
But you're probably going to use Hypercube through API-X, which exposes it as svc:ocr
. Assuming your API-X proxy is on port 8081, you can access the service with
$ curl -H "Authorization: Bearer blabhlahblah" "http://localhost:8081/services/foo/bar/svc:ocr"
Current maintainers:
If you would like to contribute, please get involved by attending our weekly Tech Call. We love to hear from you!
If you would like to contribute code to the project, you need to be covered by an Islandora Foundation Contributor License Agreement or Corporate Contributor License Agreement. Please see the Contributors pages on Islandora.ca for more information.
We recommend using the islandora-playbook to get started.