Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FITS runs Tika, which runs Tesseract, which is very slow #25

Open
escowles opened this issue Oct 26, 2016 · 8 comments
Open

FITS runs Tika, which runs Tesseract, which is very slow #25

escowles opened this issue Oct 26, 2016 · 8 comments

Comments

@escowles
Copy link

Related to #18 — we found that 75% or more of the time to run FITS on our 100MB TIFF files was spent running Tesseract (run by Tika). We disabled Tika by commenting out the TikaTool line in the /path/to/fits/xml/fits.xml configuration file, and saw dramatically faster FITS execution times (20 seconds per file instead of 90+).

We updated our Ansible playbook to comment out the Tika line when we install FITS: ucsdlib/ansible-role-fits#2

@jcoyne
Copy link
Member

jcoyne commented Oct 27, 2016

/cc @mark-dce @val99erie @acozine

@mark-dce
Copy link

@jcoyne - thanks for the heads-up

@jrochkind
Copy link

Anyone know what you lose by not running tika, and if there are any alternatives?

@jrochkind
Copy link

Ah tesseract is OCR, I think! Wonder if you can configure tika-via-fits to not do OCR, but still get metadata that it gets.

@mjgiarlo
Copy link
Member

mjgiarlo commented Nov 2, 2017

FWIW, the way Hydra::Derivatives does full-text extraction does rely on Tika but does not rely on FITS' usage of Tika. It would be interesting to test a FITS config with Tika disabled in both this gem and Hyrax to see if it breaks any tests. If not, perhaps we should disable FITS' usage of Tika by default (and folks are always free to tweak the FITS config if they need it).

@jrochkind
Copy link

I'd just still want to get other metadata extraction/validation that I think Tika is doing.

Where does FITS config actually live in a hyrax app or another app using hydra-derivatives?

@ghost
Copy link

ghost commented Nov 2, 2017

Reading on here it doesn't sound like the OCR step will run if tesseract is not installed: https://wiki.apache.org/tika/TikaOCR

If it is, then disabling tika for only certain file types might be a better option?

If FITS is only doing a limited job for hydra-file_characterization we could look to publish a fits.xml configured so only certain tools run, and use the exclude options in the fits.xml to try to make it run more efficiently by only running one tool on given file types, like EXIF for all images and so on.

@escowles
Copy link
Author

escowles commented Nov 2, 2017

I don't know if it's possible to disable Tesseract only for some file types, but you can disable it globally. We do this in our Ansible scripts (see https://github.com/pulibrary/princeton_ansible/pull/2/files).

In our experience, it dramatically sped up FITS/Tika times (Tesseract was the majority of FITS processing time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

5 participants