Apache Tika and FITS #8

dlucian · 2023-05-15T06:40:52Z

dlucian
May 15, 2023
Maintainer

Tika and FITS (File Information Tools Set) are two software solutions that extract metadata from files. They know how to handle a very large number of filetypes.

Apache Tika has detectors and parsers and generates metadata on a variety of file-types.
File Information Tools Set (FITS) is a tool that uses several tools (including Apache Tika) to extract and combine metadata information.

Both of these have Docker options and I've played with them over the weekend.

Apache Tika has the ability to use Tesseract OCR to perform OCR on images and PDF files, which makes it a must have in this tool set. On the other hand, it cannot use multiple parsers on the same file, which means it's very limited. Problem number two is that it is written in Java and it takes time to start up, and uses a lot of resources.

FITS overcomes the single parser problem as it runs multiple tools including Apache Tika, MediaInfo, FileUtility, exifTool and others on the given file, depending on the file type. That means it can use both Apache Tika and exifTool on an image and get the OCR and the metadata for the given file. The downside is that it doesn't seem to include the OCR data in the XML that it generates, and again, it's a Java application, very slow and resource-intensive. Especially when FITS uses Tika, it gets much slower.

So, my conclusions after this weekend:

Neither FITS nor Apache Tika seem to do a very good job at extracting OCR from images and PDFs, they are very slow (Java software) and FITS isn't even returning the extracted OCR in the output XML.
I am considering using Tesseract in a "manual" manner to extract text from images.
For extracting OCR from PDFs, I can use some simple script like ocr2text.
Also started to play with processing large sets of files, might be a good idea to use PHP for that¹.
Thinking about creating a small PHP script that calculates checksums for files², but it looks like md5sum, sha256sum and cksum do a very good job at it, and they're widely available.
Monitoring the filesystem for updates so we can run the metadata extract tool & indexing seems to be possible via either fswatch or inotify.

"Parse large number of files with PHP" script ↩
"Calculate file checksum in PHP" script ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Tika and FITS #8

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Apache Tika and FITS #8

dlucian May 15, 2023 Maintainer

Footnotes

Replies: 0 comments

dlucian
May 15, 2023
Maintainer