You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tika and FITS (File Information Tools Set) are two software solutions that extract metadata from files. They know how to handle a very large number of filetypes.
File Information Tools Set (FITS) is a tool that uses several tools (including Apache Tika) to extract and combine metadata information.
Both of these have Docker options and I've played with them over the weekend.
Apache Tika has the ability to use Tesseract OCR to perform OCR on images and PDF files, which makes it a must have in this tool set. On the other hand, it cannot use multiple parsers on the same file, which means it's very limited. Problem number two is that it is written in Java and it takes time to start up, and uses a lot of resources.
FITS overcomes the single parser problem as it runs multiple tools including Apache Tika, MediaInfo, FileUtility, exifTool and others on the given file, depending on the file type. That means it can use both Apache Tika and exifTool on an image and get the OCR and the metadata for the given file. The downside is that it doesn't seem to include the OCR data in the XML that it generates, and again, it's a Java application, very slow and resource-intensive. Especially when FITS uses Tika, it gets much slower.
So, my conclusions after this weekend:
Neither FITS nor Apache Tika seem to do a very good job at extracting OCR from images and PDFs, they are very slow (Java software) and FITS isn't even returning the extracted OCR in the output XML.
I am considering using Tesseract in a "manual" manner to extract text from images.
For extracting OCR from PDFs, I can use some simple script like ocr2text.
Also started to play with processing large sets of files, might be a good idea to use PHP for that1.
Thinking about creating a small PHP script that calculates checksums for files2, but it looks like md5sum, sha256sum and cksum do a very good job at it, and they're widely available.
Monitoring the filesystem for updates so we can run the metadata extract tool & indexing seems to be possible via either fswatch or inotify.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Tika and FITS (File Information Tools Set) are two software solutions that extract metadata from files. They know how to handle a very large number of filetypes.
Both of these have Docker options and I've played with them over the weekend.
Apache Tika has the ability to use Tesseract OCR to perform OCR on images and PDF files, which makes it a must have in this tool set. On the other hand, it cannot use multiple parsers on the same file, which means it's very limited. Problem number two is that it is written in Java and it takes time to start up, and uses a lot of resources.
FITS overcomes the single parser problem as it runs multiple tools including Apache Tika, MediaInfo, FileUtility, exifTool and others on the given file, depending on the file type. That means it can use both Apache Tika and exifTool on an image and get the OCR and the metadata for the given file. The downside is that it doesn't seem to include the OCR data in the XML that it generates, and again, it's a Java application, very slow and resource-intensive. Especially when FITS uses Tika, it gets much slower.
So, my conclusions after this weekend:
md5sum
,sha256sum
andcksum
do a very good job at it, and they're widely available.fswatch
or inotify.Footnotes
"Parse large number of files with PHP" script ↩
"Calculate file checksum in PHP" script ↩
Beta Was this translation helpful? Give feedback.
All reactions