Replies: 2 comments
-
|
Beta Was this translation helpful? Give feedback.
-
Adding notes that I sent to @aosan: Meta-data
Alternative Scenarios
Important Meta-data Attributes
Meta-data Storage Options
Why use extended attributes (xattr)?
|
Beta Was this translation helpful? Give feedback.
-
The initial structure mentions "the vault" that corresponds to the storage, "the index" that corresponds to the index and "the ai".
What?
I believe that the above is missing the Meta part, which consist of analyzing a file and storing meta information about that file.
Why?
To enable the search and beyond (AI) to easily find relevant information for your query. Consider the following scenarios based on the filetype you have:
Also, and this is pretty important, ability to add tags to files, therefore connecting them together. This way it's going to be easier to find connections between them.
How?
One simple approach would be using Extended file attributes
Some sources mention that:
On MacOS (and probably ExtX file system), one can easily use
xattr
to read/write extended attributes on files.Tried this and looks like it's pretty straightforward:
$ xattr -w co.pvp.plaintext "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam facilisis mauris vel eros aliquam, eget posuere augue interdum. Etiam finibus et nulla id semper. Nulla a tincidunt arcu. Donec rhoncus, dui at faucibus eleifend, nisi urna consectetur velit, et mattis eros velit nec magna. Vivamus leo augue, porta id convallis ac, tincidunt nec odio. Nulla dictum consectetur tincidunt. In sagittis dui imperdiet nibh consectetur sollicitudin. Mauris rhoncus consequat ligula et tincidunt. Nullam sed massa ultrices, condimentum dui eget, blandit nunc. Aenean libero leo, finibus non nulla id, consequat tempor ante." \ output.mkv
Retrieving as well:
FAQ
Why not compute the "meta" when indexing? Imagine some of the meta is harder to retrieve than other. For example using AI to describe images, and (in the future) videos, or extract OCR from videos or images. You don't want to reprocess everything every time you build the index. The most efficient way is to store meta data attached to the file.
What happens when you change a file? The meta will remain the same Our processed meta information should come with a hash/checksum of the file, so that when the file changes, we know the meta data is stale and needs reprocessing.
Why not keep this data in a database, external to the storage? For this, I'm thinking that when we copy a folder to another, or files from one place to another place, the meta data gets copied along so there's no need to process files that have already been processed.
Tools
This is not something we should build ourselves, but we should find an extensible way to extract this information.
Some tools that already pull out meta information from various files:
There is even a tool wrapper for extracting meta data, called File Information Tool Set (FITS).
P.S. Apache Tika apparently supports chaining TesseractOCR, to have OCR performed on the contents of the image.
Conclusion
The Storage system must support Extended Attributes because we are using them to properly index the content, and we should be using an extendable open-source library to extract meta data from stored files.
Beta Was this translation helpful? Give feedback.
All reactions