The Meta component in the Storage, needed by the Index #6

dlucian · 2023-04-16T16:15:12Z

dlucian
Apr 16, 2023
Maintainer

The initial structure mentions "the vault" that corresponds to the storage, "the index" that corresponds to the index and "the ai".

What?

I believe that the above is missing the Meta part, which consist of analyzing a file and storing meta information about that file.

Why?

To enable the search and beyond (AI) to easily find relevant information for your query. Consider the following scenarios based on the filetype you have:

PDF: you'll want a stripped down plain-text version of the PDF so you can index it
PPT(X): you'll want the plain-text data in the PPT
JPG,PNG: you'll want EXIF info + description of the image (AI-powered)
MP3: you'll want info such as album, artist, genre and even maybe BPM (some might be AI-powered in the future)
MP4: you'll want the plaintext of whatever is in the video file so you can index it.
MD,TXT: you'll want to know it's a plaintext file that you can fully index

Also, and this is pretty important, ability to add tags to files, therefore connecting them together. This way it's going to be easier to find connections between them.

How?

One simple approach would be using Extended file attributes

In Linux, the ext2, ext3, ext4, JFS, Squashfs, UBIFS, Yaffs2, ReiserFS, Reiser4, XFS, Btrfs, OrangeFS, Lustre, OCFS2 1.6, ZFS, and F2FS filesystems support extended attributes (abbreviated xattr) when enabled in the kernel configuration. (via google)

Some sources mention that:

File systems which do not support extended attributes: NFS as a general rule. However, support is now available in some NFSv4 implementations. RFC 8276 tries to standardize this so it may be available in NFS as well.
Cloud and similar services which provide full support for extended attributes aren't known, although in some circumstances iCloud Drive can.
Services which do not support extended attributes: Box, Google Drive, Microsoft OneDrive, OmniPresence, SugarSync, Sync.

On MacOS (and probably ExtX file system), one can easily use xattr to read/write extended attributes on files.

Tried this and looks like it's pretty straightforward:

$ xattr -w co.pvp.plaintext "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam facilisis mauris vel eros aliquam, eget posuere augue interdum. Etiam finibus et nulla id semper. Nulla a tincidunt arcu. Donec rhoncus, dui at faucibus eleifend, nisi urna consectetur velit, et mattis eros velit nec magna. Vivamus leo augue, porta id convallis ac, tincidunt nec odio. Nulla dictum consectetur tincidunt. In sagittis dui imperdiet nibh consectetur sollicitudin. Mauris rhoncus consequat ligula et tincidunt. Nullam sed massa ultrices, condimentum dui eget, blandit nunc. Aenean libero leo, finibus non nulla id, consequat tempor ante." \
output.mkv

Retrieving as well:

xattr -l ./output.mkv
co.pvp.plaintext: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nullam facilisis mauris vel eros aliquam, eget posuere augue interdum. Etiam finibus et nulla id semper. Nulla a tincidunt arcu. Donec rhoncus, dui at faucibus eleifend, nisi urna consectetur velit, et mattis eros velit nec magna. Vivamus leo augue, porta id convallis ac, tincidunt nec odio. Nulla dictum consectetur tincidunt. In sagittis dui imperdiet nibh consectetur sollicitudin. Mauris rhoncus consequat ligula et tincidunt. Nullam sed massa ultrices, condimentum dui eget, blandit nunc. Aenean libero leo, finibus non nulla id, consequat tempor ante.

FAQ

Why not compute the "meta" when indexing? Imagine some of the meta is harder to retrieve than other. For example using AI to describe images, and (in the future) videos, or extract OCR from videos or images. You don't want to reprocess everything every time you build the index. The most efficient way is to store meta data attached to the file.

What happens when you change a file? The meta will remain the same Our processed meta information should come with a hash/checksum of the file, so that when the file changes, we know the meta data is stale and needs reprocessing.

Why not keep this data in a database, external to the storage? For this, I'm thinking that when we copy a folder to another, or files from one place to another place, the meta data gets copied along so there's no need to process files that have already been processed.

Tools

This is not something we should build ourselves, but we should find an extensible way to extract this information.

Some tools that already pull out meta information from various files:

ExifTool by Phil Harvey https://exiftool.org
Apache Tika https://tika.apache.org (see formats)
Metadata Extraction Tool https://meta-extractor.sourceforge.net (no recent activity)

There is even a tool wrapper for extracting meta data, called File Information Tool Set (FITS).

P.S. Apache Tika apparently supports chaining TesseractOCR, to have OCR performed on the contents of the image.

Conclusion

The Storage system must support Extended Attributes because we are using them to properly index the content, and we should be using an extendable open-source library to extract meta data from stored files.

dlucian · 2023-04-19T07:27:21Z

dlucian
Apr 19, 2023
Maintainer Author

# other ways to set extended attributes
attr -s vasile.toarna -V "Lorem ipsum doloret!" ./meow.txt

# other ways to get extended attributes
attr -l ./meow.txt
attr -g vasile.toarna ./meow.txt
getfattr -d -m . ./panda.txt

0 replies

dlucian · 2023-05-08T08:49:36Z

dlucian
May 8, 2023
Maintainer Author

Adding notes that I sent to @aosan:

Meta-data

Challenges with meta-data: Unlike Google and Yahoo, it might be costly for us to extract meta-data (energy or money). For example, I created a small project that records my screen with a framerate of 0.2 (1 screenshot every 5 seconds), and creates small 10-minute .mp4 clips.
Initial idea: I initially made it to perform OCR on the active window, but later realized that I don't want to consume processing power for this. A simple screen video is enough because what I want is to apply a video-to-ocr tool to extract a transcription of what is on the screen. Why? To easily find something I read/saw/did/wrote some time ago but can't remember exactly what it is.

Alternative Scenarios

Personal video footage: Soon, we will be able to easily extract a frame every few seconds and use an AI-tool to describe the frame. The result would be a textual description of the clips, excellent candidates for full-text search, embeddings, or LLM.
Whisper API: This API extracts all text from all audio files in the vault. Although it's not expensive, transcription has a cost.

Important Meta-data Attributes

I believe there should be a set of attributes for as many files as possible: creation timestamp and location (latitude/longitude - mostly for photos/videos/audio). In some cases, the creation timestamp can be obtained from the file (attribute), but there are situations when it's extracted differently (e.g., EXIF info).

Meta-data Storage Options

This meta-data, whatever it may be, must be: (1) used instantly (indexing), (2) saved in a separate file, (3) stored in a database, or (4) stored as an extended attribute. There are pros and cons for each option. It might be worth considering a configurable option.

Why use extended attributes (xattr)?

I would choose xattr because, regardless of where you move the file, as long as you don't modify its content, the meta-data is valid and attached to the file. If it's separate, you either have to be careful to move the meta-data as well, or it needs to be re-processed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Meta component in the Storage, needed by the Index #6

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

The Meta component in the Storage, needed by the Index #6

dlucian Apr 16, 2023 Maintainer

What?

Why?

How?

FAQ

Tools

Conclusion

Replies: 2 comments

dlucian Apr 19, 2023 Maintainer Author

dlucian May 8, 2023 Maintainer Author

Meta-data

dlucian
Apr 16, 2023
Maintainer

dlucian
Apr 19, 2023
Maintainer Author

dlucian
May 8, 2023
Maintainer Author