-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to SQLite DB storage #187
Switch to SQLite DB storage #187
Conversation
The thread about DB is very tl;dr, so if you dont mind some questions about the final DB schema attached:
|
Entry_attribute replaces all references to fields and tags in the data storage so it's essentially the storage of all the attrs of the key:attr pairs. It maps entries to the metadata with tags as keys (stored in the tag table) and the attrs stored in the entry_attribute table,
|
Question on ignored extensions, is the plan that files with those extensions are ignored by the database (no entries generated) or just hidden from the UI? Trying to see if that info should be stored as a UI settings item, or as another table in the DB (currently commented out) |
I was intending on them being hidden on the UI side so the library doesn't have to rescan whenever you make changes to the ignore list. |
This PR could get quite big (which is okay) as it's reimplementing many core features. But could we use this as a starting point to refactor some components out of here before continuing? Namely: Decouple the Library from the storage backendLet the storage backend handle data storage in the DB. The library can manage CRUD, caching, linking, and other management-related features but let the storage backend handle (and optimize) the implementation.
Isolate the filesystem implementation from TagStudio internalsThis would fix the inability to reference files if they move or are deleted. We should have a module that handles the management of files, e.g., their IDs, location, system metadata, etc., and provide an API for libraries to interact with them.
Scopes and defaultsInstead of each library managing its implementation of Tags, their storage, and their defaults, have the Tag implementation be separate. Since the storage is already abstracted, the Tag Manager can handle this by creating, managing, and storing tags and their relationships (not sure if this is a goal, but tag relationships could be more than just parent-child) in Global Scope. Then, the application could provide a UI to manage these (and import across libraries), and individual libraries can manage local tags and their file associations. This would also allow for easy imports of tags and moving them around libraries in a user-friendly fashion. In the future, if we want user plugins for adding tags (like image classification or OCR plugins), that would interop with this API for adding tags and then the libraries API for linking them. I'm happy to start on some of these (like filesystem and storage), but it's up to @CyanVoxel to see if he thinks this is a good direction. |
Thanks for the comments yeah this really would be a big one. I just wanted to get the discussion going and loop in some of the GitHub crowd.
I believe this was one of the end goals for this though definitely not touched on in the first stages.
Agreed, it was never the intention for an entire library to live in memory at once long term but since that's how it's currently implemented I was looking at incremental changes to make that more possible.
This level of filesystem interaction is well beyond my existing knowledge but I would be interested in learning about it, I'm not seeing clear ways for these metadata structures to resolve back to their file data so that things like thumbnails and opening with system default viewers would be achievable without falling back to system calls to resolve the filename. or is the thought more that this implementation would scan a directory, resolve the filesystem ids from the file names and use that to internally translate between file names and OS level file identifiers? (e.g. I move
I think this is sort of being shifted towards just by having the tags live in the database, so there wouldn't be a list of defaults in the source code, it would instead be pulled from storage, the current defaults would just be created as defaults in the storage solution since that simplifies the transition.
|
Kind of; essentially, I'm saying to Separate Concerns. For now, abstract out the storage implementation specifics from the TagStudio Library class/implementation. We could do a Factory or Prototype pattern or just provide an Abstract implementation. The Library should be agnostic to the storage backend. Then each storage implementation would handle figuring out how actually to implement the methods. (and avoid tangling the GUI with any of this, it becomes a big hot mess really fast) See projects like Napari for an idea of structuring larger PyQt projects. class StorageInterface(ABC):
@abstractmethod
def attatch_tag_entry(self, tag: Tag, entry: Entry) -> None:
pass
@abstractmethod
def link_tags(self, tag1: Tag, tag2: Tag, association: Association) -> None:
pass
...
We don't need to translate between file names and the ID. The file name, path, ID, and other metadata are already attached to the file. If we use the path as the identifier, we run into linking issues as files get moved around, and if we use a hash, when internal data is modified (like if you crop a photo), the hash changes. The ID is a more consistent identifier (it's not guaranteed to always be the same, like on Windows, if the file moves drives the volume ID, a part of the whole id, changes). But take for example, the directory below where
If I have all my tags already associated with the png. If it was to then move the file under games:
We would lose the association as the path has changed. This could get really bad if you're moving around more than just a few files after you've spent time tagging them. And if I happen to crop or modify it in some way after, most any hash I know of (md5, sha, crc64) would change (and they're also expensive to calculate as the file size grows). The ID would not. Preserving the links. Not perfect but I believe it's better. An example implementation for this: def _filetime_to_dt(ft):
us = (ft.dwHighDateTime << 32) + ft.dwLowDateTime
us = us // 10 - 11644473600000000
return datetime.timestamp(us / 1e6).fromtimestamp(datetime.UTC)
def _get_windows_metadata(file_path: str):
try:
file_handle = ctypes.windll.kernel32.CreateFileW(
file_path, 0x00, 0x01 | 0x02 | 0x04, None, 0x03, 0x02000000, None
)
if file_handle == -1:
raise ctypes.WinError()
info = ctypes.wintypes.BY_HANDLE_FILE_INFORMATION()
if not ctypes.indll.kernel32.GetFileInformationByHandle(file_handle, ctypes.byref(info)):
raise ctypes.WinError()
ctypes.windll.kernel32.CloseHandle(file_handle)
return {
"path": file_path,
"uid": f"{info.dwVolumeSerialNumber}{info.nFileIndexHigh}{info.nFileIndexLow}",
"size": (info.nFileSizeHigh << 32) + info.nFileSizeLow,
"creation_time": _filetime_to_dt(info.ftCreationTime),
"last_access_time": _filetime_to_dt(info.ftLastAccessTime),
"last_write_time": _filetime_to_dt(info.ftLastWriteTime)
}
except Exception as e:
return {"error": str(e)}
def _get_unix_metadata(file_path):
try:
stats = os.stat(file_path)
return {
"path": file_path,
"uid": f"{stats.st_dev}{stats.st_ino}",
"size": stats.st_size,
"creation_time": datetime.fromtimestamp(stats.st_ctime),
"last_access_time": datetime.fromtimestamp(stats.st_atime),
"last_write_time": datetime.fromtimestamp(stats.st_mtime)
}
except Exception as e:
return {"error": str(e)} |
I can see the flexibility gain of such a system, I'll look into the Abstract classes and Prototypes a bit more, I'll admit I tend to lean away from them because I'm not normally writing things that need plugins or configurable backends. For napari I see they went prototypes but that repo is a lot to take in to try and understand the structure of what and why they might have done something. I'll see if I can look over it a bit more when I have more time.
I think I agree and am following on this. So to lookup tags from a file you would select a file, parse the system metadata and use the system ID as the |
Napari is a great project, and I recommend giving it a look, but it has a different goal. We don't need to copy its systems per se -- the idea is just that they've been able to manage the separation of concerns pretty well in a larger Python Qt project. PyQt is nice as it's really easy to get started and have an MVP fast, but as soon as it grows in complexity and in contributors, the difficulty can ramp up fast. Separation of concerns, types, and documentation all really help here.
Exactly! This should minimize |
Yeah wouldn't think about copying verbatim just looking for an understanding of the separation. After exploring for a little bit and especially with the potential for future plugins I'll be looking at protocols for this PR but still open to changes if there's a better suggestion.
Not going to lie, that sounds pretty appealing, I'm sure there are still some cases that this won't catch but we would have those either way. I'll probably start working that way unless I hear direction otherwise or there are solid points against this. |
…a standalone object
Add `save` to the DataSource Protocol
comply with json_typing structures
…o feature/SQLite-database
A DataSource can operate CRUD on 1 item at a time or get all Reads are passed ID Creates, Updates & Deletes are passed objects (tag, entity, etc)
Trying to separate the library from the data source is just leading me to either keep the library functioning on JSON or implement some naive ORM. As a result I'm closing this in favor of #190, I think there are still some good ideas for the identification and deduplication that came out of the discussion but I am not going to be able to match the performance or maintainability of the code written with SQLAlchemy, that library as stated also opens the backend system to a variety of SQL dialects. |
Only reason this is going up in this state is because it's sat on my computer too long already, maybe if its out there I'll move faster on the actual code writing.
src\core\sql_library.py
representing the in memory data structuressrc\core\create_db.sql
representing the schema for the databaseIncredibly rough draft that isn't close to done,
Only
Location
andEntry
have had a cursory initial pass for features, with many missing features of the existing libraryFirst thought is that
Library
handles all instantiation and processing with the remaining objects primarily being memory caching to prevent slow downs in the initial phases when things likesrc\qt\modals\tag_database.py
would otherwise try to read the database 1 tag at a time.Initial DB Schema Graphic