Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use UUIDs for primary keys #3

Closed
dlqqq opened this issue Sep 7, 2022 · 6 comments · Fixed by #30
Closed

use UUIDs for primary keys #3

dlqqq opened this issue Sep 7, 2022 · 6 comments · Fixed by #30
Assignees
Labels
enhancement New feature or request

Comments

@dlqqq
Copy link
Collaborator

dlqqq commented Sep 7, 2022

Preferring UUIDs over integers for primary keys in the Files table should be a configurable trait on the FileIdManager class.

Use UUIDs for file IDs instead of integers.

See below discussion for further context.

@dlqqq dlqqq added the enhancement New feature or request label Sep 7, 2022
@kevin-bates
Copy link
Member

Will UUIDs and Integers be included in the table regardless of the configuration and the configuration merely identifies which of the two is the PK? Or is this a one-time option and requires deletion of the database to change?

@dlqqq
Copy link
Collaborator Author

dlqqq commented Sep 16, 2022

I think it makes sense for this to be a one-time option. Keep in mind other extensions may associate data (e.g. comments) with a file ID. If you change the format of the file ID, then these associations are retroactively broken.

@kevin-bates
Copy link
Member

I think it makes sense for this to be a one-time option. Keep in mind other extensions may associate data (e.g. comments) with a file ID. If you change the format of the file ID, then these associations are retroactively broken.

That's exactly right. This is another reason UUID should be the PK, because once the "ID" is out in the wild, it's a done deal.

Objects containing file ID references will undoubtedly find their way to other systems (where "other systems" is defined to be a set of Jupyter servers using a different FileID database). When that happens, there will be no way to determine that their ID reference of "42" is the same as the file associated with ID 42 in this "other system". The association will be made and we'll essentially have a data integrity violation on our hands. External references unrelated to this file ID 42 now have no relation to the document in which they were intended. Folks will see comments that they shouldn't see nor have any context to the actual document. That is a problem that will continue until the foreign ID reference is greater than the total number of files indexed in the target database. Once the target database grows large enough, however, then foreign ID references that used to correctly fail will suddenly start (incorrectly) succeeding due to this issue.

Using UUIDs, the attempt to associate the foreign objects will correctly fail with a not found exception - always. In addition, UUIDs allow exporting/importing files from one database to another without conflict. In essence, the local database provides information relative to that local file, but its ID is its true identifier everywhere.

Having a sequenced column (in addition to a UUID PK) makes sense for determining insertion order, but one could argue a sufficiently precise timestamp would be more useful in that it provides history in time rather than only relative history.

@dlqqq
Copy link
Collaborator Author

dlqqq commented Oct 25, 2022

@kevin-bates

So, thinking more deeply, I think it's definitely true that it makes no sense for this to be configurable, since you can't really switch between integer and UUID PKs without breaking every other service that uses file IDs. So likely, we either want to provide two separate implementations (with two different default db_paths of course), or just be opinionated and stick to one.

My only objection to using UUIDs was the performance concern associated with generating and indexing by them. However, this was before the days of sync_all() and the out-of-band handling, and the overhead added by that should be a few orders of magnitude slower than any performance cost associated with using UUIDs.

Furthermore, I want to apologize for being unable to understand the use-case up until now. I think I get it now. Imagine a network of multiple Jupyter servers, and there is a main server responsible for associating a file ID with some data. Then if the other Jupyter servers send the payload with the file ID to the main server, and that file ID is not globally unique, then this main server would have to have a separate table mapping "local file ID + server ID" to "global file IDs", and do a JOIN query every time to return the data associated with a local file ID. That just seems silly and brittle.

All things considered, I am now in favor of switching our default implementation to using UUIDs. However, I want to emphasize one major point: globally unique file IDs should not be used to shared between copies of files from different filesystems. When copying a file from one Jupyter server to another, a new UUID should be generated, since that is a copy. Sharing file IDs between different files defeats the purpose of file ID, and opens you up to all sorts of weird consistency issues with sharing state between servers.

Thank you for pushing repeatedly on this. Future-proofing file ID definitely is best done before any users are installing this at home. I'll need buy-in from the RTC folks, but I'm confident we can move forward with your proposal.

@dlqqq dlqqq changed the title new config option for UUIDs for primary keys use UUIDs for primary keys Oct 25, 2022
@kevin-bates
Copy link
Member

Thank you @dlqqq. I agree with your comment regarding copying files in principle. However, UUIDs enable the ability for applications to be developed to merge environments without concern of conflicts. For example, an application may wish to offer the ability to preserve path and ID, but update the inode (or even move to an entirely different filesystem) knowing the previous system's filesystem will not be used (although that's their decision). UUIDs enable that choice because entries in one DB are unique in another thereby eliminating the need to update external references (which is like herding cats).

My only objection to using UUIDs was the performance concern associated with generating and indexing by them.

FWIW, prior to coming into Open Source and Jupyter, I spent 12+ years with an ECM application in which every primary key (and other indices) were UUIDs across dozens of tables in SQLServer, DB2, Oracle, and other "enterprise" DBs. The application was optimized for retrievals but also supported bulk, high-volume, ingestion rates, and performance due to our use of UUIDs was not an issue.

That said, I would recommend that implementations convert their string-based IDs to/from binary forms as that will be faster and half the space (16 bytes vs. 36 characters) - but that's up to the implementation. Fortunately, Jupyter is not a high-volume, 1000+ user application so that kind of optimization may not be critical.

@dlqqq dlqqq self-assigned this Oct 26, 2022
@dlqqq
Copy link
Collaborator Author

dlqqq commented Oct 26, 2022

@kevin-bates

That said, I would recommend that implementations convert their string-based IDs to/from binary forms as that will be faster and half the space (16 bytes vs. 36 characters) - but that's up to the implementation. Fortunately, Jupyter is not a high-volume, 1000+ user application so that kind of optimization may not be critical.

I'm not going to add this in for my initial PR, since the performance improvement (if any) is unclear. While this does decrease the size of each record, the difference is small relative to the size of the file paths and other columns, and may actually worsen performance from the overhead of converting UUIDs to/from binary in Python. I would only add this if there were performance benchmarks to validate the efficacy of this approach.

Anecdotal evidence that binary UUIDs may only have a marginal performance enhancement: https://vespa-mrs.github.io/vespa.io/development/project_dev/database/DatabaseUuidEfficiency.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants