use UUIDs for primary keys #3

dlqqq · 2022-09-07T16:45:40Z

~~Preferring UUIDs over integers for primary keys in the Files table should be a configurable trait on the FileIdManager class.~~

Use UUIDs for file IDs instead of integers.

See below discussion for further context.

The text was updated successfully, but these errors were encountered:

kevin-bates · 2022-09-15T23:01:31Z

Will UUIDs and Integers be included in the table regardless of the configuration and the configuration merely identifies which of the two is the PK? Or is this a one-time option and requires deletion of the database to change?

dlqqq · 2022-09-16T23:22:29Z

I think it makes sense for this to be a one-time option. Keep in mind other extensions may associate data (e.g. comments) with a file ID. If you change the format of the file ID, then these associations are retroactively broken.

kevin-bates · 2022-09-17T14:44:01Z

I think it makes sense for this to be a one-time option. Keep in mind other extensions may associate data (e.g. comments) with a file ID. If you change the format of the file ID, then these associations are retroactively broken.

That's exactly right. This is another reason UUID should be the PK, because once the "ID" is out in the wild, it's a done deal.

Objects containing file ID references will undoubtedly find their way to other systems (where "other systems" is defined to be a set of Jupyter servers using a different FileID database). When that happens, there will be no way to determine that their ID reference of "42" is the same as the file associated with ID 42 in this "other system". The association will be made and we'll essentially have a data integrity violation on our hands. External references unrelated to this file ID 42 now have no relation to the document in which they were intended. Folks will see comments that they shouldn't see nor have any context to the actual document. That is a problem that will continue until the foreign ID reference is greater than the total number of files indexed in the target database. Once the target database grows large enough, however, then foreign ID references that used to correctly fail will suddenly start (incorrectly) succeeding due to this issue.

Using UUIDs, the attempt to associate the foreign objects will correctly fail with a not found exception - always. In addition, UUIDs allow exporting/importing files from one database to another without conflict. In essence, the local database provides information relative to that local file, but its ID is its true identifier everywhere.

Having a sequenced column (in addition to a UUID PK) makes sense for determining insertion order, but one could argue a sufficiently precise timestamp would be more useful in that it provides history in time rather than only relative history.

dlqqq · 2022-10-25T21:33:26Z

@kevin-bates

So, thinking more deeply, I think it's definitely true that it makes no sense for this to be configurable, since you can't really switch between integer and UUID PKs without breaking every other service that uses file IDs. So likely, we either want to provide two separate implementations (with two different default db_paths of course), or just be opinionated and stick to one.

My only objection to using UUIDs was the performance concern associated with generating and indexing by them. However, this was before the days of sync_all() and the out-of-band handling, and the overhead added by that should be a few orders of magnitude slower than any performance cost associated with using UUIDs.

Furthermore, I want to apologize for being unable to understand the use-case up until now. I think I get it now. Imagine a network of multiple Jupyter servers, and there is a main server responsible for associating a file ID with some data. Then if the other Jupyter servers send the payload with the file ID to the main server, and that file ID is not globally unique, then this main server would have to have a separate table mapping "local file ID + server ID" to "global file IDs", and do a JOIN query every time to return the data associated with a local file ID. That just seems silly and brittle.

All things considered, I am now in favor of switching our default implementation to using UUIDs. However, I want to emphasize one major point: globally unique file IDs should not be used to shared between copies of files from different filesystems. When copying a file from one Jupyter server to another, a new UUID should be generated, since that is a copy. Sharing file IDs between different files defeats the purpose of file ID, and opens you up to all sorts of weird consistency issues with sharing state between servers.

Thank you for pushing repeatedly on this. Future-proofing file ID definitely is best done before any users are installing this at home. I'll need buy-in from the RTC folks, but I'm confident we can move forward with your proposal.

kevin-bates · 2022-10-25T22:46:48Z

Thank you @dlqqq. I agree with your comment regarding copying files in principle. However, UUIDs enable the ability for applications to be developed to merge environments without concern of conflicts. For example, an application may wish to offer the ability to preserve path and ID, but update the inode (or even move to an entirely different filesystem) knowing the previous system's filesystem will not be used (although that's their decision). UUIDs enable that choice because entries in one DB are unique in another thereby eliminating the need to update external references (which is like herding cats).

My only objection to using UUIDs was the performance concern associated with generating and indexing by them.

FWIW, prior to coming into Open Source and Jupyter, I spent 12+ years with an ECM application in which every primary key (and other indices) were UUIDs across dozens of tables in SQLServer, DB2, Oracle, and other "enterprise" DBs. The application was optimized for retrievals but also supported bulk, high-volume, ingestion rates, and performance due to our use of UUIDs was not an issue.

That said, I would recommend that implementations convert their string-based IDs to/from binary forms as that will be faster and half the space (16 bytes vs. 36 characters) - but that's up to the implementation. Fortunately, Jupyter is not a high-volume, 1000+ user application so that kind of optimization may not be critical.

dlqqq · 2022-10-26T20:37:02Z

@kevin-bates

That said, I would recommend that implementations convert their string-based IDs to/from binary forms as that will be faster and half the space (16 bytes vs. 36 characters) - but that's up to the implementation. Fortunately, Jupyter is not a high-volume, 1000+ user application so that kind of optimization may not be critical.

I'm not going to add this in for my initial PR, since the performance improvement (if any) is unclear. While this does decrease the size of each record, the difference is small relative to the size of the file paths and other columns, and may actually worsen performance from the overhead of converting UUIDs to/from binary in Python. I would only add this if there were performance benchmarks to validate the efficacy of this approach.

Anecdotal evidence that binary UUIDs may only have a marginal performance enhancement: https://vespa-mrs.github.io/vespa.io/development/project_dev/database/DatabaseUuidEfficiency.html

dlqqq added the enhancement New feature or request label Sep 7, 2022

kevin-bates mentioned this issue Oct 11, 2022

draft: File ID Manager base implementation jupyter-server/jupyter_server#921

Closed

kevin-bates mentioned this issue Oct 25, 2022

Allow arbitrary contents managers #24

Merged

dlqqq changed the title ~~new config option for UUIDs for primary keys~~ use UUIDs for primary keys Oct 25, 2022

dlqqq self-assigned this Oct 26, 2022

dlqqq mentioned this issue Oct 26, 2022

use UUIDs for default file ID manager implementations #30

Merged

dlqqq closed this as completed in #30 Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use UUIDs for primary keys #3

use UUIDs for primary keys #3

dlqqq commented Sep 7, 2022 •

edited

Loading

kevin-bates commented Sep 15, 2022

dlqqq commented Sep 16, 2022

kevin-bates commented Sep 17, 2022

dlqqq commented Oct 25, 2022

kevin-bates commented Oct 25, 2022

dlqqq commented Oct 26, 2022

use UUIDs for primary keys #3

use UUIDs for primary keys #3

Comments

dlqqq commented Sep 7, 2022 • edited Loading

kevin-bates commented Sep 15, 2022

dlqqq commented Sep 16, 2022

kevin-bates commented Sep 17, 2022

dlqqq commented Oct 25, 2022

kevin-bates commented Oct 25, 2022

dlqqq commented Oct 26, 2022

dlqqq commented Sep 7, 2022 •

edited

Loading