-
-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for comb v4 UUIDs #303
Comments
For posterity:
After reading Nilsson's article, I don't see an immediate need for "combs" here. In a nutshell:
|
Very well argued points :). |
ramsey/uuid#53 (comment) points to this benchmark that shows a ~30X increase in INSERT time for Postgres 9.6 when using V4 UUIDS. However, the V1 perf in that test supports my suggestion, above, that V1 UUIDs are an acceptable substitute for combs. |
For me, v1 UUIDs pose an issue because of their guessability. I know there's some back-and-forth right now about the benefits (or not) that you get in having randomized, unguessable keys, but at best, it looks like the jury's still out. Thus, for my purposes, the most important criteria are
V1 UUIDs fail on the second requirement, but hit all the others. V4 UUIDs fail on the third requirement, but hit all the others. CombV4 UUIDs are, to me, the perfect blend..... But again, I accept that you're not interested in incorporating this functionality. I'll likely extend your library to add it when I get time to do some research on how, but for now, I'll just use standard V4's. |
FWIW, this module sets the
Edit: This is RFC-compliant, btw. While the |
I'd like to give some more insights on what's going on with PostgreSQL when indexing UUID values. Like most relation databases, PostgreSQL defaults to using a B-Tree data structure for indexes. Below is an example B-Tree from the linked article showing integer values being indexed. Each B-Tree node (the dark grey boxes in the graphic above) in PostgreSQL is a 8 KB page that can contain a variable number of keys. If you're indexing UUIDs, this means you can theoretically cram up to 512 (8192/16) uuids into one node (page). In reality it's quite a bit less, because each entry in the B-Tree is also containing pointers to rows (aka tuples) into the tables (aka heap) and to other B-Tree pages. For simplicity, let's assume 100 UUIDs fit into a page. Now let's imagine an indexed uuid table with 1M rows. Assuming our leaf nodes are 50% full (i.e. containing 50 UUIDs each), this mean we'll have 20K leaf nodes. And now, let's say we want to insert 20K new v4 UUIDs. Given the random distribution of those values, on average we'd expect each UUID to be inserted into a different leaf node, so we'd end up with 51 UUIDs on each node. Eventually PostgreSQL has to write this data to disk (via checkpointing, but that's another topic), and this is done by writing out each page that has been changed in full. In our case that's You might protest saying that it's really stupid to write the full 8KB of each page, when only 1% of it (~82 bytes) has changed. So yeah, in theory PostgreSQL could make much smaller writes, but in reality that would be even worse because SSD disks have a minimum write size of usually 4KB, and trying to issue a 82 byte write would actually force the SSD to first perform a 4KB read, then update the 82 bytes you're trying to write, and then issue a 4KB write to perform the update, which would likely double the latency of the overall operation. So as you can see, using random UUIDs is pretty much the worst case scenario for any B-Tree implementation and might cause over 100x write amplification. And from my personal experience running a few TB-sized PostgreSQL instances using v4 UUIDs, I've seen those kind of write amplifications in the wild. If our indexed values instead would have been sequential, we might simply have allocated 200 (20K/100) new pages as our values wouldn't have been randomly distributed across the nodes. And writing those pages would have only taken So yeah, while modern DBs might optimize a few things (e.g. storing UUIDs in binary rather than as text), random UUIDs will always present a worst-case scenario for B-Trees. And sure, some databases (e.g. LevelDB, Cassandra, RocksDB) use LSM-Trees rather than B-Trees, but they're basically just giving you higher write throughput at the expense of read throughput, so they're unlikely to replace B-Trees in the near future. But I don't have much LSM experience, and I'd have to do some more research on how they'd perform with the scenario outlined above. Oh, and last but not least, here is my attempt to reproduce the benchmark linked above: https://gist.github.com/felixge/64086c200ad6dc03aadd8b2900434c7a . In my testing v4 UUIDs are only 2x slower than v1. I suspect this is because the disk is not a significant bottleneck on my system, and the CPU overheads dominate. We're also inserting as fast as possible, so we probably end up hitting the same leaf nodes more than once between checkpoints, greatly reducing the write amplification. Oh and my |
@felixge Thanks for taking the time to provide such an insightful comment. Very helpful! Reopening so as to encourage further discussion. |
@ramsey points out that my previous comment about v1 serving much the same purpose is mistaken, which I'm happy to concede in light of @felixge's test, above. That said, my concern with COMBs is that "time-ordered v4 UUIDs" is an oxymoron. Version 4 UUIDS are "meant for generating UUIDs from truly-random or pseudo-random numbers" [rfc]. There is a certain calculus around uniqueness that follows from that, and that consumers of UUIDs have come to expect. That uniqueness contract breaks down when you start fiddling around with the now "random" (in air-quotes) bits. ... and, too, as soon as timestamp information gets encoded into a v4 uuid, someone somewhere is going to write code that relies on that, and then bitch about how actually-random v4 ids are breaking things. Count on it. Hence, I consider COMBs to be non-standard. As such, I'm reluctant to provide support for them here at this time. Having RFC-compliance be the barrier to entry for code in this module has served this project well over the years. That said, I get that there are compelling reasons for a new form of ID. I guess my suggestion for moving forward would be to treat COMBs as the different version of UUID that they are and move forward as such:
I could take a stab at implementing that new module if nobody else is so inclined but, honestly, I think it would be better if someone other than @ctavan or I took ownership of that. The more people invested in this org and this code space, the better. |
In the nearly 7 years this has been supported in ramsey/uuid, I’ve not had a single complaint. See also http://gh.peabody.io/uuidv6/ I plan to support this recommendation in version 4 of ramsey/uuid. It has been mentioned to the IETF but, without implementations, they didn’t wish to discuss it. |
Thanks for pointing to the uuidv6 draft @ramsey, that sounds really promising! Looks like 1. of @broofa's suggestions has already been done 😉. I think in the light of producing more implementations to encourage standardization it might be worthwhile to implement it for JavaScript. As @broofa suggested I think we would be happy to host this new module here in this org. However I would also prefer to release it as a separate module until there are more concrete plans to actually standardize such "v6" uuids. |
Here is already an implementation: https://github.com/aarondcohen/id128/blob/master/src/id/uuid-6.js |
@ramsey, Interesting article, thanks for sharing! I think it highlights the need for a broader conversation, though. Simply riffing on version 1 is, imho, not the right approach. There is more than one issue with v1 that should be addressed in a proposed v6 format. Specifically:
I welcome a new format but - and I say this with nothing but respect for your work - if we're going to create a v6 UUID spec I think we can, collectively, do better. |
These are all historical artifacts of the GUID format, originally created by Microsoft. I think if you were to try deviating from the RFC 4122 variant enough, you might consider creating a new variant. |
For reference, here’s the IETF thread I mentioned. I was wrong; they didn’t state they needed more implementations before discussing it. I had that confused with something else, I think. https://mailarchive.ietf.org/arch/msg/ietf/8h5cs3CHzYpbKIOc4FxO730OD4o |
@bradleypeabody as you can see from the thread above we have been discussing your UUID v6 proposal here. The linked email thread on the IETF mailing list seems to end without any conclusion. Would you like to share some more insight on what happened to your initial v6 proposal? How do you think about it today? |
@ctavan Thanks for asking. To summarize some thoughts on it:
Those are my thoughts. I'd be interested in trying to put together a draft with anyone else who wants to work on it. Probably the first step would be agreeing on the high level requirements and then summarizing those in a doc, followed by some initial idea of what the spec might look like. If someone wants to take the node/JS implementation, I can do a Go implementation and probably a C one too. |
Thank you for your write up, @bradleypeabody! I'm definitely happy to help exploring around this topic (and potentially contributing a JS impl). I'm also fine with moving the discussion over into your repo. |
Depends on the database. PostgreSQL turns text UUIDs into binary and back transparently to the point that application developers don't have to worry about it much. In fact, using anything other than UUID will force you to reimplement this text<->binary conversion yourself. (I have no horse in this race, just wanted to point this out) |
@felixge I bet you have only been waiting to implement such a conversion as a PostgreSQL extension, right 😉?! |
@ctavan my comment was actually a warning against throwing the baby out with the bathwater ; ). Additionally, IIRC, PostgreSQL doesn't care how your UUID looks and if certain version related bits are set. Any 128 bit value is accepted. So not sure if a new id format is needed. There is also a lot of prior art in this space that should be considered. |
@felixge It's a valid point for sure. I do definitely think a new format is needed, simply because the existing one doesn't provide much functionality for most applications and the whole setup can be made simpler. Allowing variable widths for both time and random bytes I feel is extremely useful for a wide range of applications. I also think that many implementations can just use (store) the string format directly if it is compact enough. However, it is certainly possible to combine the idea of multiple text formats with having one that is backward compatible. The hex encoding form that is used along with the bit positions, it's not terribly complicated to encode - as you say it's mostly a matter of putting the version number in the right place. I believe Cassandra's UUID support is like this too and I'm sure others. I'll add to the list of ideas to review. |
For anyone interested, an IETF draft for UUID version 6 has now been submitted. We'll see what happens. https://tools.ietf.org/html/draft-peabody-dispatch-new-uuid-format-00 |
@bradleypeabody When you submit, does it automatically open up a discussion thread on the IETF mailing list? If so, which mailing list (I'm subscribed to the general one, but I don't see a thread for it). |
@ramsey I believe the discussion is separate from the submission. I did a bit more discussion on the main IETF mailing list (ietf at ietf dot org) and the upshot of it was to submit a draft that lays out the proposal in detail - with particular attention to why it's needed. From my understanding the IETF meets quarterly and their next meeting is in a few weeks, which is when it would get some official feedback (although I don't know what form that feedback takes - this is new to me). I'll also reach out separately and see if I can get some more interest in the subject in general. I did get some great responses on the mailing list but the case of why this is needed will definitely require some advocacy to make it happen (to become an actual RFC and hopefully get some adoption). |
@kael-shipman if you want something like the proposed uuid v6 I use https://github.com/tanglebones/pg_tuid which time prefixes the top bits of a UUIDv4. |
Closing in preference to #580. There's been some good discussion here, but it's getting a bit long in the tooth and I suspect most/all of it is stale compared to the IETF conversations that have been taking place. |
Hey there,
Wondering if you have any thoughts on supporting comb v4 UUIDs. Doesn't seem to be a ton of information out there about this, but there seems to be general agreement that it delivers potentially massive increases in database performance at the relatively minor cost of decreased uniqueness guarantees.
I'd be willing to take a shot at implementing this, but just want to know if it would be a welcome addition or not first.
The text was updated successfully, but these errors were encountered: