-
-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature suggestion: (Option to) keep trying rather than throw SequenceOverflowException #21
Comments
Well, it's exactly what this test does. Simply create a
If you're experiencing unusually high demand "simply waiting" might actually worsen the situation, piling up more and more stuff "simply waiting" for an ID, where you actually probably would want to tell the systems upstream requesting ID's to try another host or cluster or... stop requesting ID's or... take a chill-pill or something 😆 Imagine, if you will, a factory that produces internet connected marbles and each marble rolling off the factory line gets its own ID. First, that would have to be an insanely large factory, and you'd probably have hundreds of those around the world - hence why you chose IdGen. Also, that toy would have to be popular in millions of galaxies; can you imagine the profit on that? Now imagine production is ramped up because of high demand; Christmas maybe. And now you're unable to produce ID's fast enough. Would you "skip" assigning marbles ID's altogether and potentially have a lot of unhappy customers because their toy won't work? Or "keep the marbles waiting" (where? the factory floors are finite... so is our solar system) while there are even more coming from the machines producing them? Or do you prefer to signal an operator with a red light? Or maybe tell the machines upstream to slow production down just a little? I think explicitly leaving handling this exceptional case up to whomever is using the library is exactly what we want. It forces people to think about their code. "Silently slowing down" generating ID's and "solving" the problem may make matters worse. And what better way to notify someone they forgot to handle an overflow by crashing? 😆 As a library creator I don't want to "take responsibility" for "handling"* such cases, the user (should) know(s) best and take responsibility. * Where I'm not sure if 'just put in a delay and wait for a bit' can even be defined as 'handling' the situation...
Actually, 4096 (12 bits) for each generator (1024 generators, 10 bits) per millisecond to be precise in the default configuration. That's over 4096 * 1024 * 1000 ≈ 4 billion per second... Which is exactly why throwing an exception shouldn't be a problem; it'll be an exceptional situation anyway; hence the exception 😉 Let's be real for a second here; most of us will never, ever, even come close to even the tiniest fraction of requiring millions or billions of ID's per second. Even if you do, you then should have at least 210 generators (which is 1024 generators!) EACH to their fullest capacity; else why would you have used a MaskConfig of only 12 sequence bits and 10 generator bits? Why didn't you use, say, only 2 generator bits and 20 sequence bits which increases each generator's sequence size 256-fold from the default configuration?
Actually catching the exception and handling it accordingly will improve reliability instead of ignoring the situation at hand and trying to swipe the problem under the floormat by just keeping everyone waiting. Exceptions aren't always a Bad Thing™, they signal something exceptional is going on and you might want to act upon it (or catch and swallow, ignoring the problem - it's up to whomever is using the library). Having said that; it shouldn't be hard to implement a " Though I currently disagree on the issue at hand, I am open to discussion. Change my mind 😜 |
Cool, here goes: The situation that caused this for me was when I was migrating from another database, and generating several million IDs in advance. In a tight loop, 4,096 is easy to hit in a millisecond, even on my 9 year old machine.
Presumably, it's unlikely that you're going to max out on your generator count, and you're probably going to be generating a ton from one generator, since those generators are probably going to be different application instances or even entirely different computers. So having 1024 generators doesn't really help in most practical cases. Also, the generator count could be precious, whereas a millisecond isn't.
Softly, by slowing down. It buys time for developers to react. I understand that silent failures are bad, but it's also been said that premature optimization is the root of all evil. It's harder to imagine a situation where generating IDs is the most time-consuming part of any process, or where generating IDs is more time critical than mission critical. Imagining a very likely scenario: You're developing an image uploader. During development, you're working on sets of, say, 1000, and then a couple years after release, people start uploading larger and larger sets, and you start hitting 3000, 4000, 5000. A crash would stop business then and there. After hours or potentially days of lost business or a very large customer, you wrap the whole thing in a try/catch. At that point, you can start evaluating other ID options until performance begins to suffer beyond what is acceptable -- you've effectively landed on the same solution as not throwing an exception in the first place. Years pass, and people start uploading 10,000 images at a time. The upload takes two minutes and a millisecond rather than two minutes exactly. It's 2080. Miraculously, the earth, along with you and your users, are still alive. Per-core CPU performance is still the same as it is now. People are uploading 100,000 images at a time. The upload takes twenty minutes and ten milliseconds rather than twenty minutes exactly. The time component is about to overflow. Now you need to switch to a new way of doing things -- and not even because it was too slow. If instead you're doing something that does generate a bajillion IDs all day every day in a tight loop, like, say, a data logger or a user input device driver, you're probably just as likely to encounter an unacceptable slowdown during development as you are to encounter a crash. I should add that exceptions are also quite slow. This is why you see the TryParse/Try* API all across .NET for time-critical code. |
For that specific scenario you could use a generator-id per-thread (for example). Or just chop up the work in, say, 1024 batches and process each batch "as" a specific generator. And, again, these numbers aren't carved in stone, right? You might as wel use up to 64 generators (only 6 bits per generator, giving you 4 extra bits (16 times!) the extra 'sequence space'. Another "trick" you could use (but make sure you understand what's going on / what you're doing) is for the migration only "temporarily" use an IdGenerator with a
It's not about premature optimization at all. It's about fail fast.
Again; this isn't about the time taken to generate an ID (it being the "most time consuming part" in a system); it's about not sweeping problems under the rug.
No, it wouldn't / shouldn't. Just possibly one of your 1024 nodes threw an exception for an upload because it's sequence overflowed; it may even throw one more for each upload after that for the remainder of the duration of that millisecond and it may even flood some logs if you're not careful. But the next millisecond everything will be (or should be) alright again. In this setup, when the problem doesn't get immediate attention, you'll just start to notice that some percentage in every 4096 * 1024 * 1000 uploads per second will fail and that total number is rising higher as your system evolves. Whenever you're in the market for an Id generator like IdGen you won't be using a single process that, when crashing, will bring down everything to a screeching halt. You'll have redundancy and high availability as part of your system - a horizontal architecture. Not a single process somewhere in the back of a broom-closet that is brought to it's knees by a single exception. Tens or hundreds, maybe thousands of nodes.
Because in 2023 the developers noticed a problem and fixed it by either adding more nodes and lowering individual loads or redesigning the system or...
I just can't get my head around to... if you have a (single) system that generates a bajillion IDs all day every day in a tight loop and making that system delay intentionally; how does that solve your problem? All it does is make matters worse because the users keep coming and uploading and now you need huge amounts of memory and storage just to be able to keep track of some 'queue' that you're storing requests in waiting for ID's...
Who's premature optimizing now? 😉 I am very aware of the "expense" of exceptions; again: they're not meant for regular program flow but for exceptional circumstances which an overflow undoubtedly is. We need the (library) user to think on how to actually handle the situation of a sequence overflow. If they want to wait, then sure, be my guest and put a On the other hand, if they don't want to wait and slow down they have the freedom to just put something else in the exceptionhandler. Like, I don't know, spin up another Amazon node to dynamically 'scale up' and spread the load over more nodes. Or tell the frontend to try another cluster to handle the request. Or... maybe even putting in a If the "just wait around a bit..." was built into the library the system would never be aware of the problem and not be able to (automatically) 'scale up'. Also; depending on the And to drive my point home even a bit more: What if
On a (more) positive note: I think that's one thing we agree on. You (anyone) should carefully think about the amount of generators required (now and in the distant future); adjusting this later can (but doesn't have to) be a problem. And finally; even though we may disagree (and I'm still open to having my mind changed), I would like to stress that it's only a handful of lines of code to implement a |
Right, but then I'm using more generator bits, and I have to sacrifice the one-to-one correspondence between a generator and a whatever broader instance. I also need to keep track of what range of generators are reserved for each. So it's very much not ideal.
I mention silent failures for the sake of argument, because those are actually post-mortem failures, which are the latest of all kinds of failures.
Getting back to failing fast: The best place to fail is at the compiler, then deployment time, then execution time, then postmortem. Crashes aren't particularly high on the "fail fast" chart. Not all failures are equal, either. You're going to need to be doing something with the IDs anyway, and the rest of the program is going to be taking up much more time per ID than a quick function like
Right, but every single node is going to fail identically on the same or similar input if the input someday involves several thousand IDs.
That's a different hypothetical scenario. My first scenario is in regards to an image uploader (or literally any application involving thousands of distinct items being uploaded or stored to a server); the second is in regards to hardware drivers or IoT type situations.
I was just pointing out that if performance is a concern, try/catch isn't the right answer.
That's a good option if your work unit is parallelizable. It may not be desirable or even possible to do so.
Maybe not
That would be another reason to avoid
I'm just talking about practical cases here. Presumably that's why the defaults are set to what they are, because they're practical. It's also why I think it would be a good option to have, whether or not it's the default behavior.
But that can halt on the order of hours even in a normal situation, and the wait time is in no way correlated to the size of the workload.
If you want several thousand items to be served (or even just processed) by the same node, I think it makes perfect sense. |
Only during the migration; time is part of the generated Id's.
No you don't? Who cares which generator was used, at a particular time, which node was used to do the migration. And if it's important to you then why not simply put in a
There's no post-mortem if you actually catch the exception and handle it accordingly.
Again; what crashes? Handle 👏 your 👏 exceptions.
Yes, like signalling the problem and redirecting to a less busy node for example. Not busy waiting and starting thumb-twiddling until some arbitrary time has passed so we 'can' generate a new id. EDIT: I just realized this is where some of the confusion may come from. I keep/kept saying redirecting to other nodes, spinning up new instances etc. whenever a sequence overflows. Let me be clear: it's not IdGen's job to signal a host is too busy or whatever; there are other means for that (like decent monitoring of your nodes). It's also not up to IdGen to use the "arbitrarily sized" sequence as a means for figuring out what's "too busy" and what isn't. There should be other systems in place for that. By the time your sequence overflows IdGen has no option; it quite literally ran out of possibilities; it can't generate a new id because that will create a conflicting id. One option to solve that is to redirect the work to another node that could handle the job. Another option is to simply fail. Or wait. Again; whichever option you choose; it should be up to the library user, not IdGen deciding for some that, well, it ran out of options, let's just slow things down. The exception is there to signal something is wrong: your sequence has overflowed where it shouldn't. The number of bits you reserved for the sequence are too few. How you want to fix it (adjust the
That is assuming all 1024 (or whatever the generators size is) are all actually up and the load is evenly distributed over those nodes so they all fail at the same time. By then you should start to wonder if IdGen is actually for you.
Again, it isn't. Correctness and reliability is. You're insisting on exceptions being bad but they're not. Their actual intended use is to signal exceptional situations; hence their name.
Again; maybe IdGen then isn't for you. Maybe you should look into plain simple auto-incremented ID's. The specific use-case for IdGen is distributed (e.g. parallelizable) uncoordinated ID generation.
Whether it's
They are for most intended use-cases. distributed, uncoordinated ID generation. Not tight-loop migrations in a single process.
I still don't see any strong, if any at all, arguments for 'waiting'. I also don't see any solutions on how to handle the work piling up when we start waiting. And finally I still don't see why you wouldn't just write the handful lines of code to create a
Who's to say the Id-generation won't? The more ID's are being generated, even with the smallest workload, if you start waiting for ticks to expire more requests/uploads/marbles will be queuing up. During the next tick, whatever the duration is, even _more _work will have to be processed (causing new waits, causing more backlog).
Then adjust the MaskConfig accordingly? It's not as if you don't have the option? Every bit you add to the sequence part doubles the space (at the cost of halving your generators or time resolution, depending on where you 'take' the bit from). IdGen is actually one of the very few, if not the only library offering variable sized timestamp, generator range and sequences. Most other snowflake(-alike) libraries are fixed 41/10/12 bits size respectively. I'm sorry, I just don't find any convincing or compelling reason to (silently or otherwise) keep waiting around until a tick has expired to generate the next ID in a busy system. Heck, why would you even wait around as you could simply wrap-around the sequence counter and 'artificially' increment the tick; it's what the outcome is going to be anyway; you know this ahead of time. Why keep sitting around, thumb twiddling, for the tick to pass? But by then you're damn near close to plain auto-incrementing ID's. Again, not what IdGen is intended to solve. Circling back to your migration; that makes me wonder too: Don't you want your migration to happen as fast as possible? If you're having to migrate millions of items, with a low enough workload to actually overflow the sequence, do you really want to add extra time doing nothing? Why don't you take the much easier route(s) of either adjusting the number of bits reserved for the sequence or 'faking' a generator-id for the duration of the migration just to speed things along. Both make much more sense to me than introducing artificial delays just because some counter overflowed. |
Yes, if it's only a one-time migration, and you can be sure that there's nothing else happening simultaneously. The image uploader situation I noted earlier would have most users making IDs in the hundreds (and, crucially, occasionally, potentially, in the thousands) kicked off by multiple users simultaneously and frequently. Allocating generators using this tactic then becomes a non-trivial problem.
That would be a better argument in Java, where the caller is forced to be aware of every exception. May I suggest logging via a callback function or firing an event instead of throwing an exception?
Yes, but unexpected situations are unhandled almost by definition, and unhandled exceptions cause crashes, regardless of how severe the unexpected situation is. An exception that occurs early in development or during testing is a great. An exception that likely won't occur until some point after release is antithetical to reliability.
These two concerns seem to contradict.
Autoincrements and UUIDs pose problems even for relatively simple SQL applications, which can be solved with a combination of source IDs (generators) and mostly-sortable source-specific IDs (timestamp and ordinal). Cloud-style massively scalable services aren't the only place that IDs with these characteristics make sense.
True, if you assume the ID generation is interspersed with the work. Generating a batch of IDs up front for a batch of work is a common scenario, say, if you want to assign IDs to graph nodes before you link them up.
Because the worst case isn't the usual case, and in the usual case, autoincrement isn't always the best solution. In most cases, failing in an unusual case rather than taking a few extra milliseconds doesn't seem like a good tradeoff.
Benford's law. For numeric/digital representations of countable real-world items, most counts will tend toward the highest digit having the lowest value (i.e. 1), with higher values in lower digits being anomalous. This is so much so that it's used as a signal in detecting fraud. Assuming you've chosen your mask size (i.e. your groupings) to take into account the most common use cases, most of your data is going to be of smaller counts, until you have an unusual case. The unusual case, the rollover digit beyond 4096, will usually be 1 -- meaning that the wait will almost always be one tick, sometimes two or three, rarely more. Even then, it'll likely be a fraction of a tick as opposed to a full tick. And if someone is going to generate millions of IDs for a time-critical application, surely, they'll have profiled its performance. And if someone is going to generate millions of IDs for things that aren't used in tight loops, surely they need those IDs for some kind of processing which will invariably take up a lot more time than generating IDs or even waiting to generate IDs.
ID formats, in most applications, are kind of set in stone once they're chosen.
In my particular case, I had 22 million items to migrate. 22 million IDs on a single generator takes 5 seconds with the default settings. The rest of the migration takes 10 minutes. You're right that I could safely use another generator (and I admit I hadn't thought of that), but what I'm trying to illustrate with this example is that, given that generating IDs is a very small part of the process, waiting another 5 seconds is perfectly fine. In normal operation, I expect to be generating hundreds of IDs at a time. It would be unusual, but I can't say I won't be generating thousands. In those cases, I would rather lose a millisecond or two than fail an entire work item or have to parallelize right down my work unit. And if the performance gets really bad -- hey, I should be monitoring my nodes, right? In any case, it seems I've failed to change your mind. No prob, it's just a difference in philosophy. I was able to write up my own solution in the meantime. If I put it up on Github, I'll be sure to credit you. Thanks for entertaining my thoughts! |
Then you should've solved that problem earlier; again: the (intended) use-case for IdGen is distributed uncoordinated id generation. If you're only going to use a single generator then there's not much use for IdGen. Using multiple generators is the intended scenario and, by extension, allocating generators should've been a solved problem by then. Wether you provision each node with a fixed generator id, use a per-thread generator id, have a server ("generator coordinator") hand out generator id's or ... whatever.
You may, and that would be a possibility, but I don't think it's an very intuitive way of doing things.
The situation isn't unexpected; the
But, again, if it happens it will be clear what is going on, what the problem is. Instead of silently ignoring things.
Again, if this is your use-case then IdGen offers many options (some of which may be required to have been thought-out upfront):
Maybe in your situation(s). But where for you a millisecond is "just a small wait", for others it's an unacceptable price to pay. From a library standpoint, which can be used (and is intended to be used) in any random scenario, throwing in a wait is not always the best option. Also, again, even if we were to wait - a tick may be anything from a millisecond to days, nanoseconds to centuries (theoretically).
Without using Benford's law or any other convoluted methods; there's an easier way to detect a sequence overflow: The
Which, again, may be acceptable in your specific use-case/scenario, but it may not be in others. You really should consider scenario's outside your own in which this library will be used (and is intended to be used) and take that into account when suggesting solutions. To exaggerate a bit: what would happen if developers of OS'es, other libraries etc. all would just throw in a random millisecond (or whatever period) wait in code? "Oh, hi! So you want a
That's another assumption I wouldn't want to make; who's to say you didn't oveflow the sequence within the first one-thousands of the tick? I agree it's likely but not that likely.
That's why you need to sit down and think about it before and during the design of your application. Having said that, if anything, IdGen allows you to 'cheat' a little; you can 'fake' a generator-id (for example, if you normally only use 1, 2, 3, ... you could use 1023, 1022, 1021... to 'keep track' of when you needed to do this) for these kinds of situations. And because a timestamp is included in the ID you could even keep track of when you cheated and mark ID's between the start- and end-time as 'cheated'.
... again: IdGen is not _intended _to be used in a 'single generator' scenario; it's (very much) ok if you do, but that's not how it's intended to be used. So if you then overflow the sequence you need to either re-think and adjust the maskconfig or use a second generator ID or wait for the remainder of that millisecond or... do whatever you want. IdGen signalled you very clearly, and as early as possible, what was going on and that there was a problem. Didn't it? 😉
... for you. Keep that in mind. For you. Who am I to decide that's OK for everyone using this library?
Again; YOU would rather loose a millisecond. But that may not be the case for everyone using this library.
Agree to disagree 😅
Glad you did 👍
Cool. Maybe link to this issue?
You're welcome and ditto! |
You may also be happy to know that since 2.4.1 (see #24) it is now possible to SpinWait. I still don't recommend it and it still defaults to throwing an exception but... |
SequenceOverflowException
only occurs under very specific circumstances (hardware speed, unusually high demand, very tight loop etc.) which may not always be easy to test for or anticipate.It's hard to imagine a use case where a user can simply give up if they can't have an ID, so there's no harm in simply waiting. The default settings allow for 4 million IDs per second, which ought to be fast enough for all intents and purposes, and much preferable to unexpected crashes.
Of course, it's easy to wrap the whole thing in a try/catch block, but my concern isn't usability so much as reliability.
The text was updated successfully, but these errors were encountered: