-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Persist state across Sage executions #9
Comments
The simplest solution could be to use the |
@michalmuskala I've read through
We need something with fast writes and deletes by sacrificing reads, like Log-Structured Merge Trees. But I'm not aware of good implementation which would not bring too much maintenance burden. Any ideas? |
Implementation can work something like this:
Probably we should remove compensation error handler since it looks obsolete with an ability to persist and retry compensations after a critical failure. Questions:
|
I think we should keep it all local - at least in the first iteration. I don't think it's the distribution of the log (or transaction persistence) would be actually that useful. The simple adapter could just use a dets table - it might not be the best tool for the job, but it's a no-dependencies solution that should allow validating the idea. Lastly, saving funs might be a problem - any code change would basically make it irrecoverable (since serialisation for funs includes the md5 hash of the module). Maybe the persistence feature should only work with mfa tuples? |
@michalmuskala Sage which relies on MFA instead of callbacks in a form |
If we further extend persistence with external transaction ID we would be able to trigger recovery externally by executing the same Sage with the same ID and attributes, which would provide idempotency feature described #2. |
We can use LSM DB from NHS https://github.com/martinsumner/leveled |
I think this is fine for an initial pass but I'm not sure this is a correct default to have long term. It would be problematic for our uses. But as long as persistence is behind an adapter as described everything should be fine since we can swap in our own persistence. Thanks for this library and the thought you're putting into it 👍. |
@AndrewDryga, is there any update on this issue? Thanks! |
Hey @deepankar-j. I'm not spending much time adding features to Sage as it fits our use cases at a time as it is now. So if anyone wants - this one is open for grabs. |
Thanks @AndrewDryga. Out of curiosity, how do you handle deployments for your system while a saga is the middle of executing. We have a three-node cluster on which we do a rolling update. This is why I'm interested in the execution log and saga recovery. However, I imagine that updating in the middle of a saga execution might not be as big of an issue for blue-green deployments. |
I think the goal is to handle process crash - not just node failure... In my opinion this shouldn't be special to sage. For example, the whole context can be rehydrated before calling sage execution (or at the first step of sage) and the all steps should be able to handle that. In my case we keep everything in the database, and each sage step are idempotent with the information available in database - e.g. when we know it's already done (e.g. an attr for external resource is already set ) then execution step would be no-op. |
@deepankar-j we currently don't handle it as it's pretty rare for our system. It's the same as not handling node crash before Sage was introduced, but with Sage's semantics code is much easier to understand. Also, if your functions are idempotent (eg. you persist some state between steps with an upsert operation) the saga would be persistent without any handling from the library. |
Ha, I'm implementing my own distributed transaction system and started to think "I have to be reinventing the wheel here", so I found this cool project! My system runs in Kubernetes so pods are very frequently killed, restarted, deleted, etc, so I must persist my transaction state. Currently my system uses Redis and Jetstream. Write the transaction to Redis, enqueue a Jetstream message, and have consumers process the message which update the transaction state in Redis. Notification of transaction/step values is all done with pub/sub (which comes built-in with Jetstream). Jetstream handles work-queue type stuff... retries, acks, etc. Each step in a transaction must be idempotent, because the system has at-least-once delivery semantics. Anyway, long winded way to say... maybe there should be a behaviour that specifies how to store/load transaction state, that way people can use whatever storage system they want? |
@cjbottaro there is no interface right now but the idea was to introduce such behavior and then move the state into a separate module that would be a default implementation. If somebody has a practical need for this a PR would be welcome. Keep in mind that persisting state is not as easy as it sounds, for example, you need some sort of two-phase commits for every execution step to recover from failure when execution has finished but the node is restarted before Sage persisted the state. |
Yes, but even then, I think 2-phase commits relies on something being 100% reliable (i.e. the file system). Which doesn't jive with networked/distributed systems. This problem is not fun at all. https://en.wikipedia.org/wiki/Two_Generals%27_Problem I'd love to brainstorm on this. My current thoughts are around "eventual consistency" systems... but that depends on detecting inconsistencies. cc @AndrewDryga |
@cjbottaro the thing I would focus on would be making sure that when your node receives SIGTERM it can stop gracefully, you should cut the traffic to it and give it time to finish processing sagas. This way your k8s deployment would be a bit slower but you won't have sagas to be stopped mid-way (unless node crashes or OOM which is rate). I feel like this approach would be more practical than solving distributed systems issues. |
Can use khepri for this |
We can add a persistence adapter for Sage. The main goal of implementing it is to let users recover after node failures, making it possible to get better compensation guarantees. As I see it we need:
This mechanism should be prone to network errors, eg. we should persist a state before executing a transaction, so that even if the node is restarted, compensation (even trough without knowing anything about the effect) would still start, potentially on another node.
The text was updated successfully, but these errors were encountered: