-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add docs about backwards compat design #7085
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this. I agree with all the content here. I wonder if we can organize it more usefully so that it's easier to skim to find the areas of relevance to the reader in that moment. Honestly, I'm just afraid people won't really read the wall of text start-to-finish. As an example, the first two paragraphs cover stuff we explicitly don't expect people to need to think about. One idea is just to put some section headers in, like "Network APIs made with Progenitor/Dropshot", "Network APIs with custom protocols over TCP", "Database state", "Other ad hoc persistent state", etc. Another idea would be to lead with the specific guidelines and relegate the rest to a "Background" section that people could read if they want to better understand how to apply them or where they came from.
That said, having this as-is is better than not having it so if that all seems annoying then I'd say just go ahead and land this. I think section headers would be good bang-for-the-buck.
docs/control-plane-architecture.adoc
Outdated
@@ -205,6 +205,24 @@ It's essential that components provide visibility into what they're doing for de | |||
* Sagas are potentially long-lived. Without needing any per-saga work, the saga log provides detailed information about which steps have run, which steps are in-progress, and the results of each step that completed. | |||
* Background tasks are continuous processes. They can provide whatever detailed status they want to, including things like: activity counters, error counters, ringbuffers of recent events, data produced by the task, etc. These can be viewed with `omdb`. | |||
|
|||
=== Backwards Compatibility | |||
|
|||
In general, backwards compatibility between services will be provided at the API level as described in <<rfd421>>. Most internal control plane service APIs are Dropshot based and therfore can utilize the same strategy. Some other services, such as trust quroum and crucible operate over TCP with custom protocols, and have their own mechanisms for backwards compatibility. The introduction of new services of this type should be largely unnecessary for the foreseeable future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, backwards compatibility between services will be provided at the API level as described in <<rfd421>>. Most internal control plane service APIs are Dropshot based and therfore can utilize the same strategy. Some other services, such as trust quroum and crucible operate over TCP with custom protocols, and have their own mechanisms for backwards compatibility. The introduction of new services of this type should be largely unnecessary for the foreseeable future. | |
In general, backwards compatibility between services will be provided at the API level as described in <<rfd421>>. Most internal control plane service APIs are Dropshot based and therefore can utilize the same strategy. Some other services, such as trust quroum and Crucible, operate over TCP with custom protocols and have their own mechanisms for backwards compatibility. The introduction of new services of this type should be largely unnecessary for the foreseeable future. |
docs/control-plane-architecture.adoc
Outdated
guidelines for how to do this consistently and safely in the future. | ||
|
||
1. Ensure the code to perfom an upgrade / backfill is in one location. This makes it easier to find and remove once it is no longer needed. It also makes it easier to test in isolation, and to understand the complete change. | ||
2. We currently operate in a mupdate driven world. For the time being we should not perform arbitrary backfilling during normal operation of the system. Instead, after a mupdate/schema migration we should force a blueprint step that performs the backfilling before doing anything else. Once that is done, all other nexus code can operate as if there never was any backfilling. This keeps the bulk of the code streamlined and allows the update code to be self maintained. It's possible we may want to make this a first class part of reconfigurator for when we enter the online update world. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make this a more more general rule. This is basically what I did in #4466, where I created a new ledger for the new format. On startup, sled agent would read the old one and write the new one and then from then on it would only ever look at the new one. This is the same principle you're describing here -- for all the same reasons and benefits -- but for the Ledger data you mentioned a few paragraphs above.
I think the general principle is something like: when doing a migration from old-format stuff to new-format stuff, prefer to do it up front during some kind of startup operation so that the rest of the system can operate only in the new world. (This may require being pretty careful that the conversion won't fail, since if it does, the system won't come up. I was pretty paranoid about that with #4466 and took a lot of steps to make sure that wouldn't happen (e.g., fetched data from every system I could get my hands on and ran it through the conversion and added a lot of it to the test suite too).)
@davepacheco I took a stab at your suggestions in 219fab9 |
No description provided.