Skip to content
This repository has been archived by the owner on Oct 27, 2020. It is now read-only.

Mecha resiliency and the API #74

Open
dewiniaid opened this issue Mar 15, 2016 · 1 comment
Open

Mecha resiliency and the API #74

dewiniaid opened this issue Mar 15, 2016 · 1 comment

Comments

@dewiniaid
Copy link
Contributor

Poking all of you since this is somewhat multidisciplinary
@trezy @tyrope @xlexi @kenneaal

(Wall of text crits for 9001 damage).

The current API model for "skunkworks" will be stripping out all of the HTTP functionality and speaking only Websockets. This drastically simplifies some of the design bits, considering it needs WS anyways for notifications and not having to implement both is... convenient.

It's been theoretically possible for Mecha to operate in a hybrid "online, with offline fallback" mode for awhile now, though I hadn't given it much thought of how it'd look. I also don't have a convenient way to test various failure modes since I haven't been able to get the API to run locally to be able to do things like... 'crash' it in the middle of Mecha talking to it.

One part of this is figuring out some of the specifics of when Mecha should consider itself offline vs online, how exactly it should go about checking for restored connectivity, and how it should reconcile vs the API after being disconnected for a set amount of time.

To that end, here's some tidbits on the current setup:

Property Change Tracking

Mecha tracks "pending" property changes. That is, I can do this:

rescue.platform = 'pc'
rescue.quotes.append("quote")

and Mecha will know that both platform and quotes have changed. Furthermore, for some simple cases of modification (like appending to a list, add/remove from a set, or add/remove/set in a dict) it maintains a limited amount of state that can be used to re-apply the same change against an updated collection. (In essence, it's kind of like git rebase). More complicated modifications that can't be reliably repeated (like removing or altering an item in a list) flag the collection in such a way where Mecha knows it can't reliably merge and instead will overwrite whatever the API provides in its entirety.

This is currently used by the mechanism for saving cases to determine which attributes to send -- once a case is successfully saved, the particular properties are 'committed' which essentially removes them from the set of changed properties and tells them to discard any pending state they might have.

This is all handled under the hood, individual commands just change properties and call rescue.save() without having to deal with the minutiae.

Async saving and applying updates from the API

Mecha immediately applies any changes locally, reports on their affects, and then queues the relevant API call. It does have the ability to report any subsequent failures, but does not have the ability to roll back state.

Mecha also applies any updated rescue messages it receives from the API against a rescue, with one exception: Any properties with pending changes keep their existing (Mecha-supplied) values rather than what the API says. This is probably the correct behavior, because if they're flagged as changed in Mecha that means it's probably trying to tell the API about the change and hasn't yet. The exception to this exception is collections: as mentioned above, they'll replay their changes against the API version of the data if they believe it is feasible.

There is one potential problem here: If there are multiple pending updates to a case and they end up executing out of order. In theory the change protection should prevent any mayhem from happening. In practice, Mecha has a per-rescue "lock" -- while a case is locked no other command may modify it until the lock is released. If everything is healthy with the API, this should be unnoticeable -- but if there's issues saving cases, it'll take longer for the lock to be released (if the API is slow or calls are otherwise timing out) and may somewhat slow down multiple actions on the same case.

The rescue lock only applies to writes -- !quote, !list, etc are unaffected.

Timeouts and Retries.

Currently, Mecha allows for a 30 second timeout before giving up on a change. There's not yet any reply mechanism -- the case will just stay out of sync until something triggers it to save again and that save succeeds. This is bad and needs to be fixed.

There's also issues retrying some requests -- appending quotes as a convenience method is great, until this scenario occurs:

  1. Mecha tries to append a quote.
  2. The request times out or the connection is lost.
  3. Unknown to Mecha, the quote is successfully appended.
  4. Mecha retries the action
  5. The quote gets duplicated.

(Replace "append a quote" with "Create a new case" for a bigger issue.)

One option here might be to treat a timeout as "Okay, we're totally offline" until some other connectivity test proves otherwise and then reconcile state (but we need to figure out how to reconcile state.)

Currently, only appending quotes and opening new cases are not idempotent and thus cannot safely be retried.

Reconciliation after downtime.

The big question here is... what happens when whatever issue kept Mecha away from the API is resolved?

There's a few options here:

  • Mecha can assume it's authoritative and completely overwrite the case with its own state. (This can merge in anything that doesn't have pending changes as normal). This is currently what I'm leaning towards, but is not without drawbacks -- anyone using !sub on a case in Mecha will overwrite all of the changes to that case's quotes made via some other API user if the issue is due to Mecha losing its network connection rather than the server.
  • Mecha can discard all of its changes and use the API's versions instead. This is probably the wrong option.
  • Mecha can pick some hybrid approach (but what?) based on case attributes -- like what Mecha has for dateModified vs what the server has. This would be super easy if individual attributes were versioned, but that's also overkill.
  • Something else you guys can come up with.

Something else Mecha will need to do is remember all of its closed cases until API connectivity is restored -- but I want a way to be able to access recently closed cases from Mecha anyways. (My thought is they'd use negative index numbers and rotate between, e.g. -1..-10)

What exactly is downtime, and what is an error.

While connection loss is easy to identify, there's a fine between "the API is being slow/timing out" and "the API is allowing connections, but is completely unresponsive to commands." Figuring out where to draw this line in Mecha for it to switch from online to offline mode is going to be key. Figuring out when functionality is restored is also important.

Also, Mecha needs to be able to distinguish from an error message that is -- say -- complaining about MongoDB or Elasticsearch being down (which corresponds to "API is borked, go offline!"), error messages telling it to retry something/etc (e.g versioning conflicts, if/when the API gets them), and error messages saying "Nope, you screwed up, don't ever try that again." Right now most API errors are pretty vague and nonspecific.

@Marenthyu
Copy link
Member

TL;DR?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants