-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return errors when session consistency would be broken #4351
Return errors when session consistency would be broken #4351
Conversation
kill_session_on_consistency_loss@51658 aka 20221020.5 vs main ewma over 20 builds from 51225 to 51650 Click to see tablemain
kill_session_on_consistency_loss
|
src/http/http_endpoint.h
Outdated
auto header_begin = std::search( | ||
response.begin(), response.end(), target.begin(), target.end()); | ||
auto header_name_end = std::find(header_begin, response.end(), ':'); | ||
auto header_value_end = std::find(header_name_end, response.end(), '\r'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems unfortunate at best, and possibly insecure at worst if it's possible to add http::headers::CCF_TX_ID
inline as a header value, followed by a bogus value. We should at least look for "\r{}", http::headers::CCF_TX_ID
, I think.
x-my-evil-header: x-ms-ccf-transaction-id: 9.99\r
x-ms-ccf-transaction-id: 2.1
I'm a bit skeptical about the tradeoffs involved in going above and beyond when we know for sure there has been an election, to still attempt to accurately report status. It seems to me that immediately returning an error and shutting down the connection is cleaner to implement and to reason about in terms of availability. A batching client cannot avoid having to implement a backtracking procedure, as far as I can tell anyway. |
Perhaps something we could do in that situation is respond with errors to all further requests, generically at first, and then with the last committed transaction id in the last term the session wrote to.
|
Summary of discussion with @heidihoward yesterday: the tradeoff for trying to do better than just dropping the connection is unclear, particularly in an environment where many client libraries use connection pools and will not expose sessions directly. It is unlikely that a client will write logic that correctly handles these responses. |
My view is that this approach was easier to implement than closing the session (application code returning an application error, rather than affecting session lifetimes), that it makes little difference to clients which don't handle this behaviour (is your session closed aggressively or do you get permanent errors? Could you or do you want to handle one of these under the hood?), and that it's much nicer in the case where there's a human-in-the-loop request flow (harmless elections are invisible and allow you to proceed, you get a readable error if your state was lost that helps you backtrack). EDIT - On reflection, there's really 2 changes here from what we initially envisaged. A is whether this is "an election affects all active sessions", or "per-request, we look at the reported TxIDs" (with a middle-ground of "after an election, we do FOO for every new request on previously active sessions"). B is whether we close the connection, send an error response and close the connection (requires waiting for the next request), or keep the session open but continue to return errors. |
@eddyashton in the case of an automatic client with a connection pool, keeping the connection open but returning errors for every subsequent request is clearly suboptimal, even if it's for a while until we can provide information about where the rollback took place. It results in a much higher rate of failure going forward than closing the connection. |
Parking this PR for now, and will try a simpler approach of aggressively killing sessions on election. |
Resolves #3952.
This currently implements a softer variant of what was previously discussed. Rather than killing sessions, we return HTTP errors. Even after returning such an error, we will not kill the session - the user may ask us multiple times over the same session and get repeated errors, or even ask us a pure command (non-transactional and thus not inconsistent) and get a real response back.
I've added an end-to-end test that I think covers everything, but I'm considering a stochastic for broader coverage: spammy clients confirming they see a consistently ratcheting TxID (or this new error), while we load the service/cause elections/kill nodes.