-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server instances using events-based entry cache occasionally have sticky, stale entries #5021
Comments
Please assign this to me |
@rturner3 One theory I have for this is that we are reading from the database and getting back a valid entry, upon getting a deleted event. Is it possible we are reading from a replica that has stale data? |
We do configure read-only replicas in SPIRE Server |
To be explicit, this is configured via "ro_connection_string", correct? |
Yes |
I think this is somewhat similar to #4329 (or at least the root cause of it), just that now if affects events which leads to more visible issues. I think the conclusion there is that we need serialiseable isolation level. |
I also noticed that during start up we fetch the latest event id and the list the entries/agents in two independent transactions:
Should this happen in a single transaction to guarantee a consistent view of the world? |
@sorindumitru In that case i think its ok its not behind a transaction. Since the events is basically a queue if we get some events in between the two calls then we'll do some extra reads to the database but things will still be consistent.. |
@rturner3 Found an issue that could lead to a dropped event in the below: spire/pkg/server/endpoints/authorized_entryfetcher.go Lines 159 to 173 in 0f1310b
If the call to
Is that happening in your deployment? |
I see some I do however have metrics data for all our deployments going back the past 40 days, and I don't see any failed calls to get node selectors (I checked this metric). In any case, this logic does look suspect if we did ever fail to fetch node selectors since |
A couple observations from recent logs:
We also have the agent LRU cache feature enabled. Just thinking out loud, is there some edge case we might be hitting where:
It still doesn't fully explain to me why we see the |
Mentioning #4985 as this may be a blocker. There's been two additional changes in event cache proposed in PRs, can we verify if those address some / all of the suspected issue described here? |
I think we found the root cause for this. The problem is with how sql spire/pkg/server/datastore/sqlstore/sqlstore.go Lines 455 to 470 in 7872525
SQL doesnt guarantee auto_increment is in order across transactions as its a 2 step process of first securing the next index and the committing. Kind of like a read modify write cycle. The problem is more visible with the prune registration entires functionality as the transaction there is across a loop of multiple deletes and event creates, allowing a bigger window for this race condition to happen. What we need to solve this is an always increasing index in sql, or more generally a way to get "what's new" since the last read. |
Couple solutions to this problem:
1 and 2 both depend on a "transaction timeout". In the case the server crashes in the middle of a transaction the skipped event will never be written. This is a database configurable which I think is 8 hours by default on mysql. As this is not a common occurrence I think option 2 would work best. |
@faisal-memon and I have been discussing various solutions. *** Faisal prefers a solution where we don't process entries more than once. To implement this, he proposed a skipped_entry list. This approach was then refined by me to ensure it was provably not going to mis-process entries.
The algorithm can be simplified conceptually by these features:
*** If Faisal deems the above algorithm too complex, we will likely pivot to the next complete solution:
The algorithm above depends on the current assumptions that all entries can be processed idempotently, which was a goal of the original event processing system. Note that this assumes that all registration entry events remain idempotent going forward. If neither of the two solutions are deemed sufficient, the remaining solutions are still possibly viable:
*** The following solutions are deemed non-viable, with the reasoning give below:
The problems of capturing the autoincrement EntryID at the beginning of the transaction seem like they can be fixed by capturing the row creation time; however, even with capture of the time the row was written to the database, the same problem remains. Rows that would be written to the database in a second session might not commit without significant delay. Such a delay would commit these rows in an out-of-order time sequence, even if after the commit they correctly indicate the ordering of their creation. When the event polling fails to read these in-time-order, but uncommitted rows, the event polling will fail to go backwards in time to capture the skipped events. Additionally, while Postgres contains support for row-creation capture through the non-standard SQL function CLOCK_TIMESTAMP(), which upon each row's processing will immediately resolve the system clock time, MySQL does not support similar functionality. NOW(), CURRENT_TIME, CURRENT_TIME() and other variants of time-capturing variables and functions across all databases are pinned to the transaction create time, and are cached such that subsequent calls within the same transaction resolve the same time).
This suggestion by Faisal would solve the event polling issues, by ensuring that we only read transactions in the order by which they close, making it impossible to skip events; but, such a solution is impossible, because by the time the transaction closes, the rows that are part of the transaction are already written to temporary data structures that comprise the commit. Rewriting these rows effectively means holding the transaction open to do the "commit close" timestamp update, which then takes time, delaying the commit close. This creates a chicken and egg style problem, as it is impossible to both close and update the rows in zero time.
This suggestion comes from a mis-assumption that the server count and server identity is stable, neither of which are requirements for the product. Spire servers may join or leave the cluster at any time, and it is impossible for the spire client to predict when a joining server will perform the initial cache population. As it could not know when the cache is initialized, it follows that it could not predict if the "needs to be processed" value is required or not. |
I was able to create a unit test to reproduce this failure and have a prototype fix here: https://github.com/faisal-memon/spire/pull/344/files |
@faisal-memon would you do a Merge Request for your branch, and if it is incomplete, please mark it as in-progress? |
This is most likely the last issue open that blocks #4985 |
Will do once #5042 merges. My branch is based on top of that one. |
blocks #4985 |
FYI, #5042 was merged this morning. |
The algorithm posted above will work great for covering skipped events during the polling loop. It doesn't cover the case of a server restart when there is a skipped event that hasn't resolved. So upon restart we have to get a list of all the events and look for any gaps. This leads to the below issue:
In the scenario event 20 will never get processed by server 3. |
This seems like a viable race condition. If there is a large gap between event 19 and 20, events < 20 could be quite old and in danger of pruning. |
I'm following this step by step
The auto increment index has issued 21, and the next index will be 22.
The database table will be updated after the initial server load, and additionally the next record to read from Server 3 will be 22 or greater, skipping record 21. While I fully agree with you that this condition will exist; I'd still replace a known bug in steady-state running systems with an error that has a rare chance of happening on server startup. Please proceed to closing the steady state problem of not keeping these entries synced. Startup issue like this will be investigated after we close the current issue. |
Opened #5151 to track the race issue referenced in #5021 (comment) |
We've seen this error log being emitted regularly in our deployments of SPIRE since enabling the events-based cache:
Entry not found or not authorized
.We've seen this error get repeatedly logged for a few entries (~0.01% of total entries). I noticed this log gets emitted for a long time, for days on end. Looking at a few examples, the pattern I've seen is that one server instance is providing an entry from its cache that no longer exists in the database. This causes agents to pick up this entry when they round robin to this bad server instance (let's call this instance
A
).Then when the agent round robins to the next server instance
B
, it asks for an X.509-SVID to be signed for this old entry ID. The server instanceB
denies the SVID from being signed because the entry ID doesn't match any entry in its cache, then it logs the aforementioned error message.There doesn't appear to be any other way to mitigate this than to restart the server to get it to force refresh its entry cache.
Relevant Server config:
The text was updated successfully, but these errors were encountered: