Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIRE Server MySQL allowing duplicate entry creations #4329

Closed
amoore877 opened this issue Jul 7, 2023 · 9 comments
Closed

SPIRE Server MySQL allowing duplicate entry creations #4329

amoore877 opened this issue Jul 7, 2023 · 9 comments
Labels
help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog stale unscoped The issue needs more design or understanding in order for the work to progress

Comments

@amoore877
Copy link
Member

amoore877 commented Jul 7, 2023

In my org we have noticed for a while now that sometimes we have duplicate registrations existing in SPIRE with MySQL DB. At our scale, this is maybe a few dozen duplications a month across 2M+ entries, so not a show-stopping issue but definitely still a bug.
We had some time to investigate deeply today, and we believe we've root-caused it. Credit to @rturner3 as partner in the investigation.

Relevant code: https://github.com/spiffe/spire/blob/727fd183530abc02c8ead0799afccc4d0f667266/pkg/server/datastore/sqlstore/sqlstore.go#L367C2-L367C2

	if err = ds.withWriteTx(ctx, func(tx *gorm.DB) (err error) {
		registrationEntry, err = lookupSimilarEntry(ctx, ds.db, tx, entry)
		if err != nil {
			return err
		}
		if registrationEntry != nil {
			existing = true
			return nil
		}
		registrationEntry, err = createRegistrationEntry(tx, entry)

There is a race condition if multiple requesting processes try to make a similar entry in SPIRE.
Normally, a lookup is first done to see if a similar entry (by SPIFFE ID, Parent ID, and Selectors) already exists. If so, we create nothing and just return that entry.
However, since we are using withWriteTx, there is not a sufficient lock retrieved for performing the lookup. Thus, concurrent requests can both get back that no similar entry exists, and proceed with creation.

We suspect that withReadModifyWriteTx is needed instead.

My org intends to try to schedule time to test this fix in our internal fork, observing if there's any performance hits and if the issue fully resolves.
However, we're not yet committed to a timeframe here and welcome community members to try to reproduce the issue and test out the fix themselves.

Observed on multiple versions up to our current of 1.6.1. If needed I can attempt to get exact versions :)

@zmt
Copy link
Contributor

zmt commented Jul 10, 2023

If needed I can attempt to get exact versions

We observed it at the following versions: v0.12.3, v1.0.4, v1.6.1

@rturner3 rturner3 added the triage/in-progress Issue triage is in progress label Jul 11, 2023
@amoore877
Copy link
Member Author

amoore877 commented Jul 13, 2023

This post stands as a living place I will update ad hoc with rates of duplication found over some periods of time. Scale of total registrations is continually in millions, distributed across a couple dozen separate spire clusters:


Version 1.6.1: 2023.07.07 - 1 dupe over 2 hours
2023.07.13 - 10 dupes over 6 days
2023.07.18- 7 dupes over 5 days

@evan2645
Copy link
Member

Related to #3467 and #2275. Closing out the former in favor of this issue.

@evan2645 evan2645 added help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog unscoped The issue needs more design or understanding in order for the work to progress and removed triage/in-progress Issue triage is in progress labels Jul 18, 2023
@sorindumitru
Copy link
Contributor

I've also observed this today, with PostgreSQL not MySQL. It seems fairly reliably to reproduce: start N instances of spire-server with a side-car that creates the same registration entry.

This was with a recent HEAD, from sometime this week. I was able to reproduce the issue twice, so if there's a potential fix for the issue, I can test it to see if helps or not. I should be able to also test MySQL, if that's needed.

@sorindumitru
Copy link
Contributor

sorindumitru commented Aug 9, 2023

I tested using withReadModifyWriteTx but that didn't help at all. I'm not sure if that would help with this particular transaction since it only sets a gorm option for SELECT (AFAICT):

tx = tx.Set("gorm:query_option", "FOR UPDATE")

but for listing entries we build the query ourselves.

I'm no database expert, but I'm not sure locking rows would help us here much with the data being spread out across multiple tables. All inserts affect independent rows with no relation to each other, from the point of view of the database.

Setting serializable isolation level seems to work.

@rturner3
Copy link
Collaborator

rturner3 commented Aug 10, 2023

Using serializable isolation level seems like the right direction (also suggested in previous issue #3467), but we will need to do it in a way that is compatible with SQL databases that don't support serializable isolation level, e.g. Percona XtraDB Cluster (see #2587). Some thoughts that come to mind that could help us find a concrete approach:

  • On server startup, can we reliably detect at runtime whether the configured database supports serializable isolation level? One idea could be to try to execute a serializable transaction and use the success/failure result as a way to decide whether to use serializable isolation level or not. Alternatively, maybe there is a way we can figure out the database type and enable/disable serializable isolation level based on some hardcoded list of databases like Percona?
  • If detecting support for serializable isolation level at runtime is infeasible, I feel it would be good to prioritize backwards compatibility by making usage of serializable isolation level an opt-in configuration.

@sorindumitru
Copy link
Contributor

I'll see if that works. I should have time for this later this week.

Do we have a list of database engines that we support? I wouldn't have thought of Percona, but I can see why it's supported, it's supposed to look like MySQL.

Copy link

This issue is stale because it has been open for 365 days with no activity.

@github-actions github-actions bot added the stale label Aug 13, 2024
Copy link

This issue was closed because it has been inactive for 30 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog stale unscoped The issue needs more design or understanding in order for the work to progress
Projects
None yet
Development

No branches or pull requests

5 participants