Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possibility of losing commits when polling in high concurrency scenarios with SQL Server #21

Open
fschmied opened this issue Apr 17, 2016 · 10 comments

Comments

@fschmied
Copy link

As explained in this dba.stackexchange.com answer, the IDENTITY-based checkpointing mechanism used by the SqlPersistenceEngine with MsSqlDialect is not safe: commits could be skipped when using GetFrom or CommitPollingClient under high load. The probability of running into this scenario is probably very low, but it does exist.

See also NEventStore/NEventStore#425 for a discussion of possible solutions.

@mgevans
Copy link

mgevans commented Jul 27, 2016

I've also seen this issue on PostgreSQL.

@fschmied
Copy link
Author

fschmied commented Aug 4, 2016

@mgevans In SQL Server, the scenario is probably very unlikely to occur, as the window of opportunity is very, very small if you use the right isolation level (lock-based READ COMMITTED without the READ COMMITTED SNAPSHOT option).

I don't really know PostgreSQL, but if there is a similar way of using lock-based READ COMMITTED vs. snapshot-based READ COMMITTED, that might be a way of avoiding it (or making it less likely) on PostgreSQL as well.

@fschmied
Copy link
Author

fschmied commented Dec 1, 2016 via email

@PeterStephenson
Copy link

Hi, We actually realised that Read Committed Snapshot was on shortly after I posted that and I removed the post as it was misleading, sorry!

@fschmied
Copy link
Author

fschmied commented Aug 9, 2019

See #14 (comment) about READ COMMITTED SNAPSHOT and Azure SQL.

@cecilyth
Copy link

cecilyth commented Dec 4, 2024

We are experiencing this in our production environment. Are there any suggestions for fixing this? We've implemented the READCOMMITTEDLOCK changeea686a1#diff-9ec321712e759f586acc6e7a5b53661f1f7cadf2e6b6faade5e6e7db28f98f30
but still occasionally miss commits in our pollers.

for example here we pulled out 3227030099 and 3227030104, but not 3227030103, which is in the same bucket.

image

Update: We ended up implementing something similar to the suggestion in this comment. So far it's working well.

@fschmied
Copy link
Author

@cecilyth Would it be possible for you to share the code of your solution? I think if READCOMMITTEDLOCK is in place, using NOLOCK wouldn't actually make much difference as this is not about a lock being skipped, but an IDENTITY increment AND another full transaction happening before the lock in question is being taken. (See dba.stackexchange.com discussion linked above.) However, simply repeating the query might reduce the likelihood of the problem occurring even more (with NOLOCK or without).

Also, I'd be interested: Are you 100% sure the READCOMMITTEDLOCK hint was included in the relevant GetFrom query? I'm asking because we're also using NEventStore in production with quite a lot of load, but have never seen skipped commits. I was under the impression that with the locks in place, the chance of skipping was very, very low. If you're seeing it occasionally despite locking, that might be wrong and I'd like to know 😅.

@cecilyth
Copy link

cecilyth commented Dec 16, 2024

Here's the narrative for how our journey with this bug went. We're a summer camp registration company, so have very heavy spurts of load during a registration event, and the Azure SQL database for our eventstore is just over 3 TBs with almost 2 billion rows in the Commits table.

We originally released our eventstore upgrade (going from version 3 of the original Eventstore repo to the latest version of NEventStore) in late August. We fairly quickly saw the pollers skipping commits, implemented the READCOMMITTEDLOCK hint, and patched that out to our production environment in the first week of September. The rate at which we were "skipping" went down drastically after that release, to the point where it happened maybe once or twice a month and we had to dig pretty deep to find any instances at all. I can say that the READCOMMITTEDLOCK hint was definitely added to the query, but I can't say with certainty that it was included in the query plan. It did however seemingly fix our skip issue.

Unfortunately, shortly thereafter we realized we had an issue with the primary key on the Commits table. Our primary key included the CheckpointNumber, meaning we never got a primary key violation (due to the IDENTITY always being unique) and thus never got any ConflictingCommandExceptions. This meant we had streams with multiple of the same CommitSequence, and some of the commits conflicted with one another since they weren't retried or detected as conflicting.

In late November we changed the primary key on the Commits table, dropping CheckpointNumber from the key. Unexpectedly, when we did this we started seeing tons more skipped commits, anywhere between 10-100 per day, and it was drastically impacting our system stability and data proliferation during periods of high load. Another thing we noticed was that the performance of querying commits in the poller got about 10 times faster after the PK change.

Sequencing the commits on the client side wasn't really an option for us, as we're a multi tenant system and each commit Bucket for us corresponds to a different client, meaning it was expected to have holes in the CheckpointNumber while polling commits per bucket, and we'd have no way of knowing which holes were just checkpoints that belong to a different bucket, or if they had been truly skipped.

I implemented new functionality in SqlPersistenceEngine as well as MsSqlDialect to try to use the Count(*) with NOLOCK suggestion that I linked in my previous comment. In our load tests, the NOLOCK hint seemed to do nothing - we were consistently counting the same number of commits in the NOLOCK query as we were pulling out with the READCOMMITTEDLOCK hint on the GetFrom query. I then refactored a bit to open a transaction with READUNCOMMITTED isolation level and then do the count. Load testing revealed that this did frequently result in a different count (eg. 10 uncommitted count but only 9 commits pulled on first query), and requerying commits until the two counts aligned prevented us from seeing any symptoms of skipped commits in our load tests. It's possible that just doing the second query was enough of a delay, but I was never getting a difference in count in the first implementation (NOLOCK hint), whereas it was fairly easy to see happen at about the rate we were expecting after refactoring to set the transaction isolation level specifically.

We deployed to production last week and haven't seen any skipped commits since. Finally, we seem to be in a stable place with the primary key change, read committed lock for the GetFrom, and uncommitted count. Our code for the READCOMMITTEDLOCK change is exactly what's in the changeset that I linked above. Unfortunately, I'm not able to share our exact implementation of the uncommitted count, but it's basically this:

in MsSqlDialect:

public const string GetCountFromToBucketAndCheckpoint = "SELECT Count(*)\r\n  FROM Commits\r\n WHERE BucketId = @BucketId \r\n  AND CheckpointNumber > @FromCheckpointNumber  \r\n  AND CheckpointNumber <= @ToCheckpointNumber;";

public IDbTransaction OpenReadUncommittedTransaction(IDbConnection connection)
{
	return connection.BeginTransaction(System.Data.IsolationLevel.ReadUncommitted);
}

in SqlPersistenceEngine:

public int GetCountUncommitted(string bucketId, long fromCheckpoint, long toCheckpoint)
{
	return ExecuteReadUncommittedCommand((IDbStatement cmd) =>
	{
		string getCountFromBucketAndCheckpointUncommitted = _dialect.GetCountFromToBucketAndCheckpoint;
		cmd.AddParameter(_dialect.BucketId, bucketId, DbType.AnsiString);
		cmd.AddParameter(_dialect.FromCheckpointNumber, fromCheckpoint);
		cmd.AddParameter(_dialect.ToCheckpointNumber, toCheckpoint);
		return cmd.ExecuteScalar(getCountFromBucketAndCheckpointUncommitted).ToInt();
	});
}

private T ExecuteReadUncommittedCommand<T>(Func<IDbStatement, T> command)
{
	return ExecuteReadUncommittedCommand((IDbConnection _, IDbStatement statement) => command(statement));
}

protected virtual T ExecuteReadUncommittedCommand<T>(Func<IDbConnection, IDbStatement, T> command)
{
	using (TransactionScope transactionScope = OpenCommandScope())
	using (IDbConnection dbConnection = _connectionFactory.Open())
	using (IDbTransaction dbTransaction = _dialect.OpenReadUncommittedTransaction(dbConnection))
	using (IDbStatement arg = _dialect.BuildStatement(transactionScope, dbConnection, dbTransaction))
	{
		try
		{
			T val = command(dbConnection, arg);
			dbTransaction?.Commit();
			transactionScope?.Complete();
			return val;
		}
		catch (Exception ex)
		{
			if (!RecoverableException(ex))
			{
				throw new StorageException(ex.Message, ex);
			}

			transactionScope?.Complete();
			throw;
		}
	}

	bool RecoverableException(Exception e)
	{
		if (!(e is UniqueKeyViolationException))
		{
			return e is StorageUnavailableException;
		}

		return true;
	}
}

@fschmied
Copy link
Author

Wow, thanks a lot for the detailed description! ❤️

So, you were still rarely seeing skipped commits with your very high load even after introducing the READCOMMITTEDLOCK hint in September, albeit much, much fewer than without the hint (which would be very much expected) - that's very valuable info to me.

It also makes some sense to me that the skip likelihood increased a lot after removing the CheckpointNumber from the Primary Key and thus, I assume, the CLUSTERED index, as the GetFrom query for pulling commits is acting on that very same index (if you're pulling by checkpoint number, which would be the default approach). So, the whole synchronization between commit pulling and creating new commits usually takes place on that index.

(BTW, there's usually another UNIQUE index called IX_Commits_CommitSequence that I think is responsible for avoiding duplicate CommitSequence values and for triggering the ConflictingCommandException. If that didn't work, that index must have been deleted or gotten corrupted somehow? Maybe restoring that would be/have been an alternative solution to your issue.)

Thanks also for the details about your workaround! The bad news is I think it still doesn't solve the original issue in theory, but in practice it's probably enough to get a stable solution.

In case you're still interested: The theoretic issue (as I understand it) is the following sequence of events, caused by a race involving (at least) three parallel threads. Threads A and B are writing commits, thread C is pulling commits, with a query like WHERE CheckpointNumber > 16 and with READCOMMITTEDLOCK.

  1. Thread A: Get next CheckpointNumber (IDENTITY) => 17
  2. Thread B: Get next CheckpointNumber (IDENTITY) => 18
  3. Thread B: Take ROWLOCK on CLUSTERED index entry for value 18
  4. Thread C: Run into ROWLOCK of step 3, block due to READCOMMITTEDLOCK.
  5. Thread A: Take ROWLOCK on CLUSTERED index entry for value 17
  6. Not caring about the order, Threads A and B insert and commit their rows.

As Thread C is already blocked on value 18 in step 5, it will skip the value 17 written in step 7.

With your solution, you still have the race condition in theory - if thread C performs its second (read uncommitted) query before Thread A gets to insert its row, it will return the same count as in step 5 and still skip value 17. The problem is inherent because the next CheckpointNumber value is incremented independently of the rows becoming visible or locks being taken. In practice, performing the second query is probably enough to allow Thread A to proceed to step 7, so you can mitigate it ("by chance").

The only other fix I know that can solve this issue is to pause when you see a gap while pulling events, but even that can only be done heuristically (how long to wait?), and it will negatively affect performance if you have lots of aborted transactions (e.g., due to high concurrency on the same streams) :( .

@cecilyth
Copy link

cecilyth commented Dec 17, 2024

It's a good theory about that UNIQUE index somehow being corrupt or missing somewhere! I'll look into that. I totally understand your example and based on all the talk of the issue being very very rare, I was suspicious of it being the cause of our problems. One thing I'm not totally getting yet is why we see the mismatch in count when querying uncommitted, and re-querying that range result in more commits getting pulled. I guess in your example, it's because by the time we count the uncommitted, we've hit step 5/thread C has taken the rowlock on 17?

I really appreciate your response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants