-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tsm: ensure that snapshot writes are eventually retried/released, even if WriteSnapshot fails #5689
tsm: ensure that snapshot writes are eventually retried/released, even if WriteSnapshot fails #5689
Conversation
f83195c
to
194ad0f
Compare
Oops. I need to re-roll with corresponding changes to the tests. |
194ad0f
to
79b5c0f
Compare
On reflection, I think this fix is probably incomplete, since I don't think the write for the latest snapshot will do what is required to gather the keys from snapshots that failed to be written on a previous WriteSnapshot() call so we end up losing that data from those snapshots. I need to think about this some more, I think. Will re-open the PR when/if I find a solution. |
df3edd7
to
8257570
Compare
Previously, if a snapshot write failed, the memory was never released and the associated writes were never retried. Furthermore, if the write error was transient and a subsequent compact was successful, then associated WAL segments would be deleted without ever having been written into TSM files. Such a data loss would not be apparent until influxd was restarted (because the data would still be served from the cache). Interestingly, these issues made influxd less resilient in the face of transient write errors that apparently do not require a server restart, than in the presence of permanent write errors that do. The reason is that provided the WAL segments are not deleted, the normal influxd restart process will eventually compact the WAL segments associated with failed snapshot writes. However, if the write errors are only transient, then the next successful compaction will delete the WAL segments without them ever having been written into TSM files. This series should address these issues. |
I have now added some tests for the rollback functionality. The new and revised tests show how the association between a cache snapshot and a slice of WAL segment names is established and how WriteSnapshots() can use this capability to retry previously failed snapshot writes. |
953958f
to
b65864d
Compare
Fixed a thread safety issue in the implementation - PrepareSnapshots() now returns a clone of the snapshots slice, rather than a reference. This allows the snapshotter to modify the contents of the slice without interfering with queries that may be trying to use it. |
b65864d
to
dc541f4
Compare
note: commits prior to 7059331 are from PR #5778 and will not be delivered by the final version of this PR |
This query output shows how this fix recovered the snapshots that could not be written because of temporary removal of write permission to the shard's TSM directory. When the write permissions were restored, the snapshot count dropped to zero again. The second screen shot is taken after restarting the server, indicating that the full contents of the cache was successfully persisted to TSM files Log output showing compaction of resulting TSM files:
|
@jwilder - example of a (manual) integration test output which shows that this fix does what it intends to do. |
To run the test, do: cd tsdb/engine/tsm1 Then run: GORACE=history=1000 go test -race cache_race_test.go
The current implementation of Cache.Deduplicate is not thread safe as revealed by a code review and, seperately, a go race detector test case. The first race condition arises because entries being deduplicated by the Cache.Deduplicate method were being read by the cache logic without the deduplicating thread taking a write lock. A second race condition arose with one of the initial fixes because the snapshot map was being updated by the deduplicating thread without a readlock (on the snapshot) being taken by readers of the snapshot. This fix works by taking a copy of the snapshot's store (called dirty) while the write lock of the engine's cache is held. Only the entries that need sorting are cloned. The other entries are copied by reference. Deduplication of the slice occurs on the dirty slice, rather than the store slice. There is no thread safety issue with the dirty slice because only the snapshotting thread sees that slice. The entries that are updated by the snapshotting thread are the entries that need sorting but all these entries are clones which are not visible to any other thread. When we are done deduplicating, we update the store of the snapshot while holding the write lock of the engine's cache - not that of the snapshot itself. Once a snapshot is deduplicated, it is never modified again. In the time between the snapshot being created and the deduplication finishing, the snapshot may be modified, however, in these cases the writes are guarded by a writelock on the engine's cache and the snapshotting thread does not refer to those values. The end result is that it is no longer necessary to take any snapshot locks the lock guarding the engine's cache is sufficient. The snapshotter does need to take write locks on the engine's cache but only at two points: * while taking the initial snapshot * after deduplication is complete when updating the cache's store reference Signed-off-by: Jon Seymour <jon@wildducktheories.com>
dc541f4
to
cfef88f
Compare
Rebased to remove "comment only assertions" commit that was removed from another PR. |
@jonseymour I think supporting the multiple snapshots in the cache is not necessary and is complicating things. I have an alternative fix in #5789 that removes the multiple snapshot support. |
cfef88f
to
a1673df
Compare
Also, introduce a currently empty RollbackSnapshot method. Later, we are going to use a commit pattern where resources are first prepared and then either committed or rolled back. Currently, the code doesn't understand the concept of a rollback, but we will add it later. For now, we will simply rename the existing methods to their final names and add the RollbackSnapshot() method as a placeholder. Signed-off-by: Jon Seymour <jon@wildducktheories.com>
…led. Signed-off-by: Jon Seymour <jon@wildducktheories.com>
This lock ensures that at most one thread is actively attempting to commit the WAL log for a single shard at once. The assumption here it is that it is not helpful to have more than one thread attempting to do this and that reasoning about correctness is much more easily achieved if one only has to consider a single thread. The Prepare/Commit or Rollback pattern ensures that the lock will always be released. This is not to say that the TSM writing process might not be benefit from concurrency, only that the concurrency should be managed within the scope of the shard-wide commit lock. Note that the acqusition order is commit lock first, then cache mutex. This ensures the cache is available for writes while the snapshot is being processed. Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Now that there is only one thread allowed to be actively writing a snapshot at once, we can simplify the the commit logic to simply removing all active snapshots. It also means we do not need to scan for the committed snapshot since we are going to purge all of them. Note that Rollback doesn't touch the snapshot slice - it leaves the snapshot in place until a subsequent compaction succeeds at which point it will be removed. Signed-off-by: Jon Seymour <jon@wildducktheories.com>
The previous commits ensured that snapshots that failed to write would eventually be released. However, only the current snapshot would get written to disk. This commit ensures that the writes for previously failed snapshots are retried again. To ensure we cleanup only those WAL segments that we have successfully compacted, we need to keep track of precisely which WAL segments each snapshot captures, and retry all the snapshots in the same order. We only remove WAL segments for snapshots that we successfully write and stop on the first error we encounter. When we rollback, we only rollback the snapshots that we failed to write. Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Add some tests to check that new files specified on PrepareSnapshots are associated with the latest snapshot.
The test now demonstrates that if a snapshot is rolled back, the failed snapshots (only) will appear in the slice returned by the next PrepareSnapshots call. The test demonstrates how the discovery of new WAL segment leads to the new files being associated with the latest snapshot. The test demonstrates that when retrying multiple snapshots, we can successfully commit partial progress, even if we need to rollback one of the later snapshots.
A test to check that the commit lock is working as expected. In particular, we want to check that if there is a pending commit, a prepare issued by a concurrent goroutine will block until such time as the current holder of the lock either commits or rolls back. This test also checks that concurrent writes into the cache are not prevented by goroutines that hold the cache's commit lock. Signed-off-by: Jon Seymour <jon@wildducktheories.com>
a1673df
to
80e97b4
Compare
@jonseymour I'm closing this in favor of #5789 because it resolves #5686 by removing some unnecessary complexity in the cache. Since we don't need the multiple snapshots, it does not make sense to add additional layers on top of them that adds additional complexity to the cache. |
This is my attempt to fix the code review issue I identified in #5686.
The controversial aspect of this fix may be the introduction of a commit lock that ensures that
at most one thread is writing a snapshot for a given shard at a given time. I think the lock
is justified because reasoning about concurrent execution of these code paths seems fraught with
danger. Of course, if there are contrary opinions, lets discuss.