Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsm: ensure that snapshot writes are eventually retried/released, even if WriteSnapshot fails #5689

Closed

Commits on Feb 22, 2016

  1. tsm: cache: add a test that demonstrates the the race condition.

    To run the test, do:
    
       cd tsdb/engine/tsm1
    
    Then run:
    
       GORACE=history=1000 go test -race cache_race_test.go
    jonseymour committed Feb 22, 2016
    Configuration menu
    Copy the full SHA
    32fc885 View commit details
    Browse the repository at this point in the history
  2. tsm: cache: improve thread safety of Cache.Deduplicate.

    The current implementation of Cache.Deduplicate is not thread safe as
    revealed by a code review and, seperately, a go race detector test case.
    
    The first race condition arises because entries being deduplicated by the
    Cache.Deduplicate method were being read by the cache logic without the
    deduplicating thread taking a write lock.
    
    A second race condition arose with one of the initial fixes because the snapshot
    map was being updated by the deduplicating thread without a readlock (on the snapshot)
    being taken by readers of the snapshot.
    
    This fix works by taking a copy of the snapshot's store (called dirty) while
    the write lock of the engine's cache is held. Only the entries that need sorting are
    cloned. The other entries are copied by reference.
    
    Deduplication of the slice occurs on the dirty slice, rather than the store slice.
    There is no thread safety issue with the dirty slice because only the
    snapshotting thread sees that slice. The entries that are updated by the
    snapshotting thread are the entries that need sorting but all these entries
    are clones which are not visible to any other thread.
    
    When we are done deduplicating, we update the store of the snapshot while holding the
    write lock of the engine's cache - not that of the snapshot itself.
    
    Once a snapshot is deduplicated, it is never modified again. In the time
    between the snapshot being created and the deduplication finishing, the
    snapshot may be modified, however, in these cases the writes are
    guarded by a writelock on the engine's cache and the snapshotting thread
    does not refer to those values.
    
    The end result is that it is no longer necessary to take any snapshot locks
    the lock guarding the engine's cache is sufficient.
    
    The snapshotter does need to take write locks on the engine's cache but only at two points:
    
    * while taking the initial snapshot
    * after deduplication is complete when updating the cache's store reference
    
    Signed-off-by: Jon Seymour <jon@wildducktheories.com>
    jonseymour committed Feb 22, 2016
    Configuration menu
    Copy the full SHA
    26fd71d View commit details
    Browse the repository at this point in the history

Commits on Feb 23, 2016

  1. tsm: rename Snapshot/ClearSnapshot to PrepareSnapshot/CommitSnapshot

    Also, introduce a currently empty RollbackSnapshot method.
    
    Later, we are going to use a commit pattern where resources are first prepared
    and then either committed or rolled back.
    
    Currently, the code doesn't understand the concept of a rollback, but
    we will add it later.
    
    For now, we will simply rename the existing methods to their final names and
    add the RollbackSnapshot() method as a placeholder.
    
    Signed-off-by: Jon Seymour <jon@wildducktheories.com>
    jonseymour committed Feb 23, 2016
    Configuration menu
    Copy the full SHA
    982c4df View commit details
    Browse the repository at this point in the history
  2. tsm: ensure that either CommitSnapshot() or RollbackSnapshot() is cal…

    …led.
    
    Signed-off-by: Jon Seymour <jon@wildducktheories.com>
    jonseymour committed Feb 23, 2016
    Configuration menu
    Copy the full SHA
    964a0e4 View commit details
    Browse the repository at this point in the history
  3. tsm: introduce a commit lock.

    This lock ensures that at most one thread is actively attempting to
    commit the WAL log for a single shard at once.
    
    The assumption here it is that it is not helpful to have more than
    one thread attempting to do this and that reasoning about correctness
    is much more easily achieved if one only has to consider
    a single thread.
    
    The Prepare/Commit or Rollback pattern ensures that the lock will always
    be released.
    
    This is not to say that the TSM writing process might not be benefit
    from concurrency, only that the concurrency should be managed within
    the scope of the shard-wide commit lock.
    
    Note that the acqusition order is commit lock first, then cache mutex.
    This ensures the cache is available for writes while the snapshot is
    being processed.
    
    Signed-off-by: Jon Seymour <jon@wildducktheories.com>
    jonseymour committed Feb 23, 2016
    Configuration menu
    Copy the full SHA
    8d1b7b8 View commit details
    Browse the repository at this point in the history
  4. tsm: simplify interface to CommitSnapshot() and RollbackSnapshot()

    Now that there is only one thread allowed to be actively writing
    a snapshot at once, we can simplify the the commit logic to simply
    removing all active snapshots. It also means we do not need
    to scan for the committed snapshot since we are going to purge all of them.
    
    Note that Rollback doesn't touch the snapshot slice - it leaves the
    snapshot in place until a subsequent compaction succeeds at which
    point it will be removed.
    
    Signed-off-by: Jon Seymour <jon@wildducktheories.com>
    jonseymour committed Feb 23, 2016
    Configuration menu
    Copy the full SHA
    dda3b19 View commit details
    Browse the repository at this point in the history
  5. tsm: ensure previously failed snapshot writes are retried

    The previous commits ensured that snapshots that failed to write
    would eventually be released. However, only the current snapshot
    would get written to disk. This commit ensures that the writes for
    previously failed snapshots are retried again.
    
    To ensure we cleanup only those WAL segments that we have successfully
    compacted, we need to keep track of precisely which WAL segments each
    snapshot captures, and retry all the snapshots in the same order.
    We only remove WAL segments for snapshots that we successfully write
    and stop on the first error we encounter.
    
    When we rollback, we only rollback the snapshots that we failed to
    write.
    
    Signed-off-by: Jon Seymour <jon@wildducktheories.com>
    jonseymour committed Feb 23, 2016
    Configuration menu
    Copy the full SHA
    59d5319 View commit details
    Browse the repository at this point in the history
  6. tsm: extend TestCache_CacheSnapshot

    Add some tests to check that new files specified on PrepareSnapshots are
    associated with the latest snapshot.
    jonseymour committed Feb 23, 2016
    Configuration menu
    Copy the full SHA
    393e07e View commit details
    Browse the repository at this point in the history
  7. tsm: add TestCache_CacheRollback

    The test now demonstrates that if a snapshot is rolled back, the failed
    snapshots (only) will appear in the slice returned by the next
    PrepareSnapshots call.
    
    The test demonstrates how the discovery of new WAL segment leads to the
    new files being associated with the latest snapshot.
    
    The test demonstrates that when retrying multiple snapshots, we
    can successfully commit partial progress, even if we need to rollback
    one of the later snapshots.
    jonseymour committed Feb 23, 2016
    Configuration menu
    Copy the full SHA
    c8796d9 View commit details
    Browse the repository at this point in the history
  8. tsm: add TestCache_CacheTestCommitLock

    A test to check that the commit lock is working as expected. In particular,
    we want to check that if there is a pending commit, a prepare issued
    by a concurrent goroutine will block until such time as the
    current holder of the lock either commits or rolls back.
    
    This test also checks that concurrent writes into the cache
    are not prevented by goroutines that hold the cache's commit lock.
    
    Signed-off-by: Jon Seymour <jon@wildducktheories.com>
    jonseymour committed Feb 23, 2016
    Configuration menu
    Copy the full SHA
    80e97b4 View commit details
    Browse the repository at this point in the history