Skip to content

Conversation

@sumeerbhola
Copy link
Collaborator

@sumeerbhola sumeerbhola commented Feb 22, 2022

The const override that disabled loosely coupled truncation
is removed. There is additional testing of both the end-to-end
path and the truncator.

Informs #36262

Release note: None

@sumeerbhola sumeerbhola requested review from a team as code owners February 22, 2022 19:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two commits are from #76215 and #76793

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)

@sumeerbhola sumeerbhola force-pushed the raft_trunc3 branch 5 times, most recently from e569ce7 to 1345618 Compare February 23, 2022 15:00
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased. The non test changes are tiny.
Most of the test diffs are indentation changes due to running the kvserver server-side tests both with and without loosely-coupled truncation. The client-side kvserver tests are only run with whatever is the default truncation behavior.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed the tests yet, have a suggestion on how to simplify the prod code. Will review tests first thing tomorrow.

Reviewed 3 of 13 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/raft_log_truncator.go, line 394 at r1 (raw file):

	if runTruncation {
		if err := t.stopper.RunAsyncTask(
			t.ambientCtx, "raft-log-truncation", t.durabilityAdvanced); err != nil {

Consider pulling the t.mu.runningTruncation = false out. It is nice to have it all in one place.

Near the top of the method:

runTruncation := false
done := func() {}

t.mu.Lock()
[...]
  runTruncation = true
  t.mu.runningTruncation = true
  done := func() {
    t.mu.Lock()
    if !t.mu.runningTruncation { panic("...") }
    t.mu.runningTruncation = false
    t.mu.Unlock()
  }

[...]

for runTruncation {
  if err := t.stopper.RunAsyncTask(...., func(ctx context.Context) {
    defer done()
    for {
      t.durabilityAdvanced(ctx)
      t.mu.Lock()
      queued := t.mu.queuedDurabilityCB
      t.mu.queuedDurabilityCB = false
      t.mu.Unlock()
      if !queued { return }
    }
  }); err != nil {
    done()
    return
  }
  
}

That way runningTruncation is only accessed in this method.


pkg/kv/kvserver/raft_log_truncator.go, line 435 at r1 (raw file):

		t.tryEnactTruncations(ctx, rangeID, reader)
		select {
		case <-shouldQuiesce:

Why do you need this? Were you seeing hangs? tryEnactTruncations should also be sensitive to quiescence so this seems superfluous.


pkg/kv/kvserver/raft_log_truncator.go, line 443 at r1 (raw file):

		}
	}
	t.mu.Lock()

This part would go away with my other suggestion.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/raft_log_truncator.go, line 394 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Consider pulling the t.mu.runningTruncation = false out. It is nice to have it all in one place.

Near the top of the method:

runTruncation := false
done := func() {}

t.mu.Lock()
[...]
  runTruncation = true
  t.mu.runningTruncation = true
  done := func() {
    t.mu.Lock()
    if !t.mu.runningTruncation { panic("...") }
    t.mu.runningTruncation = false
    t.mu.Unlock()
  }

[...]

for runTruncation {
  if err := t.stopper.RunAsyncTask(...., func(ctx context.Context) {
    defer done()
    for {
      t.durabilityAdvanced(ctx)
      t.mu.Lock()
      queued := t.mu.queuedDurabilityCB
      t.mu.queuedDurabilityCB = false
      t.mu.Unlock()
      if !queued { return }
    }
  }); err != nil {
    done()
    return
  }
  
}

That way runningTruncation is only accessed in this method.

Much better! Done.


pkg/kv/kvserver/raft_log_truncator.go, line 435 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why do you need this? Were you seeing hangs? tryEnactTruncations should also be sensitive to quiescence so this seems superfluous.

I saw a test fail with a pebble.tableCacheShard leak, which suggests the test shutdown without waiting for this goroutine that was holding open a pebble.Iterator. It seemed wise to not have stray goroutines that are not linked to a stopper, hence the stopper plumbing.

This particular change to check shouldQuiesce was just to play nice -- my reading of the stopper contract was that it waits for all such async tasks after closing this channel. So may as well check after processing each range.


pkg/kv/kvserver/raft_log_truncator.go, line 443 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This part would go away with my other suggestion.

Done

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 13 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/raft_log_truncator.go, line 435 at r1 (raw file):

hence the stopper plumbing.

Sorry, that wasn't the question. Was just asking about checking quiescing here. It doesn't seem necessary, since this won't block shutdown. Mild preference for removal to keep the code clean.


pkg/kv/kvserver/testdata/raft_log_truncator, line 405 at r2 (raw file):

# Test case to ensure that truncator is reading only flushed engine state when
# deciding to enact truncations.
create-replica id=3 trunc-index=20 last-log-entry=30

It's odd to see last-log-entry but then the log ends at 29. I think this should be last-log-entry=29 (i.e. make it inclusive).


pkg/kv/kvserver/testdata/raft_log_truncator, line 438 at r2 (raw file):

# Move RaftAppliedState enough to allow first truncation, but don't flush that
# change.
write-raft-applied-index id=3 raft-applied-index=22 no-flush=true

What ensures that a flush doesn't accidentally happen?

The const override that disabled loosely coupled truncation
is removed. There is additional testing of both the end-to-end
path and the truncator.

Informs cockroachdb#36262

Release note: None
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/raft_log_truncator.go, line 435 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

hence the stopper plumbing.

Sorry, that wasn't the question. Was just asking about checking quiescing here. It doesn't seem necessary, since this won't block shutdown. Mild preference for removal to keep the code clean.

I've left the code in place and added

		// Check if the stopper is quiescing. This isn't strictly necessary, but
		// if there are a huge number of ranges that need to be truncated, this
		// will cause us to stop faster.

pkg/kv/kvserver/testdata/raft_log_truncator, line 405 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It's odd to see last-log-entry but then the log ends at 29. I think this should be last-log-entry=29 (i.e. make it inclusive).

Done


pkg/kv/kvserver/testdata/raft_log_truncator, line 438 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

What ensures that a flush doesn't accidentally happen?

The initial memtable size is 256KB, and doubles for each new memtable. Even that initial size is much larger than anything we do in this test between explicit flushes.

I've added a comment to the test

				// The initial engine memtable size is 256KB, and doubles for each new
				// memtable. Even the initial size is much larger than anything we do
				// in this test between explicit flushes. Hence we can rely on the
				// fact that no-flush will actually be respected, and we won't
				// encounter an unexpected flush.

@sumeerbhola
Copy link
Collaborator Author

bors r=tbg

@sumeerbhola
Copy link
Collaborator Author

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Feb 25, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 25, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 25, 2022

Build succeeded:

@craig craig bot merged commit 978744d into cockroachdb:master Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants