Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: add snapshot-pinned keys sstable properties and metrics #1814

Merged
merged 1 commit into from
Mar 30, 2023

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Jul 15, 2022

Open snapshots require flushes and compactions to preserve, and write,
additional keys. Add new sstable properties and global metrics surfacing
the count and size of these obsolete keys.

These can be used to improve observability and the ability to diagnose
write-amplification incurred due to open snapshots.

Additionally, future work may make use of the table-level properties to
improve compaction heuristics, because files with obsolete keys may be
cheaper to compact than their physical file size alone indicates.

Informs #1204.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jbowens jbowens force-pushed the snapshot-pinned branch 2 times, most recently from 699a043 to e0c7a26 Compare July 16, 2022 04:02
@jbowens
Copy link
Collaborator Author

jbowens commented Jul 16, 2022

I think a good next step after this PR would be to run some experiments that are known to incur high w-amp (@tbg might have something in mind?) and observe whether the LSM stats record significant pinned bytes. If they do, #1810 or incorporating the pinned-bytes sstable property into compaction heuristics might be worthwhile.

@jbowens jbowens mentioned this pull request Jul 17, 2022
3 tasks
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet read compactionIter but flushing some minor comments (it looks good).

Reviewed 3 of 24 files at r1, 1 of 5 files at r2, 1 of 10 files at r3.
Reviewable status: 4 of 29 files reviewed, 5 unresolved discussions (waiting on @jbowens and @sumeerbhola)


compaction.go line 2288 at r2 (raw file):

	}()

	internalTableOptConstructor := private.SSTableInternalTableOpt.(func(func(p *sstable.Properties)) sstable.WriterOption)

I'm confused by the indirection happening here.
private.SSTableInternalTableOpt used to be an interface sstable.WriterOption. Is this change because it was too wide and gave access to the whole Writer when all we want is access to the Properties? So is this using something that takes a func(*sstable.Properties) and produces something that is a sstable.WriterOption.

So something like

type propertiesAdapter struct {
    f func(p *sstable.Properties)
}
func (p propertiesAdapter) writerApply(w *Writer) {
   f(w.props)
}

which could be used as

internalTableOpt := propertiesAdapter{
  f: func(p *sstable.Properties) {
     ...
  }
}

I guess I am confused about the casting starting with private.SSTableInternalTableOpt. And I never understood why private.SSTableInternalOpt is an interface{} and not the real type.


compaction.go line 2669 at r2 (raw file):

				// its elision. Increment the stats.
				pinnedCount++
				pinnedBytes += uint64(len(key.UserKey)) + base.InternalTrailerLen + uint64(len(val))

the bytes include the value, but the properties documentation says "cumulative bytes of keys". I guess it was meant to say "key-value pairs".


db.go line 423 at r3 (raw file):

			// The cumulative count and size of snapshot-pinned keys written to
			// sstables.
			cumulativePinnedCount uint64

I assume the idea is to compare it with the cumulative number of bytes being written in compactions.

Is there a value to knowing how many of the bytes currently in the DB are pinned? It would mean reading the properties (in the background) after the DB is opened, so I guess we can wait until we think it is worth the trouble.


sstable/properties.go line 144 at r2 (raw file):

	SnapshotPinnedKeys uint64 `prop:"pebble.num.snapshot-pinned-keys"`
	// The cumulative bytes of keys in this table that were pinned by open
	// snapshots.

Is this comparable to RawKeySize? If yes, can you add a comment.

Also, how about calling this SnapshotPinnedKeySize and also having SnapshotPinnedValueSize that would be comparable to RawValueSize?


sstable/writer.go line 1893 at r2 (raw file):

// Apply this option during Close.
func (i internalTableOpt) postApply() {}

when does writerApply get applied? And is postApply part of an interface?

Oh, I see -- postApply is just a marker saying when writerApply should be called, yes?
This could use some code commentary.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 29 files reviewed, 6 unresolved discussions (waiting on @jbowens and @sumeerbhola)


compaction_iter.go line 301 at r3 (raw file):

	i.pos = iterPosCurForward
	i.valid = false
	for i.iterKey != nil {

The preceding call to nextInStripe can return due to sameStripeNonSkippable while leaving i.skip=true. So we may be still in the same snapshot stripe.
I am trying to find that case in the following comment.

Should we store the stripeChangeType returned by nextInStripe etc. and split the newStripe case into newStripeSameKey and newStripeDifferentKey, and use the former to decide that pinned is true. That will allow us to leverage the current behavior of i.skip which stays true even with intermediate sameStripeNonSkippable.

I think what you have here is logically similar, so this may be subjective.

Copy link
Contributor

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :lgtm:

Reviewed 24 of 24 files at r1, 5 of 5 files at r2, 10 of 10 files at r3, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens)


internal/keyspan/fragmenter.go line 238 at r2 (raw file):

	// CoversInvisibly indicates the tested key does fall within the span's
	// bounds and the span contains at least one key with a higher sequence
	// number, but none visible at the provided snapshot.

Some diagrams might be useful for these types, demonstrating "visible" and "invisible".


internal/keyspan/testdata/fragmenter_covers line 1 at r2 (raw file):

build

Without reading the test code directly, it's a little hard to decipher how this test works. Can you add a brief description?

Copy link
Contributor

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r4, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens)

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 29 files reviewed, 6 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


compaction.go line 2288 at r2 (raw file):

Previously, sumeerbhola wrote…

I'm confused by the indirection happening here.
private.SSTableInternalTableOpt used to be an interface sstable.WriterOption. Is this change because it was too wide and gave access to the whole Writer when all we want is access to the Properties? So is this using something that takes a func(*sstable.Properties) and produces something that is a sstable.WriterOption.

So something like

type propertiesAdapter struct {
    f func(p *sstable.Properties)
}
func (p propertiesAdapter) writerApply(w *Writer) {
   f(w.props)
}

which could be used as

internalTableOpt := propertiesAdapter{
  f: func(p *sstable.Properties) {
     ...
  }
}

I guess I am confused about the casting starting with private.SSTableInternalTableOpt. And I never understood why private.SSTableInternalOpt is an interface{} and not the real type.

I think private.SSTableInternalTableOpt is an interface{} to avoid a cyclical package dependency.


compaction.go line 2669 at r2 (raw file):

Previously, sumeerbhola wrote…

the bytes include the value, but the properties documentation says "cumulative bytes of keys". I guess it was meant to say "key-value pairs".

thanks, done by splitting this into two properties.


db.go line 423 at r3 (raw file):

Previously, sumeerbhola wrote…

I assume the idea is to compare it with the cumulative number of bytes being written in compactions.

Is there a value to knowing how many of the bytes currently in the DB are pinned? It would mean reading the properties (in the background) after the DB is opened, so I guess we can wait until we think it is worth the trouble.

yeah, I think there's some value, especially for just getting a sense of how much space amplification results from snapshots under workloads. for our own internal use, we can use the cockroach debug pebble db properties command to aggregate all the sstables' props


sstable/properties.go line 144 at r2 (raw file):

Previously, sumeerbhola wrote…

Is this comparable to RawKeySize? If yes, can you add a comment.

Also, how about calling this SnapshotPinnedKeySize and also having SnapshotPinnedValueSize that would be comparable to RawValueSize?

Done.


sstable/writer.go line 1893 at r2 (raw file):

Previously, sumeerbhola wrote…

when does writerApply get applied? And is postApply part of an interface?

Oh, I see -- postApply is just a marker saying when writerApply should be called, yes?
This could use some code commentary.

Done.

@jbowens
Copy link
Collaborator Author

jbowens commented Mar 21, 2023

Resurrecting this one for fixit week.

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 29 files reviewed, 6 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


compaction_iter.go line 301 at r3 (raw file):

Previously, sumeerbhola wrote…

The preceding call to nextInStripe can return due to sameStripeNonSkippable while leaving i.skip=true. So we may be still in the same snapshot stripe.
I am trying to find that case in the following comment.

Should we store the stripeChangeType returned by nextInStripe etc. and split the newStripe case into newStripeSameKey and newStripeDifferentKey, and use the former to decide that pinned is true. That will allow us to leverage the current behavior of i.skip which stays true even with intermediate sameStripeNonSkippable.

I think what you have here is logically similar, so this may be subjective.

I like it, done.

@jbowens jbowens requested review from a team and bananabrick March 21, 2023 21:33
Copy link
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Have some clarifying questions/nits.

Reviewed 2 of 21 files at r5, 1 of 1 files at r6, 16 of 24 files at r7, 1 of 3 files at r8, all commit messages.
Reviewable status: 21 of 29 files reviewed, 18 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


compaction.go line 2761 at r8 (raw file):

		// Set the external sst version to 0. This is what RocksDB expects for
		// db-internal sstables; otherwise, it could apply a global sequence number.
		p.ExternalFormatVersion = 0

I'm confused about ExternalFormatVersion. It's not being used by Pebble, is it used by Cockroach? ExternalFormatVersion is set to rocksDBExternalFormatVersion in NewWriter and then we set this to 0 during Close. Definitely confusing due to the multiple layers of indirection.

Could we directly set these in finishOutput before calling Writer.Close?


compaction_iter.go line 324 at r8 (raw file):

i.iterKey is the first of a new snapshot stripe.

Is this true? Can iterStripeChange not be sameStripeSkippable or sameStripeNonSkippable?


compaction_iter.go line 358 at r8 (raw file):

			// therefore warrants some investigation.
			i.saveKey()
			// TODO(jackson): Handle tracking pinned statistics for range keys

nit: It would be nice to update this todo to indicate what it would take to handle tracking pinned stats for range keys. It would allow someone else to pick up the todo and implement it in the future more easily.


compaction_iter.go line 415 at r8 (raw file):

			// preserving the original value, and potentially mutating the key
			// kind.
			i.setNext()

Will the value of snapshotPinned stay the same after the call to setNext here? Or after the mergeNext call below?


compaction_iter.go line 518 at r8 (raw file):

// overwriting or mutating the saved key or value before they have been returned
// to the caller of the exported function (i.e. the caller of Next, First, etc.)
func (i *compactionIter) nextInStripe() stripeChangeType {

nit: We could write a comment that the caller must use the return type to update i.iterStripeChange. Or update iterStripeChange from within nextInStripe and also return stripeChangeType.


metrics.go line 482 at r8 (raw file):

	formatCacheMetrics(w, &m.BlockCache, "bcache")
	formatCacheMetrics(w, &m.TableCache, "tcache")
	w.Printf("  snaps %9d %7s %7d  (score == earliest seq num)\n",

Why remove this?


sstable/properties.go line 152 at r8 (raw file):

	// The cumulative bytes of values in this table that were pinned by
	// open snapshots. This value is comparable to RawValueSize.
	SnapshotPinnedValueSize uint64 `prop:"pebble.raw.snapshot-pinned-values"`

nit: I think we don't need the raw in the tags for the fields here.


sstable/properties.go line 371 at r8 (raw file):

		p.saveString(m, unsafe.Offsetof(p.PropertyCollectorNames), p.PropertyCollectorNames)
	}
	if p.SnapshotPinnedKeys > 0 {

Does writing these change the on disk format of the sstables? Would we require version gating for older versions of Pebble which try and read this kind of sstable?


internal/keyspan/testdata/fragmenter_covers line 1 at r8 (raw file):

# This datadriven test uses a single command 'build' that illustrates a sequence

I like this test. It's pretty thorough.


internal/keyspan/testdata/fragmenter_covers line 5 at r8 (raw file):

#
# 'add' lines add a new span with the provided sequence number and the provided
# bounds.

We should add a line that there is no output line associated with the add command.


internal/keyspan/testdata/fragmenter_covers line 12 at r8 (raw file):

build
deleted a.SET.0     5
add 3: a-----------m

nit: How about addedSpan instead of add? It clarifies what's going on in the test.


range_del_test.go line 427 at r8 (raw file):

		FS: vfs.NewMem(),
		Levels: []LevelOptions{
			{TargetFileSize: 200},

Is this because the properties increase the size of the sstable?

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 21 of 29 files reviewed, 15 unresolved discussions (waiting on @bananabrick, @nicktrav, and @sumeerbhola)


compaction.go line 2761 at r8 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

I'm confused about ExternalFormatVersion. It's not being used by Pebble, is it used by Cockroach? ExternalFormatVersion is set to rocksDBExternalFormatVersion in NewWriter and then we set this to 0 during Close. Definitely confusing due to the multiple layers of indirection.

Could we directly set these in finishOutput before calling Writer.Close?

We don't use it. It exists only for compatibility with RocksDB (which the MostCompatible format major version still supports). It's set to rocksDBExternalFormatVersion so that non-internal constructors of sstables set rocksDBExternalFormatVersion.

The existing indirection and private hook exist to ensure that it's not possible for external users of the sstable package to fiddle with the standard sstable properties, but Pebble itself is allowed to manipulate them appropriately.


compaction_iter.go line 324 at r8 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

i.iterKey is the first of a new snapshot stripe.

Is this true? Can iterStripeChange not be sameStripeSkippable or sameStripeNonSkippable?

this was poorly worded; updated


compaction_iter.go line 415 at r8 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Will the value of snapshotPinned stay the same after the call to setNext here? Or after the mergeNext call below?

Yeah. Both of these cases return one key to represent the snapshot stripe. setNext and mergeNext determine the value of the resulting key-value pair, but it's still representing a snapshot stripe that's not the first.


compaction_iter.go line 518 at r8 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

nit: We could write a comment that the caller must use the return type to update i.iterStripeChange. Or update iterStripeChange from within nextInStripe and also return stripeChangeType.

Done.


metrics.go line 482 at r8 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Why remove this?

thanks, this was a mistake during rebasing


sstable/properties.go line 152 at r8 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

nit: I think we don't need the raw in the tags for the fields here.

The raw is to match rocksdb.raw.[range-key.]{key,value}.size properties.


sstable/properties.go line 371 at r8 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Does writing these change the on disk format of the sstables? Would we require version gating for older versions of Pebble which try and read this kind of sstable?

no — the properties block is just a block where the key is the property name (eg, pebble.raw.snapshot-pinned-keys). At read time, if the property is unrecognized, it's loaded into the UserProperties map.


internal/keyspan/testdata/fragmenter_covers line 5 at r8 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

We should add a line that there is no output line associated with the add command.

Done.


range_del_test.go line 427 at r8 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

Is this because the properties increase the size of the sstable?

Yeah

@jbowens jbowens force-pushed the snapshot-pinned branch 2 times, most recently from efdefc5 to 2a29376 Compare March 24, 2023 17:52
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 30 files reviewed, 15 unresolved discussions (waiting on @bananabrick, @nicktrav, and @sumeerbhola)


compaction.go line 2288 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think private.SSTableInternalTableOpt is an interface{} to avoid a cyclical package dependency.

I updated this interface boundary to be a little more direct (and remove the postApply marker)


compaction.go line 2761 at r8 (raw file):

Previously, jbowens (Jackson Owens) wrote…

We don't use it. It exists only for compatibility with RocksDB (which the MostCompatible format major version still supports). It's set to rocksDBExternalFormatVersion so that non-internal constructors of sstables set rocksDBExternalFormatVersion.

The existing indirection and private hook exist to ensure that it's not possible for external users of the sstable package to fiddle with the standard sstable properties, but Pebble itself is allowed to manipulate them appropriately.

I updated this interface boundary to be a little more direct.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 24 files at r1, 2 of 21 files at r5, 12 of 24 files at r7, 1 of 3 files at r8, 8 of 11 files at r9, 1 of 1 files at r10, all commit messages.
Reviewable status: 25 of 30 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @jbowens, and @nicktrav)


compaction_iter.go line 390 at r10 (raw file):

		switch i.iterKey.Kind() {
		case InternalKeyKindDelete, InternalKeyKindSingleDelete:
			if i.elideTombstone(i.iterKey.UserKey) {

we are always calling elideTombstone now, which involves a key comparison, while earlier it was only happening if i.curSnapshotIdx == 0. Is this not a performance concern because the common case in CockroachDB will be i.curSnapshotIdx == 0 (since snapshots are rare)?


compaction_iter.go line 537 at r10 (raw file):

// stripeChangeType indicates how the snapshot stripe changed relative to the previous
// key. If no change, it also indicates whether the current entry is skippable.
// If there snapshot changed, it also indicates whether the new stripe was

nit: if the ...


metrics.go line 236 at r10 (raw file):

		// sstables during flushes or compactions that would've been elided if
		// it weren't for open snapshots.
		PinnedKeysSize uint64

this is key and value size, yes? Should we call it PinnedSize?


sstable/properties.go line 149 at r10 (raw file):

	// The cumulative bytes of keys in this table that were pinned by
	// open snapshots. This value is comparable to RawKeySize.
	SnapshotPinnedKeySize uint64 `prop:"pebble.raw.snapshot-pinned-keys"`

shouldn't this and the one below be suffixed by .size?

Add three new built-in sstable properties that hold the count, cumulative raw
size of keys and cumulative raw size of values that would've been elided were
it not for open snapshots. For now these properties only include point keys and
not range deletions or range keys.

Also, add two new fields to pebble.Metrics: monotonically increasing cumulative
totals of the count and sum of keys written to sstables due to snapshots.
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 25 of 30 files reviewed, 18 unresolved discussions (waiting on @bananabrick, @nicktrav, and @sumeerbhola)


compaction_iter.go line 390 at r10 (raw file):

Previously, sumeerbhola wrote…

we are always calling elideTombstone now, which involves a key comparison, while earlier it was only happening if i.curSnapshotIdx == 0. Is this not a performance concern because the common case in CockroachDB will be i.curSnapshotIdx == 0 (since snapshots are rare)?

I think that's true, but I don't have a great intuition for truly how rare snapshots are. Hopefully these metrics can help with that.

We alternatively could also liberally label these tombstones as pinned.

Also worth noting that L5->L6 compactions, plus compactions over keyspaces that are entirely in-use avoid the key comparison.


metrics.go line 236 at r10 (raw file):

Previously, sumeerbhola wrote…

this is key and value size, yes? Should we call it PinnedSize?

Done.


sstable/properties.go line 149 at r10 (raw file):

Previously, sumeerbhola wrote…

shouldn't this and the one below be suffixed by .size?

ah, good catch. Done.

@jbowens
Copy link
Collaborator Author

jbowens commented Mar 30, 2023

TFTRs!

@jbowens jbowens merged commit 53a50a0 into cockroachdb:master Mar 30, 2023
@jbowens jbowens deleted the snapshot-pinned branch March 30, 2023 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants