-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: add snapshot-pinned keys sstable properties and metrics #1814
Conversation
699a043
to
e0c7a26
Compare
I think a good next step after this PR would be to run some experiments that are known to incur high w-amp (@tbg might have something in mind?) and observe whether the LSM stats record significant pinned bytes. If they do, #1810 or incorporating the pinned-bytes sstable property into compaction heuristics might be worthwhile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet read compactionIter
but flushing some minor comments (it looks good).
Reviewed 3 of 24 files at r1, 1 of 5 files at r2, 1 of 10 files at r3.
Reviewable status: 4 of 29 files reviewed, 5 unresolved discussions (waiting on @jbowens and @sumeerbhola)
compaction.go
line 2288 at r2 (raw file):
}() internalTableOptConstructor := private.SSTableInternalTableOpt.(func(func(p *sstable.Properties)) sstable.WriterOption)
I'm confused by the indirection happening here.
private.SSTableInternalTableOpt
used to be an interface sstable.WriterOption
. Is this change because it was too wide and gave access to the whole Writer
when all we want is access to the Properties
? So is this using something that takes a func(*sstable.Properties)
and produces something that is a sstable.WriterOption
.
So something like
type propertiesAdapter struct {
f func(p *sstable.Properties)
}
func (p propertiesAdapter) writerApply(w *Writer) {
f(w.props)
}
which could be used as
internalTableOpt := propertiesAdapter{
f: func(p *sstable.Properties) {
...
}
}
I guess I am confused about the casting starting with private.SSTableInternalTableOpt
. And I never understood why private.SSTableInternalOpt
is an interface{}
and not the real type.
compaction.go
line 2669 at r2 (raw file):
// its elision. Increment the stats. pinnedCount++ pinnedBytes += uint64(len(key.UserKey)) + base.InternalTrailerLen + uint64(len(val))
the bytes include the value, but the properties documentation says "cumulative bytes of keys". I guess it was meant to say "key-value pairs".
db.go
line 423 at r3 (raw file):
// The cumulative count and size of snapshot-pinned keys written to // sstables. cumulativePinnedCount uint64
I assume the idea is to compare it with the cumulative number of bytes being written in compactions.
Is there a value to knowing how many of the bytes currently in the DB are pinned? It would mean reading the properties (in the background) after the DB is opened, so I guess we can wait until we think it is worth the trouble.
sstable/properties.go
line 144 at r2 (raw file):
SnapshotPinnedKeys uint64 `prop:"pebble.num.snapshot-pinned-keys"` // The cumulative bytes of keys in this table that were pinned by open // snapshots.
Is this comparable to RawKeySize
? If yes, can you add a comment.
Also, how about calling this SnapshotPinnedKeySize
and also having SnapshotPinnedValueSize
that would be comparable to RawValueSize
?
sstable/writer.go
line 1893 at r2 (raw file):
// Apply this option during Close. func (i internalTableOpt) postApply() {}
when does writerApply
get applied? And is postApply
part of an interface?
Oh, I see -- postApply
is just a marker saying when writerApply
should be called, yes?
This could use some code commentary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 of 29 files reviewed, 6 unresolved discussions (waiting on @jbowens and @sumeerbhola)
compaction_iter.go
line 301 at r3 (raw file):
i.pos = iterPosCurForward i.valid = false for i.iterKey != nil {
The preceding call to nextInStripe
can return due to sameStripeNonSkippable
while leaving i.skip=true
. So we may be still in the same snapshot stripe.
I am trying to find that case in the following comment.
Should we store the stripeChangeType
returned by nextInStripe
etc. and split the newStripe
case into newStripeSameKey
and newStripeDifferentKey
, and use the former to decide that pinned is true. That will allow us to leverage the current behavior of i.skip
which stays true even with intermediate sameStripeNonSkippable
.
I think what you have here is logically similar, so this may be subjective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 24 of 24 files at r1, 5 of 5 files at r2, 10 of 10 files at r3, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens)
internal/keyspan/fragmenter.go
line 238 at r2 (raw file):
// CoversInvisibly indicates the tested key does fall within the span's // bounds and the span contains at least one key with a higher sequence // number, but none visible at the provided snapshot.
Some diagrams might be useful for these types, demonstrating "visible" and "invisible".
internal/keyspan/testdata/fragmenter_covers
line 1 at r2 (raw file):
build
Without reading the test code directly, it's a little hard to decipher how this test works. Can you add a brief description?
e0c7a26
to
28ea7ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r4, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @jbowens)
28ea7ed
to
c3e896c
Compare
c3e896c
to
faeffa4
Compare
faeffa4
to
ea942e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 29 files reviewed, 6 unresolved discussions (waiting on @nicktrav and @sumeerbhola)
compaction.go
line 2288 at r2 (raw file):
Previously, sumeerbhola wrote…
I'm confused by the indirection happening here.
private.SSTableInternalTableOpt
used to be an interfacesstable.WriterOption
. Is this change because it was too wide and gave access to the wholeWriter
when all we want is access to theProperties
? So is this using something that takes afunc(*sstable.Properties)
and produces something that is asstable.WriterOption
.So something like
type propertiesAdapter struct { f func(p *sstable.Properties) } func (p propertiesAdapter) writerApply(w *Writer) { f(w.props) }
which could be used as
internalTableOpt := propertiesAdapter{ f: func(p *sstable.Properties) { ... } }
I guess I am confused about the casting starting with
private.SSTableInternalTableOpt
. And I never understood whyprivate.SSTableInternalOpt
is aninterface{}
and not the real type.
I think private.SSTableInternalTableOpt
is an interface{}
to avoid a cyclical package dependency.
compaction.go
line 2669 at r2 (raw file):
Previously, sumeerbhola wrote…
the bytes include the value, but the properties documentation says "cumulative bytes of keys". I guess it was meant to say "key-value pairs".
thanks, done by splitting this into two properties.
db.go
line 423 at r3 (raw file):
Previously, sumeerbhola wrote…
I assume the idea is to compare it with the cumulative number of bytes being written in compactions.
Is there a value to knowing how many of the bytes currently in the DB are pinned? It would mean reading the properties (in the background) after the DB is opened, so I guess we can wait until we think it is worth the trouble.
yeah, I think there's some value, especially for just getting a sense of how much space amplification results from snapshots under workloads. for our own internal use, we can use the cockroach debug pebble db properties
command to aggregate all the sstables' props
sstable/properties.go
line 144 at r2 (raw file):
Previously, sumeerbhola wrote…
Is this comparable to
RawKeySize
? If yes, can you add a comment.Also, how about calling this
SnapshotPinnedKeySize
and also havingSnapshotPinnedValueSize
that would be comparable toRawValueSize
?
Done.
sstable/writer.go
line 1893 at r2 (raw file):
Previously, sumeerbhola wrote…
when does
writerApply
get applied? And ispostApply
part of an interface?Oh, I see --
postApply
is just a marker saying whenwriterApply
should be called, yes?
This could use some code commentary.
Done.
Resurrecting this one for fixit week. |
ea942e4
to
c92b5eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 29 files reviewed, 6 unresolved discussions (waiting on @nicktrav and @sumeerbhola)
compaction_iter.go
line 301 at r3 (raw file):
Previously, sumeerbhola wrote…
The preceding call to
nextInStripe
can return due tosameStripeNonSkippable
while leavingi.skip=true
. So we may be still in the same snapshot stripe.
I am trying to find that case in the following comment.Should we store the
stripeChangeType
returned bynextInStripe
etc. and split thenewStripe
case intonewStripeSameKey
andnewStripeDifferentKey
, and use the former to decide that pinned is true. That will allow us to leverage the current behavior ofi.skip
which stays true even with intermediatesameStripeNonSkippable
.I think what you have here is logically similar, so this may be subjective.
I like it, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Have some clarifying questions/nits.
Reviewed 2 of 21 files at r5, 1 of 1 files at r6, 16 of 24 files at r7, 1 of 3 files at r8, all commit messages.
Reviewable status: 21 of 29 files reviewed, 18 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)
compaction.go
line 2761 at r8 (raw file):
// Set the external sst version to 0. This is what RocksDB expects for // db-internal sstables; otherwise, it could apply a global sequence number. p.ExternalFormatVersion = 0
I'm confused about ExternalFormatVersion
. It's not being used by Pebble, is it used by Cockroach? ExternalFormatVersion
is set to rocksDBExternalFormatVersion
in NewWriter
and then we set this to 0 during Close
. Definitely confusing due to the multiple layers of indirection.
Could we directly set these in finishOutput
before calling Writer.Close
?
compaction_iter.go
line 324 at r8 (raw file):
i.iterKey is the first of a new snapshot stripe.
Is this true? Can iterStripeChange
not be sameStripeSkippable
or sameStripeNonSkippable
?
compaction_iter.go
line 358 at r8 (raw file):
// therefore warrants some investigation. i.saveKey() // TODO(jackson): Handle tracking pinned statistics for range keys
nit: It would be nice to update this todo to indicate what it would take to handle tracking pinned stats for range keys. It would allow someone else to pick up the todo and implement it in the future more easily.
compaction_iter.go
line 415 at r8 (raw file):
// preserving the original value, and potentially mutating the key // kind. i.setNext()
Will the value of snapshotPinned
stay the same after the call to setNext
here? Or after the mergeNext
call below?
compaction_iter.go
line 518 at r8 (raw file):
// overwriting or mutating the saved key or value before they have been returned // to the caller of the exported function (i.e. the caller of Next, First, etc.) func (i *compactionIter) nextInStripe() stripeChangeType {
nit: We could write a comment that the caller must use the return type to update i.iterStripeChange
. Or update iterStripeChange
from within nextInStripe
and also return stripeChangeType.
metrics.go
line 482 at r8 (raw file):
formatCacheMetrics(w, &m.BlockCache, "bcache") formatCacheMetrics(w, &m.TableCache, "tcache") w.Printf(" snaps %9d %7s %7d (score == earliest seq num)\n",
Why remove this?
sstable/properties.go
line 152 at r8 (raw file):
// The cumulative bytes of values in this table that were pinned by // open snapshots. This value is comparable to RawValueSize. SnapshotPinnedValueSize uint64 `prop:"pebble.raw.snapshot-pinned-values"`
nit: I think we don't need the raw
in the tags for the fields here.
sstable/properties.go
line 371 at r8 (raw file):
p.saveString(m, unsafe.Offsetof(p.PropertyCollectorNames), p.PropertyCollectorNames) } if p.SnapshotPinnedKeys > 0 {
Does writing these change the on disk format of the sstables? Would we require version gating for older versions of Pebble which try and read this kind of sstable?
internal/keyspan/testdata/fragmenter_covers
line 1 at r8 (raw file):
# This datadriven test uses a single command 'build' that illustrates a sequence
I like this test. It's pretty thorough.
internal/keyspan/testdata/fragmenter_covers
line 5 at r8 (raw file):
# # 'add' lines add a new span with the provided sequence number and the provided # bounds.
We should add a line that there is no output line associated with the add
command.
internal/keyspan/testdata/fragmenter_covers
line 12 at r8 (raw file):
build deleted a.SET.0 5 add 3: a-----------m
nit: How about addedSpan
instead of add
? It clarifies what's going on in the test.
range_del_test.go
line 427 at r8 (raw file):
FS: vfs.NewMem(), Levels: []LevelOptions{ {TargetFileSize: 200},
Is this because the properties increase the size of the sstable?
c92b5eb
to
c4af672
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 21 of 29 files reviewed, 15 unresolved discussions (waiting on @bananabrick, @nicktrav, and @sumeerbhola)
compaction.go
line 2761 at r8 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
I'm confused about
ExternalFormatVersion
. It's not being used by Pebble, is it used by Cockroach?ExternalFormatVersion
is set torocksDBExternalFormatVersion
inNewWriter
and then we set this to 0 duringClose
. Definitely confusing due to the multiple layers of indirection.Could we directly set these in
finishOutput
before callingWriter.Close
?
We don't use it. It exists only for compatibility with RocksDB (which the MostCompatible format major version still supports). It's set to rocksDBExternalFormatVersion
so that non-internal constructors of sstables set rocksDBExternalFormatVersion
.
The existing indirection and private hook exist to ensure that it's not possible for external users of the sstable
package to fiddle with the standard sstable properties, but Pebble itself is allowed to manipulate them appropriately.
compaction_iter.go
line 324 at r8 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
i.iterKey is the first of a new snapshot stripe.
Is this true? Can
iterStripeChange
not besameStripeSkippable
orsameStripeNonSkippable
?
this was poorly worded; updated
compaction_iter.go
line 415 at r8 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Will the value of
snapshotPinned
stay the same after the call tosetNext
here? Or after themergeNext
call below?
Yeah. Both of these cases return one key to represent the snapshot stripe. setNext
and mergeNext
determine the value of the resulting key-value pair, but it's still representing a snapshot stripe that's not the first.
compaction_iter.go
line 518 at r8 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
nit: We could write a comment that the caller must use the return type to update
i.iterStripeChange
. Or updateiterStripeChange
from withinnextInStripe
and also return stripeChangeType.
Done.
metrics.go
line 482 at r8 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Why remove this?
thanks, this was a mistake during rebasing
sstable/properties.go
line 152 at r8 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
nit: I think we don't need the
raw
in the tags for the fields here.
The raw
is to match rocksdb.raw.[range-key.]{key,value}.size
properties.
sstable/properties.go
line 371 at r8 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Does writing these change the on disk format of the sstables? Would we require version gating for older versions of Pebble which try and read this kind of sstable?
no — the properties block is just a block where the key is the property name (eg, pebble.raw.snapshot-pinned-keys
). At read time, if the property is unrecognized, it's loaded into the UserProperties
map.
internal/keyspan/testdata/fragmenter_covers
line 5 at r8 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
We should add a line that there is no output line associated with the
add
command.
Done.
range_del_test.go
line 427 at r8 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
Is this because the properties increase the size of the sstable?
Yeah
efdefc5
to
2a29376
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 14 of 30 files reviewed, 15 unresolved discussions (waiting on @bananabrick, @nicktrav, and @sumeerbhola)
compaction.go
line 2288 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I think
private.SSTableInternalTableOpt
is aninterface{}
to avoid a cyclical package dependency.
I updated this interface boundary to be a little more direct (and remove the postApply
marker)
compaction.go
line 2761 at r8 (raw file):
Previously, jbowens (Jackson Owens) wrote…
We don't use it. It exists only for compatibility with RocksDB (which the MostCompatible format major version still supports). It's set to
rocksDBExternalFormatVersion
so that non-internal constructors of sstables setrocksDBExternalFormatVersion
.The existing indirection and private hook exist to ensure that it's not possible for external users of the
sstable
package to fiddle with the standard sstable properties, but Pebble itself is allowed to manipulate them appropriately.
I updated this interface boundary to be a little more direct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 24 files at r1, 2 of 21 files at r5, 12 of 24 files at r7, 1 of 3 files at r8, 8 of 11 files at r9, 1 of 1 files at r10, all commit messages.
Reviewable status: 25 of 30 files reviewed, 19 unresolved discussions (waiting on @bananabrick, @jbowens, and @nicktrav)
compaction_iter.go
line 390 at r10 (raw file):
switch i.iterKey.Kind() { case InternalKeyKindDelete, InternalKeyKindSingleDelete: if i.elideTombstone(i.iterKey.UserKey) {
we are always calling elideTombstone
now, which involves a key comparison, while earlier it was only happening if i.curSnapshotIdx == 0
. Is this not a performance concern because the common case in CockroachDB will be i.curSnapshotIdx == 0
(since snapshots are rare)?
compaction_iter.go
line 537 at r10 (raw file):
// stripeChangeType indicates how the snapshot stripe changed relative to the previous // key. If no change, it also indicates whether the current entry is skippable. // If there snapshot changed, it also indicates whether the new stripe was
nit: if the ...
metrics.go
line 236 at r10 (raw file):
// sstables during flushes or compactions that would've been elided if // it weren't for open snapshots. PinnedKeysSize uint64
this is key and value size, yes? Should we call it PinnedSize?
sstable/properties.go
line 149 at r10 (raw file):
// The cumulative bytes of keys in this table that were pinned by // open snapshots. This value is comparable to RawKeySize. SnapshotPinnedKeySize uint64 `prop:"pebble.raw.snapshot-pinned-keys"`
shouldn't this and the one below be suffixed by .size
?
Add three new built-in sstable properties that hold the count, cumulative raw size of keys and cumulative raw size of values that would've been elided were it not for open snapshots. For now these properties only include point keys and not range deletions or range keys. Also, add two new fields to pebble.Metrics: monotonically increasing cumulative totals of the count and sum of keys written to sstables due to snapshots.
2a29376
to
90d3597
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 25 of 30 files reviewed, 18 unresolved discussions (waiting on @bananabrick, @nicktrav, and @sumeerbhola)
compaction_iter.go
line 390 at r10 (raw file):
Previously, sumeerbhola wrote…
we are always calling
elideTombstone
now, which involves a key comparison, while earlier it was only happening ifi.curSnapshotIdx == 0
. Is this not a performance concern because the common case in CockroachDB will bei.curSnapshotIdx == 0
(since snapshots are rare)?
I think that's true, but I don't have a great intuition for truly how rare snapshots are. Hopefully these metrics can help with that.
We alternatively could also liberally label these tombstones as pinned.
Also worth noting that L5->L6 compactions, plus compactions over keyspaces that are entirely in-use avoid the key comparison.
metrics.go
line 236 at r10 (raw file):
Previously, sumeerbhola wrote…
this is key and value size, yes? Should we call it PinnedSize?
Done.
sstable/properties.go
line 149 at r10 (raw file):
Previously, sumeerbhola wrote…
shouldn't this and the one below be suffixed by
.size
?
ah, good catch. Done.
TFTRs! |
Open snapshots require flushes and compactions to preserve, and write,
additional keys. Add new sstable properties and global metrics surfacing
the count and size of these obsolete keys.
These can be used to improve observability and the ability to diagnose
write-amplification incurred due to open snapshots.
Additionally, future work may make use of the table-level properties to
improve compaction heuristics, because files with obsolete keys may be
cheaper to compact than their physical file size alone indicates.
Informs #1204.