eventpb: add storage event types #86277

nicktrav · 2022-08-17T02:15:29Z

Add the StoreStats event type, a per-store event emitted to the
TELEMETRY logging channel. This event type will be computed from the
Pebble metrics for each store.

Emit a StoreStats event periodically, by default, once per hour, per
store.

Touches #85589.

Release note: None.

Release justification: low risk, high benefit changes to existing
functionality.

cockroach-teamcity · 2022-08-17T02:15:37Z

This change is

nicktrav · 2022-08-17T02:17:21Z

cc: @jbowens - sneak peak at the new event type. I've cherry-picked a bunch of metrics from the Pebble metrics, erring on the side of "too many". We can probably prune this list a bit, but wanted to get a sense check on what we'd be capturing (second commit).

sumeerbhola

Reviewed 1 of 11 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @nicktrav)

pkg/kv/kvserver/store.go line 130 at r1 (raw file):

var logStoreTelemetryTicks = envutil.EnvOrDefaultInt(
	"COCKROACH_LOG_STORE_TELEMETRY_TICKS_INTERVAL",
	6*60, // Once per hour. (tick interval = 10s) * 6 * 60 = 3600s = 1h.

what's the justification for 1h intervals? Can we miss transient interesting events because of this coarse granularity? What are other layers using as their telemetry logging interval?

pkg/util/log/eventpb/storage_events.proto line 33 at r1 (raw file):

  repeated LevelStats levels = 4 [(gogoproto.nullable) = false, (gogoproto.jsontag) = ""];

  // Cache metrics.

Could you add in parentheses whether something is cumulative? I'm assuming everything here is cumulative since node restart. Do we expect to take deltas or rates of these cumulative values downstream (in the analytics pipeline)? If yes, would it be simpler to export the deltas? Otherwise I can imagine a more complex calculation where we also export the restart time and then do a delta calculation from 0 if the restart time changed. Metrics systems can sometimes do that under the covers for you (Monarch did), but it is a complicated sql query to write in the absence of built-in support.

nicktrav

After talking with some folks internally, I'm going to open this back up for review.

Some points that I've discussed with folks:

PCI compliance / sensitive data. There is no PII emitted in these per-store stats. The data in each even is numerical, and pertains the state of a single LSM (i.e. store).

Data volume. We've landed on a starting rate of one event per store per hour. There were some concerns that logging too much could overload the fluentbit logging daemon. I expect the size of the events to be on the order of low single digit KB per event (uncompressed). The set of stats present in this PR can also be scoped down, if necessary, to bring this number down.

More context in this thread (internal).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @pransudash, and @sumeerbhola)

pkg/kv/kvserver/store.go line 130 at r1 (raw file):

what's the justification for 1h intervals?

My intention was to strike a balance between frequent (basically degenerates into timeseries metrics the lower the interval, and is more expensive, puts more stress on the logging infra) and useful (when emitting infrequently, derived metrics like rates of change become harder to compute, etc.). I believe the lower bound is more important from an operational / cost perspective. I suspect on the upper bound, logging as infrequently as once every 24 hours could still be useful. I'd be hesitant to go higher than that.

Can we miss transient interesting events because of this coarse granularity?

Certainly - however, the goal of this telemetry logging isn't to capture such events. These new telemetry events are intended to roll up to provide an aggregate view across all clusters (cloud, for now). We're more interested in the trends / broad themes. We have the ability to dig into the details if necessary, as we have the cluster, node and store information in the events.

What are other layers using as their telemetry logging interval?

One recent example of telemetry logging is #84761. That emits schema information once per week.

I have a thread going here (internal) that discusses what is reasonable. tldr: it was indicated that an event per store per hour seems like a reasonable starting point, and should not be an issue.

pkg/kv/kvserver/store.go line 3357 at r2 (raw file):

	// if reporting is enabled. These events are intended to be emitted at low
	// frequency. Trigger on tick 1 for the same reasons as above.
	if logcrash.DiagnosticsReportingEnabled.Get(&s.ClusterSettings().SV) &&

While the logcrash package name doesn't seem all that relevant here, the comment on DiagnosticsReportingEnabled makes me believe it is suitable for use as a gate on whether we should emit the event or not:

	// "diagnostics.reporting.enabled" enables reporting of metrics related to a
	// node's storage (number, size and health of ranges) back to CockroachDB.
	// Collecting this data from production clusters helps us understand and improve
	// how our storage systems behave in real-world use cases.

If there's a better cluster setting / env var to use, please let me know. Alternatively, if it's fine to unconditionally log this to the telemetry channel, that could also be an option (less inclined to do this for non-CC clusters, as that data is effectively useless and will eat up disk space).

pkg/util/log/eventpb/storage_events.proto line 33 at r1 (raw file):

Could you add in parentheses whether something is cumulative?

Done. Clarified whether a metric is a counter or gauge.

I'm assuming everything here is cumulative since node restart.

Correct. Added that clarification at the top.

Do we expect to take deltas or rates of these cumulative values downstream (in the analytics pipeline)?

cc: @pransudash - do you know the answer to this? Will it be an issue if we emit counters that contain discontinuities? I assume that's only an issue to consider when we get to performing the analytics?

jbowens

modulo the question about deltas.

Reviewed 1 of 11 files at r1, 2 of 4 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz, @nicktrav, @pransudash, and @sumeerbhola)

pkg/kv/kvserver/store.go line 3358 at r2 (raw file):

	// frequency. Trigger on tick 1 for the same reasons as above.
	if logcrash.DiagnosticsReportingEnabled.Get(&s.ClusterSettings().SV) &&
		tick%logStoreTelemetryTicks == 1 {

To avoid spamming the telemetry if we're crash looping, I think we could report on tick%logStoreTelemetryTicks == logStoreTelemetryTicks-1.

pkg/storage/engine.go line 1062 at r2 (raw file):

		TableZombieCount:           m.Table.ZombieCount,
		TableZombieSize:            m.Table.ZombieSize,
	}

maybe I'm greedy, but I'm not sure if there's anything I'd want to drop. If anything, maybe WAL.ObsoleteFiles and WAL.ObsoletePhysicalSize.

I think eventually we could replace CompactionNumInProgress with a value that shows the 'effective compaction concurrency' over time, rather than just a single moment. I think including the CompactionNumInProgress until we have that metric is fine.

Can we add the new Keys.RangeKeySetsCount metric too?

nicktrav

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jbowens, @knz, @pransudash, and @sumeerbhola)

pkg/kv/kvserver/store.go line 3358 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

To avoid spamming the telemetry if we're crash looping, I think we could report on tick%logStoreTelemetryTicks == logStoreTelemetryTicks-1.

Good call. Done.

pkg/storage/engine.go line 1062 at r2 (raw file):

Can we add the new Keys.RangeKeySetsCount metric too?

Done!

not sure if there's anything I'd want to drop.

+1 - Given the limited concern around the amount of data being emitted and the relative infrequency (even at one event per hour per store), I'm inclined to keep this set of metrics as they are.

pransudash

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, and @sumeerbhola)

pkg/util/log/eventpb/storage_events.proto line 33 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Could you add in parentheses whether something is cumulative?

Done. Clarified whether a metric is a counter or gauge.

I'm assuming everything here is cumulative since node restart.

Correct. Added that clarification at the top.

Do we expect to take deltas or rates of these cumulative values downstream (in the analytics pipeline)?

cc: @pransudash - do you know the answer to this? Will it be an issue if we emit counters that contain discontinuities? I assume that's only an issue to consider when we get to performing the analytics?

If needed, I can instruct downstream pipelines to reconstruct deltas from the cumulative values and this something we do frequently for other datasets. As long as your counters are always increasing, that shouldn't be an issue. If it ever decreases or goes back to 0 does that mean the node restarted? In any case, I think it's good to keep cumulative values and only include deltas if they can't be reliably calculated from the counters.

nicktrav

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, and @sumeerbhola)

pkg/util/log/eventpb/storage_events.proto line 33 at r1 (raw file):

If it ever decreases or goes back to 0 does that mean the node restarted?

Correct.

I think it's good to keep cumulative values and only include deltas if they can't be reliably calculated from the counters.

Great. In that case I think we should be good here. We have either counters or gauges. No deltas.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, @pransudash, and @sumeerbhola)

pkg/util/log/eventpb/storage_events.proto line 33 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

If it ever decreases or goes back to 0 does that mean the node restarted?

Correct.

I think it's good to keep cumulative values and only include deltas if they can't be reliably calculated from the counters.

Great. In that case I think we should be good here. We have either counters or gauges. No deltas.

Ignore me since I'm being nitpicky, but I do want to stress that we can't really reliably calculate from resetting counters since we may observe monotonically non-decreasing values despite the counter having dropped to 0.

Add the `StoreStats` event type, a per-store event emitted to the `TELEMETRY` logging channel. This event type will be computed from the Pebble metrics for each store. Emit a `StoreStats` event periodically, by default, once per hour, per store. Touches cockroachdb#85589. Release note: None. Release justification: low risk, high benefit changes to existing functionality.

nicktrav · 2022-09-21T17:51:00Z

Circling back on this one finally.

I chatted with @sumeerbhola offline about the delta issue. We'll see how much we can work around this on the backend for now (i.e. in how we do the reporting). This will avoid complicating things too much for the metrics reporting pipeline, which is still being fleshed out (the JSON schema can be considered "internal" and is subject to change).

TFTRs!

bors r=jbowens

craig · 2022-09-21T17:51:06Z

👎 Rejected by code reviews

sumeerbhola

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, @pransudash, and @sumeerbhola)

pransudash · 2022-09-21T18:54:14Z

@nicktrav Sounds good! Let me know if there's any need for us to chat about reporting from these log events. I'm assuming the logs are still scheduled to send hourly?

nicktrav · 2022-09-21T19:08:24Z

bors r=jbowens,sumeerbhola

craig · 2022-09-21T21:14:01Z

Build succeeded:

Bazel Essential CI (Cockroach)

nicktrav force-pushed the nickt.storage-events branch 2 times, most recently from 94689f4 to 5b9ed06 Compare August 19, 2022 16:22

nicktrav requested review from jbowens and a team August 19, 2022 16:23

nicktrav changed the title ~~[DNM] eventpb: add storage event types~~ eventpb: add storage event types Aug 19, 2022

nicktrav marked this pull request as ready for review August 19, 2022 16:23

nicktrav requested a review from a team August 19, 2022 16:23

nicktrav requested a review from a team as a code owner August 19, 2022 16:23

nicktrav force-pushed the nickt.storage-events branch from 5b9ed06 to 34a3bb9 Compare August 19, 2022 20:30

nicktrav requested a review from a team as a code owner August 19, 2022 20:30

nicktrav force-pushed the nickt.storage-events branch from 34a3bb9 to 6e5e401 Compare August 19, 2022 20:34

sumeerbhola requested changes Aug 22, 2022

View reviewed changes

nicktrav changed the title ~~eventpb: add storage event types~~ [DNM] eventpb: add storage event types Aug 22, 2022

nicktrav force-pushed the nickt.storage-events branch from 6e5e401 to 341e8c3 Compare August 22, 2022 23:08

nicktrav commented Aug 22, 2022

View reviewed changes

nicktrav changed the title ~~[DNM] eventpb: add storage event types~~ eventpb: add storage event types Aug 22, 2022

nicktrav requested review from knz and pransudash August 22, 2022 23:35

jbowens approved these changes Aug 23, 2022

View reviewed changes

nicktrav force-pushed the nickt.storage-events branch 2 times, most recently from 75c21f3 to 29426b4 Compare August 23, 2022 21:36

nicktrav commented Aug 23, 2022

View reviewed changes

andreimatei self-requested a review August 23, 2022 21:49

pransudash reviewed Aug 23, 2022

View reviewed changes

nicktrav commented Aug 24, 2022

View reviewed changes

nicktrav force-pushed the nickt.storage-events branch from 29426b4 to c74c215 Compare August 24, 2022 18:58

sumeerbhola reviewed Sep 6, 2022

View reviewed changes

nicktrav force-pushed the nickt.storage-events branch from c74c215 to 1fc9149 Compare September 19, 2022 17:43

nicktrav added the backport-22.2.x label Sep 20, 2022

sumeerbhola approved these changes Sep 21, 2022

View reviewed changes

craig bot merged commit 766b62d into cockroachdb:master Sep 21, 2022

blathers-crl bot mentioned this pull request Sep 21, 2022

release-22.2: eventpb: add storage event types #88405

Merged

nicktrav deleted the nickt.storage-events branch September 21, 2022 21:15

nicktrav mentioned this pull request Sep 21, 2022

storage: make use of telemetry channel for structured event logging #85589

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eventpb: add storage event types #86277

eventpb: add storage event types #86277

nicktrav commented Aug 17, 2022 •

edited

Loading

cockroach-teamcity commented Aug 17, 2022

nicktrav commented Aug 17, 2022

sumeerbhola left a comment

nicktrav left a comment

jbowens left a comment

nicktrav left a comment

pransudash left a comment

nicktrav left a comment

sumeerbhola left a comment

nicktrav commented Sep 21, 2022

craig bot commented Sep 21, 2022

sumeerbhola left a comment

pransudash commented Sep 21, 2022

nicktrav commented Sep 21, 2022

craig bot commented Sep 21, 2022

eventpb: add storage event types #86277

eventpb: add storage event types #86277

Conversation

nicktrav commented Aug 17, 2022 • edited Loading

cockroach-teamcity commented Aug 17, 2022

nicktrav commented Aug 17, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

pransudash left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

nicktrav commented Sep 21, 2022

craig bot commented Sep 21, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

pransudash commented Sep 21, 2022

nicktrav commented Sep 21, 2022

craig bot commented Sep 21, 2022

nicktrav commented Aug 17, 2022 •

edited

Loading