Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eventpb: add storage event types #86277

Merged
merged 1 commit into from
Sep 21, 2022

Conversation

nicktrav
Copy link
Collaborator

@nicktrav nicktrav commented Aug 17, 2022

Add the StoreStats event type, a per-store event emitted to the
TELEMETRY logging channel. This event type will be computed from the
Pebble metrics for each store.

Emit a StoreStats event periodically, by default, once per hour, per
store.

Touches #85589.

Release note: None.

Release justification: low risk, high benefit changes to existing
functionality.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nicktrav
Copy link
Collaborator Author

cc: @jbowens - sneak peak at the new event type. I've cherry-picked a bunch of metrics from the Pebble metrics, erring on the side of "too many". We can probably prune this list a bit, but wanted to get a sense check on what we'd be capturing (second commit).

@nicktrav nicktrav force-pushed the nickt.storage-events branch 2 times, most recently from 94689f4 to 5b9ed06 Compare August 19, 2022 16:22
@nicktrav nicktrav requested review from jbowens and a team August 19, 2022 16:23
@nicktrav nicktrav changed the title [DNM] eventpb: add storage event types eventpb: add storage event types Aug 19, 2022
@nicktrav nicktrav marked this pull request as ready for review August 19, 2022 16:23
@nicktrav nicktrav requested a review from a team August 19, 2022 16:23
@nicktrav nicktrav requested a review from a team as a code owner August 19, 2022 16:23
@nicktrav nicktrav requested a review from a team as a code owner August 19, 2022 20:30
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 11 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens and @nicktrav)


pkg/kv/kvserver/store.go line 130 at r1 (raw file):

var logStoreTelemetryTicks = envutil.EnvOrDefaultInt(
	"COCKROACH_LOG_STORE_TELEMETRY_TICKS_INTERVAL",
	6*60, // Once per hour. (tick interval = 10s) * 6 * 60 = 3600s = 1h.

what's the justification for 1h intervals? Can we miss transient interesting events because of this coarse granularity? What are other layers using as their telemetry logging interval?


pkg/util/log/eventpb/storage_events.proto line 33 at r1 (raw file):

  repeated LevelStats levels = 4 [(gogoproto.nullable) = false, (gogoproto.jsontag) = ""];

  // Cache metrics.

Could you add in parentheses whether something is cumulative? I'm assuming everything here is cumulative since node restart. Do we expect to take deltas or rates of these cumulative values downstream (in the analytics pipeline)? If yes, would it be simpler to export the deltas? Otherwise I can imagine a more complex calculation where we also export the restart time and then do a delta calculation from 0 if the restart time changed. Metrics systems can sometimes do that under the covers for you (Monarch did), but it is a complicated sql query to write in the absence of built-in support.

@nicktrav nicktrav changed the title eventpb: add storage event types [DNM] eventpb: add storage event types Aug 22, 2022
Copy link
Collaborator Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking with some folks internally, I'm going to open this back up for review.

Some points that I've discussed with folks:

PCI compliance / sensitive data. There is no PII emitted in these per-store stats. The data in each even is numerical, and pertains the state of a single LSM (i.e. store).

Data volume. We've landed on a starting rate of one event per store per hour. There were some concerns that logging too much could overload the fluentbit logging daemon. I expect the size of the events to be on the order of low single digit KB per event (uncompressed). The set of stats present in this PR can also be scoped down, if necessary, to bring this number down.

More context in this thread (internal).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jbowens, @pransudash, and @sumeerbhola)


pkg/kv/kvserver/store.go line 130 at r1 (raw file):

what's the justification for 1h intervals?

My intention was to strike a balance between frequent (basically degenerates into timeseries metrics the lower the interval, and is more expensive, puts more stress on the logging infra) and useful (when emitting infrequently, derived metrics like rates of change become harder to compute, etc.). I believe the lower bound is more important from an operational / cost perspective. I suspect on the upper bound, logging as infrequently as once every 24 hours could still be useful. I'd be hesitant to go higher than that.

Can we miss transient interesting events because of this coarse granularity?

Certainly - however, the goal of this telemetry logging isn't to capture such events. These new telemetry events are intended to roll up to provide an aggregate view across all clusters (cloud, for now). We're more interested in the trends / broad themes. We have the ability to dig into the details if necessary, as we have the cluster, node and store information in the events.

What are other layers using as their telemetry logging interval?

One recent example of telemetry logging is #84761. That emits schema information once per week.

I have a thread going here (internal) that discusses what is reasonable. tldr: it was indicated that an event per store per hour seems like a reasonable starting point, and should not be an issue.


pkg/kv/kvserver/store.go line 3357 at r2 (raw file):

	// if reporting is enabled. These events are intended to be emitted at low
	// frequency. Trigger on tick 1 for the same reasons as above.
	if logcrash.DiagnosticsReportingEnabled.Get(&s.ClusterSettings().SV) &&

While the logcrash package name doesn't seem all that relevant here, the comment on DiagnosticsReportingEnabled makes me believe it is suitable for use as a gate on whether we should emit the event or not:

	// "diagnostics.reporting.enabled" enables reporting of metrics related to a
	// node's storage (number, size and health of ranges) back to CockroachDB.
	// Collecting this data from production clusters helps us understand and improve
	// how our storage systems behave in real-world use cases.

If there's a better cluster setting / env var to use, please let me know. Alternatively, if it's fine to unconditionally log this to the telemetry channel, that could also be an option (less inclined to do this for non-CC clusters, as that data is effectively useless and will eat up disk space).


pkg/util/log/eventpb/storage_events.proto line 33 at r1 (raw file):

Could you add in parentheses whether something is cumulative?

Done. Clarified whether a metric is a counter or gauge.

I'm assuming everything here is cumulative since node restart.

Correct. Added that clarification at the top.

Do we expect to take deltas or rates of these cumulative values downstream (in the analytics pipeline)?

cc: @pransudash - do you know the answer to this? Will it be an issue if we emit counters that contain discontinuities? I assume that's only an issue to consider when we get to performing the analytics?

@nicktrav nicktrav changed the title [DNM] eventpb: add storage event types eventpb: add storage event types Aug 22, 2022
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

modulo the question about deltas.

Reviewed 1 of 11 files at r1, 2 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz, @nicktrav, @pransudash, and @sumeerbhola)


pkg/kv/kvserver/store.go line 3358 at r2 (raw file):

	// frequency. Trigger on tick 1 for the same reasons as above.
	if logcrash.DiagnosticsReportingEnabled.Get(&s.ClusterSettings().SV) &&
		tick%logStoreTelemetryTicks == 1 {

To avoid spamming the telemetry if we're crash looping, I think we could report on tick%logStoreTelemetryTicks == logStoreTelemetryTicks-1.


pkg/storage/engine.go line 1062 at r2 (raw file):

		TableZombieCount:           m.Table.ZombieCount,
		TableZombieSize:            m.Table.ZombieSize,
	}

maybe I'm greedy, but I'm not sure if there's anything I'd want to drop. If anything, maybe WAL.ObsoleteFiles and WAL.ObsoletePhysicalSize.

I think eventually we could replace CompactionNumInProgress with a value that shows the 'effective compaction concurrency' over time, rather than just a single moment. I think including the CompactionNumInProgress until we have that metric is fine.

Can we add the new Keys.RangeKeySetsCount metric too?

@nicktrav nicktrav force-pushed the nickt.storage-events branch 2 times, most recently from 75c21f3 to 29426b4 Compare August 23, 2022 21:36
Copy link
Collaborator Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jbowens, @knz, @pransudash, and @sumeerbhola)


pkg/kv/kvserver/store.go line 3358 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

To avoid spamming the telemetry if we're crash looping, I think we could report on tick%logStoreTelemetryTicks == logStoreTelemetryTicks-1.

Good call. Done.


pkg/storage/engine.go line 1062 at r2 (raw file):

Can we add the new Keys.RangeKeySetsCount metric too?

Done!

not sure if there's anything I'd want to drop.

+1 - Given the limited concern around the amount of data being emitted and the relative infrequency (even at one event per hour per store), I'm inclined to keep this set of metrics as they are.

@andreimatei andreimatei self-requested a review August 23, 2022 21:49
Copy link
Contributor

@pransudash pransudash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, and @sumeerbhola)


pkg/util/log/eventpb/storage_events.proto line 33 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Could you add in parentheses whether something is cumulative?

Done. Clarified whether a metric is a counter or gauge.

I'm assuming everything here is cumulative since node restart.

Correct. Added that clarification at the top.

Do we expect to take deltas or rates of these cumulative values downstream (in the analytics pipeline)?

cc: @pransudash - do you know the answer to this? Will it be an issue if we emit counters that contain discontinuities? I assume that's only an issue to consider when we get to performing the analytics?

If needed, I can instruct downstream pipelines to reconstruct deltas from the cumulative values and this something we do frequently for other datasets. As long as your counters are always increasing, that shouldn't be an issue. If it ever decreases or goes back to 0 does that mean the node restarted? In any case, I think it's good to keep cumulative values and only include deltas if they can't be reliably calculated from the counters.

Copy link
Collaborator Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, and @sumeerbhola)


pkg/util/log/eventpb/storage_events.proto line 33 at r1 (raw file):

If it ever decreases or goes back to 0 does that mean the node restarted?

Correct.

I think it's good to keep cumulative values and only include deltas if they can't be reliably calculated from the counters.

Great. In that case I think we should be good here. We have either counters or gauges. No deltas.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, @pransudash, and @sumeerbhola)


pkg/util/log/eventpb/storage_events.proto line 33 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

If it ever decreases or goes back to 0 does that mean the node restarted?

Correct.

I think it's good to keep cumulative values and only include deltas if they can't be reliably calculated from the counters.

Great. In that case I think we should be good here. We have either counters or gauges. No deltas.

Ignore me since I'm being nitpicky, but I do want to stress that we can't really reliably calculate from resetting counters since we may observe monotonically non-decreasing values despite the counter having dropped to 0.

Add the `StoreStats` event type, a per-store event emitted to the
`TELEMETRY` logging channel. This event type will be computed from the
Pebble metrics for each store.

Emit a `StoreStats` event periodically, by default, once per hour, per
store.

Touches cockroachdb#85589.

Release note: None.

Release justification: low risk, high benefit changes to existing
functionality.
@nicktrav
Copy link
Collaborator Author

Circling back on this one finally.

I chatted with @sumeerbhola offline about the delta issue. We'll see how much we can work around this on the backend for now (i.e. in how we do the reporting). This will avoid complicating things too much for the metrics reporting pipeline, which is still being fleshed out (the JSON schema can be considered "internal" and is subject to change).

TFTRs!

bors r=jbowens

@craig
Copy link
Contributor

craig bot commented Sep 21, 2022

👎 Rejected by code reviews

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @jbowens, @knz, @pransudash, and @sumeerbhola)

@pransudash
Copy link
Contributor

@nicktrav Sounds good! Let me know if there's any need for us to chat about reporting from these log events. I'm assuming the logs are still scheduled to send hourly?

@nicktrav
Copy link
Collaborator Author

bors r=jbowens,sumeerbhola

@craig
Copy link
Contributor

craig bot commented Sep 21, 2022

Build succeeded:

@craig craig bot merged commit 766b62d into cockroachdb:master Sep 21, 2022
@nicktrav nicktrav deleted the nickt.storage-events branch September 21, 2022 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants