-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: add metric for schema change job user/database errors #68252
Conversation
65ca358
to
4043b8b
Compare
4043b8b
to
26f7760
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments from me.
query I | ||
SELECT count(usage_count) | ||
FROM crdb_internal.feature_usage | ||
WHERE feature_name = 'sql.schema_changer.errors.constraint_violation' and usage_count > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this query quite confusing to be honest. What is the purpose and meaning of the usage_count > 1
clause? Shouldn't this be a sum
instead of a count
?
pkg/sql/schema_changer.go
Outdated
} else { | ||
telemetry.Inc(sc.metrics.UncategorizedErrors) | ||
} | ||
telemetry.Inc(sc.metrics.AllErrors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing wrong with this AllErrors counter per se, but since it can be trivially inferred by summing the other two above I'd be inclined to remove it.
When running analytics queries on the counters it's nice to be able to aggregate on the parent key. In CRDB this means SELECT ..., sum(usage_count) FROM crdb_internal.feature_usage WHERE feature_name LIKE 'sql.schema_changer.errors.%' AND ... GROUP BY ...
and in Snowflake it's I don't know what but I'd wager a guess that it's nicer for the clients of these counters to not have to special-case AllErrors.
pgCode == pgcode.ForeignKeyViolation || | ||
pgCode == pgcode.NotNullViolation || | ||
pgCode == pgcode.IntegrityConstraintViolation | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this is correct (I honestly don't know) this might not age so well if we need to cover new failure codes. Have you considered making the pgcode itself part of the telemetry counter key? That way you wouldn't have to worry about this. I believe it'd be OK to shift the burden of interpreting these to the client. cc @vy-ton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pgcode itself part of the telemetry counter key
Do you mean a counter for every time a pgcode is returned? Could that be noisy and include instances when these specific codes are encountered in non-schema changes?
When we have the new telemetry logging infra, I think tracking pgcode errors would be a great thing to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @postamar)
pkg/sql/schema_changer.go, line 151 at r1 (raw file):
Previously, vy-ton (Vy Ton) wrote…
pgcode itself part of the telemetry counter key
Do you mean a counter for every time a pgcode is returned? Could that be noisy and include instances when these specific codes are encountered in non-schema changes?When we have the new telemetry logging infra, I think tracking pgcode errors would be a great thing to add.
Does that mean right now we are okay with this approach, due to limitations of existing infrasturcture? I do like the idea of shifting the burden on the clients and just adding this in the key @postamar @vy-ton
pkg/sql/schema_changer.go, line 2135 at r1 (raw file):
Previously, postamar (Marius Posta) wrote…
There's nothing wrong with this AllErrors counter per se, but since it can be trivially inferred by summing the other two above I'd be inclined to remove it.
When running analytics queries on the counters it's nice to be able to aggregate on the parent key. In CRDB this means
SELECT ..., sum(usage_count) FROM crdb_internal.feature_usage WHERE feature_name LIKE 'sql.schema_changer.errors.%' AND ... GROUP BY ...
and in Snowflake it's I don't know what but I'd wager a guess that it's nicer for the clients of these counters to not have to special-case AllErrors.
Good point, I'll drop the allErrors counter, since it's trivial to compute.
pkg/sql/logictest/testdata/logic_test/alter_table, line 1848 at r1 (raw file):
Previously, postamar (Marius Posta) wrote…
I find this query quite confusing to be honest. What is the purpose and meaning of the
usage_count > 1
clause? Shouldn't this be asum
instead of acount
?
This is more of a sanity check at the end of the test depending on the mode we are running for the logic test, there may be either N or more failures. I'll make this saner by having it greater than N. The count is just mean to give a binary indicator that a greater or equal number was seen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @postamar)
pkg/sql/logictest/testdata/logic_test/alter_table, line 1848 at r1 (raw file):
Previously, fqazi (Faizan Qazi) wrote…
This is more of a sanity check at the end of the test depending on the mode we are running for the logic test, there may be either N or more failures. I'll make this saner by having it greater than N. The count is just mean to give a binary indicator that a greater or equal number was seen.
*Sorry its because the errors sum across the different modes
26f7760
to
881a77f
Compare
LGTM |
@postamar TFTR! |
Fixes: cockroachdb#68111 Previously, we had no way of tracking the reason a schema change job failed in our metrics. This was inadequate because we had no way of knowing why schema jobs failed. To address this, this patch will add metrics for tracking failures due to constraint violations and database errors. Release note (sql change): Add a new metrics to track schema job failure (sql.schema_changer.errors.all, sql.schema_changer.errors.constraint_violation, sql.schema_changer.errors.uncategorized), errors inside the crdb_internal.feature_usage table.
881a77f
to
3375d87
Compare
bors r=postamar |
Build failed (retrying...): |
Build failed: |
bors r=postamar |
Build succeeded: |
Fixes: #68111
Previously, we had no way of tracking the reason
a schema change job failed in our metrics. This was
inadequate because we had no way of knowing why
schema jobs failed. To address this, this patch
will add metrics for tracking failures due to constraint
violations and database errors.
Release note (sql change): Add a new metrics to track schema job failure
(sql.schema_changer.errors.all,
sql.schema_changer.errors.constraint_violation,
sql.schema_changer.errors.uncategorized), errors inside the
crdb_internal.feature_usage table.