Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: add metric for schema change job user/database errors #68252

Merged
merged 1 commit into from
Aug 13, 2021

Conversation

fqazi
Copy link
Collaborator

@fqazi fqazi commented Jul 29, 2021

Fixes: #68111

Previously, we had no way of tracking the reason
a schema change job failed in our metrics. This was
inadequate because we had no way of knowing why
schema jobs failed. To address this, this patch
will add metrics for tracking failures due to constraint
violations and database errors.

Release note (sql change): Add a new metrics to track schema job failure
(sql.schema_changer.errors.all,
sql.schema_changer.errors.constraint_violation,
sql.schema_changer.errors.uncategorized), errors inside the
crdb_internal.feature_usage table.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@fqazi fqazi force-pushed the schemaChangeFailErrorType branch from 65ca358 to 4043b8b Compare July 30, 2021 14:34
@fqazi fqazi requested a review from a team July 30, 2021 14:34
@fqazi fqazi marked this pull request as ready for review July 30, 2021 14:34
@fqazi fqazi force-pushed the schemaChangeFailErrorType branch from 4043b8b to 26f7760 Compare July 30, 2021 18:28
Copy link
Contributor

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments from me.

query I
SELECT count(usage_count)
FROM crdb_internal.feature_usage
WHERE feature_name = 'sql.schema_changer.errors.constraint_violation' and usage_count > 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this query quite confusing to be honest. What is the purpose and meaning of the usage_count > 1 clause? Shouldn't this be a sum instead of a count?

} else {
telemetry.Inc(sc.metrics.UncategorizedErrors)
}
telemetry.Inc(sc.metrics.AllErrors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing wrong with this AllErrors counter per se, but since it can be trivially inferred by summing the other two above I'd be inclined to remove it.

When running analytics queries on the counters it's nice to be able to aggregate on the parent key. In CRDB this means SELECT ..., sum(usage_count) FROM crdb_internal.feature_usage WHERE feature_name LIKE 'sql.schema_changer.errors.%' AND ... GROUP BY ... and in Snowflake it's I don't know what but I'd wager a guess that it's nicer for the clients of these counters to not have to special-case AllErrors.

pgCode == pgcode.ForeignKeyViolation ||
pgCode == pgcode.NotNullViolation ||
pgCode == pgcode.IntegrityConstraintViolation
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this is correct (I honestly don't know) this might not age so well if we need to cover new failure codes. Have you considered making the pgcode itself part of the telemetry counter key? That way you wouldn't have to worry about this. I believe it'd be OK to shift the burden of interpreting these to the client. cc @vy-ton

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pgcode itself part of the telemetry counter key
Do you mean a counter for every time a pgcode is returned? Could that be noisy and include instances when these specific codes are encountered in non-schema changes?

When we have the new telemetry logging infra, I think tracking pgcode errors would be a great thing to add.

Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @postamar)


pkg/sql/schema_changer.go, line 151 at r1 (raw file):

Previously, vy-ton (Vy Ton) wrote…

pgcode itself part of the telemetry counter key
Do you mean a counter for every time a pgcode is returned? Could that be noisy and include instances when these specific codes are encountered in non-schema changes?

When we have the new telemetry logging infra, I think tracking pgcode errors would be a great thing to add.

Does that mean right now we are okay with this approach, due to limitations of existing infrasturcture? I do like the idea of shifting the burden on the clients and just adding this in the key @postamar @vy-ton


pkg/sql/schema_changer.go, line 2135 at r1 (raw file):

Previously, postamar (Marius Posta) wrote…

There's nothing wrong with this AllErrors counter per se, but since it can be trivially inferred by summing the other two above I'd be inclined to remove it.

When running analytics queries on the counters it's nice to be able to aggregate on the parent key. In CRDB this means SELECT ..., sum(usage_count) FROM crdb_internal.feature_usage WHERE feature_name LIKE 'sql.schema_changer.errors.%' AND ... GROUP BY ... and in Snowflake it's I don't know what but I'd wager a guess that it's nicer for the clients of these counters to not have to special-case AllErrors.

Good point, I'll drop the allErrors counter, since it's trivial to compute.


pkg/sql/logictest/testdata/logic_test/alter_table, line 1848 at r1 (raw file):

Previously, postamar (Marius Posta) wrote…

I find this query quite confusing to be honest. What is the purpose and meaning of the usage_count > 1 clause? Shouldn't this be a sum instead of a count?

This is more of a sanity check at the end of the test depending on the mode we are running for the logic test, there may be either N or more failures. I'll make this saner by having it greater than N. The count is just mean to give a binary indicator that a greater or equal number was seen.

Copy link
Collaborator Author

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @postamar)


pkg/sql/logictest/testdata/logic_test/alter_table, line 1848 at r1 (raw file):

Previously, fqazi (Faizan Qazi) wrote…

This is more of a sanity check at the end of the test depending on the mode we are running for the logic test, there may be either N or more failures. I'll make this saner by having it greater than N. The count is just mean to give a binary indicator that a greater or equal number was seen.

*Sorry its because the errors sum across the different modes

@fqazi fqazi force-pushed the schemaChangeFailErrorType branch from 26f7760 to 881a77f Compare August 4, 2021 02:45
@postamar postamar self-requested a review August 12, 2021 14:53
@postamar
Copy link
Contributor

LGTM

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 12, 2021

@postamar TFTR!

Fixes: cockroachdb#68111

Previously, we had no way of tracking the reason
a schema change job failed in our metrics. This was
inadequate because we had no way of knowing why
schema jobs failed. To address this, this patch
will add metrics for tracking failures due to constraint
violations and database errors.

Release note (sql change): Add a new metrics to track schema job failure
(sql.schema_changer.errors.all,
sql.schema_changer.errors.constraint_violation,
sql.schema_changer.errors.uncategorized), errors inside the
crdb_internal.feature_usage table.
@fqazi fqazi force-pushed the schemaChangeFailErrorType branch from 881a77f to 3375d87 Compare August 12, 2021 19:38
@fqazi
Copy link
Collaborator Author

fqazi commented Aug 12, 2021

bors r=postamar

@craig
Copy link
Contributor

craig bot commented Aug 12, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 12, 2021

Build failed:

@fqazi
Copy link
Collaborator Author

fqazi commented Aug 13, 2021

bors r=postamar

@craig
Copy link
Contributor

craig bot commented Aug 13, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: Add telemetry counters for schema change error type
4 participants