Summary of errors in logs that are not yet monitored #3

jtgeibel · 2021-09-06T17:47:03Z

Here is a summary of error="" entries in our logs that we may want to monitor more closely in our metrics. We may want to do like Heroku does and assign code values to these error cases. We should ensure these all have an at=error prefix so that they can be easily ingested from logs.

error="canceling statement due to statement timeout"
error="unhealthy database pool"
error="there is no unique or exclusion constraint matching the ON CONFLICT specification"
error="end of file reached" (on crate publish endpoint)
downloads_counter error: unhealthy database pool
- Only 1 occurrence in the last month. I'm opening a PR to log this as at=error mod=downloads_counter error="unhealthy database pool".
Error: error sending request for url (https://events.pagerduty.com/generic/2010-04-15/create_event.json): operation timed out
- A few occurrences in the last month. (We send a resolved update, if nothing is wrong, and this failed a few times.)
error="failed to upload crate: error sending request for url (https://crates-io.s3-us-west-1.amazonaws.com/crates/xyz/xyz-0.2.0.crate): connection closed before message completed"
From crate_owner_invitations. These don't seem like internal status=500 errors, and should maybe be user facing instead: error="missing user {private.inviter_id}", error="missing crate with id {invitation.crate_id}".

Additionally, we may want to add an at=warn prefix that could be used to flag slow requests and other operationally interesting events that aren't strictly errors.

The text was updated successfully, but these errors were encountered:

This is an initial step in adding metrics for the errors identified in rust-lang#3.

Turbo87 · 2021-09-06T22:08:02Z

While logging certainly makes sense, I'm wondering if it would be better to use Sentry more for these things 🤔

jtgeibel · 2021-09-07T00:35:21Z

I think we should do both where possible. My original motivation for investigating was to make sure we capture Heroku platform level error codes, where the request may not make it to the backed, or where the backend completes successfully but for some reason the user still sees an error. Then by adopting the existing prefix, we can ensure that all levels of errors end up in at least one place together.

jtgeibel mentioned this issue Sep 6, 2021

Use a common prefix for logged errors rust-lang/crates.io#3894

Closed

jtgeibel added a commit to jtgeibel/crates-io-heroku-metrics that referenced this issue Sep 6, 2021

Ingest log lines starting with at=error

742d585

This is an initial step in adding metrics for the errors identified in rust-lang#3.

jtgeibel mentioned this issue Sep 6, 2021

Ingest log lines starting with at=error #4

Closed

jtgeibel added a commit to jtgeibel/crates-io-heroku-metrics that referenced this issue Sep 6, 2021

Ingest log lines starting with at=error

986fc61

This is an initial step in adding metrics for the errors identified in rust-lang#3.

jtgeibel added a commit to jtgeibel/crates-io-heroku-metrics that referenced this issue Sep 6, 2021

Ingest log lines starting with at=error

2f5aaa2

This is an initial step in adding metrics for the errors identified in rust-lang#3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summary of errors in logs that are not yet monitored #3

Summary of errors in logs that are not yet monitored #3

jtgeibel commented Sep 6, 2021

Turbo87 commented Sep 6, 2021

jtgeibel commented Sep 7, 2021

Summary of errors in logs that are not yet monitored #3

Summary of errors in logs that are not yet monitored #3

Comments

jtgeibel commented Sep 6, 2021

Turbo87 commented Sep 6, 2021

jtgeibel commented Sep 7, 2021