-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from panic in http and grpc handlers. #2059
Conversation
I don't see any good reason to crash any component during a bad request. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Codecov Report
@@ Coverage Diff @@
## master #2059 +/- ##
==========================================
+ Coverage 64.02% 64.19% +0.17%
==========================================
Files 133 136 +3
Lines 10222 10240 +18
==========================================
+ Hits 6545 6574 +29
+ Misses 3187 3175 -12
- Partials 490 491 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good but I'm not sure about putting panic recovery on the write path. I believe we wanted to recover on the read path and specifically on ingester reads which is the one area our reads/write coincide and thus a place where read panics could cause data loss.
@@ -140,37 +141,33 @@ func New(cfg Config) (*Loki, error) { | |||
} | |||
|
|||
func (t *Loki) setupAuthMiddleware() { | |||
t.cfg.Server.GRPCMiddleware = []grpc.UnaryServerInterceptor{serverutil.RecoveryGRPCUnaryInterceptor} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want recovery middleware everywhere -- what about just on the read path?
This was intentional, I don't see why an API call read or write should cause a crash and data loss. Even crashing a distributor can cause an outage, imagine a situation where we have a specific stream that takes down a distributor. In a very short time, it could bring the whole write path down, for all tenant. Again I don't see a situation where we would want to leave a component crash on request. |
I would second this, read/write we never want a panic. Write is actually a much worse scenario because it would guarantee round robin crash of every component because writes are consistent and always happening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
I don't see any good reason to crash any component during a bad request.
The stacktrace is still printed in multiline to stderr.
e.g
Signed-off-by: Cyril Tovena cyril.tovena@gmail.com