Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic within a handler should return an HTTP 500 #1501

Closed
sazzer opened this issue May 11, 2020 · 26 comments
Closed

Panic within a handler should return an HTTP 500 #1501

sazzer opened this issue May 11, 2020 · 26 comments

Comments

@sazzer
Copy link

sazzer commented May 11, 2020

(Apologies if this is already covered anywhere. I did look and couldn't find anything, but I may very well have missed it!)

Expected Behavior

If anything goes catastrophically wrong within a handler, it would be good to return an HTTP 500 response to the client.

Even better would be the option to customize what the error response is.

Current Behavior

Currently, the server just hangs up the network connection.
It does not seem to break the threading - I can make many more panicking requests to the server than it is running threads and they all connect just fine. I just don't get a decent error.

Steps to Reproduce (for bugs)

  1. Write a handler
  2. Call - e.g. unwrap() on an Err value from within the handler

Context

There are certain classes of errors that are catastrophic in nature. For example, the database not being present. Adding deliberate error propagation for these scenarios only serves to bloat the codebase - all of a sudden I've got to propagate the errors all the way from the bottom layer up to the handlers in order to return an error.

Alternatively, I can just not care about these. Just call pool.get().await.unwrap() to get a connection from the connection pool, for example. If I get a connection then that's great. If I don't then I have a catastrophic failure on my hands anyway, so panicking isn't unreasonable.

I could wrap every single handler in panic::catch_unwind() and handle it myself, but this is then a lot of duplication of effort, and has the need to return appropriate catastrophic errors from every handler. (I'm also not sure how to use catch_unwind in an async context, but that's my problem :) )

Alternatively, if Actix-web can do this then it only needs to be done once, and can have guaranteed consistent responses no matter which handler had the problem.

Obviously, this should only apply to actual catastrophic failures. Business errors, validation errors, etc are the domain of the application and not of Actix-web and should remain as such.

Your Environment

  • Rust Version (I.e, output of rustc -V): rustc 1.45.0-nightly (a08c47310 2020-05-07)
  • Actix Web Version:
-> % cargo tree | grep actix
├── actix-cors v0.2.0
├── actix-http v1.0.1 (*)
├── actix-rt v1.1.1 (*)
├── actix-web v2.0.0 (*)
@robjtede
Copy link
Member

robjtede commented May 11, 2020

I would argue that terminating a connection is a suitable response to a handler panic. You should endevor to employ proper error handling at all levels. Things that return a Result do so because their errors are recoverable and should be either handled or bubbled.

Using pool.get().await is certainly common in handler code. Taking advantage of the ? operator is the best and cleanest way to handle this pool.get().await? where your handler's Error type impls From<PoolError> and can be converted to an error response (eg impl actix_web::ResponseError). There a plethora of error helper crates to make this pattern clean and easy to implement but using a plain global error enum with manual impls is also reasonably simple.

On catch_unwind, trying to employ a system that relies on catch_unwind has all sorts of gotchas including considerations of things that are or are not unwind-safe; it is not equivalent to try-catch in other languages that are built to handle clean up.

@sazzer
Copy link
Author

sazzer commented May 11, 2020

I appreciate that catch_unwind is not a trivial thing to go for, and that it's not an equivalent to try/catch in other languages.

However, I'm also not suggesting that this is something to do lightly. You're absolutely right that returning a Result and using the ? operator is correct behaviour most of the time. But I'm not convinced it's correct behaviour all of the time.

Specifically I am talking about things that really are catastrophic failures. Things like talking to a database that just isn't there. Not reading a database record that is not found, but the entire database being inaccessible.

And yes, this can be done using Result and propagating an error value. That works, but it gets very tedious and repetitive when you have a multi-layered architecture. For example, in my app right now I'm following the Clean Architecture pattern (Which works remarkably well in Rust), which means I have:

  • Handler function lookup_username() -> impl Responder
  • Calls UserService.lookup_username() -> bool
  • Calls UserRepository.find_user_by_username() -> Option<UserModel>
  • Calls Database.checkout() -> PooledConnection (Where Database is a wrapper around bb8)
  • Calls Pool.get() -> Result<PooledConnection, Error>

Right now, this works great. And the only way this can at all fail is in a catastrophic sense. If the database is missing, or the schema is corrupt. Anything else will cause a graceful result to be returned and all is good. (This design was based on reading around in blog posts, the Rust Book and The Rustonomicon.)

Now, I could return a Result here. That would mean that every step of the way I've got to have a Result type, with an appropriate Error type for that level - bearing in mind that other calls through could have business errors involved as well, so you'd now be mixing these catastrophic failures with normal business errors - and then have to handle it in every single handler function to correctly return an HTTP 500 indicating that something was seriously wrong.

As such, it seemed reasonable to have this level of handling be something that was more common and shared so that everyone can benefit automatically. :)

Since these are the level of errors that means that everything is broken anyway, I'm not totally invested in whether it is decided to go ahead with this or not. If you still deem it better to leave it alone then you are significantly more experienced than I am and I bow to that wisdom :)

Cheers

@Lesiuk
Copy link
Contributor

Lesiuk commented May 11, 2020

I am writing my code without using any panics but some external libraries are using panics in some cases so returning 500 if it happens should be better idea than just closing connection. I will research this possibility (https://docs.rs/futures/0.3.5/futures/future/trait.FutureExt.html#method.catch_unwind) this weekend and maybe create my first PR.

@robjtede
Copy link
Member

Anyone from @actix/contributors got experiences or opinions on handler panicking?

@Dowwie
Copy link
Contributor

Dowwie commented May 11, 2020

I'm sorry but I don't suppose this proposal. Rust offers first-class error handling to manage this situation. Impress upon the authors of 3rd party libraries to manage errors rather than panic.

@robjtede
Copy link
Member

@sazzer if you want the feeling of throwing from functions this crate gets you pretty close to the syntax: https://docs.rs/fehler/1.0.0/fehler/attr.throws.htm

@fafhrd91
Copy link
Member

catch unwind is not possible with actix, period.

@sazzer
Copy link
Author

sazzer commented May 11, 2020

@robjtede It's not that I want exceptions or throws. I actually like the Result<> handling that Rust gives us (Despite coming from a predominantly overwhelming Java background!)

What I wasn't keen on was the need to do that for error conditions that are truly catastrophic, and that wouldn't need to have a Result<> return for any reason other than this catastrophic case.

For example, my case above with "lookup_username". Assuming that the database is present, the network is working and the schema is valid, this can not fail. Putting in multiple levels of Result<> to handle a case that should never happen just feels like busywork for no real benefit. That's why I've got it panicking in that case - if the database isn't there then there really is nothing that can be done within the code to deal with that. (As opposed to, for example, trying to persist a record with a duplicate username which really should return a Result<> since that's an error that is expected and can be handled)

However, based on the comment from @fafhrd91 that turned up as I was typing this, it's all a non-starter anyway so I'm going to shut up now :)

Cheers

@mamcx
Copy link

mamcx commented May 12, 2020

Exist another very simple reason to support this.

Is the expected behavior. Specially on development, where the browser IS the main interface.

In any other web framework I have used (in .net, python, delphi,...) is how things are done. To the point that when this happened to me, I was wonder why?? Is something I haven't done?.

Exist many quality-of-life stuff that come for "free" in things like https://flask.palletsprojects.com/ that in actix and other rust frameworks are "missing".

I'm totally 100% in agreement that this must be done with correctness in mind. This is what I have bet on rust for rewrite a very large .net web backend, and the payoff is very good. But also I think that provide this small touches (maybe as additions or as part of "extras" crate or something like) will make rust even more attractive to the crow of python/.net/ruby...

@fafhrd91
Copy link
Member

This is not matter of new feature. This is just not possible

@Lesiuk
Copy link
Contributor

Lesiuk commented May 12, 2020

@fafhrd91 Could you explain us why catch_unwind don't work for actix so we all can learn something? Just curious.

@fafhrd91
Copy link
Member

Did you check definition?

@onelson
Copy link

onelson commented May 12, 2020

I was about to write a thoughtful response to this, but #1501 (comment) pretty much sums up my point of view.

Signaling failure with Result is a win-win and is well supported. 90% of my error handling code comes in the form of an impl actix_web::ResponseError. At that point, my handler code uses ? to focus on the "happy path" and the rest is automatic.

Using things like catch_unwind in my mind is a last resort, probably most useful in stuff like test runners or what have you.

@robjtede
Copy link
Member

From std::panic::catch_unwind docs:

It is not recommended to use this function for a general try/catch mechanism.

From futures::future::FutureExt::catch_unwind docs:

It's not recommended to use this for error handling.

From the Rust book ch 9.3:

When code panics, there’s no way to recover.

From the edition guide:

Panics thus abort execution up to some "isolation boundary", with code on the other side of the boundary still able to run, and perhaps to "recover" from the panic in some very coarse-grained way.

Fundamentally, panicking is Rust's solution to prevent memory errors/corruption and generally prevent invalid states in situations where recovery is not possible. Such cases are exceeding rare in HTTP request handling, therefore Result should be used where possible.

The "isolation boundary" should be well defined and not let states become invalid. In the case of actix-web, since workers are naturally isolated by their arbiter/thread, it is sensible for that boundary to be the thread and nothing more granular.

@fafhrd91
Copy link
Member

It is simpler, Rc and Arc are not catch unwind

@mamcx
Copy link

mamcx commented May 12, 2020

Such cases are exceeding rare in HTTP request handling, therefore Result should be used where possible.

Rare? I wish. Loss database connections, the database schema change, the upstream server/apis break (my world, all the time), etc is not rare at ALL. It happens in production and is handled corrected by ANY decent web server or router or whatever in the stack.

As I say, this lack of feature is VERY surprising to me and I think anyone that come from any other web framework below the sun. Where else this is like this?

I get if this in rust is harder. So what else can be done? Is possible to wrap the code, fork it then handle the crash in the fork or something like that? Actix MUST be put behind another web server like nginx for production? If is the case, put the idea in the docs is important (not assume everyone put nginx or similar in front of their APIs).

If rust is mean to be better at handle problems, then we can't just say "let it crash" and not respond with the proper error 500 just because this. As I say, if a workaround must be implemented, is ok. But "blind" crash is not good.

@fafhrd91
Copy link
Member

Then you need to fix dB connector or api access code. Panic is not exception, you can have a lot of bad consequences after panic. For example leaked memory.

It is better to use “panic = abort” in your cargo.toml

@onelson
Copy link

onelson commented May 12, 2020

Yeah, exactly. For all intents and purposes, panics just shouldn't happen in production systems, least of all due to an unwrap() in your own code.

If you're in a situation where some code you're calling panics without giving you the opportunity to handle the error (as a Result), I'd look to report it as a bug in whatever that library is.

@mamcx
Copy link

mamcx commented May 12, 2020

@fafhrd91,@onelson

Sure, but when the problem happen, the user get a "connection reset", so the tech support blame it on the internet or something (totally believable on my case, my users roam the country) and not report it to me, instead of a "error 500: Internal server error" where now I sure the problem is in the server.

Again, this behavior is not what is expected. So look like I need to workaround to put things behind nginx (that look like a non issue for regular websites but are another complexity for deploy intranet things).

@fafhrd91
Copy link
Member

If you put rust code with panics into production then your have to use nginx, no options.

@mamcx
Copy link

mamcx commented May 12, 2020

@fafhrd91

That is fair, is something good to know. Thanks for the help.

@robjtede
Copy link
Member

where recovery is not possible

These are the exceedingly rare cases referred to in the previous sentence. Recoverable errors are obviously commonplace.

Loss database connections, the database schema change, the upstream server/apis break...

These are all recoverable.

@onelson
Copy link

onelson commented May 12, 2020

Sure, but when the problem happen, the user get a "connection reset"

This would not happen if your handler returned a Result where the Err is converted to a 500 status response via impl actix_web::ResponseError. They'd get a 500 response, and you'd have one less chance of leaking memory. You could even retry your database connection a couple times if you wanted.

@ssokolow
Copy link

ssokolow commented Dec 31, 2020

I still think it's better UX for something like this to return an Error 500 to the user if the invariant is broken:

        thumb_path.push(format!("{:x}.png", md5::compute(
            Url::from_file_path(path).expect("path should be absolute by this point").as_str())));

(It's MD5 because it's from an in-development implementation of the XDG Thumbnail Managing Standard for a miniserve-like tool for quickly sharing image sets with friends... a tool where the whole point is that you don't need to set up something like nginx.)

@phiresky
Copy link

phiresky commented Jul 5, 2023

I want to mention that the (at least with tokio::main), actix already seems to catch panics and continue on uninterrupted: (I guess this is what the OP is talking about with "threading continues uninterrupted")

actix seems barely affected by panics:

...
thread 'actix-server worker 0' panicked at 'called `Option::unwrap()` on a `None` value', /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/markdown-it-0.5.0/src/plugins/extra/smartquotes.rs:518:51
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
... (it just goes on here)

So far so good, but the issue with this is that if actix is behind nginx, the current behaviour of actix causes nginx to log these errors:

2023/07/05 11:46:30 [error] 2158443#2158443: *721527 upstream prematurely closed connection while reading response header from upstream, client: 47.149.14.66, server: lemmy.world, request: "GET /c/xxx HTTP/1.1", upstream: "http://0.0.0.0:8536/c/xxx", host: "lemmy.world"

which (by default) causes nginx to mark the upstream down for ten seconds, the same as it would if the upstream was actually down. This causes nginx to not serve requests to the server anymore even though it is still up. This causes a DOS because one panicking request handler effectively takes the server down for 10s . Sadly this behaviour isn't even cleanly configurable in nginx.

So I'm not sure about the exact implications, but it would be great if actix would not instantly close the connection if one of my dependencies does something dumb. I don't want a configurable error message, just a 500 response.

@robjtede
Copy link
Member

robjtede commented Jul 5, 2023

Reluctantly, I created a CatchPanic middleware some time ago. It is documented with big warnings about usage and a link to this thread to understand why it's not recommended for use without careful consideration.

If you're interested in reporting panics and not catching them then the alternative PanicReporter middleware is a solution.

As for the built-in behavior of Actix Web, our view about making this behavior default is extremely unlikely to change unless the project maintainership changes significantly.

@actix actix locked as too heated and limited conversation to collaborators Jul 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants