-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocket-worker-thread panics whith PoisonError after recieving Os Error { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" } once #2031
Comments
This can happen due to spawning too many async tasks and not completing them properly, its hard to tell without seeing the implementation, In my case, I had used an async HTTP client library that would spawn a new async client on every request when actually it should create the HTTP client once and use that. |
Well then I could only guess, that it might be related to let pool = mysql_async::Pool::new(url);
let cors = rocket_cors::CorsOptions {
allowed_origins: rocket_cors::AllowedOrigins::All,
allowed_methods: vec![
rocket::http::Method::Get,
rocket::http::Method::Delete,
rocket::http::Method::Post,
]
.into_iter()
.map(From::from)
.collect(),
allowed_headers: rocket_cors::AllowedHeaders::All,
allow_credentials: true,
..Default::default()
}
.to_cors()?;
let mut config = Config::default();
config.address = IpAddr::V4(Ipv4Addr::new(127, 0, 0, 1));
config.port = 8080;
rocket::custom(config)
.manage(pool)
// Handle routing errors.
.register("/", catchers![not_found])
// This catches all **CORS** preflight requests and handles them.
.mount("/", rocket_cors::catch_all_options_routes())
.mount(
"/mypath",
FileServer::from(relative!("mypath")),
)
.manage(cors)
// Some more routes are mounted here (GET, POST, UPDATE, DELETE, etc.)...
.launch()
.await?; and then I reuse the pool whenever I need to get data from the database like this #[get("/my/route")]
pub async fn my_route(pool: &State<Pool>) -> content::Json<String> {
let mut conn = pool
.get_conn()
.await
.expect("Failed to get a connection from the MySQL access pool.");
let data = crate::database::data::get_some_data(&mut conn)
.await
.expect("Failed to load data from database.");
let json = to_string(&data).expect("Failed to parse data as JSON.");
content::Json(json)
} Is there any reasonable method to debug these async clients? Oddly there do not even seem to be requests to the server when it crashes. Here is the access log from the nginx reverse proxy:
all of these requests should result in
nevertheless the error log shows this:
P.S. I've already tried running the service with P.S. 2.: For some reason I can not really match the nginx access_log -> rocket_log. The only request that I found to be matching was the one at
|
I totally disagree with that. Rocket shouldn't be responsible to regain health of your api. Rocket is an abstraction to let you create APIs and that's it's job. If the logic of your app panics you should use infrastructure to reschedule your app such as k8s or cf alongside some service like sentry catching your panic. Panics != Errors. Yes if your business logic returns some Result with an error, I would agree, rocket would need and does react to these with the appropriate response. Panics on the other side are meant to termine your thread because something happend which can't bee ignored (like to unwrap of your example). If rocket were just to regain health, this would just totally break the image of monitoring. if you'd like to get some async debugging going, use the BTW, have you tried to use the |
The error logs appear to indicate that the issue resides within tokio, since that is where the panics are being generated. As noted, there are two types of panics in the initial logs: the OS error, and the Poison Errors. The Poison Errors should probably be panics, but they are a direct result of the OS Error (a mutex is being held by the thread that panics with the OS error). The OS Error should be gracefully handled (or an error should require immediate shutdown), and I'm not sure which. Looking into the tokio code, it appears that the OS Error is related to attempting to spawn a thread. My best guess is that the application has run into a limit on the number of threads it can spawn, which can cause |
@FrankenApps I believe I have tracked down the issue. Specifically, you appear to be reaching a limit on the number of threads that a process is allowed to spawn. In theory, Tokio should prevent this since there is a configurable limit to the number of threads Tokio can spawn, but I believe the issue is that in your enviroment, the OS limit is lower than the Tokio limit. There are two ways to solve this: lower the Tokio limit, or raise the OS limit. The default Tokio limit is 512 blocking threads, and the same number of worker threads as cores in your system. After a quick look into Rocket configuration, it appears Rocket does not allow this value to be configured. I recommend taking a look at For now, I would recommend increasing the OS limit to at least 530. Please let me know if the system or per-user limit is already higher than this, since that would indicate a deeper issue. |
@the10thWiz Thank you for the amazing work.
However I still think you are correct, because I started the application using a
so I think the problem is that the number of I will now try to increase this limit and report back after. |
@FrankenApps Awesome. I've also gone ahead and made two pull requests in relation to this issue - First, I have an open pull request to add a Rocket configuration value to set the max worker threads, in addition to the setting for the max number of non-blocking worker threads. Second, I have opened a pull request in Tokio to add a better error message for the panic you ran into. I had to take a look at the actual source code and cross reference the line number to actually identify that the panic occurred when attempting to spawn a thread. The actual number should be higher than 512, since the non-blocking worker threads, in addition to the IO and Timer driver threads don't actually count towards the 512 limit. If this fixes the issue, I would also recommend looking into your dependencies to see what is spawning these blocking tasks, and try to reduce them. From my understanding of Tokio, this seems to imply that there are at least 60 concurrent blocking tasks, since Tokio will re-use existing threads if they are available. After taking a look inside Rocket, it looks like the only places where |
@the10thWiz There is only one place in my codebase where I spawn any thread, but not using I have no prior experience with #[post("/route/that/sends/email", format = "json", data = "<data>")]
pub async fn send_mail_route(
pool: &State<Pool>,
data: rocket::serde::json::Json<MyData>,
) -> std::io::Result<()> {
send_email().await;
Ok(())
}
async fn send_email() {
// Generate a PDF and send it as an email attachement if needed.
if send_the_mail {
// This might take a while, therefore execute it in another thread.
std::thread::spawn(move || {
let pdf_result = generate_document();
match pdf_result {
Ok(pdf) => {
send_mail(pdf);
}
Err(err) => println!("Generating PDF failed {}.", err.to_string()),
}
});
}
} Because initially I found that the thread might panic and a mutex might be locked as described here, I made sure everything I do in this thread will not result in a panic. Is this something that should not be done with async fn send_email() {
// Generate a PDF and send it as an email attachement if needed.
if send_the_mail {
// This might take a while, therefore execute it in another thread.
tokio::task::spawn_blocking(move || {
let pdf_result = generate_document();
match pdf_result {
Ok(pdf) => {
send_mail(pdf);
}
Err(err) => println!("Generating PDF failed {}.", err.to_string()),
}
}).await;
}
} This feels unnatural to me, because I want the HTTP response to be a I do not ever join the thread, instead my plan is to let it do its thing (e.g. generate a PDF and send it per mail) and when its done (as far as I know) it should just cease to exist. A quick local test sending multiple emails on my MacOS machine resulted in a constant 9 threads after the emails had been sent. P.S. I set the limit to Thanks for all the help. |
Looking at your code, I don't think what you did is necessarily wrong. However, I would recommend using the
That being said, this could be part of the issue. These threads would count towards the limit set by the OS (or SystemD), but not towards the limit set by Tokio. I believe that P.S., 570 should be good, especially if you don't use the std threads. |
I see. Yes I found out afterwards that I do not need to await the spawn_blocking and switched to using it. Now that I know that the problem is probably related to having too many threads I will monitor the application more close in that regard and try to see where the problem might exactly be. |
@FrankenApps You should queue your pdf and mail jobs (can be done with files or in DB or with additional crates) and have one or a limited number of workers do the work on this queue. With just using uncontrolled spawning you might be vulnerable to a DOS by simply requesting the right url. |
@kolbma Yes, you are right. However in my case this is not concerning, because PDFs and email are only generated after successful authentication. |
Thx to @the10thWiz rwf2#2064 Mostly a topic with bad app architecture when worker threads are running too long or block and workers are as a result of exhausted. Results in a panic: rwf2#2031 ``` thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86 ``` But on limited OS configurations this needs to be adapted to the OS capabilities. The default of tokio runtime max_blocking_threads is 512 and becomes modified by this config.
I just wanted to remark on the excellent triage and feedback from @the10thWiz, @kolbma, and @FrankenApps in this thread (as well as the earlier comments by @beemstream and @somehowchris!). Truly spectacular. |
Description
I have a simple rocket application running on a server and it receives around 150 requests per day. After a week or two the service stops working until I restart it.
This is the output from
stderr
:I restarted the server after the first panic and then it ran fine for about two weeks before crashing again, which I have now observed multiple times (two cycles are shown in the log).
I am sorry that I do not have more information (I am working to collect those though) but I thought the problem was significant enough to post it as is.
Expected Behavior
I expect rocket to safely recover from the panic and not to get tangled up because of poised threads.
Environment:
Additional Context
As I currently see this Issue, the problem is basically that the socket Would Block and while this certainly is not a fatal error, it seems to me as if it is not handled correctly.
I guess it is worth noting, that in the application referred to above I use some other dependencies, namely:
however the only other two dependencies, that I think might be the culprint here are
mysql_async
androcket_cors
, but I feel like it is most likely an issue withrocket
.The text was updated successfully, but these errors were encountered: