Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocket-worker-thread panics whith PoisonError after recieving Os Error { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" } once #2031

Closed
FrankenApps opened this issue Dec 21, 2021 · 14 comments

Comments

@FrankenApps
Copy link

Description

I have a simple rocket application running on a server and it receives around 150 requests per day. After a week or two the service stops working until I restart it.

This is the output from stderr:

thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:261:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86
thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86
...
thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86
thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:261:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86
thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86
...
thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86

I restarted the server after the first panic and then it ran fine for about two weeks before crashing again, which I have now observed multiple times (two cycles are shown in the log).

I am sorry that I do not have more information (I am working to collect those though) but I thought the problem was significant enough to post it as is.

Expected Behavior

I expect rocket to safely recover from the panic and not to get tangled up because of poised threads.

Environment:

  • OS Distribution and Kernel: Ubuntu 20.04
  • Rocket Version: 0.5.0-rc.1

Additional Context
As I currently see this Issue, the problem is basically that the socket Would Block and while this certainly is not a fatal error, it seems to me as if it is not handled correctly.

I guess it is worth noting, that in the application referred to above I use some other dependencies, namely:

base64 = "0.13.0"
genpdf = { version = "0.2.0", features = ["images"] }
image = "0.23.14"
imageproc = "0.22.0"
lettre = { version = "0.10.0-rc.3", default-features = false, features = ["sendmail-transport", "native-tls", "hostname", "r2d2", "builder"] }
mysql_async = "0.28.0"
rocket = { version = "0.5.0-rc.1", features = ["json", "tls"] }
rocket_cors = { git = "https://github.com/lawliet89/rocket_cors", rev = "ad80993bcb420437a2998404bde51da9a9c0b151" }

however the only other two dependencies, that I think might be the culprint here are mysql_async and rocket_cors, but I feel like it is most likely an issue with rocket.

@FrankenApps FrankenApps added the triage A bug report being investigated label Dec 21, 2021
@beemstream
Copy link

This can happen due to spawning too many async tasks and not completing them properly, its hard to tell without seeing the implementation, In my case, I had used an async HTTP client library that would spawn a new async client on every request when actually it should create the HTTP client once and use that.

@FrankenApps
Copy link
Author

FrankenApps commented Jan 12, 2022

Well then I could only guess, that it might be related to mysql_async instead but I am not sure. I set up the server like so

let pool = mysql_async::Pool::new(url);

let cors = rocket_cors::CorsOptions {
        allowed_origins: rocket_cors::AllowedOrigins::All,
        allowed_methods: vec![
            rocket::http::Method::Get,
            rocket::http::Method::Delete,
            rocket::http::Method::Post,
        ]
        .into_iter()
        .map(From::from)
        .collect(),
        allowed_headers: rocket_cors::AllowedHeaders::All,
        allow_credentials: true,
        ..Default::default()
    }
    .to_cors()?;

let mut config = Config::default();
config.address = IpAddr::V4(Ipv4Addr::new(127, 0, 0, 1));
config.port = 8080;

rocket::custom(config)
        .manage(pool)
        // Handle routing errors.
        .register("/", catchers![not_found])
        // This catches all **CORS** preflight requests and handles them.
        .mount("/", rocket_cors::catch_all_options_routes())
        .mount(
            "/mypath",
            FileServer::from(relative!("mypath")),
        )
        .manage(cors)
        // Some more routes are mounted here (GET, POST, UPDATE, DELETE, etc.)...
        .launch()
        .await?;

and then I reuse the pool whenever I need to get data from the database like this

#[get("/my/route")]
pub async fn my_route(pool: &State<Pool>) -> content::Json<String> {
    let mut conn = pool
        .get_conn()
        .await
        .expect("Failed to get a connection from the MySQL access pool.");
    let data = crate::database::data::get_some_data(&mut conn)
        .await
        .expect("Failed to load data from database.");

    let json = to_string(&data).expect("Failed to parse data as JSON.");
    content::Json(json)
}

Is there any reasonable method to debug these async clients?

Oddly there do not even seem to be requests to the server when it crashes. Here is the access log from the nginx reverse proxy:

- - [08/Jan/2022:07:28:58 +0100] "GET / HTTP/1.0" 200 612 "http://fonlove.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36"
- - [08/Jan/2022:07:29:33 +0100] "\x03\x00\x00/*\xE0\x00\x00\x00\x00\x00Cookie: mstshash=Administr" 400 166 "-" "-"
- - [08/Jan/2022:07:29:33 +0100] "\x03\x00\x00/*\xE0\x00\x00\x00\x00\x00Cookie: mstshash=Administr" 400 166 "-" "-"
- - [08/Jan/2022:07:31:51 +0100] "GET /2.php HTTP/1.1" 404 197 "http://fonlove.com/2.php" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
- - [08/Jan/2022:07:31:57 +0100] "GET /2.php HTTP/1.1" 404 197 "http://fonlove-berlin.com/2.php" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
- - [08/Jan/2022:07:55:59 +0100] "GET / HTTP/1.1" 200 396 "-" "python-requests/2.26.0"
- - [08/Jan/2022:07:58:45 +0100] "GET / HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
- - [08/Jan/2022:08:01:39 +0100] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) Project-Resonance (http://project-resonance.com/) (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"
- - [08/Jan/2022:08:04:59 +0100] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
- - [08/Jan/2022:08:09:15 +0100] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
- - [08/Jan/2022:08:14:45 +0100] "GET / HTTP/1.1" 200 612 "-" "NetSystemsResearch studies the availability of various services across the internet. Our website is netsystemsresearch.com"
- - [08/Jan/2022:08:28:50 +0100] "GET /.env HTTP/1.1" 404 197 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
- - [08/Jan/2022:08:32:25 +0100] "GET / HTTP/1.1" 200 396 "-" "page-preview-tool Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"
- - [08/Jan/2022:08:40:21 +0100] "POST /cache.php HTTP/1.1" 404 134 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
- - [08/Jan/2022:08:53:16 +0100] "GET /.env HTTP/1.1" 404 197 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
- - [08/Jan/2022:08:53:17 +0100] "POST / HTTP/1.1" 405 568 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
- - [08/Jan/2022:08:57:15 +0100] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
- - [08/Jan/2022:09:35:01 +0100] "GET /robots.txt HTTP/1.1" 404 134 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)"
- - [08/Jan/2022:09:35:02 +0100] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)"
- - [08/Jan/2022:10:04:24 +0100] "\x16\x03\x01\x00\xE3\x01\x00\x00\xDF\x03\x03\xCC\x022\x08\x00?\xFAKC#\x15>hgKd\xEC\x89\x800'5\x02\xED\xA5\xECj\x1F\xFB\xDDJ\x9A\x00\x00h\xCC\x14\xCC\x13\xC0/\xC0+\xC00\xC0,\xC0\x11\xC0\x07\xC0'\xC0#\xC0\x13\xC0\x09\xC0(\xC0$\xC0\x14\xC0" 400 166 "-" "-"
- - [08/Jan/2022:11:06:45 +0100] "\x16\x03\x01\x00\xEE\x01\x00\x00\xEA\x03\x03\xBE\xBF&\xE6\xB3\x85?\xBA\xB9>\xF4\x85\xEB1\x1AI\x06<\xFA\xCF\x86\xCC\xA9\xAA\xB0\xD6k|\x8E\x99\x81\xA2 bo\xD2\xA9\xD3\xEE\xD8[\xF3{E\x04\x85\xEA\x8F}#\xB2\xF7\x07" 400 166 "-" "-"
- - [08/Jan/2022:11:06:45 +0100] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4 240.111 Safari/537.36"
- - [08/Jan/2022:11:11:37 +0100] "GET /config/getuser?index=0 HTTP/1.1" 404 134 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0"
- - [08/Jan/2022:11:21:06 +0100] "\x01\x00\x00\x00 \xBF\x02\x00\x88\x13\x00\x00\x87\x00\x00\x00NIMABIJIAN\x04\x03\x00\x00{\x99Caig\x9C\x03\xC7eB\xC5\x09\xC1\x18a\x11\x1A\x91\x1F\x02\x09cof\x91\xC0\x80sJ5\xD2\x80\xE6\x9A~\xB9\xC7\x83^\x96\xEEN\x16\x96\x96&\xE6\x03\xEA\xBC\x81\x02=\xAC\x10\xFA?7\x03\xC3\xDF\xF7\xE4\x98`p\xE6\x8D\xC1\xA9\x8D\xC6\x06\xDB\xAF\x91\xE7\x82s\xF7\x14H\xD4\xE1W\x9A\x93C\x9E]\xA4\x01#\x03#\x03]\x03c]CC\x05C\x03+S\x03b\xF4\x00\x00/\x9E\x16E" 400 166 "-" "-"
- - [08/Jan/2022:11:21:16 +0100] "\x16\x03\x01\x00\xEE\x01\x00\x00\xEA\x03\x03\x1C\x85SB\x8F\xD5R\xB6RJ\xD1Q\xD0\x96D\xF2\xBBRU\xE1\xC1\xCD\xAD\xCB[\xAE&\xA0\xB8g\xB8\xD7 \x11\xC2\x0C\xF6\xB97mU5\xBE\xB1M\xDD7>\xFF\xC6\x9EU/\xA0#\xF0\xE6$\x92\x92i<~k \x00&\xC0+\xC0/\xC0,\xC00\xCC\xA9\xCC\xA8\xC0\x09\xC0\x13\xC0" 400 166 "-" "-"
- - [08/Jan/2022:11:21:17 +0100] "\x16\x03\x01\x00\xEE\x01\x00\x00\xEA\x03\x03\xFAh\x16\x1Bpd\xC3\x9F\x85\xB39\xB4g\xF6\xDB}\xCEw\xC0\xC2\xBA\xB6K\xAD}\xF0\x81\xD7{\xC0\xFD\x11 \x8C3=\x181\xF2t\xE4\xEC[_|z\xA8\x8D\xD4\xE0P\xA4\xF8(\xC3\xD0\xE1\xE9d\x5C\xFA\xD1\xAC\x1Ct\x00&\xC0+\xC0/\xC0,\xC00\xCC\xA9\xCC\xA8\xC0\x09\xC0\x13\xC0" 400 166 "-" "-"
- - [08/Jan/2022:11:47:15 +0100] "HEAD / HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36"
- - [08/Jan/2022:11:53:41 +0100] "GET / HTTP/1.1" 200 612 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"
- - [08/Jan/2022:12:09:33 +0100] "GET http://azenv.net/ HTTP/1.1" 200 396 "-" "Go-http-client/1.1"
- - [08/Jan/2022:12:51:55 +0100] "HEAD /robots.txt HTTP/1.0" 404 0 "-" "-"

all of these requests should result in 404s. I have already restricted access to the rocket server using nginx:

limit_req_zone $binary_remote_addr zone=mylimit:10m rate=30r/s;
...
limit_req zone=mylimit burst=10 nodelay;

nevertheless the error log shows this:

[2022-01-08 08:35:50] thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:261:14
[2022-01-08 08:35:50] note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[2022-01-08 08:35:50] thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86
...
[2022-01-08 08:36:00] thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86
...
[2022-01-08 08:36:00] thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on an `Err` value: PoisonError { .. }', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86

P.S. I've already tried running the service with RUST_BACKTRACE=full but apparently I did something wrong.

P.S. 2.: For some reason I can not really match the nginx access_log -> rocket_log. The only request that I found to be matching was the one at 08:28:50. While all of these requests shown below should have resulted in 404s.

[08/Jan/2022:07:15:50 +0100] "GET /_ignition/execute-solution HTTP/1.1" 404 197 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
[08/Jan/2022:07:29:33 +0100] "\x03\x00\x00/*\xE0\x00\x00\x00\x00\x00Cookie: mstshash=Administr" 400 166 "-" "-"
[08/Jan/2022:07:29:33 +0100] "\x03\x00\x00/*\xE0\x00\x00\x00\x00\x00Cookie: mstshash=Administr" 400 166 "-" "-"
[08/Jan/2022:07:31:51 +0100] "GET /2.php HTTP/1.1" 404 197 "http://fonlove.com/2.php" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
[08/Jan/2022:07:31:57 +0100] "GET /2.php HTTP/1.1" 404 197 "http://fonlove-berlin.com/2.php" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
[08/Jan/2022:08:28:50 +0100] "GET /.env HTTP/1.1" 404 197 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
[08/Jan/2022:08:40:21 +0100] "POST /cache.php HTTP/1.1" 404 134 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"
[08/Jan/2022:08:53:16 +0100] "GET /.env HTTP/1.1" 404 197 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
[08/Jan/2022:09:35:01 +0100] "GET /robots.txt HTTP/1.1" 404 134 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)"
[08/Jan/2022:10:04:24 +0100] "\x16\x03\x01\x00\xE3\x01\x00\x00\xDF\x03\x03\xCC\x022\x08\x00?\xFAKC#\x15>hgKd\xEC\x89\x800'5\x02\xED\xA5\xECj\x1F\xFB\xDDJ\x9A\x00\x00h\xCC\x14\xCC\x13\xC0/\xC0+\xC00\xC0,\xC0\x11\xC0\x07\xC0'\xC0#\xC0\x13\xC0\x09\xC0(\xC0$\xC0\x14\xC0" 400 166 "-" "-"
[08/Jan/2022:11:06:45 +0100] "\x16\x03\x01\x00\xEE\x01\x00\x00\xEA\x03\x03\xBE\xBF&\xE6\xB3\x85?\xBA\xB9>\xF4\x85\xEB1\x1AI\x06<\xFA\xCF\x86\xCC\xA9\xAA\xB0\xD6k|\x8E\x99\x81\xA2 bo\xD2\xA9\xD3\xEE\xD8[\xF3{E\x04\x85\xEA\x8F}#\xB2\xF7\x07" 400 166 "-" "-"

->

[2022-01-08 05:04:35] >> No matching routes for GET /level/15/exec/-/sh/run/CR.
[2022-01-08 05:04:35] >> Responding with registered (not_found) 404 catcher.
[2022-01-08 08:28:50] >> No matching routes for GET /.env.
[2022-01-08 08:28:50] >> Responding with registered (not_found) 404 catcher.
[2022-01-08 09:40:33] >> No matching routes for GET /.env.
[2022-01-08 09:40:33] >> Responding with registered (not_found) 404 catcher.
[2022-01-08 10:23:12] >> No matching routes for POST /vendor/phpunit/phpunit/src/Util/PHP/eval-stdin.php application/x-www-form-urlencoded.
[2022-01-08 10:23:12] >> Responding with registered (not_found) 404 catcher.
[2022-01-08 10:39:03] >> No matching routes for GET /ecp/Current/exporttool/microsoft.exchange.ediscovery.exporttool.application.
[2022-01-08 10:39:03] >> Responding with registered (not_found) 404 catcher.
[2022-01-08 10:47:30] >> No matching routes for GET /owa/auth/logon.aspx?url=https%3a%2f%2f1%2fecp%2f.
[2022-01-08 10:47:30] >> Responding with registered (not_found) 404 catcher.

@somehowchris
Copy link

Expected Behavior
I expect rocket to safely recover from the panic and not to get tangled up because of poised threads.

I totally disagree with that. Rocket shouldn't be responsible to regain health of your api. Rocket is an abstraction to let you create APIs and that's it's job. If the logic of your app panics you should use infrastructure to reschedule your app such as k8s or cf alongside some service like sentry catching your panic.

Panics != Errors. Yes if your business logic returns some Result with an error, I would agree, rocket would need and does react to these with the appropriate response. Panics on the other side are meant to termine your thread because something happend which can't bee ignored (like to unwrap of your example). If rocket were just to regain health, this would just totally break the image of monitoring.

if you'd like to get some async debugging going, use the tracing crate.

BTW, have you tried to use the [Json<T> Type](https://rocket.rs/v0.5-rc/guide/requests/#json) of rocket instead of manually parsing it?

@the10thWiz
Copy link
Collaborator

the10thWiz commented Jan 12, 2022

The error logs appear to indicate that the issue resides within tokio, since that is where the panics are being generated.

As noted, there are two types of panics in the initial logs: the OS error, and the Poison Errors. The Poison Errors should probably be panics, but they are a direct result of the OS Error (a mutex is being held by the thread that panics with the OS error). The OS Error should be gracefully handled (or an error should require immediate shutdown), and I'm not sure which.

Looking into the tokio code, it appears that the OS Error is related to attempting to spawn a thread. My best guess is that the application has run into a limit on the number of threads it can spawn, which can cause pthread_create (the underlying C function) to block. Given that tokio puts pretty much everything into non-blocking mode, this results the WouldBlock error seen above.

@the10thWiz
Copy link
Collaborator

@FrankenApps I believe I have tracked down the issue. Specifically, you appear to be reaching a limit on the number of threads that a process is allowed to spawn. In theory, Tokio should prevent this since there is a configurable limit to the number of threads Tokio can spawn, but I believe the issue is that in your enviroment, the OS limit is lower than the Tokio limit.

There are two ways to solve this: lower the Tokio limit, or raise the OS limit. The default Tokio limit is 512 blocking threads, and the same number of worker threads as cores in your system. After a quick look into Rocket configuration, it appears Rocket does not allow this value to be configured. I recommend taking a look at /proc/sys/kernel/threads-max and ulimit/getrlimit to see the system and per-user thread limit respectively.

For now, I would recommend increasing the OS limit to at least 530. Please let me know if the system or per-user limit is already higher than this, since that would indicate a deeper issue.

@FrankenApps
Copy link
Author

FrankenApps commented Jan 12, 2022

@the10thWiz Thank you for the amazing work.

  • /proc/sys/kernel/threads-max is 3090191
  • ulimit gives unlimited and ulimit -Sa:
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1545095
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 62987
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

However I still think you are correct, because I started the application using a systemctl service and when I run systemctl status my_service I get this:

● my_service.service - Description of my service.
     Loaded: loaded (/etc/systemd/system/my_service.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2022-01-08 12:18:46 CET; 4 days ago
     Main PID: 237 (bash)
     Tasks: 9 (limit: 60)
     Memory: 39.9M

so I think the problem is that the number of Tasks is currently limited to 60, when it should be more than 512?

I will now try to increase this limit and report back after.

@the10thWiz
Copy link
Collaborator

@FrankenApps Awesome. I've also gone ahead and made two pull requests in relation to this issue - First, I have an open pull request to add a Rocket configuration value to set the max worker threads, in addition to the setting for the max number of non-blocking worker threads. Second, I have opened a pull request in Tokio to add a better error message for the panic you ran into. I had to take a look at the actual source code and cross reference the line number to actually identify that the panic occurred when attempting to spawn a thread.

The actual number should be higher than 512, since the non-blocking worker threads, in addition to the IO and Timer driver threads don't actually count towards the 512 limit.

If this fixes the issue, I would also recommend looking into your dependencies to see what is spawning these blocking tasks, and try to reduce them. From my understanding of Tokio, this seems to imply that there are at least 60 concurrent blocking tasks, since Tokio will re-use existing threads if they are available. After taking a look inside Rocket, it looks like the only places where spawn_blocking is used is within the TempFile construct (and sync_db_pools). After taking a cursor glance at rocket_cors and mysql_async, neither of them use spawn_blocking internally, so they are unlikely to be the culprit. A more likely culprit would be any of the sync dependencies like gen-pdf, since I assume you call them from within spawn_blocking.

@FrankenApps
Copy link
Author

FrankenApps commented Jan 12, 2022

@the10thWiz There is only one place in my codebase where I spawn any thread, but not using tokio::task::spawn_blocking, instead I simply used std::thread::spawn.

I have no prior experience with tokio, so it would not be a surprise to me if I did something wrong there:

#[post("/route/that/sends/email", format = "json", data = "<data>")]
pub async fn send_mail_route(
    pool: &State<Pool>,
    data: rocket::serde::json::Json<MyData>,
) -> std::io::Result<()> {
    send_email().await;
    Ok(())
}

async fn send_email() {
    // Generate a PDF and send it as an email attachement if needed.
    if send_the_mail {
        // This might take a while, therefore execute it in another thread.
        std::thread::spawn(move || {
            let pdf_result = generate_document();
            match pdf_result {
                Ok(pdf) => {
                    send_mail(pdf);
                }
                Err(err) => println!("Generating PDF failed {}.", err.to_string()),
            }
        });
    }
}

Because initially I found that the thread might panic and a mutex might be locked as described here, I made sure everything I do in this thread will not result in a panic.

Is this something that should not be done with tokio and should I instead change the implementation to:

async fn send_email() {
    // Generate a PDF and send it as an email attachement if needed.
    if send_the_mail {
        // This might take a while, therefore execute it in another thread.
        tokio::task::spawn_blocking(move || {
            let pdf_result = generate_document();
            match pdf_result {
                Ok(pdf) => {
                    send_mail(pdf);
                }
                Err(err) => println!("Generating PDF failed {}.", err.to_string()),
            }
        }).await;
    }
}

This feels unnatural to me, because I want the HTTP response to be a 200 first and then try sending the email (e.g. if for some reason it was impossible to send the mail I want to treat this as a silent error). Additionally I do not depend on a result of the synchronous operations.

I do not ever join the thread, instead my plan is to let it do its thing (e.g. generate a PDF and send it per mail) and when its done (as far as I know) it should just cease to exist.

A quick local test sending multiple emails on my MacOS machine resulted in a constant 9 threads after the emails had been sent.

P.S. I set the limit to 570 for now, but it would not be a problem to go a bit higher. However I would instead like to sort out what causes these blocking threads.

Thanks for all the help.

@the10thWiz
Copy link
Collaborator

Looking at your code, I don't think what you did is necessarily wrong. However, I would recommend using the spawn_blocking version, for a few reasons. The first thing I would like to note is that even if you use spawn_blocking, you don't need to await on the future, spawning it is enough for it to run to completion. This would give you the desired behavior: starting the task, and immediately returning a 200.

spawn_blocking has a few advantages over std::thread::spawn. First, Tokio uses a thread pool, so this will avoid creating a new thread if possible (which can be an expensive operation), as well as setting a limit on the number of threads that can be spawned. Second, although not applicable here, you can call async code from within spawn_blocking, since it is an async task. This would make the process for switching to some async sendmail utility easier, but I don't know if this is actually a concern. Finally, I think it's a good idea to avoid using the std lib's threads if you're already using Tokio, just to keep things consistent.

That being said, this could be part of the issue. These threads would count towards the limit set by the OS (or SystemD), but not towards the limit set by Tokio. I believe that std::thread::spawn will either panic or block if it can't spawn a thread, but I'm not sure which, or how to identify what it's doing.

P.S., 570 should be good, especially if you don't use the std threads.

@FrankenApps
Copy link
Author

I see. Yes I found out afterwards that I do not need to await the spawn_blocking and switched to using it.

Now that I know that the problem is probably related to having too many threads I will monitor the application more close in that regard and try to see where the problem might exactly be.

@kolbma
Copy link
Contributor

kolbma commented Jan 16, 2022

@FrankenApps You should queue your pdf and mail jobs (can be done with files or in DB or with additional crates) and have one or a limited number of workers do the work on this queue. With just using uncontrolled spawning you might be vulnerable to a DOS by simply requesting the right url.

@FrankenApps
Copy link
Author

@kolbma Yes, you are right.

However in my case this is not concerning, because PDFs and email are only generated after successful authentication.

kolbma added a commit to kolbma/Rocket that referenced this issue Jan 22, 2022
Thx to @the10thWiz
rwf2#2064

Mostly a topic with bad app architecture when worker threads are running too long 
or block and workers are as a result of exhausted.
Results in a panic: rwf2#2031
```
thread 'rocket-worker-thread' panicked at 'called `Result::unwrap()` on 
an `Err` value: PoisonError { .. }', 
/root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.6.1/src/runtime/blocking/pool.rs:287:86
```
But on limited OS configurations this needs to be adapted to the OS capabilities.

The default of tokio runtime max_blocking_threads is 512 and becomes modified by this config.
@FrankenApps
Copy link
Author

FrankenApps commented Feb 8, 2022

After applying the changes mentioned above and restricting access to all routes except for the known ones in my nginx reverse-proxy, the service is now running fine for 4 weeks, so I would mark this as resolved (apart from it not really beeing an Issue with rocket).

For future reference here is a profile of 2 weeks showing the number of active threads:
Bildschirmfoto 2022-02-08 um 19 53 09
when the server is in IDLE the thread count reliably goes back to 5 and it never went beyond 120 threads (which however is higher than my previous limit of 60 set for the systemd service) so I am pretty confident that the above diagnosis is right and actually let me resolve my problem.

Thanks for the help.

@SergioBenitez
Copy link
Member

I just wanted to remark on the excellent triage and feedback from @the10thWiz, @kolbma, and @FrankenApps in this thread (as well as the earlier comments by @beemstream and @somehowchris!). Truly spectacular.

@SergioBenitez SergioBenitez removed the triage A bug report being investigated label Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants