Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak on high number of concurrent connections #1790

Closed
prasannavl opened this issue Apr 5, 2019 · 57 comments
Closed

Memory leak on high number of concurrent connections #1790

prasannavl opened this issue Apr 5, 2019 · 57 comments
Labels
A-server Area: server. C-performance Category: performance. This is making existing behavior go faster.

Comments

@prasannavl
Copy link

prasannavl commented Apr 5, 2019

Hyper version: 0.12.25

Context

Server

  • Run any hyper/hyper derived (warp, tide, etc) server, that just serves a simple hello text response.
use warp::Filter;

fn main() -> Result<(), exitfailure::ExitFailure> {
    let hello = warp::any().map(|| "Hello there!");
    info!("starting");
    let addr: std::net::SocketAddr = "127.0.0.1:9000".parse()?;
    warp::serve(hello)
        .run(addr);
    Ok(())
}

(For brevity, I've used the warp app, but a basic hyper sample works as well - Here's a simple hello world servers off several hyper based servers (tide, warp, raw hyper, etc) that I used to test: https://github.com/quixon-labs/playground-rs)

Boom! (Stress tests)

  • Start bombarding the server with multiple concurrent connections, increasing them gradually.
for concurrency in 100 500 1000 2000 5000 10000 15000 20000 30000; do 
   h2load -D10 -c "$concurrency" http://localhost:9000/hello
   sleep 5s
done

Memory observation

  • Observe the memory rise up steadily but in sporadic intervals, hitting anywhere between 130MB to 300MB. (In my tests, the release builds significantly had higher usage, since they are able to handle a higher load, and lesser connections end up being dropped as a result).
  • This keeps increasing definitively as the concurrency increases, or sporadically otherwise. The only deterministic fact I'm able to observe is that it remains consistent without increased concurrency.
  • But they never come back down once they go up, as one might expect even when there is no load and after the system has had the enough time to play catch up closing all TCP connections, settling IO, kernel swapping, whatever, if any etc.

Screenshot from 2019-04-06 01-36-39

It doesn't go beyond 5 threads. Note, when tokio blocking functions are used, this can rise up all the way to 100 threads (which don't go back down yet) by default causing a significant memory bump. So, used a simpler example to sidestep this as the cause.

Heap profiles

  • Running this with valgrid (massif):

Here's the relevant bit:

(I'm not entirely sure what exactly is causing this. My first guess was some thread_local cache per connection, but couldn't really see any evidence to co-relate that yet). I'm also not sure if tokio or hyper is the cause. But it is consistently noticeable in all hyper based servers.

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 62 83,195,458,515      213,409,040      212,742,984       666,056            0
99.69% (212,742,984B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->41.87% (89,363,456B) 0x958C5A: alloc::alloc::alloc (alloc.rs:75)
| ->41.76% (89,112,576B) 0x958A40: <alloc::alloc::Global as core::alloc::Alloc>::alloc (alloc.rs:151)
| | ->41.76% (89,112,576B) 0x95A490: alloc::raw_vec::RawVec<T,A>::allocate_in (raw_vec.rs:95)
| |   ->41.76% (89,112,576B) 0x95A284: alloc::raw_vec::RawVec<T>::with_capacity (raw_vec.rs:139)
| |     ->41.76% (89,112,576B) 0x957002: alloc::vec::Vec<T>::with_capacity (vec.rs:355)
| |       ->41.76% (89,112,576B) 0x955CF4: bytes::bytes::Inner::reserve (bytes.rs:2261)
| |       | ->41.76% (89,112,576B) 0x9542E6: bytes::bytes::BytesMut::reserve (bytes.rs:1444)
| |       |   ->41.76% (89,112,576B) 0x291A81: hyper::proto::h1::io::Buffered<T,B>::read_from_io (io.rs:175)
| |       |     ->41.76% (89,112,576B) 0x293D6C: hyper::proto::h1::io::Buffered<T,B>::parse (io.rs:160)
| |       |       ->41.76% (89,112,576B) 0x29B18F: hyper::proto::h1::conn::Conn<I,B,T>::read_head (conn.rs:137)
| |       |         ->41.76% (89,112,576B) 0x217ADA: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_read_head (dispatch.rs:200)
| |       |           ->41.76% (89,112,576B) 0x218D96: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_read (dispatch.rs:136)
| |       |             ->41.76% (89,112,576B) 0x214EEF: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_inner (dispatch.rs:100)
| |       |               ->41.76% (89,112,576B) 0x214A67: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch (dispatch.rs:86)
| |       |                 ->41.76% (89,112,576B) 0x21471E: <hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T> as futures::future::Future>::poll (dispatch.rs:348)
| |       |                   ->41.76% (89,112,576B) 0x249788: <futures::future::either::Either<A,B> as futures::future::Future>::poll (either.rs:35)
| |       |                     ->41.76% (89,112,576B) 0x28AE00: futures::future::option::<impl futures::future::Future for core::option::Option<F>>::poll (option.rs:12)
| |       |                       ->41.76% (89,112,576B) 0x24E556: <hyper::server::conn::upgrades::UpgradeableConnection<I,S,E> as futures::future::Future>::poll (conn.rs:934)
| |       |                         ->41.76% (89,112,576B) 0x241D49: <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as futures::future::Future>::poll (conn.rs:874)
| |       |                           ->41.76% (89,112,576B) 0x91EBE3: <alloc::boxed::Box<F> as futures::future::Future>::poll (mod.rs:113)
| |       |                             ->41.76% (89,112,576B) 0x7025E2: futures::task_impl::Spawn<T>::poll_future_notify::{{closure}} (mod.rs:326)
| |       |                               ->41.76% (89,112,576B) 0x70285F: futures::task_impl::Spawn<T>::enter::{{closure}} (mod.rs:396)
| |       |                                 ->41.76% (89,112,576B) 0x72E3C5: futures::task_impl::std::set (mod.rs:78)
| |       |                                   ->41.76% (89,112,576B) 0x7027FD: futures::task_impl::Spawn<T>::enter (mod.rs:396)
| |       |                                     ->41.76% (89,112,576B) 0x702396: futures::task_impl::Spawn<T>::poll_fn_notify (mod.rs:288)
| |       |                                       ->41.76% (89,112,576B) 0x702586: futures::task_impl::Spawn<T>::poll_future_notify (mod.rs:326)
| |       |                                         ->41.76% (89,112,576B) 0x70C200: tokio_threadpool::task::Task::run::{{closure}} (mod.rs:145)
| |       |                                           ->41.76% (89,112,576B) 0x715FE7: core::ops::function::FnOnce::call_once (function.rs:231)
| |       |                                             ->41.76% (89,112,576B) 0x704497: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (panic.rs:309)
| |       |                                               ->41.76% (89,112,576B) 0x7020D1: std::panicking::try::do_call (panicking.rs:293)
| |       |                                                 
| |       ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
| |       
| ->00.12% (250,880B) in 1+ places, all below ms_print's threshold (01.00%)
| 
->41.76% (89,112,618B) 0x828E4A: alloc::alloc::alloc (alloc.rs:75)
| ->41.76% (89,112,618B) 0x828C30: <alloc::alloc::Global as core::alloc::Alloc>::alloc (alloc.rs:151)
|   ->41.76% (89,112,618B) 0x827A90: alloc::raw_vec::RawVec<T,A>::allocate_in (raw_vec.rs:95)
|     ->41.76% (89,112,618B) 0x8278D4: alloc::raw_vec::RawVec<T>::with_capacity (raw_vec.rs:139)
|       ->41.76% (89,112,618B) 0x828852: alloc::vec::Vec<T>::with_capacity (vec.rs:355)
|         ->41.76% (89,112,576B) 0x290E55: hyper::proto::h1::io::WriteBuf<B>::new (io.rs:413)
|         | ->41.76% (89,112,576B) 0x292974: hyper::proto::h1::io::Buffered<T,B>::new (io.rs:65)
|         |   ->41.76% (89,112,576B) 0x29927E: hyper::proto::h1::conn::Conn<I,B,T>::new (conn.rs:39)
|         |     ->41.76% (89,112,576B) 0x1DD2A7: hyper::server::conn::Http<E>::serve_connection (conn.rs:374)
|         |       ->41.76% (89,112,576B) 0x1DFB9A: <hyper::server::conn::Connecting<I,F,E> as futures::future::Future>::poll (conn.rs:732)
|         |         ->41.76% (89,112,576B) 0x241ADF: <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as futures::future::Future>::poll (conn.rs:864)
|         |           ->41.76% (89,112,576B) 0x91EBE3: <alloc::boxed::Box<F> as futures::future::Future>::poll (mod.rs:113)
|         |             ->41.76% (89,112,576B) 0x7025E2: futures::task_impl::Spawn<T>::poll_future_notify::{{closure}} (mod.rs:326)
|         |               ->41.76% (89,112,576B) 0x70285F: futures::task_impl::Spawn<T>::enter::{{closure}} (mod.rs:396)
|         |                 ->41.76% (89,112,576B) 0x72E3C5: futures::task_impl::std::set (mod.rs:78)
|         |                   ->41.76% (89,112,576B) 0x7027FD: futures::task_impl::Spawn<T>::enter (mod.rs:396)
|         |                     ->41.76% (89,112,576B) 0x702396: futures::task_impl::Spawn<T>::poll_fn_notify (mod.rs:288)
|         |                       ->41.76% (89,112,576B) 0x702586: futures::task_impl::Spawn<T>::poll_future_notify (mod.rs:326)
|         |                         ->41.76% (89,112,576B) 0x70C200: tokio_threadpool::task::Task::run::{{closure}} (mod.rs:145)
|         |                           ->41.76% (89,112,576B) 0x715FE7: core::ops::function::FnOnce::call_once (function.rs:231)
|         |                             ->41.76% (89,112,576B) 0x704497: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once (panic.rs:309)
|         |                               ->41.76% (89,112,576B) 0x7020D1: std::panicking::try::do_call (panicking.rs:293)
|         |                                 ->41.76% (89,112,576B) 0x979428: __rust_maybe_catch_panic (lib.rs:87)
|         |                                   ->41.76% (89,112,576B) 0x701E0E: std::panicking::try (panicking.rs:272)
|         |                                     ->41.76% (89,112,576B) 0x70457F: std::panic::catch_unwind (panic.rs:388)
|         |                                       ->41.76% (89,112,576B) 0x70BC84: tokio_threadpool::task::Task::run (mod.rs:130)
|         |                                         ->41.76% (89,112,576B) 0x722C0D: tokio_threadpool::worker::Worker::run_task2 (mod.rs:567)
|         |                                           ->41.76% (89,112,576B) 0x7224D2: tokio_threadpool::worker::Worker::run_task (mod.rs:459)
|         |                                             ->41.44% (88,432,640B) 0x721B51: tokio_threadpool::worker::Worker::try_run_owned_task (mod.rs:390)
|         |                                             | ->41.44% (88,432,640B) 0x72158B: tokio_threadpool::worker::Worker::try_run_task (mod.rs:297)
|         |                                             |   
|         |                                             ->00.32% (679,936B) in 1+ places, all below ms_print's threshold (01.00%)
|         |                                             
|         ->00.00% (42B) in 1+ places, all below ms_print's threshold (01.00%)

Here's the full massif profile: massif.out.zip

Note: The above massif profile is from a cargo build without release optimisations which caused the memory levels to stay much lower. Running this on release builds atleast double the usage as compared to the debug builds.


Edit:

I'm attaching here, another massif profile (release build) that's probably more interesting:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 79  7,621,402,146      408,363,480      407,058,828     1,304,652            0
99.68% (407,058,828B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->41.75% (170,500,096B) 0x184EC8: hyper::proto::h1::io::Buffered<T,B>::new (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x186BAC: <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as futures::future::Future>::poll (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|   ->41.75% (170,500,096B) 0x2D3CB0: futures::task_impl::std::set (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|     ->41.75% (170,500,096B) 0x2D48A4: futures::task_impl::Spawn<T>::poll_future_notify (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|       ->41.75% (170,500,096B) 0x2D78DF: _ZN3std9panicking3try7do_call17hf61819960f116aefE.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|         ->41.75% (170,500,096B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
|           ->41.75% (170,500,096B) 0x2D7E7E: tokio_threadpool::task::Task::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|             ->41.75% (170,500,096B) 0x2D0463: tokio_threadpool::worker::Worker::run_task (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               ->39.95% (163,135,488B) 0x2CF95C: tokio_threadpool::worker::Worker::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               | ->39.95% (163,135,488B) 0x2B3705: tokio_trace_core::dispatcher::with_default (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               |   ->39.95% (163,135,488B) 0x2B48F4: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               |     ->39.95% (163,135,488B) 0x2B4ABC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               |       ->39.95% (163,135,488B) 0x2B4C42: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               |         ->39.95% (163,135,488B) 0x2B8015: tokio::runtime::threadpool::builder::Builder::build::{{closure}} (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               |           ->39.95% (163,135,488B) 0x2D8921: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               |             ->39.95% (163,135,488B) 0x2D86FC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               |               ->39.95% (163,135,488B) 0x2DA56F: std::sys_common::backtrace::__rust_begin_short_backtrace (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               |                 ->39.95% (163,135,488B) 0x2D789A: _ZN3std9panicking3try7do_call17h20721f9dd88a1855E.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               |                   ->39.95% (163,135,488B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
|               |                     ->39.95% (163,135,488B) 0x2D259E: <F as alloc::boxed::FnBox<A>>::call_box (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               |                       ->39.95% (163,135,488B) 0x34715C: std::sys::unix::thread::Thread::new::thread_start (boxed.rs:759)
|               |                         ->39.95% (163,135,488B) 0x487D162: start_thread (pthread_create.c:486)
|               |                           ->39.95% (163,135,488B) 0x49CADED: clone (clone.S:95)
|               |                             
|               ->01.80% (7,364,608B) 0x2CFB67: tokio_threadpool::worker::Worker::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                 ->01.80% (7,364,608B) 0x2B3705: tokio_trace_core::dispatcher::with_default (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                   ->01.80% (7,364,608B) 0x2B48F4: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                     ->01.80% (7,364,608B) 0x2B4ABC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                       ->01.80% (7,364,608B) 0x2B4C42: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         ->01.80% (7,364,608B) 0x2B8015: tokio::runtime::threadpool::builder::Builder::build::{{closure}} (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                           ->01.80% (7,364,608B) 0x2D8921: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                             ->01.80% (7,364,608B) 0x2D86FC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                               ->01.80% (7,364,608B) 0x2DA56F: std::sys_common::backtrace::__rust_begin_short_backtrace (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                 ->01.80% (7,364,608B) 0x2D789A: _ZN3std9panicking3try7do_call17h20721f9dd88a1855E.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                   ->01.80% (7,364,608B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
|                                     ->01.80% (7,364,608B) 0x2D259E: <F as alloc::boxed::FnBox<A>>::call_box (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                       ->01.80% (7,364,608B) 0x34715C: std::sys::unix::thread::Thread::new::thread_start (boxed.rs:759)
|                                         ->01.80% (7,364,608B) 0x487D162: start_thread (pthread_create.c:486)
|                                           ->01.80% (7,364,608B) 0x49CADED: clone (clone.S:95)
|                                             
->41.75% (170,500,096B) 0x333E8A: bytes::bytes::Inner::reserve (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
| ->41.75% (170,500,096B) 0x184D6C: hyper::proto::h1::io::Buffered<T,B>::read_from_io (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|   ->41.75% (170,500,096B) 0x18585E: hyper::proto::h1::io::Buffered<T,B>::parse (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|     ->41.75% (170,500,096B) 0x166312: hyper::proto::h1::conn::Conn<I,B,T>::read_head (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|       ->41.75% (170,500,096B) 0x16C6F3: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|         ->41.75% (170,500,096B) 0x1A9B8F: <hyper::server::conn::upgrades::UpgradeableConnection<I,S,E> as futures::future::Future>::poll (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|           ->41.75% (170,500,096B) 0x1875B1: <hyper::server::conn::spawn_all::NewSvcTask<I,N,S,E,W> as futures::future::Future>::poll (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|             ->41.75% (170,500,096B) 0x2D3CB0: futures::task_impl::std::set (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|               ->41.75% (170,500,096B) 0x2D48A4: futures::task_impl::Spawn<T>::poll_future_notify (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                 ->41.75% (170,500,096B) 0x2D78DF: _ZN3std9panicking3try7do_call17hf61819960f116aefE.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                   ->41.75% (170,500,096B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
|                     ->41.75% (170,500,096B) 0x2D7E7E: tokio_threadpool::task::Task::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                       ->41.75% (170,500,096B) 0x2D0463: tokio_threadpool::worker::Worker::run_task (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         ->39.95% (163,135,488B) 0x2CF95C: tokio_threadpool::worker::Worker::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         | ->39.95% (163,135,488B) 0x2B3705: tokio_trace_core::dispatcher::with_default (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         |   ->39.95% (163,135,488B) 0x2B48F4: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         |     ->39.95% (163,135,488B) 0x2B4ABC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         |       ->39.95% (163,135,488B) 0x2B4C42: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         |         ->39.95% (163,135,488B) 0x2B8015: tokio::runtime::threadpool::builder::Builder::build::{{closure}} (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         |           ->39.95% (163,135,488B) 0x2D8921: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         |             ->39.95% (163,135,488B) 0x2D86FC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         |               ->39.95% (163,135,488B) 0x2DA56F: std::sys_common::backtrace::__rust_begin_short_backtrace (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         |                 ->39.95% (163,135,488B) 0x2D789A: _ZN3std9panicking3try7do_call17h20721f9dd88a1855E.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         |                   ->39.95% (163,135,488B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
|                         |                     ->39.95% (163,135,488B) 0x2D259E: <F as alloc::boxed::FnBox<A>>::call_box (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                         |                       ->39.95% (163,135,488B) 0x34715C: std::sys::unix::thread::Thread::new::thread_start (boxed.rs:759)
|                         |                         ->39.95% (163,135,488B) 0x487D162: start_thread (pthread_create.c:486)
|                         |                           ->39.95% (163,135,488B) 0x49CADED: clone (clone.S:95)
|                         |                             
|                         ->01.80% (7,364,608B) 0x2CFB67: tokio_threadpool::worker::Worker::run (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                           ->01.80% (7,364,608B) 0x2B3705: tokio_trace_core::dispatcher::with_default (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                             ->01.80% (7,364,608B) 0x2B48F4: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                               ->01.80% (7,364,608B) 0x2B4ABC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                 ->01.80% (7,364,608B) 0x2B4C42: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                   ->01.80% (7,364,608B) 0x2B8015: tokio::runtime::threadpool::builder::Builder::build::{{closure}} (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                     ->01.80% (7,364,608B) 0x2D8921: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                       ->01.80% (7,364,608B) 0x2D86FC: std::thread::local::LocalKey<T>::with (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                         ->01.80% (7,364,608B) 0x2DA56F: std::sys_common::backtrace::__rust_begin_short_backtrace (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                           ->01.80% (7,364,608B) 0x2D789A: _ZN3std9panicking3try7do_call17h20721f9dd88a1855E.llvm.11685027487046046852 (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                             ->01.80% (7,364,608B) 0x347D38: __rust_maybe_catch_panic (lib.rs:87)
|                                               ->01.80% (7,364,608B) 0x2D259E: <F as alloc::boxed::FnBox<A>>::call_box (in /home/pvl/workspace/repo/prasannavl/local/rusty/target/release/rusty)
|                                                 ->01.80% (7,364,608B) 0x34715C: std::sys::unix::thread::Thread::new::thread_start (boxed.rs:759)
|                                                   ->01.80% (7,364,608B) 0x487D162: start_thread (pthread_create.c:486)
|                                                     ->01.80% (7,364,608B) 0x49CADED: clone (clone.S:95)

Full massif profile [release build]:
massif.release.out.zip

@seanmonstar
Copy link
Member

Thanks for the detailed writeup! I have some follow-up questions:

  • I see the tool's name is h2load, is it testing HTTP2? The stacks look like HTTP1, so I wanted to clarify.
  • Did you notice this recently in newer versions? It'd be useful to know if this occurred in older versions of hyper or tokio.

@seanmonstar seanmonstar added A-server Area: server. C-performance Category: performance. This is making existing behavior go faster. labels Apr 5, 2019
@prasannavl
Copy link
Author

prasannavl commented Apr 6, 2019

I see the tool's name is h2load, is it testing HTTP2? The stacks look like HTTP1, so I wanted to clarify.

Oh, this specific one is just HTTP1. (Since I'm just spinning up a basic server without tls defaulting to http1). But it happens with HTTP2 as well - remember noticing the problem first running it on warp with tls. However, let me verify this again soon on HTTP2.

Did you notice this recently in newer versions? It'd be useful to know if this occurred in older versions of hyper or tokio.

Unfortunately, this is the only version I've tested so far. Will update in case I happen to able to run some more tests soon.

@prasannavl
Copy link
Author

Okay, with http2:

Starting from minimal concurrency to just 5k: Memory from 60MB to 600MB

Screenshot from 2019-04-06 14-27-37
Screenshot from 2019-04-06 14-27-55
Screenshot from 2019-04-06 14-28-43
Screenshot from 2019-04-06 14-29-07
Screenshot from 2019-04-06 14-29-37

@prasannavl
Copy link
Author

prasannavl commented Apr 6, 2019

Alright, the above uses tokio blocking fn, which spins 100 threads. So, here's one again with just a simple hello text with just 5 threads.

Goes from 5.2MB to 250MB and never comes back down, for just 5k concurrency.

Screenshot from 2019-04-06 14-32-42
Screenshot from 2019-04-06 14-32-55
Screenshot from 2019-04-06 14-33-44
Screenshot from 2019-04-06 14-34-12
Screenshot from 2019-04-06 14-34-36

@seanmonstar
Copy link
Member

Thanks for the extra info! I've looked around, and so far nothing jumps out at me. It seems everything is dropped correctly when the connection is closed. Which makes me suspicious if it's just the memory allocator holding onto the blocks instead of eagerly releasing back to the OS. I'll try some other log configs to see what else I can learn.

@lnicola
Copy link
Contributor

lnicola commented Apr 8, 2019

@seanmonstar I think those Massif profiles indicate that the memory is held by some Buffered instances (assuming they're taken after everything got dropped).

@prasannavl
Copy link
Author

prasannavl commented Apr 8, 2019

Which makes me suspicious if it's just the memory allocator holding onto the blocks instead of eagerly releasing back to the OS

One quick way to rule this out would be to test this with a raw tokio TCP socket. However..

I think those Massif profiles indicate that the memory is held by some Buffered instances (assuming they're taken after everything got dropped).

I think @lnicola is right. While it's a reasonably accurate indication that, it is the Buffered instances that are having left overs - just not sure when and due to what reason they are being left over.

@seanmonstar
Copy link
Member

Apologies, I was unfamiliar with Massif, I thought was just saying where memory has been allocated before. If it's saying that memory isn't being de-allocated, then it seems the task owning that memory is just hanging around in the executor. This could either be due to a flaw in the executor (less likely), or a specific case causing hyper's connection task to stop polling even though the connection closed (more likely).

I've been using this patch to monitor if the tasks are dropped, and so far with hyper's hello world example, it always drops back to 0. There may be something where a connection hangs up while hyper is processing a request body or maybe sitting in idle mode, but I've so far not found how to trigger it.

@lnicola
Copy link
Contributor

lnicola commented Apr 8, 2019

Could it be related to half-closed connections? Or does that not apply to HTTP/2?

@seanmonstar
Copy link
Member

It could, by default hyper allows half-closed from the client. If there's a response to write, though, that should either finish the connection and close, or notice the EPIPE if streaming a big response.

@prasannavl
Copy link
Author

prasannavl commented Apr 8, 2019

Apologies, I was unfamiliar with Massif, I thought was just saying where memory has been allocated before.

@seanmonstar I don't think you're completely off from how massif works. AFAIK, it primarily relies on collecting allocations through a time period. The reason I think it's a reasonably accurate indication is due to correlation. This path way seems to be the only one that generates remotely enough allocations, and it matches up with the overall memory usage.

What I pasted into the issue is just one small part of the profile that I thought was relevant. If you look at the entire profile and run it with either massif visualizer gui or ms_print on the commandline, that should provide you with a better idea.

I've been using this patch to monitor if the tasks are dropped, and so far with hyper's hello world example, it always drops back to 0. There may be something where a connection hangs up while hyper is processing a request body or maybe sitting in idle mode, but I've so far not found how to trigger it.

I just tried this. Same result. Drops back to 0, but memory usage seems to increase as usual.

@prasannavl
Copy link
Author

Just thinking aloud looking at this - could there be any pooled memory that's being reused that's increasing it's capacity on load but not trimming itself again? More precisely Vec, or Bytes that's being reused but increasing it's capacity without trimming back down?

@seanmonstar
Copy link
Member

The buffer inside Buffered doesn't ever shrink itself, it just has a limit on its growth, and only growing on a heuristic that watches for individual reads filling the whole buffer. But that buffer is owned by the task, and since the logs seem to suggest all the tasks are being dropped, it should be freed.

Since the memory was reaching a large size, and not freeing it, but not growing bigger if the benchmark is run against the same running server, that made me assume that something is just reusing memory, and if the task is dropped and it's being freed, I started to suspect the allocator. But if it's only happening in hyper and not with some similar buffered TCP server in Rust, than it is likely a bug somewhere in hyper. I just don't yet know where.

@ccgogo123
Copy link

@seanmonstar Any update on this issue?

@seanmonstar
Copy link
Member

@ccgogo123 latest status is all there. Are you experiencing this? Could be helpful to dig more.

If not, I actually was thinking of closing as there doesn't seem to be much more to do (I don't experience a leak myself).

@ccgogo123
Copy link

@seanmonstar What I'm experiencing is that the rusoto_s3 client doesn't release the memory under high concurrency. I happened to see this issue and suspect the root cause might be this.

@prasannavl
Copy link
Author

prasannavl commented Dec 20, 2019

@ccgogo123 - Would be helpful to see if you can eliminate tokio threadpool as the cause. Are you able to provide any insight?

@seanmonstar - Could you hold it off until Jan 2020? I should be able to get some free time mid Jan, and would like to take another stab at digging in, and also check on newer std future based versions.

@gauteh
Copy link

gauteh commented Jan 18, 2020

Hi, I am experiencing this as well in https://github.com/gauteh/dars. This is streaming netCDF files using the DAP protocol. In this case it keeps increasing with sequential requests as well, without ever going down. I am using hyper directly with tokio. I have not been able to isolate the leak in the netcdf reading library, so I suspect I am seeing this issue.

@peshwar9
Copy link

peshwar9 commented May 16, 2020

@seanmonstar Any further update on this issue? I'm still facing this.

I have tried this with Warp 0.2 and tide 0.8.0 , and with high concurrent load the memory keeps rising but is not released even hours after load drops.

FYI - I used the following for loadtest:

wrk -t12 -c400 -d10s <url>

The tests are pretty basic. For example, this is the hello-world code for warp:

use warp::Filter;
  
#[tokio::main]
async fn main() {
    // GET /hello/warp => 200 OK with body "Hello, warp!"
    let hello = warp::path!("hello" / String)
        .map(|name| format!("Hello, {}!", name));

    warp::serve(hello)
        .run(([127, 0, 0, 1], 3030))
        .await;
}

Test run on MacOS Mojave.
Cargo version 1.43.0

@seanmonstar
Copy link
Member

As stated before, all updates are in the issue.

I have not been able to reproduce it myself. It's difficult to pin point that this is actually leaked memory. It can also be a case of the allocator not releasing back to the OS. Does the behavior exist if you use jemalloc instead of the system allocator?

I'm inclined to close this (really this time) unless someone would like to dig more into this themselves, as I can't trigger it.

@brianbruggeman
Copy link

I also have this problem. I'm concerned because if I put my private app into our production system (our peak is 1M+ simultaneous connections), I think I'm guaranteed to have a memory leak.

I've created a repo to maybe? help with reproducing the problem: https://github.com/brianbruggeman/warp-mem-leak-test

Getting started:

git clone https://github.com/brianbruggeman/warp-mem-leak-test.git warp-mem-leaktest
cd warp-mem-leaktest
./test-me.sh

Environment variables

  • HOST - the host name to use (default: 127.0.0.1)
  • PORT - the port name to use (default: 3030)
  • DURATION - the duration of the load test (default: 3 seconds)
  • WORKERS - the number of simultaneous workers (default: 100)

Highlights:

  • I can see this problem with a single load test (using hey) over 1 second with 1 worker on my machine.
  • I'm not using anything particularly special (just warp and full featured tokio).
  • I'm not running with any specific release or compiler settings
  • I'm using the bash command, ps, and the rss field to capture change in kilobytes
  • The first iteration always seems to be the largest
  • Subsequent iterations may report 0 KB changes, but I cannot know if that's due to precision error (rss is in KB) or if it's really 0
  • I'm not consistently seeing values; each invocation has a different set of values (though they are similar).

@jxs
Copy link
Contributor

jxs commented Jun 18, 2020

hi @brianbruggeman have you read #1790 (comment), tried with jemalloc instead of the system allocator?

@brianbruggeman
Copy link

brianbruggeman commented Jun 18, 2020

Edit:

@jxs

Assuming I correctly set the allocator, then I still see the same problems.

For my jemalloc test, I added jemallocator-global.

I also created a branch specifically for this test: https://github.com/brianbruggeman/warp-mem-leak-test/tree/jemalloc

@prasannavl
Copy link
Author

prasannavl commented Jun 19, 2020

Let me jump in here - I can confirm @brianbruggeman's inference on allocators. I had switched allocator in the past and don't recall it showing any difference.

(Sorry for not following up before - ended up with other life priorities unfortunately)

@seanmonstar would you mind stating your platform, OS, mem etc? I'm curious to see if that can help dig in further since you don't seem to be running into this, interestingly (is this still the case?)

@whfuyn
Copy link

whfuyn commented Oct 21, 2021

It seems related to the allocator. After I switch to Jemalloc, memory reduces when the client side send less requests. (triggered by sending new but less requests. If not new requests sent, memory won't reduce)

@oronsh
Copy link

oronsh commented Oct 21, 2021

@whfuyn I switched to jemalloc too, but I do seem a little problem if for exmaple I send a lot of connections at the same time, I see that the memory increases a bit... It could be 3mb it could be 10mb or more.

@oronsh
Copy link

oronsh commented Oct 21, 2021

@oronsh My general impression is that the maintainers don't feel there is an issue here. In other words, we (as a community) still have not proven that there is a problem and we have not been able to provide a repeatable environment to show the issue.

And as a note, this issue is in a closed.... not to say it couldn't be reopened, but I think we have to provide clear evidence that can be repeated rather than screenshots.

Yes maybe it's kind of an edge case and maybe people are okay with restarting their server once in a while...
It's really up to us to prove it...

@oronsh
Copy link

oronsh commented Oct 21, 2021

I can reproduce similar behavior with the system allocator as well as mimalloc; but if we're going to fix this, we'll need to get a lot more detailed about the specific behavior that exasperates the problem.

I'm curious if this can be reproduced in an application that uses Tokio but not Hyper. Hyper is going to spawn a task for every connection (basically). If we write a program that spawns 10s of thousands of tasks concurrently and then sits for some amount of time does the OS reclaim memory or does the process's RSS remain at its peak?

I also think that it's more of a tokio issue rather than hyper and it will be very interesting to try what you've suggested. I noticed that it happens to me only when I'm doing some expensive operation, like serializing huge payload before sending it back to the client.
My case is that on requests, I fetch some huge payload from some other server (with reqwest).
If I don't do anything with the result and just send back to the client hello for example, the memory will stay alright.
If I try to deserialize the payload with serde_json the memory will stack up little by little per every request.
Something like this:

pub async fn get_data_route(Extension(client): Extension<Arc<MyClient>>) -> impl IntoResponse {
    // fetching huge payload
    let res = client
        .body(...)
	.send()
	.await
        .unwrap();

   // without this part the memory seems to be stable
   let data = res.json::<Foo>().await.unwrap();

   (StatusCode::OK, Json("hello world"))
}

or it could be an issue with hyper here: https://github.com/hyperium/hyper/blob/master/src/body/to_bytes.rs#L46
since it's been called from the json method.

It could also be that there's a general problem with spawning tasks, I'm wondering how it works with tokio_uring

@olix0r
Copy link
Contributor

olix0r commented Oct 21, 2021

I'll just share some of the data we have, as it may be helpful in narrowing down a smaller repro. I have some load tests for linkerd-proxy. The proxy itself is pretty complicated and probably doesn't serve as the best repro -- we'll want something a lot smaller.

We have a load test instrumented like so:

[ http/1 load generator ] -> [ client proxy (http/1 hyper server) ] -> [ client proxy (h2 hyper client) ] ->
        -> [ server proxy (h2 hyper server) ] -> [ server proxy (http/1 hyper client) ] -> [ http/1 test server ]

When running the client with steadily increasing (and then resetting) load we see memory use grow steadily, basically in lock-step with the number of server connections. Memory is not released when load resets:

image

On the server-side, however, memory usage basically remains steady.

image

The server uses far fewer TCP connections, but it also creates far fewer Hyper servers.

I can instrument this test to run with pure TCP connections and no Hyper servers/clients. I'll report back when I have some data points there. At the moment, it appears as if our memory costs are correlated with instantation of per-connection Hyper servers, though... That might be something to focus on as we try to get a narrower reproduction.

@olix0r
Copy link
Contributor

olix0r commented Oct 21, 2021

I've changed the test scenario so that we use a single proxy:

[ http/1 load generator ] ->
        -> [ server proxy (http/1 hyper server) ] -> [ server proxy (http/1 hyper client) ] -> [ http/1 test server ]

image

And in this scenario:

[ http/1 load generator ] ->
        -> [ server proxy (TCP tokio server) ] -> [ server proxy (TCP tokio client) ] -> [ http/1 test server ]

image

We can see in both cases that the behavior is roughly similar: memory usage never decreases. The same load profile is used in both cases. So the proxy is transporting the same amount of raw bytes. Though, in the HTTP case we use dramatically more more memory--presumably due to more allocation.

In this case, we're using mimalloc, though we've seen similar behavior with the system allocator (and, I believe, jemalloc as well).

Looking at the /proc filesystem confirms that all of this heap allocated

:; sudo cat /proc/$pid/smaps | grep heap -A11
56023f5e2000-56026092c000 rw-p 00000000 00:00 0                          [heap]
Size:             544040 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              541300 kB
Pss:              541300 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:    541300 kB
Referenced:       541300 kB
Anonymous:        541300 kB 

I'll try to get this running through heap track, though in the past it has shown wildly different results (seemingly underreporting used memory).

@olix0r
Copy link
Contributor

olix0r commented Oct 21, 2021

Here's where things get weird. Using the system allocator and using heaptrack we see the 'leaky behavior' as usual:

image

But heaptrack reports a very different picture:

image

Heaptrack claims that there only 5.4MB of 'leaked' allocations remaining at the end of the run, while the procinfo crate (and manually inspecting the status in /proc) reports 535MB of resident memory.

I'm assuming this indicates that this memory is actually freed, but the operating system doesn't recognize that?

I suppose I'll test jemalloc for completeness...

@sfackler
Copy link
Contributor

High performance memory allocators are built to allocate memory quickly, not to keep RSS at the minimum possible value. They generally operate under the assumption that if you have allocated a bunch of memory recently, you will want to allocate a bunch of memory again in the future and so hold onto the pages for some period of time.

If you actually want to look for leaks you should be asking the allocator how many bytes it has actually allocated to the program. In jemalloc, this can be done with the jemalloc-ctl crate: https://docs.rs/jemalloc-ctl/0.3.3/jemalloc_ctl/stats/struct.allocated.html.

@sfackler
Copy link
Contributor

If it does appear that the application is explicitly holding onto allocated memory for too long, jemalloc also has profiling features you can use to see what sections of code are performing those allocations: https://github.com/jemalloc/jemalloc/wiki/Use-Case%3A-Leak-Checking

@oronsh
Copy link

oronsh commented Oct 22, 2021

When I tried jemalloc configured with JEMALLOC_SYS_WITH_MALLOC_CONF=background_thread:true,narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0,abort_conf:true the memory released just fine but I can still see a little increase overtime with a lot of requests.

@hseeberger
Copy link

As stated before, all updates are in the issue.

I have not been able to reproduce it myself. It's difficult to pin point that this is actually leaked memory. It can also be a case of the allocator not releasing back to the OS. Does the behavior exist if you use jemalloc instead of the system allocator?

I'm inclined to close this (really this time) unless someone would like to dig more into this themselves, as I can't trigger it.

@seanmonstar I ran into this issue with Tonic 0.7.2 using 0.14.19 under tokio 1.19.2. The behavior was exactly the same with the system allocator and with jemalloc: the more load (requests/second) you put on the server, the more memory it allocates. That memory is not freed. Decreased load is handled without further allocation, but once the load is increases about the former level, additional memory gets allocated.

@JoePassanante
Copy link

Did anyone ever resolve this?

@mikemiles-dev
Copy link

Was there any mitigation for this?

@lsk569937453
Copy link

I encountered the same problem as @hseeberger.It could not work even I switched another allocator.

@prasannavl
Copy link
Author

prasannavl commented Jun 28, 2023

I think the leak is pretty well established by this point - just that I don't think anyone (including myself) has had the time enough to dig in to get to the exact source, rather than simpler outer workarounds.

I think @olix0r made an interesting point above - if this can be reproduced by just tokio. Not sure anyone tried this. Would be interested to see this across different fixed set of worker thread pool. Think that can help clarify the domain.

@cocodee
Copy link

cocodee commented Aug 29, 2023

I don't thinkd the issue should be closed without a solution .

@wonrax
Copy link

wonrax commented Dec 6, 2023

I was having the same issue and switched to the MiMalloc allocator. It releases memory instantly - from 2GB when underload back to 50MB when idle, constrast to the default allocator which peaks at 3GB and 1.7GB when idle. MiMalloc does not only consume less memory but also performs better in my case, bench tested both on my Linux machine and Docker container with resource limited to 1CPU 1GB. I think I'm going to stick with MiMalloc for now.

@prasannavl
Copy link
Author

prasannavl commented May 6, 2024

@wonrax this is interesting. Was there any specific config you had used for MiMalloc?

I re-call seeing these comments: #1790 (comment) earlier by @olix0r where mimalloc doesn't seem to have helped. Any many more comments about related to jemalloc.

Curious if this was a specific config or mimalloc itself had some tweaks that seems to have better behavior.

@wonrax
Copy link

wonrax commented May 7, 2024

I didn't set or tweak any config, the code only imported and set mimalloc as the global allocator

@lithbitren
Copy link

lithbitren commented Jun 15, 2024


  • Test Preparation:

I tested it in a virtual machine of VMware, with the environment being Ubuntu 24.04 desktop version, allocated 2.6GHz 8C8G.

Rust is the latest version rustc 1.81.0-nightly (b5b13568f 2024-06-10), in --release mode.

All terminals are set to the maximum file number with ulimit -n 65536.

Direct access to http under the same environment.

For the server-side axum, I wrote a most basic 'hello', and tested three memory allocation modes mentioned in the previous post: the default memory allocator, MiMalloc and JeMalloc.

First, I tested the performance with oha on the default mode, adjusted the oha client concurrency parameters, and reported that axum can reach a maximum of 130,000+ QPS.

The plan is to use reqwest as a client to hit axum with 1000/10000 QPS within seconds, and use tokio's interval to hit it once every second, continuously for more than a minute.

The code is the official example, all tokio asynchronous, no synchronous blocking code, and the report prints once a second, which basically does not affect the concurrency performance.


  • Test Results:

When starting, the MiMalloc mode has more than 400+K memory, and the default allocator, JeMalloc are all less than 300K memory.

When the client continues at 1000 QPS, after a few seconds, the memory of MiMalloc will be around 30+M, and the others are around 50~60M.

When the client continues at 10000 QPS, after a few seconds, the memory of MiMalloc will be around 300+M, and the others are around 500~600M.

When the client pressure test ends, the server-side of all memory allocation modes will maintain high memory usage, and will not automatically release memory in a short time if there are no new requests.

However, sending requests at a low frequency (intermittently 1~100 QPS) will gradually release memory.

  • After continuous 1000 QPS and 10000 QPS and intermittent low-frequency requests, the approximate status of each stage during a test process:
Status---(0 QPS)-(1000 QPS)-(10000 QPS)--(10 QPS)

Default:  300K -> 60M -> 600M -> 560M

MiMalloc: 500K -> 35M -> 350M -> 9M

JeMalloc: 300K -> 56M -> 560M -> 30M

  • Speculation on the cause of memory issues:

First of all, it must be admitted that axum/hyper does have a memory allocation problem, and after multiple rounds of requests, it does not return to a memory state close to the initial one.

Under the default allocator, using reqwest to hit axum at 1000 QPS for two hours, 720,000 requests, the memory is still around 55~60M.

Memory cannot be correctly recovered, which is suspected to be related to the number of concurrent requests, and has little to do with the total number of requests.

MiMalloc and JeMalloc both release memory better than the default allocator, although the final memory state of the two is still different, but it can be basically considered within the bearable range.


  • Some other issues encountered during the test:

Because in the same environment, CPU overload and port congestion will affect both the client and the server at the same time.

Although reqwest used tokio's semaphore to control concurrency, if the instantaneous concurrency is too large, the CPU will be full, and the client side will generate a large number of send errors, the specific reason is unknown, it is speculated that it may be that the TCP port is full.

In this environment, reqwest can send about 28,000 continuously within seconds without errors, but after a few seconds, a large number of send errors will appear, which will block the system's TCP channel, and even if the server is not down, it cannot provide services.

Ultimately, it was decided to test at the maximum of 10,000 QPS, which could handle requests within a second, completing all transactions in about 200~300 ms.

Under the condition that the CPU is not overloaded, there is basically no big difference in the throughput performance of the server side of each memory allocator.

But even at 10000 QPS, after running for a period of time, it is still possible for the CPU to overload and squeeze requests.

Under the continuous 10000 QPS situation,

The default allocator start to run full CPU after about ten seconds, and start to report send errors;

MiMalloc starts to run full CPU after about thirty seconds, and starts to report send errors;

JeMalloc starts to run full CPU after about eighty seconds, and starts to report send errors.

When the CPU is full, the axum server side will be unable to access, and after the client stops sending requests, the server side will return to normal within a few seconds.


  • Conclusion:

According to this test process, even if it has memory allocation problems, it seems that the memory will not increase indefinitely,

MiMalloc performed better in memory allocation and release during the test,

JeMalloc performed better in terms of CPU occupancy while taking into account memory release.


@lithbitren
Copy link

...

I tested it in a virtual machine of VMware, with the environment being Ubuntu 24.04 desktop version, allocated 2.6GHz 8C8G.

Status---(0 QPS)-(1000 QPS)-(10000 QPS)--(10 QPS)

Default:  300K -> 60M -> 600M -> 560M

MiMalloc: 500K -> 35M -> 350M -> 9M

JeMalloc: 300K -> 56M -> 560M -> 30M

...

@prasannavl ,Modifying the memory allocator isn't complicated in itself, but I'm unsure about the feasibility of integrating it into the hyper framework.

cargo.toml:

[dependencies]
axum =  "0.7.5"
tokio = { version = "1.38.0", features = ["full"] }
mimalloc = "0.1.42"
jemallocator = "0.5.4"

src/main.rs:

...
use axum::Extension;
use axum::{routing::get, Router, extract::State};
use axum::Json;
use tokio;

// use mimalloc::MiMalloc;

// #[global_allocator]
// static GLOBAL: MiMalloc = MiMalloc;

// #[cfg(not(target_env = "msvc"))]
// use jemallocator::Jemalloc;

// #[cfg(not(target_env = "msvc"))]
// #[global_allocator]
// static GLOBAL: Jemalloc = Jemalloc;

#[tokio::main]
async fn main() {
    ...
}

@belovaf
Copy link

belovaf commented Jun 19, 2024

I have the same issue.
I crated the most basic service and left it under load (3000 rps) for a few days.
The service was consuming more and more memory and wouldn't stop.
I tried mimalloc, but it didn't help.

@lixiang365
Copy link

严格说这个不是bug或者内存泄漏。是系统内存管理策略。使用malloc申请的内存, free 后会将空闲的内存合并成大的内存块,并一定会立马还给操作系统(这块具体可以看 free 源码实现),malloc_trim 就是将空闲的块都还回去。

unsafe {
	libc::malloc_trim(0);
}

更换分配器就行。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-server Area: server. C-performance Category: performance. This is making existing behavior go faster.
Projects
None yet
Development

No branches or pull requests