Client connection leak with more than one body chunk #1479

unixpickle · 2018-04-04T03:38:14Z

In a web application where the server provides more than ~8062 bytes of data to the client, the client's connection leaks. This can lead to all sorts of resource exhaustion errors. I have found that this correlates perfectly with Response::body() producing more than one Chunk.

The client/server programs below reproduce this issue. The server usually dies with the following error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Io(Error { repr: Os { code: 24, message: "Too many open files" } })', src/libcore/result.rs:916:5

This is likely related to commit ef40081 and issue #1397

Update: this is using hyper version 0.11.24. Added Cargo.toml file below.

client.rs:

extern crate futures;
extern crate hyper;
extern crate tokio_core;

use futures::{Future, IntoFuture, Stream};
use hyper::{Chunk, Client, Error};
use tokio_core::reactor::Core;

fn main() {
    let mut core = Core::new().unwrap();
    let handle = core.handle();
    let client = Client::configure().keep_alive(true).build(&handle);
    loop {
        let res = client.get("http://127.0.0.1:8080/".parse().unwrap())
            .and_then(|resp| resp.body().fold((0usize, Vec::<u8>::new()), join_chunks));
        let (num_chunks, data) = core.run(res).unwrap();
        println!("got {} bytes ({} chunks)", data.len(), num_chunks);
    }
}

fn join_chunks(
    mut state: (usize, Vec<u8>),
    next: Chunk
) -> Box<Future<Item = (usize, Vec<u8>), Error = Error>> {
    state.1.extend(next.to_vec());
    Box::new(Ok((state.0 + 1, state.1)).into_future())
}

server.rs

extern crate futures;
extern crate hyper;

use std::iter::repeat;

use futures::{Future, IntoFuture};
use hyper::{Error, Request, Response, StatusCode};
use hyper::header::ContentType;
use hyper::server::{Http, Service};

// If you set this to something like 100, it works!
// When I did a binary search, I found 8062 worked and 8063 didn't.
const PAYLOAD_SIZE: usize = 10000;

fn main() {
    Http::new()
        .bind(&"127.0.0.1:8080".parse().unwrap(), || Ok(MyService{}))
        .unwrap()
        .run()
        .unwrap();
}

struct MyService{}

impl Service for MyService {
    type Request = Request;
    type Response = Response;
    type Error = Error;
    type Future = Box<Future<Item = Self::Response, Error = Self::Error>>;

    fn call(&self, _: Request) -> Self::Future {
        Box::new(Ok(Response::new()
            .with_status(StatusCode::Ok)
            .with_header(ContentType("application/octet-stream".parse().unwrap()))
            .with_body(repeat(0u8).take(PAYLOAD_SIZE).collect::<Vec<u8>>())).into_future())
    }
}

Cargo.toml

[package]
name = "reproduce"
version = "0.1.0"
authors = ["Alex Nichol <unixpickle@gmail.com>"]

[[bin]]
name = "reproduce-server"
path = "src/server.rs"

[[bin]]
name = "reproduce-client"
path = "src/client.rs"

[dependencies]
futures = "0.1"
hyper = "0.11"
tokio-core = "0.1"

The text was updated successfully, but these errors were encountered:

Without keepalives, we get too many sockets in TIME_WAIT. With keepalives, we are bumping into hyperium/hyper#1479

lnicola · 2018-04-04T18:31:56Z

Can you try with 7fe9710? I ran into what looks like the same issue and that commit fixed it for me.

unixpickle · 2018-04-04T19:22:24Z

Tested with 7fe9710 and it worked!

lnicola · 2018-04-04T19:44:15Z

@unixpickle FYI, hyper 0.11.25 is now out.

unixpickle · 2018-04-04T20:06:34Z

Very speedy fix!

unixpickle · 2018-04-04T20:28:25Z

I'm still not entirely sure what the problem was, though.

7fe9710 supposedly works because it delays EOF until the connection has been added back to the pool. This seems to imply that I could work around the problem on 0.11.24 by sleeping after reading the body. However, this is not the case. Here's an updated client that still does not work on 0.11.24:

extern crate futures;
extern crate hyper;
extern crate simple_logger;
extern crate tokio_core;

use std::thread::sleep;
use std::time::Duration;

use futures::{Future, IntoFuture, Stream};
use hyper::{Chunk, Client, Error};
use tokio_core::reactor::Core;

fn main() {
    let mut core = Core::new().unwrap();
    let handle = core.handle();
    let client = Client::configure().keep_alive(true).build(&handle);
    loop {
        let res = client.get("http://127.0.0.1:8080/".parse().unwrap())
            .and_then(|resp| resp.body().fold((0usize, Vec::<u8>::new()), join_chunks));
        let (num_chunks, data) = core.run(res).unwrap();
        println!("got {} bytes ({} chunks)", data.len(), num_chunks);
        sleep(Duration::from_millis(100));
    }
}

fn join_chunks(
    mut state: (usize, Vec<u8>),
    next: Chunk
) -> Box<Future<Item = (usize, Vec<u8>), Error = Error>> {
    state.1.extend(next.to_vec());
    Box::new(Ok((state.0 + 1, state.1)).into_future())
}

There must be something else I'm missing. For example, I don't see why this drop is necessary. Maybe @seanmonstar can shine some light on this.

seanmonstar · 2018-04-04T20:32:03Z

Sleeping the thread won't fix it: the connection task (future) that was spawned needs to be polled again, and that can't happen if the Core isn't running.

unixpickle · 2018-04-04T21:09:30Z

Ah, that's very helpful. However, I'm still not sure I see the full picture.

Here's a modified version where the core runs forever, and the sleep is done using a future. The problem still happens on 0.11.24. In this case, the core is running forever, so @seanmonstar's previous explanation doesn't apply.

extern crate futures;
extern crate hyper;
extern crate simple_logger;
extern crate tokio_core;

use std::time::Duration;

use futures::{Future, IntoFuture, Stream};
use futures::stream::repeat;
use hyper::{Chunk, Client, Error};
use tokio_core::reactor::{Core, Timeout};

fn main() {
    let mut core = Core::new().unwrap();
    let handle = core.handle();
    let client = Client::configure().keep_alive(true).build(&handle);
    core.run(repeat::<(), Error>(()).for_each(move |_| {
        let local_handle = handle.clone();
        client.get("http://127.0.0.1:8080/".parse().unwrap())
            .and_then(|resp| resp.body().fold((0usize, Vec::<u8>::new()), join_chunks))
            .and_then(move |(num_chunks, data)| {
                println!("got {} bytes ({} chunks)", data.len(), num_chunks);
                Timeout::new(Duration::from_millis(100), &local_handle).unwrap()
                    .map_err(From::from).map(|_| ())
            })
    })).unwrap();
}

fn join_chunks(
    mut state: (usize, Vec<u8>),
    next: Chunk
) -> Box<Future<Item = (usize, Vec<u8>), Error = Error>> {
    state.1.extend(next.to_vec());
    Box::new(Ok((state.0 + 1, state.1)).into_future())
}

By the way, I have been testing for the problem by running netstat -an | grep 127 | wc -l.

seanmonstar · 2018-04-04T22:57:13Z

I don't know exactly where the issue is, but it's fixed in 0.11.25, so I personally wouldn't expend much effort myself to look further XD

unixpickle · 2018-04-05T04:51:18Z

I did some investigating, since I didn't think this should be left a mystery. I found that commit 7fe9710 fixed another bug in addition to the delayed EOF bug. The bug was in this block of code from 0.11.24 (see in src/client/mod.rs):

if let Ok(Async::NotReady) = pooled.tx.poll_ready() {
    // If the executor doesn't have room, oh well. Things will likely
    // be blowing up soon, but this specific task isn't required.
    let _ = executor.execute(future::poll_fn(move || {
        pooled.tx.poll_ready().map_err(|_| ())
    }));
}

The problem here is that pooled.tx.poll_ready() is called once from one task, and then repeatedly from a new task. In general, I'm guessing futures do not support being rescheduled on a different task. In particular, the Giver does not handle being moved to a new task; it continues to wake up the original task (which can end before the connection is put back into the pool).

And, it hardly needs to be said: the bug doesn't happen for 1 chunk because the first time the executed poll_fn is called, the connection is ready to be given back to the pool.

The EOF bug is one issue that needed to be fixed. My original reproduction fell victim to both bugs, but the later one with Timeouts only fell victim to the task rescheduling bug.

If anybody is interested, here is a minimal patch on top of 0.11.24 that fixes the task rescheduling bug but not the EOF bug:

diff --git a/src/client/conn.rs b/src/client/conn.rs
index f6a139b5..ed5d7147 100644
--- a/src/client/conn.rs
+++ b/src/client/conn.rs
@@ -126,6 +126,10 @@ impl<B> SendRequest<B>
         self.dispatch.poll_ready()
     }
 
+    pub(super) fn is_wanting(&self) -> bool {
+        self.dispatch.is_wanting()
+    }
+
     pub(super) fn is_closed(&self) -> bool {
         self.dispatch.is_closed()
     }
diff --git a/src/client/dispatch.rs b/src/client/dispatch.rs
index efc720d8..d7943d9b 100644
--- a/src/client/dispatch.rs
+++ b/src/client/dispatch.rs
@@ -45,6 +45,10 @@ impl<T, U> Sender<T, U> {
         }
     }
 
+    pub fn is_wanting(&self) -> bool {
+        self.giver.is_wanting()
+    }
+
     pub fn is_closed(&self) -> bool {
         self.giver.is_canceled()
     }
diff --git a/src/client/mod.rs b/src/client/mod.rs
index eb8c10ea..e81ef5dc 100644
--- a/src/client/mod.rs
+++ b/src/client/mod.rs
@@ -253,7 +253,7 @@ where C: Connect,
                     // for a new request to start.
                     //
                     // It won't be ready if there is a body to stream.
-                    if let Ok(Async::NotReady) = pooled.tx.poll_ready() {
+                    if !pooled.tx.is_wanting() {
                         // If the executor doesn't have room, oh well. Things will likely
                         // be blowing up soon, but this specific task isn't required.
                         let _ = executor.execute(future::poll_fn(move || {
@@ -661,4 +661,3 @@ mod background {
         }
     }
 }
-
diff --git a/src/client/signal.rs b/src/client/signal.rs
index 2ddf67f7..7fd6bb44 100644
--- a/src/client/signal.rs
+++ b/src/client/signal.rs
@@ -88,6 +88,10 @@ impl Giver {
         }
     }
 
+    pub fn is_wanting(&self) -> bool {
+        self.inner.state.load(Ordering::SeqCst) == STATE_WANT
+    }
+
     pub fn is_canceled(&self) -> bool {
         self.inner.state.load(Ordering::SeqCst) == STATE_CLOSED
     }

seanmonstar · 2018-04-05T17:54:52Z

Ah yes, I had noticed this as well a little while ago, and forgotten that the fix included that. Generally, implementations of Future should behave correctly if moved to different tasks.

unixpickle added a commit to unixpickle/squidtun that referenced this issue Apr 4, 2018

use keepalives

a3e3f35

Without keepalives, we get too many sockets in TIME_WAIT. With keepalives, we are bumping into hyperium/hyper#1479

unixpickle closed this as completed Apr 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client connection leak with more than one body chunk #1479

Client connection leak with more than one body chunk #1479

unixpickle commented Apr 4, 2018 •

edited

Loading

lnicola commented Apr 4, 2018

unixpickle commented Apr 4, 2018

lnicola commented Apr 4, 2018

unixpickle commented Apr 4, 2018

unixpickle commented Apr 4, 2018

seanmonstar commented Apr 4, 2018

unixpickle commented Apr 4, 2018

seanmonstar commented Apr 4, 2018

unixpickle commented Apr 5, 2018

seanmonstar commented Apr 5, 2018

Client connection leak with more than one body chunk #1479

Client connection leak with more than one body chunk #1479

Comments

unixpickle commented Apr 4, 2018 • edited Loading

client.rs:

server.rs

Cargo.toml

lnicola commented Apr 4, 2018

unixpickle commented Apr 4, 2018

lnicola commented Apr 4, 2018

unixpickle commented Apr 4, 2018

unixpickle commented Apr 4, 2018

seanmonstar commented Apr 4, 2018

unixpickle commented Apr 4, 2018

seanmonstar commented Apr 4, 2018

unixpickle commented Apr 5, 2018

seanmonstar commented Apr 5, 2018

unixpickle commented Apr 4, 2018 •

edited

Loading