Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short-read optimization is wrong for O_DIRECT pipes #7051

Open
throwable-one opened this issue Dec 29, 2024 · 1 comment
Open

Short-read optimization is wrong for O_DIRECT pipes #7051

throwable-one opened this issue Dec 29, 2024 · 1 comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-net Module: tokio/net

Comments

@throwable-one
Copy link

Version

tokio v1.42.0

Platform

Linux UNIT-2619 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022 x86_64 GNU/Linux

(but I tried that on several different Linuxes)

Description
The problem is covered here:
https://users.rust-lang.org/t/tokio-process-freezes-with-packet-pipes-on-linux-when-buffer-is-too-big/123103
Here is a copy

There is a thing called "packet mode" pipes in Linux, see pipe(2).
TL;TR: when opened with O_DIRECT, each write is a packet (not larger than 4096 -- PIPE_BUF).

Each read reads one "packet", if buffer is too small remain bytes are discarded.

Here is a small tool that runs dd(1) in a "packet" mode.

use std::process::Stdio;
use tokio::io::AsyncReadExt;
use tokio::process::Command;

const READ_BLOCK_SIZE: usize = 65536;
const BYTES_TO_WRITE: usize = 65536 * 2;

#[tokio::main]
async fn main() {
    let process = Command::new("/bin/dd")
        .arg("if=/dev/zero")
        // important: sets `fcntl` F_SETFL O_DIRECT
        // enables so-called "packet mode", see `pipe(2)` `O_DIRECT` option
        .arg("oflag=direct")
        .arg(format!("bs={}", BYTES_TO_WRITE))
        .arg("count=1")
        .stdout(Stdio::piped())
        .spawn()
        .unwrap();


    let mut stdout = process.stdout.unwrap();
    let mut buffer = [0u8; READ_BLOCK_SIZE];
    let mut bytes_read = 0;
    loop {
        let i = stdout.read(&mut buffer).await.unwrap();
        println!("I read {}", i);
        bytes_read += i;
        if i == 0 {
            break;
        }
    }
    if bytes_read != BYTES_TO_WRITE {
        panic!("Wrong number of bytes read: {bytes_read}");
    }
}

...and it gets stuck. Here is a strace:

// dd enables packet mode
[pid 20030] fcntl(1, F_SETFL, O_WRONLY|O_DIRECT) = 0

// reads and writes zeros
[pid 20030] read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
[pid 20030] write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072 <unfinished ...>

// futex awakes
[pid 20017] <... epoll_wait resumed>[{events=EPOLLIN, data={u32=3533706496, u64=94346785330432}}], 1024, -1) = 1
[pid 20017] futex(0x55ced29ecd70, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 20013] <... futex resumed>)        = 0
[pid 20017] epoll_wait(3,  <unfinished ...>

// Tokio ties to read 64K, but reads only 4K (due to packet mode)
[pid 20013] read(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 4096
[pid 20013] write(1, "I read 4096\n", 12I read 4096
) = 12
[pid 20013] futex(0x55ced29ecd70, FUTEX_WAIT_BITSET_PRIVATE, 1, NULL, FUTEX_BITSET_MATCH_ANY
// everything is frozen here forever

Now, let's try to use blocking api.

-use tokio::process::Command;
+use std::process::Command;

and remove await from read:

it works!!: it reads 4096 blocks till the end (just like pipe(2) suggests).

Workaround: setting buffer size to 4096 helps. It seems that Tokio waits for more data (to fill the buffer) but no more than 4096 packet might come from the "packet" pipe.

@throwable-one throwable-one added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Dec 29, 2024
@Darksonn Darksonn added the M-net Module: tokio/net label Dec 29, 2024
@Darksonn Darksonn changed the title tokio::process freezes with "packet pipes" on Linux when buffer is too big Short-read optimization is wrong for O_DIRECT pipes Dec 29, 2024
@Darksonn
Copy link
Contributor

Thanks for reporting this. This is due to Tokio's short-read optimization:

// When mio is using the epoll or kqueue selector, reading a partially full
// buffer is sufficient to show that the socket buffer has been drained.
//
// This optimization does not work for level-triggered selectors such as
// windows or when poll is used.
//
// Read more:
// https://github.com/tokio-rs/tokio/issues/5866
#[cfg(all(
not(mio_unsupported_force_poll_poll),
any(
// epoll
target_os = "android",
target_os = "illumos",
target_os = "linux",
target_os = "redox",
// kqueue
target_os = "dragonfly",
target_os = "freebsd",
target_os = "ios",
target_os = "macos",
target_os = "netbsd",
target_os = "openbsd",
target_os = "tvos",
target_os = "visionos",
target_os = "watchos",
)
))]
if 0 < n && n < len {
self.registration.clear_readiness(evt);
}

Normally, a read that is shorter than the buffer size indicates that Tokio should wait for readiness before attempting to read again. This is incorrect for O_DIRECT pipes.

@Noah-Kennedy Thoughts on what we should do here? Since the flag can be changed on an existing pipe, I'm not sure that we can just cache the flag ...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-net Module: tokio/net
Projects
None yet
Development

No branches or pull requests

2 participants