Discussion: write/send number of bytes based on previous read/recv result for IOSQE_IO_LINK #58

CarterLi · 2020-01-20T08:42:24Z

One use case for IOSQE_IO_LINK is that zero copy IO operation, but it's hard to determine how many bytes is correctly read.

For example echo server. It's just ACCEPT -> RECV -> SEND -> CLOSE, but it's hard / not possible to do in zero copy way. The problem is:

The fd used by RECV/SEND/CLOSE is generated by ACCEPT, there's no way to use it in an IOSQE_IO_LINK chain
The number of bytes received from client is known only after RECV completes. You have to wait for RECV's completion to know how many bytes need to be sent.

For 2, man 2 read says that

It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal.

So that even a simple READ -> WRITE link chain may not always be reliable.

I suggest that add a flag called IOSQE_IO_USE_PREV_RES ( the name is not decided ), which works only with IORING_OP_{WRITE,SEND} must be used together with IOSQE_IO_LINK, indicates that current operation's buffer size is set by previous ret code. If previous ret code <= 0 the operation should generate an error.

What do you think?

The text was updated successfully, but these errors were encountered:

isilence · 2020-01-20T11:04:13Z

Сomposability is one of the current topics/issues of interest for io_uring, but it's not as simple though. It's not feasible to support each of such cases as a separate IOSQE flag in the kernel. It will slow io_uring down (and please consider that the kernel must not break userspace).

What you really want is to have some ability to control flow depending on e.g. return code. We are thinking about using eBPF for steering, but don't think libraries such as yours would use it without a really good reason.

Regarding your issue:

I think the right thing to do is to fail a link, if ret_size != asked_size, so letting the userspace to handle this.
As I remember, that's the exactly behaviour for e.g. regular file reads. So, it's not as the same as read(2)

CarterLi · 2020-01-20T11:25:36Z

It will slow io_uring down (and please consider that the kernel must not break userspace)

It's not enabled by default, so it shouldn't break anything

I think the right thing to do is to fail a link, if ret_size != asked_size, so letting the userspace to handle this.

You still need an extra flag, or it will result in user-aware behavior change, which may break existing code.

In addition, it's not useful as IOSQE_IO_USE_PREV_RES in my opinion, as it still goes back to userland.

What I want to suggest is to make echo server can be implemented in zero copy manner, which should generally extend the use cases of IOSQE_IO_LINK.

As I remember, that's the exactly behaviour for e.g. regular file reads. So, it's not as the same as read(2)

I don't quite understand what you mean.

isilence · 2020-01-20T11:41:05Z

It's not enabled by default, so it shouldn't break anything

What I mean, is that a flag can't be removed in the future. And the kernel would need to analyse every such flag, and that's slow. So, instead of adding a new flag for each such case, I would prefer to have something generic/programmable.

E.g. what if you don't want to proceed, if didn't read enough? What if you have
read(n) -> read(m) -> write(n+m)? Or do a tee-like stuff? And there are a lot of cases.

I think the right thing to do is to fail a link, if ret_size != asked_size, so letting the userspace to handle this.

You still need an extra flag, or it will result in user-aware behavior change, which may break existing code.
In addition, it's not useful as IOSQE_IO_USE_PREV_RES in my opinion, as it still goes back to userland.

Right, but it's generic. After getting to the userspace you can program the behaviour whatever you want. Would love to see performance difference though.

As I remember, that's the exactly behaviour for e.g. regular file reads. So, it's not as the same as read(2)

I don't quite understand what you mean.

IORING_OP_READ* will fail a link, if it read less than was specified.

isilence · 2020-01-20T11:51:43Z

The approach I'm thinking of is to have a steering opcode with different program making decisions. E.g.

read -> OP_STEERING(1) -> write as you described
... -> OP_STEERING(2) -> ... runs an eBPF program, which may drop/modify/add sqes, etc.

CarterLi · 2020-01-20T12:21:26Z

Right, but it's generic. After getting to the userspace you can program the behaviour whatever you want. Would love to see performance difference though.

Well yes. And it can still be used for zero copy manner. Just submit multiple small buffer recv-send requests. Users only need to handle the last send operation.

But the requirement is that the kernel still needs to return the number of bytes it has read successfully

runs eBPF program, which may drop/modify/add sqes, etc.

This is far beyond my knowledge. Does it require the root privilege?

isilence · 2020-01-20T12:48:36Z

runs eBPF program, which may drop/modify/add sqes, etc.

This is far beyond my knowledge. Does it require the root privilege?

I think some popular cases (like yours) to be available for non-root users (e.g. kind of pre-registered). And if you want something more fancy (i.e. custom in-kernel steering logic), you'd probably need the root privilege.

Though, it yet to be discussed and designed, and that's why I would like to see use cases and needs.

CarterLi · 2020-01-20T14:00:24Z

runs eBPF program, which may drop/modify/add sqes, etc.

This is far beyond my knowledge. Does it require the root privilege?

I think some popular cases (like yours) to be available for non-root users (e.g. kind of pre-registered). And if you want something more fancy (i.e. custom in-kernel steering logic), you'd probably need the root privilege.

Though, it yet to be discussed and designed, and that's why I would like to see use cases and needs.

I'd say it will be the most powerful feature I will ever see since io_uring is created if it becomes true 👍

CarterLi · 2020-02-02T02:29:43Z

According to the man page:

a short read will also terminate the remainder of the chain

Never realized that. Awesome!

EDIT: Well, it seems that only IORING_OP_READV follows this rule, while IORING_OP_RECVMSG doesn't

axboe closed this as completed Mar 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: write/send number of bytes based on previous read/recv result for IOSQE_IO_LINK #58

Discussion: write/send number of bytes based on previous read/recv result for IOSQE_IO_LINK #58

CarterLi commented Jan 20, 2020

isilence commented Jan 20, 2020

CarterLi commented Jan 20, 2020 •

edited

Loading

isilence commented Jan 20, 2020

isilence commented Jan 20, 2020

CarterLi commented Jan 20, 2020 •

edited

Loading

isilence commented Jan 20, 2020 •

edited

Loading

CarterLi commented Jan 20, 2020

CarterLi commented Feb 2, 2020 •

edited

Loading

Discussion: write/send number of bytes based on previous read/recv result for IOSQE_IO_LINK #58

Discussion: write/send number of bytes based on previous read/recv result for IOSQE_IO_LINK #58

Comments

CarterLi commented Jan 20, 2020

isilence commented Jan 20, 2020

CarterLi commented Jan 20, 2020 • edited Loading

isilence commented Jan 20, 2020

isilence commented Jan 20, 2020

CarterLi commented Jan 20, 2020 • edited Loading

isilence commented Jan 20, 2020 • edited Loading

CarterLi commented Jan 20, 2020

CarterLi commented Feb 2, 2020 • edited Loading

CarterLi commented Jan 20, 2020 •

edited

Loading

CarterLi commented Jan 20, 2020 •

edited

Loading

isilence commented Jan 20, 2020 •

edited

Loading

CarterLi commented Feb 2, 2020 •

edited

Loading