Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: write/send number of bytes based on previous read/recv result for IOSQE_IO_LINK #58

Closed
CarterLi opened this issue Jan 20, 2020 · 8 comments

Comments

@CarterLi
Copy link
Contributor

One use case for IOSQE_IO_LINK is that zero copy IO operation, but it's hard to determine how many bytes is correctly read.

For example echo server. It's just ACCEPT -> RECV -> SEND -> CLOSE, but it's hard / not possible to do in zero copy way. The problem is:

  1. The fd used by RECV/SEND/CLOSE is generated by ACCEPT, there's no way to use it in an IOSQE_IO_LINK chain
  2. The number of bytes received from client is known only after RECV completes. You have to wait for RECV's completion to know how many bytes need to be sent.

For 2, man 2 read says that

It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal.

So that even a simple READ -> WRITE link chain may not always be reliable.

I suggest that add a flag called IOSQE_IO_USE_PREV_RES ( the name is not decided ), which works only with IORING_OP_{WRITE,SEND} must be used together with IOSQE_IO_LINK, indicates that current operation's buffer size is set by previous ret code. If previous ret code <= 0 the operation should generate an error.

What do you think?

@isilence
Copy link
Collaborator

Сomposability is one of the current topics/issues of interest for io_uring, but it's not as simple though. It's not feasible to support each of such cases as a separate IOSQE flag in the kernel. It will slow io_uring down (and please consider that the kernel must not break userspace).

What you really want is to have some ability to control flow depending on e.g. return code. We are thinking about using eBPF for steering, but don't think libraries such as yours would use it without a really good reason.

Regarding your issue:

  1. I think the right thing to do is to fail a link, if ret_size != asked_size, so letting the userspace to handle this.
  2. As I remember, that's the exactly behaviour for e.g. regular file reads. So, it's not as the same as read(2)

@CarterLi
Copy link
Contributor Author

CarterLi commented Jan 20, 2020

It will slow io_uring down (and please consider that the kernel must not break userspace)

It's not enabled by default, so it shouldn't break anything

I think the right thing to do is to fail a link, if ret_size != asked_size, so letting the userspace to handle this.

You still need an extra flag, or it will result in user-aware behavior change, which may break existing code.

In addition, it's not useful as IOSQE_IO_USE_PREV_RES in my opinion, as it still goes back to userland.

What I want to suggest is to make echo server can be implemented in zero copy manner, which should generally extend the use cases of IOSQE_IO_LINK.

As I remember, that's the exactly behaviour for e.g. regular file reads. So, it's not as the same as read(2)

I don't quite understand what you mean.

@isilence
Copy link
Collaborator

It's not enabled by default, so it shouldn't break anything

What I mean, is that a flag can't be removed in the future. And the kernel would need to analyse every such flag, and that's slow. So, instead of adding a new flag for each such case, I would prefer to have something generic/programmable.

E.g. what if you don't want to proceed, if didn't read enough? What if you have
read(n) -> read(m) -> write(n+m)? Or do a tee-like stuff? And there are a lot of cases.

I think the right thing to do is to fail a link, if ret_size != asked_size, so letting the userspace to handle this.

You still need an extra flag, or it will result in user-aware behavior change, which may break existing code.
In addition, it's not useful as IOSQE_IO_USE_PREV_RES in my opinion, as it still goes back to userland.

Right, but it's generic. After getting to the userspace you can program the behaviour whatever you want. Would love to see performance difference though.

As I remember, that's the exactly behaviour for e.g. regular file reads. So, it's not as the same as read(2)

I don't quite understand what you mean.

IORING_OP_READ* will fail a link, if it read less than was specified.

@isilence
Copy link
Collaborator

The approach I'm thinking of is to have a steering opcode with different program making decisions. E.g.

  1. read -> OP_STEERING(1) -> write as you described
  2. ... -> OP_STEERING(2) -> ... runs an eBPF program, which may drop/modify/add sqes, etc.

@CarterLi
Copy link
Contributor Author

CarterLi commented Jan 20, 2020

Right, but it's generic. After getting to the userspace you can program the behaviour whatever you want. Would love to see performance difference though.

Well yes. And it can still be used for zero copy manner. Just submit multiple small buffer recv-send requests. Users only need to handle the last send operation.

But the requirement is that the kernel still needs to return the number of bytes it has read successfully

runs eBPF program, which may drop/modify/add sqes, etc.

This is far beyond my knowledge. Does it require the root privilege?

@isilence
Copy link
Collaborator

isilence commented Jan 20, 2020

runs eBPF program, which may drop/modify/add sqes, etc.

This is far beyond my knowledge. Does it require the root privilege?

I think some popular cases (like yours) to be available for non-root users (e.g. kind of pre-registered). And if you want something more fancy (i.e. custom in-kernel steering logic), you'd probably need the root privilege.

Though, it yet to be discussed and designed, and that's why I would like to see use cases and needs.

@CarterLi
Copy link
Contributor Author

runs eBPF program, which may drop/modify/add sqes, etc.

This is far beyond my knowledge. Does it require the root privilege?

I think some popular cases (like yours) to be available for non-root users (e.g. kind of pre-registered). And if you want something more fancy (i.e. custom in-kernel steering logic), you'd probably need the root privilege.

Though, it yet to be discussed and designed, and that's why I would like to see use cases and needs.

I'd say it will be the most powerful feature I will ever see since io_uring is created if it becomes true 👍

@CarterLi
Copy link
Contributor Author

CarterLi commented Feb 2, 2020

According to the man page:

a short read will also terminate the remainder of the chain

Never realized that. Awesome!

EDIT: Well, it seems that only IORING_OP_READV follows this rule, while IORING_OP_RECVMSG doesn't

@axboe axboe closed this as completed Mar 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants