-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: runtime: allow termination of blocked syscalls #41054
Comments
I don't see how it could work to call the function asynchronously. It would be possible to see the Also, not all system calls involve the file system. |
For any given ThreadId, the sequence is synchronous: a) syscall.Stat(filename) I used asynchronously to indicate that the We're concerned with all blocking syscalls, filesystem and otherwise. Some can't be terminated; that's OK. |
I don't think we can synchronously call a user function as we are calling a syscall. At the moment of calling a syscall we are in a fairly restricted execution envrionment. Calling user code at that point seems like a footgun. (I thought that by "asynchronously" you meant that the user code would run in a separate goroutine. I think that could be done safely. I don't think it would be safe to call a user function and wait for it to return.) |
I don't really understand why this API would be necessary in the runtime proper. Isn't this already possible to implement as a third-party library today, using |
Oh, I see. I think that would be a mistake: terminating a syscall that has not explicitly opted in to termination might just send it back through a retry loop, resulting in live-lock if the |
Ian, since this is a low-level API, let's define what is safe in a function called from the restricted syscall environment. The point of Let's also provide a @bcmills the callback can't terminate anything; it simply provides data to the app for a later What do you mean by syscalls that opt in to termination? EDIT: We'd presumably install signal handlers without SA_RESTART, to make more syscalls interruptible. That might require retry loops around non-blocking syscalls, or a sigmask to protect those threads. |
After catching up on #40846, it's not clear to me where this API comes from? It seems like that thread was slowly converging around having versions of os.Stat, etc which take a Context that interrupts the system call when cancelled. Lots of details to be worked out, but a pretty clear API. On the other hand, I'm worried that this API is both very low-level, and difficult to use. In particular:
@ianlancetaylor I think calling user code could be generally safe so long as this is limited to standard library calls like os.Stat/Readdir, etc, since those aren't used in any fragile states within the runtime itself. If this is intended to wrap every syscall, then we're definitely going to have many cases where it is nearly impossible to call user code (e.g., mmap call trying to allocate more stack space in morestack). |
I mean syscalls invoked from a setting where the caller is prepared to do something reasonable (other than just retrying) in case of Experience with Java's |
@prattmic thanks for digging in! There were two use cases raised in the previous thread: This API can serve both cases. Re Re multiple packages, I considered that! Multiple Re different callers, concurrent syscalls have different The callbacks needn't wrap every syscall, only blocking syscalls that can be terminated without undermining the runtime. @bcmills For background, I'm building a desktop app for Windows, MacOS, and Linux, which makes heavy use of the filesystem. I have no control over the kind of storage used. The app (via timeout or user intervention) may need to terminate all ops within a filesystem tree. The app doesn't use |
I am strongly opposed to any form of calling back to user code while deep in invoking a syscall. Even if we carefully write down rules as to what is permitted, people will inevitably write code that breaks those rules, and we will be tied to what people write. We went through years of pain with cgo vs. the garbage collector, and that was in an area where we had explicitly said there was no compatibility guarantee. That said, perhaps this API could use a channel. Similarly, the notion of marking the API is experimental is a non-starter. Once people start using an API, it is fixed. We can not break people once they have started using some feature. That's not how Go works.
Nevertheless, I believe that |
Ian, on a related note, it seems that internal/poll.fcntl didn't get an EINTR loop: ... Also, it might be worth a comment with this link on poll.CloseFunc, to note that errors don't mean failure, so retry is incorrect. EDIT: Filed #41115 |
I didn't add an I guess we could add a comment to |
So right now, Go apps...
Can't we fix this problem? And without a major stdlib expansion? Can we prototype an internal runtime solution? It would underpin a future public API, and in the interim, projects that desperately need this bugfix can obtain it via
Thanks for the channel suggestion; but isn't 2 send + 2 recv operations per syscall rather expensive? |
You should of course feel free to prototype whatever you like, but I would argue against any implicit approach. I think that a cancelable syscall should be clearly cancelable. |
For the reference: on Linux with io_uring practically any system call request can be canceled: https://kernel.dk/io_uring-whatsnew.pdf -> |
This is incredibly invasive and breaks essentially all the abstractions that the Go runtime works hard to establish. I have a hard time seeing why we would do that. There may be a problem to solve here, but the answer can't be tearing down all the abstractions we have. Are you going to let the user TerminateSyscall in the middle of the runtime asking the operating system for more memory? It's just going to lead to an incredible number of subtle bugs. |
Golly, such strident critique; it seems I touched a nerve :-) This needn't be invasive. The app only needs to know about (and interrupt) its own syscalls -- not the runtime's! Re "subtle bugs", a terminated syscall returns an error indicating interruption; how is that subtle? The runtime is missing an important abstraction, the "pending system operation". Let's hear more ideas to halt stalled or runaway pending system ops! To date, the only other suggestion is to replicate dozens of stdlib APIs to take a |
Added to issue text: Here is a high-level API, adapted from #41249. Tracked ops could be limited to file syscalls.
Typical usage:
|
This proposal remains a non-starter for the reasons given above. |
@rsc, there is a problem here, Go apps...
And #41414 restates the problem as goroutine leaks due to stalled I/O. Is the Go team willing to explore solutions to this problem? |
There may be a problem here but it doesn't seem like a new problem or an urgent one. As discussed above, I don't think that this specific proposal is a good fix for that problem. |
No change in consensus, so declined. |
Many syscalls may block indefinitely, for example file ops trying paths on a network filesystem or peripheral device that's unavailable. Such syscalls can often be terminated, but Go is unable to do so at present. So today, Go apps...
CIFS (#39237, #38836) and FUSE (#40846) support this by returning
EINTR
instead of restarting after a signal. NFS on Linux used to do the same, but dropped it a while ago. So not all hung syscalls can be terminated gracefully.Rust used to retry syscalls on
EINTR
, but dropped the practice outside of some high-level APIs.The runtime should provide a way to terminate blocked syscalls. On unix, this entails sending the blocked thread a signal. Windows has an analogous mechanism,
CancelSynchronousIo()
. See alsohttps://docs.microsoft.com/en-us/windows/win32/fileio/canceling-pending-i-o-operations.
The solution must not add
context.Context
variants of all stdlib APIs that make blocking syscalls. MandatingContext
arguments for termination would force them into third party package APIs, and code importing those packages would then be broken. If a package author did not amend its API, the callers would have to manage stalled ops some other way, and likely leak resources on every op retry.I think the simplest way to allow this (other ideas welcome) is to asynchronously post a threadId & metadata to the app (edit: or an internal table) immediately before and after trying a blocking syscall.
The following API could be introduced as experimental (edit: or internal). If we also want cancellable variants of stdlib APIs, Add/DropSyscallPost() could take an argument limiting its scope to the current goroutine, for use by the variants.
A file-oriented variation of this appears in #41054 (comment). A runtime-internal variation is suggested in #41054 (comment).
Discussion of this began (more or less) with #40846 (comment)
Changelog
27-Aug: add
InterruptError
and improvePostSyscall
docsThe text was updated successfully, but these errors were encountered: