-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: netpoll port_getn() fails with impossible errno on illumos #45643
Comments
Is this possibly related to #35261? @jclulow @ianlancetaylor |
This is not related to #35261. I don't see how this could happen. This is calling the function Is there a Solaris equivalent to It's odd that nobody else is seeing this. |
That definitely seems unusual. On illumos, truss(1) is the moral equivalent of |
e.g., given process 597:
|
I made this evil, gross hack as an experiment; I do not understand The gross hack allowed Navidrome to power through an initial scan - "seemed to work"
I was spooked by the whole thing and backed out the change. |
Can you give a bit more detail about the environment you're running this in? Which illumos distribution are you using, and what release or version of that distribution? Were you able to collect information about the I'm not familiar with Navidrome -- is it doing anything unusual like polling on a character device, or is it just regular network (e.g., TCP & UDP) things? |
ie OmniOS CE r151030 LTS The failures were random-seeming enough that I didn't get a good opportunity to run It was always fd 4. And it was always from the (navidrome is a subsonic-API compatible music server) I am using golang 1.16 as mentioned above, but also worth mentioning that is the go pkg from the omnios "extras" repo [OOCE], ie as OmniOS-standard as can be. |
Update:
which is as expected, and the same version used by |
That's good news! Can you include the |
Well that's not what I expected...
Something seems borked here - the man page says either 0 or -1 should be returned
and for completeness, the running program said just as became non-running:
|
The mechanics are a little complicated here. What you see from I am a bit confused by the output, as it doesn't seem like there was a non-zero return at least in the
That's what comes out of #include <port.h>
int
main(int argc, char *argv[])
{
int nget = 1;
port_getn(10000, NULL, 128, &nget, NULL);
} ... at any rate. The I will try and knock together a DTrace script that will show you the user-level |
Here is a script that will trace some of the #!/usr/sbin/dtrace -qs
pid$target::port_getn:entry
{
self->x = timestamp;
self->fd = args[0];
self->ngetp = args[3];
self->nget = *self->ngetp;
printf("%u [%d/%d:%s] port_getn(%d, nget %p = %d)...\n",
timestamp, pid, tid, execname,
self->fd, self->ngetp, self->nget);
}
syscall::portfs:entry
/self->x/
{
printf("%u [%d/%d:%s] SYSCALL portfs %d entry\n",
timestamp, pid, tid, execname,
(int)arg0);
}
syscall::portfs:return
/self->x/
{
printf("%u [%d/%d:%s] SYSCALL portfs return %d (errno %d)\n",
timestamp, pid, tid, execname,
(int)arg0, errno);
}
pid$target::port_getn:return
/self->x/
{
this->dur = (timestamp - self->x) / 1000;
/*
* Get the per-thread errno location for this thread:
*/
this->fsbase = curthread->t_lwp->lwp_pcb.pcb_fsbase;
this->ulwp = (userland struct pid`ulwp *)this->fsbase;
this->errnop = this->ulwp->ul_errnop;
printf("%u [%d/%d:%s] -> %d (errnop %p = %d, nget = %d) %dus\n",
timestamp, pid, tid, execname,
(int)arg1, this->errnop, *this->errnop, *self->ngetp, this->dur);
if ((int)arg1 < 0 && *this->errnop == 0) {
printf("WARNING: oddball return detected\n");
ustack();
printf("\n");
}
self->x = 0;
self->fd = 0;
self->ngetp = 0;
self->nget = 0;
} e.g., some output from an
Critically this script looks at two things:
There is a conditional print if we detect a failure return (
It would be good to capture this sort of trace output while the problem is occurring, so that we can determine whether the C library is mishandling the error numbers here, or if it is some part of the Go machinery for fetching thread local errno after a C library call, or something else. |
OK! and slaps forehead for confusing native syscall vs libc illusions. Living in userland...
|
This is definitely strange. I have been looking at the kernel and the C library code for a while and I cannot see how that condition could arise so far. I have added some more in-kernel bits to the tracing here to try and find what I'm missing:
Can you run this one while reproducing again? As an aside: does the system where you're running this program have ECC memory in it? |
as before, last bits:
Yes, I believe the server has ECC memory. All filesystems are regularly scrubbed and show no errors. |
In this last trial did the
|
correct, those are the last lines of the output |
If it blew up and reported (that the |
I have no idea what I'm doing as regards the illumos kernel. |
Confirming that this still misbehaves after OS upgrade to:
|
The most relevant piece of Illumos kernel code is here:
|
The lines at 640-641 seem like it should be impossible to get -1 and zero. r.r_vals coming into this block is initialized to zero. IF the error is zero, then I'd expect to skip over 640 because of the first test, and return 0 at 642. |
I am wondering if something else is clobbering the errno in user space. The value is in a thread specific variable, but perhaps we're missing something in the calling code in the go runtime? Note that we have never seen this error at RackTop, and we are a heavy user of Golang on illumos. |
I don't see how this could explain the trace output above, but I thought I'd mention that we've discovered this illumos OS issue that under some conditions causes memory that is supposed to be zero'd by the Go runtime to not be properly zero'd. Obviously this can cause all kinds of strange behavior. I could imagine that bug affecting some of the arguments to the libc |
@jclulow I think there may be a bug in the D scripts. In this probe:
I assume that arg0 here is an Given what we know (that the low 32 bits of the syscall return value were 0; that we returned -1; that errno was 0) I infer we must have hit this branch, which means the high bits of the return value would have been -1. Working backwards, in the kernel in |
Greetings and thank you for your continued diligence in this spooky endeavor.
|
Thanks for following it up. Closing because it doesn't seem like we need to change anything in Go. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
I do not know, I found the issue running Navidrome
https://github.com/navidrome/navidrome
which enforces building only with Go 1.16
The relevant bits of
netpoll_solaris.go
do not seem to be changed against latest.What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Having built the latest navidrome (master from https://github.com/navidrome/navidrome), it will sporadically crash with:
That
(errno=0)
appears to be impossible.What did you expect to see?
Ideally, no crashes. If there were a crash, I'd expect a meaningful
errno
.What did you see instead?
A crash with
errno=0
The text was updated successfully, but these errors were encountered: