runtime: netpoll port_getn() fails with impossible errno on illumos #45643

whorfin · 2021-04-20T04:06:33Z

What version of Go are you using (`go version`)?

$ go version
go version go1.16.3 illumos/amd64

Does this issue reproduce with the latest release?

I do not know, I found the issue running Navidrome
https://github.com/navidrome/navidrome
which enforces building only with Go 1.16
The relevant bits of netpoll_solaris.go do not seem to be changed against latest.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/whorfin/.cache/go-build"
GOENV="/home/whorfin/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="illumos"
GOINSECURE=""
GOMODCACHE="/home/whorfin/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="illumos"
GOPATH="/home/whorfin/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/ooce/go-1.16"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/ooce/go-1.16/pkg/tool/illumos_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/whorfin/navidrome/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3292860960=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Having built the latest navidrome (master from https://github.com/navidrome/navidrome), it will sporadically crash with:

runtime: port_getn on fd 4 failed (errno=0)
fatal error: runtime: netpoll failed

runtime stack:
runtime.throw(0xe01def, 0x17)
        /opt/ooce/go-1.16/src/runtime/panic.go:1117 +0x72
runtime.netpoll(0x71c11d, 0x12202ce5e8f688)
        /opt/ooce/go-1.16/src/runtime/netpoll_solaris.go:249 +0x505
runtime.findrunnable(0xc000034000, 0x0)
        /opt/ooce/go-1.16/src/runtime/proc.go:2879 +0x3ee
runtime.schedule()
        /opt/ooce/go-1.16/src/runtime/proc.go:3125 +0x2d7
runtime.park_m(0xc000001080)
        /opt/ooce/go-1.16/src/runtime/proc.go:3274 +0x9d
runtime.mcall(0x57eac8)
        /opt/ooce/go-1.16/src/runtime/asm_amd64.s:327 +0x64

goroutine 1 [chan receive, 26 minutes]:
github.com/oklog/run.(*Group).Run(0xc000052d20, 0xc0003fa0b0, 0x1)
        /home/whorfin/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:43 +0xed
github.com/navidrome/navidrome/cmd.runNavidrome()
        /home/whorfin/navidrome/cmd/root.go:59 +0x107
github.com/navidrome/navidrome/cmd.glob..func2(0x25e8540, 0x266d098, 0x0, 0x0)
        /home/whorfin/navidrome/cmd/root.go:31 +0x25
github.com/spf13/cobra.(*Command).execute(0x25e8540, 0xc000092020, 0x0, 0x0, 0x25e8540, 0xc000092020)
        /home/whorfin/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x25e8540, 0xc000000180, 0x200000003, 0xc000000180)
        /home/whorfin/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
        /home/whorfin/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897
github.com/navidrome/navidrome/cmd.Execute()
        /home/whorfin/navidrome/cmd/root.go:39 +0x65
main.main()
        /home/whorfin/navidrome/main.go:11 +0x2b

That (errno=0) appears to be impossible.

What did you expect to see?

Ideally, no crashes. If there were a crash, I'd expect a meaningful errno.

What did you see instead?

A crash with errno=0

The text was updated successfully, but these errors were encountered:

toothrot · 2021-04-20T21:50:01Z

Is this possibly related to #35261? @jclulow @ianlancetaylor

ianlancetaylor · 2021-04-20T21:59:28Z

This is not related to #35261.

I don't see how this could happen. This is calling the function port_getn in the C library. That function is returning -1, but is not setting errno. That sounds like a bug in the C library.

Is there a Solaris equivalent to strace that will show the exact system calls being made and what they return? That might help identify the problem.

It's odd that nobody else is seeing this.

jclulow · 2021-04-20T22:03:54Z

That definitely seems unusual. On illumos, truss(1) is the moral equivalent of strace.

jclulow · 2021-04-20T22:25:20Z

e.g., given process 597:

# truss -t port_getn -p 597
/6:     port_getn(4, 0xFFFFFC7FEB3FF200, 128, 1, 0x00000000) (sleeping...)
/6:     port_getn(4, 0xFFFFFC7FEB3FF200, 128, 1, 0x00000000) = 0 [1]
/8:     port_associate(4, 4, 0x00000007, 0x00000001, 0xFFFFFC7FEF023F08) = 0
/6:     port_getn(4, 0xFFFFFC7FEB3FF210, 128, 1, 0xFFFFFC7FEB3FF1E8) = 0 [0]
/6:     port_getn(4, 0xFFFFFC7FEB3FF210, 128, 1, 0x00000000) (sleeping...)
...

whorfin · 2021-04-20T23:05:09Z

I made this evil, gross hack as an experiment; I do not understand port_getn well enough to know if this is safe,; it is probably stupid, but it confirms that something funky is happening. the errno() function here does seem to be doing something different AFAIK from what the "normal" pattern of using the "raw" or "Err" variant of sysvicall5 would be.

The gross hack allowed Navidrome to power through an initial scan - "seemed to work"
I saw the the "returning gList..." branch fire each time
Probably of no use to anybody, but in the spirit of "hope this helps":

        if r < 0 {
                if e != _EINTR && e != _ETIME {
                        print("runtime: port_getn on fd ", portfd, " failed (errno=", e, ")\n")
                        // whorfin
                        if (e ==0) {
                                if delay > 0 {
                                        print("\tImpossible zero errno, returning gList...\n")
                                        return gList{}
                                }
                                print("\tImpossible zero errno, retrying...\n")
                                goto retry
                        }
                        throw("runtime: netpoll failed")
                }
                // If a timed sleep was interrupted and there are no events,
                // just return to recalculate how long we should sleep now.
                if delay > 0 {
                        return gList{}
                }
                goto retry
        }

I was spooked by the whole thing and backed out the change.

jclulow · 2021-04-20T23:28:57Z

Can you give a bit more detail about the environment you're running this in? Which illumos distribution are you using, and what release or version of that distribution?

Were you able to collect information about the port_getn(3C) calls which are failing, using truss or DTrace?

I'm not familiar with Navidrome -- is it doing anything unusual like polling on a character device, or is it just regular network (e.g., TCP & UDP) things?

whorfin · 2021-04-21T00:37:06Z

SunOS xxxxxx 5.11 omnios-r151030-5bd7739fe4 i86pc i386 i86pc illumos

ie OmniOS CE r151030 LTS

The failures were random-seeming enough that I didn't get a good opportunity to run truss at an opportune time. I don't know how to make truss reveal errnos of syscalls which don't return them directly though; wouldn't it just show the "-1" retval?

It was always fd 4. And it was always from the findrunnable call, called by park_m, both in runtime/proc.go as shown in the stacktrace; Navidrome does not call netpoll directly, this seems to be from some fairly intrinsic threading code. That's why this seemed worth elevating to y'all golang folks.

(navidrome is a subsonic-API compatible music server)

I am using golang 1.16 as mentioned above, but also worth mentioning that is the go pkg from the omnios "extras" repo [OOCE], ie as OmniOS-standard as can be.

whorfin · 2021-04-22T23:40:01Z

Update:
I did a hack with a local runtime build, where I made a sysvicall5Err variant of sysvicall5 in order to call port_getn and directly get the library errno.
I then modified the netpoll call to always check if that errno matched that returned by errno(). Which it should but I'm paranoid
Managed to catch a crash, confirmed the errno() matched that from sysvicall5Err, and also that the return value of port_getn was -1. errno was the impossible 0
Finally, confirmed with ldd of my navidrome build that

libc.so.1 => /lib/64/libc.so.1

which is as expected, and the same version used by /opt/ooce/bin/go

jclulow · 2021-04-22T23:49:06Z

Managed to catch a crash, confirmed the errno() matched that from sysvicall5Err, and also that the return value of port_getn was -1. errno was the impossible 0

That's good news! Can you include the truss output that shows that information as well?

whorfin · 2021-04-23T00:43:18Z

Well that's not what I expected...
It looks like errno=0 because ... errno is zero
But the retval of port_getn was not negative!
here's the last port_getn lines captured by truss as it died:

/10:	port_getn(4, 0xFFFFFC7FDFFFF170, 128, 1, 0xFFFFFC7FDFFFF148) = 0 [0]
/5:	port_getn(4, 0xFFFFFC7FE87FF170, 128, 1, 0xFFFFFC7FE87FF148) = 0 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 62 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 0 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 0 [0]
/10:	port_getn(4, 0xFFFFFC7FDFFFF170, 128, 1, 0xFFFFFC7FDFFFF148) = 0 [0]
/5:	port_getn(4, 0xFFFFFC7FE87FF170, 128, 1, 0xFFFFFC7FE87FF148) = 0 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 62 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 0 [0]
/10:	port_getn(4, 0xFFFFFC7FDFFFF170, 128, 1, 0xFFFFFC7FDFFFF148) = 0 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 0 [0]
/5:	port_getn(4, 0xFFFFFC7FE87FF170, 128, 1, 0xFFFFFC7FE87FF148) = 0 [0]
/1:	port_getn(4, 0xFFFFFC7FFFDFAFA8, 128, 1, 0xFFFFFC7FFFDFAF80) = 62 [0]

Something seems borked here - the man page says either 0 or -1 should be returned
Here we see a retval of 62 with an errno of 0?! Am i reading this wrong?
This is on

SunOS illumosbox 5.11 omnios-r151030-5bd7739fe4 i86pc i386 i86pc illumos

and for completeness, the running program said just as became non-running:

runtime: port_getn on fd 4 failed (errno=0)
fatal error: runtime: netpoll failed

runtime stack:
runtime.throw(0xe02209, 0x17)
        /opt/ooce/go-1.16/src/runtime/panic.go:1117 +0x72
runtime.netpoll(0x7256d3, 0xc00003d800)
        /opt/ooce/go-1.16/src/runtime/netpoll_solaris.go:258 +0x505
runtime.findrunnable(0xc000036000, 0x0)
        /opt/ooce/go-1.16/src/runtime/proc.go:2879 +0x3ee
runtime.schedule()
        /opt/ooce/go-1.16/src/runtime/proc.go:3125 +0x2d7
runtime.park_m(0xc000001080)
        /opt/ooce/go-1.16/src/runtime/proc.go:3274 +0x9d
runtime.mcall(0x57eac8)
        /opt/ooce/go-1.16/src/runtime/asm_amd64.s:327 +0x64

jclulow · 2021-04-23T01:14:31Z

The mechanics are a little complicated here. What you see from truss is the underlying system call, which is actually a bit different from the signature of the C function port_getn(). The return value (here, either = 0 or = 62) is either success, or the error ETIME as you see on some of the line. The value in square brackets is, I believe, the number of events the kernel returned, and which we will write to the *nget pointer.

I am a bit confused by the output, as it doesn't seem like there was a non-zero return at least in the truss output you've pasted. If there was an error, I would expect to see something like:

port_getn(10000, 0x00000000, 128, 1, 0x00000000) Err#9 EBADF

That's what comes out of truss when I run...

#include <port.h>

int
main(int argc, char *argv[])
{
        int nget = 1;

        port_getn(10000, NULL, 128, &nget, NULL);
}

... at any rate. The ETIME condition (system call level return value of 62) is not technically an error from the perspective of truss because we don't communicate that back through the regular "set errno" path of a system call return.

I will try and knock together a DTrace script that will show you the user-level port_getn() calls.

jclulow · 2021-04-23T04:34:23Z

Here is a script that will trace some of the port_getn() calls for a process:

#!/usr/sbin/dtrace -qs

pid$target::port_getn:entry
{
        self->x = timestamp;
        self->fd = args[0];
        self->ngetp = args[3];
        self->nget = *self->ngetp;

        printf("%u [%d/%d:%s] port_getn(%d, nget %p = %d)...\n",
            timestamp, pid, tid, execname,
            self->fd, self->ngetp, self->nget);
}

syscall::portfs:entry
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs %d entry\n",
            timestamp, pid, tid, execname,
            (int)arg0);
}

syscall::portfs:return
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs return %d (errno %d)\n",
            timestamp, pid, tid, execname,
            (int)arg0, errno);
}

pid$target::port_getn:return
/self->x/
{
        this->dur = (timestamp - self->x) / 1000;

        /*
         * Get the per-thread errno location for this thread:
         */
        this->fsbase = curthread->t_lwp->lwp_pcb.pcb_fsbase;
        this->ulwp = (userland struct pid`ulwp *)this->fsbase;
        this->errnop = this->ulwp->ul_errnop;

        printf("%u [%d/%d:%s] -> %d (errnop %p = %d, nget = %d) %dus\n",
            timestamp, pid, tid, execname,
            (int)arg1, this->errnop, *this->errnop, *self->ngetp, this->dur);

        if ((int)arg1 < 0 && *this->errnop == 0) {
                printf("WARNING: oddball return detected\n");
                ustack();
                printf("\n");
        }

        self->x = 0;
        self->fd = 0;
        self->ngetp = 0;
        self->nget = 0;
}

e.g.,

some output from an acme-dns process:

dns2 # /var/tmp/portgetn.d -p $(pgrep acme-dns)
13334996135948040 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13334996135962747 [597/6:acme-dns] SYSCALL portfs 6 entry
13334996135968032 [597/6:acme-dns] SYSCALL portfs return 0 (errno 0)
13334996135973638 [597/6:acme-dns] -> 0 (errnop fffffc7feec8234c = 0, nget = 0) 25us

13334996135986941 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13334996135991899 [597/6:acme-dns] SYSCALL portfs 6 entry
13335017554884999 [597/6:acme-dns] SYSCALL portfs return 1 (errno 0)
13335017554908883 [597/6:acme-dns] -> 0 (errnop fffffc7feec8234c = 0, nget = 1) 21418921us

13335017555435247 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13335017555446540 [597/6:acme-dns] SYSCALL portfs 6 entry
13335017555451788 [597/6:acme-dns] SYSCALL portfs return 0 (errno 0)
13335017555456734 [597/6:acme-dns] -> 0 (errnop fffffc7feec8234c = 0, nget = 0) 21us

13335017555465954 [597/6:acme-dns] port_getn(4, nget fffffc7feb3ff1c8 = 1)...
13335017555470793 [597/6:acme-dns] SYSCALL portfs 6 entry

Critically this script looks at two things:

the libc-level port_getn() call, including the per-thread errno value that Go should be collecting via errno()
the underlying portfs system call made by port_getn() which is what truss is showing you as well, including the errno value that the kernel put in place for the system call return (this is separate but related to the libc errno value)

There is a conditional print if we detect a failure return (-1) where errno is 0, which should print something like:

WARNING: oddball return detected

              libc.so.1`port_getn+0x51
              acme-dns`runtime.netpoll+0x1ef
              acme-dns`runtime.startTheWorldWithSema+0x236
              acme-dns`runtime.gcMarkTermination.func3+0x26
              acme-dns`runtime.systemstack+0x73
              acme-dns`runtime.gcMarkDone+0x1ff
              acme-dns`runtime.gcBgMarkWorker+0x29d
              acme-dns`runtime.goexit+0x1

It would be good to capture this sort of trace output while the problem is occurring, so that we can determine whether the C library is mishandling the error numbers here, or if it is some part of the Go machinery for fetching thread local errno after a C library call, or something else.

whorfin · 2021-04-23T06:31:58Z

OK! and slaps forehead for confusing native syscall vs libc illusions. Living in userland...
Nicely done. Here's the last reelvant bits, showing some -1 as well as 0 retvals, and then we die right after your dtrace script detects the oddball return:

5375078499701446 [28188/6:navidrome] -> 0 (errnop fffffc7fef2a238c = 0, nget = 0) 86us
5375078499704052 [28188/1:navidrome] SYSCALL portfs 6 entry
5375078499714442 [28188/8:navidrome] -> 0 (errnop fffffc7fef2a338c = 0, nget = 0) 82us
5375078506469969 [28188/1:navidrome] SYSCALL portfs return 0 (errno 0)
5375078506524468 [28188/1:navidrome] -> -1 (errnop fffffc7feea28ed8 = 62, nget = 0) 6883us
5375078506593065 [28188/1:navidrome] port_getn(4, nget fffffc7fffdfe338 = 1)...
5375078506613707 [28188/1:navidrome] SYSCALL portfs 6 entry
5375078506626595 [28188/1:navidrome] SYSCALL portfs return 0 (errno 0)
5375078506632341 [28188/1:navidrome] -> 0 (errnop fffffc7feea28ed8 = 0, nget = 0) 39us
5375078506769053 [28188/1:navidrome] port_getn(4, nget fffffc7fffdfe338 = 1)...
5375078506776102 [28188/1:navidrome] SYSCALL portfs 6 entry
5375078506779076 [28188/8:navidrome] port_getn(4, nget fffffc7fe87ff118 = 1)...
5375078506779730 [28188/1:navidrome] SYSCALL portfs return 0 (errno 0)
5375078506786340 [28188/1:navidrome] -> 0 (errnop fffffc7feea28ed8 = 0, nget = 0) 17us
5375078506792631 [28188/6:navidrome] port_getn(4, nget fffffc7fe83ff120 = 1)...
5375078506800609 [28188/1:navidrome] port_getn(4, nget fffffc7fffdfe338 = 1)...
5375078506807195 [28188/1:navidrome] SYSCALL portfs 6 entry
5375078506819033 [28188/8:navidrome] SYSCALL portfs 6 entry
5375078506831290 [28188/6:navidrome] SYSCALL portfs 6 entry
5375078506838443 [28188/8:navidrome] SYSCALL portfs return 0 (errno 0)
5375078506851638 [28188/6:navidrome] SYSCALL portfs return 0 (errno 0)
5375078506863374 [28188/8:navidrome] -> 0 (errnop fffffc7fef2a338c = 0, nget = 0) 84us
5375078506875764 [28188/6:navidrome] -> 0 (errnop fffffc7fef2a238c = 0, nget = 0) 83us
5375078512819552 [28188/1:navidrome] SYSCALL portfs return 0 (errno 0)
5375078512870335 [28188/1:navidrome] -> -1 (errnop fffffc7feea28ed8 = 0, nget = 0) 6069us
WARNING: oddball return detected

              libc.so.1`port_getn+0x5f
              navidrome`runtime.netpoll+0xc5
              navidrome`runtime.findrunnable+0x3ee
              navidrome`runtime.schedule+0x2d7
              navidrome`runtime.park_m+0x9d
              navidrome`runtime.mcall+0x64
              navidrome`runtime.scavengeSleep+0xc5
              navidrome`runtime.bgscavenge+0x1e5
              navidrome`runtime.goexit+0x1

jclulow · 2021-04-26T06:03:57Z

This is definitely strange. I have been looking at the kernel and the C library code for a while and I cannot see how that condition could arise so far.

I have added some more in-kernel bits to the tracing here to try and find what I'm missing:

#!/usr/sbin/dtrace -qs

pid$target::port_getn:entry
{
        self->x = timestamp;
        self->fd = args[0];
        self->ngetp = args[3];
        self->nget = *self->ngetp;

        printf("%u [%d/%d:%s] port_getn(%d, nget %p = %d)...\n",
            timestamp, pid, tid, execname,
            self->fd, self->ngetp, self->nget);
}

syscall::portfs:entry
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs %d entry\n",
            timestamp, pid, tid, execname,
            (int)arg0);
}

syscall::portfs:return
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs return %d (errno %d)\n",
            timestamp, pid, tid, execname,
            (int)arg0, errno);
}

pid$target::port_getn:return
/self->x/
{
        this->dur = (timestamp - self->x) / 1000;

        /*
         * Get the per-thread errno location for this thread:
         */
        this->fsbase = curthread->t_lwp->lwp_pcb.pcb_fsbase;
        this->ulwp = (userland struct pid`ulwp *)this->fsbase;
        this->errnop = this->ulwp->ul_errnop;

        printf("%u [%d/%d:%s] -> %d (errnop %p = %d, nget = %d) %dus\n",
            timestamp, pid, tid, execname,
            (int)arg1, this->errnop, *this->errnop, *self->ngetp, this->dur);

        self->x = 0;
        self->fd = 0;
        self->ngetp = 0;
        self->nget = 0;
}

fbt::portfs:entry
/self->x/
{
        printf("%u [%d/%d:%s] KERNEL portfs op %d a0 %x a1 %x a2 %x a3 %x a4 %x\n",
            timestamp, pid, tid, execname,
            args[0], args[1], args[2], args[3], args[4], args[5]);
}

fbt::port_getn:entry
/self->x/
{
        self->k_nget = args[3];
        self->k_pgt = args[4];

        printf("%u [%d/%d:%s] KERNEL port_getn max %u nget %p = %u pgt %p loop %u\n",
            timestamp, pid, tid, execname,
            args[2], self->k_nget, *self->k_nget, self->k_pgt,
            self->k_pgt->pgt_loop);
}

fbt::set_errno:entry
/self->x/
{
        printf("%u [%d/%d:%s] KERNEL set_errno %d\n",
            timestamp, pid, tid, execname,
            args[0]);
        stack();
        printf("\n");
}

fbt::port_getn:return
/self->x/
{
        printf("%u [%d/%d:%s] KERNEL port_getn ret i %x v %d 0x%x nget %u loop %u\n",
            timestamp, pid, tid, execname,
            arg0, arg1, arg1, *self->k_nget, self->k_pgt->pgt_loop);

        self->k_nget = 0;
        self->k_pgt = 0;
}


fbt::portfs:return
/self->x/
{
        printf("%u [%d/%d:%s] KERNEL portfs ret i %x v %d 0x%x\n",
            timestamp, pid, tid, execname,
            arg0, arg1, arg1);
}

Can you run this one while reproducing again?

As an aside: does the system where you're running this program have ECC memory in it?

whorfin · 2021-04-26T17:21:38Z

as before, last bits:

5673413389769518 [11542/1:navidrome] KERNEL port_getn max 128 nget fffffe0011153dec = 1 pgt fffffe0011153da0 loop 0
5673413389781061 [11542/1:navidrome] KERNEL port_getn ret i 94 v 0 0x0 nget 0 loop 0
5673413389785050 [11542/1:navidrome] KERNEL portfs ret i e1 v 0 0x0
5673413389787737 [11542/1:navidrome] SYSCALL portfs return 0 (errno 0)
5673413389792140 [11542/1:navidrome] -> 0 (errnop fffffc7feefb8ed8 = 0, nget = 0) 53us
5673413389886540 [11542/3:navidrome] port_getn(4, nget fffffc7fe53ff120 = 1)...
5673413389886581 [11542/1:navidrome] port_getn(4, nget fffffc7fffdfd708 = 1)...
5673413389893263 [11542/1:navidrome] SYSCALL portfs 6 entry
5673413389896461 [11542/1:navidrome] KERNEL portfs op 6 a0 4 a1 fffffc7fffdfd758 a2 80 a3 1 a4 fffffc7fffdfd730
5673413389896890 [11542/11:navidrome] port_getn(4, nget fffffc7fe37ff120 = 1)...
5673413389900305 [11542/1:navidrome] KERNEL port_getn max 128 nget fffffe0011153dec = 1 pgt fffffe0011153da0 loop 0
5673413389905345 [11542/1:navidrome] KERNEL port_getn ret i 94 v 0 0x0 nget 0 loop 0
5673413389909244 [11542/1:navidrome] KERNEL portfs ret i e1 v 0 0x0
5673413389909799 [11542/3:navidrome] SYSCALL portfs 6 entry
5673413389911866 [11542/1:navidrome] SYSCALL portfs return 0 (errno 0)
5673413389916351 [11542/1:navidrome] -> 0 (errnop fffffc7feefb8ed8 = 0, nget = 0) 29us
5673413389918754 [11542/11:navidrome] SYSCALL portfs 6 entry
5673413389920104 [11542/3:navidrome] KERNEL portfs op 6 a0 4 a1 fffffc7fe53ff170 a2 80 a3 1 a4 fffffc7fe53ff148
5673413389927285 [11542/1:navidrome] port_getn(4, nget fffffc7fffdfd708 = 1)...
5673413389928718 [11542/11:navidrome] KERNEL portfs op 6 a0 4 a1 fffffc7fe37ff170 a2 80 a3 1 a4 fffffc7fe37ff148
5673413389933200 [11542/1:navidrome] SYSCALL portfs 6 entry
5673413389934911 [11542/3:navidrome] KERNEL port_getn max 128 nget fffffe00114cedec = 1 pgt fffffe00114ceda0 loop 0
5673413389936374 [11542/1:navidrome] KERNEL portfs op 6 a0 4 a1 fffffc7fffdfd758 a2 80 a3 1 a4 fffffc7fffdfd730
5673413389940191 [11542/1:navidrome] KERNEL port_getn max 128 nget fffffe0011153dec = 1 pgt fffffe0011153da0 loop 0
5673413389943122 [11542/11:navidrome] KERNEL port_getn max 128 nget fffffe0010780dec = 1 pgt fffffe0010780da0 loop 0
5673413389953552 [11542/3:navidrome] KERNEL port_getn ret i 94 v 0 0x0 nget 0 loop 0
5673413389959349 [11542/11:navidrome] KERNEL port_getn ret i 94 v 0 0x0 nget 0 loop 0
5673413389965264 [11542/3:navidrome] KERNEL portfs ret i e1 v 0 0x0
5673413389970619 [11542/11:navidrome] KERNEL portfs ret i e1 v 0 0x0
5673413389972164 [11542/3:navidrome] SYSCALL portfs return 0 (errno 0)
5673413389977261 [11542/11:navidrome] SYSCALL portfs return 0 (errno 0)
5673413389981458 [11542/3:navidrome] -> 0 (errnop fffffc7feecd0b8c = 0, nget = 0) 94us
5673413389986837 [11542/11:navidrome] -> 0 (errnop fffffc7feecd4b8c = 0, nget = 0) 89us
5673413394332656 [11542/1:navidrome] KERNEL port_getn ret i 94 v 62 0x3e nget 0 loop 0
5673413394378638 [11542/1:navidrome] KERNEL portfs ret i e1 v 266287972352 0x3e00000000
5673413394387872 [11542/1:navidrome] SYSCALL portfs return 0 (errno 0)
5673413394403418 [11542/1:navidrome] -> -1 (errnop fffffc7feefb8ed8 = 62, nget = 0) 4476us

Yes, I believe the server has ECC memory. All filesystems are regularly scrubbed and show no errors.
For extra paranoia:
$ ls -l /lib/64/libc.so.1
-rwxr-xr-x 1 root bin 2115832 Oct 23 2020 /lib/64/libc.so.1
$ sha512sum /lib/64/libc.so.1
802da4ec8c007ffc9354c64ef715e730853574b72c2ee324e9fe4992eaa1fe6a43c256d3c57de71d3b8bde715e1207c4edc38a7c3be6c26c6940d0f9429baf40 /lib/64/libc.so.1

jclulow · 2021-04-26T17:34:33Z

In this last trial did the WARNING: oddball return detected message not appear? The final line in the trace output you pasted shows that errno was ETIME (62):

...
5673413394378638 [11542/1:navidrome] KERNEL portfs ret i e1 v 266287972352 0x3e00000000
5673413394387872 [11542/1:navidrome] SYSCALL portfs return 0 (errno 0)
5673413394403418 [11542/1:navidrome] -> -1 (errnop fffffc7feefb8ed8 = 62, nget = 0) 4476us

whorfin · 2021-04-26T17:49:50Z

correct, those are the last lines of the output
the program blew up, per usual, with the message about failed (errno=0)
(that last line looks just like what truss showed, right?)

jclulow · 2021-04-26T18:00:16Z

If it blew up and reported errno=0 in that case I'm not sure it can be a libc issue, as the last line of this trace output shows that the memory address of the thread local errno was set to 62 (ETIME). The return from the kernel portfs() function just prior was 0x3e00000000 which we shift down, effectively, to get 0x3e which is 62 -- this causes us to set errno to ETIME in libc before returning:

https://github.com/illumos/illumos-gate/blob/9ecd05bdc59e4a1091c51ce68cce2028d5ba6fd1/usr/src/lib/libc/port/gen/event_port.c#L94-L97

(that the ETIME case is handled this way is why truss does not see an error, but libc is able to return one anyway)

whorfin · 2021-04-26T19:50:40Z

I have no idea what I'm doing as regards the illumos kernel.
But, uh, couldn't this be the return at line 91 with something globally haywire?
...or line 98 for that matter; a million ways to die in the kernel

whorfin · 2021-05-27T23:11:25Z

Confirming that this still misbehaves after OS upgrade to:

SunOS xxxxx 5.11 omnios-r151038-adeb0c00cf i86pc i386 i86pc illumos

gdamore · 2021-11-02T16:11:31Z

The most relevant piece of Illumos kernel code is here:

620  	case	PORT_GETN:
621  	{
622  		/*
623  		 * port_getn() can only retrieve own or shareable events from
624  		 * other processes. The port_getn() function remains in the
625  		 * kernel until own or shareable events are available or the
626  		 * timeout elapses.
627  		 */
628  		port_timer.pgt_flags = 0;
629  		port_timer.pgt_loop = 0;
630  		port_timer.pgt_rqtp = NULL;
631  		port_timer.pgt_timeout = (struct timespec *)a4;
632  		do {
633  			nget = a3;
634  			error = port_getn(pp, (port_event_t *)a1, (uint_t)a2,
635  			    (uint_t *)&nget, &port_timer);
636  		} while (nget == 0 && error == 0 && port_timer.pgt_loop);
637  		r.r_val1 = nget;
638  		r.r_val2 = error;
639  		releasef((int)a0);
640  		if (error && error != ETIME)
641  			return ((int64_t)set_errno(error));
642  		return (r.r_vals);

gdamore · 2021-11-02T16:16:48Z

The lines at 640-641 seem like it should be impossible to get -1 and zero. r.r_vals coming into this block is initialized to zero. IF the error is zero, then I'd expect to skip over 640 because of the first test, and return 0 at 642.

gdamore · 2021-11-02T16:20:04Z

I am wondering if something else is clobbering the errno in user space. The value is in a thread specific variable, but perhaps we're missing something in the calling code in the go runtime?

Note that we have never seen this error at RackTop, and we are a heavy user of Golang on illumos.

davepacheco · 2022-12-20T23:08:11Z

I don't see how this could explain the trace output above, but I thought I'd mention that we've discovered this illumos OS issue that under some conditions causes memory that is supposed to be zero'd by the Go runtime to not be properly zero'd. Obviously this can cause all kinds of strange behavior. I could imagine that bug affecting some of the arguments to the libc port_getn. But I don't see how it could explain the syscall returning 0 and then the libc function returning -1 with errno = 0.

davepacheco · 2022-12-20T23:55:33Z

@jclulow I think there may be a bug in the D scripts. In this probe:

syscall::portfs:return
/self->x/
{
        printf("%u [%d/%d:%s] SYSCALL portfs return %d (errno %d)\n",
            timestamp, pid, tid, execname,
            (int)arg0, errno);
}

I assume that arg0 here is an int64_t (dtrace -lvn syscall::portfs:return doesn't say, but portfs does appear to return an int64_t). We're casting it to an int, and I assume this is LP64 so that it's a 32-bit int. The DTrace Guide says "Integers are cast to types of lower rank by zeroing the excess high-order bits of the integer." So I think all we know from the output is that the low 32 bits are zero. I have to believe the high 32 bits were not zero. (That's because if they were zero, then I don't see how we wouldn't have returned 0 from the libc port_getn. That's based on my read of both the C source and the disassembly.)

Given what we know (that the low 32 bits of the syscall return value were 0; that we returned -1; that errno was 0) I infer we must have hit this branch, which means the high bits of the return value would have been -1. Working backwards, in the kernel in portfs after calling port_getn, I think that means nget would have had to be -1 and error would be zero. I'm not sure how this is possible either though. error = 0 and *nget != 0 and nget != 1 implies we were at the final return from port_getn, in which case we set *nget = nevents, and nevents is a uint_t incremented from 0 until at most max. It makes sense that more kernel tracing would be a good step here and it seems like we need a repro that reports the oddball message.

whorfin · 2022-12-21T19:00:10Z

Greetings and thank you for your continued diligence in this spooky endeavor.
I thought I would update that as navidrome has updated to Go 1.18 and I have rebuilt, along with many kernel updates since the last posting, I have not seen this bug re-occur.
Absence isn't proof, of course, but it seems something did improve.
netpoll_solaris.go seems only to have changed pd.closing to pd.info().closing() in netpollupdate()
FWIW:

$ go version
go version go1.18.9 illumos/amd64
$ uname -a
SunOS serverbox 5.11 omnios-r151038-ef7fcbcae5 i86pc i386 i86pc illumos

ianlancetaylor · 2022-12-21T19:18:53Z

Thanks for following it up. Closing because it doesn't seem like we need to change anything in Go.

toothrot added OS-Solaris NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 20, 2021

toothrot added this to the Backlog milestone Apr 20, 2021

davepacheco mentioned this issue May 27, 2022

cockroachdb crashed in Go runtime after reporting EINVAL from port_getn oxidecomputer/omicron#1130

Closed

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022

mknyszek added this to Go Compiler / Runtime Jul 7, 2022

mknyszek moved this to Triage Backlog in Go Compiler / Runtime Jul 15, 2022

davepacheco mentioned this issue Dec 16, 2022

cockroachdb crashed in Go runtime during test run: s.allocCount != s.nelems oxidecomputer/omicron#1146

Closed

ianlancetaylor closed this as not planned Won't fix, can't repro, duplicate, stale Dec 21, 2022

Repository owner moved this from Triage Backlog to Done in Go Compiler / Runtime Dec 21, 2022

bcmills mentioned this issue Feb 2, 2023

x/sys/unix: TestEventPortErrors failures #58259

Closed

mknyszek removed this from Go Compiler / Runtime Feb 15, 2023

golang locked and limited conversation to collaborators Dec 21, 2023

gopherbot added the FrozenDueToAge label Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: netpoll port_getn() fails with impossible errno on illumos #45643

runtime: netpoll port_getn() fails with impossible errno on illumos #45643

whorfin commented Apr 20, 2021

toothrot commented Apr 20, 2021

ianlancetaylor commented Apr 20, 2021

jclulow commented Apr 20, 2021

jclulow commented Apr 20, 2021

whorfin commented Apr 20, 2021

jclulow commented Apr 20, 2021

whorfin commented Apr 21, 2021

whorfin commented Apr 22, 2021

jclulow commented Apr 22, 2021

whorfin commented Apr 23, 2021 •

edited

Loading

jclulow commented Apr 23, 2021

jclulow commented Apr 23, 2021

whorfin commented Apr 23, 2021

jclulow commented Apr 26, 2021

whorfin commented Apr 26, 2021

jclulow commented Apr 26, 2021

whorfin commented Apr 26, 2021 •

edited

Loading

jclulow commented Apr 26, 2021 •

edited

Loading

whorfin commented Apr 26, 2021 •

edited

Loading

whorfin commented May 27, 2021

gdamore commented Nov 2, 2021

gdamore commented Nov 2, 2021

gdamore commented Nov 2, 2021

davepacheco commented Dec 20, 2022

davepacheco commented Dec 20, 2022

whorfin commented Dec 21, 2022

ianlancetaylor commented Dec 21, 2022

runtime: netpoll port_getn() fails with impossible errno on illumos #45643

runtime: netpoll port_getn() fails with impossible errno on illumos #45643

Comments

whorfin commented Apr 20, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

toothrot commented Apr 20, 2021

ianlancetaylor commented Apr 20, 2021

jclulow commented Apr 20, 2021

jclulow commented Apr 20, 2021

whorfin commented Apr 20, 2021

jclulow commented Apr 20, 2021

whorfin commented Apr 21, 2021

whorfin commented Apr 22, 2021

jclulow commented Apr 22, 2021

whorfin commented Apr 23, 2021 • edited Loading

jclulow commented Apr 23, 2021

jclulow commented Apr 23, 2021

whorfin commented Apr 23, 2021

jclulow commented Apr 26, 2021

whorfin commented Apr 26, 2021

jclulow commented Apr 26, 2021

whorfin commented Apr 26, 2021 • edited Loading

jclulow commented Apr 26, 2021 • edited Loading

whorfin commented Apr 26, 2021 • edited Loading

whorfin commented May 27, 2021

gdamore commented Nov 2, 2021

gdamore commented Nov 2, 2021

gdamore commented Nov 2, 2021

davepacheco commented Dec 20, 2022

davepacheco commented Dec 20, 2022

whorfin commented Dec 21, 2022

ianlancetaylor commented Dec 21, 2022

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

whorfin commented Apr 23, 2021 •

edited

Loading

whorfin commented Apr 26, 2021 •

edited

Loading

jclulow commented Apr 26, 2021 •

edited

Loading

whorfin commented Apr 26, 2021 •

edited

Loading