Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eventloop/networking gets stuck on musl #1431

Closed
l63l opened this issue Mar 21, 2024 · 11 comments
Closed

eventloop/networking gets stuck on musl #1431

l63l opened this issue Mar 21, 2024 · 11 comments
Labels
bug This is not expected behavior, and needs fixing

Comments

@l63l
Copy link

l63l commented Mar 21, 2024

Following happens only when building janet with musl libc since 5f2e287 .
When hitting a simple spork/http/server with a load-testing tool like wrk
and about 20 parallel connections, the server stops responding on the
second invocation of wrk.

And this only happens when using ev/go (which http/server does), but
not with ev/call:

(def a (net/listen "127.0.0.1" "8000"))

(defn handler [con]
  (defer (:close con)
    (ev/read con 1000)))

# gets stuck:
(ev/go (fn [] (net/accept-loop a handler)))

# doesn't:
# (forever
#   (def con (net/accept a))
#   (ev/call handler con))

And a client like:

(loop [i :range [0 20]]
  (ev/spawn-thread
    (forever
      (with [con (net/connect "127.0.0.1" "8000" :stream)]
        (:write con "AAA")
        (def res (:read con 1024))
        (print con res)
        (:close con)))))

To get a musl build of janet for example on debian:

apt-get install musl-tools
make CC=musl-gcc
@pepe
Copy link
Member

pepe commented Mar 21, 2024

Hello. I am using musl build on Alpine, and I have yet to notice the behavior you are describing. I will try your reproduction code and report back. Thanks for reporting.

@pepe
Copy link
Member

pepe commented Mar 22, 2024

I have tried it with the latest Janet from the master branch on Alpine Linux with musl at the time of this writing and see no issues like the one you are describing. We will need more information about your setup.

@l63l
Copy link
Author

l63l commented Mar 22, 2024

hm, thanks for testing. I tested on void-linux with host musl-libc (1.1.24), alpine 3.17 vm (musl 1.2.3) and debian bullseye with musl-gcc. Now I fetched alpine 3.19.2 and build janet from master and could reproduce it, too.

You need to keep the "server snippet" running, start the "client snippet", kill the "client snippet" and the second invocation will get stuck.

@pepe
Copy link
Member

pepe commented Mar 22, 2024

Oh. I did not get the process right. I am trying it again.

@pepe
Copy link
Member

pepe commented Mar 22, 2024

Yep. I can confirm that the problem manifests in this case on Alpine.

@bakpakin bakpakin added the bug This is not expected behavior, and needs fixing label Apr 15, 2024
@psagers
Copy link

psagers commented Apr 25, 2024

I'm experiencing similar behavior on Arch with glibc. This is Janet version 1.34.0-f92f3eb6, built from the repo. I don't see the issue on FreeBSD with an equivalent build (from the same commit).

Here's my repro setup:

(var n-req 0)

(defn handler
  [stream]
  (defer (net/close stream)
    (ev/sleep (math/random))
    (set n-req (inc n-req))
    (printf "%d: %N" n-req (net/peername stream))
    (net/write stream (string/format "%3d: Accepted\n" n-req))))

(defn main
  [&]
  (let [server (net/listen "127.0.0.1" 9000)]
    (net/accept-loop server handler)))

In another shell:

for i in $(seq 100)
do nc -d -w 2 127.0.0.1 9000 &
done

Once the accept-loop gets "stuck", the symptom I see is that some (but not all) connections are accepted and sent to the handler and some of those (but not all) result in data actually being written back to the client. The issue seems to accumulate, such that successive batches of 100 requests result in fewer and fewer successful responses.

I tried (forever (ev/call handler (net/accept server))) and that does seem to work correctly for me as well.

Updated: I see the same behavior with the official build as well.

@bakpakin
Copy link
Member

So took another look at this now that I have repro that works on my system, and the root cause is in how we setup epoll/kqueue in edge trigger mode.

This works well for most cases, but for net/accept-loop, this can cause connections to be missed in these micro benchmarks where we make many connections back to back.

The fix is conceptual fairly simple, but I would like to make server sockets use edge trigger mode unless we call net/accept-loop, in which they switch to level trigger mode and make sure we fix kqueue too, which seems to have the same wrong logic here.

@bakpakin
Copy link
Member

POC fix that just make all server sockets run in level trigger mode (so not a "real" solution as it would cause a busy loop in the case with (forever (net/accept ...))

diff --git a/src/core/ev.c b/src/core/ev.c
index 335b9738..e373c3f1 100644
--- a/src/core/ev.c
+++ b/src/core/ev.c
@@ -1543,8 +1543,12 @@ static JanetTimestamp ts_now(void) {
 /* Wait for the next event */
 static void janet_register_stream(JanetStream *stream) {
     struct epoll_event ev;
-    ev.events = EPOLLET;
-    if (stream->flags & (JANET_STREAM_READABLE | JANET_STREAM_ACCEPTABLE)) ev.events |= EPOLLIN;
+    if (stream->flags & JANET_STREAM_ACCEPTABLE) {
+        ev.events = EPOLLIN; /* Do NOT edge trigger server sockets */
+    } else {
+        ev.events = EPOLLET;
+    }
+    if (stream->flags & JANET_STREAM_READABLE) ev.events |= EPOLLIN;
     if (stream->flags & JANET_STREAM_WRITABLE) ev.events |= EPOLLOUT;
     ev.data.ptr = stream;
     int status;
@@ -1695,7 +1699,9 @@ static void timestamp2timespec(struct timespec *t, JanetTimestamp ts) {
 void janet_register_stream(JanetStream *stream) {
     struct kevent kevs[2];
     int length = 0;
-    if (stream->flags & (JANET_STREAM_READABLE | JANET_STREAM_ACCEPTABLE)) {
+    if (stream->flags & JANET_STREAM_ACCEPTABLE) {
+        EV_SETx(&kevs[length++], stream->handle, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, stream);
+    } else if (stream->flags & JANET_STREAM_READABLE) {
         EV_SETx(&kevs[length++], stream->handle, EVFILT_READ, EV_ADD | EV_ENABLE | EV_CLEAR, 0, 0, stream);
     }
     if (stream->flags & JANET_STREAM_WRITABLE) {

bakpakin added a commit that referenced this issue Apr 27, 2024
In the edge-trigger mode before this change, if a socket
receives 2 connections before one can be handled, then only a single
connection is handle and 1 connection will never be handled in some
cases. Reverting to level-trigger mode makes this impossible.
bakpakin added a commit that referenced this issue Apr 27, 2024
@bakpakin
Copy link
Member

@psagers, @l63l or @pepe, let me know if this fixes the problem (especially the musl builds that I can't easily verify) and we can close this out.

@psagers
Copy link

psagers commented Apr 27, 2024

This appears to be working for me. Thanks!

@l63l
Copy link
Author

l63l commented Apr 27, 2024

musl builds are also fixed. Thanks.

bakpakin added a commit that referenced this issue Jun 7, 2024
This doesn't seem to reintroduce the original issue. There was
definitely some interplay with #1431

Doing git bisect landed me at commit
2f0c789 as the first bad commit for
issue #1452.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This is not expected behavior, and needs fixing
Projects
None yet
Development

No branches or pull requests

4 participants