-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV in futures handler under linux #28
Comments
We're implementing an event store with two basic operations:
In our test we have 10 readers per writer and we're currently at about 3k writes per second. We're currently not using atomics.
Only promises.
It's on a different machine. We have dedicated machines for FDB cluster in a cloud. |
I'll try to make a minimal implementation to reproduce this error, because our layer already got pretty sophisticated. |
Awesome; that’d be great. Sounds similar to what I’m doing in statecraft. Given what you’ve said there’s a decent chance we can repro this with the binding tester. I’ll spin up an ubuntu machine with the stack and run it overnight to see if I can make it crash |
Here's a minimal version of our event store: https://gist.github.com/aikoven/b6788f3ce0d146302d1a9d9201576bb7 I've run it locally (in Docker) for several hours and it didn't crash. Then we deployed 12 instances of this script to the cloud and got segfault within 5 minutes. In both cases, the client part is run in Docker. As for FDB, we have a cluster of 20 FDB processes in the cloud. For local development, I use a single-process database from official FDB Docker image. |
We purposely write events to the same stream to simulate contention. So the crash may have something to do with transaction retries. |
Small update: I've been 5 instances of this script for the last 6 hours on a little freebsd box I have lying around with node v10.15.2 and fdb 6.1.0. I was hoping I could repro this problem, but no dice - its stable. I'll fire up an ubuntu box next. Either this issue is linux only, the bug is specific to fdb 6.0.x, or it needs more latency to show up. Or maybe it specifically needs node 10.13. |
Small update: Managed to reproduce this on ubuntu 17.10 / node 10.13.0 / fdb 6.0.18. Still no repro on freebsd / fdb 6.1. There's probably a hint there, but I can certainly work on it now that I can reproduce the problem. Thanks for your help making a test case @aikoven. It would have been much harder to debug this issue without your help. |
I've been playing with this for a couple days and I'm still not 100% sure whats causing it. At this point I suspect its a bug in libuv; but I'm not entirely sure. The crash is on this line in libuv. Under normal behaviour, the following things happen:
Instead what is happening is that sometimes:
This makes sense given that there aren't any read barriers in What doesn't make sense is that this is the assembly listing for ASM listening for uv_async_send0000000000aab790 <uv_async_send>:
aab790: mov 0x78(%rdi),%eax
aab793: test %eax,%eax
aab795: jne aab7a5 <uv_async_send+0x15>
aab797: mov $0x1,%edx
aab79c: lock cmpxchg %edx,0x78(%rdi)
aab7a1: test %eax,%eax
aab7a3: je aab7b0 <uv_async_send+0x20>
aab7a5: xor %eax,%eax
aab7a7: retq
aab7a8: nopl 0x0(%rax,%rax,1)
aab7b0: push %rbp
aab7b1: mov %rsp,%rbp
aab7b4: push %r14
aab7b6: push %r13
aab7b8: push %r12
aab7ba: push %rbx
aab7bb: mov 0x8(%rdi),%rax
aab7bf: mov 0x200(%rax),%ebx # <---- CRASH HERE
aab7c5: mov 0x1f8(%rax),%edx
aab7cb: cmp $0xffffffff,%ebx
aab7ce: je aab820 <uv_async_send+0x90>
aab7d0: mov $0x1,%r13d
aab7d6: mov $0x1,%r14d
aab7dc: mov $0x1d727ef,%r12d
aab7e2: jmp aab7f4 <uv_async_send+0x64>
aab7e4: nopl 0x0(%rax)
aab7e8: callq 8d9f70 <__errno_location@plt>
aab7ed: mov (%rax),%eax
aab7ef: cmp $0x4,%eax
aab7f2: jne aab840 <uv_async_send+0xb0>
aab7f4: mov %r13,%rdx
aab7f7: mov %r12,%rsi
aab7fa: mov %ebx,%edi
aab7fc: callq 8d9fb0 <write@plt> # <---- WRITE SYSCALL HERE
aab801: cmp $0xffffffff,%eax
aab804: movslq %eax,%rdx
aab807: je aab7e8 <uv_async_send+0x58>
aab809: cmp %rdx,%r14
aab80c: jne aab845 <uv_async_send+0xb5>
aab80e: pop %rbx
aab80f: xor %eax,%eax
aab811: pop %r12
aab813: pop %r13
aab815: pop %r14
aab817: pop %rbp
aab818: retq
aab819: nopl 0x0(%rax)
aab820: mov %edx,%ebx
aab822: mov $0x8,%r13d
aab828: mov $0x8,%r14d
aab82e: mov $0x1c00500,%r12d
aab834: jmp aab7f4 <uv_async_send+0x64>
aab836: nopw %cs:0x0(%rax,%rax,1)
aab840: cmp $0xb,%eax
aab843: je aab80e <uv_async_send+0x7e>
aab845: callq 8d97b0 <abort@plt>
aab84a: nopw 0x0(%rax,%rax,1) The crash happens at instruction A few notes on this:
|
I really appreciate your working on this. Unfortunately I’m not experienced enough in these low-level matters, so I can’t give any valuable comment here. But please tell me if you need any assistance from my side 🙂 |
Sure, thanks 😅 |
Update: the bug is actually a long standing, previously unknown bug in libuv. @bnoordhuis has put together a PR to fix the issue with a spin lock (just like I was suggesting above). In the meantime I’ve put together a simple spin lock around the node fdb code which seems to fix the issue (I ran it for a few hours and it worked fine). I’ll push it and publish tomorrow. That should resolve the crashes you’re seeing in prod. Meanwhile it looks like I’m using async_send in a really inefficient way. Rewriting all this code on top of napi will probably result in a significant performance boost because it will cut down on async handle allocations and syscalls. |
I've just published 📦0.9.2 with a workaround to this issue via spin locks while we wait for the new libuv code to trickle through to nodejs. @aikoven can you grab 0.9.2 and confirm that it fixes your bug? |
This is amazing! Thank you. We stopped our test cluster for now because we've finished with our load test. So, unfortunately, I can't check the fix right now. Let's consider this fixed for now, and I will check back once we have a new cluster. Thanks again! |
Great. Let me know if you still run into problems! |
Hey, we're continuing our FDB project, and I can confirm that we don't get these errors anymore. Thanks! |
This is an issue tracking apple/foundationdb#1294. It seems like there's a race condition or something affecting the
uv_send_async
thread handling code, which under rare conditions causes a crash.The text was updated successfully, but these errors were encountered: