-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastRouter segfaults #146
Comments
I've recompiled whole uWSGI to make sure it's not build issue. I'll report if it still occurs. |
After clean install still happens, 100% sure it's not a build issue:
|
do you have a minimal command line to reproduce it ? Thanks |
This is the config I'm using:
I'm also trying to reproduce it on my dev cluster but so far I can't trigger it with synthetic benchmarks, maybe it happens when client disconnects in the middle or something. |
i will investigate on that tomorrow, sorry today was a busy day (i skipped 4 releases :P) |
There are two uwsgi_buffer_destroy() in uwsgi_cr_del_peer, since it looks like out_need_free is never set it should be peer->in. Or we are corrupting memory. Any chance you can compile with debug = True and run it under valgrind --tool=memcheck ? |
I would feel bad doing this on my production routers, but I couldn't trigger it on my dev cluster, so I fear it's the only way. I'll see what I can do. |
I'd try with debug + valgrind on your dev cluster, maybe we get more luck :) Otherwise you can wait for Roberto since it is way more clueful than me on this issue. |
So far only those two errors are repeating, I'm waiting for segfault:
|
I'm running it on my less busy production cluster, so I might need to move it to more loaded one (or just wait a while) |
Il 22/02/2013 09:49, Łukasz Mierzwa ha scritto:
This is interesting because confirms there's something around |
So far no segfault, only those errors above are showing from time to time. I'll leave it running, maybe I need more traffic (it's 10 AM so not much is going on right now). If I can't reproduce it I will disable debug and retry. If I still can't get any segfault than it might have been fixed already (or I'm out of luck). |
Side note:
"0.0.0.0" doesn't seems valid, it looks like it's currently hard-coded to this value. Same story with port. |
So far no issues so it might be gone |
No issues after 6 hours, I need to revert to 1.4.5. I don't want to leave it running unattended for whole weekend. |
valgrind log, in case it's useful: |
doesn't happen with latest 1.9, closing |
Issue is still present, reopening. |
can you confirm the backtrace is always the same ? |
uwsgi_master_check_gateways_death() does not always shows in the backtrace, but beside that I don't see any differences. It always ends with:
|
I've got another segfault in FastRouter:
This might be related to malloc() issues observed before (?) |
Another crash (spotted thanks to airbrake), this time different but might be related:
|
After looking at the code it seems that it's more and more likely what @xrmx already suggested: peer->in is somehow invalid, maybe it get's NULLed in this case? |
I have just changed the retry system to be usable in a thinnier window: the peer->can_retry is set before backend connection and set to 0 after connection. It looks like the connections hapens when the buffer is not initialized if an error came during first fastrouter bytes. |
I'll push patched FastRouters to my production nodes tomorrow to verify if it helps |
Another crash in FastRouter, I had another hang (#239) during high traffic, restarted FastRouter and soon after it was restarted I had this segfault:
Is it safe to use low fastrouter-timeout (like 15 seconds)? Won't it affect client connections to FastRouter? |
is the frequency of crashes reduced after my patch ? Is it possibile we have found another bug while the previous one has been solved ? |
I believe so, previous crash did not occur since last patch and clearly this one is in different place. |
Another segfault:
uwsgi_hooked_parse() are called during subscription packet parsing, so maybe realted to #239 (?) |
i suspect 146 and 239 are the same problem. What about https://github.com/unbit/uwsgi/blob/master/plugins/corerouter/cr_common.c#L120 https://github.com/unbit/uwsgi/blob/master/plugins/corerouter/cr_common.c#L193 uwsgi_hooked_parse is not checked for error (it could return -1 if the packet is malformed) Maybe there is the possibility of a corrupted uwsgi packet, can you retry adding an uwsgi_log if uwsgi_hooked_parse returns -1 |
I've added uwsgi_error in case of <0 return value, once it hangs again I'll check for such errors |
some news on this problem ? I was still not able to reproduce it :( |
I was (un)lucky and this issue didn't occur since my last post here, still waiting for more data |
Turns out I had those issues, both segfaults with But no trace of those uwsgi_error() I've added to failing uwsgi_hooked_parse() [1], so it's something else. 1:
|
This time I've got some new errors:
but they probably orbit around the same core issue, hang from #239 occurs between hangs, so both issues are certainly connected. Can you think of any more debug logs I could add? |
I think the best approach at this point would be generating a coredump file you can inspect with gdb. Feel free to send it to me if you want |
...oh and remember to add -g to the CFLAGS when you build, like CFLAGS=-g make this avoid fully enabling UWSGI_DEBUG loglines |
I've set up everything so I should now get core dumps. I'll get back once there is some more info |
if it could be useful i have added --use-abort, i have noted on some system setting SIG_DFL to SEGV does not restore coredump generation, while abort() reliably works (at least on linux) |
I didn't had any crash since I've pushed
No other errors recorded, no malloc error, it just gets respawned (I do have |
ok, thanks |
I didn't had any malloc() error since I've recompiled with -O0, could this somehow be triggered by @unbit vs gcc optimizations trying to outsmart each other? Seems unlikely but I always had corrupted memory when hang occurred, but now I only get the hang. |
AFAIR this was fixed in #415, closing |
Current version from git master branch.
The text was updated successfully, but these errors were encountered: