Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdns recursor crash on startup,The probability of crash is relatively high #11965

Closed
zjs604381586 opened this issue Sep 16, 2022 · 6 comments
Closed

Comments

@zjs604381586
Copy link
Contributor

zjs604381586 commented Sep 16, 2022

  • system version: Debian 9

  • kernal version: Linux 4.19.117.bsk.10-amd64

  • pdns recursor version: 4.7.1

  • gcc/g++ version: 6.3.0

  • crash infos:
    Thread 8 "pdns-r/distr" received signal SIGSEGV, Segmentation fault.

  • crash code postion:

    • FileName: syncrec.cc
    • in class: nsspeeds_t
    • line code: 227
      image
  • problem causes:
    The now variable released on the stack is sometimes used in the lambda function, because the SyncRes sr object may have been released, and the member d_now will also be released, so the now variable in the lambda function will be illegal and the program will crash

  • correct code:
    Modify line 227 as follows:lambda function uses value copy

    ind.modify(it, [now](DecayingEwmaCollection& d) { d.d_lastget = now; });
    
@omoerbeek
Copy link
Member

Thanks for the report. It's not clear to me yet what sequence of events could cause the scenario you describe. I'm also wondering why you are seeing this and others do not. Do you have backtrace perhaps? The configuration file would also be nice. Before fixing I would like to fully understand this and perhaps write a regression test for it.

@omoerbeek omoerbeek added this to the rec-4.8.0 milestone Sep 17, 2022
@zjs604381586
Copy link
Contributor Author

bt:
#8 0x00005555559419ef in SyncRes::shuffleInSpeedOrder (this=this@entry=0x7fffd405cec0, tnameservers=std::unordered_map with 13 elements = {...}, prefix="", auth=...) at syncres.cc:1808
#9 0x000055555591b62b in SyncRes::doResolveAt (this=this@entry=0x7fffd405cec0, nameservers=..., auth=..., flawedNSSet=, flawedNSSet@entry=false, qname=..., qtype=..., ret=...,
depth=, beenthere=..., state=, stopAtDelegation=) at syncres.cc:3749
#10 0x000055555591dd76 in SyncRes::doResolveNoQNameMinimization (this=this@entry=0x7fffd405cec0, qname=..., qtype=..., ret=..., depth=0, beenthere=..., state=,
fromCache=, stopAtDelegation=) at syncres.cc:934
#11 0x000055555591fce4 in SyncRes::doResolve (this=this@entry=0x7fffd405cec0, qname=..., qtype=..., ret=std::vector of length 0, capacity 0, depth=depth@entry=0,
beenthere=std::set with 1 elements = {...}, state=) at syncres.cc:767
#12 0x0000555555920e74 in SyncRes::beginResolve (this=this@entry=0x7fffd405cec0, qname=..., qtype=..., qclass=qclass@entry=1, ret=std::vector of length 0, capacity 0) at syncres.cc:164
#13 0x0000555555921568 in SyncRes::getRootNS(timeval, std::function<int (ComboAddress const&, DNSName const&, int, bool, bool, int, timeval*, boost::optional&, boost::optional<ResolveContext const&>, LWResult*, bool*)>) (now=..., asyncCallback=...) at syncres.cc:4042
#14 0x0000555555842df4 in houseKeeping () at pdns_recursor.cc:3005
#15 0x000055555586756b in MTasker<PacketID, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >::makeThread(void ()(void), void*)::{lambda()#1}::operator()() con---Type to continue, or q to quit---
st (__closure=0x7fffd405d5a8) at mtasker.cc:284
#16 boost::detail::function::void_function_obj_invoker0<MTasker<PacketID, std::__cxx11::basic_string<char, std::char_traits, std::allocator > >::makeThread(void ()(void), void*)::{lambda()#1}, void>::invoke(boost::detail::function::function_buffer&) (function_obj_ptr=...) at /usr/include/boost/function/function_template.hpp:159
#17 0x000055555581b479 in boost::function0::operator() (this=0x7fffd405d5a0) at /usr/include/boost/function/function_template.hpp:771
#18 threadWrapper (t=...) at mtasker_fcontext.cc:144
#19 0x00007ffff773ae6b in make_fcontext () from /usr/lib/x86_64-linux-gnu/libboost_context.so.1.62.0
#20 0x0000000000000000 in ?? ()

code:
image

describe:
You don't need to look at the stack, there will be problems in the analysis of the code logic level。If the sr.beginResolve function call ends, the getRootNS function will also return. At this time, the sr object will be destructed. If the lambda function of the fastest function in nsspeeds_t has not been executed, illegal data access will occur, causing the process to crash.

@omoerbeek
Copy link
Member

Is this a backtrace of the crash you are referring to ?
I see shuffleInspeedOrder bering executed, but getRootNS and beginResolve are on the stack. So the SyncRes object is still alive.

At this moment I still have trouble seeing how fastest and the lambda could be executed while the corresponding SyncRes has gone out of scope. beginResolve is a synchronous function, it returns only after work done (even though the name would suggest async execution).

@omoerbeek
Copy link
Member

I have thought a bit about this a bit more but still have trouble seeing the circumstances you describe could happen: SyncRes being out of scope while fastest is being executed.

I really would appreciate both a config file and a full backtrace (not leaving out the topmost frames) of an actual crash you observed.

@omoerbeek
Copy link
Member

Hello @zjs604381586 , it has been a week since my questions. Do you have answers?

@zjs604381586
Copy link
Contributor Author

I also looked at the logic, maybe my analysis is wrong, sorry

@omoerbeek omoerbeek removed this from the rec-4.8.0 milestone Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants