Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null pointer in services/outside_network.c:160 reuse_cmp / rbtree_find_less_equal in Unbound 1.13.0 release #411

Closed
jcjones opened this issue Jan 27, 2021 · 35 comments
Assignees

Comments

@jcjones
Copy link

jcjones commented Jan 27, 2021

Using my amd64 linux on centos7 build from #393 (comment) with these commits added to 1.13.0:

I am still getting rare crashes. I've caught one here, in reuse_cmp having a nullptr for key2, coming from node->key:

unbound/util/rbtree.c

Lines 525 to 528 in ca49781

/* While there are children... */
while (node != RBTREE_NULL) {
r = rbtree->cmp(key, node->key);
if (r == 0) {

The backtrace is:

(gdb) bt
#0  reuse_cmp_addrportssl (key1=0x7fde2e2737e8, key2=0x0) at services/outside_network.c:144
#1  0x000055e21f6ba7c1 in   (key1=0x7fde2e2737e8, key2=0x0) at services/outside_network.c:160
#2  0x000055e21f6759ce in rbtree_find_less_equal (rbtree=rbtree@entry=0x7fdd7a428198, key=key@entry=0x7fde2e2737e8, result=result@entry=0x7fde2e2737c8) at util/rbtree.c:527
#3  0x000055e21f6baf0c in reuse_tcp_find (outnet=outnet@entry=0x7fdd7a428090, addr=addr@entry=0x7fdd6718d6f0, addrlen=16, use_ssl=<optimized out>) at services/outside_network.c:480
#4  0x000055e21f6bbf5f in use_free_buffer (outnet=outnet@entry=0x7fdd7a428090) at services/outside_network.c:723
#5  0x000055e21f6bc4fb in outnet_tcp_cb (c=0x7fdd61fd0ce0, arg=0x7fdd61fd0bb0, error=<optimized out>, reply_info=0x7fdd61fd0d18) at services/outside_network.c:1095
#6  0x000055e21f6b4087 in tcp_callback_reader (c=0x7fdd61fd0ce0) at util/netevent.c:1144
#7  0x000055e21f6b5548 in comm_point_tcp_handle_read (fd=217, c=0x7fdd61fd0ce0, short_ok=0) at util/netevent.c:1668
#8  0x000055e21f6b584b in comm_point_tcp_handle_callback (fd=217, event=<optimized out>, arg=0x7fdd61fd0ce0) at util/netevent.c:2062
#9  0x00007fde30723a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#10 0x000055e21f6b1fac in comm_base_dispatch (b=<optimized out>) at util/netevent.c:246
#11 0x000055e21f62e499 in worker_work (worker=worker@entry=0x55e2216803b0) at daemon/worker.c:1941
#12 0x000055e21f6222bf in thread_start (arg=0x55e2216803b0) at daemon/daemon.c:540
#13 0x00007fde3009bea5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fde2fdc496d in clone () from /lib64/libc.so.6

At the null pointer, *node is normal except for the null:

(gdb) up
#1  0x000055e21f6ba7c1 in reuse_cmp (key1=0x7fde2e2737e8, key2=0x0) at services/outside_network.c:160
160		r = reuse_cmp_addrportssl(key1, key2);
(gdb) up
#2  0x000055e21f6759ce in rbtree_find_less_equal (rbtree=rbtree@entry=0x7fdd7a428198, key=key@entry=0x7fde2e2737e8, result=result@entry=0x7fde2e2737c8) at util/rbtree.c:527
527			r = rbtree->cmp(key, node->key);
(gdb) p *node
$2 = {parent = 0x7fdd61a09468, left = 0x55e21f922460 <rbtree_null_node>, right = 0x55e21f922460 <rbtree_null_node>, key = 0x0, color = 1 '\001'}

The core file is 6.36 GB; I can certainly share it and the centos7 rpm files out-of-band if you'd like to investigate directly, or I am happy to dig around in the core file in response to your questions. Thanks again!

@gthess
Copy link
Member

gthess commented Jan 27, 2021

Could you provide the configuration in use?

@gthess gthess self-assigned this Jan 27, 2021
@jcjones
Copy link
Author

jcjones commented Jan 27, 2021

@gthess
Copy link
Member

gthess commented Jan 27, 2021

Thanks! At least this excludes a path I was looking at: forwarders and tls configuration.
Having a NULL on the key and the node somehow still being part of the tree is the fault here; still looking into it.
If you notice a way to reliably reproduce it that would be a big plus.

@jcjones
Copy link
Author

jcjones commented Jan 27, 2021

I'm afraid a reproducer is not going to be likely given the nature of the input data to these instances; sorry. Still, if something comes up, I'll let you know.

@gthess
Copy link
Member

gthess commented Feb 2, 2021

Hi @jcjones, I wasn't able to reproduce this. We have an attempt on a fix that is included in the ongoing release candidate for unbound 1.13.1. We have a branch for that, if you would like to try it out that would be great.
Other than that, an extra question: do you use unbound-control to alter the configuration on the fly?

@jcjones
Copy link
Author

jcjones commented Feb 2, 2021

No, we're not using unbound-control. I'll give the tag a try ASAP, though it may be a few days before I have results. Thanks!

@jcjones
Copy link
Author

jcjones commented Feb 9, 2021

Just an update that we've had 24 hours of 1.13.1rc1 so far with only a stuck process (100% CPU, unfortunately didn't get a core on the restart), no segfaults yet.

@jcjones
Copy link
Author

jcjones commented Feb 9, 2021

We've seen a segfault on 1.13.1rc1; I have a core dump I'll analyze tonight.

@gthess
Copy link
Member

gthess commented Feb 9, 2021

Log output would also be nice if available. We added a couple extra print error cases ("internal error: ...") when trying to fix this.

@jcjones
Copy link
Author

jcjones commented Feb 9, 2021 via email

jedisct1 added a commit to jedisct1/unbound that referenced this issue Feb 9, 2021
* nlnet/master:
  - Fix for Python 3.9, no longer use deprecated functions of   PyEval_CallObject (now PyObject_Call), PyEval_InitThreads (now   none), PyParser_SimpleParseFile (now Py_CompileString).
  Changelog note for 1.13.1 release and main branch is 1.13.2 in development.
  - release 1.13.1rc2 tag on branch-1.13.1 with added changes of 2 feb.
  - Fix indentation of root anchor for use by windows install script.
  Fixup to add to LIBS.
  And autoconf.
  - Fix windows dependency on libssp.dll because of default stack   protector in mingw.
  - Fix dynlibmod link on rhel8 for -ldl inclusion.
  - branch-1.13.1 is created, with release-1.13.1rc1 tag.
  - Hide our time traveling abilities.
  - Attempt to fix NULL keys in the reuse_tcp tree; relates to NLnetLabs#411.
@jcjones
Copy link
Author

jcjones commented Feb 10, 2021

Stack trace:

(gdb) bt
#0  reuse_cmp_addrportssl (key1=0x7fb22db8f858, key2=0x0) at services/outside_network.c:148
#1  0x0000564e82da01c1 in reuse_cmp (key1=0x7fb22db8f858, key2=0x0) at services/outside_network.c:164
#2  0x0000564e82d5b3be in rbtree_find_less_equal (rbtree=rbtree@entry=0x7fb17a428198, key=key@entry=0x7fb22db8f858, result=result@entry=0x7fb22db8f838) at util/rbtree.c:527
#3  0x0000564e82da090c in reuse_tcp_find (outnet=outnet@entry=0x7fb17a428090, addr=addr@entry=0x7fb15a124b10, addrlen=28, use_ssl=<optimized out>) at services/outside_network.c:487
#4  0x0000564e82da19d5 in use_free_buffer (outnet=outnet@entry=0x7fb17a428090) at services/outside_network.c:740
#5  0x0000564e82da1f6b in outnet_tcp_cb (c=0x7fb161c93a80, arg=0x7fb161c93950, error=<optimized out>, reply_info=0x0) at services/outside_network.c:1112
#6  0x00007fb23003fa14 in event_base_loop () from /lib64/libevent-2.0.so.5
#7  0x0000564e82d9793c in comm_base_dispatch (b=<optimized out>) at util/netevent.c:246
#8  0x0000564e82d14779 in worker_work (worker=worker@entry=0x564e846bfe30) at daemon/worker.c:1949
#9  0x0000564e82d084cf in thread_start (arg=0x564e846bfe30) at daemon/daemon.c:540
#10 0x00007fb22f9b7ea5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007fb22f6e098d in ?? () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()

I swear I saw internal errors somewhere recently, but perhaps it wasn't unbound. None of my unbound instances have logged "internal error" in accessible logs, nor the specific messages from a8485d5.

Here's the log around the segfault, though it's pretty bare:

"2021-02-09T07:57:12.300438+00:00 unbound: [30742:0] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:56:07.801150+00:00 unbound: [30742:0] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:56:07.673111+00:00 unbound: [30742:1] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:55:06.567384+00:00 unbound: [30742:1] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:55:06.129952+00:00 unbound: [30742:0] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:54:02.351378+00:00 unbound: [30742:1] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:54:02.331246+00:00 unbound: [30742:0] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:54:02.247063+00:00 unbound: [30742:0] info: start of service (unbound 1.13.1rc1)."
"2021-02-09T07:54:01.651197+00:00 unbound: [30742:0] notice: init module 2: iterator"
"2021-02-09T07:54:01.650961+00:00 unbound: [30742:0] notice: init module 1: validator"
"2021-02-09T07:54:01.650680+00:00 unbound: [30742:0] notice: init module 0: subnet"
"2021-02-09T07:54:01.639870+00:00 unbound: [30742:0] warning: did not exit gracefully last time (6459)"
"2021-02-09T07:54:01.626723+00:00 systemd: Started Unbound recursive Domain Name Server."
"2021-02-09T07:54:01.566999+00:00 unbound-checkconf: unbound-checkconf: no errors in /etc/unbound/unbound.conf"
"2021-02-09T07:54:01.535324+00:00 systemd: Starting Unbound recursive Domain Name Server..."
"2021-02-09T07:54:01.533312+00:00 systemd: Stopped Unbound recursive Domain Name Server."
"2021-02-09T07:54:01.533003+00:00 systemd: unbound.service holdoff time over, scheduling restart."
"2021-02-09T07:54:01.370294+00:00 systemd: unbound.service failed."
"2021-02-09T07:54:01.370028+00:00 systemd: Unit unbound.service entered failed state."
"2021-02-09T07:54:01.369677+00:00 systemd: unbound.service: main process exited, code=killed, status=11/SEGV"
"2021-02-09T07:53:13.530533+00:00 kernel: unbound[6461]: segfault at a8 ip 0000564e82da0154 sp 00007fb22db8f7c0 error 4 in unbound[564e82cf4000+112000]"
"2021-02-09T07:52:20.096442+00:00 unbound: [6459:1] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:51:15.445427+00:00 unbound: [6459:0] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:51:15.373748+00:00 unbound: [6459:1] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:50:14.272442+00:00 unbound: [6459:0] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:50:14.243744+00:00 unbound: [6459:1] info: generate keytag query _ta-4f66. NULL IN"
"2021-02-09T07:49:13.625484+00:00 unbound: [6459:1] info: generate keytag query _ta-4f66. NULL IN"

@jcjones
Copy link
Author

jcjones commented Mar 4, 2021

I'm still seeing these spuriously with 1.13.1 release, they look the same. E.g.:

Core was generated by `/usr/sbin/unbound -d'.
Program terminated with signal 11, Segmentation fault.
#0  reuse_cmp_addrportssl (key1=0x7f3c049a6588, key2=0x0) at services/outside_network.c:148
148		r = sockaddr_cmp(&r1->addr, r1->addrlen, &r2->addr, r2->addrlen);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-323.el7_9.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-50.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libselinux-2.5-15.el7.x86_64 openssl-libs-1.0.2k-21.el7_9.x86_64 pcre-8.32-17.el7.x86_64 protobuf-c-1.0.2-3.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  reuse_cmp_addrportssl (key1=0x7f3c049a6588, key2=0x0) at services/outside_network.c:148
#1  0x000055aefd4df251 in reuse_cmp (key1=0x7f3c049a6588, key2=0x0) at services/outside_network.c:164
#2  0x000055aefd49a45e in rbtree_find_less_equal (rbtree=rbtree@entry=0x7f3b52428198, key=key@entry=0x7f3c049a6588, 
    result=result@entry=0x7f3c049a6568) at util/rbtree.c:527
#3  0x000055aefd4df99c in reuse_tcp_find (outnet=0x7f3b52428090, addr=addr@entry=0x7f3b3c856808, addrlen=28, use_ssl=<optimized out>)
    at services/outside_network.c:487
#4  0x000055aefd4e322d in pending_tcp_query (sq=sq@entry=0x7f3b3c8567b0, packet=packet@entry=0x7f3b52428350, 
    timeout=timeout@entry=3000, callback=callback@entry=0x55aefd4e37d0 <serviced_tcp_callback>, 
    callback_arg=callback_arg@entry=0x7f3b3c8567b0) at services/outside_network.c:2120
#5  0x000055aefd4e3789 in serviced_tcp_initiate (sq=0x7f3b3c8567b0, buff=0x7f3b52428350) at services/outside_network.c:2807
#6  0x000055aefd4e3d51 in serviced_udp_callback (c=0x7f3b3c287800, arg=0x7f3b3c8567b0, error=<optimized out>, rep=0x7f3c049a6c30)
    at services/outside_network.c:3010
#7  0x000055aefd4e198a in outnet_udp_cb (c=0x7f3b3c287800, arg=0x7f3b52428090, error=<optimized out>, reply_info=0x7f3c049a6c30)
    at services/outside_network.c:1243
#8  0x000055aefd4d6d58 in comm_point_udp_callback (fd=159, event=<optimized out>, arg=<optimized out>) at util/netevent.c:769
#9  0x00007f3c06e56a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#10 0x000055aefd4d69cc in comm_base_dispatch (b=<optimized out>) at util/netevent.c:246
#11 0x000055aefd453779 in worker_work (worker=worker@entry=0x55aefdd24e60) at daemon/worker.c:1949
#12 0x000055aefd4474cf in thread_start (arg=0x55aefdd24e60) at daemon/daemon.c:540
#13 0x00007f3c067ceea5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f3c064f79fd in clone () from /lib64/libc.so.6

No instances of "internal error" in the logs at all. No output from unbound in the logs anytime close to the segfault.

Most interesting to me is that I have two of these same segfaults occurring within 10 minutes of each other, whereas before they always seemed to need substantial time to reproduce:

$ sudo coredumpctl list
TIME                            PID   UID   GID SIG PRESENT EXE
Sat 2021-02-20 12:32:09 UTC   20030   997   993  11   /usr/sbin/unbound
Sun 2021-02-21 06:49:44 UTC   26011   997   993  11   /usr/sbin/unbound
Tue 2021-02-23 04:41:03 UTC    8149   997   993  11   /usr/sbin/unbound
Thu 2021-02-25 08:25:12 UTC   26630   997   993  11   /usr/sbin/unbound
Thu 2021-02-25 11:43:34 UTC   10285   997   993  11   /usr/sbin/unbound
Mon 2021-03-01 19:41:24 UTC   24626   997   993  11 * /usr/sbin/unbound
Mon 2021-03-01 20:43:46 UTC   20576   997   993  11 * /usr/sbin/unbound
Mon 2021-03-01 23:27:30 UTC   25400   997   993  11 * /usr/sbin/unbound
Mon 2021-03-01 23:37:13 UTC    4760   997   993  11 * /usr/sbin/unbound

@gthess
Copy link
Member

gthess commented Mar 5, 2021

This is still under investigation. As reported on #439, there is now extra logging that could help pinpoint the issue.
The logging is on master (commit :269c168).
If you could compile a new version and keep an eye out for "reuse tcp delete: node not present, internal error" that could help shed some light.

@jcjones
Copy link
Author

jcjones commented Mar 15, 2021

I see one in the last three days:

unbound: [8283:1] error: reuse tcp delete: node not present, internal error, 192.41.162.30 ssl 0 lru 1
kernel: unbound[8300]: segfault at a8 ip 00005585a73dd1e4 sp 00007fcef74684f0 error 4 in unbound[5585a7331000+10f000]

That address is for l.gtld-servers.net

@gthess
Copy link
Member

gthess commented Mar 16, 2021

Quick question: how many outgoing interfaces are available for that instance?

@jcjones
Copy link
Author

jcjones commented Mar 16, 2021

Only one, with two IPv6 addresses (one route-able, one internal) and one IPv4 address.

@jcjones
Copy link
Author

jcjones commented Mar 22, 2021

Got another:

unbound: [30108:0] error: reuse tcp delete: node not present, internal error, 199.249.121.1 ssl 0 lru 1

I'm guessing more than this won't be useful unless I see something other than ssl 0 lru 1 at the end. I'll watch and let you know if I see other flags than those.

@Mityai
Copy link

Mityai commented Apr 8, 2021

Is there any updates on this problem?

We are experiencing a similar problem since 1.13.0 (now on 1.13.1 the same):

#0  waiting_tcp_callback (w=0x0, c=0x0, error=-2, reply_info=0x0)
    at services/outside_network.c:721
#1  reuse_cb_readwait_for_failure (tree_by_id=0x7ff63fba0b10, err=<optimized out>)
    at services/outside_network.c:954
#2  reuse_cb_and_decommission (outnet=<optimized out>, outnet@entry=0x1074b3e80, pend=<optimized out>, error=error@entry=-2)
    at services/outside_network.c:974
#3  0x0000000001abb391 in outnet_tcptimer (arg=0x14b04f200)
    at services/outside_network.c:2004
#4  0x0000000001b4fe71 in event_process_active_single_queue (base=0x1073fa500, activeq=0x107c43ad0, max_to_process=max_to_process@entry=2147483647,
    endtime=endtime@entry=0x0) at libevent/event.c:1697
#5  0x0000000001b4cedc in event_process_active (base=0x1073fa500)
    at libevent/event.c:1789
#6  event_base_loop (base=0x1073fa500, flags=<optimized out>, flags@entry=0)
    at libevent/event.c:2012
#7  0x0000000001b4c8a7 in event_base_dispatch (event_base=0x14b04f228)
    at libevent/event.c:1823
#8  0x0000000001af4105 in ub_event_base_dispatch (base=0x14b04f228)
    at util/ub_event.c:280
#9  0x0000000001ae8a8c in comm_base_dispatch (b=<optimized out>)
    at util/netevent.c:246
#10 0x0000000001a83519 in worker_work (worker=worker@entry=0x10b67e000)
    at daemon/worker.c:2027
#11 0x0000000001a75381 in thread_start (arg=0x10b67e000)
    at daemon/daemon.c:543
#12 0x00007ff7e02cc6db in start_thread (arg=0x7ff63fba1700) at pthread_create.c:463
#13 0x00007ff7dfdf2a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@gthess
Copy link
Member

gthess commented Apr 9, 2021

@Mityai: No update on this, still looking though.
What @wcawijngaards already committed prevents the specific crash you saw but the overall issue remains having a NULL on the key and the node somehow still being part of the tree.

jedisct1 added a commit to jedisct1/unbound that referenced this issue Apr 22, 2021
* nlnet/master: (61 commits)
  - Fix that testcode dohclient has OpenSSL initialisation calls.
  - Further fix for NLnetLabs#468: detect SSL_CTX_set_alpn_protos for build with   OpenSSL 1.0.1.
  - Fix NLnetLabs#468: OpenSSL 1.0.1 can no longer build Unbound.
  Changelog note for NLnetLabs#466 - Merge NLnetLabs#466 from FGasper: Support OpenSSLs that lack   SSL_get0_alpn_selected.
  Support OpenSSLs that lack SSL_get0_alpn_selected.
  - Remove unused functions worker_handle_reply and   libworker_handle_reply.
  - Fix documentation comment for files previously residing in checkconf/.
  - Fix that nxdomain synthesis does not happen above the stub or   forward definition.
  - Fix (increase) verbosity level for iterator error log in   processQueryTargets().
  - Fix permission denied sendto log, squelch the log messages   unless high verbosity is set.
  - rebuild configure to set EXTRALINK to libunbound.la for NLnetLabs#460.
  - Fix for NLnetLabs#411: Depth protect for crash on deleted element timeout.
  - Fix to stop IPv6 PMTU discovery.
  Changelog note for NLnetLabs#460. - Merge NLnetLabs#460 from orbea: build: Link with the libtool archive.
  build: Link with the libtool archive.
  - Clean makedist.sh.
  - Fix stack-protector change to not override other CFLAGS options.
  - Disable the use of stack-protector for cross compiled 32-bit windows builds;   relates to NLnetLabs#444.
  - Fix NLnetLabs#429: Also fix end of transfer for http download of auth zones.
  - Fix that cachedb does not produce empty object files when disabled.
  ...
@internationils
Copy link

Hello, just some info... pfsense and opnsense both have bugs open on this and forum threads. If people to help reproduce and debug are needed, that might be a good place to find them. In pfsense at least some people can get it to happen quite frequently. Links:

@gthess
Copy link
Member

gthess commented May 4, 2021

Thanks @internationils! It's always good to have more information.
As I read it these issues happen when unbound is signaled for a reload. It seems the same bug applies in this case as there is code to clean up when unbound needs to stop. Having the DHCP daemon (from the forum links) reload unbound several times a day raises the probability to hit the bug.
I believe that when this issue is resolved crashing because of reloads will also be fixed.

gthess added a commit that referenced this issue May 19, 2021
  between TCP streams.
- Refactor for uniform way to produce random DNS message IDs.
@gthess
Copy link
Member

gthess commented May 19, 2021

Hi @jcjones, @Mityai, @internationils,
There is a possible fix on master branch (ff6b527) for this.
It would be great if you could test and provide feedback!

@internationils
Copy link

I can't test it, but the PFsense people have grabbed it already...
https://redmine.pfsense.org/issues/11316#change-53857

@gthess
Copy link
Member

gthess commented May 28, 2021

Hi @jcjones, @Mityai,
just checking if you were able to test with the aforementioned fix.

@internationils: I don't see any movement on the forum threads you posted above. Do you maybe have other information on the matter?

@jcjones
Copy link
Author

jcjones commented May 28, 2021

Hi @jcjones, @Mityai,
just checking if you were able to test with the aforementioned fix.

No, not yet. Had to move this off my juggling-stack for other issues and haven't had time to reintroduce it in the interval. As soon as possible, though.

jedisct1 added a commit to jedisct1/unbound that referenced this issue May 31, 2021
* nlnet/master:
  - zonemd-check: yesno option, default no, enables the processing   of ZONEMD records for that zone.
  - Merge NLnetLabs#496 from banburybill: Use build system endianness if   available, otherwise try to work it out.
  Use build system endianness if available, otherwise try to work it out.
  - For NLnetLabs#492: Fix font highlighting for the man page on emacs.
  - Fix NLnetLabs#492: module-config respip missing in unbound.conf.5.in man   page. Merges NLnetLabs#494 from he32. Remove comment line (?) from man page.
  Transplant parts of the contributed RPZ documentation.
  - Move the NSEC3 max iterations count in line with the 150 value   used by BIND, Knot and PowerDNS. This sets the default value   for it in the configuration to 150 for all key sizes.
  - Test code has -q option for quiet output.
  - Fix for NLnetLabs#411, NLnetLabs#439, NLnetLabs#469: Reset the DNS message ID when moving queries   between TCP streams. - Refactor for uniform way to produce random DNS message IDs.
  Fix date in changelog.
  - Fix NLnetLabs#489: Compile using MSYS2 MinGW 64-bit.
  - Fix that auth-zone zonefiles use last TTL if no TTL is specified.
  Changelog note for NLnetLabs#487 - Merge PR NLnetLabs#487: ifdef RLIMIT_AS in recently added check.
  ifdef RLIMIT_AS in recently added check
gthess added a commit that referenced this issue Jul 26, 2021
gthess added a commit that referenced this issue Jul 26, 2021
gthess added a commit that referenced this issue Jul 26, 2021
gthess added a commit that referenced this issue Jul 26, 2021
@gthess gthess closed this as completed in dcd7581 Jul 26, 2021
@gthess
Copy link
Member

gthess commented Jul 26, 2021

Auto-closed by commit message, reopening.

@gthess gthess reopened this Jul 26, 2021
@gthess
Copy link
Member

gthess commented Jul 26, 2021

Hi @jcjones, @Mityai, @internationils,
Further fixes have been merged (PR #513) to the master branch.
Our own testing does not yield the issue anymore but it would be great if you could test and provide feedback!

@rbgarga
Copy link

rbgarga commented Jul 26, 2021

Hi @jcjones, @Mityai, @internationils,
Further fixes have been merged (PR #513) to the master branch.
Our own testing does not yield the issue anymore but it would be great if you could test and provide feedback!

I would love to add this patch on pfSense development branches so more people can test but it's not applying on 1.13.1 cleanly. I'm going to work on fixing conflicts and see if it works

@gthess
Copy link
Member

gthess commented Jul 26, 2021

Hi @rbgarga,
that would be great!
From a quick look, the conflicts don't appear complicated to me, glad to help if you get stuck. (Make sure to ignore doc/Changelog as this can give a lot of conflicts).

Btw you would also need the following commits (in order, before the PR diff) that solve parts of the issue before the PR was created:

  1. 1bdae42
  2. 7396eff
  3. ff6b527

@rbgarga
Copy link

rbgarga commented Jul 26, 2021

t

Hi @rbgarga,
that would be great!
From a quick look, the conflicts don't appear complicated to me, glad to help if you get stuck. (Make sure to ignore doc/Changelog as this can give a lot of conflicts).

Btw you would also need the following commits (in order, before the PR diff) that solve parts of the issue before the PR was created:

  1. 1bdae42
  2. 7396eff
  3. ff6b527

I already have these 3 commits applied and removed test changes for #513 ending up with a patch that only touches services/outside_network.[ch].

2 hunks fail to apply creating this reject file https://idaho.arrakis.com.br/files/outside_network.c.rej

@rbgarga
Copy link

rbgarga commented Jul 26, 2021

t

Hi @rbgarga,
that would be great!
From a quick look, the conflicts don't appear complicated to me, glad to help if you get stuck. (Make sure to ignore doc/Changelog as this can give a lot of conflicts).
Btw you would also need the following commits (in order, before the PR diff) that solve parts of the issue before the PR was created:

  1. 1bdae42
  2. 7396eff
  3. ff6b527

I already have these 3 commits applied and removed test changes for #513 ending up with a patch that only touches services/outside_network.[ch].

2 hunks fail to apply creating this reject file https://idaho.arrakis.com.br/files/outside_network.c.rej

I sorted out first hunk but second seems to depend of any other change:

@@ -801,11 +907,15 @@
 #ifdef USE_DNSTAP
                        pend_tcp = pend;
 #endif
+               } else {
+                       /* no reuse and no free buffer, put back at the start */
+                       outnet_add_tcp_waiting_first(outnet, w, 0);
+                       break;
                }
 #ifdef USE_DNSTAP
                if(outnet->dtenv && pend_tcp && w && w->sq &&
-                  (outnet->dtenv->log_resolver_query_messages ||
-                   outnet->dtenv->log_forwarder_query_messages)) {
+                       (outnet->dtenv->log_resolver_query_messages ||
+                       outnet->dtenv->log_forwarder_query_messages)) {
                        sldns_buffer tmp;
                        sldns_buffer_init_frm_data(&tmp, w->pkt, w->pkt_len);
                        dt_msg_send_outside_query(outnet->dtenv, &w->sq->addr,```

I don't see any place that this change would fit

@gthess
Copy link
Member

gthess commented Jul 26, 2021

You need the first part for sure; the else including outnet_add_tcp_waiting_first and break.
This becomes the else to the if(reuse) above. The previous else has now become else if(outnet->tcp_free).

I believe there are no ifdefs in your current version and you can ignore the second part.

Hope this is clear.

@rbgarga
Copy link

rbgarga commented Jul 26, 2021

You need the first part for sure; the else including outnet_add_tcp_waiting_first and break.
This becomes the else to the if(reuse) above. The previous else has now become else if(outnet->tcp_free).

I believe there are no ifdefs in your current version and you can ignore the second part.

Hope this is clear.

Awesome! Thanks!

jedisct1 added a commit to jedisct1/unbound that referenced this issue Jul 27, 2021
* nlnet/master:
  - Changelog entry for NLnetLabs#513: Stream reuse, attempt to fix NLnetLabs#411, NLnetLabs#439,   NLnetLabs#469.
  - Fix readzone unknown type print for memory resize.
  - Fix unittcpreuse.c: properly initialise outnet.
  - Remove redundant log_assert and fix error messages.
  - stream reuse, do not explicitly wait for a free pending_tcp if a reuse   could be used.
  Changelog note for NLnetLabs#512 - Merge NLnetLabs#512: unbound.service.in: upgrade hardening to latest   standards.
  unbound.service.in: upgrade hardening to latest standards
  - Add unittest for tcp_reuse functions.
  - stream reuse, move log_assert to the correct location.
  - stream reuse, clean links on structs that are unlinked from a list.
  - Fix for NLnetLabs#411, NLnetLabs#439, NLnetLabs#469: stream reuse, fix loop in the free   pending_tcp list.
  - Fix for NLnetLabs#411, NLnetLabs#439, NLnetLabs#469: stream reuse, fix outnet deletion for all   non-free pending_tcp.
  - Fix for NLnetLabs#411, NLnetLabs#439, NLnetLabs#469: stream reuse, fix LRU list when reuse is   already in the tree.
  - Fix for NLnetLabs#411, NLnetLabs#439, NLnetLabs#469: stream reuse, fix linking when touching the   tcp_reuse LRU list.
  - More log_assert for stream reuse operations.
  - Fix that ldns_zone_new_frm_fp_l counts the line number for an empty   line after a comment.
@gthess
Copy link
Member

gthess commented Aug 12, 2021

@rbgarga, 1.13.2 is now released which includes the aforementioned patches. I believe it solves the occasional crash while reloading that I've been reading on the pfsense forum.

@jcjones, @Mityai, we believe 1.13.2 solves this issue. I leave the issue open and feel free to close/update based on your experience.

@gthess
Copy link
Member

gthess commented Mar 17, 2023

Closing as inactive; the observed issues seem resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants