Orderly router shutdown and resource cleanup #293

kgiusti · 2022-04-06T18:51:00Z

On shutdown the router stops all processes "mid stream" and attempts to manually clean up resources that remain in use. This is a best effort attempt and involves bespoke cleanup code for just about every system in the router. Here is an example where the router core attempts to release resources for all links that were active at the point of shutdown.

This is error prone and triggers many leak events in ASAN. This makes it nearly impossible to determine if a leak occurs during runtime (probably a serious issue) or just at shutdown (sloppy, but not service affecting).

The router's shutdown needs to be refactored to instead perform an orderly shutdown controlled by management. This feature would cause management to:

close the $management subscription to prevent further external management access.
delete all listeners and connectors to prevent new connections being established.
close all active connections
stop any periodic services (idle link scrubber, stuck delivery detection, etc).

This will trigger the existing run-time cleanup for all outstanding messages, dispositions, etc. Once all connections finish closing, all in-flight messages are released, and other provisioned entities and services have been released, the subsystems (core, proactor, etc) can clean up and shutdown.

The devil is in the details for sure - this will be no small job.

kgiusti · 2022-04-06T19:00:29Z

Needs fixing: #156

kgiusti · 2022-04-22T20:40:39Z

Goal: We want avoid runtime leaks in the router

Old way: All leak analysis at process exit (by asan or alloc pool leak detection)

New way: Shutdown all listeners, terminate connections, and OTHER STUFF; and then,
run leak analysis at a time prior to process exit.
The key thing here is to detect leaks only for the desired scope, not for the whole router process.

jiridanek · 2022-04-23T16:21:44Z

I'm sorry but I still don't much understand this.

Runtime leaks are quite different problem from orderly shutdown. You can have one and not the other, vice versa. Consider the memory leaks you can have in Java and JavaScript, for example. (You keep reference to something you should not, or somebody else keeps reference on your behalf (how insidious!), and so gc cannot gc it. Or you keep reference longer than necessary, so things get eventually collected, but you might get OOM in the meantime.)
I do not see what the new way would translate into, practically. How does it apply to, for example, Mick's asan leak in qd_policy_c_counts_alloc #253 or to Leaked termini and qdr_link_t at router shutdown #335?

jiridanek · 2022-04-23T16:28:57Z

Even simpler trial case of a leak to be judged by the new rules, #156

jiridanek · 2022-06-22T16:23:10Z

The key thing here is to detect leaks only for the desired scope, not for the whole router process.

I guess this is the main stumbling block for me. How do you do that? How do you tell that "this buffer was 'leaked' because we did not bother to free it (yet). How can you know it is not truly leaked because there is a bug in the code and it would never be cleared up, even if the connection it is related to was closed (while the router continues running)"?

kgiusti · 2023-03-24T15:06:30Z

Won't fix.

It has been decided to maintain the current resource cleanup architecture

kgiusti added the enhancement New feature or request label Apr 6, 2022

kgiusti self-assigned this Apr 6, 2022

kgiusti mentioned this issue Apr 6, 2022

Leak Sanitizer errors when router exits due to port-conflict #156

Closed

ganeshmurthy added this to the 3.0.0 milestone Apr 12, 2022

jiridanek mentioned this issue May 14, 2022

proton raw connection event leak #469

Closed

jiridanek linked a pull request May 26, 2022 that will close this issue

Fixes #415 - Close stdin when tearing down subprocesses in system-tests to prevent fd leaks #505

Merged

jiridanek closed this as completed in #505 May 26, 2022

jiridanek reopened this May 26, 2022

jiridanek mentioned this issue May 27, 2022

python_embedded.c:665: heap-use-after-free in qd_io_rx_handler #520

Closed

kgiusti closed this as completed Mar 24, 2023

jiridanek mentioned this issue Apr 23, 2023

Embrace shutdown leaks and adjust leak checking for the resource cleanup strategy currently in use #1072

Open

jiridanek mentioned this issue Jan 5, 2024

Issue #1072: add LSAN_DO_LEAK_CHECK macro #1073

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orderly router shutdown and resource cleanup #293

Orderly router shutdown and resource cleanup #293

kgiusti commented Apr 6, 2022

kgiusti commented Apr 6, 2022

kgiusti commented Apr 22, 2022

jiridanek commented Apr 23, 2022 •

edited

Loading

jiridanek commented Apr 23, 2022

jiridanek commented Jun 22, 2022 •

edited

Loading

kgiusti commented Mar 24, 2023

Orderly router shutdown and resource cleanup #293

Orderly router shutdown and resource cleanup #293

Comments

kgiusti commented Apr 6, 2022

kgiusti commented Apr 6, 2022

kgiusti commented Apr 22, 2022

jiridanek commented Apr 23, 2022 • edited Loading

jiridanek commented Apr 23, 2022

jiridanek commented Jun 22, 2022 • edited Loading

kgiusti commented Mar 24, 2023

jiridanek commented Apr 23, 2022 •

edited

Loading

jiridanek commented Jun 22, 2022 •

edited

Loading