Embrace shutdown leaks and adjust leak checking for the resource cleanup strategy currently in use #1072

jiridanek · 2023-04-23T09:34:31Z

Orderly router shutdown and resource cleanup #293

We all agree that clean shutdown is unnecessary, with the possible exception of cleanly shutting down network sockets, in accordance with whatever protocol is in use on each. Everything else gets cleaned up by the operating system just fine.

When the code to perform cleanups is the same during normal operations and what's invoked on shutdown, then being able to clean everything up on shutdown gives a clear indication that cleanups during normal operations of the server are working as well. This is encouraged by programming languages that use language-level abstractions to keep track of resources, such as objects with destructors in object oriented languages.

In the router, however, the final cleanup code is distinct from the lifecycle operations (as the linked ticket describes) and resource management code, which means each is tested separately. The shutdown code does nothing that the os terminating the process cannot do better. Keeping the shutdown code working in sync with what is happening during router operation is a maintenance burden.

In any case, as it now stands, tests that check for leaks must be careful to always close all connections to the router before router shuts down, otherwise it will cause router shutdown leaks of connection-related data to be reported. (#750 and #767) On the other hand, something like this is a legitimate bug caused by leak related to connections #927.

What is to be done?

leak_check_at_exit=0, or simply call the one-shot void __lsan_do_leak_check(void); at the right moment
trigger lsan manually at an opportune time when exiting, probably after worker threads have been joined
delete the current upon-shutdown cleanup and deallocation code, probably remove it
decide what to do about the current alloc-pool leak-checking code; maybe it is to remain as aid for debugging, but not be run upon shutdown (because we will be now leaking alloc-pool objects on shutdown by intention)

This way, if any objects were leaked during router operation and they were no longer reachable through pointers, the manually triggered leak check would report them.

Router keeps allocation statistics. Tests could check that the number of allocated objects at the beginning and ending of the test is approximately the same, and that during testing the object counts are not climbing, which would suggest a leak.

Limitations

Pooled memory is referenced by the pool itself, so even if an object in the pool was effectively leaked, the alloc-pool linked list will still hold pointer to the chunk of memory and leak sanitizer will not report this. I don't know what to do about it. There is something very similar to our alloc-pool in facebook folly at https://github.com/facebook/folly/blob/e3b9b1e936e1d27908a18ee27c62274db87ca5ab/folly/IndexedMemPool.h#L101 and they don't seem to be doing anyting special about integrating with leak sanitizer or anything there.

The memory pool could run its own heap scan, with the knowledge of its own structure, to see if there are things inside it that are not reachable anymore. It does not appear that lsan provides any "heap scanning api", so maybe steal from some simple GC such as https://chromium.googlesource.com/chromium/src/+/7c787e8cafaa3b13951ce38c76e0ebefb31ff454/third_party/WebKit/Source/platform/heap (not simple) or https://www.hboehm.info/gc/

Disabling the memory pool with -DQD_DISABLE_MEMORY_POOL=ON in (some of) the CI jobs is probably best. With sanitizers, this still works correctly (for short running routers, with a bit of luck) and even provides better allocation stacktraces for use-after-free. Also we should limit use of safe pointers, as in

Fixes #799: Split the pooled type declarations into weak-reffed and the non-weak-reffed #564

Implementation

https://github.com/gcc-mirror/gcc/blob/master/libsanitizer/include/sanitizer/lsan_interface.h, or various wrappers on top, such as https://cs.android.com/android/platform/superproject/+/master:external/webrtc/third_party/abseil-cpp/absl/debugging/leak_check.h or https://dart.googlesource.com/sdk//+/dd76895ee0c1a03341d132b242fa2dab93641eda/runtime/platform/address_sanitizer.h

#include <sanitizer/lsan_interface.h>

Alternatives

Valgrind Memcheck, Mastiff
https://gperftools.github.io/gperftools/heap_checker.html
https://github.com/KDE/heaptrack/blob/master/src/track/heaptrack_api.h

Sources

https://www.buaq.net/go-149085.html

The text was updated successfully, but these errors were encountered:

jiridanek · 2023-04-23T17:34:20Z

I should mention that c_unittests don't work right with this, so they will have to be either removed or redone. For example, consider something like this, where the router is started in each of the tests, so that's multiple times in a single OS process.

skupper-router/tests/c_unittests/test_listener_startup.cpp

Lines 104 to 118 in c503f87

    
           TEST_CASE("Start AMQP listener with zero port" * doctest::skip(regex_is_broken())) 
        
           { 
        
               std::thread([] { 
        
                   qd_server_config_t config{}; 
        
                   config.port      = strdup("0"); 
        
                   config.host      = strdup("localhost"); 
        
                   config.host_port = strdup("localhost:0"); 
        
                   check_amqp_listener_startup_log_message( 
        
                       config, 
        
                       R"EOS(SERVER \(notice\) Listening on (127.0.0.1)|(::1):(\d\d+))EOS", 
        
                       R"EOS(SERVER \(trace\) Listener closed on localhost:0)EOS" 
        
                   ); 
        
               }).join(); 
        
           }

… _all_ shutdown lsan leaks

jiridanek mentioned this issue Apr 23, 2023

Issue #1072: add LSAN_DO_LEAK_CHECK macro #1073

Open

jiridanek added a commit to jiridanek/skupper-router that referenced this issue Jan 5, 2024

Issue skupperproject#1072: add LSAN_DO_LEAK_CHECK macro to suppress…

a9bea85

… _all_ shutdown lsan leaks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embrace shutdown leaks and adjust leak checking for the resource cleanup strategy currently in use #1072

Embrace shutdown leaks and adjust leak checking for the resource cleanup strategy currently in use #1072

jiridanek commented Apr 23, 2023

jiridanek commented Apr 23, 2023

Embrace shutdown leaks and adjust leak checking for the resource cleanup strategy currently in use #1072

Embrace shutdown leaks and adjust leak checking for the resource cleanup strategy currently in use #1072

Comments

jiridanek commented Apr 23, 2023

jiridanek commented Apr 23, 2023