Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embrace shutdown leaks and adjust leak checking for the resource cleanup strategy currently in use #1072

Open
jiridanek opened this issue Apr 23, 2023 · 1 comment

Comments

@jiridanek
Copy link
Contributor

We all agree that clean shutdown is unnecessary, with the possible exception of cleanly shutting down network sockets, in accordance with whatever protocol is in use on each. Everything else gets cleaned up by the operating system just fine.

When the code to perform cleanups is the same during normal operations and what's invoked on shutdown, then being able to clean everything up on shutdown gives a clear indication that cleanups during normal operations of the server are working as well. This is encouraged by programming languages that use language-level abstractions to keep track of resources, such as objects with destructors in object oriented languages.

In the router, however, the final cleanup code is distinct from the lifecycle operations (as the linked ticket describes) and resource management code, which means each is tested separately. The shutdown code does nothing that the os terminating the process cannot do better. Keeping the shutdown code working in sync with what is happening during router operation is a maintenance burden.

In any case, as it now stands, tests that check for leaks must be careful to always close all connections to the router before router shuts down, otherwise it will cause router shutdown leaks of connection-related data to be reported. (#750 and #767) On the other hand, something like this is a legitimate bug caused by leak related to connections #927.

What is to be done?

  • leak_check_at_exit=0, or simply call the one-shot void __lsan_do_leak_check(void); at the right moment
  • trigger lsan manually at an opportune time when exiting, probably after worker threads have been joined
  • delete the current upon-shutdown cleanup and deallocation code, probably remove it
  • decide what to do about the current alloc-pool leak-checking code; maybe it is to remain as aid for debugging, but not be run upon shutdown (because we will be now leaking alloc-pool objects on shutdown by intention)

This way, if any objects were leaked during router operation and they were no longer reachable through pointers, the manually triggered leak check would report them.

Router keeps allocation statistics. Tests could check that the number of allocated objects at the beginning and ending of the test is approximately the same, and that during testing the object counts are not climbing, which would suggest a leak.

Limitations

Pooled memory is referenced by the pool itself, so even if an object in the pool was effectively leaked, the alloc-pool linked list will still hold pointer to the chunk of memory and leak sanitizer will not report this. I don't know what to do about it. There is something very similar to our alloc-pool in facebook folly at https://github.com/facebook/folly/blob/e3b9b1e936e1d27908a18ee27c62274db87ca5ab/folly/IndexedMemPool.h#L101 and they don't seem to be doing anyting special about integrating with leak sanitizer or anything there.

The memory pool could run its own heap scan, with the knowledge of its own structure, to see if there are things inside it that are not reachable anymore. It does not appear that lsan provides any "heap scanning api", so maybe steal from some simple GC such as https://chromium.googlesource.com/chromium/src/+/7c787e8cafaa3b13951ce38c76e0ebefb31ff454/third_party/WebKit/Source/platform/heap (not simple) or https://www.hboehm.info/gc/

Disabling the memory pool with -DQD_DISABLE_MEMORY_POOL=ON in (some of) the CI jobs is probably best. With sanitizers, this still works correctly (for short running routers, with a bit of luck) and even provides better allocation stacktraces for use-after-free. Also we should limit use of safe pointers, as in

Implementation

https://github.com/gcc-mirror/gcc/blob/master/libsanitizer/include/sanitizer/lsan_interface.h, or various wrappers on top, such as https://cs.android.com/android/platform/superproject/+/master:external/webrtc/third_party/abseil-cpp/absl/debugging/leak_check.h or https://dart.googlesource.com/sdk//+/dd76895ee0c1a03341d132b242fa2dab93641eda/runtime/platform/address_sanitizer.h

#include <sanitizer/lsan_interface.h>

Alternatives

Sources

@jiridanek
Copy link
Contributor Author

I should mention that c_unittests don't work right with this, so they will have to be either removed or redone. For example, consider something like this, where the router is started in each of the tests, so that's multiple times in a single OS process.

TEST_CASE("Start AMQP listener with zero port" * doctest::skip(regex_is_broken()))
{
std::thread([] {
qd_server_config_t config{};
config.port = strdup("0");
config.host = strdup("localhost");
config.host_port = strdup("localhost:0");
check_amqp_listener_startup_log_message(
config,
R"EOS(SERVER \(notice\) Listening on (127.0.0.1)|(::1):(\d\d+))EOS",
R"EOS(SERVER \(trace\) Listener closed on localhost:0)EOS"
);
}).join();
}

jiridanek added a commit to jiridanek/skupper-router that referenced this issue Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant