You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We all agree that clean shutdown is unnecessary, with the possible exception of cleanly shutting down network sockets, in accordance with whatever protocol is in use on each. Everything else gets cleaned up by the operating system just fine.
When the code to perform cleanups is the same during normal operations and what's invoked on shutdown, then being able to clean everything up on shutdown gives a clear indication that cleanups during normal operations of the server are working as well. This is encouraged by programming languages that use language-level abstractions to keep track of resources, such as objects with destructors in object oriented languages.
In the router, however, the final cleanup code is distinct from the lifecycle operations (as the linked ticket describes) and resource management code, which means each is tested separately. The shutdown code does nothing that the os terminating the process cannot do better. Keeping the shutdown code working in sync with what is happening during router operation is a maintenance burden.
In any case, as it now stands, tests that check for leaks must be careful to always close all connections to the router before router shuts down, otherwise it will cause router shutdown leaks of connection-related data to be reported. (#750 and #767) On the other hand, something like this is a legitimate bug caused by leak related to connections #927.
What is to be done?
leak_check_at_exit=0, or simply call the one-shot void __lsan_do_leak_check(void); at the right moment
trigger lsan manually at an opportune time when exiting, probably after worker threads have been joined
delete the current upon-shutdown cleanup and deallocation code, probably remove it
decide what to do about the current alloc-pool leak-checking code; maybe it is to remain as aid for debugging, but not be run upon shutdown (because we will be now leaking alloc-pool objects on shutdown by intention)
This way, if any objects were leaked during router operation and they were no longer reachable through pointers, the manually triggered leak check would report them.
Router keeps allocation statistics. Tests could check that the number of allocated objects at the beginning and ending of the test is approximately the same, and that during testing the object counts are not climbing, which would suggest a leak.
Limitations
Pooled memory is referenced by the pool itself, so even if an object in the pool was effectively leaked, the alloc-pool linked list will still hold pointer to the chunk of memory and leak sanitizer will not report this. I don't know what to do about it. There is something very similar to our alloc-pool in facebook folly at https://github.com/facebook/folly/blob/e3b9b1e936e1d27908a18ee27c62274db87ca5ab/folly/IndexedMemPool.h#L101 and they don't seem to be doing anyting special about integrating with leak sanitizer or anything there.
Disabling the memory pool with -DQD_DISABLE_MEMORY_POOL=ON in (some of) the CI jobs is probably best. With sanitizers, this still works correctly (for short running routers, with a bit of luck) and even provides better allocation stacktraces for use-after-free. Also we should limit use of safe pointers, as in
I should mention that c_unittests don't work right with this, so they will have to be either removed or redone. For example, consider something like this, where the router is started in each of the tests, so that's multiple times in a single OS process.
We all agree that clean shutdown is unnecessary, with the possible exception of cleanly shutting down network sockets, in accordance with whatever protocol is in use on each. Everything else gets cleaned up by the operating system just fine.
When the code to perform cleanups is the same during normal operations and what's invoked on shutdown, then being able to clean everything up on shutdown gives a clear indication that cleanups during normal operations of the server are working as well. This is encouraged by programming languages that use language-level abstractions to keep track of resources, such as objects with destructors in object oriented languages.
In the router, however, the final cleanup code is distinct from the lifecycle operations (as the linked ticket describes) and resource management code, which means each is tested separately. The shutdown code does nothing that the os terminating the process cannot do better. Keeping the shutdown code working in sync with what is happening during router operation is a maintenance burden.
In any case, as it now stands, tests that check for leaks must be careful to always close all connections to the router before router shuts down, otherwise it will cause router shutdown leaks of connection-related data to be reported. (#750 and #767) On the other hand, something like this is a legitimate bug caused by leak related to connections #927.
What is to be done?
leak_check_at_exit=0
, or simply call the one-shotvoid __lsan_do_leak_check(void);
at the right momentThis way, if any objects were leaked during router operation and they were no longer reachable through pointers, the manually triggered leak check would report them.
Router keeps allocation statistics. Tests could check that the number of allocated objects at the beginning and ending of the test is approximately the same, and that during testing the object counts are not climbing, which would suggest a leak.
Limitations
Pooled memory is referenced by the pool itself, so even if an object in the pool was effectively leaked, the alloc-pool linked list will still hold pointer to the chunk of memory and leak sanitizer will not report this. I don't know what to do about it. There is something very similar to our alloc-pool in facebook folly at https://github.com/facebook/folly/blob/e3b9b1e936e1d27908a18ee27c62274db87ca5ab/folly/IndexedMemPool.h#L101 and they don't seem to be doing anyting special about integrating with leak sanitizer or anything there.
The memory pool could run its own heap scan, with the knowledge of its own structure, to see if there are things inside it that are not reachable anymore. It does not appear that lsan provides any "heap scanning api", so maybe steal from some simple GC such as https://chromium.googlesource.com/chromium/src/+/7c787e8cafaa3b13951ce38c76e0ebefb31ff454/third_party/WebKit/Source/platform/heap (not simple) or https://www.hboehm.info/gc/
Disabling the memory pool with
-DQD_DISABLE_MEMORY_POOL=ON
in (some of) the CI jobs is probably best. With sanitizers, this still works correctly (for short running routers, with a bit of luck) and even provides better allocation stacktraces for use-after-free. Also we should limit use of safe pointers, as inImplementation
https://github.com/gcc-mirror/gcc/blob/master/libsanitizer/include/sanitizer/lsan_interface.h, or various wrappers on top, such as https://cs.android.com/android/platform/superproject/+/master:external/webrtc/third_party/abseil-cpp/absl/debugging/leak_check.h or https://dart.googlesource.com/sdk//+/dd76895ee0c1a03341d132b242fa2dab93641eda/runtime/platform/address_sanitizer.h
Alternatives
Sources
The text was updated successfully, but these errors were encountered: