Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when using TLS on Windows #374

Closed
ashie opened this issue Apr 8, 2022 · 11 comments · Fixed by #387
Closed

Memory leak when using TLS on Windows #374

ashie opened this issue Apr 8, 2022 · 11 comments · Fixed by #387

Comments

@ashie
Copy link
Member

ashie commented Apr 8, 2022

Recently we received a bug report from a customer that Windows version of td-agent causes memory leak.
It's reproducible with td-agent 4.3.0.
After I investigated it, I found out that just using TLS causes this issue.
In addition, Ruby doesn't recognize it. I confirmed it by fluent-plugin-sigdump.

fluent-plugin-windows-eventlog and certstore_c also don't have major memory leak, I confirmed it by Dr.Memory and reviewing the code.
While testing cerstore_c by Dr.Memory, OpenSSL related leaks are detected. So that the origin of this memory leaks might be Ruby (openssl gem) or OpenSSL. We don't yet understand the detail due to lack of debug symbols.

Although the memory leak by one communication is tiny, it's not ignorable. Because out_forward plugin of Fluentd discards a socket every time after finishing a communication and recreate it by default, so that memory leaks are accumulated.

It can be mitigate by keepalive option of out_forward, but we need to fix the root cause to allow long-life operation.

This issue doesn't reproduce when I use Ruby 2.4.

@ashie
Copy link
Member Author

ashie commented Apr 8, 2022

I'm not sure whether it's reproduced also on GNU/Linux or not.

@ashie
Copy link
Member Author

ashie commented Apr 22, 2022

It seems that some of OpenSSL::X509::Store (or OpenSSL::X509::Store itself) of Ruby seems leaked.
Because it's not needed to recreate every communication, I've tested an ad-hoc patch for Fluentd to reuse cert_store:

	diff --git a/socket.rb b/socket.rb
	index afde072..7a72edc 100644
	--- a/socket.rb
	+++ b/socket.rb
	@@ -107,36 +107,41 @@ module Fluent
		   log.trace "setting TLS verify_mode NONE"
		   context.verify_mode = OpenSSL::SSL::VERIFY_NONE
		 else
	-          cert_store = OpenSSL::X509::Store.new
	-          if allow_self_signed_cert && OpenSSL::X509.const_defined?('V_FLAG_CHECK_SS_SIGNATURE')
	-            cert_store.flags = OpenSSL::X509::V_FLAG_CHECK_SS_SIGNATURE
	-          end
	-          begin
	-            if enable_system_cert_store
	-              if Fluent.windows? && cert_logical_store_name
	-                log.trace "loading Windows system certificate store"
	-                loader = Certstore::OpenSSL::Loader.new(log, cert_store, cert_logical_store_name,
	-                                                        enterprise: cert_use_enterprise_store)
	-                loader.load_cert_store
	-                cert_store = loader.cert_store
	-                context.cert = loader.get_certificate(cert_thumbprint) if cert_thumbprint
	+          cert_store_key = "socket_cert_store_#{object_id}".to_sym
	+          cert_store = ::Thread.current[cert_store_key]
	+          unless cert_store
	+            cert_store = OpenSSL::X509::Store.new
	+            ::Thread.current[cert_store_key] = cert_store
	+            if allow_self_signed_cert && OpenSSL::X509.const_defined?('V_FLAG_CHECK_SS_SIGNATURE')
	+              cert_store.flags = OpenSSL::X509::V_FLAG_CHECK_SS_SIGNATURE
	+            end
	+            begin
	+              if enable_system_cert_store
	+                if Fluent.windows? && cert_logical_store_name
	+                  log.trace "loading Windows system certificate store"
	+                  loader = Certstore::OpenSSL::Loader.new(log, cert_store, cert_logical_store_name,
	+                                                          enterprise: cert_use_enterprise_store)
	+                  loader.load_cert_store
	+                  cert_store = loader.cert_store
	+                  context.cert = loader.get_certificate(cert_thumbprint) if cert_thumbprint
	+                end
	+                log.trace "loading system default certificate store"
	+                cert_store.set_default_paths
		       end
	-              log.trace "loading system default certificate store"
	-              cert_store.set_default_paths
	+            rescue OpenSSL::X509::StoreError
	+              log.warn "failed to load system default certificate store", error: e
		     end
	-          rescue OpenSSL::X509::StoreError
	-            log.warn "failed to load system default certificate store", error: e
	-          end
	-          if cert_paths
	-            if cert_paths.respond_to?(:each)
	-              cert_paths.each do |cert_path|
	+            if cert_paths
	+              if cert_paths.respond_to?(:each)
	+                cert_paths.each do |cert_path|
	+                  log.trace "adding CA cert", path: cert_path
	+                  cert_store.add_file(cert_path)
	+                end
	+              else
	+                cert_path = cert_paths
		         log.trace "adding CA cert", path: cert_path
		         cert_store.add_file(cert_path)
		       end
	-            else
	-              cert_path = cert_paths
	-              log.trace "adding CA cert", path: cert_path
	-              cert_store.add_file(cert_path)
		     end
		   end

The issue seems gone without keepalive option:
Before_ in_tail + out_forward (no patch)
After_ in_tail + out_forward (with a patch to reuse cert_store)

@ashie
Copy link
Member Author

ashie commented Apr 24, 2022

I confirmed that OpenSSL::X509::Store#set_default_paths causes the memory leak:

require 'openssl'
loop do
  cert_store = OpenSSL::X509::Store.new
  cert_store.set_default_paths
  cert_store = nil
  GC.start
end

It's reproduced only on Windows.

@ashie ashie changed the title Memory leak when using TLS (at least on Windows) Memory leak when using TLS on Windows Apr 24, 2022
@ashie
Copy link
Member Author

ashie commented Apr 24, 2022

openssl gem's implementation:

365 static VALUE
366 ossl_x509store_set_default_paths(VALUE self)
367 {
368     X509_STORE *store;
369 
370     GetX509Store(self, store);
371     if (X509_STORE_set_default_paths(store) != 1){
372         ossl_raise(eX509StoreError, NULL);
373     }
374 
375     return Qnil;
376 }

It doesn't seem possible to cause the memory leak.
So I'm suspecting OpenSSL.

@ashie
Copy link
Member Author

ashie commented Apr 24, 2022

I've wrote a minimum C code to reproduce the memory leak.

#include <openssl/x509v3.h>
#include <stdio.h>

int
main(int argc, char *argv[])
{
  X509_STORE *store = X509_STORE_new();
#ifdef USE_SET_DEFAULT_PATHS
  int result = X509_STORE_set_default_paths(store);
  if (result) {
    printf("no error\r\n");
  } else {
    printf("error\r\n");
  }
#endif
  X509_STORE_free(store);
  return 0;
}
> ridk exec gcc openssl-leak.c -DUSE_SET_DEFAULT_PATHS -IC:/Ruby27-x64/msys64/mingw64/bin/../include -LC:/Ruby27-x64/msys64/mingw64/bin/../lib -lssl -lcrypto

> ridk exec "C:\Program Files (x86)\Dr. Memory\bin\drmemory.exe" -leaks_only -- a.exe
~~Dr.M~~ Dr. Memory version 2.5.0
~~Dr.M~~ Running "a.exe"
~~Dr.M~~ Using system call file C:\Users\aho\AppData\Roaming\Dr. Memory\symcache\syscalls_x64.txt
no error
~~Dr.M~~
~~Dr.M~~ Error #1: LEAK 121 direct bytes 0x000002ba3d420600-0x000002ba3d420679 + 0 indirect bytes
~~Dr.M~~ # 0 replace_malloc               [d:\a\drmemory\drmemory\common\alloc_replace.c:2580]
~~Dr.M~~ # 1 KERNELBASE.dll!CompareStringW+0x164    (0x00007fff6a6b2875 <KERNELBASE.dll+0x22875>)
~~Dr.M~~ # 2 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b4ea973 <libcrypto-1_1-x64.dll+0x13a973>)
~~Dr.M~~ # 3 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b4eac18 <libcrypto-1_1-x64.dll+0x13ac18>)
~~Dr.M~~ # 4 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b560b68 <libcrypto-1_1-x64.dll+0x1b0b68>)
~~Dr.M~~ # 5 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b55d0b5 <libcrypto-1_1-x64.dll+0x1ad0b5>)
~~Dr.M~~ # 6 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b5609ed <libcrypto-1_1-x64.dll+0x1b09ed>)
~~Dr.M~~ # 7 main
~~Dr.M~~
~~Dr.M~~ Error #2: LEAK 313 direct bytes 0x000002ba3d4206a0-0x000002ba3d4207d9 + 0 indirect bytes
~~Dr.M~~ # 0 replace_malloc               [d:\a\drmemory\drmemory\common\alloc_replace.c:2580]
~~Dr.M~~ # 1 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b4eac2b <libcrypto-1_1-x64.dll+0x13ac2b>)
~~Dr.M~~ # 2 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b4ea973 <libcrypto-1_1-x64.dll+0x13a973>)
~~Dr.M~~ # 3 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b4ea6bf <libcrypto-1_1-x64.dll+0x13a6bf>)
~~Dr.M~~ # 4 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b560b83 <libcrypto-1_1-x64.dll+0x1b0b83>)
~~Dr.M~~ # 5 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b55d0b5 <libcrypto-1_1-x64.dll+0x1ad0b5>)
~~Dr.M~~ # 6 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b5609ed <libcrypto-1_1-x64.dll+0x1b09ed>)                                                [4/786]
~~Dr.M~~ # 7 main
~~Dr.M~~
~~Dr.M~~ Error #3: LEAK 97 direct bytes 0x000002ba3d44ef20-0x000002ba3d44ef81 + 0 indirect bytes
~~Dr.M~~ # 0 replace_malloc               [d:\a\drmemory\drmemory\common\alloc_replace.c:2580]
~~Dr.M~~ # 1 KERNELBASE.dll!CompareStringW+0x164    (0x00007fff6a6b2875 <KERNELBASE.dll+0x22875>)
~~Dr.M~~ # 2 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b4ea973 <libcrypto-1_1-x64.dll+0x13a973>)
~~Dr.M~~ # 3 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b4eac18 <libcrypto-1_1-x64.dll+0x13ac18>)
~~Dr.M~~ # 4 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b560b68 <libcrypto-1_1-x64.dll+0x1b0b68>)
~~Dr.M~~ # 5 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b55cca5 <libcrypto-1_1-x64.dll+0x1acca5>)
~~Dr.M~~ # 6 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b560a21 <libcrypto-1_1-x64.dll+0x1b0a21>)
~~Dr.M~~ # 7 main
~~Dr.M~~
~~Dr.M~~ Error #4: LEAK 289 direct bytes 0x000002ba3d44efb0-0x000002ba3d44f0d1 + 0 indirect bytes
~~Dr.M~~ # 0 replace_malloc               [d:\a\drmemory\drmemory\common\alloc_replace.c:2580]
~~Dr.M~~ # 1 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b4eac2b <libcrypto-1_1-x64.dll+0x13ac2b>)
~~Dr.M~~ # 2 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b4ea973 <libcrypto-1_1-x64.dll+0x13a973>)
~~Dr.M~~ # 3 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b4ea6bf <libcrypto-1_1-x64.dll+0x13a6bf>)
~~Dr.M~~ # 4 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b560b83 <libcrypto-1_1-x64.dll+0x1b0b83>)
~~Dr.M~~ # 5 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b55cca5 <libcrypto-1_1-x64.dll+0x1acca5>)
~~Dr.M~~ # 6 libcrypto-1_1-x64.dll!?     +0x0      (0x00007fff0b560a21 <libcrypto-1_1-x64.dll+0x1b0a21>)
~~Dr.M~~ # 7 main
~~Dr.M~~
~~Dr.M~~ ERRORS FOUND:
~~Dr.M~~       0 unique,     0 total invalid heap argument(s)
~~Dr.M~~       0 unique,     0 total warning(s)
~~Dr.M~~       4 unique,     4 total,    820 byte(s) of leak(s)
~~Dr.M~~       0 unique,     0 total,      0 byte(s) of possible leak(s)
~~Dr.M~~ ERRORS IGNORED:
~~Dr.M~~       1 potential leak(s) (suspected false positives)
~~Dr.M~~          (details: C:\Users\aho\AppData\Roaming\Dr. Memory\DrMemory-a.exe.7696.000\potential_errors.txt)
~~Dr.M~~       7 unique,     7 total,    550 byte(s) of still-reachable allocation(s)
~~Dr.M~~          (re-run with "-show_reachable" for details)
~~Dr.M~~ Details: C:\Users\aho\AppData\Roaming\Dr. Memory\DrMemory-a.exe.7696.000\results.txt

@ashie
Copy link
Member Author

ashie commented Apr 24, 2022

The memory leak doesn't cause when I remove X509_STORE_set_default_paths:

>ridk exec gcc openssl-leak.c -IC:/Ruby27-x64/msys64/mingw64/bin/../include -LC:/Ruby27-x64/msys64/mingw64/bin/../li
b -lssl -lcrypto

>ridk exec "C:\Program Files (x86)\Dr. Memory\bin\drmemory.exe" -leaks_only -- a.exe
~~Dr.M~~ Dr. Memory version 2.5.0
~~Dr.M~~ Running "a.exe"
~~Dr.M~~ Using system call file C:\Users\aho\AppData\Roaming\Dr. Memory\symcache\syscalls_x64.txt
~~Dr.M~~
~~Dr.M~~ NO ERRORS FOUND:
~~Dr.M~~       0 unique,     0 total invalid heap argument(s)
~~Dr.M~~       0 unique,     0 total warning(s)
~~Dr.M~~       0 unique,     0 total,      0 byte(s) of leak(s)
~~Dr.M~~       0 unique,     0 total,      0 byte(s) of possible leak(s)
~~Dr.M~~ ERRORS IGNORED:
~~Dr.M~~       1 potential leak(s) (suspected false positives)
~~Dr.M~~          (details: C:\Users\aho\AppData\Roaming\Dr. Memory\DrMemory-a.exe.17940.000\potential_errors.txt)
~~Dr.M~~       3 unique,     3 total,     38 byte(s) of still-reachable allocation(s)
~~Dr.M~~          (re-run with "-show_reachable" for details)
~~Dr.M~~ Details: C:\Users\aho\AppData\Roaming\Dr. Memory\DrMemory-a.exe.17940.000\results.txt

@ashie
Copy link
Member Author

ashie commented Apr 24, 2022

Use OpenSSL 1.1.1n with debug symbol:

Dr. Memory version 2.5.0 build 0 built on Oct 18 2021 03:01:22
Windows version: WinVer=105;Rel=2009;Build=19044;Edition=Professional
Dr. Memory results for pid 20624: "a.exe"
Application cmdline: "a.exe"
Recorded 124 suppression(s) from default C:\Program Files (x86)\Dr. Memory\bin64\suppress-default.txt

Error #1: LEAK 121 direct bytes 0x0000022a61f90600-0x0000022a61f90679 + 0 indirect bytes
# 0 replace_malloc                             [d:\a\drmemory\drmemory\common\alloc_replace.c:2580]
# 1 libcrypto-1_1-x64.dll!simplify_path
# 2 libcrypto-1_1-x64.dll!malloc_copy_string
# 3 libcrypto-1_1-x64.dll!get_relative_path
# 4 libcrypto-1_1-x64.dll!openssl_relocation
# 5 libcrypto-1_1-x64.dll!X509_get_default_cert_file
# 6 libcrypto-1_1-x64.dll!by_file_ctrl
# 7 libcrypto-1_1-x64.dll!X509_LOOKUP_ctrl
# 8 libcrypto-1_1-x64.dll!X509_STORE_set_default_paths
# 9 main

Error #2: LEAK 313 direct bytes 0x0000022a61f906a0-0x0000022a61f907d9 + 0 indirect bytes
# 0 replace_malloc                             [d:\a\drmemory\drmemory\common\alloc_replace.c:2580]
# 1 libcrypto-1_1-x64.dll!simplify_path
# 2 libcrypto-1_1-x64.dll!malloc_copy_string
# 3 libcrypto-1_1-x64.dll!openssl_relocation
# 4 libcrypto-1_1-x64.dll!X509_get_default_cert_file
# 5 libcrypto-1_1-x64.dll!by_file_ctrl
# 6 libcrypto-1_1-x64.dll!X509_LOOKUP_ctrl
# 7 libcrypto-1_1-x64.dll!X509_STORE_set_default_paths
# 8 main

Error #3: LEAK 97 direct bytes 0x0000022a61fbef20-0x0000022a61fbef81 + 0 indirect bytes
# 0 replace_malloc                                   [d:\a\drmemory\drmemory\common\alloc_replace.c:2580]
# 1 KERNEL32.dll!GetPhysicallyInstalledSystemMemory +0x22bf   (0x00007fff6bce3530 <KERNEL32.dll+0x3530>)
# 2 libcrypto-1_1-x64.dll!simplify_path
# 3 libcrypto-1_1-x64.dll!malloc_copy_string
# 4 libcrypto-1_1-x64.dll!get_relative_path
# 5 libcrypto-1_1-x64.dll!openssl_relocation
# 6 libcrypto-1_1-x64.dll!X509_get_default_cert_dir
# 7 libcrypto-1_1-x64.dll!dir_ctrl
# 8 libcrypto-1_1-x64.dll!X509_LOOKUP_ctrl
# 9 libcrypto-1_1-x64.dll!X509_STORE_set_default_paths
#10 main

Error #4: LEAK 289 direct bytes 0x0000022a61fbefb0-0x0000022a61fbf0d1 + 0 indirect bytes
# 0 replace_malloc                             [d:\a\drmemory\drmemory\common\alloc_replace.c:2580]
# 1 libcrypto-1_1-x64.dll!simplify_path
# 2 libcrypto-1_1-x64.dll!malloc_copy_string
# 3 libcrypto-1_1-x64.dll!openssl_relocation
# 4 libcrypto-1_1-x64.dll!X509_get_default_cert_dir
# 5 libcrypto-1_1-x64.dll!dir_ctrl
# 6 libcrypto-1_1-x64.dll!X509_LOOKUP_ctrl
# 7 libcrypto-1_1-x64.dll!X509_STORE_set_default_paths
# 8 main

===========================================================================
FINAL SUMMARY:

DUPLICATE ERROR COUNTS:

SUPPRESSIONS USED:

ERRORS FOUND:
      0 unique,     0 total invalid heap argument(s)
      0 unique,     0 total warning(s)
      4 unique,     4 total,    820 byte(s) of leak(s)
      0 unique,     0 total,      0 byte(s) of possible leak(s)
ERRORS IGNORED:
      1 potential leak(s) (suspected false positives)
         (details: C:\Users\aho\AppData\Roaming\Dr. Memory\DrMemory-a.exe.20624.000\potential_errors.txt)
      7 unique,     7 total,    550 byte(s) of still-reachable allocation(s)
         (re-run with "-show_reachable" for details)

@ashie
Copy link
Member Author

ashie commented Apr 24, 2022

@ashie
Copy link
Member Author

ashie commented Apr 24, 2022

Use OpenSSL 1.1.1n with debug symbol:

How to build:

> git clone git@github.com:msys2/MINGW-packages.git
> C:\Ruby27-x64\msys64\usr\bin\bash.exe
# cd path /path/to/MINGW-packages/mingw-w64-openssl
# pacman -S base-devel
# sed -i -e "s/'strip'/'\!strip'/g" PKGBUILD
# makepkg-mingw --skippgpcheck --nocheck --nosign --syncdeps

@ashie
Copy link
Member Author

ashie commented May 8, 2022

Fixed version of MinGW's OpenSSL package (1.1.1.o-2) has been released:
https://packages.msys2.org/base/mingw-w64-openssl

We'll replace Ruby's OpenSSL with it (if we release td-agent before future version of Ruby which doesn't include it).

ashie added a commit that referenced this issue May 9, 2022
Fix #374

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
ashie added a commit that referenced this issue May 10, 2022
MinGW'S OpenSSL 1.1.1.o-1 or before has a memory leak bug in
`X509_STORE_set_default_paths`. Because Fluentd's out_forward plugin
calls it on every communication, memory leaks are accumulated so that
it will causes out of memory.

Fix #374

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
ashie added a commit that referenced this issue May 11, 2022
MinGW'S OpenSSL 1.1.1.o-1 or before has a memory leak bug in
`X509_STORE_set_default_paths`. Because Fluentd's out_forward plugin
calls it on every communication, memory leaks are accumulated so that
it will causes out of memory.

Fix #374

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant