Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CurlHandleContainer/StlAllocator crash on shutdown #1833

Closed
2 tasks done
grrtrr opened this issue Jan 1, 2022 · 2 comments
Closed
2 tasks done

CurlHandleContainer/StlAllocator crash on shutdown #1833

grrtrr opened this issue Jan 1, 2022 · 2 comments
Labels
bug This issue is a bug. pending-release This issue will be fixed by an approved PR that hasn't been released yet.

Comments

@grrtrr
Copy link
Contributor

grrtrr commented Jan 1, 2022

Confirm by changing [ ] to [x] below to ensure that it's a bug:

Describe the bug

This problem happens on shutdown of the API:

Fatal error condition occurred in third_party/aws-cpp-sdk/aws-crt/crt/aws-c-common/source/allocator.c:115: allocator != ((void *)0)
Exiting Application
################################################################################
Stack trace:
################################################################################
_solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-crt_Scrt_Saws-c-common_Slibaws-c-common.so(aws_backtrace_print+0x52) [0x7ffff758c30f]
_solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-crt_Scrt_Saws-c-common_Slibaws-c-common.so(aws_fatal_assert+0x4a) [0x7ffff75802f5]
_solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-crt_Scrt_Saws-c-common_Slibaws-c-common.so(aws_mem_acquire+0x47) [0x7ffff757e904]
_solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-core_Slibaws-core.so(_ZNSt6vectorIPvN3Aws3Crt12StlAllocatorIS0_EEEaSERKS5_+0xce) [0x7ffff792c422]
_solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-core_Slibaws-core.so(_ZN3Aws4Http19CurlHandleContainerD2Ev+0x385) [0x7ffff7929675]
_solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-core_Slibaws-core.so(_ZN3Aws4Http14CurlHttpClientD2Ev+0x197) [0x7ffff793521d]
_solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-core_Slibaws-core.so(_ZNSt23_Sp_counted_ptr_inplaceIN3Aws4Http14CurlHttpClientENS0_3Crt12StlAllocatorIS2_EELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x11) [0x7ffff792012d]
plugin_test(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x51) [0x55555558da3f]
_solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-s3_Slibaws-s3.so(_ZN3Aws6Client9AWSClientD1Ev+0xc0) [0x7ffff7b71d30]
_solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-s3_Slibaws-s3.so(_ZN3Aws2S38S3ClientD2Ev+0x9e) [0x7ffff7ad3712]
_solib_k8/libcloud_Saws_Ss3_Slibs3_Ublobstore_Uplugin.so(_ZNSt23_Sp_counted_ptr_inplaceIN3Aws2S38S3ClientESaIS2_ELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x11) [0x7ffff7fc8713]
plugin_test(_ZNSt16_Sp_counted_baseILN9__gnu_cxx12_Lock_policyE2EE10_M_releaseEv+0x51) [0x55555558da3f]
_solib_k8/libcloud_Saws_Ss3_Slibs3_Ublobstore_Uplugin.so(_ZN2av5cloud3aws2s317S3BlobstorePluginD1Ev+0x20) [0x7ffff7fca9a8]
_solib_k8/libcloud_Saws_Ss3_Slibs3_Ublobstore_Uplugin.so(_ZNSt23_Sp_counted_ptr_inplaceIN2av5cloud3aws2s317S3BlobstorePluginESaIS4_ELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x11) [0x7ffff7fc86f3]
plugin_test(_ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_St10shared_ptrIN2av9blobstore15BlobstorePluginEEESt10_Select1stISD_ESt4lessIS5_ESaISD_EE8_M_eraseEPSt13_Rb_tree_nodeISD_E+0x91) [0x5555555924e3]
plugin_test(_ZNSt8_Rb_treeINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt4pairIKS5_St10shared_ptrIN2av9blobstore15BlobstorePluginEEESt10_Select1stISD_ESt4lessIS5_ESaISD_EE8_M_eraseEPSt13_Rb_tree_nodeISD_E+0x5f) [0x5555555924b1]
plugin_test(_ZN2av9blobstore23BlobstorePluginRegistryD0Ev+0x27) [0x55555559257d]
_solib_k8/libcommon_Sblobstore_Slibblobstore_Uplugin.so(_ZNSt10unique_ptrIN2av9blobstore23BlobstorePluginRegistryENS0_9SingletonIS2_Lb0EE13StaticDeleterEED2Ev+0x29) [0x7ffff72754fb]
/lib/x86_64-linux-gnu/libc.so.6(__cxa_finalize+0xf5) [0x7ffff5c01735]
_solib_k8/libcommon_Sblobstore_Slibblobstore_Uplugin.so(+0x3a83) [0x7ffff7273a83]

It is a race condition, please see below for reproducability.

SDK version number
1.9.92 - the affected code is still present in today's master (1.9.165).

Platform/OS/Hardware/Device
Linux/ubuntu 18.04.

To Reproduce (observed behavior)
I was able to reproduce this problem reliably in a unit tests, but any other exit sequence involving the steps below should trigger the same outcome.

Here is the gdb trace with some more details:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff5bfe921 in __GI_abort () at abort.c:79
#2  0x00007ffff75802fa in aws_fatal_assert ()
   from _solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-crt_Scrt_Saws-c-common_Slibaws-c-common.so
#3  0x00007ffff757e904 in aws_mem_acquire ()
   from _solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-crt_Scrt_Saws-c-common_Slibaws-c-common.so
#4  0x00007ffff792c422 in std::vector<void*, Aws::Crt::StlAllocator<void*> >::operator=(std::vector<void*, Aws::Crt::StlAllocator<void*> > const&) ()
   from _solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-core_Slibaws-core.so
#5  0x00007ffff7929675 in Aws::Http::CurlHandleContainer::~CurlHandleContainer() ()
   from _solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-core_Slibaws-core.so
#6  0x00007ffff793521d in Aws::Http::CurlHttpClient::~CurlHttpClient() ()
   from _solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-core_Slibaws-core.so
#7  0x00007ffff792012d in std::_Sp_counted_ptr_inplace<Aws::Http::CurlHttpClient, Aws::Crt::StlAllocator<Aws::Http::CurlHttpClient>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from _solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-core_Slibaws-core.so
#8  0x000055555558da3f in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#9  0x00007ffff7b71d30 in Aws::Client::AWSClient::~AWSClient() ()
   from _solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-s3_Slibaws-s3.so
#10 0x00007ffff7ad3712 in Aws::S3::S3Client::~S3Client() ()
   from _solib_k8/libthird_Uparty_Saws-cpp-sdk_Saws-s3_Slibaws-s3.so
#11 0x00007ffff7fc8713 in std::_Sp_counted_ptr_inplace<Aws::S3::S3Client, std::allocator<Aws::S3::S3Client>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from _solib_k8/libcloud_Saws_Ss3_Slibs3_Ublobstore_Uplugin.so
#12 0x000055555558da3f in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
#13 0x00007ffff7fca9a8 in av::cloud::aws::s3::S3BlobstorePlugin::~S3BlobstorePlugin() ()
   from _solib_k8/libcloud_Saws_Ss3_Slibs3_Ublobstore_Uplugin.so
#14 0x00007ffff7fc86f3 in std::_Sp_counted_ptr_inplace<av::cloud::aws::s3::S3BlobstorePlugin, std::allocator<av::cloud::aws::s3::S3BlobstorePlugin>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from _solib_k8/libcloud_Saws_Ss3_Slibs3_Ublobstore_Uplugin.so
#15 0x00005555555924e3 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<av::blobstore::BlobstorePlugin> >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<av::blobstore::BlobstorePlugin> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<av::blobstore::BlobstorePlugin> > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<av::blobstore::BlobstorePlugin> > >*) ()
#16 0x00005555555924b1 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<av::blobstore::BlobstorePlugin> >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<av::blobstore::BlobstorePlugin> > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<av::blobstore::BlobstorePlugin> > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::shared_ptr<av::blobstore::BlobstorePlugin> > >*) ()
#17 0x000055555559257d in av::blobstore::BlobstorePluginRegistry::~BlobstorePluginRegistry() ()
#18 0x00007ffff72754fb in std::unique_ptr<av::blobstore::BlobstorePluginRegistry, av::Singleton<av::blobstore::BlobstorePluginRegistry, false>::StaticDeleter>::~unique_ptr() ()
   from _solib_k8/libcommon_Sblobstore_Slibblobstore_Uplugin.so
#19 0x00007ffff5c01735 in __cxa_finalize (d=0x7ffff7279000) at cxa_finalize.c:83
#20 0x00007ffff7273a83 in __do_global_dtors_aux ()
   from _solib_k8/libcommon_Sblobstore_Slibblobstore_Uplugin.so
#21 0x00007fffffffd600 in ?? ()
#22 0x00007ffff7de3d13 in _dl_fini () at dl-fini.c:138
Backtrace stopped: frame did not save the PC

Analysis

I printed extra context out when the application crashed, which shows the sequence of events:

  1. CleanupCrt() is called on API shutdown.

  2. This calls Aws::Delete(g_apiHandle)

  3. ApiHandle::~ApiHandle() sets g_allocator = nullptr (important for the events below).

  4. AWSClient::~AWSClient() is called (that is where the above backtrace starts).

  5. CurlHttpClient::~CurlHttpClient is called next (please refer to above strack trace for the remaining events).

  6. Aws::Http::CurlHandleContainer::~CurlHandleContainer() is called when the m_curlHandleContainer of the CurlHttpClient is destructed.

  7. The destructor in (6) calls the ShutdownAndWait operation:

// source/http/curl/CurlHandleContainer.cpp
CurlHandleContainer::~CurlHandleContainer()
{
    AWS_LOGSTREAM_INFO(CURL_HANDLE_CONTAINER_TAG, "Cleaning up CurlHandleContainer.");
    for (CURL* handle : m_handleContainer.ShutdownAndWait(m_poolSize))  // <=== PROBLEM HAPPENS HERE
    {
        AWS_LOGSTREAM_DEBUG(CURL_HANDLE_CONTAINER_TAG, "Cleaning up " << handle);
        curl_easy_cleanup(handle);
    }
} 
  1. The Aws::Utils::ExclusiveOwnershipResourceManager<CURL*> m_handleContainer is invoking the following code:
// include/aws/core/utils/ResourceManager.h
            Aws::Vector<RESOURCE_TYPE> ShutdownAndWait(size_t resourceCount)
            {   
                Aws::Vector<RESOURCE_TYPE> resources;  // <=== ALLOCATOR ALREADY NULL HERE
                std::unique_lock<std::mutex> locker(m_queueLock);
                m_shutdown = true;

                //wait for all acquired resources to be released.
                while (m_resources.size() < resourceCount)
                {   
                    m_semaphore.wait(locker, [&]() { return m_resources.size() == resourceCount; });
                }

                resources = m_resources;   // <=== CRASH HAPPENS HERE
                m_resources.clear();

                return resources;
            }
  1. The RESOURCE_TYPE in this case (as per code and stack trace) is a CURL* (aka void*) with an Aws::Crt::StlAllocator<void*> allocator.
  2. When resources in the above is constructed, the StlAllocator default constructor is called:
//  include/aws/crt/StlAllocator.h
        template <typename T> class StlAllocator : public std::allocator<T>
        { 
          public:
            using Base = std::allocator<T>;

            StlAllocator() noexcept : Base() { m_allocator = g_allocator; } // <=== HERE
  1. However, g_allocator is at this time NULL (after the call in (3) above).
  2. Next the statement resources = m_resources; invokes the std::vector::operator =
  • this calls out to the allocator of resources,
  • however, the g_allocator is NULL.
  1. Hence it trips the assert when calling aws_mem_acquire
// crt/aws-c-common/source/allocator.c

void *aws_mem_acquire(struct aws_allocator *allocator, size_t size) {
   AWS_FATAL_PRECONDITION(allocator != NULL);
   // ...
}

Possible fix

Since m_resources is empty when ShutdownAndWait is called, we can use std::move to construct resources in place:

// include/aws/core/utils/ResourceManager.h
 Aws::Vector<RESOURCE_TYPE> ShutdownAndWait(size_t resourceCount)
{
    // ...
    Aws::Vector<RESOURCE_TYPE> resources{std::move(m_resources)};

    return resources;
}

This works since, at the time of calling, m_resources still has a reference to its allocator. When the move is performed, the allocator of m_resources is used, avoiding the problem with the default-constructed (NULL) allocator of resources.

@grrtrr grrtrr added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jan 1, 2022
@jmklix
Copy link
Member

jmklix commented Jan 6, 2022

Thanks for figuring this out and making a PR!

@jmklix jmklix added pending-release This issue will be fixed by an approved PR that hasn't been released yet. and removed needs-triage This issue or PR still needs to be triaged. labels Jan 6, 2022
@jmklix jmklix closed this as completed Jan 12, 2022
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. pending-release This issue will be fixed by an approved PR that hasn't been released yet.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants