Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][R] Resolve Valgrind errors #42234

Closed
jonkeane opened this issue Jun 20, 2024 · 5 comments · Fixed by #42249
Closed

[CI][R] Resolve Valgrind errors #42234

jonkeane opened this issue Jun 20, 2024 · 5 comments · Fixed by #42249

Comments

@jonkeane
Copy link
Member

jonkeane commented Jun 20, 2024

Describe the bug, including details regarding any error messages, version, and platform.

We have been seeing Valgrind errors for a while now in R.

==774== HEAP SUMMARY:
==774==     in use at exit: 351,781,345 bytes in 69,559 blocks
==774==   total heap usage: 16,807,335 allocs, 16,737,776 frees, 9,804,514,696 bytes allocated
==774== 
==774== 400 bytes in 1 blocks are possibly lost in loss record 252 of 3,243
==774==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==774==    by 0x40147D9: calloc (rtld-malloc.h:44)
==774==    by 0x40147D9: allocate_dtv (dl-tls.c:375)
==774==    by 0x40147D9: _dl_allocate_tls (dl-tls.c:634)
==774==    by 0x4DA67B4: allocate_stack (allocatestack.c:430)
==774==    by 0x4DA67B4: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==774==    by 0x11D614D3: je_arrow_private_je_pthread_create_wrapper (background_thread.c:47)
==774==    by 0x11D614D3: background_thread_create_signals_masked (background_thread.c:287)
==774==    by 0x11D614D3: background_thread_create_locked (background_thread.c:495)
==774==    by 0x11D6275C: je_arrow_private_je_background_thread_create (background_thread.c:520)
==774==    by 0x400647D: call_init.part.0 (dl-init.c:70)
==774==    by 0x4006567: call_init (dl-init.c:33)
==774==    by 0x4006567: _dl_init (dl-init.c:117)
==774==    by 0x4E85AF4: _dl_catch_exception (dl-error-skeleton.c:182)
==774==    by 0x400DFF5: dl_open_worker (dl-open.c:808)
==774==    by 0x400DFF5: dl_open_worker (dl-open.c:771)
==774==    by 0x4E85A97: _dl_catch_exception (dl-error-skeleton.c:208)
==774==    by 0x400E34D: _dl_open (dl-open.c:883)
==774==    by 0x4DA163B: dlopen_doit (dlopen.c:56)
==774== 
==774== 723 (144 direct, 579 indirect) bytes in 1 blocks are definitely lost in loss record 288 of 3,243
==774==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==774==    by 0x12F2B64D: CRYPTO_zalloc (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==774==    by 0x12F10997: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==774==    by 0x12F008F9: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==774==    by 0x12F205E8: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==774==    by 0x12F2050B: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==774==    by 0x12F3FB2A: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==774==    by 0x13009227: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==774==    by 0x1300987D: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==774==    by 0x12F0D392: EVP_MAC_fetch (in /usr/lib/x86_64-linux-gnu/libcrypto.so.3)
==774==    by 0x1183F6B2: Aws::Utils::Crypto::Sha256HMACOpenSSLImpl::Calculate(Aws::Utils::Array<unsigned char> const&, Aws::Utils::Array<unsigned char> const&) (in /usr/local/RDvalgrind/lib/R/site-library/arrow/libs/arrow.so)
==774==    by 0x11B6CF4F: Aws::Utils::Crypto::Sha256HMAC::Calculate(Aws::Utils::Array<unsigned char> const&, Aws::Utils::Array<unsigned char> const&) (in /usr/local/RDvalgrind/lib/R/site-library/arrow/libs/arrow.so)
==774== 
==774== 1,248 bytes in 3 blocks are possibly lost in loss record 330 of 3,243
==774==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==774==    by 0x40147D9: calloc (rtld-malloc.h:44)
==774==    by 0x40147D9: allocate_dtv (dl-tls.c:375)
==774==    by 0x40147D9: _dl_allocate_tls (dl-tls.c:634)
==774==    by 0x4DA67B4: allocate_stack (allocatestack.c:430)
==774==    by 0x4DA67B4: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==774==    by 0x11D62513: je_arrow_private_je_pthread_create_wrapper (background_thread.c:47)
==774==    by 0x11D62513: background_thread_create_signals_masked (background_thread.c:287)
==774==    by 0x11D62513: check_background_thread_creation (background_thread.c:332)
==774==    by 0x11D62513: background_thread0_work (background_thread.c:370)
==774==    by 0x11D62513: background_work (background_thread.c:412)
==774==    by 0x11D62513: background_thread_entry (background_thread.c:444)
==774==    by 0x4DA5AC2: start_thread (pthread_create.c:442)
==774==    by 0x4E36A03: clone (clone.S:100)
==774== 
==774== 12,048 bytes in 4 blocks are possibly lost in loss record 1,495 of 3,243
==774==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==774==    by 0x4013E4D: malloc (rtld-malloc.h:56)
==774==    by 0x4013E4D: allocate_dtv_entry (dl-tls.c:684)
==774==    by 0x4013E4D: allocate_and_init (dl-tls.c:709)
==774==    by 0x4013E4D: tls_get_addr_tail (dl-tls.c:907)
==774==    by 0x401820B: __tls_get_addr (tls_get_addr.S:55)
==774==    by 0x11D616D3: tsd_state_get (tsd.h:269)
==774==    by 0x11D616D3: tsd_fetch_impl (tsd.h:421)
==774==    by 0x11D616D3: tsd_fetch_min (tsd.h:433)
==774==    by 0x11D616D3: tsd_internal_fetch (tsd.h:439)
==774==    by 0x11D616D3: background_thread_entry (background_thread.c:444)
==774==    by 0x4DA5AC2: start_thread (pthread_create.c:442)
==774==    by 0x4E36A03: clone (clone.S:100)
==774== 
==774== LEAK SUMMARY:
==774==    definitely lost: 144 bytes in 1 blocks
==774==    indirectly lost: 579 bytes in 11 blocks
==774==      possibly lost: 13,696 bytes in 8 blocks
==774==    still reachable: 351,098,631 bytes in 69,538 blocks
==774==                       of which reachable via heuristic:
==774==                         length64           : 456 bytes in 2 blocks
==774==                         newarray           : 4,264 bytes in 1 blocks
==774==         suppressed: 668,295 bytes in 1 blocks
==774== Reachable blocks (those to which a pointer was found) are not shown.
==774== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==774== 
==774== For lists of detected and suppressed errors, rerun with: -s
==774== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 1 from 1)

a recent build

Judging from when these started, I suspect one of these PRs is what introduced this:

* #41419
* #41295
* #41421
* #41366
* #41434

Turns out, something changed with how we were looking for binaries when we were instantiating the Valgrind run and that was causing these issues. The strange thing is that there were no code changes around this which shouldn't have caused this build to start using binaries of libarrow, but seemingly did:

The last success 30 April:

*** No nightly binaries were found for version 16.0.0.9000: falling back to libarrow build from source

The first failure 1 May:

*** Latest available nightly for 16.0.0.9000: 16.0.0.100000045

I've hardcoded don't-download-binaries in #42249 which resolves the issue, but I'm curious if you know of what changed around then to start this @assignUser ? We also might need to check other builds that we want to be source builds and confirm that they still are too.

Component(s)

C++, R

@paulfloyd
Copy link

You need to either create your threads detached or, usually better, use join.

@jonkeane jonkeane changed the title [C++][R] Valgrind errors [C++][R] Resolve Valgrind errors Jun 22, 2024
@jonkeane jonkeane changed the title [C++][R] Resolve Valgrind errors [CI][R] Resolve Valgrind errors Jun 22, 2024
@jonkeane
Copy link
Member Author

Issue resolved by pull request 42249
#42249

@jonkeane jonkeane added this to the 17.0.0 milestone Jun 22, 2024
@assignUser
Copy link
Member

I am not aware of any changes in that regard but I do remember that we had the issue before that the valgrind build used the binary after we added the change to the build system initially but that was fixed long ago... Weird.

@assignUser
Copy link
Member

assignUser commented Jun 22, 2024

Actually looking through the logs, this job was on azure pipelines before and was moved to gha in #41127 during that the libarrow_binary=false was probably lost (looking at the PR it doesn't seem to have been there? :shrug) but we also didn't have any working nightlies until then so it didn't show up!

@jonkeane
Copy link
Member Author

Aaaah ok that makes sense. I didn't realize the nightlies were down for that long, but that would explain it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants