Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory error in devArrMatch #3860

Merged
merged 2 commits into from
May 14, 2021

Conversation

hcho3
Copy link
Contributor

@hcho3 hcho3 commented May 13, 2021

Related: rapidsai/raft#229

We recently discovered a memory error in the devArrMatch() function: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci-v0.20/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu,OS=ubuntu16.04,PYTHON=3.7/161/console

13:09:34 [----------] 44 tests from FilTests/TreeliteDenseFilTest
13:09:34 [ RUN      ] FilTests/TreeliteDenseFilTest.Import/0
13:09:34 *** Error in `./test/ml': free(): invalid pointer: 0x00007f632b691fe8 ***
13:09:34 ======= Backtrace: =========
13:09:34 /lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7f632b3447f5]
13:09:34 /lib/x86_64-linux-gnu/libc.so.6(+0x8038a)[0x7f632b34d38a]
13:09:34 /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f632b35158c]
13:09:34 /workspace/ci/artifacts/cuml/cpu/conda_work/cpp/build/libcuml++.so(_ZN8treelite9ModelImplIffED0Ev+0xf5)[0x7f632c556405]

which was traced to the devArrMatch() function as follows:

$ valgrind ./cpp/build/test/ml --gtest_filter=FilTests/TreeliteDenseFilTest.Import/0
==6398== Mismatched free() / delete / delete []
==6398==    at 0x483D1CF: operator delete(void*, unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6398==    by 0x209287: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (in /home/ubuntu/cuml/cpp/build/test/ml)
==6398==    by 0x2AC253: testing::AssertionResult raft::devArrMatch<float, raft::CompareApprox<float> >(float const*, float const*, unsigned long, raft::CompareApprox<float>, CUstream_st*) (in /home/ubuntu/cuml/cpp/build/test/ml)
==6398==    by 0x2AC51C: ML::BaseFilTest::compare() (in /home/ubuntu/cuml/cpp/build/test/ml)
==6398==    by 0x4858098D: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48580BE0: testing::Test::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48580F0E: testing::TestInfo::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48581035: testing::TestSuite::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x485815EB: testing::internal::UnitTestImpl::RunAllTests() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48581858: testing::UnitTest::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x4853007E: main (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest_main.so)
==6398==  Address 0x232bfa860 is 0 bytes inside a block of size 160,000 alloc'd
==6398==    at 0x483C583: operator new[](unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6398==    by 0x2ABFF8: testing::AssertionResult raft::devArrMatch<float, raft::CompareApprox<float> >(float const*, float const*, unsigned long, raft::CompareApprox<float>, CUstream_st*) (in /home/ubuntu/cuml/cpp/build/test/ml)
==6398==    by 0x2AC51C: ML::BaseFilTest::compare() (in /home/ubuntu/cuml/cpp/build/test/ml)
==6398==    by 0x4858098D: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48580BE0: testing::Test::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48580F0E: testing::TestInfo::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48581035: testing::TestSuite::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x485815EB: testing::internal::UnitTestImpl::RunAllTests() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48581858: testing::UnitTest::Run() (in/home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x4853007E: main (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest_main.so)

Diagnosis. The devArrMatch functions are allocating a temporary buffer using new[] and then assigning it to a shared_ptr<T>. This is not valid because the destructor of shared_ptr<T> will invoke delete, not delete[]. Calling delete with a buffer allocated by new[] leads to an undefined behavior. See https://docs.microsoft.com/en-us/cpp/code-quality/c6278?view=msvc-160.

Proposed fix. Use std:unique_ptr<T[]> instead to store temporary buffers.

@hcho3 hcho3 requested a review from a team as a code owner May 13, 2021 23:05
@hcho3 hcho3 added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change labels May 13, 2021
@hcho3 hcho3 mentioned this pull request May 13, 2021
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending CI

@dantegd
Copy link
Member

dantegd commented May 13, 2021

@hcho3 orthogonal request to the issue being fixed: could you open an issue to remove the duplicated file from cuML? We probably need to do some cleanup regarding situations like this

@hcho3
Copy link
Contributor Author

hcho3 commented May 13, 2021

@dantegd Here it is: #3861

@dantegd dantegd added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 14, 2021
@cjnolet
Copy link
Member

cjnolet commented May 14, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 917e7ca into rapidsai:branch-0.20 May 14, 2021
@codecov-commenter
Copy link

Codecov Report

Merging #3860 (97fb9d8) into branch-0.20 (46174b7) will decrease coverage by 0.55%.
The diff coverage is 54.57%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20    #3860      +/-   ##
===============================================
- Coverage        85.96%   85.40%   -0.56%     
===============================================
  Files              225      227       +2     
  Lines            16986    17313     +327     
===============================================
+ Hits             14602    14787     +185     
- Misses            2384     2526     +142     
Flag Coverage Δ
dask 48.94% <24.08%> (-0.03%) ⬇️
non-dask 77.35% <52.13%> (-0.46%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/benchmark/nvtx_benchmark.py 0.00% <0.00%> (ø)
python/cuml/common/memory_utils.py 79.26% <ø> (+0.51%) ⬆️
python/cuml/dask/common/utils.py 43.68% <0.00%> (ø)
python/cuml/ensemble/randomforestclassifier.pyx 83.61% <ø> (ø)
python/cuml/linear_model/logistic_regression.pyx 89.21% <ø> (ø)
python/cuml/neighbors/nearest_neighbors.pyx 93.11% <ø> (-0.03%) ⬇️
python/cuml/tsa/auto_arima.pyx 57.48% <0.00%> (-0.53%) ⬇️
python/cuml/common/base.pyx 74.10% <29.41%> (-6.23%) ⬇️
python/cuml/model_selection/_split.py 88.99% <75.67%> (-1.87%) ⬇️
python/cuml/manifold/t_sne.pyx 78.34% <76.31%> (-1.00%) ⬇️
... and 61 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef13080...97fb9d8. Read the comment docs.

@hcho3 hcho3 deleted the fix_memory_error branch May 14, 2021 02:23
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
Related: rapidsai/raft#229

We recently discovered a memory error in the `devArrMatch()` function: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci-v0.20/job/cuml/job/prb/job/cuml-gpu-test/CUDA=11.0,GPU_LABEL=gpu,OS=ubuntu16.04,PYTHON=3.7/161/console
```
13:09:34 [----------] 44 tests from FilTests/TreeliteDenseFilTest
13:09:34 [ RUN      ] FilTests/TreeliteDenseFilTest.Import/0
13:09:34 *** Error in `./test/ml': free(): invalid pointer: 0x00007f632b691fe8 ***
13:09:34 ======= Backtrace: =========
13:09:34 /lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7f632b3447f5]
13:09:34 /lib/x86_64-linux-gnu/libc.so.6(+0x8038a)[0x7f632b34d38a]
13:09:34 /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f632b35158c]
13:09:34 /workspace/ci/artifacts/cuml/cpu/conda_work/cpp/build/libcuml++.so(_ZN8treelite9ModelImplIffED0Ev+0xf5)[0x7f632c556405]
```
which was traced to  the `devArrMatch()` function as follows:
```
$ valgrind ./cpp/build/test/ml --gtest_filter=FilTests/TreeliteDenseFilTest.Import/0
==6398== Mismatched free() / delete / delete []
==6398==    at 0x483D1CF: operator delete(void*, unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6398==    by 0x209287: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() (in /home/ubuntu/cuml/cpp/build/test/ml)
==6398==    by 0x2AC253: testing::AssertionResult raft::devArrMatch<float, raft::CompareApprox<float> >(float const*, float const*, unsigned long, raft::CompareApprox<float>, CUstream_st*) (in /home/ubuntu/cuml/cpp/build/test/ml)
==6398==    by 0x2AC51C: ML::BaseFilTest::compare() (in /home/ubuntu/cuml/cpp/build/test/ml)
==6398==    by 0x4858098D: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48580BE0: testing::Test::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48580F0E: testing::TestInfo::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48581035: testing::TestSuite::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x485815EB: testing::internal::UnitTestImpl::RunAllTests() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48581858: testing::UnitTest::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x4853007E: main (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest_main.so)
==6398==  Address 0x232bfa860 is 0 bytes inside a block of size 160,000 alloc'd
==6398==    at 0x483C583: operator new[](unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6398==    by 0x2ABFF8: testing::AssertionResult raft::devArrMatch<float, raft::CompareApprox<float> >(float const*, float const*, unsigned long, raft::CompareApprox<float>, CUstream_st*) (in /home/ubuntu/cuml/cpp/build/test/ml)
==6398==    by 0x2AC51C: ML::BaseFilTest::compare() (in /home/ubuntu/cuml/cpp/build/test/ml)
==6398==    by 0x4858098D: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48580BE0: testing::Test::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48580F0E: testing::TestInfo::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48581035: testing::TestSuite::Run() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x485815EB: testing::internal::UnitTestImpl::RunAllTests() (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x48581858: testing::UnitTest::Run() (in/home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest.so)
==6398==    by 0x4853007E: main (in /home/ubuntu/miniconda3/envs/cuml_dev/lib/libgtest_main.so)
```

**Diagnosis**. The `devArrMatch` functions are allocating a temporary buffer using `new[]` and then assigning it to a `shared_ptr<T>`. This is not valid because the destructor of `shared_ptr<T>` will invoke `delete`, not `delete[]`. Calling `delete` with a buffer allocated by `new[]` leads to an undefined behavior. See https://docs.microsoft.com/en-us/cpp/code-quality/c6278?view=msvc-160.

**Proposed fix**. Use `std:unique_ptr<T[]>` instead to store temporary buffers.

Authors:
  - Philip Hyunsu Cho (https://github.com/hcho3)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#3860
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working CUDA/C++ non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants