[REVIEW] Add tfidf bm25 #2353

jperez999 · 2024-06-05T16:21:07Z

This PR will add support for tfidf and BM25 preprocessing of sparse matrix. It does not require the user to work within the confines of the COO or CSR matrix. It only requires the triplets of data ( row, column, value). With this information, we are able to preprocess the values accordingly. Putting this up to get eyes on this, to make sure this is going in the correct direction or if not, to adjust.

Unit tests are still required for these features.

[skip ci] Update master references for main branch

REL Fix `21.06` Release Changelog

[HOTFIX] Remove `-g` from cython compile commands

[RELEASE] v22.04

Our `devel` Docker containers need to be switched to using `conda` compilers to resolve a linking error. `raft` is in those containers, but hasn't yet been built with `conda` compilers. This PR addresses that. These changes won't cleanly merge into `branch-22.08` unfortunately due to the changes in rapidsai#641, but we can address that another time. Authors: - AJ Schmidt (https://github.com/ajschmidt8) - Corey J. Nolet (https://github.com/cjnolet) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Corey J. Nolet (https://github.com/cjnolet)

[RELEASE] v22.06 raft

FIX update-version.sh

@shwina

@shwina I'm going to apologize ahead of time for this, but i was trying to forward merge your branch 22.10 locally to create a new PR from it and I accidentally pushed to your remote branch. I cherry-picked the commits over to a new branch for the hotfix. Authors: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) Approvers: - Ray Douglass (https://github.com/raydouglass)

[RELEASE] raft v22.10.01

[RELEASE] raft v22.12.01 [skip-gpuci]

REL Update changelog v23.04

cpp/include/raft/sparse/matrix/detail/preprocessing.cuh

rhdong · 2024-10-30T21:26:20Z

cpp/test/sparse/neighbors/brute_force_csr.cu

+  SparseKNNInputs<value_idx, value_t> params;
+};
+
+const std::vector<SparseKNNInputs<int, float>> inputs_i32_f = {


Would it be possible to add some additional test cases that generate random csr matrix instead of hardcoding them? Just a suggestion.

Oh boy, I thought this has been updated / fixed. We definitely don't want to be hardcoding these.

So, those hardcoded values come from the test that was in place. Take a look at what is currently available in main https://github.com/rapidsai/raft/blob/branch-24.12/cpp/test/sparse/neighbors/brute_force.cu. I felt it was acceptable to leave/use those hardcoded values, because the point of these tests here is not to ensure the brute_force works correctly, it is to check that the new interfaces I created for COO and CSR work correctly. If you want me to change, that is fine but then I think that means I need to create a Kernel for this. I dont believe that is the goal of this PR. But let me know if you think I need to make this change in this PR @cjnolet @rhdong

It depends on you, the brute force(and other neighbor algo) has been moved to CUVS, and the test cases there could be more solid, just for your reference: https://github.com/rapidsai/cuvs/blob/branch-24.12/cpp/test/neighbors/brute_force.cu#L469 .

rhdong · 2024-10-30T21:32:11Z

cpp/include/raft/sparse/neighbors/knn.cuh

@@ -103,4 +106,171 @@ void brute_force_knn(const value_idx* idxIndptr,
                                       metricArg);
 }

+/**
+ * Search the sparse kNN for the k-nearest neighbors of a set of sparse query vectors
+ * using some distance implementation


Should add the comments for the template parameters.

cpp/include/raft/sparse/neighbors/knn.cuh

rhdong · 2024-10-30T21:40:19Z

cpp/include/raft/sparse/neighbors/knn.cuh

+                     float metricArg                     = 0)
+{
+  cudaStream_t stream = raft::resource::get_cuda_stream(handle);
+


Maybe we could add a judgment for 0 size data for idx and query, though it should happen rarely. (Considering the following code includes the logic of size() - 1)

@rhdong do you think I should raise and error or just return before performing bfknn?

Should depend on the logic: to return directly (keeping no change on the outputs), if it is normal to have zero-size input, or you could use RAFT_EXPECTS to notify the caller.

cpp/test/preprocess_utils.cu

rhdong · 2024-10-30T21:57:18Z

cpp/test/preprocess_utils.cu

+  auto host_matrix = raft::make_host_matrix<T2, int64_t>(handle, num_rows, num_cols);
+  raft::copy(host_matrix.data_handle(), device_matrix.data_handle(), device_matrix.size(), stream);
+
+  for (int i = 0; i < elements_size; i++) {


Unclear on the primary objective of this logic, but just a heads-up: unless you explicitly sync on the stream before this line, we can't assume host_matrix will have the same value as device_matrix.

What is happening here is that I am loading the incoming host data into a dense matrix. So we represent the COO arrays as a dense matrix and then I am copying that dense matrix from host memory to GPU memory. Before that line I am expecting that both the host and device matrices are zero filled. I did it this way to use raft APIs as much as possible. For loop on host memory did not seem like the most efficient way to fill a matrix.

rhdong · 2024-10-30T22:16:38Z

cpp/include/raft/sparse/matrix/detail/preprocessing.cuh

+ * @param csr_in: Input CSR matrix
+ * @param values_out: Output values array
+ */
+template <typename T1, typename T2, typename IdxT>


The T1, T2 might be a bit unclear; feel free to rename them to something more meaningful if you prefer.

cpp/test/preprocess_utils.cu

rhdong · 2024-10-30T22:31:27Z

cpp/test/sparse/preprocess_csr.cu

+using SparsePreprocessBm25Csr = SparsePreprocessCSR<float, int>;
+TEST_P(SparsePreprocessBm25Csr, Result) { Run(true); }
+
+const std::vector<SparsePreprocessInputs<float, int>> sparse_preprocess_inputs = {


To be conservative and ensure that there are no surprises after merging, it is best to add some use cases for larger matrices.

larger random matrices is difficult to ensure, because rmat currently makes many duplicates during edge creation. This results in much smaller than anticipated number of edges. I think in its current form it would be misleading. But I can definitely pass much bigger parameters to RMAT. I dont think the end result will be what we expect. We need to first create a function that creates an RMAT and then removes duplicates and keeps looping through this logic until we get a set of edges of the desired amount that have no duplicates. This is outside of the purview of this PR, IMO. How do you feel about it @cjnolet?

jperez999 · 2024-12-11T17:36:17Z

cpp/test/sparse/preprocess_csr.cu

+                                             values_nnz.view(),
+                                             num_rows);
+    auto rows_csr = raft::make_device_vector<Index_, int64_t>(handle, non_dupe_nnz_count);
+    raft::sparse::convert::sorted_coo_to_csr(


With larger matrices this seems to be failing with invalid writes:

========= COMPUTE-SANITIZER Running main() from /raid/workspace/raft/cpp/build/_deps/gtest-src/googletest/src/gtest_main.cc [==========] Running 4 tests from 4 test suites. [----------] Global test environment set-up. [----------] 1 test from SparsePreprocessCSR/SparsePreprocessTfidfCsr [ RUN ] SparsePreprocessCSR/SparsePreprocessTfidfCsr.Result/0 ========= Invalid __global__ write of size 4 bytes ========= at void cub::CUB_200500_890_NS::DeviceScanKernel<cub::CUB_200500_890_NS::DeviceScanPolicy<int, thrust::plus<void>>::Policy900, thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, (bool)1>, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int *>, int, int>(T2, T3, T4, int, T5, T6, T7)+0x1b30 ========= by thread (8,0,0) in block (0,0,0) ========= Address 0x7c6486200e20 is out of bounds ========= and is 1 bytes after the nearest allocation at 0x7c6486200e00 of size 32 bytes ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x2b76ef] ========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1 ========= Host Frame: [0x15bc3] ========= in /opt/conda/envs/rapids_raft/lib/libcudart.so.12 ========= Host Frame:cudaLaunchKernel_ptsz [0x54aa0] ========= in /opt/conda/envs/rapids_raft/lib/libcudart.so.12 ========= Host Frame:void cub::CUB_200500_890_NS::DeviceScanKernel<cub::CUB_200500_890_NS::DeviceScanPolicy<int, thrust::plus<void> >::Policy900, thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, true>, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int*>, int, int>(thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, true>, int, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int*>, int) [0x2f10a] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:cudaError thrust::cuda_cub::launcher::triple_chevron::doit_host<void (*)(thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, true>, int, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int*>, int), thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, true>, int, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int*>, int>(void (*)(thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, true>, int, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int*>, int), thrust::device_ptr<int> const&, thrust::device_ptr<int> const&, cub::CUB_200500_890_NS::ScanTileState<int, true> const&, int const&, thrust::plus<void> const&, cub::CUB_200500_890_NS::detail::InputValue<int, int*> const&, int const&) const [clone .isra.0] [0x180bb] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:thrust::device_ptr<int> thrust::cuda_cub::detail::exclusive_scan_n_impl<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_base>, thrust::device_ptr<int>, long, thrust::device_ptr<int>, int, thrust::plus<void> >(thrust::cuda_cub::execution_policy<thrust::detail::execute_with_allocator<rmm::mr::thrust_allocator<char>, thrust::cuda_cub::execute_on_stream_base> >&, thrust::device_ptr<int>, long, thrust::device_ptr<int>, int, thrust::plus<void>) [clone .isra.0] [0x292d8] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:void raft::sparse::convert::detail::sorted_coo_to_csr<int>(int const*, int, int*, int, CUstream_st*) [0x5dd30] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:raft::sparse::SparsePreprocessCSR<float, int>::Run(bool) [0x6a8cc] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [0x9403d] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:testing::Test::Run() [0x94310] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:testing::TestInfo::Run() [0x946d6] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:testing::TestSuite::Run() [0x94e13] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:testing::internal::UnitTestImpl::RunAllTests() [0x9a88d] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:testing::UnitTest::Run() [0x949d8] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame:main [0x1774e] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST ========= Host Frame: [0x29d8f] ========= in /usr/lib/x86_64-linux-gnu/libc.so.6 ========= Host Frame:__libc_start_main [0x29e3f] ========= in /usr/lib/x86_64-linux-gnu/libc.so.6 ========= Host Frame: [0x177b1] ========= in /raid/workspace/raft/./cpp/build/gtests/SPARSE_TEST =========

I noticed this might be helpful in resolving this error(maybe we can check if the allocated memory size is correct or the indices staff has no invalid ones):

========= Invalid __global__ write of size 4 bytes ========= at void cub::CUB_200500_890_NS::DeviceScanKernel<cub::CUB_200500_890_NS::DeviceScanPolicy<int, thrust::plus<void>>::Policy900, thrust::device_ptr<int>, thrust::device_ptr<int>, cub::CUB_200500_890_NS::ScanTileState<int, (bool)1>, thrust::plus<void>, cub::CUB_200500_890_NS::detail::InputValue<int, int *>, int, int>(T2, T3, T4, int, T5, T6, T7)+0x1b30 ========= by thread (8,0,0) in block (0,0,0) ========= Address 0x7c6486200e20 is out of bounds ========= and is 1 bytes after the nearest allocation at 0x7c6486200e00 of size 32 bytes

rhdong · 2024-12-11T17:59:02Z

cpp/test/sparse/preprocess_csr.cu

+
+    raft::util::create_dataset<Index_, Type_f>(
+      handle, rows.view(), columns.view(), values.view(), 5, params.n_rows, params.n_cols);
+    int non_dupe_nnz_count = raft::util::get_dupe_mask_count<Index_, Type_f>(


Declaring the non_dupe_nnz_count as int64_t might be safer since it is used as int64_t in the following code.

ajschmidt8 and others added 30 commits July 14, 2020 17:05

update master references

a6677ca

REL DOC Updates for main branch switch

ad2d7d7

[skip ci] Update master references for main branch

Merge pull request rapidsai#272 from rapidsai/branch-21.06

e3c9344

REL Fix `21.06` Release Changelog

Merge pull request rapidsai#321 from rapidsai/branch-21.08

3b0a6d2

[HOTFIX] Remove `-g` from cython compile commands

REL v21.08.00 release

309ea1a

Merge pull request rapidsai#612 from rapidsai/branch-22.04

3740998

[RELEASE] v22.04

REL v22.04.00 release

e987ec8

update changelog

229b9f8

Merge pull request rapidsai#708 from rapidsai/branch-22.06

0eded98

[RELEASE] v22.06 raft

FIX update-version.sh

3e5a625

Merge pull request rapidsai#709 from rapidsai/branch-22.06

ad50a7f

FIX update-version.sh

REL v22.06.00 release

ed2c529

Merge pull request rapidsai#782 from rapidsai/branch-22.08

aae5e34

REL v22.08.00 release

87a7d16

Merge pull request rapidsai#908 from rapidsai/branch-22.10

1de93ba

REL v22.10.00 release

31ae597

Merge pull request rapidsai#988 from rapidsai/branch-22.10

c6e6ce8

[RELEASE] raft v22.10.01

REL v22.10.01 release

f7d2335

Merge pull request rapidsai#1063 from rapidsai/branch-22.12

c16fa56

REL v22.12.00 release

9a716b7

Merge pull request rapidsai#1101 from rapidsai/branch-22.12

60936ba

[RELEASE] raft v22.12.01 [skip-gpuci]

REL v22.12.01 release

a655c9a

Merge pull request rapidsai#1250 from rapidsai/branch-23.02

9a66f42

REL v23.02.00 release

69dce2d

Merge pull request rapidsai#1405 from rapidsai/branch-23.04

1467154

REL v23.04.00 release

7d1057e

REL v23.04.01 release

dc800d6

REL Merge pull request rapidsai#1486 from rapidsai/branch-23.04

520e12c

REL Update changelog v23.04

jperez999 added 4 commits October 30, 2024 12:30

revert contributing md changes

c59bdf9

remove change to pre-commit-config.yaml

c871023

remove all changes to conda env files

04041c2

revert changes to python pyproject files

b022e6e

github-actions bot removed the python label Oct 30, 2024

jperez999 added 6 commits October 30, 2024 14:38

remove extra comment symbol in file

52dd0d9

complete reversion of file to main

8b53c8d

revert dependencies file from merge

d0e8750

file revert

f9c9a0b

revert contributing md

9db7cd9

revert contributing md

a70619e