Support standardization for sparse vectors in logistic regression MG #5806

lijinf2 · 2024-03-15T23:20:30Z

No description provided.

dantegd

Found some very minor things in a first review, but the code already looks good!

cpp/src/glm/qn/mg/standardization.cuh

dantegd · 2024-03-20T17:13:57Z

cpp/src/glm/qn_mg.cu

  Standardizer<T>* stder = NULL;

+  if (standardization)
+    stder = new Standardizer(handle, X_simple, n_samples, mean_std_buff, vec_size);


stder reads too much like std err, maybe we coud rename it?

Sure. revised to std_obj.

python/cuml/tests/dask/test_dask_logistic_regression.py

dantegd · 2024-03-20T17:15:59Z

python/cuml/tests/dask/test_dask_logistic_regression.py

+    assert array_equal(lron_coef_origin, sg.coef_, tolerance)
+    assert array_equal(lron_intercept_origin, sg.intercept_, tolerance)


Same as above, using unit_tol and total_tol will lead to less flakiness in the tests

cuml/python/cuml/tests/dask/test_dask_logistic_regression.py

Line 400 in 88d659e

unit_tol=tolerance,

lijinf2 · 2024-03-27T05:22:04Z

Found some very minor things in a first review, but the code already looks good!

@dantegd Thank you for the review! I have pushed the revised code. Seems CI fails with a "unrecognized arguments: --force" error associated with memba. Is it expected?

The gemmb was tested on large dataset by multiplying a ones vector (1 x num_rows) with the sparse matrix (num_rows x 2), where every row is [1. 0.5]. When num_rows is 20 million, the gemmb returns [16,777,200 8,388,610], not the expected [20,000,000 10,000,000]. Therefore, this PR uses a chunk-based calculation to split the sparse matrix by rows, then aggregates over chunks. This can minimize the precision loss and return the expected results, already tested from 20 million 130 million. Let me know if the revised code looks ok or if there is any risk.

copy-pr-bot · 2024-04-02T15:36:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…espmm

…are term, not tested yet

…ing one to a large number

…crease the stability of the tests

lijinf2 requested review from a team as code owners March 15, 2024 23:20

github-actions bot added Cython / Python Cython or Python issue CUDA/C++ labels Mar 15, 2024

lijinf2 self-assigned this Mar 15, 2024

lijinf2 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 15, 2024

dantegd requested changes Mar 20, 2024

View reviewed changes

lijinf2 force-pushed the fea_lr_std_sparse_ready branch 2 times, most recently from a7a6793 to 527cbc1 Compare March 27, 2024 03:13

lijinf2 force-pushed the fea_lr_std_sparse_ready branch from 527cbc1 to c20610e Compare April 2, 2024 15:36

lijinf2 requested a review from a team as a code owner April 2, 2024 15:36

github-actions bot added conda conda issue ci labels Apr 2, 2024

lijinf2 added 4 commits April 2, 2024 08:40

Align scratch space to resolve cudamisalignedaddress error in cuspars…

a99261b

…espmm

revise int n_samples to size_t, set neg variance to add back mean squ…

3818df5

…are term, not tested yet

revise per comments

5069f4c

support mean var calculation in chunks to avoid precision loss of add…

a44edb9

…ing one to a large number

lijinf2 force-pushed the fea_lr_std_sparse_ready branch from c20610e to a44edb9 Compare April 2, 2024 15:40

github-actions bot removed conda conda issue ci labels Apr 2, 2024

dantegd approved these changes Apr 2, 2024

View reviewed changes

double the training data to 2e5 when n_classe is larger than 5, to in…

2052557

…crease the stability of the tests

raydouglass merged commit 669fad2 into rapidsai:branch-24.04 Apr 3, 2024
56 of 65 checks passed

lijinf2 deleted the fea_lr_std_sparse_ready branch June 26, 2024 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support standardization for sparse vectors in logistic regression MG #5806

Support standardization for sparse vectors in logistic regression MG #5806

lijinf2 commented Mar 15, 2024

dantegd left a comment

dantegd Mar 20, 2024

lijinf2 Mar 27, 2024

dantegd Mar 20, 2024

lijinf2 Mar 27, 2024

lijinf2 commented Mar 27, 2024 •

edited

Loading

copy-pr-bot bot commented Apr 2, 2024

		assert array_equal(lron_coef_origin, sg.coef_, tolerance)
		assert array_equal(lron_intercept_origin, sg.intercept_, tolerance)

Support standardization for sparse vectors in logistic regression MG #5806

Support standardization for sparse vectors in logistic regression MG #5806

Conversation

lijinf2 commented Mar 15, 2024

dantegd left a comment

Choose a reason for hiding this comment

dantegd Mar 20, 2024

Choose a reason for hiding this comment

lijinf2 Mar 27, 2024

Choose a reason for hiding this comment

dantegd Mar 20, 2024

Choose a reason for hiding this comment

lijinf2 Mar 27, 2024

Choose a reason for hiding this comment

lijinf2 commented Mar 27, 2024 • edited Loading

copy-pr-bot bot commented Apr 2, 2024

lijinf2 commented Mar 27, 2024 •

edited

Loading