Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support standardization for sparse vectors in logistic regression MG #5806

Merged
merged 5 commits into from
Apr 3, 2024

Conversation

lijinf2
Copy link
Contributor

@lijinf2 lijinf2 commented Mar 15, 2024

No description provided.

@lijinf2 lijinf2 requested review from a team as code owners March 15, 2024 23:20
@github-actions github-actions bot added Cython / Python Cython or Python issue CUDA/C++ labels Mar 15, 2024
@lijinf2 lijinf2 self-assigned this Mar 15, 2024
@lijinf2 lijinf2 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 15, 2024
Copy link
Member

@dantegd dantegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some very minor things in a first review, but the code already looks good!

cpp/src/glm/qn/mg/standardization.cuh Show resolved Hide resolved
cpp/src/glm/qn/mg/standardization.cuh Outdated Show resolved Hide resolved
Standardizer<T>* stder = NULL;

if (standardization)
stder = new Standardizer(handle, X_simple, n_samples, mean_std_buff, vec_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stder reads too much like std err, maybe we coud rename it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. revised to std_obj.

python/cuml/tests/dask/test_dask_logistic_regression.py Outdated Show resolved Hide resolved
Comment on lines 1021 to 1022
assert array_equal(lron_coef_origin, sg.coef_, tolerance)
assert array_equal(lron_intercept_origin, sg.intercept_, tolerance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, using unit_tol and total_tol will lead to less flakiness in the tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@lijinf2 lijinf2 force-pushed the fea_lr_std_sparse_ready branch 2 times, most recently from a7a6793 to 527cbc1 Compare March 27, 2024 03:13
@lijinf2
Copy link
Contributor Author

lijinf2 commented Mar 27, 2024

Found some very minor things in a first review, but the code already looks good!

@dantegd Thank you for the review! I have pushed the revised code. Seems CI fails with a "unrecognized arguments: --force" error associated with memba. Is it expected?

The gemmb was tested on large dataset by multiplying a ones vector (1 x num_rows) with the sparse matrix (num_rows x 2), where every row is [1. 0.5]. When num_rows is 20 million, the gemmb returns [16,777,200 8,388,610], not the expected [20,000,000 10,000,000]. Therefore, this PR uses a chunk-based calculation to split the sparse matrix by rows, then aggregates over chunks. This can minimize the precision loss and return the expected results, already tested from 20 million 130 million. Let me know if the revised code looks ok or if there is any risk.

@lijinf2 lijinf2 force-pushed the fea_lr_std_sparse_ready branch from 527cbc1 to c20610e Compare April 2, 2024 15:36
@lijinf2 lijinf2 requested a review from a team as a code owner April 2, 2024 15:36
Copy link

copy-pr-bot bot commented Apr 2, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added conda conda issue ci labels Apr 2, 2024
@lijinf2 lijinf2 force-pushed the fea_lr_std_sparse_ready branch from c20610e to a44edb9 Compare April 2, 2024 15:40
@github-actions github-actions bot removed conda conda issue ci labels Apr 2, 2024
@raydouglass raydouglass merged commit 669fad2 into rapidsai:branch-24.04 Apr 3, 2024
56 of 65 checks passed
@lijinf2 lijinf2 deleted the fea_lr_std_sparse_ready branch June 26, 2024 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CUDA/C++ Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants