Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WiP] SRHTs #122

Open
wants to merge 27 commits into
base: main
Choose a base branch
from
Open

[WiP] SRHTs #122

wants to merge 27 commits into from

Conversation

aryamanjeendgar
Copy link

This is a draft PR for integrating SRHTs into RandBLAS

@CLAassistant
Copy link

CLAassistant commented Oct 1, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exciting to see progress here!

I left several comments. Please resolve them, but then take a step back. Part of what you're doing is adding FHT support. This already takes effort without thinking about randomization.

Building up your suite of FHT kernels

Make a folder RandBLAS/trig/ and a file RandBLAS/trig/hadamard.hh. Have your various FHT implementations go here. You already have one for applying to column-major data from the left and storing the result in a column-major sketch. You could easily write implementation that takes in row-major data and writes to a row-major sketch (although the only way you can parallelize this is with BLAS1, I think).

It would be great to have implementations for both the Hadamard transform itself and for the transpose of the Hadamard transform (equivalently, the inverse of the Hadamard transform). If you do this then w.l.o.g. you can always assume you're applying the transformation from the left.

Once you have those functions working, add in the ability to implicitly scale the rows of the input matrix by a vector of coefficients. In sketching we end up setting this vector to a Rademacher random vector, but from an implementation standpoint these functions don't need to care where the vector comes from.

Once you have those cases sorted out, you can allow conflicting layouts for the input matrix and the output matrix (i.e., input is row-major and output is column-major). This is the trick I use for resolving transposition in the sparse-times-dense matrix kernels.

Note: if you feel like you need to allocate a temporary matrix for workspace in order to do anything useful, you can definitely try that.

Writing tests

Make a folder test/test_matmul_cores/test_trig/ and a file test/test_matmul_cores/test_trig/test_hadamard.cc. This will handle tests only for your FHT.

You can take some inspiration from

// Adapted from test::linop_common::test_left_apply_transpose_to_eye.
template <typename T, typename DenseSkOp, SparseMatrix SpMat = COOMatrix<T,int64_t>>
void test_left_transposed_sketch_of_eye(
// B = S^T * eye, where S is m-by-d, B is d-by-m
DenseSkOp &S, Layout layout
) {
auto [m, d] = dimensions(S);
auto I = eye<SpMat>(m);
std::vector<T> B(d * m, 0.0);
bool is_colmajor = (Layout::ColMajor == layout);
int64_t ldb = (is_colmajor) ? d : m;
int64_t lds = (is_colmajor) ? m : d;
lsksp3(
layout, Op::Trans, Op::NoTrans, d, m, m,
(T) 1.0, S, 0, 0, I, 0, 0, (T) 0.0, B.data(), ldb
);
std::vector<T> S_dense(m * d, 0.0);
to_explicit_buffer(S, S_dense.data(), layout);
test::comparison::matrices_approx_equal(
layout, Op::Trans, d, m,
B.data(), ldb, S_dense.data(), lds,
__PRETTY_FUNCTION__, __FILE__, __LINE__
);
}
// Adapted from test::linop_common::test_left_apply_submatrix_to_eye.
template <typename T, typename DenseSkOp, SparseMatrix SpMat = COOMatrix<T,int64_t>>
void test_left_submat_sketch_of_eye(
// B = alpha * submat(S0) * eye + beta*B, where S = submat(S) is d1-by-m1 offset by (S_ro, S_co) in S0, and B is random.
T alpha, DenseSkOp &S0, int64_t d1, int64_t m1, int64_t S_ro, int64_t S_co, Layout layout, T beta = 0.0
) {
auto [d0, m0] = dimensions(S0);
randblas_require(d0 >= d1);
randblas_require(m0 >= m1);
bool is_colmajor = layout == Layout::ColMajor;
int64_t ldb = (is_colmajor) ? d1 : m1;
// define a matrix to be sketched, and create workspace for sketch.
auto I = eye<SpMat>(m1);
auto B = std::get<0>(random_matrix<T>(d1, m1, RNGState(42)));
std::vector<T> B_backup(B);
// Perform the sketch
lsksp3(
layout, Op::NoTrans, Op::NoTrans, d1, m1, m1,
alpha, S0, S_ro, S_co, I, 0, 0, beta, B.data(), ldb
);
// Check the result
T *expect = new T[d0 * m0];
to_explicit_buffer(S0, expect, layout);
int64_t ld_expect = (is_colmajor) ? d0 : m0;
auto [row_stride_s, col_stride_s] = layout_to_strides(layout, ld_expect);
auto [row_stride_b, col_stride_b] = layout_to_strides(layout, ldb);
int64_t offset = row_stride_s * S_ro + col_stride_s * S_co;
#define MAT_E(_i, _j) expect[offset + (_i)*row_stride_s + (_j)*col_stride_s]
#define MAT_B(_i, _j) B_backup[ (_i)*row_stride_b + (_j)*col_stride_b]
for (int i = 0; i < d1; ++i) {
for (int j = 0; j < m1; ++j) {
MAT_E(i,j) = alpha * MAT_E(i,j) + beta * MAT_B(i, j);
}
}
test::comparison::matrices_approx_equal(
layout, Op::NoTrans,
d1, m1,
B.data(), ldb,
&expect[offset], ld_expect,
__PRETTY_FUNCTION__, __FILE__, __LINE__
);
delete [] expect;
}
// Adapted from test::linop_common::test_right_apply_transpose_to_eye.
template <typename T, typename DenseSkOp, SparseMatrix SpMat = COOMatrix<T,int64_t>>
void test_right_transposed_sketch_of_eye(
// B = eye * S^T, where S is d-by-n, so eye is order n and B is n-by-d
DenseSkOp &S, Layout layout
) {
auto [d, n] = dimensions(S);
auto I = eye<SpMat>(n);
std::vector<T> B(n * d, 0.0);
bool is_colmajor = Layout::ColMajor == layout;
int64_t ldb = (is_colmajor) ? n : d;
int64_t lds = (is_colmajor) ? d : n;
rsksp3(layout, Op::NoTrans, Op::Trans, n, d, n, (T) 1.0, I, 0, 0, S, 0, 0, (T) 0.0, B.data(), ldb);
std::vector<T> S_dense(n * d, 0.0);
to_explicit_buffer(S, S_dense.data(), layout);
test::comparison::matrices_approx_equal(
layout, Op::Trans, n, d,
B.data(), ldb, S_dense.data(), lds,
__PRETTY_FUNCTION__, __FILE__, __LINE__
);
}
// Adapted from test::linop_common::test_right_apply_submatrix_to_eye.
template <typename T, typename DenseSkOp, SparseMatrix SpMat = COOMatrix<T,int64_t>>
void test_right_submat_sketch_of_eye(
// B = alpha * eye * submat(S) + beta*B : submat(S) is n-by-d, eye is n-by-n, B is n-by-d and random
T alpha, DenseSkOp &S0, int64_t n, int64_t d, int64_t S_ro, int64_t S_co, Layout layout, T beta = 0.0
) {
auto [n0, d0] = dimensions(S0);
randblas_require(n0 >= n);
randblas_require(d0 >= d);
bool is_colmajor = layout == Layout::ColMajor;
int64_t ldb = (is_colmajor) ? n : d;
auto I = eye<SpMat>(n);
auto B = std::get<0>(random_matrix<T>(n, d, RNGState(11)));
std::vector<T> B_backup(B);
rsksp3(layout, Op::NoTrans, Op::NoTrans, n, d, n, alpha, I, 0, 0, S0, S_ro, S_co, beta, B.data(), ldb);
T *expect = new T[n0 * d0];
to_explicit_buffer(S0, expect, layout);
int64_t ld_expect = (is_colmajor)? n0 : d0;
auto [row_stride_s, col_stride_s] = layout_to_strides(layout, ld_expect);
auto [row_stride_b, col_stride_b] = layout_to_strides(layout, ldb);
int64_t offset = row_stride_s * S_ro + col_stride_s * S_co;
#define MAT_E(_i, _j) expect[offset + (_i)*row_stride_s + (_j)*col_stride_s]
#define MAT_B(_i, _j) B_backup[ (_i)*row_stride_b + (_j)*col_stride_b]
for (int i = 0; i < n; ++i) {
for (int j = 0; j < d; ++j) {
MAT_E(i,j) = alpha * MAT_E(i,j) + beta * MAT_B(i, j);
}
}
test::comparison::matrices_approx_equal(
layout, Op::NoTrans, n, d, B.data(), ldb, &expect[offset], ld_expect,
__PRETTY_FUNCTION__, __FILE__, __LINE__
);
delete [] expect;
}
to see what your tests could like. But don't let this line of thinking hobble you. Feel free to take a different approach to get started. I can give feedback and we iterate a bit.

RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
@rileyjmurray
Copy link
Contributor

rileyjmurray commented Oct 2, 2024

@aryamanjeendgar, for our reference, here's the code you mentioned from the test suite of FFHT:

        void fht(double *buf, int log_n) {
        int n = 1 << log_n;
        for (int i = 0; i < log_n; ++i) {
                int s1 = 1 << i;
                int s2 = s1 << 1;
                for (int j = 0; j < n; j += s2) {
                for (int k = 0; k < s1; ++k) {
                        double u = buf[j + k];
                        double v = buf[j + k + s1];
                        buf[j + k] = u + v;
                        buf[j + k + s1] = u - v;
                }
                }
        }
        }

Step 1 to figuring out how we'll change it: replace the bit manipulations with standard integer arithmetic (or just give explanatory comments). You can also create a GitHub issue to facilitate the discussion, or make an Overleaf project.

Copy link
Contributor

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do a detailed read of my notes in GitHub Issue #99.

I didn't review the whole PR since my plane is landing and I need to put away my laptop.

Ping @aryamanjeendgar

RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
Comment on lines 93 to 131
template<typename T, SignedInteger sint_t>
void applyDiagonalRademacher(
bool left,
blas::Layout layout,
int64_t rows,
int64_t cols,
T* A,
sint_t* diag
)
{
if(left && layout == blas::Layout::ColMajor) {
for(int col=0; col < cols; col++) {
if(diag[col] > 0)
continue;
blas::scal(rows, diag[col], &A[col * rows], 1);
}
}
else if(left && layout == blas::Layout::RowMajor) {
for(int col=0; col < cols; col++) {
if(diag[col] > 0)
continue;
blas::scal(rows, diag[col], &A[col], cols);
}
}
else if(!left && layout == blas::Layout::ColMajor) {
for(int row = 0; row < rows; row++) {
if(diag[row] > 0)
continue;
blas::scal(cols, diag[row], &A[row], rows);
}
}
else {
for(int row = 0; row < rows; row++) {
if(diag[row] > 0)
continue;
blas::scal(cols, diag[row], &A[row * cols], 1);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but it can and should be much more efficient. It's also probably better suited to RandBLAS/util.hh.

RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
Copy link
Contributor

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good! Please start writing unit tests.

@aryamanjeendgar
Copy link
Author

aryamanjeendgar commented Oct 17, 2024

A description of the tests:

All of the tests right now (unfortunately) use Eigen --- I use Eigen extensively to be able to consistently produce RowMajor and ColMajor random matrices, computing matrix norms and products in all of the tests --- additionally, I use Eigen in the tests for the permutation matrices since Eigen has a convenient out-of-the-box PermutationMatrix class.

I also ended up using std::random for generating a simple "random" buffer ƒrom which I instantiate the Eigen matrices.

Much of this infrastructure (with norm computation, scaling, computing products etc.) can be ported (with a lot more LoC XD) to BLAS with the exception of the permutation tests which tests against Eigen's PermutationMatrix .

The correctness tests are fairly straightforward to understand:

  1. For testing permutation: I permute the last row/column of the input matrix (it is swapped with the leading row/column), and then check if the result of my implementation and Eigens' matches up.
  2. For testing Hadamard: Checked against explicit application of the Hadamard matrix
  3. For testing Diagonal scaling: Scale the input matrix by all $-1$'s via my code and see if it's equal to a negated copy of the initial matrix

The tests are simple, but they cover all of the potential code paths in my code (since the code paths are chosen around two inputs: sketching direction + layout of the input matrix)

Let me know how we should proceed next, @rileyjmurray!

UPDATE: The latest commit also adds in invSRHT tests for checking that $\left(\mathbf{\Pi RHT}\right)^{-1}\left(\mathbf{\Pi RHT}\left(A\right)\right) = A$

Copy link
Contributor

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You don't need the RandBLAS:: namespace qualifications within RandBLAS itself.
  2. Don't explicitly reference r123::Philox4x32. Reference RandBLAS::DefaultRNG (But, again, without the namespace qualification).
  3. I don't like people having to explicitly pass in log_n to the various FHT functions. Do you have a good reason for that?
  4. Please don't use a single monolithic "correctness" test.
  5. Tests should template numerical precision, not hard-code double. The tolerance for the test would ideally be based on some known error bounds for the operations you're trying to perform. The error bounds you're working with should be very simple since every part of an RHT is an orthogonal matrix, and there are special bounds for those.
  6. There should be functions in RandBLAS proper (not just tests) for inverting an RHT.

This is just a partial review. I have to jump to a meeting.

RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
RandBLAS/trig_skops.hh Outdated Show resolved Hide resolved
@aryamanjeendgar
Copy link
Author

aryamanjeendgar commented Oct 23, 2024

You don't need the RandBLAS:: namespace qualifications within RandBLAS itself.

Fixed!

Don't explicitly reference r123::Philox4x32. Reference RandBLAS::DefaultRNG (But, again, without the namespace qualification).

Looks like I didn't have DefaultRNG on the previous version that I was working on --- will fix this with the next commit

I don't like people having to explicitly pass in log_n to the various FHT functions. Do you have a good reason for that?

I use it to implicitly decide the size of the hadamard matrix while applying the transform --- it's not necessary per se, but that's why I made sure to remove the argument from the user-facing functions lmiget and rmiget

Please don't use a single monolithic "correctness" test.

I just wanted to make the tests as easy to dispatch as possible --- the test fixtures themselves are a lot more descriptive (like test_diag_left_rowmajor --- denoting the module I test, which direction the multiplication happens from and the layout of the test matrix).

Tests should template numerical precision, not hard-code double. The tolerance for the test would ideally be based on some known error bounds for the operations you're trying to perform. The error bounds you're working with should be very simple since every part of an RHT is an orthogonal matrix, and there are special bounds for those.

With the latest commit all of test matrices etc. I construct are now templated --- since most of the tests involve some kind of a norm check, I just set the tolerance to 1e-5 (but they pass even for 1e-11)

There should be functions in RandBLAS proper (not just tests) for inverting an RHT.

This is hard to do without committing to refactor to work with a TrigSkOp class rather than just the functions lmiget, rmiget --- since to invert the transform, I'd require access to 1) The exact diagonal matrix generated from the generate_rademacher_vector_r123 and 2) The indices of rows/cols which we permuted (which will need to be reversed for the inversion) --- how do you think we can introduce inversion inside of lmiget, rmiget?

I am sure r123's counter based random number generation could help us bypass this --- but in that case, it would be the user that will have to take care of passing in the original RNG state and not the updated one that we return --- from our end, it would just involve adding in another flag to lmiget/rmiget for inversion where we just make sure to reverse the list of permutation indices to be able to perform the inversion

@rileyjmurray
Copy link
Contributor

With the latest commit all of test matrices etc. I construct are now templated --- since most of the tests involve some kind of a norm check, I just set the tolerance to 1e-5 (but they pass even for 1e-11)

If you're using a rule of thumb like that then the least you can do is define your rule with respect to the unit roundoff of the current numerical type. In this context it sounds like RandBLAS::sqrt_epsilon<T>() would do the job.

[Having proper functions for inverting an RHT] is hard to do without committing to refactor to work with a TrigSkOp class rather than just the functions lmiget, rmiget --- since to invert the transform, I'd require access to 1) The exact diagonal matrix generated from the generate_rademacher_vector_r123 and 2) The indices of rows/cols which we permuted (which will need to be reversed for the inversion) --- how do you think we can introduce inversion inside of lmiget, rmiget?

How about you define a struct type called "HadamardMixingOp" to hold the data needed for lmiget and rmiget. Maybe we end up defining a MixingOperator C++20 concept later. For now you can just do one type.

Given the HadamardMixingOp container, you can define "fill_hadamard" function (or something like that) akin to fill_dense and fill_sparse, which falls back on an implementation of a "fill_hadamard_unpacked_nosub" that's analogous to "fill_sparse_unpacked_nosub."

@aryamanjeendgar
Copy link
Author

aryamanjeendgar commented Oct 25, 2024

So, right now --- I have written HadamardMixingOp to just be a class with all the data that I need on-hand to be able to invert the $\mathbb{\Pi \text{RHT}}$ operation --- with the actual invert method being a free-function that takes in an instance of HadamardMixingOp and the sketched matrix as an input

But the problem is that this instance of HadamardMixingOp ought to be constructed inside of lmiget/rmiget when the indices for sampling and the rademacher entries are sampled --- and as of right now, since lmiget/rmiget are independent of HadamardMixingOp --- I perform the memory management for the buffers holding the rademacher entries and the sampled indices inside of the functions themselves

But from what I've realised after thinking about it for a bit, requiring an invert operation would also require us to enforce the creation of a HadamardMixingOp instance with every call of lmiget/rmiget , for the memory management for all of this to shift inside of HadamardMixingOp and most importantly, for lmiget/rmiget/invert to become free-methods accepting an instance of HadamardMixingOp

i.e. I'm not sure if there is a way to keep the construction of a HadamardMixingOp class "optional" should we want the user to have the option of being able to invert the $\mathbb{\Pi \text{RHT}}$

UPDATE: We decided to make lmiget/rmiget/invert, all free-functions accepting instances of HadamardMixingOp

Comment on lines 300 to 318
void fht_dispatch(
bool left,
bool left, // Pre-multiplying?
blas::Layout layout,
T* buff,
int64_t log_n,
int64_t num_rows,
int64_t num_cols
int64_t num_cols,
int64_t log_n,
T* A
)
{
if(left && layout == blas::Layout::ColMajor)
fht_left_col_major(buff, log_n, num_rows, num_cols);
fht_left_col_major(A, log_n, num_rows, num_cols);
else if(left && layout == blas::Layout::RowMajor)
fht_left_row_major(buff, log_n, num_rows, num_cols);
fht_left_row_major(A, log_n, num_rows, num_cols);
else if(!left && layout == blas::Layout::ColMajor)
fht_right_col_major(buff, log_n, num_rows, num_cols);
fht_right_col_major(A, log_n, num_rows, num_cols);
else
fht_right_row_major(buff, log_n, num_rows, num_cols);
fht_right_row_major(A, log_n, num_rows, num_cols);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please accept num_rows, num_cols, and a leading dimension parameter "lda". (You'll probably need the lda to also appear in the signatures of the functions you're dispatching here.)

Comment on lines +322 to +325
template <SignedInteger sint_t = int64_t>
struct HadamardMixingOp{
sint_t* diag_scale;
sint_t* selected_idxs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, I want selected_idxs to always be int64_t*. The reason that sparse matrices use a template parameter sint_t is because in sparse matrix multiplication moving around indices can actually contribute to the bottleneck runtime. In our context the selected_idxs is used to specify how we move much larger chunks of data (entire rows in the case of mixing from the left) and so using int64_t doesn't risk making things significantly more expensive.

I'm not sure what to do about the type of diag_scale, to be honest. I won't let this hold back merging the PR but I'd appreciate it if you put thought into this.

Comment on lines 357 to 364
* A free-function that performs an inversion of a matrix transformed by `lmiget | rmiget`
* the inversion is also performed in-place
*/
template <typename T, SignedInteger sint_t = int64_t>
void invert(
HadamardMixingOp<sint_t> &hmo, // details about the transform
T* SA // sketched matrix
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of function name would be problematic if someone wanted to call it from a language that doesn't support function overloading. Please propose some alternative names (here on a comment thread is fine).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of right now, I renamed it to invert_hadamard --- but maybe since its a free method of the HadamardMixingOp class, we can also call it invert_hmo?

Comment on lines +388 to +393
T log_frac_sz = std::modf(log_sz, &log_int_sz);

if(log_frac_sz < 1e-3)
log_final_sz = log_int_sz;
else
log_final_sz = log_int_sz + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of check makes me very uncomfortable. Please handle the problem of computing $\lceil \log_2(\texttt{ld})\rceil$ in a way that's more robust, or provide a proof that this way of doing this is safe.

Copy link
Contributor

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left lots of individual comments. After you resolve these, the next step will be to remove dependency on Eigen.

@rileyjmurray
Copy link
Contributor

@aryamanjeendgar please sign the Contributor License Agreement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants