Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for scipy sparse matrix datatypes #12

Open
paucablop opened this issue Aug 4, 2024 · 0 comments
Open

Adding support for scipy sparse matrix datatypes #12

paucablop opened this issue Aug 4, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@paucablop
Copy link
Collaborator

Description of the Issue

Some estimators such as Support Vector Classifiers (SVC) store some data (e.g. the support vectors) as scipy.sparse._csr.csr_matrix. This is not obvious from the documentation where it says that the support vectors should be stored as np.ndarrays.

This becomes an issue because scipy representations of sparse matrices are not supported types and are filtered out before serialization, raising and exception when deserializing the object.

Possible solutions

I have been thinking of two possible solutions for this (better alternatives are most welcome 😄)

Option 1:

Use the default function toarray() to transform the csr_matrix into a np.ndarray.

  • Pros: implementation is quite straight forward. The resulting np.ndarray can be directly transformed into a list through the SklearnSerializer._array_to_list() function.
  • Cons: this would potentially increase the size of the files because we might be storing sparce matrices as dense ones. Also, I think that this would make maintainability difficult because it requires that we know which attributes of which estimators will produce a csr_matrix so that we can transformed it back from dense to sparse during deserialization.

Option 2:

We could transform the sparse matrix into dict, and then serialize the dict. Below I have placed an example for a csc_matrix:

# Define the matrix in dense format
dense_matrix = np.array([
    [0, 10, 0, 0],
    [0, 0, 20, 0],
    [0, 0, 30, 40],
    [50, 60, 0, 0]
])

# Create a CSC matrix from the dense matrix
csc_mat = csc_matrix(dense_matrix)

During serialization we can transform the csc_matrix into a dict and store it as a JSON.

def sparse_matrix_to_dict(sparse_matrix: Union[csc_matrix, csr_matrix]) -> dict:
    return {
        "datatype": sparse_matrix.__class__.__name__,
        "data": sparse_matrix.data.tolist(),
        "indices": sparse_matrix.indices.tolist(),
        "indptr": sparse_matrix.indptr.tolist(),
        "shape": sparse_matrix.shape
    }

During deserialization we can read in the JSON and transform it back into a csc_matrix

  • Pros: it will reduce the size of the JSON files specially when large sparse matrices are generated. I also think it will be a bit easier to deserialize because the attributes stored as sparce matrices will be flag by definition in the dictionary (in the datatype).
  • Cons: it requires to rewrite some additional functions convert from sparse to dictionary and from dictionary to sparse.
@paucablop paucablop added the enhancement New feature or request label Aug 4, 2024
@paucablop paucablop moved this to Backlog in OpenModels Aug 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

1 participant