Adding support for `scipy` sparse matrix datatypes #12

paucablop · 2024-08-04T07:20:15Z

Description of the Issue

Some estimators such as Support Vector Classifiers (SVC) store some data (e.g. the support vectors) as scipy.sparse._csr.csr_matrix. This is not obvious from the documentation where it says that the support vectors should be stored as np.ndarrays.

This becomes an issue because scipy representations of sparse matrices are not supported types and are filtered out before serialization, raising and exception when deserializing the object.

Possible solutions

I have been thinking of two possible solutions for this (better alternatives are most welcome 😄)

Option 1:

Use the default function toarray() to transform the csr_matrix into a np.ndarray.

Pros: implementation is quite straight forward. The resulting np.ndarray can be directly transformed into a list through the SklearnSerializer._array_to_list() function.
Cons: this would potentially increase the size of the files because we might be storing sparce matrices as dense ones. Also, I think that this would make maintainability difficult because it requires that we know which attributes of which estimators will produce a csr_matrix so that we can transformed it back from dense to sparse during deserialization.

Option 2:

We could transform the sparse matrix into dict, and then serialize the dict. Below I have placed an example for a csc_matrix:

# Define the matrix in dense format
dense_matrix = np.array([
    [0, 10, 0, 0],
    [0, 0, 20, 0],
    [0, 0, 30, 40],
    [50, 60, 0, 0]
])

# Create a CSC matrix from the dense matrix
csc_mat = csc_matrix(dense_matrix)

During serialization we can transform the csc_matrix into a dict and store it as a JSON.

def sparse_matrix_to_dict(sparse_matrix: Union[csc_matrix, csr_matrix]) -> dict:
    return {
        "datatype": sparse_matrix.__class__.__name__,
        "data": sparse_matrix.data.tolist(),
        "indices": sparse_matrix.indices.tolist(),
        "indptr": sparse_matrix.indptr.tolist(),
        "shape": sparse_matrix.shape
    }

During deserialization we can read in the JSON and transform it back into a csc_matrix

Pros: it will reduce the size of the JSON files specially when large sparse matrices are generated. I also think it will be a bit easier to deserialize because the attributes stored as sparce matrices will be flag by definition in the dictionary (in the datatype).
Cons: it requires to rewrite some additional functions convert from sparse to dictionary and from dictionary to sparse.

The text was updated successfully, but these errors were encountered:

paucablop added the enhancement New feature or request label Aug 4, 2024

paucablop added this to OpenModels Aug 4, 2024

paucablop moved this to Backlog in OpenModels Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for `scipy` sparse matrix datatypes #12

Adding support for `scipy` sparse matrix datatypes #12

paucablop commented Aug 4, 2024

Adding support for scipy sparse matrix datatypes #12

Adding support for scipy sparse matrix datatypes #12

Comments

paucablop commented Aug 4, 2024

Description of the Issue

Possible solutions

Option 1:

Option 2:

Adding support for `scipy` sparse matrix datatypes #12

Adding support for `scipy` sparse matrix datatypes #12