OptiMask
is a Python package designed for efficiently handling NaN values in matrices, specifically focusing on computing the largest non-contiguous submatrix without NaN. OptiMask employs a heuristic method, relying solely on Numpy for speed and efficiency. In machine learning applications, OptiMask surpasses traditional methods like pandas dropna
by maximizing the amount of valid data available for model fitting. It strategically identifies the optimal set of columns (features) and rows (samples) to retain or remove, ensuring that the largest (non-contiguous) submatrix without NaN is utilized for training models.
The problem differs from the computation of the largest rectangles of 1s in a binary matrix (which can be tackled with dynamic programming) and requires a novel approach. The problem also differs from this algorithmic challenge in that it requires rearranging both columns and rows, rather than just columns.
- Largest Submatrix without NaN: OptiMask calculates the largest submatrix without NaN, enhancing data analysis accuracy.
- Efficient Computation: With optimized computation, OptiMask provides rapid results without undue delays.
- Numpy and Pandas Compatibility: OptiMask seamlessly adapts to both Numpy and Pandas data structures.
To employ OptiMask, install the optimask
package via pip:
pip install optimask
OptiMask is also available on the conda-forge channel:
conda install -c conda-forge optimask
mamba install optimask
Import the OptiMask
class from the optimask
package and utilize its methods for efficient data masking:
from optimask import OptiMask
import numpy as np
# Create a matrix with NaN values
m = 120
n = 7
data = np.zeros(shape=(m, n))
data[24:72, 3] = np.nan
data[95, :5] = np.nan
# Solve for the largest submatrix without NaN values
rows, cols = OptiMask().solve(data)
# Calculate the ratio of non-NaN values in the result
coverage_ratio = len(rows) * len(cols) / data.size
# Check if there are any NaN values in the selected submatrix
has_nan_values = np.isnan(data[rows][:, cols]).any()
# Print or display the results
print(f"Coverage Ratio: {coverage_ratio:.2f}, Has NaN Values: {has_nan_values}")
# Output: Coverage Ratio: 0.85, Has NaN Values: False
The grey cells represent the NaN locations, the blue ones represent the valid data, and the red ones represent the rows and columns removed by the algorithm:
OptiMask’s algorithm is useful for handling unstructured NaN patterns, as shown in the following example:
OptiMask
efficiently handles large matrices, delivering results within reasonable computation times:
from optimask import OptiMask
import numpy as np
def generate_random(m, n, ratio):
"""Missing at random arrays"""
return np.random.choice(a=[0, np.nan], size=(m, n), p=[1-ratio, ratio])
x = generate_random(m=100_000, n=1_000, ratio=0.02)
%time rows, cols = OptiMask(verbose=True).solve(x)
>>> Trial 1 : submatrix of size 35830x51 (1827330 elements) found.
>>> Trial 2 : submatrix of size 37167x49 (1821183 elements) found.
>>> Trial 3 : submatrix of size 37190x49 (1822310 elements) found.
>>> Trial 4 : submatrix of size 36530x50 (1826500 elements) found.
>>> Trial 5 : submatrix of size 37227x49 (1824123 elements) found.
>>> Result: the largest submatrix found is of size 35830x51 (1827330 elements) found.
>>> CPU times: total: 938 ms
>>> Wall time: 223 ms
For detailed documentation,API usage, examples and insights on the algorithm, visit OptiMask Documentation.
If you're working with time series data, check out timefiller, another Python package I developed for time series imputation. timefiller
is designed to efficiently handle missing data in time series and relies heavily on optimask
.
If you use OptiMask in your research or work, please cite it:
@software{optimask2024,
author = {Cyril Joly},
title = {OptiMask: NaN Removal and Largest Submatrix Computation},
year = {2024},
url = {https://github.com/CyrilJl/OptiMask},
}
Or:
OptiMask (2024). NaN Removal and Largest Submatrix Computation. Developed by Cyril Joly: https://github.com/CyrilJl/OptiMask