algorithms for large-scale matrix factorization
Many common matrix factorization algorithms can benefit from parallelization. This package provides distributed implementations of PCA
, ICA
, NMF
, and others that target the distributed computing engine spark
. It also wraps local implementations from scikit-learn
, to support factorization in both a local and distributed setting with a common API.
The package includes a collection of algorithms
that can be fit
to data, all of which return matrices with the results of the factorization. Compatible with Python 2.7+ and 3.4+. Built on numpy
, scipy
, and scikit-learn
. Works well alongside thunder
and supprts parallelization via spark
, but can also be used on local numpy
arrays.
Use pip
pip install thunder-factorization
Here's an example computing PCA
# create high-dimensional low-rank matrix
from sklearn.datasets import make_low_rank_matrix
X = make_low_rank_matrix(n_samples=100, n_features=100, effective_rank=5)
# use PCA to recover low-rank structure
from factorization import PCA
algorithm = PCA(k=5)
T, W_T = algorithm.fit(X)
All algorithms have a fit
method with fits the algorithm and returns the components of the factorization.
Input
X
data matrix as anumpy
ndarray
, abolt
array
, orthunder
series
orimages
datareturn_parallel
whether or not to keep the output parallelized, only valid if the input matrix is already parallelized viabolt
orthunder
, default isFalse
meaning thta all returned arrays will be local
Output
- Two or more arrays representing the estimated factors.
Here are all the available algorithms with their options.
Factors the matrix into statistically independent sources X = S * A
. Note: it is the columns of S
that represent the independent sources, linear combinations of which reconstruct the columns of X
.
Parameters
-
k
number of sources -
max_iter
maximum number of iterations -
tol
tolerance for stopping iterations -
seed
seed for random number generator that initializes algorithm -
k_pca
number of principal components used for initial dimensionality reduction, default is no dimensionality reduction (spark
mode only) -
svd_method
how to compute the distributed SVD'auto'
,'direct'
, or'em'
, see SVD documentation for details (spark
mode only)
Output
S
sources, shapenrows x k
A
mixing matrix, shapek x ncols
Factors a non-negative matrix as the product of two small non-negative matrices X = W * H
.
Parameters
k
number of componentsmax_iter
maximum number of iterationstol
tolerance for stopping iterationsseed
seed for random number generator that initializes algorithm
Output
W
left factor, shapenrows x k
H
right factor, shapek x ncols
Performs dimensionality reduction by finding an ordered set of components formed by an orthogonal projection that successively explain the maximum amount of remaining variance T = X * W^T
Parameters
k
number of componentsmax_iter
maximum number of iterationstol
tolerance for stopping iterationsseed
seed for random number generator that initializes algorithm.svd_method
how to compute the distributed SVD'auto'
,'direct'
, or'em'
, see SVD documentation for details (spark
mode only)
Output
T
components, shape(nrows, k)
W
weights, shape(k, ncols)
Generalization of the eigen-decomposition for non-square matrices X = U * diag(S) * V
.
Parameters
-
k
number of components -
max_iter
maximum number of iterations -
tol
tolerance for stopping iterations -
seed
seed for random number generator that initializes algorithm. -
svd_method
how to compute the distributed SVD (spark
mode only)'direct'
explicit computation based eigenvalue decomposition of the covariance matrix.'em'
approximate iterative method based on expectation-maximization algorithm.'auto'
usesdirect
forncols
< 750, otherwise usesem
.
Output
U
left singular vectors, shape(nrows, k)
S
singular values, shape(k,)
V
right singular vectors, shape(k, ncols)
All numpy
array
and bolt
array
inputs to factorization algorithms must be two-dimensional. Thunder images
and series
can also be factored, even though they technically have more than two dimensions, because in each case one dimension is treated as primary. For images
, each image will be flattened, creating a two-dimensional matrix with shape (key,pixels)
, and for series
, the keys will be flattened, creating a two-dimensional matrix with shape (key,time)
. After facorization, outputs will be reshaped back to their original form.
Run tests with
py.test
Tests run locally with numpy
by default, but the same tests can be run against a local spark
installation using
py.test --engine=spark