Finite Mixtures of Multivariate Poisson-Log Normal Model for Clustering Count Data
MPLNClust
is an R package for performing clustering using finite
mixtures of multivariate Poisson-log normal (MPLN) distribution proposed
by Silva et al., 2019. It
was developed for count data, with clustering of RNA sequencing data as
a motivation. However, the clustering method may be applied to other
types of count data. The package provides functions for functions for
parameter estimation via 1) an MCMC-EM framework by Silva et al.,
2019 and 2) a variational
Gaussian approximation with EM algorithm by Subedi and Browne,
2020. Information criteria (AIC, BIC,
AIC3 and ICL) and slope heuristics (Djump and DDSE, if more than 10
models are considered) are offered for model selection. Also included
are functions for simulating data from this model and visualization.
To install the latest version of the package:
require("devtools")
devtools::install_github("anjalisilva/MPLNClust", build_vignettes = TRUE)
library("MPLNClust")
To run the Shiny app:
MPLNClust::runMPLNClust()
To list all functions available in the package:
ls("package:MPLNClust")
MPLNClust
contains 14 functions.
- mplnVariational for carrying out clustering of count data using mixtures of MPLN via variational expectation-maximization
- mplnMCMCParallel for carrying out clustering of count data using mixtures of MPLN via a Markov chain Monte Carlo expectation-maximization algorithm (MCMC-EM) with parallelization
- mplnMCMCNonParallel for carrying out clustering of count data using mixtures of MPLN via a Markov chain Monte Carlo expectation-maximization algorithm (MCMC-EM) with no parallelization
- mplnDataGenerator for the purpose of generating simlulation data via mixtures of MPLN
- mplnVisualizeAlluvial for visualizing clustering results as Alluvial plots
- mplnVisualizeBar for visualizing clustering results as bar plots
- mplnVisualizeHeatmap for visualizing clustering results as heatmaps
- mplnVisualizeLine for visualizing clustering results as line plots
- AICFunction for model selection
- AIC3Function for model selection
- BICFunction for model selection
- ICLFunction for model selection
- runMPLNClust is the shiny implementation of mplnVariational
- mplnVarClassification is an implementation for classification is currently under construction
Framework of mplnVariational makes it computationally efficient and faster compared to mplnMCMCParallel or mplnMCMCNonParallel. Therefore, mplnVariational may perform better for large datasets. For more information, see details section below. An overview of the package is illustrated below:
The MPLN distribution (Aitchison and Ho, 1989) is a multivariate log normal mixture of independent Poisson distributions. The hidden layer of the MPLN distribution is a multivariate Gaussian distribution, which allows for the specification of a covariance structure. Further, the MPLN distribution can account for overdispersion in count data. Additionally, the MPLN distribution supports negative and positive correlations.
A mixture of MPLN distributions is introduced for clustering count data by Silva et al., 2019. Here, applicability is illustrated using RNA sequencing data. To this date, two frameworks have been proposed for parameter estimation: 1) an MCMC-EM framework by Silva et al., 2019 and 2) a variational Gaussian approximation with EM algorithm by Subedi and Browne, 2020.
Silva et al., 2019 used an MCMC-EM framework via Stan for parameter estimation. This method is employed in functions mplnMCMCParallel and mplnMCMCNonParallel.
Coarse grain parallelization is employed in mplnMCMCParallel, such that when a range of components/clusters (g = 1,…,G) are considered, each component/cluster size is run on a different processor. This can be performed because each component/cluster size is independent from another. All components/clusters in the range to be tested have been parallelized to run on a separate core using the parallel R package. The number of cores used for clustering is calculated using parallel::detectCores() - 1. No internal parallelization is performed for mplnMCMCNonParallel.
To check the convergence of MCMC chains, the potential scale reduction factor and the effective number of samples are used. The Heidelberger and Welch’s convergence diagnostic (Heidelberger and Welch, 1983) is used to check the convergence of the MCMC-EM algorithm. Starting values (argument: initMethod) and the number of iterations for each chain (argument: nIterations) play an important role for the successful operation of this algorithm.
Subedi and Browne, 2020 proposed a variational Gaussian approximation that alleviates challenges of MCMC-EM algorithm. Here the posterior distribution is approximated by minimizing the Kullback-Leibler (KL) divergence between the true and the approximating densities. A variational-EM based framework is used for parameter estimation. This algorithm is implemented in the function mplnVariational. The parsimonious family of models implemented by considering eigen-decomposition of covariance matrix in Subedi and Browne, 2020 is not yet available with this package.
Four model selection criteria are offered, which include the Akaike information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; Schwarz, 1978), a variation of the AIC used by Bozdogan (1994) called AIC3, and the integrated completed likelihood (ICL; Biernacki et al., 2000). Slope heuristics (Djump and DDSE; Arlot et al., 2016) could be used for model selection if more than 10 models are considered.
Starting values (argument: initMethod) and the number of iterations for each chain (argument: nInitIterations) play an important role to the successful operation of this algorithm. There maybe issues with singularity, in which case altering starting values or initialization method may help.
The Shiny app employing mplnVariational could be run and results could be visualized:
MPLNClust::runMPLNClust()
In simple, the runMPLNClust is a web applications available with
MPLNClust
.
For tutorials and plot interpretation, refer to the vignette:
browseVignettes("MPLNClust")
citation("MPLNClust")
Silva, A., S. J. Rothstein, P. D. McNicholas, and S. Subedi (2019). A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data. BMC Bioinformatics. 20(1):394.
A BibTeX entry for LaTeX users is
@Article{,
title = {A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data},
author = {A. Silva and S. J. Rothstein and P. D. McNicholas and S. Subedi},
journal = {BMC Bioinformatics},
year = {2019},
volume = {20},
number = {1},
pages = {394},
url = {https://pubmed.ncbi.nlm.nih.gov/31311497/},
}
-
Aitchison, J. and C. H. Ho (1989). The multivariate Poisson-log normal distribution. Biometrika.
-
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6.
- Anjali Silva (anjali@alumni.uoguelph.ca).
MPLNClust
welcomes issues, enhancement requests, and other
contributions. To submit an issue, use the GitHub
issues.
-
Dr. Marcelo Ponce, SciNet HPC Consortium, University of Toronto, ON, Canada for all the computational support.
-
This work was funded by Natural Sciences and Engineering Research Council of Canada, Queen Elizabeth II Graduate Scholarship, and Arthur Richmond Memorial Scholarship.