This paper was presented at EUSIPCO2024 (paper #1133) at session TH2.PA2 (poster 8)
This repository contains the implementation of the Coupled Generator Decomposition (CGD) model, which could used for multisubject, multimodal, or multigroup modeling. The model introduced here extends sparse principal component analysis (SPCA) for data fusion, enabling the identification of common features across diverse data sources while accommodating modality- and subject-specific variability. The toolbox also includes archetypal analysis and directional archetypal analysis. Please cite the following paper if you use this repository in your own work (the paper is also available on arXiv):
Say you have a data matrix
Here we present a conceptual framework and PyTorch toolbox for data fusion using sparse principal component analysis. In short, we require that at least one dimension is sampled equally across data groups (in the above case, the time axis was equal between the modalities and subjects). Along the shared dimension, a sparse source identifier matrix
The main Python class CGD.py
implements the Coupled Generator Decomposition model for a variable number of input data dimensions.
- PyTorch
The input data X
may be a torch tensor of size (*,N,P), where * may be any number of extra dimensions (e.g., subjects, clinical groups, modalities provided). In the case that N differs across data groups, the input may be a dictionary of torch tensors.
- Example 1: A time-locked EEG experiment is repeated across
$S$ subjects using the same number of EEG sensors$N$ and time points$P$ :X
is a torch.tensor(), andX.shape = (S,N,P)
. - Example 2: A time-locked simultaneous multimodal EEG and MEG experiment is repeated across
$S$ subjects, but the number of EEG and MEG sensors is different:X={'EEG':EEGdata,'MEG':MEGdata}
is a dictionary of torch tensors, where, e.g.,EEGdata.shape = (S,N_EEG,P)
. There are no restrictions on the naming or number of dictionary elements.
In summary, these are the inputs:
- Required inputs:
num_comp
: the number of components to learnX
: a torch tensor or dictionary of torch tensors.
- Optional inputs:
model
: the model to use. Options are 'SPCA', 'AA', 'DAA'. Default is 'SPCA'.Xtilde
: a torch tensor or dictionary of torch tensors to construct sources from, e.g., only the post-stimulus part of a time-locked experiment. If not specified,Xtilde
is assumed to be equal toX
.G_idx
: alternatively to providingXtilde
, a boolean tensorG_idx
of size(P,)
may indicate indices of a subset ofX
to construct sources from (defaults to all).lambda1
: the L1 regularization parameter for SPCA.lambda2
: the L2 regularization parameter for SPCA.init
: a dictionary of torch tensors, specifying the initial values for G (and S, if model=='AA').
- SPCA: Sparse Principal Component Analysis, in which L1 and L2 regularization coefficients must be provided. For sparse PCA, the model optimizes a shared generator matrix G
(P,K)
. A mixing matrix S(*,K,P)
is inferred using a procrustes transformation through(X.T@Xtilde)@G
. - AA: Archetypal Analysis, in which the model optimizes a shared generator matrix G
(P,K)
and a mixing matrix S(*,K,P)
. Both G and S are assumed non-negative and sum-to-one constraints enforced through the softmax function. - DAA: Directional Archetypal Analysis, which works similarly to AA except the data are assumed to be on a sign-invariant hypersphere.
The class is accompanied by a trainer. The trainer outputs the loss curve and the lowest loss obtained. Following training, estimated model parameters may be extracted using G,S,Bp,Bn = model.get_params()
. Bp
and Bn
are constrained non-zero variables such that G=Bp-Bn
.
In a terminal: Change directory to the directory with the CGD folder (optionally create a new virtual/conda environment with Python, PyTorch, tqdm, and numpy) and run pip install .
to install CGD as a loadable python package.
from CGD import CGD, CGD_trainer
# specify data and hyperparameters
X_train = {'EEG':EEGdata, 'MEG':MEGdata} ## EEGdata and MEGdata are torch tensors
X_test = {'EEG':EEGdata_test, 'MEG':MEGdata_test}
lambda_1 = 1e-2 #l1-regularization
lambda_2 = 1e-1 #l2-regularization
G_idx = times>=0 # times is a torch tensor indicating timing of samples relative to stimulus
K = 5 #number of components
lr = 0.1 #learning rate
# initialize model and run parameter estimation
model = CGD.TMMSAA(X=X_train,num_comp=K,model='SPCA',lambda1=lambda1,lambda2=lambda2,G_idx=G_idx)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
loss,_ = CGD_trainer.Optimizationloop(model=model,optimizer=optimizer)
# retrieve estimated parameters and evaluate loss on a test set
G,S,_,_ = model.get_model_params()
test_loss = model.eval_model(Xtrain=X_train,Xtest=X_test,G_idx=G_idx)
- Anders S Olsen
- Affiliation: DTU Compute
- Year: 2023-2024
- Toolbox introduced: February 2024
- Toolbox updates: March 2024 (AA bug fix), August 2024 (readability and usability)