Skip to content

Commit

Permalink
Tidy tutorials
Browse files Browse the repository at this point in the history
  • Loading branch information
dherrera1911 committed Feb 5, 2025
1 parent 763e0b0 commit 4b2df7b
Show file tree
Hide file tree
Showing 4 changed files with 73 additions and 68 deletions.
17 changes: 9 additions & 8 deletions docs/source/tutorials/digit_processing.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,8 @@ kernelspec:
# Digit recognition with SQFA

In this tutorial, we compare SQFA to standard dimensionality
reduction methods in digit recognition, using the
[Street View House Numbers (SVHN)](http://ufldl.stanford.edu/housenumbers/)
dataset.
reduction methods using the digit recognition dataset
[Street View House Numbers (SVHN)](http://ufldl.stanford.edu/housenumbers/).
We compare SQFA to different standard methods available
in the `sklearn` library: PCA, LDA, ICA and Factor Analysis.
To compare the methods, we test the performance of a
Expand Down Expand Up @@ -92,8 +91,8 @@ plt.show()

We see that we have 10 classes and that the training
data consists of 73257 samples of 1024 dimensions. We will now
learn 9 filters for this dataset, with each of the different
dimensionality reduction methods to the dataset
learn 9 filters for this dataset using each of the different
dimensionality reduction methods.

:::{admonition} Maximum number of filters
A limitation of LDA is that it can learn a maximum of $c-1$ filters, where
Expand Down Expand Up @@ -200,10 +199,12 @@ the learning process. The method `fit_pca` of the `SQFA` class
sets the filters to the PCA components of the data.
:::

Lets evaluate the performance of the filters in separating the classes
by QDA. QDA fits a Gaussian distribution (mean and covariance) to
Lets evaluate how well the filters separate the classes quadratically,
by using a QDA classifier on each feature set.
QDA fits a Gaussian distribution (mean and covariance) to
each class and uses the Bayes rule to classify samples. Both the
class specific means and covariances to classify samples.
class specific means and covariances are used
to classify samples.

```{code-cell} ipython3
def get_qda_accuracy(x_train, y_train, x_test, y_test, filters):
Expand Down
40 changes: 21 additions & 19 deletions docs/source/tutorials/distances.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@ kernelspec:

In the [geometry tutorial](https://sqfa.readthedocs.io/en/latest/tutorials/spd_geometry.html)
we explained the geometric intuition behind SQFA and smSQFA.
Without much justification, we proposed using the affine invariant
Without much motivation, we proposed using the affine invariant
distance in the SPD manifold for smSQFA, and the Fisher-Rao distance
in the manifold of normal distributions for SQFA.
In the [SQFA paper](https://arxiv.org/abs/2502.00168) we provide
a theoretical and empirical justification for this choice.
a theoretical and empirical motivation for this choice.
However, there are other possible distances
(or discriminability measures, or divergences) that could
be used instead, either for practical or theoretical reasons.
Expand Down Expand Up @@ -79,28 +79,30 @@ This means that, when using the Fisher-Rao metric, the length of
a curve is given by the accumulated discriminability of the
infinitesimal changes along the curve.

Interestingly, the AIRM metric for SPD matrices is equivalent to the
Fisher-Rao metric for zero-mean Gaussian distributions. Thus, the AIRM
distance applied to second-moment matrices has some intepretability in
terms of probability distributions: it tells us how discriminable
are the infinitesimal changes transforming the zero-mean Gaussian
given by $\mathbf{A}$ into the zero-mean Gaussian given by $\mathbf{B}$.
Interestingly, the affine invariant metric for SPD matrices is equivalent to the
Fisher-Rao metric for zero-mean Gaussian distributions. Thus, the
affine invariant distance applied to second-moment matrices has
some intepretability in terms of probability distributions:
it is the accumulated discriminability of the infinitesimal changes
transforming $\mathcal{N}(\mathbf{0}, \mathbf{A})$ into
$\mathcal{N}(\mathbf{0}, \mathbf{B})$.
:::

The Bures-Wasserstein distance between two SPD matrices
$\mathbf{A}$ and $\mathbf{B}$ is defined as:
$d_{BW}(\mathbf{A}, \mathbf{B}) =
\sqrt{ \text{tr}(\mathbf{A}) + \text{tr}(\mathbf{B}) -
\sqrt{ \text{Tr}(\mathbf{A}) + \text{tr}(\mathbf{B}) -
2 \text{tr}(\sqrt{\mathbf{A}^{1/2} \mathbf{B} \mathbf{A}^{1/2}}) }$

where $\text{tr}$ is the trace.
where $\text{Tr}$ is the trace.

:::{admonition} Bures-Wasserstein distance and optimal transport
:name: optimal-transport
Like the affine invariant distance, the Bures-Wasserstein distance
in the SPD manifold has an interpretation
in terms of Gaussian distributions. Specifically, the BW distance between two
SPD matrices $\mathbf{A}$ and $\mathbf{B}$ is the optimal transport distance
in terms of Gaussian distributions. Specifically, the Bures-Wasserstein
distance between two SPD matrices
$\mathbf{A}$ and $\mathbf{B}$ is the optimal transport distance
between the two zero-mean Gaussian distributions
$\mathcal{N}(\mathbf{0}, \mathbf{A})$ and $\mathcal{N}(\mathbf{0}, \mathbf{B})$.

Expand All @@ -111,14 +113,13 @@ Gaussian distribution given by $\mathcal{N}(\mathbf{0}, \mathbf{A})$
is a pile of dirt. The Bures-Wasserstein distance is the cost of
moving that pile of dirt into the shape given by
$\mathcal{N}(\mathbf{0}, \mathbf{B})$.
From the earth mover's perspective we can get some intuition
about the Bures-Wasserstein distance. For example,
that it is not invariant to scaling: if we scale up the distributions,
need to move the dirt across larger distances, increasing the cost.

Optimal transport distances are a popular tool in machine learning,
and sometimes have advantages with respect to the Fisher-Rao distances.

From the earth mover's perspective, it is easy to see that the
Bures-Wasserstein distance is not scale-invariant: if we scale up
the distributions, we need to move the dirt across larger distances,
increasing the cost.
:::

### Implementing the Bures-Wasserstein distance
Expand Down Expand Up @@ -435,6 +436,7 @@ Calvo and Oller (1990)[^3^].
This approximation is implemented in `sqfa.distances.fisher_rao_lower_bound()`.
[^3^]: Calvo, B., & Oller, J. (1990). "A distance between multivariate normal distributions based in an embedding into the siegel group." In Journal of Multivariate Analysis (Vol 35, Issue 2, pp 223-242).



Currently, the functionality of using custom distance functions
with the SQFA method is no implemented in `sqfa`, but it will
be implemented soon.

76 changes: 38 additions & 38 deletions docs/source/tutorials/spd_geometry.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ In this tutorial we provide a brief overview of the geometric perspective
of SQFA and smSQFA. We start by explaining what is supervised linear
dimensionality reduction. Then, we explain how SQFA and smSQFA
use a geometric perspective on discriminability to learn
discriminative features.
discriminative linear features.

## SQFA and smSQFA perform supervised dimensionality reduction

Expand All @@ -26,15 +26,15 @@ of methods that use the class labels of the data to learn a set
of linear features that are useful for classification.

More specifically, given a dataset $\{\mathbf{x}_t, y_t\}_{t=1}^N$
where $\mathbf{x}_t \in \mathbb{R}^n$ are the data vectors, $y_t$ are the
class labels, the goal of supervised linear dimensionality reduction in
where $\mathbf{x}_t \in \mathbb{R}^n$ are the data vectors,
$y_t \in \{1, \ldots, c}$ are the class labels,
the goal of supervised linear dimensionality reduction in
general is to learn a set of linear filters
$\mathbf{F} \in \mathbb{R}^{n \times m}$, $m<n$ such that
the transformed data points $\mathbf{z}_t = \mathbf{F}^T \mathbf{x}_t$
support classification as best as possible.

To make this clearer, it is useful to consider the classical
example of supervised dimensionality reduction,
The classical example of supervised dimensionality reduction is
[Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
(LDA). LDA aims to maximize linear discriminability.
For this, LDA uses the class-conditional means
Expand All @@ -43,7 +43,7 @@ $\Sigma$, where it is assumed that all classes have the same
covariance matrix. Then, LDA learns a set of linear filters
$\mathbf{F} \in \mathbb{R}^{n \times m}$ that maximizes the spread
between the class means (or their distances)
relative to the covariance of the data.
relative to the covariance of the data in the feature space.
For this goal, LDA quantifies the spread between class means
using the Mahalanobis distance,
which is just the Euclidean distance after transforming
Expand All @@ -54,9 +54,11 @@ The goal of SQFA and smSQFA is similar to LDA. What
characterizes SQFA and smSQFA, however, is that they take
into account the class-specific second-order statistics to make
the classes discriminable. Specifically, SQFA uses both
the class-conditional means and the class-conditional
covariance matrices, and smSQFA uses only the class-conditional
second-moment matrices. Then, SQFA and smSQFA learn the
the class-conditional means $\mu_i$ and the class-conditional
covariance matrices $\Sima_i$ of the features,
and smSQFA uses only the class-conditional
second-moment matrices $\Psi_i = \mathbb{E}[\mathbf{z}\mathbf{z}^T | y=i]$.
Then, SQFA and smSQFA learn the
filters that make the classes as different as possible
considering their second-order statistics. Such
differences in second-order statistics are particularly
Expand All @@ -73,16 +75,6 @@ geometry to explain smSQFA, which is the simpler of the two methods.

## The geometric perspective of smSQFA

Symmetric Positive Definite (SPD) matrices are symmetric matrices whose
eigenvalues are all strictly positive. This type of matrix appears in many
statistics and machine learning applications. For example, covariance matrices
and second-moment matrices are SPD matrices[^1].
The set of $m \times m$ SPD matrices forms a manifold, denoted $\mathrm{SPD(m)}$.
The geometry of $\mathrm{SPD(m)}$ (which is shaped like an open cone
in the space of symmetric matrices) is very well studied, and there
are different formulas for computing distances, geodesics, means, and other
geometric quantities in this space.

:::{admonition} Riemannian manifolds
Riemannian manifolds are geometric spaces that locally look like Euclidean
spaces, but globally can have a more complex structure. The classical example
Expand All @@ -108,17 +100,25 @@ such spaces, like measuring distances, interpolating, finding averages, etc.
</figure>
:::

As mentioned above, smSQFA focuses on maximizing the quadratic
discriminability allowed by the class-conditional second-moment
matrices for the feature vectors $\mathbf{z}_t$,
denoted $\Psi_i = \mathbb{E}[\mathbf{z}\mathbf{z}^T | y=i]$.
For this, smSQFA uses the fact that the second-moment matrices
$\Psi_i$ are symmetric positive definite (SPD) matrices, and
that they can be seen as points in the SPD manifold $\mathrm{SPD(m)}$.

Symmetric Positive Definite (SPD) matrices are symmetric matrices whose
eigenvalues are all strictly positive. This type of matrix appears in many
statistics and machine learning applications. For example, covariance matrices
and second-moment matrices are SPD matrices[^1].
The set of $m$-by$m$ SPD matrices forms a manifold, denoted $\mathrm{SPD(m)}$.
The geometry of $\mathrm{SPD(m)}$ (which is shaped like an open cone
in the space of symmetric matrices) is very well studied, and there
are different formulas for computing distances, geodesics, means, and other
geometric quantities in this space.

As mentioned above, smSQFA focuses on maximizing the
discriminability allowed by the class-conditional
second-moment matrices $\Psi_i$. For this, smSQFA considers the
second-moment matrices $\Psi_i$ as points in the SPD manifold $\mathrm{SPD(m)}$.

<figure>
<p align="center">
<img src="../_static/sqfa_geometry.svg" width="700">
<img src="../_static/sqfa_geometry.svg" width="600">
<figcaption>
<b>Geometry of data statistics.</b>
<i>smSQFA considers the geometry of the second-order statistics of
Expand All @@ -131,11 +131,11 @@ that they can be seen as points in the SPD manifold $\mathrm{SPD(m)}$.
</figure>

What is the advantage of considering the second-moment matrices
as points in $\mathrm{SPD(m)}$? The key idea is that second-moment
matrices that are farther apart in the $\mathrm{SPD(m)}$ manifold
are more different, and thus more discriminable. Thus,
smSQFA aims to maximize the distances between the second-moment matrices
$\Psi_i$ in $\mathrm{SPD(m)}$. This is analogous to how LDA
as points in $\mathrm{SPD(m)}$? The key idea is that when matrices
$\Psi_i$ and $\Psi_j$ are farther apart in $\mathrm{SPD(m)}$, they
are more different, and thus more discriminable. Therefore,
smSQFA maximizes the distances between the second-moment matrices
in $\mathrm{SPD(m)}$. This is analogous to how LDA
maximizes the distances between the class means in Euclidean
space[^2]. The objective of smSQFA can be written as

Expand All @@ -155,10 +155,11 @@ $d(\Psi_i,\Psi_j) = \left\| \log(\Psi_i^{-1/2}\Psi_j\Psi_i^{-1/2}) \right\|_F =
where $\log$ is the matrix logarithm, $\|\cdot\|_F$ is the Frobenius norm,
and where $\lambda_k$ are the eigenvalues of $\Psi_i^{-1/2}\Psi_j\Psi_i^{-1/2}$,
or equivalently, the generalized eigenvalues of the
pair $(\Psi_i, \Psi_j)$. The relationship between the
pair $(\Psi_i, \Psi_j)$.

The relationship between the
affine-invariant distance and quadratic discriminability is
discussed at length in the [SQFA paper](https://arxiv.org/abs/2502.00168).

With this geometric approach, smSQFA learns filters that
allow for quadratic discriminability, as shown with different
examples in the documentation tutorials.
Expand Down Expand Up @@ -202,11 +203,10 @@ discriminability based on the second-moment matrices of the
features. However, second-moment matrices are less informative
than considering both the means and the covariance matrices
of the features simultaneously. SQFA considers both the
means and the covariance matrices of the features, allowing
to use more information to maximize quadratic discriminability.
class-conditional means and covariances.

Basically, SQFA uses the same geometric perspective as smSQFA, but
using a different manifold that reflects both means and
SQFA uses the same geometric perspective as smSQFA, but
using a different manifold that accounts for both means and
covariances. This is the information-geometry manifold of
$m$-dimensional Gaussian distributions, denoted
$\mathcal{M}_{\mathcal{N}}$, where each point
Expand Down
8 changes: 5 additions & 3 deletions docs/source/tutorials/toy_problem.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,19 +31,21 @@ but different covariance matrices that allow for good quadratic
separability of the classes. The covariances of the classes
are rotated versions of each other.
The differences in covariances make this space preferred by
SQFA and smSQFA. Because the means are the same,
SQFA and smSQFA. But because the means are the same,
this subspace is not preferred by LDA. The overall variance
in this subspace is moderate, so PCA does not prefer it either.
2) Dimensions 3 and 4 have slightly different means for the classes,
but the same covariance matrix. The differences in means make this
space preferred by LDA. The overall variance in this subspace
is moderate so, PCA does not prefer it. The differences in means
are small, so this subspace is not very discriminative.
is moderate, so PCA does not prefer it. The differences in the
class means are small, so this subspace is not very discriminative.
3) Dimensions 5 and 6 have the same mean and covariance matrix
for all classes, but high overall variance. This space is
preferred by PCA. This subspace is not preferred by SQFA or LDA
because it is not discriminative.

The three subspaces will be made clear in the plots below.


## Implementation of the toy problem

Expand Down

0 comments on commit 4b2df7b

Please sign in to comment.