Tidy tutorials

dherrera1911 · Feb 5, 2025 · 4b2df7b · 4b2df7b
1 parent 763e0b0
commit 4b2df7b
Show file tree

Hide file tree

Showing 4 changed files with 73 additions and 68 deletions.
diff --git a/docs/source/tutorials/digit_processing.md b/docs/source/tutorials/digit_processing.md
@@ -14,9 +14,8 @@ kernelspec:
 # Digit recognition with SQFA
 
 In this tutorial, we compare SQFA to standard dimensionality
-reduction methods in digit recognition, using the
-[Street View House Numbers (SVHN)](http://ufldl.stanford.edu/housenumbers/)
-dataset.
+reduction methods using the digit recognition dataset
+[Street View House Numbers (SVHN)](http://ufldl.stanford.edu/housenumbers/).
 We compare SQFA to different standard methods available
 in the `sklearn` library: PCA, LDA, ICA and Factor Analysis.
 To compare the methods, we test the performance of a
@@ -92,8 +91,8 @@ plt.show()
 
 We see that we have 10 classes and that the training
 data consists of 73257 samples of 1024 dimensions. We will now
-learn 9 filters for this dataset, with each of the different
-dimensionality reduction methods to the dataset
+learn 9 filters for this dataset using each of the different
+dimensionality reduction methods.
 
 :::{admonition} Maximum number of filters
 A limitation of LDA is that it can learn a maximum of $c-1$ filters, where
@@ -200,10 +199,12 @@ the learning process. The method `fit_pca` of the `SQFA` class
 sets the filters to the PCA components of the data.
 :::
 
-Lets evaluate the performance of the filters in separating the classes
-by QDA. QDA fits a Gaussian distribution (mean and covariance) to
+Lets evaluate how well the filters separate the classes quadratically,
+by using a QDA classifier on each feature set.
+QDA fits a Gaussian distribution (mean and covariance) to
 each class and uses the Bayes rule to classify samples. Both the
-class specific means and covariances to classify samples.
+class specific means and covariances are used
+to classify samples.
 
 ```{code-cell} ipython3
 def get_qda_accuracy(x_train, y_train, x_test, y_test, filters):

diff --git a/docs/source/tutorials/distances.md b/docs/source/tutorials/distances.md
@@ -15,11 +15,11 @@ kernelspec:
 
 In the [geometry tutorial](https://sqfa.readthedocs.io/en/latest/tutorials/spd_geometry.html)
 we explained the geometric intuition behind SQFA and smSQFA.
-Without much justification, we proposed using the affine invariant
+Without much motivation, we proposed using the affine invariant
 distance in the SPD manifold for smSQFA, and the Fisher-Rao distance
 in the manifold of normal distributions for SQFA.
 In the [SQFA paper](https://arxiv.org/abs/2502.00168) we provide
-a theoretical and empirical justification for this choice.
+a theoretical and empirical motivation for this choice.
 However, there are other possible distances
 (or discriminability measures, or divergences) that could
 be used instead, either for practical or theoretical reasons.
@@ -79,28 +79,30 @@ This means that, when using the Fisher-Rao metric, the length of
 a curve is given by the accumulated discriminability of the
 infinitesimal changes along the curve.
 
-Interestingly, the AIRM metric for SPD matrices is equivalent to the
-Fisher-Rao metric for zero-mean Gaussian distributions. Thus, the AIRM
-distance applied to second-moment matrices has some intepretability in
-terms of probability distributions: it tells us how discriminable
-are the infinitesimal changes transforming the zero-mean Gaussian
-given by $\mathbf{A}$ into the zero-mean Gaussian given by $\mathbf{B}$.
+Interestingly, the affine invariant metric for SPD matrices is equivalent to the
+Fisher-Rao metric for zero-mean Gaussian distributions. Thus, the
+affine invariant distance applied to second-moment matrices has
+some intepretability in terms of probability distributions:
+it is the accumulated discriminability of the infinitesimal changes
+transforming $\mathcal{N}(\mathbf{0}, \mathbf{A})$ into
+$\mathcal{N}(\mathbf{0}, \mathbf{B})$.
 :::
 
 The Bures-Wasserstein distance between two SPD matrices
 $\mathbf{A}$ and $\mathbf{B}$ is defined as:
 $d_{BW}(\mathbf{A}, \mathbf{B}) =
-\sqrt{ \text{tr}(\mathbf{A}) + \text{tr}(\mathbf{B}) -
+\sqrt{ \text{Tr}(\mathbf{A}) + \text{tr}(\mathbf{B}) -
 2 \text{tr}(\sqrt{\mathbf{A}^{1/2} \mathbf{B} \mathbf{A}^{1/2}}) }$
 
-where $\text{tr}$ is the trace. 
+where $\text{Tr}$ is the trace. 
 
 :::{admonition} Bures-Wasserstein distance and optimal transport
 :name: optimal-transport
 Like the affine invariant distance, the Bures-Wasserstein distance
 in the SPD manifold has an interpretation
-in terms of Gaussian distributions. Specifically, the BW distance between two
-SPD matrices $\mathbf{A}$ and $\mathbf{B}$ is the optimal transport distance
+in terms of Gaussian distributions. Specifically, the Bures-Wasserstein
+distance between two SPD matrices
+$\mathbf{A}$ and $\mathbf{B}$ is the optimal transport distance
 between the two zero-mean Gaussian distributions
 $\mathcal{N}(\mathbf{0}, \mathbf{A})$ and $\mathcal{N}(\mathbf{0}, \mathbf{B})$.
 
@@ -111,14 +113,13 @@ Gaussian distribution given by $\mathcal{N}(\mathbf{0}, \mathbf{A})$
 is a pile of dirt. The Bures-Wasserstein distance is the cost of
 moving that pile of dirt into the shape given by
 $\mathcal{N}(\mathbf{0}, \mathbf{B})$.
+From the earth mover's perspective we can get some intuition
+about the Bures-Wasserstein distance. For example,
+that it is not invariant to scaling: if we scale up the distributions,
+need to move the dirt across larger distances, increasing the cost.
 
 Optimal transport distances are a popular tool in machine learning,
 and sometimes have advantages with respect to the Fisher-Rao distances.
-
-From the earth mover's perspective, it is easy to see that the
-Bures-Wasserstein distance is not scale-invariant: if we scale up
-the distributions, we need to move the dirt across larger distances,
-increasing the cost.
 :::
 
 ### Implementing the Bures-Wasserstein distance
@@ -435,6 +436,7 @@ Calvo and Oller (1990)[^3^].
 This approximation is implemented in `sqfa.distances.fisher_rao_lower_bound()`.
 [^3^]: Calvo, B., & Oller, J. (1990). "A distance between multivariate normal distributions based in an embedding into the siegel group." In Journal of Multivariate Analysis (Vol 35, Issue 2, pp 223-242).
 
-
-
+Currently, the functionality of using custom distance functions
+with the SQFA method is no implemented in `sqfa`, but it will
+be implemented soon.
 
diff --git a/docs/source/tutorials/spd_geometry.md b/docs/source/tutorials/spd_geometry.md
@@ -17,7 +17,7 @@ In this tutorial we provide a brief overview of the geometric perspective
 of SQFA and smSQFA. We start by explaining what is supervised linear
 dimensionality reduction. Then, we explain how SQFA and smSQFA
 use a geometric perspective on discriminability to learn
-discriminative features.
+discriminative linear features.
 
 ## SQFA and smSQFA perform supervised dimensionality reduction
 
@@ -26,15 +26,15 @@ of methods that use the class labels of the data to learn a set
 of linear features that are useful for classification.
 
 More specifically, given a dataset $\{\mathbf{x}_t, y_t\}_{t=1}^N$
-where $\mathbf{x}_t \in \mathbb{R}^n$ are the data vectors, $y_t$ are the
-class labels, the goal of supervised linear dimensionality reduction in
+where $\mathbf{x}_t \in \mathbb{R}^n$ are the data vectors,
+$y_t \in \{1, \ldots, c}$ are the class labels,
+the goal of supervised linear dimensionality reduction in
 general is to learn a set of linear filters
 $\mathbf{F} \in \mathbb{R}^{n \times m}$, $m<n$ such that
 the transformed data points $\mathbf{z}_t = \mathbf{F}^T \mathbf{x}_t$
 support classification as best as possible.
 
-To make this clearer, it is useful to consider the classical
-example of supervised dimensionality reduction,
+The classical example of supervised dimensionality reduction is
 [Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
 (LDA). LDA aims to maximize linear discriminability.
 For this, LDA uses the class-conditional means
@@ -43,7 +43,7 @@ $\Sigma$, where it is assumed that all classes have the same
 covariance matrix. Then, LDA learns a set of linear filters
 $\mathbf{F} \in \mathbb{R}^{n \times m}$ that maximizes the spread
 between the class means (or their distances)
-relative to the covariance of the data.
+relative to the covariance of the data in the feature space.
 For this goal, LDA quantifies the spread between class means
 using the Mahalanobis distance,
 which is just the Euclidean distance after transforming
@@ -54,9 +54,11 @@ The goal of SQFA and smSQFA is similar to LDA. What
 characterizes SQFA and smSQFA, however, is that they take
 into account the class-specific second-order statistics to make
 the classes discriminable. Specifically, SQFA uses both
-the class-conditional means and the class-conditional
-covariance matrices, and smSQFA uses only the class-conditional
-second-moment matrices. Then, SQFA and smSQFA learn the
+the class-conditional means $\mu_i$ and the class-conditional
+covariance matrices $\Sima_i$ of the features,
+and smSQFA uses only the class-conditional
+second-moment matrices $\Psi_i = \mathbb{E}[\mathbf{z}\mathbf{z}^T | y=i]$.
+Then, SQFA and smSQFA learn the
 filters that make the classes as different as possible
 considering their second-order statistics. Such
 differences in second-order statistics are particularly
@@ -73,16 +75,6 @@ geometry to explain smSQFA, which is the simpler of the two methods.
 
 ## The geometric perspective of smSQFA
 
-Symmetric Positive Definite (SPD) matrices are symmetric matrices whose
-eigenvalues are all strictly positive. This type of matrix appears in many
-statistics and machine learning applications. For example, covariance matrices
-and second-moment matrices are SPD matrices[^1].
-The set of $m \times m$ SPD matrices forms a manifold, denoted $\mathrm{SPD(m)}$.
-The geometry of $\mathrm{SPD(m)}$ (which is shaped like an open cone
-in the space of symmetric matrices) is very well studied, and there
-are different formulas for computing distances, geodesics, means, and other
-geometric quantities in this space.
-
 :::{admonition} Riemannian manifolds
 Riemannian manifolds are geometric spaces that locally look like Euclidean
 spaces, but globally can have a more complex structure. The classical example
@@ -108,17 +100,25 @@ such spaces, like measuring distances, interpolating, finding averages, etc.
 </figure>
 :::
 
-As mentioned above, smSQFA focuses on maximizing the quadratic
-discriminability allowed by the class-conditional second-moment
-matrices for the feature vectors $\mathbf{z}_t$,
-denoted $\Psi_i = \mathbb{E}[\mathbf{z}\mathbf{z}^T | y=i]$.
-For this, smSQFA uses the fact that the second-moment matrices
-$\Psi_i$ are symmetric positive definite (SPD) matrices, and
-that they can be seen as points in the SPD manifold $\mathrm{SPD(m)}$.
+
+Symmetric Positive Definite (SPD) matrices are symmetric matrices whose
+eigenvalues are all strictly positive. This type of matrix appears in many
+statistics and machine learning applications. For example, covariance matrices
+and second-moment matrices are SPD matrices[^1].
+The set of $m$-by$m$ SPD matrices forms a manifold, denoted $\mathrm{SPD(m)}$.
+The geometry of $\mathrm{SPD(m)}$ (which is shaped like an open cone
+in the space of symmetric matrices) is very well studied, and there
+are different formulas for computing distances, geodesics, means, and other
+geometric quantities in this space.
+
+As mentioned above, smSQFA focuses on maximizing the
+discriminability allowed by the class-conditional
+second-moment matrices $\Psi_i$. For this, smSQFA considers the
+second-moment matrices $\Psi_i$ as points in the SPD manifold $\mathrm{SPD(m)}$.
 
 <figure>
 <p align="center">
-  <img src="../_static/sqfa_geometry.svg" width="700">
+  <img src="../_static/sqfa_geometry.svg" width="600">
   <figcaption>
   <b>Geometry of data statistics.</b>
   <i>smSQFA considers the geometry of the second-order statistics of
@@ -131,11 +131,11 @@ that they can be seen as points in the SPD manifold $\mathrm{SPD(m)}$.
 </figure>
 
 What is the advantage of considering the second-moment matrices
-as points in $\mathrm{SPD(m)}$? The key idea is that second-moment
-matrices that are farther apart in the $\mathrm{SPD(m)}$ manifold
-are more different, and thus more discriminable. Thus,
-smSQFA aims to maximize the distances between the second-moment matrices
-$\Psi_i$ in $\mathrm{SPD(m)}$. This is analogous to how LDA
+as points in $\mathrm{SPD(m)}$? The key idea is that when matrices
+$\Psi_i$ and $\Psi_j$ are farther apart in $\mathrm{SPD(m)}$, they
+are more different, and thus more discriminable. Therefore,
+smSQFA maximizes the distances between the second-moment matrices
+in $\mathrm{SPD(m)}$. This is analogous to how LDA
 maximizes the distances between the class means in Euclidean
 space[^2]. The objective of smSQFA can be written as
 
@@ -155,10 +155,11 @@ $d(\Psi_i,\Psi_j) = \left\| \log(\Psi_i^{-1/2}\Psi_j\Psi_i^{-1/2}) \right\|_F =
 where $\log$ is the matrix logarithm, $\|\cdot\|_F$ is the Frobenius norm,
 and where $\lambda_k$ are the eigenvalues of $\Psi_i^{-1/2}\Psi_j\Psi_i^{-1/2}$,
 or equivalently, the generalized eigenvalues of the
-pair $(\Psi_i, \Psi_j)$. The relationship between the
+pair $(\Psi_i, \Psi_j)$.
+
+The relationship between the
 affine-invariant distance and quadratic discriminability is
 discussed at length in the [SQFA paper](https://arxiv.org/abs/2502.00168).
-
 With this geometric approach, smSQFA learns filters that
 allow for quadratic discriminability, as shown with different
 examples in the documentation tutorials.
@@ -202,11 +203,10 @@ discriminability based on the second-moment matrices of the
 features. However, second-moment matrices are less informative
 than considering both the means and the covariance matrices
 of the features simultaneously. SQFA considers both the
-means and the covariance matrices of the features, allowing
-to use more information to maximize quadratic discriminability.
+class-conditional means and covariances.
 
-Basically, SQFA uses the same geometric perspective as smSQFA, but
-using a different manifold that reflects both means and
+SQFA uses the same geometric perspective as smSQFA, but
+using a different manifold that accounts for both means and
 covariances. This is the information-geometry manifold of
 $m$-dimensional Gaussian distributions, denoted
 $\mathcal{M}_{\mathcal{N}}$, where each point

diff --git a/docs/source/tutorials/toy_problem.md b/docs/source/tutorials/toy_problem.md
@@ -31,19 +31,21 @@ but different covariance matrices that allow for good quadratic
 separability of the classes. The covariances of the classes
 are rotated versions of each other.
 The differences in covariances make this space preferred by
-SQFA and smSQFA. Because the means are the same,
+SQFA and smSQFA. But because the means are the same,
 this subspace is not preferred by LDA. The overall variance
 in this subspace is moderate, so PCA does not prefer it either.
 2) Dimensions 3 and 4 have slightly different means for the classes,
 but the same covariance matrix. The differences in means make this
 space preferred by LDA. The overall variance in this subspace
-is moderate so, PCA does not prefer it. The differences in means
-are small, so this subspace is not very discriminative.
+is moderate, so PCA does not prefer it. The differences in the
+class means are small, so this subspace is not very discriminative.
 3) Dimensions 5 and 6 have the same mean and covariance matrix
 for all classes, but high overall variance. This space is
 preferred by PCA. This subspace is not preferred by SQFA or LDA
 because it is not discriminative.
 
+The three subspaces will be made clear in the plots below.
+
 
 ## Implementation of the toy problem