Merge pull request #7 from valentingol/features

✨ Add features: gradient backward, whitening and randomized svd + benchmark with sklearn
valentingol · Jun 19, 2024 · 9bff944 · 9bff944
2 parents 26cc518 + 1f2f7a6
commit 9bff944
Show file tree

Hide file tree

Showing 12 changed files with 664 additions and 62 deletions.
diff --git a/README.md b/README.md
@@ -3,6 +3,7 @@
 Principal Component Anlaysis (PCA) in PyTorch. The intention is to provide a
 simple and easy to use implementation of PCA in PyTorch, the most similar to
 the `sklearn`'s PCA as possible (in terms of API and, of course, output).
+Plus, this implementation is **fully differentiable and faster** (thanks to GPU parallelization)!
 
 [![Release](https://img.shields.io/github/v/tag/valentingol/torch_pca?label=Pypi&logo=pypi&logoColor=yellow)](https://pypi.org/project/torch_pca/)
 ![PythonVersion](https://img.shields.io/badge/python-3.8%20%7E%203.11-informational)
@@ -58,27 +59,70 @@ pca_model = PCA(n_components=None, svd_solver='full')
 
 More details and features in the [API documentation](https://torch-pca.readthedocs.io/en/latest/api.html#torch_pca.pca_main.PCA).
 
+## Gradient backward pass
+
+Use the pytorch framework allows the automatic differentiation of the PCA!
+
+The PCA transform method is always differentiable so it is always possible to
+compute gradient like that:
+
+```python
+pca = PCA()
+for ep in range(n_epochs):
+    optimizer.zero_grad()
+    out = neural_net(inputs)
+    with torch.no_grad():
+        pca.fit(out)
+    out = pca.transform(out)
+    loss = loss_fn(out, targets)
+    loss.backward()
+```
+
+If you want to compute the gradient over the full PCA model (including the
+fitted `pca.n_components`), you can do it by using the "full" SVD solver
+and removing the part of the `fit` method that enforce the deterministic
+output by passing `determinist=False` in `fit` or `fit_transform` method.
+This part sort the components using the singular values and change their sign
+accordingly so it is not differentiable by nature but may be not necessary if
+you don't care about the determinism of the output:
+
+```python
+pca = PCA(svd_solver="full")
+for ep in range(n_epochs):
+    optimizer.zero_grad()
+    out = neural_net(inputs)
+    out = pca.fit_transform(out, determinist=False)
+    loss = loss_fn(out, targets)
+    loss.backward()
+```
+
+## Comparison of execution time with sklearn's PCA
+
+As we can see below the PyTorch PCA is faster than sklearn's PCA, in all the
+configs tested with the parameter by default (for each PCA model):
+
+![include](docs/_static/comparison.png)
+
 ## Implemented features
 
 - [x] `fit`, `transform`, `fit_transform` methods.
 - [x] All attributes from sklean's PCA are available: `explained_variance_(ratio_)`,
       `singular_values_`, `components_`, `mean_`, `noise_variance_`, ...
 - [x] Full SVD solver
 - [x] SVD by covariance matrix solver
+- [x] Randomized SVD solver
 - [x] (absent from sklearn) Decide how to center the input data in `transform` method
   (default is like sklearn's PCA)
 - [x] Find number of components with explained variance proportion
 - [x] Automatically find number of components with MLE
 - [x] `inverse_transform` method
+- [x] Whitening option
+- [x] `get_covariance` method
+- [x] `get_precision` method and `score`/`score_samples` methods
 
 ## To be implemented
 
-- [ ] Whitening option
-- [ ] Randomized SVD solver
-- [ ] ARPACK solver
-- [ ] `get_covariance` method
-- [ ] `get_precision` method and `score` method
-- [ ] Support sparse matrices
+- [ ] Support sparse matrices with ARPACK solver
 
 ## Contributing
 

diff --git a/benchmark.py b/benchmark.py
@@ -0,0 +1,63 @@
+"""Comparison between sklearn and torch PCA models."""
+
+# Copyright (c) 2024 Valentin Goldité. All Rights Reserved.
+
+from time import time
+
+# NOTE: requires matplotlib (not in requirements(-dev).txt)
+import matplotlib.pyplot as plt
+import numpy as np
+import torch
+from sklearn.decomposition import PCA as PCA_sklearn
+
+from torch_pca import PCA
+
+
+def main() -> None:
+    """Measure and compare the time of execution of the PCA."""
+    configs = [(75, 75), (100, 2000), (10_000, 500)]
+    torch_times, sklearn_times = [], []
+    for config in configs:
+        inputs = torch.randn(*config)
+        t0 = time()
+        PCA(n_components=50).fit_transform(inputs)
+        torch_times.append(round(time() - t0, 4))
+        t0 = time()
+        PCA_sklearn(n_components=50).fit_transform(inputs)
+        sklearn_times.append(round(time() - t0, 4))
+    ticks = np.arange(len(configs))
+    labels = [f"n_samples={config[0]}, n_features={config[1]}" for config in configs]
+    width = 0.35
+    fig, ax = plt.subplots()
+    rects1 = ax.bar(ticks - width / 2, torch_times, width, label="Pytorch PCA")
+    rects2 = ax.bar(ticks + width / 2, sklearn_times, width, label="Sklearn PCA")
+    ax.set_ylabel("Time of execution (s)")
+    ax.set_title("Comparison of execution time between Pytorch and Sklearn PCA.")
+    ax.set_xticks(ticks)
+    ax.set_xticklabels(labels)
+    ax.legend()
+    autolabel(rects1, ax)
+    autolabel(rects2, ax)
+    fig.tight_layout()
+    plt.show()
+
+
+def autolabel(rects: list, ax: plt.Axes) -> None:
+    """Attach a text label above each bar in *rects*, displaying its height.
+
+    From https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/barchart.html
+    """
+    for rect in rects:
+        height = rect.get_height()
+        ax.annotate(
+            str(height),
+            xy=(rect.get_x() + rect.get_width() / 2, height),
+            xytext=(0, 3),  # 3 points vertical offset
+            textcoords="offset points",
+            ha="center",
+            va="bottom",
+        )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/docs/_static/comparison.png b/docs/_static/comparison.png
diff --git a/docs/comparison.md b/docs/comparison.md
@@ -0,0 +1,6 @@
+# Comparison of execution time with sklearn's PCA
+
+As we can see below the PyTorch PCA is faster than sklearn's PCA, in all the
+configs tested with the parameter by default (for each PCA model):
+
+![include](https://raw.githubusercontent.com/valentingol/torch_pca/main/docs/_static/comparison.png)
diff --git a/docs/grad.md b/docs/grad.md
@@ -0,0 +1,36 @@
+# Gradient backward pass
+
+Use the pytorch framework allows the automatic differentiation of the PCA!
+
+The PCA transform method is always differentiable so it is always possible to
+compute gradient like that:
+
+```python
+pca = PCA()
+for ep in range(n_epochs):
+    optimizer.zero_grad()
+    out = neural_net(inputs)
+    with torch.no_grad():
+        pca.fit(out)
+    out = pca.transform(out)
+    loss = loss_fn(out, targets)
+    loss.backward()
+```
+
+If you want to compute the gradient over the full PCA model (including the
+fitted `pca.n_components`), you can do it by using the "full" SVD solver
+and removing the part of the `fit` method that enforce the deterministic
+output by passing `determinist=False` in `fit` or `fit_transform` method.
+This part sort the components using the singular values and change their sign
+accordingly so it is not differentiable by nature but may be not necessary if
+you don't care about the determinism of the output:
+
+```python
+pca = PCA(svd_solver="full")
+for ep in range(n_epochs):
+    optimizer.zero_grad()
+    out = neural_net(inputs)
+    out = pca.fit_transform(out, determinist=False)
+    loss = loss_fn(out, targets)
+    loss.backward()
+```
diff --git a/docs/index.rst b/docs/index.rst
@@ -4,15 +4,16 @@ Pytorch PCA
 Principal Component Anlaysis (PCA) in PyTorch. The intention is to
 provide a simple and easy to use implementation of PCA in PyTorch, the
 most similar to the ``sklearn``\ ’s PCA as possible (in terms of API
-and, of course, output).
+and, of course, output). Plus, this implementation is **fully differentiable and faster**
+(thanks to GPU parallelization)!
 
 |Release| |PythonVersion| |PytorchVersion|
 
 |GitHub User followers| |GitHub User’s User stars|
 
 |Ruff_logo| |Black_logo|
 
-|Ruff| |Flake8| |Pydocstyle| |MyPy| |PyLint|
+|Ruff| |Flake8| |MyPy| |PyLint|
 
 |Tests| |Coverage| |Documentation Status|
 
@@ -41,8 +42,6 @@ Documentation: https://torch-pca.readthedocs.io/en/latest/
    :target: https://github.com/valentingol/Dinosor/actions/workflows/ruff.yaml
 .. |Flake8| image:: https://github.com/valentingol/torch_pca/actions/workflows/flake.yaml/badge.svg
    :target: https://github.com/valentingol/Dinosor/actions/workflows/flake.yaml
-.. |Pydocstyle| image:: https://github.com/valentingol/torch_pca/actions/workflows/pydocstyle.yaml/badge.svg
-   :target: https://github.com/valentingol/Dinosor/actions/workflows/pydocstyle.yaml
 .. |MyPy| image:: https://github.com/valentingol/torch_pca/actions/workflows/mypy.yaml/badge.svg
    :target: https://github.com/valentingol/Dinosor/actions/workflows/mypy.yaml
 .. |PyLint| image:: https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/valentingol/8fb4f3f78584e085dd7b0cca7e046d1f/raw/torch_pca_pylint.json
@@ -60,6 +59,7 @@ Documentation: https://torch-pca.readthedocs.io/en/latest/
 
    installation
    howto
+   grad
    api
    contributing.md
    license.md