Skip to content

Commit

Permalink
Merge pull request #7 from valentingol/features
Browse files Browse the repository at this point in the history
✨ Add features: gradient backward, whitening and randomized svd + benchmark with sklearn
  • Loading branch information
valentingol authored Jun 19, 2024
2 parents 26cc518 + 1f2f7a6 commit 9bff944
Show file tree
Hide file tree
Showing 12 changed files with 664 additions and 62 deletions.
56 changes: 50 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
Principal Component Anlaysis (PCA) in PyTorch. The intention is to provide a
simple and easy to use implementation of PCA in PyTorch, the most similar to
the `sklearn`'s PCA as possible (in terms of API and, of course, output).
Plus, this implementation is **fully differentiable and faster** (thanks to GPU parallelization)!

[![Release](https://img.shields.io/github/v/tag/valentingol/torch_pca?label=Pypi&logo=pypi&logoColor=yellow)](https://pypi.org/project/torch_pca/)
![PythonVersion](https://img.shields.io/badge/python-3.8%20%7E%203.11-informational)
Expand Down Expand Up @@ -58,27 +59,70 @@ pca_model = PCA(n_components=None, svd_solver='full')

More details and features in the [API documentation](https://torch-pca.readthedocs.io/en/latest/api.html#torch_pca.pca_main.PCA).

## Gradient backward pass

Use the pytorch framework allows the automatic differentiation of the PCA!

The PCA transform method is always differentiable so it is always possible to
compute gradient like that:

```python
pca = PCA()
for ep in range(n_epochs):
optimizer.zero_grad()
out = neural_net(inputs)
with torch.no_grad():
pca.fit(out)
out = pca.transform(out)
loss = loss_fn(out, targets)
loss.backward()
```

If you want to compute the gradient over the full PCA model (including the
fitted `pca.n_components`), you can do it by using the "full" SVD solver
and removing the part of the `fit` method that enforce the deterministic
output by passing `determinist=False` in `fit` or `fit_transform` method.
This part sort the components using the singular values and change their sign
accordingly so it is not differentiable by nature but may be not necessary if
you don't care about the determinism of the output:

```python
pca = PCA(svd_solver="full")
for ep in range(n_epochs):
optimizer.zero_grad()
out = neural_net(inputs)
out = pca.fit_transform(out, determinist=False)
loss = loss_fn(out, targets)
loss.backward()
```

## Comparison of execution time with sklearn's PCA

As we can see below the PyTorch PCA is faster than sklearn's PCA, in all the
configs tested with the parameter by default (for each PCA model):

![include](docs/_static/comparison.png)

## Implemented features

- [x] `fit`, `transform`, `fit_transform` methods.
- [x] All attributes from sklean's PCA are available: `explained_variance_(ratio_)`,
`singular_values_`, `components_`, `mean_`, `noise_variance_`, ...
- [x] Full SVD solver
- [x] SVD by covariance matrix solver
- [x] Randomized SVD solver
- [x] (absent from sklearn) Decide how to center the input data in `transform` method
(default is like sklearn's PCA)
- [x] Find number of components with explained variance proportion
- [x] Automatically find number of components with MLE
- [x] `inverse_transform` method
- [x] Whitening option
- [x] `get_covariance` method
- [x] `get_precision` method and `score`/`score_samples` methods

## To be implemented

- [ ] Whitening option
- [ ] Randomized SVD solver
- [ ] ARPACK solver
- [ ] `get_covariance` method
- [ ] `get_precision` method and `score` method
- [ ] Support sparse matrices
- [ ] Support sparse matrices with ARPACK solver

## Contributing

Expand Down
63 changes: 63 additions & 0 deletions benchmark.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
"""Comparison between sklearn and torch PCA models."""

# Copyright (c) 2024 Valentin Goldité. All Rights Reserved.

from time import time

# NOTE: requires matplotlib (not in requirements(-dev).txt)
import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.decomposition import PCA as PCA_sklearn

from torch_pca import PCA


def main() -> None:
"""Measure and compare the time of execution of the PCA."""
configs = [(75, 75), (100, 2000), (10_000, 500)]
torch_times, sklearn_times = [], []
for config in configs:
inputs = torch.randn(*config)
t0 = time()
PCA(n_components=50).fit_transform(inputs)
torch_times.append(round(time() - t0, 4))
t0 = time()
PCA_sklearn(n_components=50).fit_transform(inputs)
sklearn_times.append(round(time() - t0, 4))
ticks = np.arange(len(configs))
labels = [f"n_samples={config[0]}, n_features={config[1]}" for config in configs]
width = 0.35
fig, ax = plt.subplots()
rects1 = ax.bar(ticks - width / 2, torch_times, width, label="Pytorch PCA")
rects2 = ax.bar(ticks + width / 2, sklearn_times, width, label="Sklearn PCA")
ax.set_ylabel("Time of execution (s)")
ax.set_title("Comparison of execution time between Pytorch and Sklearn PCA.")
ax.set_xticks(ticks)
ax.set_xticklabels(labels)
ax.legend()
autolabel(rects1, ax)
autolabel(rects2, ax)
fig.tight_layout()
plt.show()


def autolabel(rects: list, ax: plt.Axes) -> None:
"""Attach a text label above each bar in *rects*, displaying its height.
From https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/barchart.html
"""
for rect in rects:
height = rect.get_height()
ax.annotate(
str(height),
xy=(rect.get_x() + rect.get_width() / 2, height),
xytext=(0, 3), # 3 points vertical offset
textcoords="offset points",
ha="center",
va="bottom",
)


if __name__ == "__main__":
main()
Binary file added docs/_static/comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions docs/comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Comparison of execution time with sklearn's PCA

As we can see below the PyTorch PCA is faster than sklearn's PCA, in all the
configs tested with the parameter by default (for each PCA model):

![include](https://raw.githubusercontent.com/valentingol/torch_pca/main/docs/_static/comparison.png)
36 changes: 36 additions & 0 deletions docs/grad.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Gradient backward pass

Use the pytorch framework allows the automatic differentiation of the PCA!

The PCA transform method is always differentiable so it is always possible to
compute gradient like that:

```python
pca = PCA()
for ep in range(n_epochs):
optimizer.zero_grad()
out = neural_net(inputs)
with torch.no_grad():
pca.fit(out)
out = pca.transform(out)
loss = loss_fn(out, targets)
loss.backward()
```

If you want to compute the gradient over the full PCA model (including the
fitted `pca.n_components`), you can do it by using the "full" SVD solver
and removing the part of the `fit` method that enforce the deterministic
output by passing `determinist=False` in `fit` or `fit_transform` method.
This part sort the components using the singular values and change their sign
accordingly so it is not differentiable by nature but may be not necessary if
you don't care about the determinism of the output:

```python
pca = PCA(svd_solver="full")
for ep in range(n_epochs):
optimizer.zero_grad()
out = neural_net(inputs)
out = pca.fit_transform(out, determinist=False)
loss = loss_fn(out, targets)
loss.backward()
```
8 changes: 4 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,16 @@ Pytorch PCA
Principal Component Anlaysis (PCA) in PyTorch. The intention is to
provide a simple and easy to use implementation of PCA in PyTorch, the
most similar to the ``sklearn``\ ’s PCA as possible (in terms of API
and, of course, output).
and, of course, output). Plus, this implementation is **fully differentiable and faster**
(thanks to GPU parallelization)!

|Release| |PythonVersion| |PytorchVersion|

|GitHub User followers| |GitHub User’s User stars|

|Ruff_logo| |Black_logo|

|Ruff| |Flake8| |Pydocstyle| |MyPy| |PyLint|
|Ruff| |Flake8| |MyPy| |PyLint|

|Tests| |Coverage| |Documentation Status|

Expand Down Expand Up @@ -41,8 +42,6 @@ Documentation: https://torch-pca.readthedocs.io/en/latest/
:target: https://github.com/valentingol/Dinosor/actions/workflows/ruff.yaml
.. |Flake8| image:: https://github.com/valentingol/torch_pca/actions/workflows/flake.yaml/badge.svg
:target: https://github.com/valentingol/Dinosor/actions/workflows/flake.yaml
.. |Pydocstyle| image:: https://github.com/valentingol/torch_pca/actions/workflows/pydocstyle.yaml/badge.svg
:target: https://github.com/valentingol/Dinosor/actions/workflows/pydocstyle.yaml
.. |MyPy| image:: https://github.com/valentingol/torch_pca/actions/workflows/mypy.yaml/badge.svg
:target: https://github.com/valentingol/Dinosor/actions/workflows/mypy.yaml
.. |PyLint| image:: https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/valentingol/8fb4f3f78584e085dd7b0cca7e046d1f/raw/torch_pca_pylint.json
Expand All @@ -60,6 +59,7 @@ Documentation: https://torch-pca.readthedocs.io/en/latest/

installation
howto
grad
api
contributing.md
license.md
Loading

0 comments on commit 9bff944

Please sign in to comment.