diff --git a/README.md b/README.md deleted file mode 100644 index ed79b0043..000000000 --- a/README.md +++ /dev/null @@ -1,67 +0,0 @@ -# scikit-matter - -[![Test](https://github.com/lab-cosmo/scikit-matter/workflows/Test/badge.svg)](https://github.com/lab-cosmo/scikit-matter/actions?query=workflow%3ATest) -[![codecov](https://codecov.io/gh/lab-cosmo/scikit-matter/branch/main/graph/badge.svg?token=UZJPJG34SM)](https://codecov.io/gh/lab-cosmo/scikit-matter/) -[![pypi](https://img.shields.io/pypi/v/skmatter.svg)](https://pypi.org/project/skmatter) -[![conda](https://anaconda.org/conda-forge/skmatter/badges/version.svg)](https://anaconda.org/conda-forge/skmatter) -[![documentation](https://img.shields.io/badge/documentation-latest-sucess)](https://scikit-matter.readthedocs.io) - -A collection of scikit-learn compatible utilities that implement methods -born out of the materials science and chemistry communities. - -## Installation - -You can install *scikit-matter* either via pip using - -```bash -pip install skmatter -``` - -or conda - -```bash -conda install -c conda-forge skmatter -``` - -You can then `import skmatter` in your code! - -## Developing the package - -Start by installing the development dependencies: - -```bash -pip install tox black flake8 -``` - -Then this package itself - -```bash -git clone https://github.com/lab-cosmo/scikit-matter -cd scikit-matter -pip install -e . -``` - -This install the package in development mode, making is `import`able globally -and allowing you to edit the code and directly use the updated version. - -### Running the tests - -```bash -cd -# run unit tests -tox -# run the code formatter -black --check . -# run the linter -flake8 -``` - -You may want to setup your editor to automatically apply the -[black](https://black.readthedocs.io/en/stable/) code formatter when saving your -files, there are plugins to do this with [all major -editors](https://black.readthedocs.io/en/stable/editor_integration.html). - -## License and developers - -This project is distributed under the BSD-3-Clauses license. By contributing to -it you agree to distribute your changes under the same license. diff --git a/README.rst b/README.rst new file mode 100644 index 000000000..342f5d00e --- /dev/null +++ b/README.rst @@ -0,0 +1,94 @@ +scikit-matter +============= + +|tests| |codecov| |docs| |pypi| |conda| |docs| + +A collection of scikit-learn compatible utilities that implement methods born out of the +materials science and chemistry communities. + +Installation +------------ + +You can install *scikit-matter* either via pip using + +.. code-block:: bash + + pip install skmatter + + +or conda + +.. code-block:: bash + + conda install -c conda-forge skmatter + + +You can then `import skmatter` in your code! + +Developing the package +---------------------- + +Start by installing the development dependencies: + +.. code-block:: bash + + pip install tox black flake8 + + +Then this package itself + +.. code-block:: bash + + git clone https://github.com/lab-cosmo/scikit-matter + cd scikit-matter + pip install -e . + + +This install the package in development mode, making is ``import`` able globally and +allowing you to edit the code and directly use the updated version. + +Running the tests +^^^^^^^^^^^^^^^^^ + +.. code-block:: bash + + cd + # run unit tests + tox + # run the code formatter + black --check . + # run the linter + flake8 + + +You may want to setup your editor to automatically apply the `black`_ code formatter +when saving your files, there are plugins to do this with `all major editors`_. + +License and developers +---------------------- + +This project is distributed under the BSD-3-Clauses license. By contributing to it you +agree to distribute your changes under the same license. + +.. _`black`: https://black.readthedocs.io/en/stable/ +.. _`all major editors`: https://black.readthedocs.io/en/stable/editor_integration.html + +.. |tests| image:: https://github.com/lab-cosmo/scikit-matter/workflows/Test/badge.svg + :alt: Github Actions Tests Job Status + :target: https://github.com/lab-cosmo/scikit-matter/actions?query=workflow%3ATests + +.. |codecov| image:: https://codecov.io/gh/lab-cosmo/scikit-matter/branch/main/graph/badge.svg?token=UZJPJG34SM + :alt: Code coverage + :target: https://codecov.io/gh/lab-cosmo/scikit-matter/ + +.. |pypi| image:: https://img.shields.io/pypi/v/skmatter.svg + :alt: Latest PYPI version + :target: https://pypi.org/project/skmatter + +.. |conda| image:: https://anaconda.org/conda-forge/skmatter/badges/version.svg + :alt: Latest conda version + :target: https://anaconda.org/conda-forge/skmatter + +.. |docs| image:: https://img.shields.io/badge/documentation-latest-sucess + :alt: Documentation + :target: https://scikit-matter.readthedocs.io diff --git a/docs/src/bibliography.rst b/docs/src/bibliography.rst index 4af27a247..428925508 100644 --- a/docs/src/bibliography.rst +++ b/docs/src/bibliography.rst @@ -3,42 +3,39 @@ References .. [deJong1992] S. de Jong, H.A.L. Kiers, - "Principal covariates regression: Part I. Theory", - Chemom. intell. lab. syst. 14 (1992) 155-164 - https://doi.org/10.1016/0169-7439(92)80100-I + "Principal covariates regression: Part I. Theory", Chemom. intell. lab. syst. 14 + (1992) 155-164 https://doi.org/10.1016/0169-7439(92)80100-I .. [Imbalzano2018] - Giulio Imbalzano, Andrea Anelli, Daniele Giofré, Sinja Klees, Jörg Behler, and Michele Ceriotti, - “Automatic selection of atomic fingerprints and reference configurations for machine-learning potentials.” - The Journal of chemical physics 148 24 (2018): 241730. - https://aip.scitation.org/doi/10.1063/1.5024611. + Giulio Imbalzano, Andrea Anelli, Daniele Giofré,Sinja Klees, Jörg Behler, and + Michele Ceriotti, “Automatic selection of atomic fingerprints and reference + configurations for machine-learning potentials.” The Journal of chemical physics 148 + 24 (2018): 241730. https://aip.scitation.org/doi/10.1063/1.5024611. .. [Ceriotti2019] - Michele Ceriotti, Lyndon Emsley, Federico Paruzzo, Albert Hofstetter, Félix Musil, Sandip De, Edgar A. Engel, and Andrea Anelli. - "Chemical Shifts in Molecular Solids by Machine Learning Datasets", - Materials Cloud Archive 2019.0023/v2 (2019), + Michele Ceriotti, Lyndon Emsley, Federico Paruzzo, Albert Hofstetter, Félix Musil, + Sandip De, Edgar A. Engel, and Andrea Anelli. "Chemical Shifts in Molecular Solids + by Machine Learning Datasets", Materials Cloud Archive 2019.0023/v2 (2019), https://doi.org/10.24435/materialscloud:2019.0023/v2. .. [Helfrecht2020] Benjamin A Helfrecht, Rose K Cersonsky, Guillaume Fraux, and Michele Ceriotti, - "Structure-property maps with Kernel principal covariates regression." - 2020 Mach. Learn.: Sci. Technol. 1 045021. + "Structure-property maps with Kernel principal covariates regression." 2020 Mach. + Learn.: Sci. Technol. 1 045021. https://iopscience.iop.org/article/10.1088/2632-2153/aba9ef. .. [Pozdnyakov2020] - Pozdnyakov, S. N., Willatt, M. J., Bartók, A. P., Ortner, C., Csányi, G., & Ceriotti, M. (2020). - "Incompleteness of Atomic Structure Representations." - Physical Review Letters, 125(16). - https://doi.org/10.1103/physrevlett.125.166001 + Pozdnyakov, S. N., Willatt, M. J., Bartók, A. P., Ortner, C., Csányi, G., & + Ceriotti, M. (2020). "Incompleteness of Atomic Structure Representations." Physical + Review Letters, 125(16). https://doi.org/10.1103/physrevlett.125.166001 .. [Goscinski2021] - Alexander Goscinski, Guillaume Fraux, Giulio Imbalzano, and Michele Ceriotti, - "The role of feature space in atomistic learning." - 2021 Mach. Learn.: Sci. Technol. 2 025028. - https://iopscience.iop.org/article/10.1088/2632-2153/abdaf7. + Alexander Goscinski, Guillaume Fraux, Giulio Imbalzano, and Michele Ceriotti, "The + role of feature space in atomistic learning." 2021 Mach. Learn.: Sci. Technol. 2 + 025028. https://iopscience.iop.org/article/10.1088/2632-2153/abdaf7. .. [Cersonsky2021] - Rose K Cersonsky, Benjamin A Helfrecht, Edgar A. Engel, Sergei Kliavinek, and Michele Ceriotti, - "Improving Sample and Feature Selection with Principal Covariates Regression" - 2021 Mach. Learn.: Sci. Technol. 2 035038. + Rose K Cersonsky, Benjamin A Helfrecht, Edgar A. Engel, Sergei Kliavinek, and + Michele Ceriotti, "Improving Sample and Feature Selection with Principal Covariates + Regression" 2021 Mach. Learn.: Sci. Technol. 2 035038. https://iopscience.iop.org/article/10.1088/2632-2153/abfe7c. diff --git a/docs/src/contributing.rst b/docs/src/contributing.rst index a1d4e5073..ef0716e15 100644 --- a/docs/src/contributing.rst +++ b/docs/src/contributing.rst @@ -18,14 +18,14 @@ Then this package itself cd scikit-matter pip install -e . -This install the package in development mode, making it importable globally -and allowing you to edit the code and directly use the updated version. +This install the package in development mode, making it importable globally and allowing +you to edit the code and directly use the updated version. Running the tests ################# -The testsuite is implemented using Python's `unittest`_ framework and should be set-up and -run in an isolated virtual environment with `tox`_. All tests can be run with +The testsuite is implemented using Python's `unittest`_ framework and should be set-up +and run in an isolated virtual environment with `tox`_. All tests can be run with .. code-block:: bash @@ -40,11 +40,11 @@ If you wish to test only specific functionalities, for example: tox -e examples # test the examples -You can also use ``tox -e format`` to use tox to do actual formatting instead -of just testing it. Also, you may want to setup your editor to automatically apply the -`black `_ code formatter when saving your -files, there are plugins to do this with `all major -editors `_. +You can also use ``tox -e format`` to use tox to do actual formatting instead of just +testing it. Also, you may want to setup your editor to automatically apply the `black +`_ code formatter when saving your files, there +are plugins to do this with `all major editors +`_. .. _unittest: https://docs.python.org/3/library/unittest.html .. _tox: https://tox.readthedocs.io/en/latest @@ -60,9 +60,8 @@ machine as described above. Then, build the documentation with tox -e docs -You can then visualize the local documentation with your favorite browser using -the following command (or open the :file:`docs/build/html/index.html` file -manually). +You can then visualize the local documentation with your favorite browser using the +following command (or open the :file:`docs/build/html/index.html` file manually). .. code-block:: bash @@ -172,8 +171,8 @@ Then, show ``scikit-matter`` how to load your data by adding a loader function t Add this function to ``src/skmatter/datasets/__init__.py``. -Finally, add a test to ``tests/test_datasets.py`` to see that your dataset -loads properly. It should look something like this: +Finally, add a test to ``tests/test_datasets.py`` to see that your dataset loads +properly. It should look something like this: .. code-block:: python @@ -190,7 +189,8 @@ loads properly. It should look something like this: self.my_data.DESCR -You're good to go! Time to submit a `pull request. `_ +You're good to go! Time to submit a `pull request. +`_ License diff --git a/docs/src/datasets.rst b/docs/src/datasets.rst index 162683323..cd3c368fd 100644 --- a/docs/src/datasets.rst +++ b/docs/src/datasets.rst @@ -7,5 +7,4 @@ Datasets .. include:: ../../src/skmatter/datasets/descr/nice_dataset.rst -.. include:: ../../src/skmatter/datasets/descr/who_dataset.rst - +.. include:: ../../src/skmatter/datasets/descr/who_dataset.rst diff --git a/docs/src/gfrm.rst b/docs/src/gfrm.rst index 1330d4f8f..a7d1f5a6a 100644 --- a/docs/src/gfrm.rst +++ b/docs/src/gfrm.rst @@ -11,12 +11,12 @@ Reconstruction Measures Global Reconstruction Error ########################### -.. autofunction:: pointwise_global_reconstruction_error -.. autofunction:: global_reconstruction_error +.. autofunction:: pointwise_global_reconstruction_error +.. autofunction:: global_reconstruction_error .. _GRD-api: -Global Reconstruction Distortion +Global Reconstruction Distortion ################################ .. autofunction:: pointwise_global_reconstruction_distortion diff --git a/docs/src/index.rst b/docs/src/index.rst index 106b89a70..186faff2b 100644 --- a/docs/src/index.rst +++ b/docs/src/index.rst @@ -1,23 +1,23 @@ scikit-matter documentation =========================== -``scikit-matter`` is a collection of `scikit-learn `_ -compatible utilities that implement methods born out of the materials science -and chemistry communities. +``scikit-matter`` is a collection of `scikit-learn `_ compatible +utilities that implement methods born out of the materials science and chemistry +communities. -Convenient-to-use libraries such as scikit-learn have accelerated the adoption and application -of machine learning (ML) workflows and data-driven methods. Such libraries have gained great -popularity partly because the implemented methods are generally applicable in multiple domains. -While developments in the atomistic learning community have put forward general-use machine -learning methods, their deployment is commonly entangled with domain-specific functionalities, -preventing access to a wider audience. +Convenient-to-use libraries such as scikit-learn have accelerated the adoption and +application of machine learning (ML) workflows and data-driven methods. Such libraries +have gained great popularity partly because the implemented methods are generally +applicable in multiple domains. While developments in the atomistic learning community +have put forward general-use machine learning methods, their deployment is commonly +entangled with domain-specific functionalities, preventing access to a wider audience. scikit-matter targets domain-agnostic implementations of methods developed in the -computational chemical and materials science community, following the -scikit-learn API and coding guidelines to promote usability and interoperability -with existing workflows. scikit-matter contains a toolbox of methods for -unsupervised and supervised analysis of ML datasets, including the comparison, -decomposition, and selection of features and samples. +computational chemical and materials science community, following the scikit-learn API +and coding guidelines to promote usability and interoperability with existing workflows. +scikit-matter contains a toolbox of methods for unsupervised and supervised analysis of +ML datasets, including the comparison, decomposition, and selection of features and +samples. .. toctree:: :maxdepth: 1 diff --git a/docs/src/intro.rst b/docs/src/intro.rst index 642fd443f..85a3e7630 100644 --- a/docs/src/intro.rst +++ b/docs/src/intro.rst @@ -1,44 +1,68 @@ What's in scikit-matter? ======================== -``scikit-matter`` is a collection of `scikit-learn `_ -compatible utilities that implement methods born out of the materials science -and chemistry communities. - -This package serves two purposes: 1) as a development ground for models and patches that may ultimately be suitable for inclusion -in sklearn, and 2) to coalesce field-specific sklearn-like routines and models in -a well-documented and standardized repository. - -Currently, scikit-matter contains models described in [Imbalzano2018]_, [Helfrecht2020]_, [Goscinski2021]_ and [Cersonsky2021]_, as well -as some modifications to sklearn functionalities and minimal datasets that are useful within the field -of computational materials science and chemistry. +``scikit-matter`` is a collection of `scikit-learn `_ compatible +utilities that implement methods born out of the materials science and chemistry +communities. +This package serves two purposes: 1) as a development ground for models and patches that +may ultimately be suitable for inclusion in sklearn, and 2) to coalesce field-specific +sklearn-like routines and models in a well-documented and standardized repository. +Currently, scikit-matter contains models described in [Imbalzano2018]_, +[Helfrecht2020]_, [Goscinski2021]_ and [Cersonsky2021]_, as well as some modifications +to sklearn functionalities and minimal datasets that are useful within the field of +computational materials science and chemistry. - Fingerprint Selection: - Multiple data sub-selection modules, for selecting the most relevant features and samples out of a large set of candidates [Imbalzano2018]_, [Helfrecht2020]_ and [Cersonsky2021]_. + Multiple data sub-selection modules, for selecting the most relevant features and + samples out of a large set of candidates [Imbalzano2018]_, [Helfrecht2020]_ and + [Cersonsky2021]_. - * :ref:`CUR-api` decomposition: an iterative feature selection method based upon the singular value decoposition. - * :ref:`PCov-CUR-api` decomposition extends upon CUR by using augmented right or left singular vectors inspired by Principal Covariates Regression. - * :ref:`FPS-api`: a common selection technique intended to exploit the diversity of the input space. The selection of the first point is made at random or by a separate metric. + * :ref:`CUR-api` decomposition: an iterative feature selection method based upon the + singular value decoposition. + * :ref:`PCov-CUR-api` decomposition extends upon CUR by using augmented right or left + singular vectors inspired by Principal Covariates Regression. + * :ref:`FPS-api`: a common selection technique intended to exploit the diversity of + the input space. The selection of the first point is made at random or by a + separate metric. * :ref:`PCov-FPS-api` extends upon FPS much like PCov-CUR does to CUR. - * :ref:`Voronoi-FPS-api`: conduct FPS selection, taking advantage of Voronoi tessellations to accelerate selection. - * :ref:`DCH-api`: selects samples by constructing a directional convex hull and determining which samples lie on the bounding surface. + * :ref:`Voronoi-FPS-api`: conduct FPS selection, taking advantage of Voronoi + tessellations to accelerate selection. + * :ref:`DCH-api`: selects samples by constructing a directional convex hull and + determining which samples lie on the bounding surface. - Reconstruction Measures: - A set of easily-interpretable error measures of the relative information capacity of feature space `F` with respect to feature space `F'`. - The methods returns a value between 0 and 1, where 0 means that `F` and `F'` are completey distinct in terms of linearly-decodable information, and where 1 means that `F'` is contained in `F`. - All methods are implemented as the root mean-square error for the regression of the feature matrix `X_F'` (or sometimes called `Y` in the doc) from `X_F` (or sometimes called `X` in the doc) for transformations with different constraints (linear, orthogonal, locally-linear). - By default a custom 2-fold cross-validation :py:class:`skosmo.linear_model.RidgeRegression2FoldCV` is used to ensure the generalization of the transformation and efficiency of the computation, since we deal with a multi-target regression problem. - Methods were applied to compare different forms of featurizations through different hyperparameters and induced metrics and kernels [Goscinski2021]_ . + A set of easily-interpretable error measures of the relative information capacity of + feature space `F` with respect to feature space `F'`. The methods returns a value + between 0 and 1, where 0 means that `F` and `F'` are completey distinct in terms of + linearly-decodable information, and where 1 means that `F'` is contained in `F`. All + methods are implemented as the root mean-square error for the regression of the + feature matrix `X_F'` (or sometimes called `Y` in the doc) from `X_F` (or sometimes + called `X` in the doc) for transformations with different constraints (linear, + orthogonal, locally-linear). By default a custom 2-fold cross-validation + :py:class:`skosmo.linear_model.RidgeRegression2FoldCV` is used to ensure the + generalization of the transformation and efficiency of the computation, since we deal + with a multi-target regression problem. Methods were applied to compare different + forms of featurizations through different hyperparameters and induced metrics and + kernels [Goscinski2021]_ . - * :ref:`GRE-api` (GRE) computes the amount of linearly-decodable information recovered through a global linear reconstruction. - * :ref:`GRD-api` (GRD) computes the amount of distortion contained in a global linear reconstruction. - * :ref:`LRE-api` (LRE) computes the amount of decodable information recovered through a local linear reconstruction for the k-nearest neighborhood of each sample. + * :ref:`GRE-api` (GRE) computes the amount of linearly-decodable information + recovered through a global linear reconstruction. + * :ref:`GRD-api` (GRD) computes the amount of distortion contained in a global linear + reconstruction. + * :ref:`LRE-api` (LRE) computes the amount of decodable information recovered through + a local linear reconstruction for the k-nearest neighborhood of each sample. - Principal Covariates Regression - * PCovR: the standard Principal Covariates Regression [deJong1992]_. Utilises a combination between a PCA-like and an LR-like loss, and therefore attempts to find a low-dimensional projection of the feature vectors that simultaneously minimises information loss and error in predicting the target properties using only the latent space vectors $\mathbf{T}$ :ref:`PCovR-api`. - * Kernel Principal Covariates Regression (KPCovR) a kernel-based variation on the original PCovR method, proposed in [Helfrecht2020]_ :ref:`KPCovR-api`. - -If you would like to contribute to scikit-matter, check out our :ref:`contributing` page! + * PCovR: the standard Principal Covariates Regression [deJong1992]_. Utilises a + combination between a PCA-like and an LR-like loss, and therefore attempts to find + a low-dimensional projection of the feature vectors that simultaneously minimises + information loss and error in predicting the target properties using only the + latent space vectors $\mathbf{T}$ :ref:`PCovR-api`. + * Kernel Principal Covariates Regression (KPCovR) a kernel-based variation on the + original PCovR method, proposed in [Helfrecht2020]_ :ref:`KPCovR-api`. + +If you would like to contribute to scikit-matter, check out our :ref:`contributing` +page! diff --git a/docs/src/linear_models.rst b/docs/src/linear_models.rst index ed9ef3b99..4833c844d 100644 --- a/docs/src/linear_models.rst +++ b/docs/src/linear_models.rst @@ -6,7 +6,7 @@ Linear Models Orthogonal Regression ##################### -.. autoclass:: OrthogonalRegression +.. autoclass:: OrthogonalRegression .. currentmodule:: skmatter.linear_model._ridge diff --git a/docs/src/preprocessing.rst b/docs/src/preprocessing.rst index addad1659..4baaeabd7 100644 --- a/docs/src/preprocessing.rst +++ b/docs/src/preprocessing.rst @@ -1,5 +1,5 @@ Preprocessing -============================= +============= .. automodule:: skmatter.preprocessing :members: diff --git a/docs/src/reference.rst b/docs/src/reference.rst index 5c34960d5..ed2f3d070 100644 --- a/docs/src/reference.rst +++ b/docs/src/reference.rst @@ -3,7 +3,6 @@ API Reference ============= - .. toctree:: :maxdepth: 1 :caption: Contents: diff --git a/docs/src/selection.rst b/docs/src/selection.rst index dcab88bd7..fa45d8dab 100644 --- a/docs/src/selection.rst +++ b/docs/src/selection.rst @@ -40,7 +40,8 @@ This can be executed using: Xr = selector.transform(X) -where `Selector` is one of the classes below that overwrites the method :py:func:`score`. +where `Selector` is one of the classes below that overwrites the method +:py:func:`score`. From :py:class:`GreedySelector`, selectors inherit these public methods: @@ -58,29 +59,30 @@ CUR ### -CUR decomposition begins by approximating a matrix :math:`{\mathbf{X}}` using a subset of columns and rows +CUR decomposition begins by approximating a matrix :math:`{\mathbf{X}}` using a subset +of columns and rows .. math:: - \mathbf{\hat{X}} \approx \mathbf{X}_\mathbf{c} \left(\mathbf{X}_\mathbf{c}^- \mathbf{X} \mathbf{X}_\mathbf{r}^-\right) \mathbf{X}_\mathbf{r}. + \mathbf{\hat{X}} \approx \mathbf{X}_\mathbf{c} \left(\mathbf{X}_\mathbf{c}^- + \mathbf{X} \mathbf{X}_\mathbf{r}^-\right) \mathbf{X}_\mathbf{r}. These subsets of rows and columns, denoted :math:`\mathbf{X}_\mathbf{r}` and -:math:`\mathbf{X}_\mathbf{c}`, respectively, can be determined by iterative -maximization of a leverage score :math:`\pi`, representative of the relative -importance of each column or row. From hereon, we will call selection methods -which are derived off of the CUR decomposition "CUR" as a shorthand for -"CUR-derived selection". In each iteration of CUR, we select the column or row -that maximizes :math:`\pi` and orthogonalize the remaining columns or rows. -These steps are iterated until a sufficient number of features has been selected. -This iterative approach, albeit comparatively time consuming, is the most -deterministic and efficient route in reducing the number of features needed to -approximate :math:`\mathbf{X}` when compared to selecting all features in a -single iteration based upon the relative :math:`\pi` importance. - -The feature and sample selection versions of CUR differ only in the computation -of :math:`\pi`. In sample selection :math:`\pi` is computed using the left -singular vectors, versus in feature selection, :math:`\pi` is computed using the -right singular vectors. In addition to :py:class:`GreedySelector`, both instances -of CUR selection build off of :py:class:`skmatter._selection._cur._CUR`, and inherit +:math:`\mathbf{X}_\mathbf{c}`, respectively, can be determined by iterative maximization +of a leverage score :math:`\pi`, representative of the relative importance of each +column or row. From hereon, we will call selection methods which are derived off of the +CUR decomposition "CUR" as a shorthand for "CUR-derived selection". In each iteration of +CUR, we select the column or row that maximizes :math:`\pi` and orthogonalize the +remaining columns or rows. These steps are iterated until a sufficient number of +features has been selected. This iterative approach, albeit comparatively time +consuming, is the most deterministic and efficient route in reducing the number of +features needed to approximate :math:`\mathbf{X}` when compared to selecting all +features in a single iteration based upon the relative :math:`\pi` importance. + +The feature and sample selection versions of CUR differ only in the computation of +:math:`\pi`. In sample selection :math:`\pi` is computed using the left singular +vectors, versus in feature selection, :math:`\pi` is computed using the right singular +vectors. In addition to :py:class:`GreedySelector`, both instances of CUR selection +build off of :py:class:`skmatter._selection._cur._CUR`, and inherit .. currentmodule:: skmatter._selection @@ -88,7 +90,8 @@ of CUR selection build off of :py:class:`skmatter._selection._cur._CUR`, and inh .. automethod:: _CUR._compute_pi They are instantiated using -:py:class:`skmatter.feature_selection.CUR` and :py:class:`skmatter.sample_selection.CUR`, e.g. +:py:class:`skmatter.feature_selection.CUR` and +:py:class:`skmatter.sample_selection.CUR`, e.g. .. code-block:: python @@ -117,14 +120,15 @@ They are instantiated using PCov-CUR ######## -PCov-CUR extends upon CUR by using augmented right or left singular vectors -inspired by Principal Covariates Regression, as demonstrated in [Cersonsky2021]_. -These methods employ the modified kernel and covariance matrices introduced in :ref:`PCovR-api` -and available via the Utility Classes. +PCov-CUR extends upon CUR by using augmented right or left singular vectors inspired by +Principal Covariates Regression, as demonstrated in [Cersonsky2021]_. These methods +employ the modified kernel and covariance matrices introduced in :ref:`PCovR-api` and +available via the Utility Classes. -Again, the feature and sample selection versions of PCov-CUR differ only in the computation -of :math:`\pi`. So, in addition to :py:class:`GreedySelector`, both instances -of PCov-CUR selection build off of :py:class:`skmatter._selection._cur._PCovCUR`, inheriting +Again, the feature and sample selection versions of PCov-CUR differ only in the +computation of :math:`\pi`. So, in addition to :py:class:`GreedySelector`, both +instances of PCov-CUR selection build off of +:py:class:`skmatter._selection._cur._PCovCUR`, inheriting .. currentmodule:: skmatter._selection @@ -168,15 +172,15 @@ Farthest Point-Sampling (FPS) Farthest Point Sampling is a common selection technique intended to exploit the diversity of the input space. -In FPS, the selection of the first point is made at random or by a separate metric. -Each subsequent selection is made to maximize the Haussdorf distance, -i.e. the minimum distance between a point and all previous selections. -It is common to use the Euclidean distance, however other distance metrics may be employed. +In FPS, the selection of the first point is made at random or by a separate metric. Each +subsequent selection is made to maximize the Haussdorf distance, i.e. the minimum +distance between a point and all previous selections. It is common to use the Euclidean +distance, however other distance metrics may be employed. Similar to CUR, the feature and selection versions of FPS differ only in the way -distance is computed (feature selection does so column-wise, sample selection does -so row-wise), and are built off of the same base class, :py:class:`skmatter._selection._fps._FPS`, -in addition to GreedySelector, and inherit +distance is computed (feature selection does so column-wise, sample selection does so +row-wise), and are built off of the same base class, +:py:class:`skmatter._selection._fps._FPS`, in addition to GreedySelector, and inherit .. currentmodule:: skmatter._selection @@ -184,8 +188,8 @@ in addition to GreedySelector, and inherit .. automethod:: _FPS.get_distance .. automethod:: _FPS.get_select_distance -These selectors can be instantiated using -:py:class:`skmatter.feature_selection.FPS` and :py:class:`skmatter.sample_selection.FPS`. +These selectors can be instantiated using :py:class:`skmatter.feature_selection.FPS` and +:py:class:`skmatter.sample_selection.FPS`. .. code-block:: python @@ -209,13 +213,14 @@ These selectors can be instantiated using PCov-FPS ######## -PCov-FPS extends upon FPS much like PCov-CUR does to CUR. Instead of using the -Euclidean distance solely in the space of :math:`\mathbf{X}`, we use a combined -distance in terms of :math:`\mathbf{X}` and :math:`\mathbf{y}`. -Again, the feature and sample selection versions of PCov-FPS differ only in -computing the distances. So, in addition to :py:class:`GreedySelector`, both instances -of PCov-FPS selection build off of :py:class:`skmatter._selection._fps._PCovFPS`, and inherit +PCov-FPS extends upon FPS much like PCov-CUR does to CUR. Instead of using the Euclidean +distance solely in the space of :math:`\mathbf{X}`, we use a combined distance in terms +of :math:`\mathbf{X}` and :math:`\mathbf{y}`. + +Again, the feature and sample selection versions of PCov-FPS differ only in computing +the distances. So, in addition to :py:class:`GreedySelector`, both instances of PCov-FPS +selection build off of :py:class:`skmatter._selection._fps._PCovFPS`, and inherit .. currentmodule:: skmatter._selection @@ -259,7 +264,8 @@ Voronoi FPS .. autoclass :: VoronoiFPS -These selectors can be instantiated using :py:class:`skmatter.sample_selection.VoronoiFPS`. +These selectors can be instantiated using +:py:class:`skmatter.sample_selection.VoronoiFPS`. .. code-block:: python @@ -285,13 +291,12 @@ These selectors can be instantiated using :py:class:`skmatter.sample_selection.V When *Not* to Use Voronoi FPS ----------------------------- -In many cases, this algorithm may not increase upon the efficiency. For example, -for simple metrics (such as Euclidean distance), Voronoi FPS will likely not -accelerate, and may decelerate, computations when compared to FPS. The sweet -spot for Voronoi FPS is when the number of selectable samples is already enough -to divide the space with Voronoi polyhedrons, but not yet comparable to the total -number of samples, when the cost of bookkeeping significantly degrades the speed -of work compared to FPS. +In many cases, this algorithm may not increase upon the efficiency. For example, for +simple metrics (such as Euclidean distance), Voronoi FPS will likely not accelerate, and +may decelerate, computations when compared to FPS. The sweet spot for Voronoi FPS is +when the number of selectable samples is already enough to divide the space with Voronoi +polyhedrons, but not yet comparable to the total number of samples, when the cost of +bookkeeping significantly degrades the speed of work compared to FPS. .. _DCH-api: @@ -301,7 +306,8 @@ Directional Convex Hull (DCH) .. autoclass :: DirectionalConvexHull -This selector can be instantiated using `skmatter.sample_selection.DirectionalConvexHull`. +This selector can be instantiated using +:class:`skmatter.sample_selection.DirectionalConvexHull`. .. code-block:: python diff --git a/docs/src/tutorials.rst b/docs/src/tutorials.rst index ba52579ab..2b592007c 100644 --- a/docs/src/tutorials.rst +++ b/docs/src/tutorials.rst @@ -2,7 +2,8 @@ Examples ######## For a thorough tutorial of the methods introduced in `scikit-matter`, we suggest you -check out the pedagogic notebooks in our companion project `kernel-tutorials `_. +check out the pedagogic notebooks in our companion project `kernel-tutorials +`_. .. toctree:: :glob: diff --git a/docs/src/utils.rst b/docs/src/utils.rst index b53fbb136..bae996748 100644 --- a/docs/src/utils.rst +++ b/docs/src/utils.rst @@ -21,7 +21,9 @@ Orthogonalizers for CUR .. currentmodule:: skmatter.utils._orthogonalizers -When computing non-iterative CUR, it is necessary to orthogonalize the input matrices after each selection. For this, we have supplied a feature and a sample orthogonalizer for feature and sample selection. +When computing non-iterative CUR, it is necessary to orthogonalize the input matrices +after each selection. For this, we have supplied a feature and a sample orthogonalizer +for feature and sample selection. .. autofunction:: X_orthogonalizer .. autofunction:: Y_feature_orthogonalizer diff --git a/pyproject.toml b/pyproject.toml index fc6ca12aa..7870f7300 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -17,7 +17,7 @@ authors = [ {name = "Victor P. Principe"}, {name = "Michele Ceriotti"} ] -readme = "README.md" +readme = "README.rst" requires-python = ">=3.8" license = {text = "BSD-3-Clause"} classifiers = [ diff --git a/src/skmatter/datasets/descr/csd-1000r.rst b/src/skmatter/datasets/descr/csd-1000r.rst index d97dc21f3..8fa9b55ff 100644 --- a/src/skmatter/datasets/descr/csd-1000r.rst +++ b/src/skmatter/datasets/descr/csd-1000r.rst @@ -3,9 +3,9 @@ CSD-1000R ######### -This dataset, intended for model testing, contains the SOAP power spectrum -features and local NMR chemical shieldings for 100 environments selected -from CSD-1000r, originally published in [Ceriotti2019]_. +This dataset, intended for model testing, contains the SOAP power spectrum features and +local NMR chemical shieldings for 100 environments selected from CSD-1000r, originally +published in [Ceriotti2019]_. Function Call ------------- @@ -15,33 +15,33 @@ Function Call Data Set Characteristics ------------------------ - :Number of Instances: Each representation 100 +:Number of Instances: Each representation 100 - :Number of Features: Each representation 100 +:Number of Features: Each representation 100 - The representations were computed with [C1]_ using the hyperparameters: +The representations were computed with [C1]_ using the hyperparameters: - :rascal hyperparameters: +:rascal hyperparameters: - +---------------------------+------------+ - | key | value | - +---------------------------+------------+ - | interaction_cutoff: | 3.5 | - +---------------------------+------------+ - | max_radial: | 6 | - +---------------------------+------------+ - | max_angular: | 6 | - +---------------------------+------------+ - | gaussian_sigma_constant": | 0.4 | - +---------------------------+------------+ - | gaussian_sigma_type: | "Constant"| - +---------------------------+------------+ - | cutoff_smooth_width: | 0.5 | - +---------------------------+------------+ - | normalize: | True | - +---------------------------+------------+ ++---------------------------+------------+ +| key | value | ++---------------------------+------------+ +| interaction_cutoff: | 3.5 | ++---------------------------+------------+ +| max_radial: | 6 | ++---------------------------+------------+ +| max_angular: | 6 | ++---------------------------+------------+ +| gaussian_sigma_constant": | 0.4 | ++---------------------------+------------+ +| gaussian_sigma_type: | "Constant"| ++---------------------------+------------+ +| cutoff_smooth_width: | 0.5 | ++---------------------------+------------+ +| normalize: | True | ++---------------------------+------------+ - Of the 2'520 resulting features, 100 were selected via FPS using [C2]_. +Of the 2'520 resulting features, 100 were selected via FPS using [C2]_. References ---------- @@ -57,7 +57,7 @@ Reference Code from skmatter.feature_selection import CUR from skmatter.preprocessing import StandardFlexibleScaler from skmatter.sample_selection import FPS - + # read all of the frames and book-keep the centers and species filename = "/path/to/CSD-1000R.xyz" frames = np.asarray( diff --git a/src/skmatter/datasets/descr/degenerate_CH4_manifold.rst b/src/skmatter/datasets/descr/degenerate_CH4_manifold.rst index 306974a9e..07d5b59af 100644 --- a/src/skmatter/datasets/descr/degenerate_CH4_manifold.rst +++ b/src/skmatter/datasets/descr/degenerate_CH4_manifold.rst @@ -3,10 +3,14 @@ Degenerate CH4 manifold ####################### -The dataset contains two representations (SOAP power spectrum and bispectrum) of the two manifolds spanned by the carbon atoms of two times 81 methane structures. -The SOAP power spectrum representation the two manifolds intersect creating a degenerate manifold/line for which the representation remains the same. -In contrast for higher body order representations as the (SOAP) bispectrum the carbon atoms can be uniquely represented and do not create a degenerate manifold. -Following the naming convention of [Pozdnyakov2020]_ for each representation the first 81 samples correspond to the X minus manifold and the second 81 samples contain the X plus manifold +The dataset contains two representations (SOAP power spectrum and bispectrum) of the two +manifolds spanned by the carbon atoms of two times 81 methane structures. The SOAP power +spectrum representation the two manifolds intersect creating a degenerate manifold/line +for which the representation remains the same. In contrast for higher body order +representations as the (SOAP) bispectrum the carbon atoms can be uniquely represented +and do not create a degenerate manifold. Following the naming convention of +[Pozdnyakov2020]_ for each representation the first 81 samples correspond to the X minus +manifold and the second 81 samples contain the X plus manifold Function Call ------------- @@ -16,40 +20,39 @@ Function Call Data Set Characteristics ------------------------ - :Number of Instances: Each representation 162 - - :Number of Features: Each representation 12 - - The representations were computed with [D1]_ using the hyperparameters: - - :rascal hyperparameters: - - +---------------------------+------------+ - | key | value | - +===========================+============+ - | radial_basis: | "GTO" | - +---------------------------+------------+ - | interaction_cutoff: | 4 | - +---------------------------+------------+ - | max_radial: | 2 | - +---------------------------+------------+ - | max_angular: | 2 | - +---------------------------+------------+ - | gaussian_sigma_constant": | 0.5 | - +---------------------------+------------+ - | gaussian_sigma_type: | "Constant"| - +---------------------------+------------+ - | cutoff_smooth_width: | 0.5 | - +---------------------------+------------+ - | normalize: | False | - +---------------------------+------------+ - -The SOAP bispectrum features were in addition reduced to 12 features with principal component analysis (PCA) [D2]_. +:Number of Instances: Each representation 162 + +:Number of Features: Each representation 12 + +The representations were computed with [D1]_ using the hyperparameters: + +:rascal hyperparameters: + ++---------------------------+------------+ +| key | value | ++===========================+============+ +| radial_basis: | "GTO" | ++---------------------------+------------+ +| interaction_cutoff: | 4 | ++---------------------------+------------+ +| max_radial: | 2 | ++---------------------------+------------+ +| max_angular: | 2 | ++---------------------------+------------+ +| gaussian_sigma_constant": | 0.5 | ++---------------------------+------------+ +| gaussian_sigma_type: | "Constant"| ++---------------------------+------------+ +| cutoff_smooth_width: | 0.5 | ++---------------------------+------------+ +| normalize: | False | ++---------------------------+------------+ + +The SOAP bispectrum features were in addition reduced to 12 features with principal +component analysis (PCA) [D2]_. References ---------- .. [D1] https://github.com/lab-cosmo/librascal commit 8d9ad7a .. [D2] https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html - -======= diff --git a/src/skmatter/datasets/descr/nice_dataset.rst b/src/skmatter/datasets/descr/nice_dataset.rst index 20b3c35e8..23d733755 100644 --- a/src/skmatter/datasets/descr/nice_dataset.rst +++ b/src/skmatter/datasets/descr/nice_dataset.rst @@ -3,19 +3,24 @@ NICE dataset ############ -This is a toy dataset containing NICE[1, 4](N-body Iterative Contraction of Equivariants) features for first 500 configurations of the dataset[2, 3] with randomly displaced methane configurations. +This is a toy dataset containing NICE[1, 4](N-body Iterative Contraction of +Equivariants) features for first 500 configurations of the dataset[2, 3] with randomly +displaced methane configurations. Function Call ------------- + .. function:: skmatter.datasets.load_nice_dataset Data Set Characteristics ------------------------ :Number of Instances: 500 + :Number of Features: 160 -The representations were computed using the NICE package[4] using the following definition of the NICE calculator: +The representations were computed using the NICE package[4] using the following +definition of the NICE calculator: .. code-block:: python @@ -52,13 +57,18 @@ The representations were computed using the NICE package[4] using the following References ---------- -[1] Jigyasa Nigam, Sergey Pozdnyakov, and Michele Ceriotti. "Recursive evaluation and iterative contraction of N-body equivariant features." The Journal of Chemical Physics 153.12 (2020): 121101. + +[1] Jigyasa Nigam, Sergey Pozdnyakov, and Michele Ceriotti. "Recursive evaluation and + iterative contraction of N-body equivariant features." The Journal of Chemical + Physics 153.12 (2020): 121101. [2] Incompleteness of Atomic Structure Representations -Sergey N. Pozdnyakov, Michael J. Willatt, Albert P. Bartók, Christoph Ortner, Gábor Csányi, and Michele Ceriotti + Sergey N. Pozdnyakov, Michael J. Willatt, Albert P. Bartók, Christoph Ortner, + Gábor Csányi, and Michele Ceriotti [3] https://archive.materialscloud.org/record/2020.110 Reference Code -------------- + [4] https://github.com/lab-cosmo/nice diff --git a/src/skmatter/datasets/descr/who_dataset.rst b/src/skmatter/datasets/descr/who_dataset.rst index 4aaf6dd05..b794a70b6 100644 --- a/src/skmatter/datasets/descr/who_dataset.rst +++ b/src/skmatter/datasets/descr/who_dataset.rst @@ -42,12 +42,13 @@ References .. [8] https://data.worldbank.org/indicator/SN.ITK.DEFC.ZS .. [9] https://data.worldbank.org/indicator/SP.DYN.LE00.IN .. [10] https://data.worldbank.org/indicator/SP.POP.TOTL - + Reference Code -------------- -and compiled through the following script, where the datasets have been placed in a folder named `who_data`: +and compiled through the following script, where the datasets have been placed in a +folder named ``who_data``: .. code-block:: python @@ -68,7 +69,7 @@ and compiled through the following script, where the datasets have been placed i sheet_name="Data", index_col=0, ) - + indicator = data["Indicator Code"].values[0] indicator_codes[indicator] = data["Indicator Name"].values[0] diff --git a/tox.ini b/tox.ini index 09cfcbdec..52d8a4d8d 100644 --- a/tox.ini +++ b/tox.ini @@ -4,7 +4,7 @@ envlist = tests examples -lint_folders = {toxinidir}/src {toxinidir}/tests +lint_folders = "{toxinidir}/src" "{toxinidir}/tests" "{toxinidir}/docs/src/" [testenv:tests] @@ -42,10 +42,12 @@ deps = flake8-bugbear flake8-sphinx-links isort + sphinx-lint commands = flake8 {[tox]lint_folders} black --check --diff {[tox]lint_folders} isort --check-only --diff {[tox]lint_folders} + sphinx-lint --enable line-too-long --max-line-length 88 {[tox]lint_folders} "{toxinidir}/README.rst" [testenv:format] # Abuse tox to do actual formatting. Users can call `tox -e format` to run