Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.9.0] - 2023-11-10

New RandomSVD algorithm
New LanczosSVD algorithm
New distributed versions of Random Forest Classifier and Random Forest Regressor
New nested versions of Random Forest Classifier and Random Forest Regressor
Included a version of TeraSort algorithm

[0.8.0] - 2022-11-11

Added

save and load methods for all models
Adding Muliclass CSVM
Adding TS-QR (Tall Skinny QR)
New in-place operations for ds-arrays: add iadd isub
Matrix-Subtraction and Matrix-Addition
Concatenating two ds-arrays by columns
Save ds-array to npy file
Load ds-array from several npy files
Create ds-arrays from blocks
GridSearch for simulations & improvements
Inverse transformation in Scalers
Train-Test-Split functionality
Add KNN Classifier
Better SVD columns pairing
GPU Support using CUDA/CuPy for algorithms: Kmeans, KNN, SVD, PCA, Matmul, Addition, Subtraction, QR, Kronecker

Changed

New documentation for GPU, RandomForest, Scalers

Fixed

Fix bug Scalers & tests

[0.7.0] - 2021-11-10

Added

New decomposition algorithm QR
New preprocessing algorithm MinMaxScaler
Jenkinsfile for CI automated tests
ds-array matrix multiplication (matmul)
New function for ds-array creation
Add @constraint(computing_units="${ComputingUnits}") to all tasks
More I/O functions for reading and writing ds-arrays
More tests

Changed

Move RandomForest from 'classification' to 'trees'

Fixed

Some bugs in the ds-array

0.6.0 - 2020-10-09

Added

User guide and glossary
Method to read from npy files
Support for one-dimensional data in ds-array
Parametrized ds-array tests
identity, full and zeros methods that generate ds-arrays filled with a value
ds-array operators: subtraction, division, conjugate, transpose, item setting, etc.
matmul, kronecker product and rechunk methods for of ds-arrays
Automatic deletion of ds-arrays when the GC is called
Multivariate linear regression
SVD (Singular Value Decomposition)
PCA using SVD
ADMM Lasso algorithm
Daura clustering algorithm

Changed

Improved performance testing scripts and added new tests
Allow executing applications with params using dislib exec
Extended and improved the tutorial notebook
Moved data loading routines to a different file as array.py was getting too big
apply_along_axis for sparse data now returns sparse ds-arrays
Updated dislib-base docker image
Replaced COLLECTION_INOUT parameters with COLLECTION_OUT when possible for improving performance
Updated requirement PyCOMPSs >= 2.7

Fixed

Some bugs in the ds-array
Internal inconsistencies in transformed_array of PCA

0.5.0 - 2019-11-25

Added

Grid search and randomized search with cross-validation
K-fold splitter
Support for jupyter-notebooks from dislib docker image
Automatic installation of dislib executable when running pip install dislib
Support for sparse data in PCA
A new notebook with more usage examples
jupyter command to dislib executable
Pointer to sklearn license in LICENSE file
NOTICE file

Changed

Estimators now extend sklearn BaseEstimator
Extended tutorial notebook with other examples
Added acknowledgements to README

Removed

Pandas dependency in test_als
CODEOWNERS file

Fixed

Small fixes to tutorial notebook
Small fixes to documentation
dislib executable now works even if PyCOMPSs is not installed
Bug fix in ALS performance test
Several bugs in fancy indexing of ds-arrays
Fixed dislib executable on MacOS

0.4.0 - 2019-09-16

Added

Distributed array data structure
A basic tutorial notebook

Changed

Updated docker image to PyCOMPSs 2.5
Modified the whole library to use distributed arrays instead of Datasets (including estimators, examples, etc.)
Added 'init' parameter to K-means
Updated the developer guide

Removed

Dataset and Subset data structures
FFT estimator
Methods to load from multiple files

Fixed

Fixed the usage of random state in K-means
Some issues in the performance tests
Other minor bug fixes

0.3.0 - 2019-06-28

Added

The VERSION file
Test for duplicate support vectors in CSVM
Test for GaussianMixture with random initialization
New types of covariances for GaussianMixture and more tests
Scripts for automated performance tests on MareNostrum 4
A small Performance section to the docs
Two new algorithms: PCA and LinearRegression
Added some tests for DBSCAN

Changed

Dataset now does not check for duplicate samples (and does not build an array of unique IDs). This improves performance significantly.
CSVM now checks and removes duplicate samples generated during the fit process.
GaussianMixture now works with sparse data
GaussianMixture now removes partial results using compss_delete
Improved the performance of K-means' _partial_sum task
Improved docs of GaussianMixture and simplified the code
Added a check_convergence argument to GaussianMixture
Significant performance improvement of DBSCAN
Improved the performance of the shuffle method by using PyCOMPSs COLLECTIONS

Fixed

A bug in DBSCAN that was generating incorrect results in certain cases

0.2.0 - 2019-03-01

Added

This CHANGELOG file
Added badges to README file
Added tests for C-SVM and K-means
Created a utils module with shuffle and as_grid methods
Added an API reference to the documentation
Dataset.samples and Dataset.labels properties
New tests for DBSCAN
A first version of nearest neighbors algorithm
Added tests for C-SVM, K-means and DBSCAN with sparse data
Created a setup.py file and a pip package
First implementation of Gaussian mixtures and ALS
Implemented a StandardScaler class as part of a new preprocessing module
Created a resample method in the utils module
Dataset transpose
Dataset apply function

Changed

Refactored DBSCAN completely to make code more legible and fix several bugs
Fixed DBSCAN because it was producing wrong results in some scenarios. Changed the use of disjoint sets to connected components.
Extended the installation instructions in the README file
The script classifier_comparison.py now includes Random Forest classifier
Tests are split into modules
The COMPSs docker image has been reworked
Changed the way random_state is used in the different algorithms to ensure proper randomization and reproducibility of different executions.
Unified the signatures of the different algorithms to fit, predict, and fit_predict. These methods now have the same arguments in all the algorithms.
Changed license to Apache v2
Fixed some typos in README
load methods in the data module can take a delimiter argument now
Moved the quickstart guide to a separate file and included it in the documentation
Fixed several bugs