Skip to content

Latest commit

 

History

History
812 lines (576 loc) · 34.7 KB

CHANGES.rst

File metadata and controls

812 lines (576 loc) · 34.7 KB

Release history

.. currentmodule:: skrub

Ongoing development

Skrub is a very recent package. It is currently undergoing fast development and backward compatibility is not ensured.

New features

Major changes

Minor changes

  • The column filter selection dropdown in the tablereport is smaller and its label has been removed to save space. :pr:`1107` by :user:`Jérôme Dockès <jeromedockes>`.

  • The TableReport now uses the font size of its parent element when inserted into another page. This makes it smaller in pages that use a smaller font size than the browser default such as VSCode in some configurations. It also makes it easier to control its size when inserting it in a web page by setting the font size of its parent element. A few other small adjustments have also been made to make it a bit more compact. :pr:`1098` by :user:`Jérôme Dockès <jeromedockes>`.

  • Display of labels in the plots of the TableReport, especially for other scripts than the latin alphabet, has improved. - before, some characters could be missing and replaced by empty boxes. - before, when the text is truncated, the ellipsis "..." could appear on the

    wrong side for right-to-left scripts.

    Moreover, when the text contains line breaks it now appears all on one line. Note this only affects the labels in the plots; the rest of the report did not have these problems. :pr:`1097` by :user:`Jérôme Dockès <jeromedockes>`.

  • In the TableReport it is now possible, before clicking any of the cells, to reach the dataframe sample table and activate a cell with tab key navigation. :pr:`1101` by :user:`Jérôme Dockès <jeromedockes>`.

  • The "Column name" column of the "summary statistics" table in the TableReport is now always visible when scrolling the table. :pr:`1102` by :user:`Jérôme Dockès <jeromedockes>`.

Bug fixes

Release 0.3.1

Minor changes

Release 0.3.0

Highlights

  • Polars dataframes are now supported across all skrub estimators.
  • :class:`TableReport` generates an interactive report for a dataframe. This page regroups some precomputed examples.

Major changes

Minor changes

Release 0.2.0

Major changes

Minor changes

skrub release 0.1.1

This is a bugfix release to adapt to the most recent versions of pandas (2.2) and scikit-learn (1.5). There are no major changes to the functionality of skrub.

skrub release 0.1.0

Major changes

Minor changes

Before skrub: dirty_cat

Skrub was born from the dirty_cat package.

Dirty-cat release 0.4.1

Major changes

Minor changes

Dirty-cat Release 0.4.0

Major changes

Minor changes

Bug fixes

Dirty-cat Release 0.3.0

Major changes

Notes

Dirty-cat Release 0.2.2

Bug fixes

Dirty-cat Release 0.2.1

Major changes

Bug-fixes

Notes

Dirty-cat Release 0.2.0

Also see pre-release 0.2.0a1 below for additional changes.

Major changes

Notes

Dirty-cat Release 0.2.0a1

Version 0.2.0a1 is a pre-release. To try it, you have to install it manually using:

pip install --pre dirty_cat==0.2.0a1

or from the GitHub repository:

pip install git+https://github.com/dirty-cat/dirty_cat.git

Major changes

Bug-fixes

Dirty-cat Release 0.1.1

Major changes

Bug-fixes

Dirty-cat Release 0.1.0

Major changes

Bug-fixes

Dirty-cat Release 0.0.7

  • MinHashEncoder: Added minhash_encoder.py and fast_hast.py files that implement minhash encoding through the :class:`MinHashEncoder` class. This method allows for fast and scalable encoding of string categorical variables.
  • datasets.fetch_employee_salaries: change the origin of download for employee_salaries.
    • The function now return a bunch with a dataframe under the field "data", and not the path to the csv file.
    • The field "description" has been renamed to "DESCR".
  • SimilarityEncoder: Fixed a bug when using the Jaro-Winkler distance as a similarity metric. Our implementation now accurately reproduces the behaviour of the python-Levenshtein implementation.
  • SimilarityEncoder: Added a handle_missing attribute to allow encoding with missing values.
  • TargetEncoder: Added a handle_missing attribute to allow encoding with missing values.
  • MinHashEncoder: Added a handle_missing attribute to allow encoding with missing values.

Dirty-cat Release 0.0.6

  • SimilarityEncoder: Accelerate SimilarityEncoder.transform, by:
    • computing the vocabulary count vectors in fit instead of transform
    • computing the similarities in parallel using joblib. This option can be turned on/off via the n_jobs attribute of the :class:`SimilarityEncoder`.
  • SimilarityEncoder: Fix a bug that was preventing a :class:`SimilarityEncoder` to be created when categories was a list.
  • SimilarityEncoder: Set the dtype passed to the ngram similarity to float32, which reduces memory consumption during encoding.

Dirty-cat Release 0.0.5

  • SimilarityEncoder: Change the default ngram range to (2, 4) which performs better empirically.
  • SimilarityEncoder: Added a most_frequent strategy to define prototype categories for large-scale learning.
  • SimilarityEncoder: Added a k-means strategy to define prototype categories for large-scale learning.
  • SimilarityEncoder: Added the possibility to use hashing ngrams for stateless fitting with the ngram similarity.
  • SimilarityEncoder: Performance improvements in the ngram similarity.
  • SimilarityEncoder: Expose a get_feature_names method.