Alexander is a Python wrapper that aims to make scikit-learn fully compatible with pandas.
Alexander mirrors sklearn's API and structure. Most classes are either estimators (they have fit
, predict
and fit_predict
methods) or transformers (they have a fit
, transform
and fit_transform
methods). Transformers takes pd.DataFrame as input and returns pd.DataFrame as output, and are expected to be concatenated into pipelines.
import alexander.datasets
import alexander.ensemble
import alexander.pipeline
import alexander.preprocessing
import pandas as pd
data, target = alexander.datasets.load_titanic(return_X_y=True)
rf_pipeline = alexander.pipeline.Pipeline([
(['Pclass', 'Fare'], None),
('Age', alexander.preprocessing.MissingValuesFiller(strategy='median')),
('Sex', [alexander.preprocessing.LabelEncoder(),
alexander.preprocessing.OneHotEncoder()]),
('Embarked', [alexander.preprocessing.MissingValuesFiller(strategy='most_frequent'),
alexander.preprocessing.LabelEncoder(),
alexander.preprocessing.OneHotEncoder()])
])
rf_pipeline.fit(data) # Note how the pipeline is trained on the whole dataset
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(data, target, test_size=0.2))
rf_train = rf_pipeline.transform(X_train)
rf_clf = alexander.ensemble.RandomForestClassifier(n_estimators=100)
rf_clf.fit(rf_train, y_train) # The model is ready to be used!
Alexander follows these principles:
- Alexander has the same interface as scikit-learn
- Transformers and estimators expect pd.DataFrame as input
- Every method or function that returns something, should return a pd.DataFrame
Guidelines for Alexander usage:
- Pretty much all data transformations should be done as part of an alexander.pipeline.Pipeline()
- Pipelines should be fitted to the whole dataset, before any split occures
These are the modules there are currently implemented in Alexander and their status:
Module | Status | Notes |
---|---|---|
sklearn.datasets | Work in progress | Added a new function to load the Titanic dataset |
sklearn.ensemble | Fully wrapped | |
sklearn.model_selection | Work in progress | |
sklearn.neural_network | Fully wrapped | |
sklearn.pipeline | Fully replaced | Pipeline works differently in sklearn and Alexander |
sklearn.preprocessing | Work in progress | This will have to be re-designed |
Please, refrain from using any module whose status is not either Fully wrapped
or Fully replaced
.
Unless otherwise stated, the returned DataFrames have the same index and columns names as the input.
Done. It behaves as in scikit-learn except for the fact both transform() and fit_transform() return pd.DataFrame(). The returned DataFrame has the same index and columns names as the input. Note: even though fit() does not really do anything here, the input is still checked and an error is returned if X is not a Pandas object.
Done. It behaves as in scikit-learn except for the fact that transform(), inverse_transform() and fit_transform() return pd.DataFrame().
Done. It behaves as in scikit-learn except for the fact both transform() and fit_transform() return pd.DataFrame().
Not implemented yet.
Not implemented yet.
Not implemented yet.
Not implemented yet.
Done. It behaves as in scikit-learn except for the fact that transform(), inverse_transform() and fit_transform() return pd.DataFrame(). Note: Alexander does not allow inplace scaling or normalization, so parameter copy is de facto always set to True.
Done. It behaves as in scikit-learn except for the fact that transform(), inverse_transform() and fit_transform() return pd.DataFrame(). Note: Alexander does not allow inplace scaling or normalization, so parameter copy is de facto always set to True.
Done. It behaves as in scikit-learn except for the fact that transform(), inverse_transform() and fit_transform() return pd.DataFrame(). Note: Alexander does not allow inplace scaling or normalization, so parameter copy is de facto always set to True.
Done. It behaves as in scikit-learn except for the fact both transform() and fit_transform() return pd.DataFrame(). Currently the columns names are just integers (starting from 0).
Done. It behaves as in scikit-learn except for the fact both transform() and fit_transform() return pd.DataFrame(). Alexander uses get_feature_names() to get the right names for the DataFrame columns.
Done. It behaves as in scikit-learn except for the fact that transform(), inverse_transform() and fit_transform() return pd.DataFrame(). Note: Alexander does not allow inplace scaling or normalization, so parameter copy is de facto always set to True.
Done. It behaves as in scikit-learn except for the fact that transform(), inverse_transform() and fit_transform() return pd.DataFrame(). Note: Alexander does not allow inplace scaling or normalization, so parameter copy is de facto always set to True.
Done. It behaves as in scikit-learn except for the fact that transform(), inverse_transform() and fit_transform() return pd.DataFrame(). Note: Alexander does not allow inplace scaling or normalization, so parameter copy is de facto always set to True.
- Usage of mock: if you look into the code from time to time you'll encounter non-orthodox usages of the mock library. This is because Alexander uses mock functionalities to repeat infinite loops. Imagine for example an Alexander class that implements a fit(), a transform() and a fit_transform() methods. The first two methods call their sklearn equivalents (via super()) while fit_transform call self.fit() and self.transform() one after the other. However, sklearn fit() often call self.fit_transform() behind the curtains so what happens is: self.fit() calls super().fit(), which calls self.fit_transform(), which has now been replaced by a different method whose first action is to simply call self.fit() et voilà we have an infinite loop. Mock solves all this.
If you find this library useful, consider dropping me a message or starring this repo so that I know there are people interested in the project out there. (: