Skip to content

Commit

Permalink
Merge branch 'main' into dependabot/pip/pygments-2.15.0
Browse files Browse the repository at this point in the history
  • Loading branch information
koaning authored Sep 19, 2023
2 parents 39fd742 + 3ed073d commit 23cf390
Show file tree
Hide file tree
Showing 10 changed files with 30 additions and 31 deletions.
2 changes: 1 addition & 1 deletion doc/crossvalidation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -593,7 +593,7 @@
"source": [
"## GroupTimeSeriesSplit\n",
"\n",
"In a time series problem it is possible that not every time unit (e.g. years) has the same amount of rows/observations. This makes a normal kfold split inpractical as you cannot specify a certain timeframe per fold (e.g. 5 years), because this can cause the folds' sizes to be very different. With `GroupTimeSeriesSplit` you can specify the amount of folds you want (e.g. `n_splits=3`) and `GroupTimeSeriesSplit` will calculate itself folds in such a way that the amount of observations per fold are as similar as possible. <br>\n",
"In a time series problem it is possible that not every time unit (e.g. years) has the same amount of rows/observations. This makes a normal kfold split impractical as you cannot specify a certain timeframe per fold (e.g. 5 years), because this can cause the folds' sizes to be very different. With `GroupTimeSeriesSplit` you can specify the amount of folds you want (e.g. `n_splits=3`) and `GroupTimeSeriesSplit` will calculate itself folds in such a way that the amount of observations per fold are as similar as possible. <br>\n",
"\n",
"The folds are created with a smartly modified brute forced method. This still means that for higher `n_splits` values in combination with many different unique time periods (e.g. 100 different years, thus 100 groups) the generation of the optimal split points can take minutes to hours. `UserWarnings` are raised when `GroupTimeSeriesSplit` expects to be running over a minute. Of course, this actual runtime depends on your machine's specifications.\n",
"\n",
Expand Down
12 changes: 6 additions & 6 deletions doc/linear-models.ipynb

Large diffs are not rendered by default.

15 changes: 7 additions & 8 deletions doc/meta.ipynb

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions doc/outliers.ipynb

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions doc/rstudio.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ on how to build a proper scikit-learn gridsearch using reticulate so
we figured we might add a resource to our documentation here.

It should be said that we feel that the best developer experience
is definately going to be in python but we figured it be helpful
is definitely going to be in python but we figured it be helpful
to put a small example in our documentation.

## Demo
## Demo

You'll first need to install a dependency and set up a link to a
python virtualenv that has scikit-lego already installed.
Expand Down Expand Up @@ -127,7 +127,7 @@ ggplot(data=cv_df) +

![](_static/Rplot2.png)

## Important
## Important

Note that we're mainly trying to demonstrate the R api here. In terms of fairness you
would want to explore the dataset further before you say anything conclusive.
Expand Down
2 changes: 1 addition & 1 deletion readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ Here's a list of features that this library currently offers:
We want to be rather open here in what we accept but we do demand three
things before they become added to the project:

1. any new feature contributes towards a demonstratable real-world usecase
1. any new feature contributes towards a demonstrable real-world usecase
2. any new feature passes standard unit tests (we use the ones from scikit-learn)
3. the feature has been discussed in the issue list beforehand

Expand Down
2 changes: 1 addition & 1 deletion sklego/linear_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,7 @@ def constraints(self, y_hat, y_true, sensitive, n_obs):

class BaseScipyMinimizeRegressor(BaseEstimator, RegressorMixin, ABC):
"""
Base class for regressors relying on scipy's minimze method. Derive a class from this one and give it the function to be minimized.
Base class for regressors relying on scipy's minimize method. Derive a class from this one and give it the function to be minimized.
Parameters
----------
Expand Down
2 changes: 1 addition & 1 deletion sklego/meta/confusion_balancer.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def fit(self, X, y):
X, y = check_X_y(X, y, estimator=self.estimator, dtype=FLOAT_DTYPES)
if not isinstance(self.estimator, ProbabilisticClassifier):
raise ValueError(
"The ConfusionBalancer meta model only works on classifcation models with .predict_proba."
"The ConfusionBalancer meta model only works on classification models with .predict_proba."
)
self.estimator.fit(X, y)
self.classes_ = unique_labels(y)
Expand Down
2 changes: 1 addition & 1 deletion sklego/meta/thresholder.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ class Thresholder(BaseEstimator, ClassifierMixin):
design the algorithm to only accept a certain class if the probability
for it is larger than, say, 90% instead of 50%.
:param model: the moddel to threshold
:param model: the model to threshold
:param threshold: the actual threshold to use
:param refit: if True, we will always retrain the model even if it is already fitted.
If False we only refit if the original model isn't fitted.
Expand Down
6 changes: 3 additions & 3 deletions sklego/model_selection.py
Original file line number Diff line number Diff line change
Expand Up @@ -482,7 +482,7 @@ def _calc_first_and_last_split_index(self, X=None, y=None, groups=None):
)
init_ideal_group_size = self._ideal_group_size * 0.9

# initalize the index of the first split, to reduce the amount of possible index split options
# initialize the index of the first split, to reduce the amount of possible index split options
first_split_index = (
self._grouped_df.assign(
cumsum_obs=lambda df: df["observations"].cumsum()
Expand All @@ -496,7 +496,7 @@ def _calc_first_and_last_split_index(self, X=None, y=None, groups=None):
.iloc[0]
.name
)
# initalize the index of the last split point, to reduce the amount of possible index split options
# initialize the index of the last split point, to reduce the amount of possible index split options
last_split_index = len(self._grouped_df) - (
self._grouped_df.assign(
observations=lambda df: df["observations"].values[::-1],
Expand Down Expand Up @@ -634,7 +634,7 @@ def _regroup(self, groups):
"""
Specifies in which group every observation belongs
:param groups: orginal groups in array
:param groups: original groups in array
:type: groups: np.array
:return: indices for the train and test splits of each fold
Expand Down

0 comments on commit 23cf390

Please sign in to comment.