Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] make single problem loaders for equal length problems return numpy arrays #109

Merged
merged 37 commits into from
Feb 28, 2023
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
89a0a3c
remove RimeSeriesSVC
TonyBagnall Feb 23, 2023
c3f6f55
Merge branch 'main' of https://github.com/scikit-time/scikit-time
TonyBagnall Feb 25, 2023
706fd62
revise single problem loaders
TonyBagnall Feb 25, 2023
7fa8859
revise single problem loaders
TonyBagnall Feb 25, 2023
878a79f
notebooks temporarily use dataframes
TonyBagnall Feb 25, 2023
d63b05e
set basic motions to return nested_univ in testing
TonyBagnall Feb 25, 2023
db3274c
set basic motions to return nested_univ in build notebooks
TonyBagnall Feb 25, 2023
7cf6263
remove ilocs with numpy
TonyBagnall Feb 25, 2023
3f11f49
more notebook dataframes
TonyBagnall Feb 25, 2023
f2b092a
more notebook dataframes
TonyBagnall Feb 25, 2023
5975dcc
more notebook dataframes
TonyBagnall Feb 25, 2023
995bc0a
remove double Raises
TonyBagnall Feb 25, 2023
9b70ade
remove double Returns
TonyBagnall Feb 25, 2023
548cf89
remove more pandas operations on numpy
TonyBagnall Feb 25, 2023
f469ace
change tests that rely on basic_motions
TonyBagnall Feb 26, 2023
46a0b3e
adjust tests using single problem loaders
TonyBagnall Feb 26, 2023
72d5a43
Merge branch 'main' of https://github.com/scikit-time/scikit-time
TonyBagnall Feb 26, 2023
c949e30
revert test_orchestration
TonyBagnall Feb 26, 2023
53d89ed
test numpy3d
TonyBagnall Feb 26, 2023
b0c41b6
reformat tests for single problem loaders
TonyBagnall Feb 26, 2023
fe34052
yet more tests
TonyBagnall Feb 26, 2023
0c8114b
notebooks
TonyBagnall Feb 26, 2023
a44e3ee
test_FittedParamExtractor.py
TonyBagnall Feb 26, 2023
a4f0449
test_tsfresh
TonyBagnall Feb 26, 2023
104cfb7
test_tsfresh
TonyBagnall Feb 26, 2023
c5d0666
test_tsfresh
TonyBagnall Feb 27, 2023
51e273a
revised classifier notebook
TonyBagnall Feb 27, 2023
549a754
revised classifier notebook
TonyBagnall Feb 27, 2023
afe87e2
revised classifier notebook
TonyBagnall Feb 27, 2023
78e1057
revised minirocket notebook
TonyBagnall Feb 27, 2023
43c46f8
revised data io docstrings
TonyBagnall Feb 27, 2023
9ab91e9
Merge branch 'main' of https://github.com/scikit-time/scikit-time
TonyBagnall Feb 27, 2023
0c56387
return removed test in benchmarking
TonyBagnall Feb 27, 2023
f6de9d3
Merge branch 'main' into nested_univ
TonyBagnall Feb 27, 2023
5c244dc
Merge branch 'ajb/notebooks' into nested_univ
TonyBagnall Feb 27, 2023
ed7d01b
Revert "Merge branch 'ajb/notebooks' into nested_univ"
TonyBagnall Feb 27, 2023
fe3e3ab
fix data_io
TonyBagnall Feb 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions examples/02_classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -545,7 +545,7 @@
"`sktime` offers two other ways of building estimators for multivariate time series problems:\n",
"\n",
"1. Concatenation of time series columns into a single long time series column via `ColumnConcatenator` and apply a classifier to the concatenated data,\n",
"2. Dimension ensembling via `ColumnEnsembleClassifier` in which one classifier is fitted for each time series column/dimension of the time series and their predictions are combined through a voting scheme. \n",
"2. Dimension ensembling via `ColumnEnsembleClassifier` in which one classifier is fitted for each time series column/dimension of the time series and their predictions are combined through a voting scheme.\n",
"\n",
"We can concatenate multivariate time series/panel data into long univariate time series/panel using a transform and then apply a classifier to the univariate data:"
]
Expand Down Expand Up @@ -690,7 +690,7 @@
"\n",
"#### HIVE-COTE 2.0 (HC2)\n",
"The HIerarchical VotE Collective of Transformation-based Ensembles is a meta ensemble that combines classifiers built on different representations. Version 2 combines DrCIF, TDE, an ensemble of RocketClassifiers called the Arsenal and the ShapeletTransformClassifier. It is one of the most accurate classifiers on the UCR and UEA time series archives.\n",
" \n",
"\n",
"[3] Middlehurst, Matthew, James Large, Michael Flynn, Jason Lines, Aaron Bostrom, and Anthony Bagnall. \"HIVE-COTE 2.0: a new meta ensemble for time series classification.\" Machine Learning (2021)\n",
"[ML 2021](https://link.springer.com/article/10.1007/s10994-021-06057-9)\n",
"\n",
Expand Down
9 changes: 6 additions & 3 deletions examples/02a_classification_multivariate_cnn.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,12 @@
"metadata": {},
"outputs": [],
"source": [
"X_train, y_train = load_basic_motions(split=\"train\", return_X_y=True)\n",
"X_test, y_test = load_basic_motions(split=\"test\", return_X_y=True)\n",
"X_train, y_train = load_basic_motions(\n",
" split=\"train\", return_X_y=True, return_type=\"nested_univ\"\n",
")\n",
"X_test, y_test = load_basic_motions(\n",
" split=\"test\", return_X_y=True, return_type=\"nested_univ\"\n",
")\n",
"print(X_train.shape)\n",
"print(X_test.shape)\n",
"print(type(X_train.iloc[1, 1]))\n",
Expand Down Expand Up @@ -133,5 +137,4 @@
},
"nbformat": 4,
"nbformat_minor": 2

}
8 changes: 6 additions & 2 deletions examples/02b_classification_multivariate_lstmfcn.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,12 @@
"metadata": {},
"outputs": [],
"source": [
"X_train, y_train = load_basic_motions(split=\"train\", return_X_y=True)\n",
"X_test, y_test = load_basic_motions(split=\"test\", return_X_y=True)\n",
"X_train, y_train = load_basic_motions(\n",
" split=\"train\", return_X_y=True, return_type=\"nested_univ\"\n",
")\n",
"X_test, y_test = load_basic_motions(\n",
" split=\"test\", return_X_y=True, return_type=\"nested_univ\"\n",
")\n",
"print(X_train.shape)\n",
"print(X_test.shape)\n",
"print(type(X_train.iloc[1, 1]))\n",
Expand Down
12 changes: 6 additions & 6 deletions examples/AA_datatypes_and_datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@
"* structure convention: `obj` must be 2D, i.e., `obj.shape` must have length 2. This is also true for univariate time series.\n",
"* variables: variables correspond to columns of `obj`.\n",
"* variable names: the `\"np.ndarray\"` mtype cannot represent variable names.\n",
"* time points: the rows of `obj` correspond to different, distinct time points. \n",
"* time points: the rows of `obj` correspond to different, distinct time points.\n",
"* time index: The time index is implicit and by-convention. The `i`-th row (for an integer `i`) is interpreted as an observation at the time point `i`.\n",
"* capabilities: cannot represent multivariate series; cannot represent unequally spaced series"
]
Expand Down Expand Up @@ -262,8 +262,8 @@
"\n",
"* structure convention: `obj.index` must be a pair multi-index of type `(Index, t)`, where `t` is one of `Int64Index`, `RangeIndex`, `DatetimeIndex`, `PeriodIndex` and monotonous. `obj.index` must have two levels (can be named or not).\n",
"* instance index: the first element of pairs in `obj.index` (0-th level value) is interpreted as an instance index, we call it \"instance index\" below.\n",
"* instances: rows with the same \"instance index\" index value correspond to the same instance; rows with different \"instance index\" values correspond to different instances. \n",
"* time index: the second element of pairs in `obj.index` (1-st level value) is interpreted as a time index, we call it \"time index\" below. \n",
"* instances: rows with the same \"instance index\" index value correspond to the same instance; rows with different \"instance index\" values correspond to different instances.\n",
"* time index: the second element of pairs in `obj.index` (1-st level value) is interpreted as a time index, we call it \"time index\" below.\n",
"* time points: rows of `obj` with the same \"time index\" value correspond correspond to the same time point; rows of `obj` with different \"time index\" index correspond correspond to the different time points.\n",
"* variables: columns of `obj` correspond to different variables\n",
"* variable names: column names `obj.columns`\n",
Expand Down Expand Up @@ -297,7 +297,7 @@
"\n",
"* structure convention: `obj` must be 3D, i.e., `obj.shape` must have length 3.\n",
"* instances: instances correspond to axis 0 elements of `obj`.\n",
"* instance index: the instance index is implicit and by-convention. The `i`-th element of axis 0 (for an integer `i`) is interpreted as indicative of observing instance `i`. \n",
"* instance index: the instance index is implicit and by-convention. The `i`-th element of axis 0 (for an integer `i`) is interpreted as indicative of observing instance `i`.\n",
"* variables: variables correspond to axis 1 elements of `obj`.\n",
"* variable names: the `\"numpy3D\"` mtype cannot represent variable names.\n",
"* time points: time points correspond to axis 2 elements of `obj`.\n",
Expand Down Expand Up @@ -376,8 +376,8 @@
"\n",
"* structure convention: `obj.index` must be a 3 or more level multi-index of type `(Index, ..., Index, t)`, where `t` is one of `Int64Index`, `RangeIndex`, `DatetimeIndex`, `PeriodIndex` and monotonous. We call the last index the \"time-like\" index.\n",
"* hierarchy level: rows with the same non-time-like index values correspond to the same hierarchy unit; rows with different non-time-like index combination correspond to different hierarchy unit.\n",
"* hierarchy: the non-time-like indices in `obj.index` are interpreted as a hierarchy identifying index. \n",
"* time index: the last element of tuples in `obj.index` is interpreted as a time index. \n",
"* hierarchy: the non-time-like indices in `obj.index` are interpreted as a hierarchy identifying index.\n",
"* time index: the last element of tuples in `obj.index` is interpreted as a time index.\n",
"* time points: rows of `obj` with the same `\"timepoints\"` index correspond correspond to the same time point; rows of `obj` with different `\"timepoints\"` index correspond correspond to the different time points.\n",
"* variables: columns of `obj` correspond to different variables\n",
"* variable names: column names `obj.columns`\n",
Expand Down
10 changes: 2 additions & 8 deletions examples/feature_extraction_with_tsfresh.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@
}
],
"source": [
"X, y = load_arrow_head(return_X_y=True)\n",
"X, y = load_arrow_head(return_X_y=True, return_type=\"nested_univ\")\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)"
]
Expand All @@ -97,9 +97,6 @@
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
Expand Down Expand Up @@ -606,7 +603,7 @@
}
],
"source": [
"X, y = load_basic_motions(return_X_y=True)\n",
"X, y = load_basic_motions(return_X_y=True, return_type=\"nested_univ\")\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)"
]
Expand All @@ -623,9 +620,6 @@
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
Expand Down
12 changes: 6 additions & 6 deletions examples/interpolation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Suppose we have a set of time series with different lengths, i.e. different number \n",
"of time points. Currently, most of sktime's functionality requires equal-length time series, so to use sktime, we need to first converted our data into equal-length time series. In this tutorial, you will learn how to use the `TSInterpolator` to do so. "
"Suppose we have a set of time series with different lengths, i.e. different number\n",
"of time points. Currently, most of sktime's functionality requires equal-length time series, so to use sktime, we need to first converted our data into equal-length time series. In this tutorial, you will learn how to use the `TSInterpolator` to do so."
]
},
{
Expand Down Expand Up @@ -73,7 +73,7 @@
}
],
"source": [
"X, y = load_basic_motions()\n",
"X, y = load_basic_motions(return_type=\"nested_univ\")\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"\n",
"steps = [\n",
Expand Down Expand Up @@ -133,7 +133,7 @@
" ) # here is a problem\n",
"\n",
"\n",
"X, y = load_basic_motions()\n",
"X, y = load_basic_motions(return_type=\"nested_univ\")\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"\n",
"for df in [X_train, X_test]:\n",
Expand All @@ -156,7 +156,7 @@
"metadata": {},
"source": [
"# Now the interpolator enters\n",
"Now we use our interpolator to resize time series of different lengths to user-defined length. Internally, it uses linear interpolation from scipy and draws equidistant samples on the user-defined number of points. \n",
"Now we use our interpolator to resize time series of different lengths to user-defined length. Internally, it uses linear interpolation from scipy and draws equidistant samples on the user-defined number of points.\n",
"\n",
"After interpolating the data, the classifier works again."
]
Expand Down Expand Up @@ -187,7 +187,7 @@
"source": [
"from sktime.transformations.panel.interpolate import TSInterpolator\n",
"\n",
"X, y = load_basic_motions()\n",
"X, y = load_basic_motions(return_type=\"nested_univ\")\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"\n",
"for df in [X_train, X_test]:\n",
Expand Down
12 changes: 6 additions & 6 deletions examples/minirocket.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@
},
"outputs": [],
"source": [
"X_train, y_train = load_arrow_head(split=\"train\", return_X_y=True)\n",
"X_train, y_train = load_arrow_head(split=\"train\", return_type=\"nested_univ\")\n",
"# visualize the first univariate time series\n",
"X_train.iloc[0, 0].plot()"
]
Expand Down Expand Up @@ -252,7 +252,7 @@
},
"outputs": [],
"source": [
"X_train, y_train = load_basic_motions(split=\"train\", return_X_y=True)"
"X_train, y_train = load_basic_motions(split=\"train\")"
]
},
{
Expand Down Expand Up @@ -327,7 +327,7 @@
},
"outputs": [],
"source": [
"X_test, y_test = load_basic_motions(split=\"test\", return_X_y=True)\n",
"X_test, y_test = load_basic_motions(split=\"test\")\n",
"X_test_transform = minirocket_multi.transform(X_test)"
]
},
Expand Down Expand Up @@ -471,8 +471,8 @@
"\n",
"\n",
"### 4.1 Load japanese_vowels as unequal length dataset\n",
"Japanese vowels is a a UCI Archive dataset. 9 Japanese-male speakers were recorded saying the vowels ‘a’ and ‘e’. \n",
"The raw recordings are preprocessed to get a 12-dimensional (multivariate) classification probem. The series lengths are between 7 and 29. "
"Japanese vowels is a a UCI Archive dataset. 9 Japanese-male speakers were recorded saying the vowels ‘a’ and ‘e’.\n",
"The raw recordings are preprocessed to get a 12-dimensional (multivariate) classification probem. The series lengths are between 7 and 29."
]
},
{
Expand Down Expand Up @@ -509,7 +509,7 @@
"metadata": {},
"source": [
"### 4.2 Create a pipeline, train on it\n",
"As before, we create a sklearn pipeline. \n",
"As before, we create a sklearn pipeline.\n",
"MiniRocketMultivariateVariable requires a minimum series length of 9, where missing values are padded up to a length of 9, with the value \"-10.0\".\n",
"Afterwards a scaler and a RidgeClassifierCV are added.\n"
]
Expand Down
4 changes: 2 additions & 2 deletions sktime/benchmarking/tests/test_TSCStrategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
@pytest.mark.parametrize("dataset", DATASET_LOADERS)
def test_TSCStrategy(dataset):
"""Test strategy."""
train = dataset(split="train", return_X_y=False)
test = dataset(split="test", return_X_y=False)
train = dataset(split="train", return_X_y=False, return_type="nested_univ")
test = dataset(split="test", return_X_y=False, return_type="nested_univ")
s = TSCStrategy(classifier)
task = TSCTask(target="class_val")
s.fit(task, train)
Expand Down
51 changes: 5 additions & 46 deletions sktime/benchmarking/tests/test_orchestration.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,55 +37,14 @@ def make_reduction_pipeline(estimator):
return pipeline


# simple test of orchestration and metric evaluation
@pytest.mark.parametrize("data_loader", [load_gunpoint, load_arrow_head])
def test_automated_orchestration_vs_manual(data_loader):
"""Test orchestration."""
data = data_loader(return_X_y=False)

dataset = RAMDataset(dataset=data, name="data")
task = TSCTask(target="class_val")

# create strategies
# clf = TimeSeriesForestClassifier(n_estimators=1, random_state=1)
clf = make_reduction_pipeline(
RandomForestClassifier(n_estimators=2, random_state=1)
)
strategy = TSCStrategy(clf)

# result backend
results = RAMResults()
orchestrator = Orchestrator(
datasets=[dataset],
tasks=[task],
strategies=[strategy],
cv=SingleSplit(random_state=1),
results=results,
)

orchestrator.fit_predict(save_fitted_strategies=False)
result = next(results.load_predictions(cv_fold=0, train_or_test="test")) # get
# only first item of iterator
actual = result.y_pred

# expected output
task = TSCTask(target="class_val")
cv = SingleSplit(random_state=1)
train_idx, test_idx = next(cv.split(data))
train = data.iloc[train_idx, :]
test = data.iloc[test_idx, :]
strategy.fit(task, train)
expected = strategy.predict(test)

# compare results
np.testing.assert_array_equal(actual, expected)


# extensive tests of orchestration and metric evaluation against sklearn
@pytest.mark.parametrize(
"dataset",
[
RAMDataset(dataset=load_arrow_head(return_X_y=False), name="ArrowHead"),
RAMDataset(
dataset=load_arrow_head(return_X_y=False, return_type="nested_univ"),
name="ArrowHead",
),
UEADataset(path=DATAPATH, name="GunPoint", target_name="class_val"),
],
)
Expand Down Expand Up @@ -161,7 +120,7 @@ def test_single_dataset_single_strategy_against_sklearn(
# simple test of sign test and ranks
def test_stat():
"""Test sign ranks."""
data = load_gunpoint(split="train", return_X_y=False)
data = load_gunpoint(split="train", return_X_y=False, return_type="nested_univ")
dataset = RAMDataset(dataset=data, name="gunpoint")
task = TSCTask(target="class_val")

Expand Down
2 changes: 1 addition & 1 deletion sktime/benchmarking/tests/test_tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

TASKS = (TSCTask, TSRTask)

gunpoint = load_gunpoint(return_X_y=False)
gunpoint = load_gunpoint(return_X_y=False, return_type="nested_univ")
shampoo_sales = load_shampoo_sales()

BASE_READONLY_ATTRS = ("target", "features", "metadata")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@
def test_knn_on_unit_test(distance_key):
"""Test function for elastic knn, to be reinstated soon."""
# load arrowhead data for unit tests
X_train, y_train = load_unit_test(split="train", return_X_y=True)
X_test, y_test = load_unit_test(split="test", return_X_y=True)
X_train, y_train = load_unit_test(split="train")
X_test, y_test = load_unit_test(split="test")
knn = KNeighborsTimeSeriesClassifier(
distance=distance_key,
)
Expand All @@ -61,8 +61,8 @@ def test_knn_on_unit_test(distance_key):
@pytest.mark.parametrize("distance_key", distance_functions)
def test_knn_bounding_matrix(distance_key):
"""Test knn with custom bounding parameters."""
X_train, y_train = load_unit_test(split="train", return_X_y=True)
X_test, y_test = load_unit_test(split="test", return_X_y=True)
X_train, y_train = load_unit_test(split="train")
X_test, y_test = load_unit_test(split="test")
knn = KNeighborsTimeSeriesClassifier(
distance=distance_key, distance_params={"window": 0.5}
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,13 @@
)
from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_unit_test
from sktime.datatypes._panel._convert import from_nested_to_3d_numpy


def test_prob_threshold_on_unit_test_data():
"""Test of ProbabilityThresholdEarlyClassifier on unit test data."""
# load unit test data
X_train, y_train = load_unit_test(split="train", return_X_y=True)
X_test, y_test = load_unit_test(split="test", return_X_y=True)
X_train, y_train = load_unit_test(split="train")
X_test, y_test = load_unit_test(split="test")
indices = np.random.RandomState(0).choice(len(y_train), 10, replace=False)

# train probability threshold
Expand All @@ -30,7 +29,6 @@ def test_prob_threshold_on_unit_test_data():
final_probas = np.zeros((10, 2))
final_decisions = np.zeros(10)

X_test = from_nested_to_3d_numpy(X_test)
states = None
for i in pt.classification_points:
X = X_test[indices, :, :i]
Expand Down
Loading