Skip to content

Commit

Permalink
[ENH] make single problem loaders for equal length problems return nu…
Browse files Browse the repository at this point in the history
…mpy arrays (#109)

Changes the `return_type` from `None` to `numpy3d` and removes the very long comments describing possible data structures in unnecessary detail. Part of #42
  • Loading branch information
TonyBagnall authored Feb 28, 2023
1 parent 015339f commit f7e226a
Show file tree
Hide file tree
Showing 37 changed files with 388 additions and 550 deletions.
4 changes: 2 additions & 2 deletions examples/02_classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -545,7 +545,7 @@
"`sktime` offers two other ways of building estimators for multivariate time series problems:\n",
"\n",
"1. Concatenation of time series columns into a single long time series column via `ColumnConcatenator` and apply a classifier to the concatenated data,\n",
"2. Dimension ensembling via `ColumnEnsembleClassifier` in which one classifier is fitted for each time series column/dimension of the time series and their predictions are combined through a voting scheme. \n",
"2. Dimension ensembling via `ColumnEnsembleClassifier` in which one classifier is fitted for each time series column/dimension of the time series and their predictions are combined through a voting scheme.\n",
"\n",
"We can concatenate multivariate time series/panel data into long univariate time series/panel using a transform and then apply a classifier to the univariate data:"
]
Expand Down Expand Up @@ -690,7 +690,7 @@
"\n",
"#### HIVE-COTE 2.0 (HC2)\n",
"The HIerarchical VotE Collective of Transformation-based Ensembles is a meta ensemble that combines classifiers built on different representations. Version 2 combines DrCIF, TDE, an ensemble of RocketClassifiers called the Arsenal and the ShapeletTransformClassifier. It is one of the most accurate classifiers on the UCR and UEA time series archives.\n",
" \n",
"\n",
"[3] Middlehurst, Matthew, James Large, Michael Flynn, Jason Lines, Aaron Bostrom, and Anthony Bagnall. \"HIVE-COTE 2.0: a new meta ensemble for time series classification.\" Machine Learning (2021)\n",
"[ML 2021](https://link.springer.com/article/10.1007/s10994-021-06057-9)\n",
"\n",
Expand Down
9 changes: 6 additions & 3 deletions examples/02a_classification_multivariate_cnn.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,12 @@
"metadata": {},
"outputs": [],
"source": [
"X_train, y_train = load_basic_motions(split=\"train\", return_X_y=True)\n",
"X_test, y_test = load_basic_motions(split=\"test\", return_X_y=True)\n",
"X_train, y_train = load_basic_motions(\n",
" split=\"train\", return_X_y=True, return_type=\"nested_univ\"\n",
")\n",
"X_test, y_test = load_basic_motions(\n",
" split=\"test\", return_X_y=True, return_type=\"nested_univ\"\n",
")\n",
"print(X_train.shape)\n",
"print(X_test.shape)\n",
"print(type(X_train.iloc[1, 1]))\n",
Expand Down Expand Up @@ -133,5 +137,4 @@
},
"nbformat": 4,
"nbformat_minor": 2

}
8 changes: 6 additions & 2 deletions examples/02b_classification_multivariate_lstmfcn.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,12 @@
"metadata": {},
"outputs": [],
"source": [
"X_train, y_train = load_basic_motions(split=\"train\", return_X_y=True)\n",
"X_test, y_test = load_basic_motions(split=\"test\", return_X_y=True)\n",
"X_train, y_train = load_basic_motions(\n",
" split=\"train\", return_X_y=True, return_type=\"nested_univ\"\n",
")\n",
"X_test, y_test = load_basic_motions(\n",
" split=\"test\", return_X_y=True, return_type=\"nested_univ\"\n",
")\n",
"print(X_train.shape)\n",
"print(X_test.shape)\n",
"print(type(X_train.iloc[1, 1]))\n",
Expand Down
12 changes: 6 additions & 6 deletions examples/AA_datatypes_and_datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@
"* structure convention: `obj` must be 2D, i.e., `obj.shape` must have length 2. This is also true for univariate time series.\n",
"* variables: variables correspond to columns of `obj`.\n",
"* variable names: the `\"np.ndarray\"` mtype cannot represent variable names.\n",
"* time points: the rows of `obj` correspond to different, distinct time points. \n",
"* time points: the rows of `obj` correspond to different, distinct time points.\n",
"* time index: The time index is implicit and by-convention. The `i`-th row (for an integer `i`) is interpreted as an observation at the time point `i`.\n",
"* capabilities: cannot represent multivariate series; cannot represent unequally spaced series"
]
Expand Down Expand Up @@ -262,8 +262,8 @@
"\n",
"* structure convention: `obj.index` must be a pair multi-index of type `(Index, t)`, where `t` is one of `Int64Index`, `RangeIndex`, `DatetimeIndex`, `PeriodIndex` and monotonous. `obj.index` must have two levels (can be named or not).\n",
"* instance index: the first element of pairs in `obj.index` (0-th level value) is interpreted as an instance index, we call it \"instance index\" below.\n",
"* instances: rows with the same \"instance index\" index value correspond to the same instance; rows with different \"instance index\" values correspond to different instances. \n",
"* time index: the second element of pairs in `obj.index` (1-st level value) is interpreted as a time index, we call it \"time index\" below. \n",
"* instances: rows with the same \"instance index\" index value correspond to the same instance; rows with different \"instance index\" values correspond to different instances.\n",
"* time index: the second element of pairs in `obj.index` (1-st level value) is interpreted as a time index, we call it \"time index\" below.\n",
"* time points: rows of `obj` with the same \"time index\" value correspond correspond to the same time point; rows of `obj` with different \"time index\" index correspond correspond to the different time points.\n",
"* variables: columns of `obj` correspond to different variables\n",
"* variable names: column names `obj.columns`\n",
Expand Down Expand Up @@ -297,7 +297,7 @@
"\n",
"* structure convention: `obj` must be 3D, i.e., `obj.shape` must have length 3.\n",
"* instances: instances correspond to axis 0 elements of `obj`.\n",
"* instance index: the instance index is implicit and by-convention. The `i`-th element of axis 0 (for an integer `i`) is interpreted as indicative of observing instance `i`. \n",
"* instance index: the instance index is implicit and by-convention. The `i`-th element of axis 0 (for an integer `i`) is interpreted as indicative of observing instance `i`.\n",
"* variables: variables correspond to axis 1 elements of `obj`.\n",
"* variable names: the `\"numpy3D\"` mtype cannot represent variable names.\n",
"* time points: time points correspond to axis 2 elements of `obj`.\n",
Expand Down Expand Up @@ -376,8 +376,8 @@
"\n",
"* structure convention: `obj.index` must be a 3 or more level multi-index of type `(Index, ..., Index, t)`, where `t` is one of `Int64Index`, `RangeIndex`, `DatetimeIndex`, `PeriodIndex` and monotonous. We call the last index the \"time-like\" index.\n",
"* hierarchy level: rows with the same non-time-like index values correspond to the same hierarchy unit; rows with different non-time-like index combination correspond to different hierarchy unit.\n",
"* hierarchy: the non-time-like indices in `obj.index` are interpreted as a hierarchy identifying index. \n",
"* time index: the last element of tuples in `obj.index` is interpreted as a time index. \n",
"* hierarchy: the non-time-like indices in `obj.index` are interpreted as a hierarchy identifying index.\n",
"* time index: the last element of tuples in `obj.index` is interpreted as a time index.\n",
"* time points: rows of `obj` with the same `\"timepoints\"` index correspond correspond to the same time point; rows of `obj` with different `\"timepoints\"` index correspond correspond to the different time points.\n",
"* variables: columns of `obj` correspond to different variables\n",
"* variable names: column names `obj.columns`\n",
Expand Down
10 changes: 2 additions & 8 deletions examples/feature_extraction_with_tsfresh.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@
}
],
"source": [
"X, y = load_arrow_head(return_X_y=True)\n",
"X, y = load_arrow_head(return_X_y=True, return_type=\"nested_univ\")\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)"
]
Expand All @@ -97,9 +97,6 @@
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
Expand Down Expand Up @@ -606,7 +603,7 @@
}
],
"source": [
"X, y = load_basic_motions(return_X_y=True)\n",
"X, y = load_basic_motions(return_X_y=True, return_type=\"nested_univ\")\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)"
]
Expand All @@ -623,9 +620,6 @@
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
Expand Down
12 changes: 6 additions & 6 deletions examples/interpolation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Suppose we have a set of time series with different lengths, i.e. different number \n",
"of time points. Currently, most of sktime's functionality requires equal-length time series, so to use sktime, we need to first converted our data into equal-length time series. In this tutorial, you will learn how to use the `TSInterpolator` to do so. "
"Suppose we have a set of time series with different lengths, i.e. different number\n",
"of time points. Currently, most of sktime's functionality requires equal-length time series, so to use sktime, we need to first converted our data into equal-length time series. In this tutorial, you will learn how to use the `TSInterpolator` to do so."
]
},
{
Expand Down Expand Up @@ -73,7 +73,7 @@
}
],
"source": [
"X, y = load_basic_motions()\n",
"X, y = load_basic_motions(return_type=\"nested_univ\")\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"\n",
"steps = [\n",
Expand Down Expand Up @@ -133,7 +133,7 @@
" ) # here is a problem\n",
"\n",
"\n",
"X, y = load_basic_motions()\n",
"X, y = load_basic_motions(return_type=\"nested_univ\")\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"\n",
"for df in [X_train, X_test]:\n",
Expand All @@ -156,7 +156,7 @@
"metadata": {},
"source": [
"# Now the interpolator enters\n",
"Now we use our interpolator to resize time series of different lengths to user-defined length. Internally, it uses linear interpolation from scipy and draws equidistant samples on the user-defined number of points. \n",
"Now we use our interpolator to resize time series of different lengths to user-defined length. Internally, it uses linear interpolation from scipy and draws equidistant samples on the user-defined number of points.\n",
"\n",
"After interpolating the data, the classifier works again."
]
Expand Down Expand Up @@ -187,7 +187,7 @@
"source": [
"from sktime.transformations.panel.interpolate import TSInterpolator\n",
"\n",
"X, y = load_basic_motions()\n",
"X, y = load_basic_motions(return_type=\"nested_univ\")\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
"\n",
"for df in [X_train, X_test]:\n",
Expand Down
31 changes: 12 additions & 19 deletions examples/minirocket.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@
},
"outputs": [],
"source": [
"X_train, y_train = load_arrow_head(split=\"train\", return_X_y=True)\n",
"X_train, y_train = load_arrow_head(split=\"train\", return_type=\"nested_univ\")\n",
"# visualize the first univariate time series\n",
"X_train.iloc[0, 0].plot()"
]
Expand Down Expand Up @@ -177,7 +177,7 @@
},
"outputs": [],
"source": [
"X_test, y_test = load_arrow_head(split=\"test\", return_X_y=True)\n",
"X_test, y_test = load_arrow_head(split=\"test\")\n",
"X_test_transform = minirocket.transform(X_test)"
]
},
Expand Down Expand Up @@ -208,10 +208,7 @@
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***\n",
"\n",
"## 2 Multivariate Time Series\n",
"\n",
"We can use the multivariate version of MiniRocket for multivariate time series input.\n",
Expand All @@ -220,15 +217,11 @@
"\n",
"Import MiniRocketMultivariate.\n",
"\n",
"**Note**: MiniRocketMultivariate compiles via Numba on import."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"**Note**: MiniRocketMultivariate compiles via Numba on import.\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
Expand All @@ -252,7 +245,7 @@
},
"outputs": [],
"source": [
"X_train, y_train = load_basic_motions(split=\"train\", return_X_y=True)"
"X_train, y_train = load_basic_motions(split=\"train\")"
]
},
{
Expand Down Expand Up @@ -327,7 +320,7 @@
},
"outputs": [],
"source": [
"X_test, y_test = load_basic_motions(split=\"test\", return_X_y=True)\n",
"X_test, y_test = load_basic_motions(split=\"test\")\n",
"X_test_transform = minirocket_multi.transform(X_test)"
]
},
Expand Down Expand Up @@ -471,8 +464,8 @@
"\n",
"\n",
"### 4.1 Load japanese_vowels as unequal length dataset\n",
"Japanese vowels is a a UCI Archive dataset. 9 Japanese-male speakers were recorded saying the vowels ‘a’ and ‘e’. \n",
"The raw recordings are preprocessed to get a 12-dimensional (multivariate) classification probem. The series lengths are between 7 and 29. "
"Japanese vowels is a a UCI Archive dataset. 9 Japanese-male speakers were recorded saying the vowels ‘a’ and ‘e’.\n",
"The raw recordings are preprocessed to get a 12-dimensional (multivariate) classification probem. The series lengths are between 7 and 29."
]
},
{
Expand Down Expand Up @@ -509,7 +502,7 @@
"metadata": {},
"source": [
"### 4.2 Create a pipeline, train on it\n",
"As before, we create a sklearn pipeline. \n",
"As before, we create a sklearn pipeline.\n",
"MiniRocketMultivariateVariable requires a minimum series length of 9, where missing values are padded up to a length of 9, with the value \"-10.0\".\n",
"Afterwards a scaler and a RidgeClassifierCV are added.\n"
]
Expand Down
4 changes: 2 additions & 2 deletions sktime/benchmarking/tests/test_TSCStrategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
@pytest.mark.parametrize("dataset", DATASET_LOADERS)
def test_TSCStrategy(dataset):
"""Test strategy."""
train = dataset(split="train", return_X_y=False)
test = dataset(split="test", return_X_y=False)
train = dataset(split="train", return_X_y=False, return_type="nested_univ")
test = dataset(split="test", return_X_y=False, return_type="nested_univ")
s = TSCStrategy(classifier)
task = TSCTask(target="class_val")
s.fit(task, train)
Expand Down
9 changes: 6 additions & 3 deletions sktime/benchmarking/tests/test_orchestration.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def make_reduction_pipeline(estimator):
@pytest.mark.parametrize("data_loader", [load_gunpoint, load_arrow_head])
def test_automated_orchestration_vs_manual(data_loader):
"""Test orchestration."""
data = data_loader(return_X_y=False)
data = data_loader(return_X_y=False, return_type="nested_univ")

dataset = RAMDataset(dataset=data, name="data")
task = TSCTask(target="class_val")
Expand Down Expand Up @@ -85,7 +85,10 @@ def test_automated_orchestration_vs_manual(data_loader):
@pytest.mark.parametrize(
"dataset",
[
RAMDataset(dataset=load_arrow_head(return_X_y=False), name="ArrowHead"),
RAMDataset(
dataset=load_arrow_head(return_X_y=False, return_type="nested_univ"),
name="ArrowHead",
),
UEADataset(path=DATAPATH, name="GunPoint", target_name="class_val"),
],
)
Expand Down Expand Up @@ -161,7 +164,7 @@ def test_single_dataset_single_strategy_against_sklearn(
# simple test of sign test and ranks
def test_stat():
"""Test sign ranks."""
data = load_gunpoint(split="train", return_X_y=False)
data = load_gunpoint(split="train", return_X_y=False, return_type="nested_univ")
dataset = RAMDataset(dataset=data, name="gunpoint")
task = TSCTask(target="class_val")

Expand Down
2 changes: 1 addition & 1 deletion sktime/benchmarking/tests/test_tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

TASKS = (TSCTask, TSRTask)

gunpoint = load_gunpoint(return_X_y=False)
gunpoint = load_gunpoint(return_X_y=False, return_type="nested_univ")
shampoo_sales = load_shampoo_sales()

BASE_READONLY_ATTRS = ("target", "features", "metadata")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@
def test_knn_on_unit_test(distance_key):
"""Test function for elastic knn, to be reinstated soon."""
# load arrowhead data for unit tests
X_train, y_train = load_unit_test(split="train", return_X_y=True)
X_test, y_test = load_unit_test(split="test", return_X_y=True)
X_train, y_train = load_unit_test(split="train")
X_test, y_test = load_unit_test(split="test")
knn = KNeighborsTimeSeriesClassifier(
distance=distance_key,
)
Expand All @@ -61,8 +61,8 @@ def test_knn_on_unit_test(distance_key):
@pytest.mark.parametrize("distance_key", distance_functions)
def test_knn_bounding_matrix(distance_key):
"""Test knn with custom bounding parameters."""
X_train, y_train = load_unit_test(split="train", return_X_y=True)
X_test, y_test = load_unit_test(split="test", return_X_y=True)
X_train, y_train = load_unit_test(split="train")
X_test, y_test = load_unit_test(split="test")
knn = KNeighborsTimeSeriesClassifier(
distance=distance_key, distance_params={"window": 0.5}
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,13 @@
)
from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.datasets import load_unit_test
from sktime.datatypes._panel._convert import from_nested_to_3d_numpy


def test_prob_threshold_on_unit_test_data():
"""Test of ProbabilityThresholdEarlyClassifier on unit test data."""
# load unit test data
X_train, y_train = load_unit_test(split="train", return_X_y=True)
X_test, y_test = load_unit_test(split="test", return_X_y=True)
X_train, y_train = load_unit_test(split="train")
X_test, y_test = load_unit_test(split="test")
indices = np.random.RandomState(0).choice(len(y_train), 10, replace=False)

# train probability threshold
Expand All @@ -30,7 +29,6 @@ def test_prob_threshold_on_unit_test_data():
final_probas = np.zeros((10, 2))
final_decisions = np.zeros(10)

X_test = from_nested_to_3d_numpy(X_test)
states = None
for i in pt.classification_points:
X = X_test[indices, :, :i]
Expand Down
Loading

0 comments on commit f7e226a

Please sign in to comment.